Python 读取csv文件，其中包括引号和换行符内的两个双引号_Python_File_Csv_Dataframe

Python 读取csv文件，其中包括引号和换行符内的两个双引号

python file csv dataframe

Python 读取csv文件，其中包括引号和换行符内的两个双引号,python,file,csv,dataframe,Python,File,Csv,Dataframe,我有一个问题，我有一个大文件，我想用Python读取，它看起来像： "2019-10-09 10:11:09","NICK","Hello, how are you today? I'm like ""weather"", often changing." col1 col2 col3 2019-10-09 09:32:09 NICK Hello, how are you today? I'm like ""weathe

我有一个问题，我有一个大文件，我想用Python读取，它看起来像：

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."

col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.

"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."

col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break

col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN

我想将此文件读入数据框，该数据框如下所示：

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."

col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.

"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."

col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break

col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN

我没有什么问题。首先，有一个问题，我的分隔符是“，”，它也在来自col3的一些消息中。第二个问题是，在来自col3的一些消息中，存在我不知道如何处理的换行符（如“you”之后的示例）。最后一个问题是，在来自col3的消息中，还有两个双引号“”，它们表示消息中的引号

我已尝试使用以下命令读取此文件：

with open('/data/myfile.csv', 'r', encoding='utf-8') as csvfile:
    df = pd.read_csv(csvfile, sep=",", quotechar='"', escapechar='\\')

不幸的是，这种方法不起作用。我不知道这三件事中的哪一件是导致问题的原因。它向我显示了错误，它期望有三列，但没有更多

编辑：还有一些其他问题，因为它仍然向我显示此错误：

标记数据时出错。C错误：第60行预期有3个字段，saw 5

当我查看文件时，我不知道它是如何解释行的，因为我从col3中得到了一些消息，其中有一些中断行。如何打印导致问题的这一行

编辑2：我在终端中使用了以下代码：

sed -n 60p myfile.csv

它打印的是空行。所以我也做了，前后几行。它看起来像：

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."

col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.

"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."

col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break

col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN

编辑3： @博恩达尔是对的。我包含的这一行并没有引起问题。现在，我已将代码编辑为：

with open('opinions-ml.csv', 'r', encoding='utf-8') as csvfile:
    df = pd.read_csv(csvfile, names=['col1', 'col2', 'col3'], sep=",", quotechar='"', escapechar='\\')

我发现问题是由这样的行引起的：

"2019-10-09 10:11:09","NICK","This is some text "and this, is quote" and it is also text
Awww. and, there was, line break"

Python将其解读为数据帧，如下所示：

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."

col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.

"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."

col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break

col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN

你认为有机会解决这个问题吗？也许用正则表达式？或者我应该回到文件提供商那里解决这个问题

编辑4：还有一句话：

"2019-10-09 10:11:09","NICK","This is some text "and this is quote" and it is also text
Awww. and there, was line break"

Python将其解读为数据帧，如下所示：

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."

col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.

"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."

col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break

col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN

据我所知，csv方言可能会有所帮助。下面的代码生成正确的输出

import pandas as pd
import csv

csv.register_dialect('mydialect', delimiter=',', quoting=csv.QUOTE_ALL, doublequote=True)
df = pd.read_csv('test.csv', dialect='mydialect')
df

解决方案2：重新格式化数据

前2列不需要任何处理
第三列需要转义

将行拆分为，（逗号）并从第三个索引中转义值

import csv
with open('test.csv') as infile, open('reformated_data.csv', 'w', newline='') as outfile:

    outputWriter = csv.writer(outfile, delimiter=',',
                            escapechar='\\', quoting=csv.QUOTE_NONE)
    for line in infile:
        line = line.split(',')
        col12 = line[0:2]
        col3 = ''.join(line[2:]).encode("unicode_escape").decode("utf-8")
        outputWriter.writerow(col12 + [col3])

正如我在评论中告诉你的，你的问题是在

“weather”

中，

“

没有转义。因此熊猫将其解释为quotechar。据我所知，没有任何方法可以阻止转义，只需对文件进行预处理并将

“weaterh”

更改为

“weather\”

一种方法是：

with open('/data/myfile.csv', 'r', encoding='utf-8') as f_in, open("/data/preprocessed.csv", 'w') as f_out:
    for line in f_in:
        line = line.replace('""', '\\\"\\\"')
        f_out.write(line)

这一变化

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."
"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."

到

然后，您可以使用它创建一个数据帧（使用上面发布的代码和新创建的文件），它看起来如下所示：

                  col1  col2                                                                                                  col3
0  2019-10-09 10:11:09  NICK                                      Hello, how are you\ntoday? I'm like ""weather"", often changing.
1  2019-10-09 10:11:09  som1  This isn't this.\nIt's like this, and this.\n\nAnd as my opinions is this.\n\nFinally, it's the end.

您只发布了一行数据，可能还有其他您不知道的错误。我认为您的数据中的单引号可能比双引号的问题更大，但请尝试一下，看看它是如何运行的。

您的问题是

“weather”“

”没有被转义，熊猫认为它是一个引号。如果您将此更改为

“weather\”\“

它会起作用的（您使用“\”指定escapechar。我认为您必须在将其加载到中之前进行一些预处理panda@Boendal，你认为这个预处理可以用Python完成吗？请看我的答案，也许它对你有帮助。打印问题中导致你出现问题的那一行，而不是问在哪里打印。@Boendal，我如何打印这一行ch导致了问题？当我在终端中打开此文件时，我不知道如何计算消息中有换行符的行数（因此在某些行中只有来自

col3

的部分）。它对我仍然不起作用。它会生成错误

错误标记化数据。C错误：第60行中预期有7个字段，saw 9

可能，在某些消息中有双换行符。您还可以根据您的文件以方言提供

行终止符

。（检查这些\r\n\r\n）在pd.read\u csv中还有另一个选项

error\u bad\u lines=False

，它将从数据中删除坏行。我知道跳过行，但它会从我的文件中跳过大约3k行，所以这是一个很大的数目。（总数为150k）使用

lineterminator

它仍然显示相同的错误。我认为您需要在处理之前格式化数据，我在回答中提供了解决方案-2，这可能会有所帮助。正如您所说，还有一些其他问题，因为它仍然显示此错误：

错误标记数据。C错误：第60行中预期有7个字段，当我看看这个文件，我不知道它是如何解释行的，因为我收到了来自col3
的消息，里面有一些断行。我能打印出引起问题的这一行吗？@maliniaki在你的问题中发了出来。好的，我发了。它还显示了预期的3个字段
-不像我提到的那样7@maliniaki我们不知道你的数据a看起来您必须向我们提供更多信息/数据，以便我们能够帮助您。我们无法仅凭此猜测您的问题。