Python 读取csv文件,其中包括引号和换行符内的两个双引号

Python 读取csv文件,其中包括引号和换行符内的两个双引号,python,file,csv,dataframe,Python,File,Csv,Dataframe,我有一个问题,我有一个大文件,我想用Python读取,它看起来像: "2019-10-09 10:11:09","NICK","Hello, how are you today? I'm like ""weather"", often changing." col1 col2 col3 2019-10-09 09:32:09 NICK Hello, how are you today? I'm like ""weathe

我有一个问题,我有一个大文件,我想用Python读取,它看起来像:

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."
col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.
"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."
col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break
col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN
我想将此文件读入数据框,该数据框如下所示:

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."
col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.
"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."
col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break
col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN
我没有什么问题。首先,有一个问题,我的分隔符是“,”,它也在来自col3的一些消息中。第二个问题是,在来自col3的一些消息中,存在我不知道如何处理的换行符(如“you”之后的示例)。最后一个问题是,在来自col3的消息中,还有两个双引号“”,它们表示消息中的引号

我已尝试使用以下命令读取此文件:

with open('/data/myfile.csv', 'r', encoding='utf-8') as csvfile:
    df = pd.read_csv(csvfile, sep=",", quotechar='"', escapechar='\\')
不幸的是,这种方法不起作用。我不知道这三件事中的哪一件是导致问题的原因。它向我显示了错误,它期望有三列,但没有更多

编辑: 还有一些其他问题,因为它仍然向我显示此错误:

标记数据时出错。C错误:第60行预期有3个字段,saw 5

当我查看文件时,我不知道它是如何解释行的,因为我从col3中得到了一些消息,其中有一些中断行。如何打印导致问题的这一行

编辑2: 我在终端中使用了以下代码:

sed -n 60p myfile.csv
它打印的是空行。所以我也做了,前后几行。它看起来像:

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."
col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.
"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."
col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break
col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN
编辑3: @博恩达尔是对的。我包含的这一行并没有引起问题。现在,我已将代码编辑为:

with open('opinions-ml.csv', 'r', encoding='utf-8') as csvfile:
    df = pd.read_csv(csvfile, names=['col1', 'col2', 'col3'], sep=",", quotechar='"', escapechar='\\')
我发现问题是由这样的行引起的:

"2019-10-09 10:11:09","NICK","This is some text "and this, is quote" and it is also text
Awww. and, there was, line break"
Python将其解读为数据帧,如下所示:

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."
col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.
"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."
col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break
col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN
你认为有机会解决这个问题吗?也许用正则表达式?或者我应该回到文件提供商那里解决这个问题

编辑4: 还有一句话:

"2019-10-09 10:11:09","NICK","This is some text "and this is quote" and it is also text
Awww. and there, was line break"
Python将其解读为数据帧,如下所示:

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."
col1                  col2          col3
2019-10-09 09:32:09   NICK          Hello, how are you today? I'm like ""weather"", often changing.
"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."
col1                  col2          col3
2019-10-09 09:32:09   NICK          This is some text and this
Awww. and             there was     line break
col1                  col2            col3
2019-10-09 09:32:09   NICK            This is some text and this is quote" and it is also text
Awww. and there       was line break  NaN

据我所知,csv方言可能会有所帮助。 下面的代码生成正确的输出

import pandas as pd
import csv

csv.register_dialect('mydialect', delimiter=',', quoting=csv.QUOTE_ALL, doublequote=True)
df = pd.read_csv('test.csv', dialect='mydialect')
df
解决方案2:重新格式化数据

  • 前2列不需要任何处理
  • 第三列需要转义
  • 将行拆分为,(逗号)并从第三个索引中转义值

    import csv
    with open('test.csv') as infile, open('reformated_data.csv', 'w', newline='') as outfile:
    
        outputWriter = csv.writer(outfile, delimiter=',',
                                escapechar='\\', quoting=csv.QUOTE_NONE)
        for line in infile:
            line = line.split(',')
            col12 = line[0:2]
            col3 = ''.join(line[2:]).encode("unicode_escape").decode("utf-8")
            outputWriter.writerow(col12 + [col3])
    

正如我在评论中告诉你的,你的问题是在
“weather”
中,
没有转义。因此熊猫将其解释为quotechar。据我所知,没有任何方法可以阻止转义,只需对文件进行预处理并将
“weaterh”
更改为
“weather\”

一种方法是:

with open('/data/myfile.csv', 'r', encoding='utf-8') as f_in, open("/data/preprocessed.csv", 'w') as f_out:
    for line in f_in:
        line = line.replace('""', '\\\"\\\"')
        f_out.write(line)
这一变化

"2019-10-09 10:11:09","NICK","Hello, how are you
today? I'm like ""weather"", often changing."
"2019-10-09 10:11:09","som1","This isn't this.
It's like this, and this.

And as my opinions is this.

Finally, it's the end."

然后,您可以使用它创建一个数据帧(使用上面发布的代码和新创建的文件),它看起来如下所示:

                  col1  col2                                                                                                  col3
0  2019-10-09 10:11:09  NICK                                      Hello, how are you\ntoday? I'm like ""weather"", often changing.
1  2019-10-09 10:11:09  som1  This isn't this.\nIt's like this, and this.\n\nAnd as my opinions is this.\n\nFinally, it's the end.

您只发布了一行数据,可能还有其他您不知道的错误。我认为您的数据中的单引号可能比双引号的问题更大,但请尝试一下,看看它是如何运行的。

您的问题是
“weather”“
”没有被转义,熊猫认为它是一个引号。如果您将此更改为
“weather\”\“
它会起作用的(您使用“\”指定escapechar。我认为您必须在将其加载到中之前进行一些预处理panda@Boendal,你认为这个预处理可以用Python完成吗?请看我的答案,也许它对你有帮助。打印问题中导致你出现问题的那一行,而不是问在哪里打印。@Boendal,我如何打印这一行ch导致了问题?当我在终端中打开此文件时,我不知道如何计算消息中有换行符的行数(因此在某些行中只有来自
col3
的部分)。它对我仍然不起作用。它会生成错误
错误标记化数据。C错误:第60行中预期有7个字段,saw 9
可能,在某些消息中有双换行符。您还可以根据您的文件以方言提供
行终止符
。(检查这些\r\n\r\n)在pd.read\u csv中还有另一个选项
error\u bad\u lines=False
,它将从数据中删除坏行。我知道跳过行,但它会从我的文件中跳过大约3k行,所以这是一个很大的数目。(总数为150k)使用
lineterminator
它仍然显示相同的错误。我认为您需要在处理之前格式化数据,我在回答中提供了解决方案-2,这可能会有所帮助。正如您所说,还有一些其他问题,因为它仍然显示此错误:
错误标记数据。C错误:第60行中预期有7个字段,当我看看这个文件,我不知道它是如何解释行的,因为我收到了来自
col3
的消息,里面有一些断行。我能打印出引起问题的这一行吗?@maliniaki在你的问题中发了出来。好的,我发了。它还显示了
预期的3个字段
-不像我提到的那样7@maliniaki我们不知道你的数据a看起来您必须向我们提供更多信息/数据,以便我们能够帮助您。我们无法仅凭此猜测您的问题。