Python在加载csv文件时遗漏了一些行尾字符

Python在加载csv文件时遗漏了一些行尾字符,python,pandas,file-processing,Python,Pandas,File Processing,我有一个用德语写的csv文件(标签分开)。我没有创建该文件。我尝试使用Python的pandas包读取该文件。我做了以下工作: import pandas as pd trn_file ="data/train.csv" pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None) # pd_train is [1153 rows x 12 columns] # the first couple of

我有一个用德语写的csv文件(标签分开)。我没有创建该文件。我尝试使用Python的pandas包读取该文件。我做了以下工作:

import pandas as pd
trn_file ="data/train.csv"
pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None)
# pd_train is [1153 rows x 12 columns]
# the first  couple of rows of pd_train can be seen below:
>>> pd_train
        0                                                  1                                     2    3           4   5   6                                                7                                                8                      9     10    11
0       35  Auch in Großbritannien, wo 19 Atomreaktoren in...                              Ausstieg -1.0  2011-03-13  10  10                                     Sunday Times                                     Sunday Times           Sunday Times   NaN     1
1      117  Deswegen sollte Deutschland nicht für [...] we...                              Ausstieg  1.0  2011-04-11  60  62                                 Dietram Hoffmann                                 Dietram Hoffmann                    NaN   NaN   121
当我调查数据帧时,我意识到文件没有正确解析。我的意思是,我看到的行似乎合并了,即使它们之间有一个换行符。例如,下面的示例显示了一个句子,但实际上它包含4个句子。(它们应该在数据框中的单独行中):

我如何解决这个问题

如果我需要提供更多信息,请告诉我

编辑 我尝试了@abby的建议。当我给出完整路径时,没有任何更改,当我删除delimeter和编码参数时,我得到以下错误:

pd.read_csv(trn_file,header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 14, saw 12
pd.read\u csv(trn\u文件,头文件=None)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“anaconda3/lib/python3.6/site packages/pandas/io/parsers.py”,第678行,在解析器中
返回读取(文件路径或缓冲区,kwds)
文件“anaconda3/lib/python3.6/site packages/pandas/io/parsers.py”,第446行,已读
data=parser.read(nrows)
文件“anaconda3/lib/python3.6/site packages/pandas/io/parsers.py”,第1036行,已读
ret=自身。\发动机读取(nrows)
文件“anaconda3/lib/python3.6/site packages/pandas/io/parsers.py”,第1848行,已读
数据=自身。\读卡器读取(nrows)
文件“pandas/_libs/parsers.pyx”,第876行,在pandas._libs.parsers.TextReader.read中
文件“pandas/_libs/parsers.pyx”,第891行,在pandas._libs.parsers.TextReader._read_low_内存中
文件“pandas/_libs/parsers.pyx”,第945行,在pandas._libs.parsers.TextReader._read_行中
文件“pandas/_libs/parsers.pyx”,第932行,在pandas._libs.parsers.TextReader._标记化_行中
文件“pandas/_libs/parsers.pyx”,第2112行,在pandas._libs.parsers.raise_parser_错误
pandas.errors.ParserError:标记数据时出错。C错误:在第14行中预期有11个字段,saw 12

问题在于某些文本条目包含引用字符。它们屏蔽分隔符和换行符。通过指定
quoting=csv.QUOTE\u NONE
可以关闭对quoting字符的这种特殊处理。 所以使用

读取带有偶尔引用字符的文件。 见:

csv.QUOTE\u无

指示编写器对象从不引用字段。当输出数据中出现当前分隔符时,它前面会有当前分隔符 逃避现实的性格。如果未设置escapechar,写入程序将引发 如果遇到任何需要转义的字符,则出错

指示读取器不执行引号字符的特殊处理


2件事,1)尝试给出文件的完整路径,2)尝试删除分隔符和编码argument@Abby我现在两个都要试试。然而,我想知道你对第一个选项的看法?@Abby我在尝试了你的建议后编辑了我的问题。不幸的是,它们无法正常工作。您使用的python版本是什么?您可以传递
error\u bad\u lines=False
跳过这些行,而不是在它们上面出错。我的python版本是3.6.5。谢谢您的建议,但是,当我使用delimiter
delimiter='\t'
时,我没有得到任何错误。此外,我的数据非常有限,我更喜欢(以某种方式)修复这些行,而不是忽略它们
pd.read_csv(trn_file,header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 14, saw 12
pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None,quoting = csv.QUOTE_NONE)
pd.read_csv(”train.csv", quotechar='"',skipinitialspace=True)
quotechar=‘”’ --  Any commas between these characters shouldn’t be treated as new columns.

skipinitialspace=True --  Skip spaces after delimiter.