Python在加载csv文件时遗漏了一些行尾字符
我有一个用德语写的csv文件(标签分开)。我没有创建该文件。我尝试使用Python的pandas包读取该文件。我做了以下工作:Python在加载csv文件时遗漏了一些行尾字符,python,pandas,file-processing,Python,Pandas,File Processing,我有一个用德语写的csv文件(标签分开)。我没有创建该文件。我尝试使用Python的pandas包读取该文件。我做了以下工作: import pandas as pd trn_file ="data/train.csv" pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None) # pd_train is [1153 rows x 12 columns] # the first couple of
import pandas as pd
trn_file ="data/train.csv"
pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None)
# pd_train is [1153 rows x 12 columns]
# the first couple of rows of pd_train can be seen below:
>>> pd_train
0 1 2 3 4 5 6 7 8 9 10 11
0 35 Auch in Großbritannien, wo 19 Atomreaktoren in... Ausstieg -1.0 2011-03-13 10 10 Sunday Times Sunday Times Sunday Times NaN 1
1 117 Deswegen sollte Deutschland nicht für [...] we... Ausstieg 1.0 2011-04-11 60 62 Dietram Hoffmann Dietram Hoffmann NaN NaN 121
当我调查数据帧时,我意识到文件没有正确解析。我的意思是,我看到的行似乎合并了,即使它们之间有一个换行符。例如,下面的示例显示了一个句子,但实际上它包含4个句子。(它们应该在数据框中的单独行中):
我如何解决这个问题
如果我需要提供更多信息,请告诉我
编辑
我尝试了@abby的建议。当我给出完整路径时,没有任何更改,当我删除delimeter和编码参数时,我得到以下错误:
pd.read_csv(trn_file,header=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
data = parser.read(nrows)
File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 14, saw 12
pd.read\u csv(trn\u文件,头文件=None)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“anaconda3/lib/python3.6/site packages/pandas/io/parsers.py”,第678行,在解析器中
返回读取(文件路径或缓冲区,kwds)
文件“anaconda3/lib/python3.6/site packages/pandas/io/parsers.py”,第446行,已读
data=parser.read(nrows)
文件“anaconda3/lib/python3.6/site packages/pandas/io/parsers.py”,第1036行,已读
ret=自身。\发动机读取(nrows)
文件“anaconda3/lib/python3.6/site packages/pandas/io/parsers.py”,第1848行,已读
数据=自身。\读卡器读取(nrows)
文件“pandas/_libs/parsers.pyx”,第876行,在pandas._libs.parsers.TextReader.read中
文件“pandas/_libs/parsers.pyx”,第891行,在pandas._libs.parsers.TextReader._read_low_内存中
文件“pandas/_libs/parsers.pyx”,第945行,在pandas._libs.parsers.TextReader._read_行中
文件“pandas/_libs/parsers.pyx”,第932行,在pandas._libs.parsers.TextReader._标记化_行中
文件“pandas/_libs/parsers.pyx”,第2112行,在pandas._libs.parsers.raise_parser_错误
pandas.errors.ParserError:标记数据时出错。C错误:在第14行中预期有11个字段,saw 12
问题在于某些文本条目包含引用字符。它们屏蔽分隔符和换行符。通过指定quoting=csv.QUOTE\u NONE
可以关闭对quoting字符的这种特殊处理。
所以使用
读取带有偶尔引用字符的文件。
见:
csv.QUOTE\u无
指示编写器对象从不引用字段。当输出数据中出现当前分隔符时,它前面会有当前分隔符
逃避现实的性格。如果未设置escapechar,写入程序将引发
如果遇到任何需要转义的字符,则出错
指示读取器不执行引号字符的特殊处理
2件事,1)尝试给出文件的完整路径,2)尝试删除分隔符和编码argument@Abby我现在两个都要试试。然而,我想知道你对第一个选项的看法?@Abby我在尝试了你的建议后编辑了我的问题。不幸的是,它们无法正常工作。您使用的python版本是什么?您可以传递
error\u bad\u lines=False
跳过这些行,而不是在它们上面出错。我的python版本是3.6.5。谢谢您的建议,但是,当我使用delimiterdelimiter='\t'
时,我没有得到任何错误。此外,我的数据非常有限,我更喜欢(以某种方式)修复这些行,而不是忽略它们
pd.read_csv(trn_file,header=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
data = parser.read(nrows)
File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 14, saw 12
pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None,quoting = csv.QUOTE_NONE)
pd.read_csv(”train.csv", quotechar='"',skipinitialspace=True)
quotechar=‘”’ -- Any commas between these characters shouldn’t be treated as new columns.
skipinitialspace=True -- Skip spaces after delimiter.