Python Pandas无法读取锯齿状文本文件的第216行
我有一个参差不齐的txt文件(每行的列数不同),我正在尝试用Pandas读取它。出于某种原因,它可以读取前216行,但不能读取前217行Python Pandas无法读取锯齿状文本文件的第216行,python,pandas,Python,Pandas,我有一个参差不齐的txt文件(每行的列数不同),我正在尝试用Pandas读取它。出于某种原因,它可以读取前216行,但不能读取前217行 >>> df = pd.read_table("test.txt", names = range(2000), nrows = 216) >>> df = pd.read_table("test.txt", names = range(2000), nrows = 217) Traceback (most recent ca
>>> df = pd.read_table("test.txt", names = range(2000), nrows = 216)
>>> df = pd.read_table("test.txt", names = range(2000), nrows = 217)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/alexwhatley/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/alexwhatley/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 321, in _read
return parser.read(nrows)
File "/Users/alexwhatley/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/Users/alexwhatley/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 839, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9208)
File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
df=pd.read_表(“test.txt”,name=range(2000),nrows=216)
>>>df=pd.read_表(“test.txt”,name=range(2000),nrows=217)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/Users/alexwhatley/anaconda3/lib/python3.5/site packages/pandas/io/parsers.py”,第562行,在解析器中
返回读取(文件路径或缓冲区,kwds)
文件“/Users/alexwhatley/anaconda3/lib/python3.5/site packages/pandas/io/parsers.py”,第321行,已读
返回parser.read(nrows)
文件“/Users/alexwhatley/anaconda3/lib/python3.5/site packages/pandas/io/parsers.py”,第815行,已读
ret=自身。\发动机读取(nrows)
文件“/Users/alexwhatley/anaconda3/lib/python3.5/site packages/pandas/io/parsers.py”,第1314行,已读
数据=自身。\读卡器读取(nrows)
pandas.parser.textleader.read(pandas/parser.c:8748)中的文件“pandas/parser.pyx”,第805行
文件“pandas/parser.pyx”,第839行,位于pandas.parser.TextReader.\u read\u low\u内存中(pandas/parser.c:9208)
文件“pandas/parser.pyx”,第881行,位于pandas.parser.TextReader.\u read\u行(pandas/parser.c:9731)
文件“pandas/parser.pyx”,第868行,位于pandas.parser.TextReader.\u标记化\u行(pandas/parser.c:9602)
pandas.parser.raise_parser_error(pandas/parser.c:23325)中的文件“pandas/parser.pyx”,第1865行
pandas.io.common.CParserError:标记数据时出错。C错误:捕获到缓冲区溢出-可能是输入文件格式错误。
文件位于此处:。有人知道发生了什么吗?解决办法是:
import pandas as pd
the_file = []
with open(r"./genes.txt", 'rb') as f:
for line in f:
the_file.append(line.split('\t'))
df = pd.DataFrame(the_file,columns=range(max([len(l) for l in the_file])))
print df[0]
结果:
0 KEGG_GLYCOLYSIS_GLUCONEOGENESIS
1 KEGG_CITRATE_CYCLE_TCA_CYCLE
2 KEGG_PENTOSE_PHOSPHATE_PATHWAY
3 KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS
4 KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM
5 KEGG_GALACTOSE_METABOLISM
6 KEGG_ASCORBATE_AND_ALDARATE_METABOLISM
7 KEGG_FATTY_ACID_METABOLISM
8 KEGG_STEROID_BIOSYNTHESIS
9 KEGG_PRIMARY_BILE_ACID_BIOSYNTHESIS
10 KEGG_STEROID_HORMONE_BIOSYNTHESIS
11 KEGG_OXIDATIVE_PHOSPHORYLATION
12 KEGG_PURINE_METABOLISM
13 KEGG_PYRIMIDINE_METABOLISM
14 KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
15 KEGG_GLYCINE_SERINE_AND_THREONINE_METABOLISM
16 KEGG_CYSTEINE_AND_METHIONINE_METABOLISM
17 KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION
18 KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_BIOSYNTHESIS
19 KEGG_LYSINE_DEGRADATION
20 KEGG_ARGININE_AND_PROLINE_METABOLISM
21 KEGG_HISTIDINE_METABOLISM
22 KEGG_TYROSINE_METABOLISM
23 KEGG_PHENYLALANINE_METABOLISM
24 KEGG_TRYPTOPHAN_METABOLISM
25 KEGG_BETA_ALANINE_METABOLISM
26 KEGG_TAURINE_AND_HYPOTAURINE_METABOLISM
27 KEGG_SELENOAMINO_ACID_METABOLISM
28 KEGG_GLUTATHIONE_METABOLISM
29 KEGG_STARCH_AND_SUCROSE_METABOLISM
...
425 ST_GAQ_PATHWAY
426 ST_GA13_PATHWAY
427 ST_STAT3_PATHWAY
428 SA_FAS_SIGNALING
429 SA_G1_AND_S_PHASES
430 SIG_INSULIN_RECEPTOR_PATHWAY_IN_CARDIAC_MYOCYTES
431 ST_T_CELL_SIGNAL_TRANSDUCTION
432 ST_TYPE_I_INTERFERON_PATHWAY
433 ST_PAC1_RECEPTOR_PATHWAY
434 SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES
435 SIG_BCR_SIGNALING_PATHWAY
436 SA_G2_AND_M_PHASES
437 ST_B_CELL_ANTIGEN_RECEPTOR
438 ST_INTERLEUKIN_4_PATHWAY
439 ST_WNT_BETA_CATENIN_PATHWAY
440 SA_MMP_CYTOKINE_CONNECTION
441 ST_JNK_MAPK_PATHWAY
442 SA_PROGRAMMED_CELL_DEATH
443 ST_FAS_SIGNALING_PATHWAY
444 ST_MYOCYTE_AD_PATHWAY
445 SA_PTEN_PATHWAY
446 SA_REG_CASCADE_OF_CYCLIN_EXPR
447 SA_TRKA_RECEPTOR
448 ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY
449 PID_FANCONI_PATHWAY
450 PID_SMAD2_3NUCLEAR_PATHWAY
451 PID_FCER1_PATHWAY
452 PID_ENDOTHELIN_PATHWAY
453 PID_BCR_5PATHWAY
454 PID_PRL_SIGNALING_EVENTS_PATHWAY
如果删除文件的前216行并使用
nrows=1
,会发生什么情况?由于文件几乎不包含大约100列,使用2000列有什么特殊原因吗?