Python 读取缺少/不完整标题或列数不规则的csv
我有一个Python 读取缺少/不完整标题或列数不规则的csv,python,python-2.7,csv,pandas,dataframe,Python,Python 2.7,Csv,Pandas,Dataframe,我有一个文件.csv,大约有15k行,看起来像这样 SAMPLE_TIME, POS, OFF, HISTOGRAM 2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 2015-07-15 16:43:55,
文件.csv
,大约有15k行,看起来像这样
SAMPLE_TIME, POS, OFF, HISTOGRAM
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
我希望将其导入到pandas.DataFrame,并为没有标题的列提供任意随机值,如下所示:
SAMPLE_TIME, POS, OFF, HISTOGRAM 1 2 3 4 5 6
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2, 0, 5, 59, 4, 0, 0,
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 6, 0, nan
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 7, nan nan
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2, 0, 5, 0, 0, 2, nan
这是不可能导入的,因为我尝试了不同的解决方案,例如提供一个,但仍然没有乐趣,我能够使其工作的唯一方法是在.csv
文件中手动添加一个标题。这有点违背了自动化的目的
然后我试着: 这样做
lines=list(csv.reader(open('file.csv')))
header, values = lines[0], lines[1:]
它正确读取文件,给我一个~15k元素值的列表,每个元素都是一个字符串列表,其中每个字符串都正确解析了文件中的数据字段,但当我尝试执行此操作时:
data = {h:v for h,v in zip (header, zip(*values))}
df = pd.DataFrame.from_dict(data)
或者这个:
data2 = {h:v for h,v in zip (str(xrange(16)), zip(*values))}
df2 = pd.DataFrame.from_dict(data)
然后,无头柱消失,柱的顺序完全混合。有什么可能的解决办法吗 假设您的数据位于名为foo.csv的文件中,您可以执行以下操作。这是针对熊猫0.17进行的测试
df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1)
这个怎么样。我根据你的样本数据制作了一个csv
导入行时:
with open('test.csv','rb') as f:
lines = list(csv.reader(f))
headers, values =lines[0],lines[1:]
要生成好的标题名称,请使用以下行:
headers = [i or ind for ind, i in enumerate(headers)]
因此,由于(我假设)csv的工作方式,头应该有一堆空字符串值。空字符串的计算结果为False,因此此理解返回没有标题的每列的编号列
然后做一个df:
df = pd.DataFrame(values,columns=headers)
这看起来像:
11: SAMPLE_TIME POS OFF HISTOGRAM 4 5 6 7 8 9 \
0 15/07/2015 16:41 0-0-0-0-3 1 2 0 5 59 0 0 0
1 15/07/2015 16:42 0-0-0-0-3 1 0 0 5 9 0 0 0
2 15/07/2015 16:43 0-0-0-0-3 1 0 0 5 5 0 0 0
3 15/07/2015 16:44 0-0-0-0-3 1 2 0 5 0 0 0 0
... 12 13 14 15 16 17 18 19 20 21
0 ... 2 0 0 0 0 0 0 0 0 0
1 ... 2 0 0 0 50 0
2 ... 2 0 0 0 0 4 0 0 0
3 ... 2 0 0 0 6 0 0 0 0
[4 rows x 22 columns]
您可以将列直方图
拆分为新的数据帧
,并将其拆分为原始列
print df
SAMPLE_TIME, POS, OFF, \
0 2015-07-15 16:41:56 0-0-0-0-3, 1,
1 2015-07-15 16:42:55 0-0-0-0-3, 1,
2 2015-07-15 16:43:55 0-0-0-0-3, 1,
3 2015-07-15 16:44:56 0-0-0-0-3, 1,
HISTOGRAM
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
可以根据第一个实际行的长度创建列:
from tempfile import TemporaryFile
with open("out.txt") as f, TemporaryFile("w+") as t:
h, ln = next(f), len(next(f).split(","))
header = h.strip().split(",")
f.seek(0), next(f)
header += range(ln)
print(pd.read_csv(f, names=header))
这将给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 13 14 15 16 17 18 19 20 21 22
0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN
1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN
3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
[4 rows x 27 columns]
或者,您可以在传递给pandas之前清理文件:
import pandas as pd
from tempfile import TemporaryFile
with open("in.csv") as f, TemporaryFile("w+") as t:
for line in f:
t.write(line.replace(" ", ""))
t.seek(0)
ln = len(line.strip().split(","))
header = t.readline().strip().split(",")
header += range(ln)
print(pd.read_csv(t,names=header))
这给了你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \
0 2015-07-1516:41:56 0-0-0-0-3 1 2 0 5 59 0 0 0 ... 0
1 2015-07-1516:42:55 0-0-0-0-3 1 0 0 5 9 0 0 0 ... 0
2 2015-07-1516:43:55 0-0-0-0-3 1 0 0 5 5 0 0 0 ... 0
3 2015-07-1516:44:56 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 0
12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 NaN NaN NaN
1 50 0 NaN NaN NaN NaN NaN NaN NaN
2 0 4 0 0 0 NaN NaN NaN NaN
3 6 0 0 0 0 NaN NaN NaN NaN
[4 rows x 25 columns]
或者要删除所有列,请执行以下操作:
print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))
给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 8 9 10 11 12 13 14 15 16 17
0 0 0 ... 2 0 0 0 0 0 0 0 0 0
1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN
2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN
3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN
[4 rows x 22 columns]
Windows7上的Python2.7.10、Anaconda2.1.0 64位。熊猫0.17.1,csv.1.0。我不理解你的怀疑。因此,输入在一个单元格中包含所有这些值。我明白我的错误了。是的,第一个例子是有一大堆问题的输入
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 8 9 10 11 12 13 14 15 16 17
0 0 0 ... 2 0 0 0 0 0 0 0 0 0
1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN
2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN
3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN
[4 rows x 22 columns]