Python 将键值对读入熊猫
Pandas使读取CSV文件变得非常容易:Python 将键值对读入熊猫,python,pandas,Python,Pandas,Pandas使读取CSV文件变得非常容易: pd.read_table('data.txt', sep=',') 熊猫对具有键值对的文件是否具有类似的特性?我想到了这个: pd.DataFrame([dict([p.split('=') for p in l.split(',')]) for l in open('data.txt')]) 如果不是内置的,那么可能是更地道的 感兴趣的文件如下所示: symbol=ESM3,exchange=GLOBEX,timestamp=136542852
pd.read_table('data.txt', sep=',')
熊猫对具有键值对的文件是否具有类似的特性?我想到了这个:
pd.DataFrame([dict([p.split('=') for p in l.split(',')]) for l in open('data.txt')])
如果不是内置的,那么可能是更地道的
感兴趣的文件如下所示:
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525690751,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525697183,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525714498,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525734967,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525735567,price=1548.00,quantity=555
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525735585,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525736116,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525740757,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525748502,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525748952,price=1548.00,quantity=557
它在每一行上都有完全相同的键,并且顺序相同。没有空值。要生成的表是:
exchange price quantity symbol timestamp
0 GLOBEX 1548.00 551\n ESM3 1365428525690751
1 GLOBEX 1548.00 551\n ESM3 1365428525697183
2 GLOBEX 1548.00 551\n ESM3 1365428525714498
3 GLOBEX 1548.00 551\n ESM3 1365428525734967
4 GLOBEX 1548.00 555\n ESM3 1365428525735567
5 GLOBEX 1548.00 556\n ESM3 1365428525735585
6 GLOBEX 1548.00 556\n ESM3 1365428525736116
7 GLOBEX 1548.00 556\n ESM3 1365428525740757
8 GLOBEX 1548.00 556\n ESM3 1365428525748502
9 GLOBEX 1548.00 557\n ESM3 1365428525748952
(我可以在导入后使用
rstrip()
从quantity
中删除\n
。如果您事先知道键名,并且名称总是以相同的顺序出现,则可以使用转换器切掉键名,然后使用名称
参数命名列:
import pandas as pd
def value(item):
return item[item.find('=')+1:]
df = pd.read_table('data.txt', header=None, delimiter=',',
converters={i:value for i in range(5)},
names='symbol exchange timestamp price quantity'.split())
print(df)
在你公布的数据上
symbol exchange timestamp price quantity
0 ESM3 GLOBEX 1365428525690751 1548.00 551
1 ESM3 GLOBEX 1365428525697183 1548.00 551
2 ESM3 GLOBEX 1365428525714498 1548.00 551
3 ESM3 GLOBEX 1365428525734967 1548.00 551
4 ESM3 GLOBEX 1365428525735567 1548.00 555
5 ESM3 GLOBEX 1365428525735585 1548.00 556
6 ESM3 GLOBEX 1365428525736116 1548.00 556
7 ESM3 GLOBEX 1365428525740757 1548.00 556
8 ESM3 GLOBEX 1365428525748502 1548.00 556
9 ESM3 GLOBEX 1365428525748952 1548.00 557
我不确定这样做的最佳方式是什么,但假设在值中找不到分隔符——想到角落的情况会伤到我的大脑——那么这样的事情不是非常优雅,但很简单:
>>> df = pd.read_csv("esm.csv", sep=",|=", header=None)
>>> df2 = df.ix[:,1::2]
>>> df2.columns = list(df.ix[0,0::2])
>>> df2
symbol exchange timestamp price quantity
0 ESM3 GLOBEX 1365428525690751 1548 551
1 ESM3 GLOBEX 1365428525697183 1548 551
2 ESM3 GLOBEX 1365428525714498 1548 551
3 ESM3 GLOBEX 1365428525734967 1548 551
4 ESM3 GLOBEX 1365428525735567 1548 555
5 ESM3 GLOBEX 1365428525735585 1548 556
6 ESM3 GLOBEX 1365428525736116 1548 556
7 ESM3 GLOBEX 1365428525740757 1548 556
8 ESM3 GLOBEX 1365428525748502 1548 556
9 ESM3 GLOBEX 1365428525748952 1548 557
基本上,先读入它,然后自己做透视,保留其他元素,然后修复列名。你能举个例子说明文件的外观和数据帧的格式吗?@DSM我添加了一个例子。这很有效。我可以在open('data.txt').readline().split(',')]Right中使用
keys=[l.split('=')[0::2][0]为l自动设置列名。那是个好主意。或者,可能更简单一点:names=[item.split('=')[0]for item in open('data.txt').readline().split(',')]
尽管@unutbu的解决方案运行了一半的时间,但这也很有效。