Python 合并具有数百万行的磁盘表时出现问题
TypeError:无法序列化列[date],因为它的数据 内容为[empty]对象数据类型 你好!目前有两个大的HDFStore,每个都包含一个节点,这两个节点都不适合内存。节点不包含NaN值。现在我想使用合并这两个节点。首先在一个小商店进行了测试,所有数据都放在一个块中,结果正常。但是现在对于必须逐块合并的情况,它会给我以下错误:Python 合并具有数百万行的磁盘表时出现问题,python,python-2.7,pandas,pytables,hdfstore,Python,Python 2.7,Pandas,Pytables,Hdfstore,TypeError:无法序列化列[date],因为它的数据 内容为[empty]对象数据类型 你好!目前有两个大的HDFStore,每个都包含一个节点,这两个节点都不适合内存。节点不包含NaN值。现在我想使用合并这两个节点。首先在一个小商店进行了测试,所有数据都放在一个块中,结果正常。但是现在对于必须逐块合并的情况,它会给我以下错误:TypeError:无法序列化列[date],因为它的数据内容是[empty]object dtype 这是我正在运行的代码 >>> import
TypeError:无法序列化列[date],因为它的数据内容是[empty]object dtype
这是我正在运行的代码
>>> import pandas as pd
>>> from pandas import HDFStore
>>> print pd.__version__
0.12.0rc1
>>> h5_1 ='I:/Data/output/test8\\var1.h5'
>>> h5_3 ='I:/Data/output/test8\\var3.h5'
>>> h5_1temp = h5_1.replace('.h5','temp.h5')
>>> A = HDFStore(h5_1)
>>> B = HDFStore(h5_3)
>>> Atemp = HDFStore(h5_1temp)
>>> print A
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1.h5
/var1 frame_table (shape->12626172)
>>> print B
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var3.h5
/var3 frame_table (shape->6313086)
>>> nrows_a = A.get_storer('var1').nrows
>>> nrows_b = B.get_storer('var3').nrows
>>> a_chunk_size = 500000
>>> b_chunk_size = 500000
>>> for a in xrange(int(nrows_a / a_chunk_size) + 1):
... a_start_i = a * a_chunk_size
... a_stop_i = min((a + 1) * a_chunk_size, nrows_a)
... a = A.select('var1', start = a_start_i, stop = a_stop_i)
... for b in xrange(int(nrows_b / b_chunk_size) + 1):
... b_start_i = b * b_chunk_size
... b_stop_i = min((b + 1) * b_chunk_size, nrows_b)
... b = B.select('var3', start = b_start_i, stop = b_stop_i)
... Atemp.append('mergev13', pd.merge(a, b , left_index=True, right_index=True,how='inner'))
...
Traceback (most recent call last):
File "<interactive input>", line 9, in <module>
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 923, in _write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 3251, in write
return super(AppendableMultiFrameTable, self).write(obj=obj.reset_index(), data_columns=data_columns, **kwargs)
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2983, in write
**kwargs)
File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2715, in create_axes
raise e
TypeError: Cannot serialize the column [date] because
its data contents are [empty] object dtype
>>将熊猫作为pd导入
>>>从熊猫进口HDFStore
>>>打印pd.\u版本__
0.12.0rc1
>>>h5_1='I:/Data/output/test8\\var1.h5'
>>>h5_3='I:/Data/output/test8\\var3.h5'
>>>h5_1temp=h5_1.替换(“.h5”和“温度h5”)
>>>A=HDF存储(h5_1)
>>>B=HDF存储(h5_3)
>>>Atemp=HDFStore(H51temp)
>>>打印
文件路径:I:/Data/output/test8\var1.h5
/var1框架_表(形状->12626172)
>>>打印B
文件路径:I:/Data/output/test8\var3.h5
/var3框架_表(形状->6313086)
>>>nrows\u a=a.get\u storer('var1')。nrows
>>>nrows_b=b.get_storer('var3')。nrows
>>>块大小=500000
>>>b_块大小=500000
>>>对于X范围内的(int(nrows\u a/a\u chunk\u size)+1):
... a_start_i=a*a_块大小
... a_stop_i=min((a+1)*a_chunk_大小,nrows_a)
... a=a.select('var1',start=a\u start\u i,stop=a\u stop\u i)
... 对于x范围内的b(int(nrows\u b/b\u chunk\u size)+1):
... b_start_i=b*b_块大小
... b_停止i=min((b+1)*b_块大小,nrows_b)
... b=b.select('var3',start=b\u start\u i,stop=b\u stop\u i)
... Atemp.append('mergev13',pd.merge(a,b,left\u index=True,right\u index=True,how='inner'))
...
回溯(最近一次呼叫最后一次):
文件“”,第9行,在
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第658行,在append中
self.\u写入组(键、值、表=True、追加=True、**kwargs)
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第923行,在_write_to_组中
s、 写入(obj=value,append=append,complib=complib,**kwargs)
写入文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第3251行
返回super(AppendableMultiFrameTable,self).write(obj=obj.reset_index(),data_columns=data_columns,**kwargs)
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第2983行,处于写入状态
**kwargs)
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第2715行,位于create\u轴中
提高e
TypeError:无法序列化列[date],因为
其数据内容为[empty]对象数据类型
我注意到的是,它提到我在pandas_版本上:='0.10.1',然而我的pandas版本是0.12.0rc1。关于节点的更多具体信息:
>>> A.select_column('var1','date').unique()
array([2006001, 2006009, 2006017, 2006025, 2006033, 2006041, 2006049,
2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
2006337, 2006345, 2006353, 2006361], dtype=int64)
>>> B.select_column('var3','date').unique()
array([2006001, 2006017, 2006033, 2006049, 2006065, 2006081, 2006097,
2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
2006337, 2006353], dtype=int64)
>>> A.get_storer('var1').levels
['x', 'y', 'date']
>>> A.get_storer('var1').attrs
/var1._v_attrs (AttributeSet), 12 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['date', 'y', 'x'],
index_cols := [(0, 'index')],
levels := ['x', 'y', 'date'],
nan_rep := 'nan',
non_index_axes := [(1, ['x', 'y', 'date', 'var1'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_multiframe',
values_cols := ['values_block_0', 'date', 'y', 'x']]
>>> A.get_storer('var1').table
/var1/table (Table(12626172,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"date": Int64Col(shape=(), dflt=0, pos=2),
"y": Int64Col(shape=(), dflt=0, pos=3),
"x": Int64Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (3276,)
autoIndex := True
colindexes := {
"date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
>>> B.get_storer('var3').levels
['x', 'y', 'date']
>>> B.get_storer('var3').attrs
/var3._v_attrs (AttributeSet), 12 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['date', 'y', 'x'],
index_cols := [(0, 'index')],
levels := ['x', 'y', 'date'],
nan_rep := 'nan',
non_index_axes := [(1, ['x', 'y', 'date', 'var3'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_multiframe',
values_cols := ['values_block_0', 'date', 'y', 'x']]
>>> B.get_storer('var3').table
/var3/table (Table(6313086,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"date": Int64Col(shape=(), dflt=0, pos=2),
"y": Int64Col(shape=(), dflt=0, pos=3),
"x": Int64Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (3276,)
autoIndex := True
colindexes := {
"date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
>>> print Atemp
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1temp.h5
/mergev13 frame_table (shape->823446)
>A.select_列('var1','date').unique()
阵列([20060012006009,2006017,2006025,2006033,2006041,2006049,
2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
2006337200634520063532006361],数据类型=int64)
>>>B.选择_列('var3','date')。唯一()
阵列([20060012006017、2006033、2006049、2006065、2006081、2006097、,
2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
20063372006353],数据类型=int64)
>>>A.get_storer('var1')。级别
['x','y','date']
>>>A.get_storer('var1').attrs
/变量1._v_attrs(属性集),12个属性:
[类别:='组',
标题:='',
版本:=“1.0”,
数据列:=['date','y','x'],
索引列:=[(0,'索引')],
级别:=['x','y','date'],
nan_代表:='nan',
非索引轴:=[(1,['x','y','date','var1']),
熊猫_类型:=“框架_表”,
熊猫_版本:=“0.10.1”,
表\u类型:=“可追加的\u多帧”,
值\u列:=['values\u block\u 0','date','y','x']
>>>A.get_storer('var1')。表
/var1/表(表(12626172,)”
说明:={
“索引”:Int64Col(shape=(),dflt=0,pos=0),
“值块0”:浮点64col(形状=(1),dflt=0.0,位置=1),
“日期”:Int64Col(形状=(),dflt=0,位置=2),
“y”:Int64Col(形状=(),dflt=0,位置=3),
“x”:Int64Col(shape=(),dflt=0,pos=4)}
字节顺序:='little'
chunkshape:=(3276,)
自动索引:=真
共索引:={
“日期”:索引(6,中等,随机,zlib(1))。为_CSI=False,
“索引”:索引(6,中等,随机,zlib(1))。为_CSI=False,
“y”:索引(6,中等,随机,zlib(1))。是_CSI=False,
“x”:索引(6,中等,随机,zlib(1)).is_CSI=False}
>>>B.get_storer('var3')。级别
['x','y','date']
>>>B.get_storer('var3').attrs
/变量3.属性集,12个属性:
[类别:='组',
标题:='',
版本:=“1.0”,
数据列:=['date','y','x'],
索引列:=[(0,'索引')],
级别:=['x','y','date'],
nan_代表:='nan',
非索引轴:=[(1,['x','y','date','var3']),
熊猫_类型:=“框架_表”,
熊猫_版本:=“0.10.1”,
表\u类型:=“可追加的\u多帧”,
值\u列:=['values\u block\u 0','date','y','x']
>>>B.get_storer('var3')。表
/var3/表(表(6313086,)”
说明:={
“索引”:Int64Col(shape=(),dflt=0,pos=0),
“值块0”:浮点64col(形状=(1),dflt=0.0,位置=1),
“日期”:Int64Col(形状=(),dflt=0,位置=2),
“y”:Int64Col(形状=(),dflt=0,位置=3),
“x”:Int64Col(shape=(),dflt=0,pos=4)}
字节顺序:='little'
chunkshape:=(3276,)
自动索引:=真
共索引:={
“日期”:索引(6,中等,随机,zlib(1))。为_CSI=False,
“指数”:指数(6,中等,脱毛)
df = pd.merge(a, b , left_index=True, right_index=True,how='inner')
if len(df):
Atemp.append('mergev46', df)
<class 'pandas.io.pytables.HDFStore'>
File path: var4.h5
/var4 frame_table (shape->1334)
<class 'pandas.io.pytables.HDFStore'>
File path: var6.h5
/var6 frame_table (shape->667)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1334 entries, (928, 310, 2006001) to (1000, 238, 2006361)
Data columns (total 1 columns):
var4 1334 non-null values
dtypes: float64(1)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 667 entries, (928, 310, 2006001) to (1000, 238, 2006353)
Data columns (total 1 columns):
var6 667 non-null values
dtypes: float64(1)
<class 'pandas.io.pytables.HDFStore'>
File path: var4temp.h5
/mergev46 frame_table (shape->977)
Closing remaining open files: var6.h5... done var4.h5... done var4temp.h5... done