Python 合并具有数百万行的磁盘表时出现问题

Python 合并具有数百万行的磁盘表时出现问题,python,python-2.7,pandas,pytables,hdfstore,Python,Python 2.7,Pandas,Pytables,Hdfstore,TypeError:无法序列化列[date],因为它的数据 内容为[empty]对象数据类型 你好!目前有两个大的HDFStore,每个都包含一个节点,这两个节点都不适合内存。节点不包含NaN值。现在我想使用合并这两个节点。首先在一个小商店进行了测试,所有数据都放在一个块中,结果正常。但是现在对于必须逐块合并的情况,它会给我以下错误:TypeError:无法序列化列[date],因为它的数据内容是[empty]object dtype 这是我正在运行的代码 >>> import

TypeError:无法序列化列[date],因为它的数据 内容为[empty]对象数据类型

你好!目前有两个大的HDFStore,每个都包含一个节点,这两个节点都不适合内存。节点不包含NaN值。现在我想使用合并这两个节点。首先在一个小商店进行了测试,所有数据都放在一个块中,结果正常。但是现在对于必须逐块合并的情况,它会给我以下错误:
TypeError:无法序列化列[date],因为它的数据内容是[empty]object dtype

这是我正在运行的代码

>>> import pandas as pd
>>> from pandas import HDFStore
>>> print pd.__version__
0.12.0rc1

>>> h5_1 ='I:/Data/output/test8\\var1.h5'
>>> h5_3 ='I:/Data/output/test8\\var3.h5'
>>> h5_1temp = h5_1.replace('.h5','temp.h5')

>>> A = HDFStore(h5_1)
>>> B = HDFStore(h5_3)
>>> Atemp = HDFStore(h5_1temp)

>>> print A
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1.h5
/var1            frame_table  (shape->12626172)
>>> print B
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var3.h5
/var3            frame_table  (shape->6313086)

>>> nrows_a = A.get_storer('var1').nrows
>>> nrows_b = B.get_storer('var3').nrows
>>> a_chunk_size = 500000
>>> b_chunk_size = 500000
>>> for a in xrange(int(nrows_a / a_chunk_size) + 1):
...     a_start_i = a * a_chunk_size
...     a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)
...     a = A.select('var1', start = a_start_i, stop = a_stop_i)
...     for b in xrange(int(nrows_b / b_chunk_size) + 1):
...         b_start_i = b * b_chunk_size
...         b_stop_i = min((b + 1) * b_chunk_size, nrows_b)
...         b = B.select('var3', start = b_start_i, stop = b_stop_i)
...         Atemp.append('mergev13', pd.merge(a, b , left_index=True, right_index=True,how='inner'))

... 
Traceback (most recent call last):
  File "<interactive input>", line 9, in <module>
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in append
    self._write_to_group(key, value, table=True, append=True, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 923, in _write_to_group
    s.write(obj = value, append=append, complib=complib, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 3251, in write
    return super(AppendableMultiFrameTable, self).write(obj=obj.reset_index(), data_columns=data_columns, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2983, in write
    **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2715, in create_axes
    raise e
TypeError: Cannot serialize the column [date] because
its data contents are [empty] object dtype
>>将熊猫作为pd导入
>>>从熊猫进口HDFStore
>>>打印pd.\u版本__
0.12.0rc1
>>>h5_1='I:/Data/output/test8\\var1.h5'
>>>h5_3='I:/Data/output/test8\\var3.h5'
>>>h5_1temp=h5_1.替换(“.h5”和“温度h5”)
>>>A=HDF存储(h5_1)
>>>B=HDF存储(h5_3)
>>>Atemp=HDFStore(H51temp)
>>>打印
文件路径:I:/Data/output/test8\var1.h5
/var1框架_表(形状->12626172)
>>>打印B
文件路径:I:/Data/output/test8\var3.h5
/var3框架_表(形状->6313086)
>>>nrows\u a=a.get\u storer('var1')。nrows
>>>nrows_b=b.get_storer('var3')。nrows
>>>块大小=500000
>>>b_块大小=500000
>>>对于X范围内的(int(nrows\u a/a\u chunk\u size)+1):
...     a_start_i=a*a_块大小
...     a_stop_i=min((a+1)*a_chunk_大小,nrows_a)
...     a=a.select('var1',start=a\u start\u i,stop=a\u stop\u i)
...     对于x范围内的b(int(nrows\u b/b\u chunk\u size)+1):
...         b_start_i=b*b_块大小
...         b_停止i=min((b+1)*b_块大小,nrows_b)
...         b=b.select('var3',start=b\u start\u i,stop=b\u stop\u i)
...         Atemp.append('mergev13',pd.merge(a,b,left\u index=True,right\u index=True,how='inner'))
... 
回溯(最近一次呼叫最后一次):
文件“”,第9行,在
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第658行,在append中
self.\u写入组(键、值、表=True、追加=True、**kwargs)
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第923行,在_write_to_组中
s、 写入(obj=value,append=append,complib=complib,**kwargs)
写入文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第3251行
返回super(AppendableMultiFrameTable,self).write(obj=obj.reset_index(),data_columns=data_columns,**kwargs)
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第2983行,处于写入状态
**kwargs)
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”,第2715行,位于create\u轴中
提高e
TypeError:无法序列化列[date],因为
其数据内容为[empty]对象数据类型
我注意到的是,它提到我在pandas_版本上:='0.10.1',然而我的pandas版本是0.12.0rc1。关于节点的更多具体信息:

>>> A.select_column('var1','date').unique()
array([2006001, 2006009, 2006017, 2006025, 2006033, 2006041, 2006049,
       2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
       2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
       2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
       2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
       2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
       2006337, 2006345, 2006353, 2006361], dtype=int64)

>>> B.select_column('var3','date').unique()
array([2006001, 2006017, 2006033, 2006049, 2006065, 2006081, 2006097,
       2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
       2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
       2006337, 2006353], dtype=int64)

>>> A.get_storer('var1').levels
['x', 'y', 'date']

>>> A.get_storer('var1').attrs
/var1._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'y', 'x'],
    index_cols := [(0, 'index')],
    levels := ['x', 'y', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['x', 'y', 'date', 'var1'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'y', 'x']]

>>> A.get_storer('var1').table
/var1/table (Table(12626172,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "y": Int64Col(shape=(), dflt=0, pos=3),
  "x": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

>>> B.get_storer('var3').levels
['x', 'y', 'date']

>>> B.get_storer('var3').attrs
/var3._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'y', 'x'],
    index_cols := [(0, 'index')],
    levels := ['x', 'y', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['x', 'y', 'date', 'var3'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'y', 'x']]

>>> B.get_storer('var3').table
/var3/table (Table(6313086,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "y": Int64Col(shape=(), dflt=0, pos=3),
  "x": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

>>> print Atemp
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1temp.h5
/mergev13            frame_table  (shape->823446)
>A.select_列('var1','date').unique()
阵列([20060012006009,2006017,2006025,2006033,2006041,2006049,
2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
2006337200634520063532006361],数据类型=int64)
>>>B.选择_列('var3','date')。唯一()
阵列([20060012006017、2006033、2006049、2006065、2006081、2006097、,
2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
20063372006353],数据类型=int64)
>>>A.get_storer('var1')。级别
['x','y','date']
>>>A.get_storer('var1').attrs
/变量1._v_attrs(属性集),12个属性:
[类别:='组',
标题:='',
版本:=“1.0”,
数据列:=['date','y','x'],
索引列:=[(0,'索引')],
级别:=['x','y','date'],
nan_代表:='nan',
非索引轴:=[(1,['x','y','date','var1']),
熊猫_类型:=“框架_表”,
熊猫_版本:=“0.10.1”,
表\u类型:=“可追加的\u多帧”,
值\u列:=['values\u block\u 0','date','y','x']
>>>A.get_storer('var1')。表
/var1/表(表(12626172,)”
说明:={
“索引”:Int64Col(shape=(),dflt=0,pos=0),
“值块0”:浮点64col(形状=(1),dflt=0.0,位置=1),
“日期”:Int64Col(形状=(),dflt=0,位置=2),
“y”:Int64Col(形状=(),dflt=0,位置=3),
“x”:Int64Col(shape=(),dflt=0,pos=4)}
字节顺序:='little'
chunkshape:=(3276,)
自动索引:=真
共索引:={
“日期”:索引(6,中等,随机,zlib(1))。为_CSI=False,
“索引”:索引(6,中等,随机,zlib(1))。为_CSI=False,
“y”:索引(6,中等,随机,zlib(1))。是_CSI=False,
“x”:索引(6,中等,随机,zlib(1)).is_CSI=False}
>>>B.get_storer('var3')。级别
['x','y','date']
>>>B.get_storer('var3').attrs
/变量3.属性集,12个属性:
[类别:='组',
标题:='',
版本:=“1.0”,
数据列:=['date','y','x'],
索引列:=[(0,'索引')],
级别:=['x','y','date'],
nan_代表:='nan',
非索引轴:=[(1,['x','y','date','var3']),
熊猫_类型:=“框架_表”,
熊猫_版本:=“0.10.1”,
表\u类型:=“可追加的\u多帧”,
值\u列:=['values\u block\u 0','date','y','x']
>>>B.get_storer('var3')。表
/var3/表(表(6313086,)”
说明:={
“索引”:Int64Col(shape=(),dflt=0,pos=0),
“值块0”:浮点64col(形状=(1),dflt=0.0,位置=1),
“日期”:Int64Col(形状=(),dflt=0,位置=2),
“y”:Int64Col(形状=(),dflt=0,位置=3),
“x”:Int64Col(shape=(),dflt=0,pos=4)}
字节顺序:='little'
chunkshape:=(3276,)
自动索引:=真
共索引:={
“日期”:索引(6,中等,随机,zlib(1))。为_CSI=False,
“指数”:指数(6,中等,脱毛)
df = pd.merge(a, b , left_index=True, right_index=True,how='inner')

if len(df):
    Atemp.append('mergev46', df)
<class 'pandas.io.pytables.HDFStore'>
File path: var4.h5
/var4            frame_table  (shape->1334)
<class 'pandas.io.pytables.HDFStore'>
File path: var6.h5
/var6            frame_table  (shape->667)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1334 entries, (928, 310, 2006001) to (1000, 238, 2006361)
Data columns (total 1 columns):
var4    1334  non-null values
dtypes: float64(1)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 667 entries, (928, 310, 2006001) to (1000, 238, 2006353)
Data columns (total 1 columns):
var6    667  non-null values
dtypes: float64(1)
<class 'pandas.io.pytables.HDFStore'>
File path: var4temp.h5
/mergev46            frame_table  (shape->977)
Closing remaining open files: var6.h5... done var4.h5... done var4temp.h5... done