Python 合并具有数百万行的磁盘表时出现问题_Python_Python 2.7_Pandas_Pytables_Hdfstore

Python 合并具有数百万行的磁盘表时出现问题

python python-2.7 pandas

Python 合并具有数百万行的磁盘表时出现问题,python,python-2.7,pandas,pytables,hdfstore,Python,Python 2.7,Pandas,Pytables,Hdfstore,TypeError:无法序列化列[date]，因为它的数据内容为[empty]对象数据类型你好！目前有两个大的HDFStore，每个都包含一个节点，这两个节点都不适合内存。节点不包含NaN值。现在我想使用合并这两个节点。首先在一个小商店进行了测试，所有数据都放在一个块中，结果正常。但是现在对于必须逐块合并的情况，它会给我以下错误：TypeError:无法序列化列[date]，因为它的数据内容是[empty]object dtype 这是我正在运行的代码 >>> import

TypeError:无法序列化列[date]，因为它的数据内容为[empty]对象数据类型

你好！目前有两个大的HDFStore，每个都包含一个节点，这两个节点都不适合内存。节点不包含NaN值。现在我想使用合并这两个节点。首先在一个小商店进行了测试，所有数据都放在一个块中，结果正常。但是现在对于必须逐块合并的情况，它会给我以下错误：

TypeError:无法序列化列[date]，因为它的数据内容是[empty]object dtype

这是我正在运行的代码

>>> import pandas as pd
>>> from pandas import HDFStore
>>> print pd.__version__
0.12.0rc1

>>> h5_1 ='I:/Data/output/test8\\var1.h5'
>>> h5_3 ='I:/Data/output/test8\\var3.h5'
>>> h5_1temp = h5_1.replace('.h5','temp.h5')

>>> A = HDFStore(h5_1)
>>> B = HDFStore(h5_3)
>>> Atemp = HDFStore(h5_1temp)

>>> print A
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1.h5
/var1            frame_table  (shape->12626172)
>>> print B
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var3.h5
/var3            frame_table  (shape->6313086)

>>> nrows_a = A.get_storer('var1').nrows
>>> nrows_b = B.get_storer('var3').nrows
>>> a_chunk_size = 500000
>>> b_chunk_size = 500000
>>> for a in xrange(int(nrows_a / a_chunk_size) + 1):
...     a_start_i = a * a_chunk_size
...     a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)
...     a = A.select('var1', start = a_start_i, stop = a_stop_i)
...     for b in xrange(int(nrows_b / b_chunk_size) + 1):
...         b_start_i = b * b_chunk_size
...         b_stop_i = min((b + 1) * b_chunk_size, nrows_b)
...         b = B.select('var3', start = b_start_i, stop = b_stop_i)
...         Atemp.append('mergev13', pd.merge(a, b , left_index=True, right_index=True,how='inner'))

... 
Traceback (most recent call last):
  File "<interactive input>", line 9, in <module>
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in append
    self._write_to_group(key, value, table=True, append=True, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 923, in _write_to_group
    s.write(obj = value, append=append, complib=complib, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 3251, in write
    return super(AppendableMultiFrameTable, self).write(obj=obj.reset_index(), data_columns=data_columns, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2983, in write
    **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2715, in create_axes
    raise e
TypeError: Cannot serialize the column [date] because
its data contents are [empty] object dtype

>>将熊猫作为pd导入
>>>从熊猫进口HDFStore
>>>打印pd.\u版本__
0.12.0rc1
>>>h5_1='I:/Data/output/test8\\var1.h5'
>>>h5_3='I:/Data/output/test8\\var3.h5'
>>>h5_1temp=h5_1.替换（“.h5”和“温度h5”）
>>>A=HDF存储（h5_1）
>>>B=HDF存储（h5_3）
>>>Atemp=HDFStore（H51temp）
>>>打印
文件路径：I:/Data/output/test8\var1.h5
/var1框架_表（形状->12626172）
>>>打印B
文件路径：I:/Data/output/test8\var3.h5
/var3框架_表（形状->6313086）
>>>nrows\u a=a.get\u storer（'var1'）。nrows
>>>nrows_b=b.get_storer（'var3'）。nrows
>>>块大小=500000
>>>b_块大小=500000
>>>对于X范围内的（int（nrows\u a/a\u chunk\u size）+1）：
...     a_start_i=a*a_块大小
...     a_stop_i=min（（a+1）*a_chunk_大小，nrows_a）
...     a=a.select（'var1'，start=a\u start\u i，stop=a\u stop\u i）
...     对于x范围内的b（int（nrows\u b/b\u chunk\u size）+1）：
...         b_start_i=b*b_块大小
...         b_停止i=min（（b+1）*b_块大小，nrows_b）
...         b=b.select（'var3'，start=b\u start\u i，stop=b\u stop\u i）
...         Atemp.append（'mergev13'，pd.merge（a，b，left\u index=True，right\u index=True，how='inner'））
... 
回溯（最近一次呼叫最后一次）：
文件“”，第9行，在
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”，第658行，在append中
self.\u写入组（键、值、表=True、追加=True、**kwargs）
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”，第923行，在_write_to_组中
s、 写入（obj=value，append=append，complib=complib，**kwargs）
写入文件“D:\Python27\lib\site packages\pandas\io\pytables.py”，第3251行
返回super（AppendableMultiFrameTable，self）.write（obj=obj.reset_index（），data_columns=data_columns，**kwargs）
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”，第2983行，处于写入状态
**kwargs）
文件“D:\Python27\lib\site packages\pandas\io\pytables.py”，第2715行，位于create\u轴中
提高e
TypeError:无法序列化列[date]，因为
其数据内容为[empty]对象数据类型

我注意到的是，它提到我在pandas_版本上：='0.10.1'，然而我的pandas版本是0.12.0rc1。关于节点的更多具体信息：

>>> A.select_column('var1','date').unique()
array([2006001, 2006009, 2006017, 2006025, 2006033, 2006041, 2006049,
       2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
       2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
       2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
       2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
       2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
       2006337, 2006345, 2006353, 2006361], dtype=int64)

>>> B.select_column('var3','date').unique()
array([2006001, 2006017, 2006033, 2006049, 2006065, 2006081, 2006097,
       2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
       2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
       2006337, 2006353], dtype=int64)

>>> A.get_storer('var1').levels
['x', 'y', 'date']

>>> A.get_storer('var1').attrs
/var1._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'y', 'x'],
    index_cols := [(0, 'index')],
    levels := ['x', 'y', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['x', 'y', 'date', 'var1'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'y', 'x']]

>>> A.get_storer('var1').table
/var1/table (Table(12626172,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "y": Int64Col(shape=(), dflt=0, pos=3),
  "x": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

>>> B.get_storer('var3').levels
['x', 'y', 'date']

>>> B.get_storer('var3').attrs
/var3._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'y', 'x'],
    index_cols := [(0, 'index')],
    levels := ['x', 'y', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['x', 'y', 'date', 'var3'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'y', 'x']]

>>> B.get_storer('var3').table
/var3/table (Table(6313086,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "y": Int64Col(shape=(), dflt=0, pos=3),
  "x": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

>>> print Atemp
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1temp.h5
/mergev13            frame_table  (shape->823446)

>A.select_列（'var1'，'date'）.unique（）
阵列（[20060012006009，2006017，2006025，2006033，2006041，2006049，
2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
2006337200634520063532006361]，数据类型=int64）
>>>B.选择_列（'var3'，'date'）。唯一（）
阵列（[20060012006017、2006033、2006049、2006065、2006081、2006097、，
2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
20063372006353]，数据类型=int64）
>>>A.get_storer（'var1'）。级别
['x'，'y'，'date']
>>>A.get_storer（'var1'）.attrs
/变量1._v_attrs（属性集），12个属性：
[类别：='组'，
标题：=''，
版本：=“1.0”，
数据列：=['date'，'y'，'x']，
索引列：=[（0，'索引'）]，
级别：=['x'，'y'，'date']，
nan_代表：='nan'，
非索引轴：=[（1，['x'，'y'，'date'，'var1']），
熊猫_类型：=“框架_表”，
熊猫_版本：=“0.10.1”，
表\u类型：=“可追加的\u多帧”，
值\u列：=['values\u block\u 0'，'date'，'y'，'x']
>>>A.get_storer（'var1'）。表
/var1/表（表（12626172，）”
说明：={
“索引”：Int64Col（shape=（），dflt=0，pos=0），
“值块0”：浮点64col（形状=（1），dflt=0.0，位置=1），
“日期”：Int64Col（形状=（），dflt=0，位置=2），
“y”：Int64Col（形状=（），dflt=0，位置=3），
“x”：Int64Col（shape=（），dflt=0，pos=4）}
字节顺序：='little'
chunkshape:=（3276，）
自动索引：=真
共索引：={
“日期”：索引（6，中等，随机，zlib（1））。为_CSI=False，
“索引”：索引（6，中等，随机，zlib（1））。为_CSI=False，
“y”：索引（6，中等，随机，zlib（1））。是_CSI=False，
“x”：索引（6，中等，随机，zlib（1））.is_CSI=False}
>>>B.get_storer（'var3'）。级别
['x'，'y'，'date']
>>>B.get_storer（'var3'）.attrs
/变量3.属性集，12个属性：
[类别：='组'，
标题：=''，
版本：=“1.0”，
数据列：=['date'，'y'，'x']，
索引列：=[（0，'索引'）]，
级别：=['x'，'y'，'date']，
nan_代表：='nan'，
非索引轴：=[（1，['x'，'y'，'date'，'var3']），
熊猫_类型：=“框架_表”，
熊猫_版本：=“0.10.1”，
表\u类型：=“可追加的\u多帧”，
值\u列：=['values\u block\u 0'，'date'，'y'，'x']
>>>B.get_storer（'var3'）。表
/var3/表（表（6313086，）”
说明：={
“索引”：Int64Col（shape=（），dflt=0，pos=0），
“值块0”：浮点64col（形状=（1），dflt=0.0，位置=1），
“日期”：Int64Col（形状=（），dflt=0，位置=2），
“y”：Int64Col（形状=（），dflt=0，位置=3），
“x”：Int64Col（shape=（），dflt=0，pos=4）}
字节顺序：='little'
chunkshape:=（3276，）
自动索引：=真
共索引：={
“日期”：索引（6，中等，随机，zlib（1））。为_CSI=False，
“指数”：指数（6，中等，脱毛）
df = pd.merge(a, b , left_index=True, right_index=True,how='inner')

if len(df):
    Atemp.append('mergev46', df)

<class 'pandas.io.pytables.HDFStore'>
File path: var4.h5
/var4            frame_table  (shape->1334)
<class 'pandas.io.pytables.HDFStore'>
File path: var6.h5
/var6            frame_table  (shape->667)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1334 entries, (928, 310, 2006001) to (1000, 238, 2006361)
Data columns (total 1 columns):
var4    1334  non-null values
dtypes: float64(1)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 667 entries, (928, 310, 2006001) to (1000, 238, 2006353)
Data columns (total 1 columns):
var6    667  non-null values
dtypes: float64(1)
<class 'pandas.io.pytables.HDFStore'>
File path: var4temp.h5
/mergev46            frame_table  (shape->977)

Closing remaining open files: var6.h5... done var4.h5... done var4temp.h5... done