使用Pandas to_hdf时，是否可以为参差不齐张量指定一个列数据类型到vlen special_dtype/vlarray？_Pandas_Numpy_Hdf5_H5py_Pytables

使用Pandas to_hdf时，是否可以为参差不齐张量指定一个列数据类型到vlen special_dtype/vlarray？

pandas numpy

使用Pandas to_hdf时，是否可以为参差不齐张量指定一个列数据类型到vlen special_dtype/vlarray？,pandas,numpy,hdf5,h5py,pytables,Pandas,Numpy,Hdf5,H5py,Pytables,我有一个Pandas列，其中包含numpy数组或大小不同的列表。如果我尝试使用to_hdf将数据帧转换为hdf5，我会得到这样的消息 PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed-integer,key->block0_values] 我

我有一个Pandas列，其中包含numpy数组或大小不同的列表。如果我尝试使用to_hdf将数据帧转换为hdf5，我会得到这样的消息

PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values]

我猜这是因为熊猫栏中的张量参差不齐。HDpy确实有一个用于不规则张量的特殊数据类型

这里的例子

h5f = h5py.File('data.h5', 'w')
dt = h5py.special_dtype(vlen=np.dtype('int32'))
h5f.create_dataset('batch', data=yourData, dtype=dt, compression='gzip', compression_opts=9)

因此，我可以将pandas df转换为numpy，然后分别保存每个numpy数组，使用特殊的vlen数据类型存储可变长度列

我想知道是否有办法在熊猫身上做到这一点

下面是一个使用我的一小块数据的简单示例。它下载并打开一小块数据帧，并将其保存到hdf5

import requests
import pickle
import numpy as np
import pandas as pd

#Download function for google drive 

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

#download the google drive file 
download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl') 

sampleDF2 = pd.read_pickle('sample.pkl')

sampleDF2.to_hdf( 'pandasList.hdf', 'first', complevel = 9 )

sampleDF2['totalCites2'] = sampleDF2['totalCites2'].apply(lambda x: np.array(x))

sampleDF2.to_hdf( 'pandasNumpy.hdf', 'first', complevel = 9 )

为了方便起见，这里有一个colab笔记本，其中包含以下代码

编辑：

正如hpualj所提到的，Pandas使用的是Pytables而不是h5py，因此问题似乎应该是如何使用vlarray，即Pytables如何存储可变长度数组

to_hdf

使用

pytables

与

HDF5

接口，而不是

h5py

。看来我应该研究一下vlarray，pytables是如何存储可变长度数组的。你找到这个问题的答案了吗？我也希望能够使用熊猫来保存不规则的张量。一些解决方案。您可以将它们保存为文本，然后在打开时转换为数字。或者，您可以将它们全部放在一列中，然后使用另一列来指示数据点属于哪个类，当您打开文件时，您可以使用pandas或numpy来堆叠具有该特定类型的所有行