Python 数据帧的高效读写_Python_Pandas_Dataframe_Sparse Matrix_Countvectorizer

Python 数据帧的高效读写

python pandas dataframe

Python 数据帧的高效读写,python,pandas,dataframe,sparse-matrix,countvectorizer,Python,Pandas,Dataframe,Sparse Matrix,Countvectorizer,我有一个pandas数据框，我想把它分成几个较小的部分，每个10万行，然后保存到磁盘上，这样我就可以读取数据并逐个处理。我尝试过使用dill和hdf存储，因为csv和原始文本似乎需要很多时间我在一个子集数据上尝试了这一点，其中包含约500k行和五列混合数据。两个包含字符串、一个整数、一个浮点，最后一个包含来自sklearn.feature\u extraction.text.CountVectorizer的二进制计数，存储为scipy.sparse.csr.csr\u矩阵sparse矩阵这是

我有一个pandas数据框，我想把它分成几个较小的部分，每个10万行，然后保存到磁盘上，这样我就可以读取数据并逐个处理。我尝试过使用

dill

和

hdf

存储，因为csv和原始文本似乎需要很多时间

我在一个子集数据上尝试了这一点，其中包含约500k行和五列混合数据。两个包含字符串、一个整数、一个浮点，最后一个包含来自

sklearn.feature\u extraction.text.CountVectorizer

的二进制计数，存储为

scipy.sparse.csr.csr\u矩阵

sparse矩阵

这是我遇到问题的最后一篇专栏文章。转储和加载数据没有问题，但当我尝试实际访问数据时，它是pandas.Series对象。其次，该序列中的每一行都是一个元组，它包含整个数据集

# Before dumping, the original df has 100k rows.
# Each column has one value except for 'counts' which has 1400. 
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400. 

vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2))
counts = vectorizer.fit_transform(df['string_data'])
df['counts'] = counts

df_split  =  pandas.DataFrame(np.column_stack([df['string1'][0:100000],
                                               df['string2'][0:100000],
                                               df['float'][0:100000],
                                               df['integer'][0:100000],
                                               df['counts'][0:100000]]),
                                               columns=['string1','string2','float','integer','counts'])
dill.dump(df, open(file[i], 'w'))

df = dill.load(file[i])
print(type(df['counts'])
> <class 'pandas.core.series.Series'>
print(np.shape(df['counts'])
> (100000,)
print(np.shape(df['counts'][0])
> (496718, 1400)    # 496718 is the number of rows in my complete data set.
print(type(df['counts']))
> <type 'tuple'>

#转储之前，原始df有100k行。
#每列有一个值，但“计数”有1400。
#这意味着df['counts']给了我一个100k x 1400的稀疏矩阵。
矢量器=sklearn.feature\u extraction.text.CountVectorizer（analyzer='char'，ngram\u range=（2,2））
计数=矢量器.fit_变换（df['string_data']）
df['counts']=计数
df_split=pandas.DataFrame（np.column_stack（[df['string1'][0:100000]），
df['string2'][0:100000]，
df['float'][0:100000]，
df['integer'][0:100000]，
df['counts'][0:100000]]），
列=['string1'、'string2'、'float'、'integer'、'counts']）
dump（df，open（文件[i]，'w'））
df=dill.load（文件[i]）
打印（类型（df['counts']））
> 
打印（np.形状（df['counts']））
> (100000,)
打印（np.shape（df['counts'][0]）
>（4967181400）#496718是我的完整数据集中的行数。
打印（类型（df['counts']））
>

我是否犯了任何明显的错误，或者是否有更好的方法以这种格式存储这些数据，一种不太耗时的方法？它必须能够扩展到包含1亿行的完整数据

df['counts'] = counts

这将生成一个Pandas系列（列），其中元素的#等于

len（df）

，其中每个元素是一个稀疏矩阵，由

矢量器返回。fit#u变换（df['string_data']）

您可以尝试执行以下操作：

df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)

注意：请注意，这会将稀疏矩阵分解为密集（而非稀疏）数据帧，因此它将使用更多的内存，您可以使用
MemoryError
结论：
这就是为什么我建议您分别存储原始DF和
count
sparse matrix
您是如何创建/附加
counts
列的？我将此添加到了代码中我认为将稀疏矩阵存储为pandas列不是一个好主意-在我看来，这是一种容易出错的方式。我会分别存储它们…为什么您认为这是错误倾向？你能详细说明为什么你认为这是一个坏主意吗？只要重新阅读你自己的问题：
当我试图实际访问数据时，它是一个熊猫。Series对象
；-）谢谢，确实大小爆炸了。我会按照你的建议将两者分开。@Tobias，很高兴它有帮助：）