Python 使用dask为数据帧的一列应用json.loads_Python_Pandas_Dataframe_Apply_Dask

Python 使用dask为数据帧的一列应用json.loads

python pandas dataframe dask

Python 使用dask为数据帧的一列应用json.loads,python,pandas,dataframe,apply,dask,Python,Pandas,Dataframe,Apply,Dask,我有一个这样的数据帧fulldb\u aclipp\u united： SparkID ... Period 0 913955 ... {"@PeriodName": "2000", "@DateBegin": "2000-01... 1 913955 ... {"@PeriodName": "1999", "@DateBegin": "1999-01... 2 16768 .

我有一个这样的数据帧

fulldb\u aclipp\u united

：

   SparkID  ...                                             Period
0   913955  ...  {"@PeriodName": "2000", "@DateBegin": "2000-01...
1   913955  ...  {"@PeriodName": "1999", "@DateBegin": "1999-01...
2    16768  ...  {"@PeriodName": "2007", "@DateBegin": "2007-01...
3    16768  ...  {"@PeriodName": "2006", "@DateBegin": "2006-01...
4    16768  ...  {"@PeriodName": "2005", "@DateBegin": "2005-01...

我需要将

Period

列（现在是字符串列）转换为

json

值列。通常我使用

df.apply（lambda x:json.loads（x））

来完成，但是这个数据帧太大，无法作为一个整体来处理。我想使用

dask

，但我似乎错过了一些重要的东西。我想我不知道如何在

dask

中使用

apply

，但我找不到解决方案

代码

如果在内存中使用Pandas和所有df，我应该这样做：

#%% read df
os.chdir('/opt/data/.../download finance/output')
fulldb_accrep_united = pd.read_csv('fulldb_accrep_first_download_raw_quotes_corrected.csv', index_col = 0, encoding = 'utf-8')
os.chdir('..')

#%% Deleting some freaky symbols from column
condition = fulldb_accrep_united['Period'].str.contains('\\xa0', na = False, regex = False)
fulldb_accrep_united.loc[condition.values, 'Period'] = fulldb_accrep_united.loc[condition.values, 'Period'].str.replace('\\xa0', ' ', regex = False).values

#%% Convert to json
fulldb_accrep_united.loc[fulldb_accrep_united['Period'].notnull(), 'Period'] = fulldb_accrep_united['Period'].dropna().apply(lambda x: json.loads(x))

这是我尝试使用的代码

dask

：

#%% load data with dask
os.chdir('/opt/data/.../download finance/output')
fulldb_accrep_united = dd.read_csv('fulldb_accrep_first_download_raw_quotes_corrected.csv', encoding = 'utf-8', blocksize = 16 * 1024 * 1024) #16Mb chunks
os.chdir('..')

#%% setup calculation graph. No work is done here.
def transform_to_json(df):
    condition = df['Period'].str.contains('\\xa0', na = False, regex = False)
    df['Period'] = df['Period'].mask(condition.values, df['Period'][condition.values].str.replace('\\xa0', ' ', regex = False).values)

    condition2 = df['Period'].notnull()
    df['Period'] = df['Period'].mask(condition2.values, df['Period'].dropna().apply(lambda x: json.loads(x)).values)

result = transform_to_json(fulldb_accrep_united)

此处的最后一个单元格显示错误：

NotImplementedError: Series getitem in only supported for other series objects with matching partition structure

我做错了什么？我花了将近5个小时试图找到类似的主题，但我想我遗漏了一些重要的东西，因为我对这个主题还不熟悉。

你的问题太长了，我没有通读所有的内容。我道歉。看

但是，根据标题，您可能希望跨dataframe列中的每个元素应用json.loads函数

df["column-name"] = df["column-name"].apply(json.loads)

是的，但我想使用dask，因为数据帧太大。上面的代码同样适用于熊猫数据帧和dask数据帧