Python 使用if-else条件迭代dask数据帧

Python 使用if-else条件迭代dask数据帧,python,pandas,dask,dask-distributed,dask-dataframe,Python,Pandas,Dask,Dask Distributed,Dask Dataframe,我有一个大约1500万行的数据集,pandas无法在上面执行for循环。我正在尝试dask数据帧以加快执行时间,但是,itertation不起作用 初始数据帧示例: cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp'] data = [[12003, 1, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228], [12003,

我有一个大约1500万行的数据集,pandas无法在上面执行for循环。我正在尝试dask数据帧以加快执行时间,但是,itertation不起作用

初始数据帧示例

cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp']
data = [[12003, 1, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
        [12003, 2, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
        [12003, 3, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
        [12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
        [12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183]
]
bookdf = pd.DataFrame(data, columns = cols)
cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp', 'position', 'check', 'egix', 'expx']
data = [[12003, 1, 446499.51923, 23.76, np.nan, np.nan, 0.00228, 0.00228, 0, 446499.51923],
        [12003, 2, 446499.51923, 32.76, np.nan, np.nan, 0.00228, 0.00228, 1, 447517.89163],
        [12003, 3, 446499.51923, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 2, 448338.21855],
        [12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 3, 449160.04918],
        [12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 4, 449983.38628],
        [12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 5, 450808.23260],
        [12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 6, 451634.59091],
        [12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 7, 452462.46399],
        [12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 8, 453294.43921],
        [12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 9, 454127.94424],
        [12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 0, 163392.40385],
        [12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 1, 163765.06788],
        [12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 2, 164065.25900],
        [12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 3, 164366.00038],
        [12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 4, 164667.29304]
]
bookdf = pd.DataFrame(data, columns = cols)
所需输出

cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp']
data = [[12003, 1, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
        [12003, 2, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
        [12003, 3, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
        [12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
        [12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183]
]
bookdf = pd.DataFrame(data, columns = cols)
cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp', 'position', 'check', 'egix', 'expx']
data = [[12003, 1, 446499.51923, 23.76, np.nan, np.nan, 0.00228, 0.00228, 0, 446499.51923],
        [12003, 2, 446499.51923, 32.76, np.nan, np.nan, 0.00228, 0.00228, 1, 447517.89163],
        [12003, 3, 446499.51923, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 2, 448338.21855],
        [12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 3, 449160.04918],
        [12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 4, 449983.38628],
        [12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 5, 450808.23260],
        [12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 6, 451634.59091],
        [12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 7, 452462.46399],
        [12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 8, 453294.43921],
        [12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 9, 454127.94424],
        [12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 0, 163392.40385],
        [12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 1, 163765.06788],
        [12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 2, 164065.25900],
        [12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 3, 164366.00038],
        [12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 4, 164667.29304]
]
bookdf = pd.DataFrame(data, columns = cols)
Pandas中仅适用于小数据集的工作代码:

# 'check' column is being created to get first row of each  grouped data w.r.t 'id' column. 
# I need to take to take first row of each group and do the below calculation for rest of the rows of each group but ```bookdf.group('id).first()```  is not working with the below calculation which basically retains the last value and do the math.

bookdf['check'] =  bookdf.groupby(bookdf['id']).cumcount()
bookdf['egix']  = np.where((bookdf.check==0) & (bookdf.PEGI>0), bookdf.PEGI, bookdf.EGI0)
bookdf['expx']  = np.where((bookdf.check==0) & (bookdf.PExp>0), boodf.PExp, bookdf.EXP0)
for ind in bookdf.index:
    if boo1df['check'][ind]!=0:
        bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
        bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])
如果我尝试使用dask数据帧运行相同的代码,则会出现以下错误:

for ind in range(0, len(book1df)):
    if boo1df['check'][ind]!=0:
        bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
        bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])

**Error** : Series getitem in only supported for other series objects with matching partition structure.

Is there any way to implement this in dask dataframe or another best way to get the desired output with large Dataset.

一个选择是完全摆脱循环

#这将创建感兴趣的掩码
mask=bookdf['check']!=0
#现在我们可以使用.loc应用遮罩
bookdf.loc[mask,'egix']=bookdf.loc[mask,['egix']].shift(-1)*(bookdf.loc[mask,['gEGI']]
bookdf.loc[mask,'expx']=bookdf.loc[mask,['expx']].shift(-1)*(bookdf.loc[mask,['TotExp']]

这应该适用于pandas和dask数据帧。

由于您需要访问每个组中以前的“egix”和“expx”值,请创建两个新列来存储这些值,以便高效地完成计算。然后在df的所有列中使用apply方法:

bookdf['egix_prev'] = bookdf.groupby('id')['egix'].shift(1)
bookdf['expx_prev'] = bookdf.groupby('id')['expx'].shift(1)

bookdf['egix'] = bookdf.apply(lambda x:x['egix_prev']*(1+x['gEGI']) if x['check']!=0 else x['egix'],axis=1)
bookdf['expx'] = bookdf.apply(lambda x:x['expx_prev']*(1+x['TotExp']) if x['check']!=0 else x['expx'],axis=1)

如果您想高效地迭代大型表的行,那么使用df.iterrows()几乎总是更好的。下面的代码应该比使用外部for循环快20倍左右

row_list = []
for row in bookdf.iterrows():
    if row[1]['check']!=0:
        row[1]['egix'] = row_list[-1][1]['egix']*(1 + row[1]['gEGI'])
        row[1]['expx'] = row_list[-1][1]['expx']*(1 + row[1]['TotExp'])
    row_list.append(row)
rebuilt_df = pd.DataFrame([row[1] for row in row_list]) 

如果您提供一个完全可再现的数据帧,这样用户就可以运行代码并尝试自己查看错误,这会更容易。乍一看,在不太了解其他内容的情况下,我建议您尝试在apply()旁边使用pandas shift()方法。请参阅@HMReliable,您好,我已按要求附上可复制数据框。请帮我解决这个问题。我已经尝试了shift和apply(),但它们不起作用,因为我必须迭代整个列才能得到所需的计算结果。但是,这段代码返回的是预期的输出,基本上是前一行与1+当前行的乘积。我是否可以得到上述可复制的期望输出中提到的结果。不清楚“检查”列是如何生成的。“检查”列的生成是为了区分每个组的第一行w.r.t到“id”列,以便对组中的其余行进行计算。我尝试运行pandas代码,但是由于没有定义egix/expx列,所以它不起作用。我已经更新了上面的代码,其中提供了如何生成egix/expx的详细信息。请看一看谢谢你的代码@HMReliable。在这种情况下,移位不起作用,因为“egix”在每次计算后都会对每一行进行更新,它只在下一行计算时使用“egix”的第一行值,但对于下一行,它应该使用更新的“egix”值。这就是为什么我使用for循环来获取以前的索引,并对整个列的当前索引进行更改。请参阅我的最新答案@Ankit Chaudhary