Python 加快熊猫中的数据帧操作_Python_Python 3.x_Pandas_Dataframe_Etl

Python 加快熊猫中的数据帧操作

python python-3.x pandas dataframe

Python 加快熊猫中的数据帧操作,python,python-3.x,pandas,dataframe,etl,Python,Python 3.x,Pandas,Dataframe,Etl,我目前正在从R转换到python，我想知道我是否可以加快下面的数据帧操作。我有一个销售数据集，由500k行和17列组成，在将它们放入仪表板之前，我需要对其进行一些计算。我的数据如下所示： location time product sales store1 2017 brandA 10 store1 2017 brandB 17 store1 2017 brandC 15 store1 2017 brandD 19 store1 20

我目前正在从R转换到python，我想知道我是否可以加快下面的数据帧操作。我有一个销售数据集，由500k行和17列组成，在将它们放入仪表板之前，我需要对其进行一些计算。我的数据如下所示：

location  time  product  sales
store1    2017  brandA   10
store1    2017  brandB   17 
store1    2017  brandC   15
store1    2017  brandD   19
store1    2017  catTot   86
store2    2017  brandA   8
store2    2017  brandB   23 
store2    2017  brandC   5
store2    2017  brandD   12
store2    2017  catTot   76
.         .     .         .
.         .     .         .
.         .     .         .
.         .     .         .

df['location'] = df.location.astype('category')
df['time'] = df.time.astype('category')

var_geo = []
var_time = []
for var_time in df.time.cat.categories:
    for var_geo in df.location.cat.categories:
        df_tmp = []
        fct_eur = []

        df_tmp = df[(df['location'] == var_geo) & (df['time'] == var_time)]
        fct_eur = df_tmp.iloc[len(df_tmp)-1,3] df_tmp.iloc[0:len(df_tmp)-2,3].sum()
        df.loc[(df['location'] == var_geo) & (df['time'] == var_time) & (df['product'] == 'catTot'), ['sales']] = fct_eur

catTot是我从原始数据集中得到的一个预聚合，它显示了给定时间段内给定商店的总销售额。正如你所看到的，其他产品只占总数的一小部分，永远不会加起来，但是它们被包括在总数中。由于我想反映由于仪表板中的性能问题，在不显示所有产品的情况下给定位置的总销售额，因此我需要用实际为当前值减去其他产品之和的总和来替换CatTotValue

目前，我迭代嵌套for循环以进行更改。代码如下所示：

location  time  product  sales
store1    2017  brandA   10
store1    2017  brandB   17 
store1    2017  brandC   15
store1    2017  brandD   19
store1    2017  catTot   86
store2    2017  brandA   8
store2    2017  brandB   23 
store2    2017  brandC   5
store2    2017  brandD   12
store2    2017  catTot   76
.         .     .         .
.         .     .         .
.         .     .         .
.         .     .         .

df['location'] = df.location.astype('category')
df['time'] = df.time.astype('category')

var_geo = []
var_time = []
for var_time in df.time.cat.categories:
    for var_geo in df.location.cat.categories:
        df_tmp = []
        fct_eur = []

        df_tmp = df[(df['location'] == var_geo) & (df['time'] == var_time)]
        fct_eur = df_tmp.iloc[len(df_tmp)-1,3] df_tmp.iloc[0:len(df_tmp)-2,3].sum()
        df.loc[(df['location'] == var_geo) & (df['time'] == var_time) & (df['product'] == 'catTot'), ['sales']] = fct_eur

如您所见，catTot始终是屏蔽数据帧中的最后一行。这个操作现在每次大约需要9分钟，因为我有23个门店，大约880种产品，30个时间段和5种不同的测量方法，这导致了大约500k行。是否有一种更优雅或至少更快的方法来进行此类操作？

您可以创建一个分组键，其中所有非catTot设置为sales，然后透视表以聚合sales列，例如：

agg = df.pivot_table(
    index=['location', 'time'],
    columns=np.where(df['product'] == 'catTot', 'catTot', 'sales'),  
    values='sales', 
    aggfunc='sum'
)

这将为您提供：

               catTot  sales
location time
store1   2017      86     61
store2   2017      76     48

然后您可以执行new_total=agg['cattto']-agg['sales']：

实际上，一位朋友提出了这种解决我问题的方法。这段代码也是他的代码，它构建了一个嵌套目录，并将度量值添加到每行的键中，但是除了catTot之外的所有内容都乘以-1。因此，最终只保留剩余的部分

for row in data:
        safe_add(mapping, row[0], int(row[1]), row[2], int(row[3]))
def safe_add(mapping, store, year, brand, count):
    if not store in mapping:
        mapping[store] = {}
    if not year in mapping[store]:
        mapping[store][year] = 0
    if brand != 'catTot':
        count = count * -1
    new_count = count + mapping[store][year]
    mapping[store][year] = new_count

在得到嵌套目录后，我在字典中循环了一次，以获得需要写出它的行数。我这样做是为了能够预填充一个空df并填充它

counter=0    
for geo in mapping.keys():
    for time in mapping[store].keys():
        counter +=1
df_annex = pd.DataFrame(data=None, index=np.arange(0, counter), columns=df.columns)
for geo in mapping.keys():
    for time in mapping[store].keys():
        df_annex.iloc[counterb, 0] = geo
        .
        .

写完字典后，我只需将df中的旧总数子集，并将其与附件一起进行编码。结果是7.88秒而不是9分钟。

效果很好，谢谢您的回答。事实上，这也让我开始使用嵌套的dict，因为我的实际问题不仅包含一个度量值，还包含五个度量值。