Python 将计算应用于数据帧中的过滤值
我对熊猫不熟悉 将此视为我的数据帧: dfPython 将计算应用于数据帧中的过滤值,python,pandas,dataframe,Python,Pandas,Dataframe,我对熊猫不熟悉 将此视为我的数据帧: df Search Impressions Clicks Transactions ContainsBest ContainsFree Country Best phone 10 5 1 True False UK Best free
Search Impressions Clicks Transactions ContainsBest ContainsFree Country
Best phone 10 5 1 True False UK
Best free phone 15 4 2 True True UK
free phone 20 3 4 False True UK
good phone 13 1 5 False False US
just a free phone 12 3 4 False True US
Country Impressions Clicks Transactions
UK 45 12 7
ContainsBest 25 9 3
ContainsFree 35 7 6
US 25 4 9
ContainsBest 0 0 0
ContainsFree 12 3 4
Country Impressions Clicks Transactions TopCategoriesForImpressions TopCategoriesForClicks TopCategoriesForTransactions
UK 45 12 7 ContainsFree ContainsBest ContainsFree
ContainsBest 25 9 3 ContainsBest ContainsFree ContainsBest
ContainsFree 35 7 6
US 25 4 9 ContainsFree ContainsFree ContainsFree
ContainsBest 0 0 0
ContainsFree 12 3 4
我有列ContainsBest
和ContainsFree
。我想对所有印象
,点击
和交易
进行求和,其中包含的测试
为真
,然后我想对印象
进行求和,点击
和交易
其中包含的自由
为真,并对国家/地区列中的每个唯一值执行相同的操作。因此,新的数据帧将如下所示:
TopCategoriesForImpressions = output_df['Impressions'].sort_values(by='Impressions', ascending=False).where(output_df['Country']=='UK')
输出_df
Search Impressions Clicks Transactions ContainsBest ContainsFree Country
Best phone 10 5 1 True False UK
Best free phone 15 4 2 True True UK
free phone 20 3 4 False True UK
good phone 13 1 5 False False US
just a free phone 12 3 4 False True US
Country Impressions Clicks Transactions
UK 45 12 7
ContainsBest 25 9 3
ContainsFree 35 7 6
US 25 4 9
ContainsBest 0 0 0
ContainsFree 12 3 4
Country Impressions Clicks Transactions TopCategoriesForImpressions TopCategoriesForClicks TopCategoriesForTransactions
UK 45 12 7 ContainsFree ContainsBest ContainsFree
ContainsBest 25 9 3 ContainsBest ContainsFree ContainsBest
ContainsFree 35 7 6
US 25 4 9 ContainsFree ContainsFree ContainsFree
ContainsBest 0 0 0
ContainsFree 12 3 4
为此,我理解我需要使用以下内容:
uk_toal_impressions = df['Impressions'].sum().where(df['Country']=='UK')
uk_best_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsBest'])
uk_free_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsFree'])
然后,我会对点击
和交易
应用相同的逻辑,并对国家
美国
重做相同的代码
我试图实现的第二件事是添加列TopCategories
perCountry
和Impressions
,点击和交易
,以便我的最终输出_df
如下所示:
TopCategoriesForImpressions = output_df['Impressions'].sort_values(by='Impressions', ascending=False).where(output_df['Country']=='UK')
最终输出\u df
Search Impressions Clicks Transactions ContainsBest ContainsFree Country
Best phone 10 5 1 True False UK
Best free phone 15 4 2 True True UK
free phone 20 3 4 False True UK
good phone 13 1 5 False False US
just a free phone 12 3 4 False True US
Country Impressions Clicks Transactions
UK 45 12 7
ContainsBest 25 9 3
ContainsFree 35 7 6
US 25 4 9
ContainsBest 0 0 0
ContainsFree 12 3 4
Country Impressions Clicks Transactions TopCategoriesForImpressions TopCategoriesForClicks TopCategoriesForTransactions
UK 45 12 7 ContainsFree ContainsBest ContainsFree
ContainsBest 25 9 3 ContainsBest ContainsFree ContainsBest
ContainsFree 35 7 6
US 25 4 9 ContainsFree ContainsFree ContainsFree
ContainsBest 0 0 0
ContainsFree 12 3 4
列topcegoriesforxx
逻辑是一种简单的ContainsBest
和ContainsFree
行,位于Country
列下。因此,UK
国家的TopCategoriesForImpressions
无容器
集装箱贝斯特
而UK
国家的TopCategoriesForClicks
是:
集装箱贝斯特
无容器
我知道我需要使用类似这样的东西:
TopCategoriesForImpressions = output_df['Impressions'].sort_values(by='Impressions', ascending=False).where(output_df['Country']=='UK')
我只是觉得很难把所有东西都放在我上一次的最终输出\u df
中。另外,我假设我不需要创建output\u df
,只是想添加它,以便更好地理解实现最终输出\u df的步骤
所以我的问题是:
如何应用基于一个和多个条件的计算?请参见行ContainsBest
和ContainsFree
如何根据条件对列值进行排序?请参见列TopCategoriesForImpressions
事实上,我有70个国家和20个栏目Containsxxx
,有没有办法在不增加70个国家和20个Containsxxx
栏目条件的情况下实现这一点
非常感谢您的建议。解决方案的第一部分应该是:
#removed unnecessary column Search and added ContainAll column filled Trues
df1 = df.drop('Search', 1).assign(ContainAll = True)
#columns for tests
cols1 = ['Impressions','Clicks','Transactions']
cols2 = ['ContainsBest','ContainsFree','ContainAll']
print (df1[cols2].dtypes)
ContainsBest bool
ContainsFree bool
ContainAll bool
dtype: object
print (df1[cols1].dtypes)
Impressions int64
Clicks int64
Transactions int64
dtype: object
对于第二种情况,可以使用numpy.argsort
和per groups筛选检查排序的行:
def f(x):
i = x.index.to_numpy()
a = i[(-x.to_numpy()).argsort(axis=0)]
return pd.DataFrame(a, columns=x.columns)
df2 = (df1[df1['Type'].isin(['ContainsBest','ContainsFree']) &
~df1[cols1].eq(0).all(1)]
.set_index('Type')
.groupby('Country')[cols1]
.apply(f)
.add_prefix('TopCategoriesFor')
.rename_axis(['Country','Type'])
.rename({0:'ContainsBest', 1:'ContainsFree'})
)
print (df2)
TopCategoriesForImpressions TopCategoriesForClicks \
Country Type
UK ContainsBest ContainsFree ContainsBest
ContainsFree ContainsBest ContainsFree
US ContainsBest ContainsFree ContainsFree
TopCategoriesForTransactions
Country Type
UK ContainsBest ContainsFree
ContainsFree ContainsBest
US ContainsBest ContainsFree
出于某种原因,我确实得到了与您相同的结构,但我所有的值都是0
。你知道为什么会这样吗?在使用您添加的代码之前,我正在将df
写入csv,可能to_csv
会使df
为空?@JonasPalačionis-hmmm,数据是数字吗?正在检查。我还收到了一个警告futurearning:Passing list like to.loc或[]如果缺少任何标签,将在将来引发KeyError,您可以使用.reindex()作为替代方法。
不确定这是否会影响脚本。@JonasPalačionis-一个想法是print(df1.melt(['Country']+cols1,var_name='Type',value_name='mask').dtypes)
?@JonasPalačionis-添加了一些指纹以供检查。