Python 是否有一种方法可以对数据框中的某些项进行排序并排除其他项?
我有一个名为ranks的pandas数据框架,其中包含我的集群及其关键指标。我使用Python 是否有一种方法可以对数据框中的某些项进行排序并排除其他项?,python,pandas,filtering,ranking,Python,Pandas,Filtering,Ranking,我有一个名为ranks的pandas数据框架,其中包含我的集群及其关键指标。我使用rank()对它们进行排名,但是有两个特定的集群,我希望它们的排名与其他集群不同 ranks=pd.DataFrame(数据={'Cluster':['0','1','2',', '3', '4', '5','6', '7', '8', '9'], “客户数量”:[145118, 2. 1236, 219847, 9837, 64865, 3855, 219549, 34171, 3924120], “最近
rank()
对它们进行排名,但是有两个特定的集群,我希望它们的排名与其他集群不同
ranks=pd.DataFrame(数据={'Cluster':['0','1','2',',
'3', '4', '5','6', '7', '8', '9'],
“客户数量”:[145118,
2.
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
“最近大街”:[39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
“平均频率”:[1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
“货币大道”:[14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
排名['Ave.Expense']=排名['Ave.Monetary']/排名['Ave.Frequency']
然后我应用rank()
方法如下:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
这就给了我:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
这就是它应该做的,但是具有最高Ave.expense
的集群需要始终排在第一位,具有最高Ave.recenty
的集群需要始终排在最后
所以我修改了上面的代码如下:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
然后我得到这个
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
请帮助我修改上述if声明,或者建议完全不同的方法。当然,这需要尽可能动态。因此,您需要在数据帧上进行自定义排名,其中具有最高
平均花费的集群(/row)总是排名第一,而具有最高平均最近性的集群总是排在最后
解决方案是五行。注:
- 对于
DataFrame.drop()
,您的想法是正确的,只需使用idxmax()
获取需要特殊处理的两行的索引,并将其存储,这样您就不需要在drop
中使用庞大而笨拙的逻辑过滤器表达式
- 不需要创建这么多临时列,也不需要临时复制
ranks_2=ranks.drop(…)
;只需将drop()
的结果传递到rank()
- 。。。通过所需列上的
.sum(axis=1)
,无需定义lambda,也无需将其输出保存在temp列“total
”中
- …然后我们只需将这些列组之和输入
rank()
,这将为我们提供1..8的值,因此我们添加1以将rank()
的结果抵消为2..9。(你可以概括这一点)
- 我们为
Ave.expense
,Ave.recenty
行手动设置“总体排名”
- (是的,您也可以将所有这些实现为一个自定义函数,其输入为四列
Ave.
列或四列*\u rank
列。)
代码:(请参阅底部的样板,以便在您的数据框中阅读,下次请制作示例MCVE,以帮助我们帮助您)
下面是接收数据的样板文件:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))
欢迎来到StackOverflow。请花点时间阅读这篇文章,以及如何提供答案,并相应地修改你的问题。这些关于如何提出好问题的提示可能也很有用。好吧,您希望在数据帧上自定义排名,其中Ave.Spend
最高的集群(/row)总是排在第一位,而Ave.recenty最高的集群总是排在最后。请将您的代码示例编辑为可执行(MCVE)。顺便说一句,包含空格的列很烦人,并且使代码很长,我会将它们重命名为AvRec、AvFrq、AvMon、AvSpn
或其他任何名称。顺便说一句,rank()
在应用于浮点列时将返回浮点(不是int)。因为可能存在平局,rank(method='average')
返回浮动。用.astype(int)
来贬低这一点是不安全的,因为你需要处理那些棘手的问题。至于清楚地说明问题:“我有一个包含集群及其关键指标的数据框架。”集群是一组客户吗?(信用卡用户?B2B?杂货店?)一组计算机?还有别的吗?