Python Postgres将列设置为百分位
我有一个交易表,希望添加一个百分比列,根据金额列指定当月该交易的百分比 以四分位数而非百分位数为例: 输入示例:Python Postgres将列设置为百分位,python,postgresql,psycopg2,Python,Postgresql,Psycopg2,我有一个交易表,希望添加一个百分比列,根据金额列指定当月该交易的百分比 以四分位数而非百分位数为例: 输入示例: id | month | amount 1 | 1 | 1 2 | 1 | 2 3 | 1 | 5 4 | 1 | 3 5 | 2 | 1 6 | 2 | 3 1 | 2 | 5 1 | 2 | 7 1 | 2 | 9 1 | 2 | 11 1
id | month | amount
1 | 1 | 1
2 | 1 | 2
3 | 1 | 5
4 | 1 | 3
5 | 2 | 1
6 | 2 | 3
1 | 2 | 5
1 | 2 | 7
1 | 2 | 9
1 | 2 | 11
1 | 2 | 15
1 | 2 | 16
示例输出
id | month | amount | quartile
1 | 1 | 1 | 25
2 | 1 | 2 | 50
3 | 1 | 5 | 100
4 | 1 | 3 | 75
5 | 2 | 1 | 25
6 | 2 | 3 | 25
1 | 2 | 5 | 50
1 | 2 | 15 | 100
1 | 2 | 9 | 75
1 | 2 | 11 | 75
1 | 2 | 7 | 50
1 | 2 | 16 | 100
目前,我使用postgres的percentile_cont
函数来确定不同百分位的截止点的数量值,然后遍历并相应地更新百分位列。不幸的是,这种方法太慢了,因为我有很多不同的月份。关于如何更快地完成此操作的任何想法,最好将百分位数的计算和更新合并到一个SQL语句中
我的代码:
num_buckets = 10
for i in range(num_buckets):
decimal_percentile = (i+1)*(1.0/num_buckets)
prev_decimal_percentile = i*1.0/num_buckets
percentile = int(decimal_percentile*100)
cursor.execute("SELECT month,
percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC),
percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC)
FROM transactions GROUP BY month;",
(prev_decimal_percentile, decimal_percentile))
iter_cursor = connection.cursor()
for data in cursor:
iter_cursor.execute("UPDATE transactions SET percentile=%s
WHERE month = %s
AND amount >= %s AND amount <= %s;",
(percentile, data[0], data[1], data[2]))
num_bucket=10
对于范围内的i(个桶):
小数百分位数=(i+1)*(1.0/num)
上一个十进制百分比=i*1.0/num
百分位数=整数(十进制百分位数*100)
cursor.execute(“选择月份,
组内百分位控制(%s)(按ASC金额排序),
集团内百分位控制(%s)(按金额ASC排序)
按月从交易组中删除;“,
(上一个小数点,小数点)
iter_cursor=connection.cursor()
对于游标中的数据:
iter_cursor.execute(“更新事务集百分比=%s
其中月份=%s
和amount>=%s和amount您可以在单个查询中执行此操作,例如4个存储桶:
update transactions t
set percentile = calc_percentile
from (
select distinct on (month, amount)
id,
month,
amount,
calc_percentile
from transactions
join (
select
bucket,
month as calc_month,
percentile_cont(bucket*1.0/4) within group (order by amount asc) as calc_amount,
bucket*100/4 as calc_percentile
from transactions
cross join generate_series(1, 4) bucket
group by month, bucket
) s on month = calc_month and amount <= calc_amount
order by month, amount, calc_percentile
) s
where t.month = s.month and t.amount = s.amount;
顺便说一句,id
应该是主键,然后它可以用于连接以获得更好的性能
select *
from transactions
order by month, amount;
id | month | amount | percentile
----+-------+--------+------------
1 | 1 | 1 | 25
2 | 1 | 2 | 50
4 | 1 | 3 | 75
3 | 1 | 5 | 100
5 | 2 | 1 | 25
6 | 2 | 3 | 25
1 | 2 | 5 | 50
1 | 2 | 7 | 50
1 | 2 | 9 | 75
1 | 2 | 11 | 75
1 | 2 | 15 | 100
1 | 2 | 16 | 100
(12 rows)