Python Postgres将列设置为百分位_Python_Postgresql_Psycopg2

Python Postgres将列设置为百分位

python postgresql

Python Postgres将列设置为百分位,python,postgresql,psycopg2,Python,Postgresql,Psycopg2,我有一个交易表，希望添加一个百分比列，根据金额列指定当月该交易的百分比以四分位数而非百分位数为例：输入示例： id | month | amount 1 | 1 | 1 2 | 1 | 2 3 | 1 | 5 4 | 1 | 3 5 | 2 | 1 6 | 2 | 3 1 | 2 | 5 1 | 2 | 7 1 | 2 | 9 1 | 2 | 11 1

我有一个交易表，希望添加一个百分比列，根据金额列指定当月该交易的百分比

以四分位数而非百分位数为例：

输入示例：

id | month | amount
1  |   1   |   1
2  |   1   |   2
3  |   1   |   5
4  |   1   |   3
5  |   2   |   1
6  |   2   |   3
1  |   2   |   5
1  |   2   |   7
1  |   2   |   9
1  |   2   |   11
1  |   2   |   15
1  |   2   |   16

示例输出

id | month | amount |  quartile
1  |   1   |   1    |      25
2  |   1   |   2    |      50
3  |   1   |   5    |      100
4  |   1   |   3    |      75
5  |   2   |   1    |      25
6  |   2   |   3    |      25
1  |   2   |   5    |      50
1  |   2   |   15   |      100
1  |   2   |   9    |      75
1  |   2   |   11   |      75
1  |   2   |   7    |      50
1  |   2   |   16   |      100

目前，我使用postgres的

percentile_cont

函数来确定不同百分位的截止点的数量值，然后遍历并相应地更新百分位列。不幸的是，这种方法太慢了，因为我有很多不同的月份。关于如何更快地完成此操作的任何想法，最好将百分位数的计算和更新合并到一个SQL语句中

我的代码：

num_buckets = 10

for i in range(num_buckets):
    decimal_percentile = (i+1)*(1.0/num_buckets)
    prev_decimal_percentile = i*1.0/num_buckets
    percentile = int(decimal_percentile*100)
    cursor.execute("SELECT month, 
                           percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC), 
                           percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC) 
                     FROM transactions GROUP BY month;", 
                     (prev_decimal_percentile, decimal_percentile))
    iter_cursor = connection.cursor()
    for data in cursor:
        iter_cursor.execute("UPDATE transactions SET percentile=%s 
                             WHERE month = %s 
                                   AND amount >= %s AND amount <= %s;", 
                            (percentile, data[0], data[1], data[2]))

num_bucket=10
对于范围内的i（个桶）：
小数百分位数=（i+1）*（1.0/num）
上一个十进制百分比=i*1.0/num
百分位数=整数（十进制百分位数*100）
cursor.execute（“选择月份，
组内百分位控制（%s）（按ASC金额排序），
集团内百分位控制（%s）（按金额ASC排序）
按月从交易组中删除；“，
（上一个小数点，小数点）
iter_cursor=connection.cursor（）
对于游标中的数据：
iter_cursor.execute（“更新事务集百分比=%s
其中月份=%s
和amount>=%s和amount您可以在单个查询中执行此操作，例如4个存储桶：
update transactions t
set percentile = calc_percentile
from (
    select distinct on (month, amount) 
        id, 
        month, 
        amount, 
        calc_percentile
    from transactions
    join (
        select 
            bucket,
            month as calc_month, 
            percentile_cont(bucket*1.0/4) within group (order by amount asc) as calc_amount,
            bucket*100/4 as calc_percentile
        from transactions 
        cross join generate_series(1, 4) bucket
        group by month, bucket
        ) s on month = calc_month and amount <= calc_amount
    order by month, amount, calc_percentile 
    ) s
where t.month = s.month and t.amount = s.amount;

顺便说一句，id
应该是主键，然后它可以用于连接以获得更好的性能

select *
from transactions
order by month, amount;

 id | month | amount | percentile 
----+-------+--------+------------
  1 |     1 |      1 |         25
  2 |     1 |      2 |         50
  4 |     1 |      3 |         75
  3 |     1 |      5 |        100
  5 |     2 |      1 |         25
  6 |     2 |      3 |         25
  1 |     2 |      5 |         50
  1 |     2 |      7 |         50
  1 |     2 |      9 |         75
  1 |     2 |     11 |         75
  1 |     2 |     15 |        100
  1 |     2 |     16 |        100
(12 rows)