PostgreSQL/Python-获取最后N行不重复_Python_Postgresql

PostgreSQL/Python-获取最后N行不重复

python postgresql

PostgreSQL/Python-获取最后N行不重复,python,postgresql,Python,Postgresql,我有什么办法可以做到这一点吗例如，如果我的表格包含以下元素： id | username | profile_photo ---+----------+-------------- 1 | juan | urlphoto/juan 2 | nestor | urlphoto/nestor 3 | pablo | urlphoto/pablo 4 | pablo | urlphoto/pablo 并且，我想得到最后2行应该得到： id 2 -> nesto

我有什么办法可以做到这一点吗

例如，如果我的表格包含以下元素：

id | username | profile_photo
---+----------+--------------
 1 |     juan | urlphoto/juan
 2 |   nestor | urlphoto/nestor
 3 |    pablo | urlphoto/pablo
 4 |    pablo | urlphoto/pablo

并且，我想得到最后2行应该得到：

id 2 -> nestor | urlphoto/nestor
id 3 -> pablo  | urlphoto/pablo

谢谢你抽出时间

解决方案：

解决方案是在前n个元素中插入一个项目（如果尚未插入）

import psycopg2, psycopg2.extras, json
db = psycopg2.connect("")

cursor = db.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cursor.execute("SELECT * FROM users ORDER BY id DESC LIMIT n;")
row = [item['user_id'] for item in cursor.fetchall()]

if not user_id in row:
    cursor.execute("INSERT..")
    db.commit()
cursor.close()
db.close()

怎么样

SELECT id, username, profile_photo
FROM (select min(id), username, profile_photo FROM table
      GROUP BY username, profile_photo) tmp ORDER BY id DESC LIMIT 2

如果你不在乎最后一行的顺序，那就来吧

SELECT min(id), username, profile_photo 
FROM oh_my_table
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2

在您的示例中，您没有描述构成重复行的内容，没有重复任何内容，因为由于id，所有行都是唯一的，但我假设您希望除id之外的所有列上的行都是不同的，并且您不关心少数可能重复的id中的哪一个

让我们从一些测试数据开始：

CREATE UNLOGGED TABLE profile_photos (id int, username text, profile_photo text);
Time: 417.014 ms

INSERT INTO profile_photos
SELECT g.id, r.username, 'urlphoto/' || r.username
FROM generate_series(1, 10000000) g (id)
CROSS JOIN substr(md5(g.id::text), 0, 8) r (username);
INSERT 0 10000000
Time: 24497.335 ms

我将测试两个可能的解决方案，每个解决方案有两个索引：

CREATE INDEX id_btree ON profile_photos USING btree (id);
CREATE INDEX
Time: 8139.347 ms

CREATE INDEX username_profile_photo_id_btree ON profile_photos USING btree (username, profile_photo, id DESC);
CREATE INDEX
Time: 81667.411 ms

VACUUM ANALYZE profile_photos;
VACUUM
Time: 1338.034 ms

因此，第一个解决方案是Sami和Clément给出的，他们的查询基本相同：

SELECT min(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2;

   min    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 5088.611 ms

结果看起来不错，但是如果这些用户中的任何一个以前发布过个人资料照片，那么这个查询可能会产生不希望的结果。让我们模仿一下：

UPDATE profile_photos
SET (username, profile_photo) = ('d1ca3aa', 'urlphoto/d1ca3aa')
WHERE id = 1;
UPDATE 1
Time: 1.313 ms

SELECT min(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2;

   min   | username |  profile_photo   
---------+----------+------------------
 9999999 | 283f427  | urlphoto/283f427
 9999998 | facf1f3  | urlphoto/facf1f3
(2 rows)
Time: 5032.213 ms

因此，查询将忽略用户可能添加的任何更新内容。它看起来不像您想要的，因此我建议将minid替换为maxid：

SELECT max(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY max(id) DESC 
LIMIT 2;

   max    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 5068.507 ms

是的，但是看起来很慢。查询计划是：

                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=655369.97..655369.98 rows=2 width=29) (actual time=6215.284..6215.285 rows=2 loops=1)
   ->  Sort  (cost=655369.97..678809.36 rows=9375755 width=29) (actual time=6215.282..6215.282 rows=2 loops=1)
         Sort Key: (max(id))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=0.56..561612.42 rows=9375755 width=29) (actual time=0.104..4945.534 rows=9816449 loops=1)
               ->  Index Only Scan using username_profile_photo_id_btree on profile_photos  (cost=0.56..392855.43 rows=9999925 width=29) (actual time=0.089..1849.036 rows=10000000 loops=1)
                     Heap Fetches: 0
 Total runtime: 6215.344 ms
(8 rows)

这里需要注意的是，没有合法使用包含GROUP BY的聚合：在本例中，GROUP BY用于过滤重复项，这里唯一的聚合是选择其中任何一个。Postgres有一个扩展，允许您丢弃一组列上的重复项：

SELECT *
FROM (    
    SELECT DISTINCT ON (username, profile_photo) *
    FROM profile_photos
) X
ORDER BY id DESC
LIMIT 2;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 3779.723 ms

这要快一点，原因如下：

                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=630370.16..630370.17 rows=2 width=29) (actual time=4921.031..4921.031 rows=2 loops=1)
   ->  Sort  (cost=630370.16..653809.55 rows=9375755 width=29) (actual time=4921.030..4921.030 rows=2 loops=1)
         Sort Key: profile_photos.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Unique  (cost=0.56..442855.06 rows=9375755 width=29) (actual time=0.114..4220.410 rows=9816449 loops=1)
               ->  Index Only Scan using username_profile_photo_id_btree on profile_photos  (cost=0.56..392855.43 rows=9999925 width=29) (actual time=0.111..2040.601 rows=10000000 loops=1)
                     Heap Fetches: 0
 Total runtime: 4921.081 ms
(8 rows)

如果我们能够以某种方式通过id DESC LIMIT 1以简单的顺序获取最后一行，并从表的末尾查找另一行，那将不会是重复的

WITH first AS (
    SELECT *
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1
)
SELECT *
FROM first
UNION ALL
(SELECT *
FROM profile_photos p
WHERE EXISTS (
    SELECT 1
    FROM first
    WHERE (first.username, first.profile_photo) <> (p.username, p.profile_photo))
ORDER BY id DESC
LIMIT 1);

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 1.217 ms

这是非常快的，但手工定制的产量只有两行。让我们用更自动化的东西来代替它：

WITH RECURSIVE last (id, username, profile_photo, a) AS (
    (SELECT id, username, profile_photo, ARRAY[ROW(username, profile_photo)] a
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1)
    UNION ALL
    (SELECT older.id, older.username, older.profile_photo, last.a || ROW(older.username, older.profile_photo)
    FROM last
    JOIN profile_photos older ON last.id > older.id AND NOT ROW(older.username, older.profile_photo) = ANY(last.a)
    WHERE array_length(a, 1) < 10
    ORDER BY id DESC
    LIMIT 1)
)
SELECT id, username, profile_photo
FROM last;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
  9999998 | facf1f3  | urlphoto/facf1f3
  9999997 | 305ebab  | urlphoto/305ebab
  9999996 | 74ab43a  | urlphoto/74ab43a
  9999995 | 23f2458  | urlphoto/23f2458
  9999994 | 6b465af  | urlphoto/6b465af
  9999993 | 33ee85a  | urlphoto/33ee85a
  9999992 | c0b9ef4  | urlphoto/c0b9ef4
  9999991 | b63d5bf  | urlphoto/b63d5bf
(10 rows)
Time: 2706.837 ms

这比前面的查询要快，但正如您在下面的查询计划中所看到的，对于每个生成的行，它都必须扫描id上的索引

                                                                                      QUERY PLAN                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on last  (cost=6.52..6.74 rows=11 width=68) (actual time=0.104..4439.807 rows=10 loops=1)
   CTE last
     ->  Recursive Union  (cost=0.43..6.52 rows=11 width=61) (actual time=0.098..4439.780 rows=10 loops=1)
           ->  Limit  (cost=0.43..0.47 rows=1 width=29) (actual time=0.095..0.095 rows=1 loops=1)
                 ->  Index Scan Backward using id_btree on profile_photos  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.093..0.093 rows=1 loops=1)
           ->  Limit  (cost=0.43..0.58 rows=1 width=61) (actual time=443.965..443.966 rows=1 loops=10)
                 ->  Nested Loop  (cost=0.43..1406983.38 rows=9510977 width=61) (actual time=443.964..443.964 rows=1 loops=10)
                       Join Filter: ((last_1.id > older.id) AND (ROW(older.username, older.profile_photo) <> ALL (last_1.a)))
                       Rows Removed by Join Filter: 8
                       ->  Index Scan Backward using id_btree on profile_photos older  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.008..167.755 rows=1000010 loops=10)
                       ->  WorkTable Scan on last last_1  (cost=0.00..0.25 rows=3 width=36) (actual time=0.000..0.000 rows=0 loops=10000102)
                             Filter: (array_length(a, 1) < 10)
                             Rows Removed by Filter: 1
 Total runtime: 4439.907 ms
(14 rows)

自Postgres 9.3以来，有一种新的连接类型可用，即侧向连接。它允许您在行级别做出连接决策，即它适用于每一行。我们可以使用它来实现以下逻辑：只要没有N行，对于每个生成的行，查看是否有比上一行旧的行，如果有，则将该行添加到生成的结果中

WITH RECURSIVE last (id, username, profile_photo, a) AS (
    (SELECT id, username, profile_photo, ARRAY[ROW(username, profile_photo)] a
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1)
    UNION ALL
    (SELECT older.id, older.username, older.profile_photo, last.a || ROW(older.username, older.profile_photo)
    FROM last
    CROSS JOIN LATERAL (
        SELECT *
        FROM profile_photos older
        WHERE last.id > older.id AND NOT ROW(older.username, older.profile_photo) = ANY(last.a)
        ORDER BY id DESC
        LIMIT 1
    ) older
    WHERE array_length(a, 1) < 10
    ORDER BY id DESC
    LIMIT 1)
)
SELECT id, username, profile_photo
FROM last;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
  9999998 | facf1f3  | urlphoto/facf1f3
  9999997 | 305ebab  | urlphoto/305ebab
  9999996 | 74ab43a  | urlphoto/74ab43a
  9999995 | 23f2458  | urlphoto/23f2458
  9999994 | 6b465af  | urlphoto/6b465af
  9999993 | 33ee85a  | urlphoto/33ee85a
  9999992 | c0b9ef4  | urlphoto/c0b9ef4
  9999991 | b63d5bf  | urlphoto/b63d5bf
(10 rows)
Time: 1.966 ms

现在很快。。。直到N太大

                                                                                        QUERY PLAN                                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on last  (cost=18.61..18.83 rows=11 width=68) (actual time=0.074..0.359 rows=10 loops=1)
   CTE last
     ->  Recursive Union  (cost=0.43..18.61 rows=11 width=61) (actual time=0.070..0.346 rows=10 loops=1)
           ->  Limit  (cost=0.43..0.47 rows=1 width=29) (actual time=0.067..0.068 rows=1 loops=1)
                 ->  Index Scan Backward using id_btree on profile_photos  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.065..0.065 rows=1 loops=1)
           ->  Limit  (cost=1.79..1.79 rows=1 width=61) (actual time=0.026..0.026 rows=1 loops=10)
                 ->  Sort  (cost=1.79..1.80 rows=3 width=61) (actual time=0.025..0.025 rows=1 loops=10)
                       Sort Key: older.id
                       Sort Method: quicksort  Memory: 25kB
                       ->  Nested Loop  (cost=0.43..1.77 rows=3 width=61) (actual time=0.020..0.021 rows=1 loops=10)
                             ->  WorkTable Scan on last last_1  (cost=0.00..0.25 rows=3 width=36) (actual time=0.001..0.001 rows=1 loops=10)
                                   Filter: (array_length(a, 1) < 10)
                                   Rows Removed by Filter: 0
                             ->  Limit  (cost=0.43..0.49 rows=1 width=29) (actual time=0.017..0.017 rows=1 loops=9)
                                   ->  Index Scan Backward using id_btree on profile_photos older  (cost=0.43..161076.14 rows=3170326 width=29) (actual time=0.016..0.016 rows=1 loops=9)
                                         Index Cond: (last_1.id > id)
                                         Filter: (ROW(username, profile_photo) <> ALL (last_1.a))
                                         Rows Removed by Filter: 0
 Total runtime: 0.439 ms
(19 rows)