Python 在Postgres中识别重复的时间序列序列_Python_Sql_Postgresql_Time Series

Python 在Postgres中识别重复的时间序列序列

python sql postgresql

Python 在Postgres中识别重复的时间序列序列,python,sql,postgresql,time-series,Python,Sql,Postgresql,Time Series,我在Postgres DB中有一个带有列的时间序列表 item_id, country_id, year, month, value 此表中有重复的时间序列：它们具有相同的国家id和时间序列日期/值，但分配了不同的项目id，例如：“红苹果”和“苹果，红苹果” 如何识别这些重复的时间序列？我希望country_id、年、月和值在项目存在的所有日期中匹配我是初学者，所以请原谅我遗漏的任何细节。我主要寻找概念性的方法——我可以在Postgres或python/Pandas中实现它例如，我希

我在Postgres DB中有一个带有列的时间序列表

item_id,  country_id,  year,  month, value

此表中有重复的时间序列：它们具有相同的国家id和时间序列日期/值，但分配了不同的项目id，例如：“红苹果”和“苹果，红苹果”

如何识别这些重复的时间序列？我希望country_id、年、月和值在项目存在的所有日期中匹配

我是初学者，所以请原谅我遗漏的任何细节。我主要寻找概念性的方法——我可以在Postgres或python/Pandas中实现它

例如，我希望能够检测到这样的东西：

item_id,     country_id,     year,     month,    value
-------------------------------------------------------
Red Apples   5               1996      1         300
Red Apples   5               1996      2         500
Red Apples   5               1996      3         370
Apples, Red  5               1996      1         300
Apples, Red  5               1996      2         500
Apples, Red  5               1996      3         370

item_id1,     item_id2,      country_id,     year,     month_range
-----------------------------------------------------------------
Red Apples    Apples, Red         5          1996       [1,3]

select distinct A.country_id, A.item_id, B.item_id, A.year, A.month, A.value
                      from my_table as A,
                      my_table as B 
                      where
                      (A.country_id=B.country_id and 
                      A.item_id<>B.item_id and 
                      A.year=B.year and 
                      A.month=B.month and 
                      A.value=B.value )

我希望输出如下所示：

item_id,     country_id,     year,     month,    value
-------------------------------------------------------
Red Apples   5               1996      1         300
Red Apples   5               1996      2         500
Red Apples   5               1996      3         370
Apples, Red  5               1996      1         300
Apples, Red  5               1996      2         500
Apples, Red  5               1996      3         370

item_id1,     item_id2,      country_id,     year,     month_range
-----------------------------------------------------------------
Red Apples    Apples, Red         5          1996       [1,3]

select distinct A.country_id, A.item_id, B.item_id, A.year, A.month, A.value
                      from my_table as A,
                      my_table as B 
                      where
                      (A.country_id=B.country_id and 
                      A.item_id<>B.item_id and 
                      A.year=B.year and 
                      A.month=B.month and 
                      A.value=B.value )

类似这样的东西也可以：

item_id1,     item_id2,      country_id,     year,     time_month,   value
--------------------------------------------------------------------------
Red Apples    Apples, Red         5          1996         1           300
Red Apples    Apples, Red         5          1996         2           500
Red Apples    Apples, Red         5          1996         3           370

我想试试这样的东西：

item_id,     country_id,     year,     month,    value
-------------------------------------------------------
Red Apples   5               1996      1         300
Red Apples   5               1996      2         500
Red Apples   5               1996      3         370
Apples, Red  5               1996      1         300
Apples, Red  5               1996      2         500
Apples, Red  5               1996      3         370

item_id1,     item_id2,      country_id,     year,     month_range
-----------------------------------------------------------------
Red Apples    Apples, Red         5          1996       [1,3]

select distinct A.country_id, A.item_id, B.item_id, A.year, A.month, A.value
                      from my_table as A,
                      my_table as B 
                      where
                      (A.country_id=B.country_id and 
                      A.item_id<>B.item_id and 
                      A.year=B.year and 
                      A.month=B.month and 
                      A.value=B.value )

然后，我会检查以确保所有日期/值都显示在每个标识的项目id对中。但如果可能的话，我想一次检查所有日期/值

我不确定表联接是否合适…？

选择* 从我的桌子上按国家/地区、年份、月份、值分组 countitem_id>1

!！这是未经测试的

请参见下面的更新

除非您提供有关示例数据和预期结果的更多详细信息，否则我认为以下查询可能会有所帮助：

选择国家/地区id、年、月、值从桌子上按国家/地区、年份、月份、值分组计数*>1；此查询将显示除item_id外所有相等的条目。如果要查找与重复组对应的所有行，请使用以下查询：

选择项目id、国家id、年、月、值从桌子上其中国家/地区id、年、月、值在里面选择国家/地区id、年、月、值从桌子上按国家/地区、年份、月份、值分组计数*>1的按国家编号、年、月、值、项目编号排序的订单；我已经将列item_id设置为排序顺序中的最后一个，它应该使识别重复项更为可见。请随意调整。此查询可能需要一段时间，具体取决于您的数据

为了避免在将来的重复日期出现这种情况，您可能需要创建一个唯一的约束，如下所示：

item_id,     country_id,     year,     month,    value
-------------------------------------------------------
Red Apples   5               1996      1         300
Red Apples   5               1996      2         500
Red Apples   5               1996      3         370
Apples, Red  5               1996      1         300
Apples, Red  5               1996      2         500
Apples, Red  5               1996      3         370

item_id1,     item_id2,      country_id,     year,     month_range
-----------------------------------------------------------------
Red Apples    Apples, Red         5          1996       [1,3]

select distinct A.country_id, A.item_id, B.item_id, A.year, A.month, A.value
                      from my_table as A,
                      my_table as B 
                      where
                      (A.country_id=B.country_id and 
                      A.item_id<>B.item_id and 
                      A.year=B.year and 
                      A.month=B.month and 
                      A.value=B.value )

更改表格a\u表格添加约束u\u cymv 唯一国家/地区id、年、月、值；编辑：添加注释后，我提出了以下查询以查找一系列重复项：

WITH a_table(item_id,country_id,year,month,value) AS (VALUES
    ('Red Apples'::text,5,1996,1,300::numeric),
    ('Red Apples',5,1996,2,500),
    ('Red Apples',5,1996,3,370),
    ('Apples, Red',5,1996,1,300),
    ('Apples, Red',5,1996,2,500),
    ('Apples, Red',5,1996,3,370)
), dups AS (
    SELECT string_agg(item_id,'/') AS items,
           country_id,value,
           daterange(to_date(year::text||month,'YYYYMM'),
                     (to_date(year::text||month,'YYYYMM')
                      +INTERVAL'1mon')::date,'[)') AS range
      FROM a_table
     GROUP BY country_id,year,month,value
    HAVING count(*) > 1
)
SELECT grp,count(*),items,country_id,
       daterange(min(lower(range)), max(upper(range)), '[)') r,
       array_agg(value)
  FROM ( 
    SELECT items,country_id,range,value,
           sum(g) OVER (ORDER BY country_id, range) grp
      FROM (
        SELECT items,country_id,
               range,value,
               CASE WHEN lag(range) OVER (PARTITION BY country_id
                                          ORDER BY range) -|- range
                    THEN NULL ELSE 1 END g
          FROM dups) s
    ) s
 GROUP BY grp,country_id,items
HAVING count(*) >= 3
 ORDER BY country_id,r,items;

它的作用是：

_表是所提供样本数据的副本； dups是一个查找重复记录的程序。我还将年、月列转换为日期范围，因为我认为没有其他方法可以正确地查找跨越纽约的系列；在列出重复项后，我将比较一个国家/地区id内的前一个范围与当前范围，如果没有，则设置组标志g；接下来，我使用sum函数的一个函数来创建组标识符grp。对于样本数据，这只产生一个组；最后，我使用grp作为GROUPBY，将数据分组到系列中。我还将country_id和项包含到groupby中，但这只是为了避免将它们包装到聚合函数中——它们在每个grp中都是唯一的。我还形成了一个新的daterange列，这是由于范围类型没有内置的聚合函数。在执行此查询之前，您可能需要增加work_mem，根据实际表中的行数，最多可以增加1GB。

请尝试一下，让我知道它是否适合你。如果您能为这一个共享解释分析缓冲区，那就太好了。

对不起，也许我的问题现在更清楚了。我不是想识别重复的行，而是给了两个不同名称的整个数据系列。你的建议可能有效，但我真的很想看到两个相互冲突的item_id值Hanks，这让我有了一部分方法。但它仍然无法识别冲突的item_ID，而且我的原始表有1000万行，可能有1000个不同的item_ID，因此无法手动完成。@user3591836，我不明白您所说的识别是什么意思？我提供的查询只返回重复的序列。请精确。我希望输出包括所有重复时间序列的项目id，加上它们相同的时间间隔。类似于“红苹果”、“苹果，红”，1996，[1,3]如果您的数据有另一个条目，如黄色香蕉，51996,1300，该怎么办？在这里它也算是重复的吗？我只想确定重复的时间序列或至少是子序列。不是只有一个日期的巧合匹配。序列的最小长度是多少？以及如何处理跨年度边界的系列，如1996-121997-1？每个项目id、国家id对将有几年的数据，我希望找到至少连续3个月相同的数据。输出的具体格式并不重要，只要返回值相同的所有项目ID和国家ID&日期。