R 用红移SQL替换基于循环的重复数据消除代码_R_Loops_Amazon Redshift_Lag_Dense Rank

R 用红移SQL替换基于循环的重复数据消除代码

r loops amazon-redshift

R 用红移SQL替换基于循环的重复数据消除代码,r,loops,amazon-redshift,lag,dense-rank,R,Loops,Amazon Redshift,Lag,Dense Rank,我们正在尝试将大量用于数据集操作的遗留R代码迁移到红移SQL。所有这些都很容易移植，除了下面的位，它被证明是难以处理的。这就是我来找你的原因，温柔的读者。我怀疑我的要求是不可能的，但我缺乏证明这一点的能力下面的R代码所做的是使用循环机制消除唯一整数标识符的重复数据。您将在内联注释中看到完整的详细信息在开始之前，下面是一个小的带注释的示例集，让您了解所需SQL代码应该具有的影响：以下是我们试图用红移SQL替换的带注释的R代码： # the purpose of this function i

我们正在尝试将大量用于数据集操作的遗留R代码迁移到红移SQL。所有这些都很容易移植，除了下面的位，它被证明是难以处理的。这就是我来找你的原因，温柔的读者。我怀疑我的要求是不可能的，但我缺乏证明这一点的能力

下面的R代码所做的是使用循环机制消除唯一整数标识符的重复数据。您将在内联注释中看到完整的详细信息

在开始之前，下面是一个小的带注释的示例集，让您了解所需SQL代码应该具有的影响：

以下是我们试图用红移SQL替换的带注释的R代码：

# the purpose of this function is to dedupe a set of identifiers
    # so that each month, the set if identifiers grouped under that month
    # will not have appeared in the previous two months
    # it does this by building 3 sets:
        # current month
        # previous month
        # 2 months ago
        # In a loop, it sets the current month set for the current year-month value in the loop
            # then filters that set against the contents of previous 2 months' sets
            # then unions the surving months set against the survivors of previous months so far

# I believe the functionality below is mainly taken from library(dplyr)
library(dplyr)
library(tidyverse)
library(lubridate)
library(multidplyr) 
library(purrr)
library(stringr)
library(RJDBC)

dedupeIdentifiers <- function(dataToDedupe, YearToStart = 2014, YearToEnd = 2016) { 
    # dataToDedupe is input set
    # YearToStart = default starting year
    # YearToEnd = default ending year

    monthYearSeq <- expand.grid(Month = 1:12, Year = YearToStart:YearToEnd) %>% tbl_df() # make a grid having all months 1:12 from starting to ending year
    twoMonthsAgoIdentifiers <- data_frame(propertyid = integer(0)) # make empty data frame to hold list of unique identifiers
    oneMonthAgoIdentifiers  <- data_frame(propertyid = integer(0)) # make empty data frame to hold list of unique identifiers
    identifiersToKeep <- dataToDedupe %>% slice(0) # make empty data frame to hold list of unique identifiers

    for(i in 1:nrow(monthYearSeq)) {
        curMonth <- monthYearSeq$Month[i] # get current month for row in loop of monthYearSeq
        curYear <- monthYearSeq$Year[i] # get current year for row in loop of monthYearSeq

        curIdentifiers <- dataToDedupe %>% filter(year(initialdate) == curYear, month(initialdate) == curMonth)%>% 
            # initialdate is the date variable in the set by which the set is filtered
            # start by filtering to make a subset, curIdentifiers, which is the set where initialdate == current month and year in the loop
            group_by(uniqueidentifier) %>% slice(1) %>% ungroup() %>%  # take just 1 example of each unique identifier in the subset
            anti_join(twoMonthsAgoIdentifiers) %>% # filter out uniqueidentifier that were in set two months ago
            anti_join(oneMonthAgoIdentifiers) # filter out uniqueidentifier that were in set one month ago

        twoMonthsAgoIdentifiers <- oneMonthAgoIdentifiers # move one month set into two month set
        oneMonthAgoIdentifiers <- curIdentifiers %>% select(uniqueidentifier) # move current month set into one month set
        identifiersToKeep <- bind_rows(identifiersToKeep, curIdentifiers) # add "surviving" unique identifiers after filtering for last 2 months
            # to updated set of deduped indentifiers
    } # lather, rinse, repeat

    return(identifiersToKeep) # return all survivors
}

最后，以下是我们迄今为止尝试但未成功的一些事情：

提出了递归CTE。红移不允许递归CTE。使用lags评估当前日期值和以前日期值之间的日期差异，该差异在唯一标识符上划分。对于同一唯一标识符123，如果是一组连续的1-5个月，则这不起作用。在这种情况下，第4个月和第5个月都将保留，但第5个月实际上应该取消。自动左键在唯一标识符上针对自身加入集合，以便可以计算所有月份排列。-这实际上与使用滞后具有相同的问题。使用包含所有所需月份和年份的虚拟日期集，将缺少的月份和年份注入要筛选的集合中。标记要筛选的原始集合中的行。然后使用根据唯一标识符和标志划分的密集_秩来选择秩为%3=0的每一行。这个问题是，您不能总是让densite_rank值根据需要跨分区计数，因此%3值出现错误。使用上述各项的组合。 . 我们可以得到约90%的奇偶校验与原始循环代码，但不幸的是，我们必须有一个完美的替代品

请尊重我们在SQL中重现这一点的目标，或者证明在这种情况下，在SQL中重现循环的结果是不可能的。像坚持使用R、在python中执行循环、尝试这个新包之类的响应都不会有帮助

非常感谢你的积极建议

您的过程可以使用sql会话技术在红移中完成

基本上，您使用许多LAG语句来比较特定窗口上的数据，然后比较结果以完成最终分类

您可以有多少个具有相同唯一标识符的值？麦克斯？谢谢你@JonScott。我之前打印错了：在基于initialdate的给定月份中，应该只有一个唯一标识符值的实例，并且该值不应该在前2个月出现。可以使用红移python UDF进行此操作，对于每一行，您都可以传递当前行数据和所有先前值的数组，这些值可以通过连接到array_agg summary子表在sql中创建。然后python可以应用与R中相同的复杂逻辑，并返回一个标志来指示是否应该保留该行。谢谢！到目前为止，使用lags还不起作用，但我会尝试一下，看看它是否提供了新的功能。请毫不犹豫地使用临时表或临时表。将逻辑分解为更简单的部分。还要确保为临时表声明dist键。如果它们都有相同的提取键，那么最终的计算速度就会加快。这些文章的本质似乎是使用滞后回溯标记要保留/操作的行。我还没有找到一种能抓住所有案例的滞后方法，但我会尝试一下。只是想让你知道一个方法的完整测试和所有的数据都有点牵连，所以如果我没有快速响应，这就是为什么。我会看看我是否能从过去的努力中挖掘出一些例外，在这些努力中，滞后是不起作用的；也许这会暴露出我所缺少的关于你的会话化方法的任何东西。