Amazon web services 使用红移复制命令进行合并_Amazon Web Services_Amazon Redshift_Amazon Kinesis Firehose

Amazon web services 使用红移复制命令进行合并

amazon-web-services amazon-redshift

Amazon web services 使用红移复制命令进行合并,amazon-web-services,amazon-redshift,amazon-kinesis-firehose,Amazon Web Services,Amazon Redshift,Amazon Kinesis Firehose,我有一个过程，它对输入进行迭代，并将数据输出到AWS Firehose，我已将这些数据配置为上传到我创建的红移表中。一个问题是，有时行可能会被复制，因为流程需要重新评估数据。比如： Event_date, event_id, event_cost 2015-06-25, 123, 3 2015-06-25, 123, 4 Redshift columns: event_date,event_id,cost copy event_table from <s3> (update e

我有一个过程，它对输入进行迭代，并将数据输出到AWS Firehose，我已将这些数据配置为上传到我创建的红移表中。一个问题是，有时行可能会被复制，因为流程需要重新评估数据。比如：

Event_date, event_id, event_cost
2015-06-25, 123, 3
2015-06-25, 123, 4

Redshift columns: event_date,event_id,cost
copy event_table from <s3> 
(update event_table 
select c_source.event_date,c_source.event_id,c_source.cost from <s3 source> as c_source join event_table on c_source.event_id = event_table.event_id) 
CSV


copy event_table from <s3> 
(insert into event_table 
select c_source.event_date,c_source.event_id,c_source.cost from event_table left outer join<s3 source> as c_source join on c_source.event_id = event_table.event_id where c_source.event_id is NULL) 
CSV

看这里，我想用新值替换旧行，如下所示：

insert into event_table_staging  
select event_date,event_id, event_cost from <s3 location>;

delete from event_table  
using event_table_staging  
where event_table.event_id = event_table_staging.event_id;

insert into target 
select * from event_table_staging;

delete from event_table_staging  
select * from event_table_staging;

插入到事件表中
从中选择事件日期、事件id、事件成本；
从事件表中删除
使用事件\u表\u暂存
其中event_table.event_id=event_table_staging.event_id；
插入目标
从事件表中选择*；
从事件表中删除
从事件表中选择*；

是否可以执行以下操作：

Event_date, event_id, event_cost
2015-06-25, 123, 3
2015-06-25, 123, 4

Redshift columns: event_date,event_id,cost
copy event_table from <s3> 
(update event_table 
select c_source.event_date,c_source.event_id,c_source.cost from <s3 source> as c_source join event_table on c_source.event_id = event_table.event_id) 
CSV


copy event_table from <s3> 
(insert into event_table 
select c_source.event_date,c_source.event_id,c_source.cost from event_table left outer join<s3 source> as c_source join on c_source.event_id = event_table.event_id where c_source.event_id is NULL) 
CSV

红移列：事件日期、事件id、成本
从中复制事件_表
（更新事件表）
从c_source.event_id=event_table.event_id上的as c_source join event_table中选择c_source.event_date、c_source.event_id、c_source.cost）
CSV
从中复制事件_表
（插入到事件表中）
选择c_source.event_date、c_source.event_id、c_source.cost from event_table left outer join作为c_source.event_id=event_table.event_id上的c_source join，其中c_source.event_id为空）
CSV

您不能直接从副本进行合并

但是，您的初始方法可以使用临时表包装在事务中，以暂存负载数据以获得最佳性能

BEGIN
;
CREATE TEMP TABLE event_table_staging (
     event_date  TIMESTAMP  NULL
    ,event_id    BIGINT     NULL
    ,event_cost  INTEGER    NULL )
DISTSTYLE KEY
DISTKEY (event_id)
SORTKEY (event_id)
;
COPY event_table_staging  
FROM <s3 location>
COMPUDATE ON
;
UPDATE event_table  
SET    event_date = new.event_date
      ,event_cost = new.event_cost
FROM        event_table         AS trg
INNER JOIN  event_table_staging AS new
        ON  trg.event_id = new.event_id
WHERE COALESCE(trg.event_date,0) <> COALESCE(new.event_date,0)
  AND COALESCE(trg.event_cost,0) <> COALESCE(new.event_cost,0)
;
INSERT INTO event_table 
SELECT  event_date
       ,event_id  
       ,event_cost
FROM        event_table_staging AS new
LEFT JOIN   event_table         AS trg
       ON   trg.event_id = new.event_id
WHERE trg.event_id IS NULL
;
COMMIT
;

开始
;
创建临时表事件\u表\u暂存(
事件\日期时间戳NULL
，事件id BIGINT NULL
，事件\成本整数（空）
DISTSTYLE键
DISTKEY（事件标识）
SORTKEY（事件id）
;
复制事件\u表\u暂存
从…起
计算
;
更新事件表
设置事件日期=新建。事件日期
，事件成本=新事件成本
从事件_表作为训练
内部联接事件\u表\u暂存为新
在trg.event\u id=new.event\u id上
其中合并（训练事件日期，0）合并（新事件日期，0）
和合并（训练事件成本，0）合并（新事件成本，0）
;
插入到事件表中
选择事件日期
，事件编号
，事件成本
从事件_表_暂存为新
左连接事件\u表作为训练
在trg.event\u id=new.event\u id上
其中trg.event\u id为空
;
犯罪
;

只要您使用一个事务，并且总更新量相对较低（个位数%），这种方法实际上执行得非常好。唯一需要注意的是，您的目标需要定期进行

vaculation

ed，每月一次就足够了

我们每小时对100百万行范围内的几个表执行此操作，即，100百万行合并为100百万行。用户对合并表的查询仍然执行良好

Redshift经过优化，能够以经济高效的方式处理大量数据，您需要改变对其他数据库中的数据和数据库的一些想法

主要的概念是不应该在红移中更新数据。你应该把Redshift的数据看作“日志”。您可以将函数用作INSERT或UPDATE，但它们会极大地限制您可以处理的数据量

您可以通过多种方式处理重复项：

通过管理正在处理的所有ID的一些内存内查找表（例如，在Redis on中），并忽略已处理的记录，您可以首先防止写入重复项
您可以将副本保存在红移中，并使用只获取其中一条记录的函数（例如）处理这些记录
您可以将原始事件设置为红移，并在对数据库的查询中进行聚合，而不是作为预处理。此模式还提供了更改聚合数据方式的灵活性。这些聚合的红移速度非常快，并且几乎不需要预聚合

如果您仍然希望在红移中使用“清理”和聚合数据，则可以使用具有正确聚合或窗口函数的SQL查询来删除该数据，删除旧表并将数据复制回红移中。

谢谢，我添加了一个标识列，以便区分事件id。我将创建一个单独的表，仅存储最新更新的结果。我确实不同意

UPDATE

INSERT

DELETE

在红移中是单方面错误的。这些命令都得到了充分的支持并有很好的文档记录。我们使用这些命令在红移中每小时合并数百万行，没有任何问题。另外，执行

卸载并重新加载到同一数据库是完全不必要的<代码>插入到
和创建表…因为
的存在就是为了这个目的。在这种情况下，我不打算卸载，也不打算卸载窗口功能。我确实计划在流端使用ElastiCache。我的解决方案将流式传输到具有autoid列的临时表中。由于我将事件日期分离出来，我将创建一个单独的加载步骤来合并实际的EOD，这将获取最新的值并清理历史数据。我刚刚意识到我的解决方案更接近你的解决方案…我是否接受并接受你的解决方案？我不能一次完成，因为输入是消防水带流。我不需要每小时拉取，只需要EOD，这是我将配置Lambda或Datapipeline作业的目的（如果我需要基于事件的执行，仍然要决定）性能的重要部分是什么？为什么使用事务和临时表很重要？如果暂存表相当大，那么我假设临时表不在内存中（或分布在内存中？），而是在磁盘上，类似于永久表。我一直在使用AWS DMS工具从Postgres同步到Redshift，RS集群上的绝大多数使用是DMS更新作业，只是为了跟上进度。我注意到它没有创建正确类型的stage表，但是，所有内容都是一个宽字符串和merge/upsert