Apache nifi 从流文件内容中提取多行内容_Apache Nifi_Hortonworks Data Platform_Hortonworks Dataflow

Apache nifi 从流文件内容中提取多行内容

apache-nifi

Apache nifi 从流文件内容中提取多行内容,apache-nifi,hortonworks-data-platform,hortonworks-dataflow,Apache Nifi,Hortonworks Data Platform,Hortonworks Dataflow,我正在从MySQL表导入数据（仅针对选定列）并将其放入HDFS中。完成后，我想在配置单元中创建一个表为此，我有一个schema.sql文件，其中包含整个表的CREATETABLE语句，我只想为我导入的列生成新的CREATETABLE语句类似于我在下面示例中使用grep所做的操作我使用了FetchFile和ExtractText但无法使其工作。如果我将整个模式放入属性中，如何使用NiFi处理器甚至表达式语言来实现这一点或者有更好的方法在导入的数据上创建表吗？NiFi可以基于流文件内容生成

我正在从MySQL表导入数据（仅针对选定列）并将其放入HDFS中。完成后，我想在配置单元中创建一个表

为此，我有一个

schema.sql

文件，其中包含整个表的CREATETABLE语句，我只想为我导入的列生成新的CREATETABLE语句

类似于我在下面示例中使用

grep

所做的操作

我使用了

FetchFile

和

ExtractText

但无法使其工作。如果我将整个模式放入属性中，如何使用NiFi处理器甚至表达式语言来实现这一点

或者有更好的方法在导入的数据上创建表吗？

NiFi可以基于流文件内容生成create table语句

1.使用ConvertAvroToORC处理器创建ORC表：

 Pull data from source(ExecuteSQL...etc)
  -> ConvertAvroToORC //add Hive DbName,TableName in HiveTableName property value--> 
  -> PutHDFS //store the orc file into HDFS location --> 
  -> ReplaceText //Replace the flowfile content with ${hive.ddl} Location '${absolute.hdfs.path}'--> 
  -> PutHiveQL //execute the create table statement

ExecuteSQL (success)|-> PutHDFS //store data into HDFS
           (success)|-> ExtractAvroMetadata //configure Metadata Keys as avro.schema 
                     -> ReplaceText //replace flowfile content with avro.schema
                     -> PutHDFS //store the avsc file into schema directory
                     -> ReplaceText //create avro table on top of schema directory
                     -> PutHiveQL //execute the hive.ddl

CREATE TABLE as_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='/path/to/the/schema/test_serializer.avsc');

如果要将avro数据转换为ORC格式，然后存储到HDFS中，则ConvertAvroToORC处理器会将
```
hive.ddl
```
属性添加到流文件中
PutHDFS处理器向流文件添加
```
absolute.hdfs.path
```
属性
我们可以使用这个hive.ddl，absolute.hdfs.path属性，并在hdfs目录上动态创建orc表

流量：

 Pull data from source(ExecuteSQL...etc)
  -> ConvertAvroToORC //add Hive DbName,TableName in HiveTableName property value--> 
  -> PutHDFS //store the orc file into HDFS location --> 
  -> ReplaceText //Replace the flowfile content with ${hive.ddl} Location '${absolute.hdfs.path}'--> 
  -> PutHiveQL //execute the create table statement

ExecuteSQL (success)|-> PutHDFS //store data into HDFS
           (success)|-> ExtractAvroMetadata //configure Metadata Keys as avro.schema 
                     -> ReplaceText //replace flowfile content with avro.schema
                     -> PutHDFS //store the avsc file into schema directory
                     -> ReplaceText //create avro table on top of schema directory
                     -> PutHiveQL //execute the hive.ddl

CREATE TABLE as_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='/path/to/the/schema/test_serializer.avsc');

有关上述流程的更多详细信息，请参阅链接

2.使用ExtractAvroMetaData处理器创建Avro表：

 Pull data from source(ExecuteSQL...etc)
  -> ConvertAvroToORC //add Hive DbName,TableName in HiveTableName property value--> 
  -> PutHDFS //store the orc file into HDFS location --> 
  -> ReplaceText //Replace the flowfile content with ${hive.ddl} Location '${absolute.hdfs.path}'--> 
  -> PutHiveQL //execute the create table statement

ExecuteSQL (success)|-> PutHDFS //store data into HDFS
           (success)|-> ExtractAvroMetadata //configure Metadata Keys as avro.schema 
                     -> ReplaceText //replace flowfile content with avro.schema
                     -> PutHDFS //store the avsc file into schema directory
                     -> ReplaceText //create avro table on top of schema directory
                     -> PutHiveQL //execute the hive.ddl

CREATE TABLE as_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='/path/to/the/schema/test_serializer.avsc');

在NiFi中，一旦我们使用QueryDatabaseTable提取数据，ExecuteSQL处理器就会将数据的格式设置为AVRO
我们可以基于Avro模式（.avsc文件）创建Avro表，通过使用ExtractAvroMetaData处理器，我们可以提取模式并保留为flowfile属性，然后通过使用此模式，我们可以动态创建AvroTables

流量：

Pull data from source(ExecuteSQL...etc) -> ConvertAvroToORC //add Hive DbName,TableName in HiveTableName property value--> -> PutHDFS //store the orc file into HDFS location --> -> ReplaceText //Replace the flowfile content with ${hive.ddl} Location '${absolute.hdfs.path}'--> -> PutHiveQL //execute the create table statement

ExecuteSQL (success)|-> PutHDFS //store data into HDFS (success)|-> ExtractAvroMetadata //configure Metadata Keys as avro.schema -> ReplaceText //replace flowfile content with avro.schema -> PutHDFS //store the avsc file into schema directory -> ReplaceText //create avro table on top of schema directory -> PutHiveQL //execute the hive.ddl

CREATE TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='/path/to/the/schema/test_serializer.avsc');
创建表语句示例：

Pull data from source(ExecuteSQL...etc) -> ConvertAvroToORC //add Hive DbName,TableName in HiveTableName property value--> -> PutHDFS //store the orc file into HDFS location --> -> ReplaceText //Replace the flowfile content with ${hive.ddl} Location '${absolute.hdfs.path}'--> -> PutHiveQL //execute the create table statement

ExecuteSQL (success)|-> PutHDFS //store data into HDFS (success)|-> ExtractAvroMetadata //configure Metadata Keys as avro.schema -> ReplaceText //replace flowfile content with avro.schema -> PutHDFS //store the avsc file into schema directory -> ReplaceText //create avro table on top of schema directory -> PutHiveQL //execute the hive.ddl

CREATE TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='/path/to/the/schema/test_serializer.avsc');
我们将在上面的流程中使用replacetextprocessor来更改模式url的路径
使用ExecuteSQL处理器的另一种方法是从（sys.tables/INFORMATION\u SCHEMA.columns..etc）获取所有create table语句（或）列信息（如果源系统允许）并编写一个脚本，将数据类型映射到相应的配置单元类型中，然后将它们以所需的格式存储在配置单元中
编辑：

Pull data from source(ExecuteSQL...etc) -> ConvertAvroToORC //add Hive DbName,TableName in HiveTableName property value--> -> PutHDFS //store the orc file into HDFS location --> -> ReplaceText //Replace the flowfile content with ${hive.ddl} Location '${absolute.hdfs.path}'--> -> PutHiveQL //execute the create table statement

ExecuteSQL (success)|-> PutHDFS //store data into HDFS (success)|-> ExtractAvroMetadata //configure Metadata Keys as avro.schema -> ReplaceText //replace flowfile content with avro.schema -> PutHDFS //store the avsc file into schema directory -> ReplaceText //create avro table on top of schema directory -> PutHiveQL //execute the hive.ddl

CREATE TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='/path/to/the/schema/test_serializer.avsc');
要对flowfile内容运行
grep
命令，我们需要使用ExecuteStreamCommand处理器
ESC配置：

Pull data from source(ExecuteSQL...etc) -> ConvertAvroToORC //add Hive DbName,TableName in HiveTableName property value--> -> PutHDFS //store the orc file into HDFS location --> -> ReplaceText //Replace the flowfile content with ${hive.ddl} Location '${absolute.hdfs.path}'--> -> PutHiveQL //execute the create table statement

ExecuteSQL (success)|-> PutHDFS //store data into HDFS (success)|-> ExtractAvroMetadata //configure Metadata Keys as avro.schema -> ReplaceText //replace flowfile content with avro.schema -> PutHDFS //store the avsc file into schema directory -> ReplaceText //create avro table on top of schema directory -> PutHiveQL //execute the hive.ddl

CREATE TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='/path/to/the/schema/test_serializer.avsc');

然后将
输出流
关系馈送到ExtractText Processor
ET配置：

Pull data from source(ExecuteSQL...etc) -> ConvertAvroToORC //add Hive DbName,TableName in HiveTableName property value--> -> PutHDFS //store the orc file into HDFS location --> -> ReplaceText //Replace the flowfile content with ${hive.ddl} Location '${absolute.hdfs.path}'--> -> PutHiveQL //execute the create table statement

ExecuteSQL (success)|-> PutHDFS //store data into HDFS (success)|-> ExtractAvroMetadata //configure Metadata Keys as avro.schema -> ReplaceText //replace flowfile content with avro.schema -> PutHDFS //store the avsc file into schema directory -> ReplaceText //create avro table on top of schema directory -> PutHiveQL //execute the hive.ddl

CREATE TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='/path/to/the/schema/test_serializer.avsc');
将新属性添加为
内容

(?s)(.*)

然后将
content属性添加到流文件中，您可以使用该属性并准备创建表语句。 Hi。谢谢你的详细回答。我之前试过使用Avro，在你回答后又试了一次。架构中的列数据类型不正确。它几乎将所有内容都标记为字符串类型。甚至bigint和timestamp列也被标记为string。关于最后一种方法，问题中提到的schema.sql文件已经有了列名及其对应的配置单元数据类型。因此，我只想提取感兴趣的列行，然后构造带位置的“创建表查询”。@pratpor，在ExecuteSQL中尝试将Avro Logical Types属性值设置为True..etc，然后检查Avro数据文件中的数据类型。@pratpor，请检查我更新的答案的编辑部分，描述如何仅从ff内容中提取所需的列。正在阅读关于ExecuteStreamCommand 。这很有效。谢谢你的帮助。使用Avro逻辑类型更新属性，它不起作用。它将timestamp和bigint，bigint作为字符串。