Apache nifi 使用ApacheNIFI的复杂转换和过滤器

Apache nifi 使用ApacheNIFI的复杂转换和过滤器,apache-nifi,Apache Nifi,我有一个JSON数组: [ { "account_login" : "some_mail@gmail.com", "view_id" : 11313231, "join_id" : "utm_campaign=toyota&utm_content=multiformat_sites&utm_medium=cpc&utm_source=mytarget", &

我有一个JSON数组:

[ {
  "account_login" : "some_mail@gmail.com",
  "view_id" : 11313231,
  "join_id" : "utm_campaign=toyota&utm_content=multiformat_sites&utm_medium=cpc&utm_source=mytarget",
  "start_date" : "2020-08-01",
  "end_date" : "2020-08-31"
}, {
  "account_login" : "another_mail@lab.net",
  "view_id" : 19556319183,
  "join_id" : "utm_campaign=mazda&utm_content=keywords_social-networks&utm_medium=cpc&utm_source=facebook",
  "start_date" : "2020-12-22",
  "end_date" : "2020-12-23"
}, {
...
} ]
对于每个
join\u id
我应该做下一件事:

  • 将字符串拆分为键值对:
    utm_活动,丰田;utm_内容、多格式网站;etc
  • 过滤它们(下面是Java代码)
  • 将密钥转换为另一种格式;使用数据库中的表(下面是Java代码) 我的主要目标是重复以下Java代码:

    public class GaUtmFactoryService {
    
        private static final String INVALID_MACRO_FOOTPRINTS = "^.*[{\\[%]+.+[}\\]%].*$";
    
        public Map<String, String> extractUtmMarks(String utmMarks) {
            if (utmMarks == null || utmMarks.isBlank()) {
                return Collections.emptyMap();
            }
            return Arrays.stream(utmMarks.split("\\s*&\\s*"))
                    .map(s -> s.trim().split("\\s*=\\s*"))
                    .filter(this::isUtmMarksValid)
                    .collect(Collectors.toMap(
                            key -> convertCsUtmMarkToGa(key[0]),
                            value -> value[1],
                            (val1, val2) -> val2)
                    );
        }
    
        
        private boolean isUtmMarksValid(String[] utmMarks) {
            return utmMarks.length == 2
                    && !convertCsUtmMarkToGa(utmMarks[0]).isBlank()
                    && !utmMarks[1].isBlank()
                    && Arrays.stream(utmMarks).noneMatch(this::isUtmMarkContainsInvalidChars);
        }
    
        private boolean isUtmMarkContainsInvalidChars(String utmMark) {
            return utmMark.matches(INVALID_MACRO_FOOTPRINTS)
                    || !StandardCharsets.US_ASCII.newEncoder().canEncode(utmMark);
        }
    
       
        private String convertCsUtmMarkToGa(String utmMark) {
           switch (utmMark) {
                case "utm_medium":
                    return "ga:medium";
                case "utm_campaign":
                    return "ga:campaign";
                case "utm_source":
                    return "ga:source";
                case "utm_content":
                    return "ga:adContent";
                case "utm_term":
                    return "ga:keyword";
                case "utm_target":
                case "utm_a":
                    return "";
                default:
                    return rowUtmMarks;
            }
        }
    
    }
    
    公共类GaUtmFactoryService{
    私有静态最终字符串无效的_宏_封装=“^.*[{\\[%]+.+[}\\\]%.*$”;
    公共地图提取utmMarks(字符串utmMarks){
    if(utmMarks==null | | utmMarks.isBlank()){
    return Collections.emptyMap();
    }
    返回Arrays.stream(utmMarks.split(“\\s*&\\s*”))
    .map(s->s.trim().split(“\\s*=\\s*”)
    .filter(此::IsUtmarkValid)
    .collect(collector.toMap)(
    键->ConvertCsutmarktoga(键[0]),
    值->值[1],
    (val1,val2)->val2)
    );
    }
    私有布尔值IsUtmarks有效(字符串[]Utmarks){
    返回utmMarks.length==2
    &&!convertcsutmarktoga(utmMarks[0]).isBlank()
    &&!utmMarks[1]。isBlank()
    &&Arrays.stream(utmMarks.noneMatch)(this::isUtmMarkContainsInvalidChars);
    }
    私有布尔值IsUtmarkContainsInValidChars(字符串Utmark){
    返回utmMark.matches(无效的\u宏\u封装外形)
    ||!StandardCharsets.US_ASCII.newEncoder().canEncode(utmMark);
    }
    私有字符串转换器UTMMarktoga(字符串utmMark){
    交换机(utmMark){
    案例“utm_中等”:
    返回“ga:中等”;
    案例“utm_活动”:
    返回“ga:活动”;
    案例“utm_来源”:
    返回“ga:source”;
    案例“utm_内容”:
    返回“ga:adContent”;
    案例“utm_术语”:
    返回“ga:关键字”;
    案例“utm_目标”:
    案例“utm_a”:
    返回“”;
    违约:
    返回rowUtmMarks;
    }
    }
    }
    
    来自外部的用法:

    public Map<String, String> getConvertedMarks() {
            GaUtmFactoryService gaUtmFactoryService = new GaUtmFactoryService();
            String utmMarks = "utm_campaign=toyota&utm_content=multiformat_sites&utm_medium=cpc&utm_source=facebook";
            Map<String, String> converted = gaUtmFactoryService.extractUtmMarks(utmMarks);
            //should be:
            ////{ga:campaign=toyota, ga:adContent=multiformat_sites, ga:medium=cpc, ga:source=facebook}
            return converted;
        }
    
    publicmap getConvertedMarks(){
    GaUtmFactoryService GaUtmFactoryService=新的GaUtmFactoryService();
    String utmMarks=“utm_活动=丰田&utm_内容=多格式网站&utm_媒体=cpc&utm_来源=facebook”;
    映射转换=gaUtmFactoryService.ExtractUtmarks(Utmarks);
    //应该是:
    ////{ga:campaign=toyota,ga:adContent=multiformat_sites,ga:medium=cpc,ga:source=facebook}
    转换收益;
    }
    
    NiFi有可能吗?或者,如果这很难,也许我应该为这个任务创建带有一些端点的REST microservice

    更新

    我做了
    EvaluateJsonPath
    SplitJson
    。现在每个json文件都有一个属性:
    utm.marks=utm\u campaign=toyota&utm\u content=multiformat\u sites&utm\u medium=cpc&utm\u source=mytarget

    我需要拆分这些属性并获得如下smth:

    campaign.key=ga:campaign

    campaign.value=toyota

    content.key=ga:content

    content.value=多格式站点


    等等。

    对于此转换,ExecuteGroovyScript可能如下所示:

    import groovy.json*
    //从会话获取文件
    def ff=session.get()
    如果(!ff)返回
    //读取流、转换为读取器、解析为列表/对象
    def data=ff.read().withReader(“UTF-8”){r->new JsonSlurper().parse(r)}
    //转换json
    每个{i->
    i、 join\u id=i.join\u id
    .split(“\\s*&\\s*”/#到数组
    .1条{
    //#将每个项目转换为映射项
    字符串[]kv=it.split(“\\s*=\\s*”)
    千伏[0]=[
    “utm_中等”:“ga:中等”,
    “utm_活动”:“ga:活动”,
    “utm_源”:“ga:源”,
    “utm_内容”:“ga:adContent”,
    “utm_术语”:“ga:关键字”,
    ].get(千伏[0])
    千伏
    }
    .findAll{k,v->k}/#过滤掉空/空键
    }
    //写回文件
    ff.write(“UTF-8”){w->newjsonbuilder(data.writeTo(w)}
    //走向成功
    REL_SUCCESS基于一个JSON(非数组)答案的解决方案:

    import groovy.json*
    //从会话获取文件
    def ff=session.get()
    如果(!ff)返回
    //读取流、转换为读取器、解析为列表/对象
    def data=ff.read().withReader(“UTF-8”){r->new JsonSlurper().parse(r)}
    def builder=新JsonBuilder(数据)
    builder.content.join_id=builder.content.join_id.split(“\\s*&\\s*”)/#到数组
    .1条{
    //#将每个项目转换为映射项
    字符串[]kv=it.split(“\\s*=\\s*”)
    千伏[0]=[
    “utm_中等”:“ga:中等”,
    “utm_活动”:“ga:活动”,
    “utm_源”:“ga:源”,
    “utm_内容”:“ga:adContent”,
    “utm_术语”:“ga:关键字”,
    ].get(千伏[0])
    千伏
    }
    .findAll{k,v->k}/#过滤掉空/空键
    ff.write(“UTF-8”){w->builder.writeTo(w)}
    //走向成功
    
    如果你有java代码,使用executegroovyscript会更容易。是的,我有同样的想法。但问题是。。我不知道groovy。你的案例应该输出什么?例如字符串utm_campaign=toyota&utm_content=multiformat_sites&utm_source=facebook应该被转移到这个地图:
    {key:ga:campaign,value:toyota,key:ga:adContent,value:multiformat_sites,key:ga:source,value:facebook}
    你的代码不完整,没有入口点,有未初始化的变量,。。。请编辑您的问题,并提供输入和相应的输出。谢谢!我试过这个代码,然后
    import groovy.json.*
    //get file from session
    def ff=session.get()
    if(!ff)return
    //read stream, convert to reader, parse to list/objects
    
    def data=ff.read().withReader("UTF-8"){r-> new JsonSlurper().parse(r) }
    def builder = new JsonBuilder(data)
    
    builder.content.join_id = builder.content.join_id.split("\\s*&\\s*")  //# to array
            .collectEntries{ 
                    //# convert each item to map entry
                    String[] kv = it.split("\\s*=\\s*")
                    kv[0] = [
                        "utm_medium"   : "ga:medium",
                        "utm_campaign" : "ga:campaign",
                        "utm_source"   : "ga:source",
                        "utm_content"  : "ga:adContent",
                        "utm_term"     : "ga:keyword",
                    ].get( kv[0] )
                    kv
                }
            .findAll{ k,v-> k } //# filter out empty/null keys
    ff.write("UTF-8"){w-> builder.writeTo(w)}
    //transfer to success
    REL_SUCCESS<<ff