Sql Scala Spark使用筛选器表达式使用ACL创建额外列

Sql Scala Spark使用筛选器表达式使用ACL创建额外列,sql,scala,apache-spark-sql,Sql,Scala,Apache Spark Sql,大家好 这是一个配置文件限制 。我有一个源数据集,例如: +-----------+----------+----------+ | Col1 | Col2 | Col3 | +-----------+----------+----------+ | ValueA 1 | ValueB 2 | ValueC 3 | | ValueA 1 | ValueB 3 | ValueC 4 | +-----------+----------+----------+ 我需

大家好

  • 这是一个配置文件限制
。我有一个源数据集,例如:

+-----------+----------+----------+
|   Col1    |   Col2   |   Col3   |
+-----------+----------+----------+
| ValueA 1  | ValueB 2 | ValueC 3 |
| ValueA 1  | ValueB 3 | ValueC 4 |
+-----------+----------+----------+
我需要得到下一个数据集:

+-----------+----------+----------+----------+
|   Col1    |   Col2   |   Col3   | Profile1 |
+-----------+----------+----------+----------+
| ValueA 1  | ValueB 2 | ValueC 3 |        1 |
| ValueA 1  | ValueB 3 | ValueC 4 |        0 |
+-----------+----------+----------+----------+
  • 1-表示筛选器函数返回true
  • 0-表示筛选器函数返回false
我知道如何使用join(通过sql_expr过滤源数据集、join with column等等)。 但我有大约100个配置文件,我不会做100个连接。 我不是在寻找现成的解决方案,但一些如何有效地使之成为现实的建议将切中要害。 我认为我可以以某种方式创建配置文件限制集合(profile\u id,sql\u expression)并为每一行进行映射,创建一个包含正确的profile\u id的数组的列,最后进行flatmap

更新1: 目前我使用这个解决方案,但无法测试它,因为在本地它永远不会结束

    @Override
    public <V extends SomeData, T extends ObjWithRestr> Dataset<Row> filterByMultipleRestrictionObjs(Dataset<V> source,
                                                                                                          List<T> objsWithRestr,
                                                                                                          Class<V> tClass) {
        Dataset<Row> resultDataset = source.as(Encoders.bean(Row.class));
        for (T objWithRestr : objsWithRestr) {
            Profile profile = (Profile) objWithRestr;
            String client_id = profile.getClient_id();
            ProfileRestrictions profileRestrictions = gsonAdapter
                    .fromJson(new StringReader(objWithRestr.getRestrictions()), ProfileRestrictions.class);

            String combinedFilter = getCombinedFilter(profileRestrictions.getDemoFilter(), profileRestrictions.getMediaFilter());
            Dataset<Row> filteredDataset = resultDataset.filter(combinedFilter);
            Dataset<Row> falseDataset = resultDataset.exceptAll(filteredDataset).withColumn(client_id, lit(0));
            Dataset<Row> trueDataset = resultDataset.intersectAll(filteredDataset).withColumn(client_id, lit(1));
            resultDataset = falseDataset.unionByName(trueDataset);

        }
        return resultDataset;
    }
@覆盖
公共数据集筛选器ByMultipleRestrictionObjs(数据集源,
列出ObjsWithRest,
类别(tClass){
Dataset resultDataset=source.as(Encoders.bean(Row.class));
对于(T OBJWITHREST:OBJSWITHREST){
外形外形=(外形)OBJWITHREST;
字符串client_id=profile.getClient_id();
ProfileRestrictions ProfileRestrictions=gsonAdapter
.fromJson(新的StringReader(objWithRest.getRestrictions()),ProfileRestrictions.class);
String combinedFilter=getCombinedFilter(profileRestrictions.getDemoFilter(),profileRestrictions.getMediaFilter());
Dataset filteredDataset=resultDataset.filter(combinedFilter);
Dataset falseDataset=resultDataset.Exceptal(filteredDataset).withColumn(客户端id,lit(0));
Dataset trueDataset=resultDataset.intersectAll(filteredDataset).withColumn(客户端id,lit(1));
resultDataset=falseDataset.unionByName(trueDataset);
}
返回结果数据集;
}

@Sangam.gavine这是一个很酷的解决方案。但我的过滤器sql_表达式的值类似于(而不是ValueB=1),这就是我无法创建临时表的原因。我已经找到了解决方案,在这里我可以创建MapFunction,创建单行数据集,进行筛选,如果数据集大小为1,则使用1写入记录,如果数据集大小为0,则使用0写入记录。但是我不喜欢它,因为它有很多开销。我必须创建多个数据集。
    @Override
    public <V extends SomeData, T extends ObjWithRestr> Dataset<Row> filterByMultipleRestrictionObjs(Dataset<V> source,
                                                                                                          List<T> objsWithRestr,
                                                                                                          Class<V> tClass) {
        Dataset<Row> resultDataset = source.as(Encoders.bean(Row.class));
        for (T objWithRestr : objsWithRestr) {
            Profile profile = (Profile) objWithRestr;
            String client_id = profile.getClient_id();
            ProfileRestrictions profileRestrictions = gsonAdapter
                    .fromJson(new StringReader(objWithRestr.getRestrictions()), ProfileRestrictions.class);

            String combinedFilter = getCombinedFilter(profileRestrictions.getDemoFilter(), profileRestrictions.getMediaFilter());
            Dataset<Row> filteredDataset = resultDataset.filter(combinedFilter);
            Dataset<Row> falseDataset = resultDataset.exceptAll(filteredDataset).withColumn(client_id, lit(0));
            Dataset<Row> trueDataset = resultDataset.intersectAll(filteredDataset).withColumn(client_id, lit(1));
            resultDataset = falseDataset.unionByName(trueDataset);

        }
        return resultDataset;
    }
# With the help of below approach you can be able to solve the isseue i believe

Your filter condition values
filter_col1|filter_col2
valueA 3|ValueB 2
valueA 4|ValueB 3
valueA 5|ValueB 4
valueA 6|ValueB 5

//read them and conver them into a dataframe - filter_cond_df
//Create temp table on top of filter_cond_df
filter_cond_df.createOrReplaceTempView("filter_temp")

Your input Data:
+-----------+----------+----------+
|   Col1    |   Col2   |   Col3   |
+-----------+----------+----------+
| ValueA 1  | ValueB 2 | ValueC 3 |
| ValueA 1  | ValueB 3 | ValueC 4 |
+-----------+----------+----------+

//consider this as input_df, create a temp table on top it
input_df.createOrReplaceTempView("input_temp")
//to get only the matching for your filter condition
val matching_df = spark.sql("""select * from input_temp where col1 in (select filtert_col1 from filter_temp) or col2 in (select filter_col2 from filter_temp)""")

//get the remaining or not matched from your input
val notmatching_df = input_df.except(matching_df)

//adding profile column with value 1 to matching_df
val result1 = matching_df.withColumn("profile"),lit(1))
//adding profile column with value 0 to notmatching_df
val result2 = notmatching_df.withColumn("profile",lit(0))

val final_result = result1.union(result2)

i hope this helps!