Sql Scala Spark使用筛选器表达式使用ACL创建额外列_Sql_Scala_Apache Spark Sql

Sql Scala Spark使用筛选器表达式使用ACL创建额外列

sql scala

Sql Scala Spark使用筛选器表达式使用ACL创建额外列,sql,scala,apache-spark-sql,Sql,Scala,Apache Spark Sql,大家好这是一个配置文件限制。我有一个源数据集，例如： +-----------+----------+----------+ | Col1 | Col2 | Col3 | +-----------+----------+----------+ | ValueA 1 | ValueB 2 | ValueC 3 | | ValueA 1 | ValueB 3 | ValueC 4 | +-----------+----------+----------+ 我需

大家好

这是一个配置文件限制

。我有一个源数据集，例如：

+-----------+----------+----------+
|   Col1    |   Col2   |   Col3   |
+-----------+----------+----------+
| ValueA 1  | ValueB 2 | ValueC 3 |
| ValueA 1  | ValueB 3 | ValueC 4 |
+-----------+----------+----------+

我需要得到下一个数据集：

+-----------+----------+----------+----------+
|   Col1    |   Col2   |   Col3   | Profile1 |
+-----------+----------+----------+----------+
| ValueA 1  | ValueB 2 | ValueC 3 |        1 |
| ValueA 1  | ValueB 3 | ValueC 4 |        0 |
+-----------+----------+----------+----------+

1-表示筛选器函数返回true
0-表示筛选器函数返回false

我知道如何使用join（通过sql_expr过滤源数据集、join with column等等）。但我有大约100个配置文件，我不会做100个连接。我不是在寻找现成的解决方案，但一些如何有效地使之成为现实的建议将切中要害。我认为我可以以某种方式创建配置文件限制集合（profile\u id，sql\u expression）并为每一行进行映射，创建一个包含正确的profile\u id的数组的列，最后进行flatmap

更新1: 目前我使用这个解决方案，但无法测试它，因为在本地它永远不会结束

    @Override
    public <V extends SomeData, T extends ObjWithRestr> Dataset<Row> filterByMultipleRestrictionObjs(Dataset<V> source,
                                                                                                          List<T> objsWithRestr,
                                                                                                          Class<V> tClass) {
        Dataset<Row> resultDataset = source.as(Encoders.bean(Row.class));
        for (T objWithRestr : objsWithRestr) {
            Profile profile = (Profile) objWithRestr;
            String client_id = profile.getClient_id();
            ProfileRestrictions profileRestrictions = gsonAdapter
                    .fromJson(new StringReader(objWithRestr.getRestrictions()), ProfileRestrictions.class);

            String combinedFilter = getCombinedFilter(profileRestrictions.getDemoFilter(), profileRestrictions.getMediaFilter());
            Dataset<Row> filteredDataset = resultDataset.filter(combinedFilter);
            Dataset<Row> falseDataset = resultDataset.exceptAll(filteredDataset).withColumn(client_id, lit(0));
            Dataset<Row> trueDataset = resultDataset.intersectAll(filteredDataset).withColumn(client_id, lit(1));
            resultDataset = falseDataset.unionByName(trueDataset);

        }
        return resultDataset;
    }

@覆盖
公共数据集筛选器ByMultipleRestrictionObjs（数据集源，
列出ObjsWithRest，
类别（tClass）{
Dataset resultDataset=source.as（Encoders.bean（Row.class））；
对于（T OBJWITHREST:OBJSWITHREST）{
外形外形=（外形）OBJWITHREST；
字符串client_id=profile.getClient_id（）；
ProfileRestrictions ProfileRestrictions=gsonAdapter
.fromJson（新的StringReader（objWithRest.getRestrictions（）），ProfileRestrictions.class）；
String combinedFilter=getCombinedFilter（profileRestrictions.getDemoFilter（），profileRestrictions.getMediaFilter（））；
Dataset filteredDataset=resultDataset.filter（combinedFilter）；
Dataset falseDataset=resultDataset.Exceptal（filteredDataset）.withColumn（客户端id，lit（0））；
Dataset trueDataset=resultDataset.intersectAll（filteredDataset）.withColumn（客户端id，lit（1））；
resultDataset=falseDataset.unionByName（trueDataset）；
}
返回结果数据集；
}

@Sangam.gavine这是一个很酷的解决方案。但我的过滤器sql_表达式的值类似于（而不是ValueB=1），这就是我无法创建临时表的原因。我已经找到了解决方案，在这里我可以创建MapFunction，创建单行数据集，进行筛选，如果数据集大小为1，则使用1写入记录，如果数据集大小为0，则使用0写入记录。但是我不喜欢它，因为它有很多开销。我必须创建多个数据集。

    @Override
    public <V extends SomeData, T extends ObjWithRestr> Dataset<Row> filterByMultipleRestrictionObjs(Dataset<V> source,
                                                                                                          List<T> objsWithRestr,
                                                                                                          Class<V> tClass) {
        Dataset<Row> resultDataset = source.as(Encoders.bean(Row.class));
        for (T objWithRestr : objsWithRestr) {
            Profile profile = (Profile) objWithRestr;
            String client_id = profile.getClient_id();
            ProfileRestrictions profileRestrictions = gsonAdapter
                    .fromJson(new StringReader(objWithRestr.getRestrictions()), ProfileRestrictions.class);

            String combinedFilter = getCombinedFilter(profileRestrictions.getDemoFilter(), profileRestrictions.getMediaFilter());
            Dataset<Row> filteredDataset = resultDataset.filter(combinedFilter);
            Dataset<Row> falseDataset = resultDataset.exceptAll(filteredDataset).withColumn(client_id, lit(0));
            Dataset<Row> trueDataset = resultDataset.intersectAll(filteredDataset).withColumn(client_id, lit(1));
            resultDataset = falseDataset.unionByName(trueDataset);

        }
        return resultDataset;
    }

# With the help of below approach you can be able to solve the isseue i believe

Your filter condition values
filter_col1|filter_col2
valueA 3|ValueB 2
valueA 4|ValueB 3
valueA 5|ValueB 4
valueA 6|ValueB 5

//read them and conver them into a dataframe - filter_cond_df
//Create temp table on top of filter_cond_df
filter_cond_df.createOrReplaceTempView("filter_temp")

Your input Data:
+-----------+----------+----------+
|   Col1    |   Col2   |   Col3   |
+-----------+----------+----------+
| ValueA 1  | ValueB 2 | ValueC 3 |
| ValueA 1  | ValueB 3 | ValueC 4 |
+-----------+----------+----------+

//consider this as input_df, create a temp table on top it
input_df.createOrReplaceTempView("input_temp")
//to get only the matching for your filter condition
val matching_df = spark.sql("""select * from input_temp where col1 in (select filtert_col1 from filter_temp) or col2 in (select filter_col2 from filter_temp)""")

//get the remaining or not matched from your input
val notmatching_df = input_df.except(matching_df)

//adding profile column with value 1 to matching_df
val result1 = matching_df.withColumn("profile"),lit(1))
//adding profile column with value 0 to notmatching_df
val result2 = notmatching_df.withColumn("profile",lit(0))

val final_result = result1.union(result2)

i hope this helps!