Sql Scala Spark使用筛选器表达式使用ACL创建额外列
大家好Sql Scala Spark使用筛选器表达式使用ACL创建额外列,sql,scala,apache-spark-sql,Sql,Scala,Apache Spark Sql,大家好 这是一个配置文件限制 。我有一个源数据集,例如: +-----------+----------+----------+ | Col1 | Col2 | Col3 | +-----------+----------+----------+ | ValueA 1 | ValueB 2 | ValueC 3 | | ValueA 1 | ValueB 3 | ValueC 4 | +-----------+----------+----------+ 我需
- 这是一个配置文件限制
+-----------+----------+----------+
| Col1 | Col2 | Col3 |
+-----------+----------+----------+
| ValueA 1 | ValueB 2 | ValueC 3 |
| ValueA 1 | ValueB 3 | ValueC 4 |
+-----------+----------+----------+
我需要得到下一个数据集:
+-----------+----------+----------+----------+
| Col1 | Col2 | Col3 | Profile1 |
+-----------+----------+----------+----------+
| ValueA 1 | ValueB 2 | ValueC 3 | 1 |
| ValueA 1 | ValueB 3 | ValueC 4 | 0 |
+-----------+----------+----------+----------+
- 1-表示筛选器函数返回true
- 0-表示筛选器函数返回false
@Override
public <V extends SomeData, T extends ObjWithRestr> Dataset<Row> filterByMultipleRestrictionObjs(Dataset<V> source,
List<T> objsWithRestr,
Class<V> tClass) {
Dataset<Row> resultDataset = source.as(Encoders.bean(Row.class));
for (T objWithRestr : objsWithRestr) {
Profile profile = (Profile) objWithRestr;
String client_id = profile.getClient_id();
ProfileRestrictions profileRestrictions = gsonAdapter
.fromJson(new StringReader(objWithRestr.getRestrictions()), ProfileRestrictions.class);
String combinedFilter = getCombinedFilter(profileRestrictions.getDemoFilter(), profileRestrictions.getMediaFilter());
Dataset<Row> filteredDataset = resultDataset.filter(combinedFilter);
Dataset<Row> falseDataset = resultDataset.exceptAll(filteredDataset).withColumn(client_id, lit(0));
Dataset<Row> trueDataset = resultDataset.intersectAll(filteredDataset).withColumn(client_id, lit(1));
resultDataset = falseDataset.unionByName(trueDataset);
}
return resultDataset;
}
@覆盖
公共数据集筛选器ByMultipleRestrictionObjs(数据集源,
列出ObjsWithRest,
类别(tClass){
Dataset resultDataset=source.as(Encoders.bean(Row.class));
对于(T OBJWITHREST:OBJSWITHREST){
外形外形=(外形)OBJWITHREST;
字符串client_id=profile.getClient_id();
ProfileRestrictions ProfileRestrictions=gsonAdapter
.fromJson(新的StringReader(objWithRest.getRestrictions()),ProfileRestrictions.class);
String combinedFilter=getCombinedFilter(profileRestrictions.getDemoFilter(),profileRestrictions.getMediaFilter());
Dataset filteredDataset=resultDataset.filter(combinedFilter);
Dataset falseDataset=resultDataset.Exceptal(filteredDataset).withColumn(客户端id,lit(0));
Dataset trueDataset=resultDataset.intersectAll(filteredDataset).withColumn(客户端id,lit(1));
resultDataset=falseDataset.unionByName(trueDataset);
}
返回结果数据集;
}
@Sangam.gavine这是一个很酷的解决方案。但我的过滤器sql_表达式的值类似于(而不是ValueB=1),这就是我无法创建临时表的原因。我已经找到了解决方案,在这里我可以创建MapFunction,创建单行数据集,进行筛选,如果数据集大小为1,则使用1写入记录,如果数据集大小为0,则使用0写入记录。但是我不喜欢它,因为它有很多开销。我必须创建多个数据集。
@Override
public <V extends SomeData, T extends ObjWithRestr> Dataset<Row> filterByMultipleRestrictionObjs(Dataset<V> source,
List<T> objsWithRestr,
Class<V> tClass) {
Dataset<Row> resultDataset = source.as(Encoders.bean(Row.class));
for (T objWithRestr : objsWithRestr) {
Profile profile = (Profile) objWithRestr;
String client_id = profile.getClient_id();
ProfileRestrictions profileRestrictions = gsonAdapter
.fromJson(new StringReader(objWithRestr.getRestrictions()), ProfileRestrictions.class);
String combinedFilter = getCombinedFilter(profileRestrictions.getDemoFilter(), profileRestrictions.getMediaFilter());
Dataset<Row> filteredDataset = resultDataset.filter(combinedFilter);
Dataset<Row> falseDataset = resultDataset.exceptAll(filteredDataset).withColumn(client_id, lit(0));
Dataset<Row> trueDataset = resultDataset.intersectAll(filteredDataset).withColumn(client_id, lit(1));
resultDataset = falseDataset.unionByName(trueDataset);
}
return resultDataset;
}
# With the help of below approach you can be able to solve the isseue i believe
Your filter condition values
filter_col1|filter_col2
valueA 3|ValueB 2
valueA 4|ValueB 3
valueA 5|ValueB 4
valueA 6|ValueB 5
//read them and conver them into a dataframe - filter_cond_df
//Create temp table on top of filter_cond_df
filter_cond_df.createOrReplaceTempView("filter_temp")
Your input Data:
+-----------+----------+----------+
| Col1 | Col2 | Col3 |
+-----------+----------+----------+
| ValueA 1 | ValueB 2 | ValueC 3 |
| ValueA 1 | ValueB 3 | ValueC 4 |
+-----------+----------+----------+
//consider this as input_df, create a temp table on top it
input_df.createOrReplaceTempView("input_temp")
//to get only the matching for your filter condition
val matching_df = spark.sql("""select * from input_temp where col1 in (select filtert_col1 from filter_temp) or col2 in (select filter_col2 from filter_temp)""")
//get the remaining or not matched from your input
val notmatching_df = input_df.except(matching_df)
//adding profile column with value 1 to matching_df
val result1 = matching_df.withColumn("profile"),lit(1))
//adding profile column with value 0 to notmatching_df
val result2 = notmatching_df.withColumn("profile",lit(0))
val final_result = result1.union(result2)
i hope this helps!