Scala Spark:与数组的连接条件(可为null的)
我有2个数据帧,我想加入它们并想过滤数据,我想过滤Scala Spark:与数组的连接条件(可为null的),scala,apache-spark,join,apache-spark-sql,azure-databricks,Scala,Apache Spark,Join,Apache Spark Sql,Azure Databricks,我有2个数据帧,我想加入它们并想过滤数据,我想过滤 OrgTypeToExclude与每个transactionid匹配的数据 在一个单词中,我的transactionId是join contiions,OrgTypeToExclude是exclude条件,这里分享一个简单的示例 import org.apache.spark.sql.functions.expr import spark.implicits._ val jsonstr ="""{ "id": "3b4219f8-0579
OrgTypeToExclude与每个transactionid匹配的数据 在一个单词中,我的transactionId是join contiions,OrgTypeToExclude是exclude条件,这里分享一个简单的示例
import org.apache.spark.sql.functions.expr
import spark.implicits._
val jsonstr ="""{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}"""
val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
val json = spark.read.json(Seq(jsonstr).toDS).select("Transactions.TransactionId","Transactions.OrgTypeToExclude")
df.printSchema()
json.printSchema()
df.join(json,$"code"<=> $"TransactionId".cast("string") && !exp("array_contains(OrgTypeToExclude, Alp)") ,"inner" ).show()
--Expecting output
id Code Alp
4 "USAL" "C"
2 "USMD" "B"
3 "USGA" "C"
import org.apache.spark.sql.functions.expr
导入spark.implicits_
val jsonstr=“”{
“id”:“3b4219f8-0579-4933-ba5e-c0fc532eeb2a”,
“交易”:[
{
“TransactionId”:“USAL”,
“OrgTypeToExclude”:[“A”,“B”]
},
{
“TransactionId”:“USMD”,
“OrgTypeToExclude”:[“E”]
},
{
“TransactionId”:“USGA”,
“OrgTypeToExclude”:[]
}
]
}"""
val df=序列((1,“USAL”,“A”),(4,“USAL”,“C”),(2,“USMD”,“B”),(5,“USMD”,“E”),(3,“USGA”,“C”)。toDF(“id”,“代码”,“Alp”)
val json=spark.read.json(Seq(jsonstr.toDS)。选择(“Transactions.TransactionId”、“Transactions.OrgTypeToExclude”)
df.printSchema()
json.printSchema()
join(json,$“code”$“TransactionId”.cast(“string”)和&!exp(“array_contains(orgtypeteoexclude,Alp)”),“inner”).show()
--期望输出
id代码Alp
4“USAL”“C”
2“USMD”“B”
3“USGA”“C”
谢谢,
Manoj.
事务
是一种数组类型&您正在访问事务ID
和OrgTypeToExclude
,因此您将获得多个数组
相反,您只需分解根级别的事务
数组并提取OrgTypeToExclude
和事务ID
的结构值,接下来的步骤将很简单
请检查下面的代码
scala> val jsonstr ="""{
|
| "id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
| "Transactions": [
| {
| "TransactionId": "USAL",
| "OrgTypeToExclude": ["A","B"]
| },
| {
| "TransactionId": "USMD",
| "OrgTypeToExclude": ["E"]
| },
| {
| "TransactionId": "USGA",
| "OrgTypeToExclude": []
| }
| ]
| }"""
jsonstr: String =
{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}
scala> val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
df: org.apache.spark.sql.DataFrame = [id: int, code: string ... 1 more field]
scala> val json = spark.read.json(Seq(jsonstr).toDS).select(explode($"Transactions").as("Transactions")).select($"Transactions.*")
json: org.apache.spark.sql.DataFrame = [OrgTypeToExclude: array<string>, TransactionId: string]
scala> df.show(false)
+---+----+---+
|id |code|Alp|
+---+----+---+
|1 |USAL|A |
|4 |USAL|C |
|2 |USMD|B |
|5 |USMD|E |
|3 |USGA|C |
+---+----+---+
scala> json.show(false)
+----------------+-------------+
|OrgTypeToExclude|TransactionId|
+----------------+-------------+
|[A, B] |USAL |
|[E] |USMD |
|[] |USGA |
+----------------+-------------+
scala> df.join(jsondf,(df("code") === jsondf("TransactionId") && !array_contains(jsondf("OrgTypeToExclude"),df("Alp"))),"inner").select("id","code","alp").show(false)
+---+----+---+
|id |code|alp|
+---+----+---+
|4 |USAL|C |
|2 |USMD|B |
|3 |USGA|C |
+---+----+---+
scala>
scala>val jsonstr=“”{
|
|“id”:“3b4219f8-0579-4933-ba5e-c0fc532eeb2a”,
|“交易”:[
| {
|“TransactionId”:“USAL”,
|“OrgTypeToExclude”:[“A”,“B”]
| },
| {
|“TransactionId”:“USMD”,
|“OrgTypeToExclude”:[“E”]
| },
| {
|“TransactionId”:“USGA”,
|“OrgTypeToExclude”:[]
| }
| ]
| }"""
jsonstr:String=
{
“id”:“3b4219f8-0579-4933-ba5e-c0fc532eeb2a”,
“交易”:[
{
“TransactionId”:“USAL”,
“OrgTypeToExclude”:[“A”,“B”]
},
{
“TransactionId”:“USMD”,
“OrgTypeToExclude”:[“E”]
},
{
“TransactionId”:“USGA”,
“OrgTypeToExclude”:[]
}
]
}
scala>val df=Seq((1,“USAL”,“A”),(4,“USAL”,“C”),(2,“USMD”,“B”),(5,“USMD”,“E”),(3,“USGA”,“C”)。toDF(“id”,“代码”,“Alp”)
df:org.apache.spark.sql.DataFrame=[id:int,code:string…还有一个字段]
scala>val json=spark.read.json(Seq(jsonstr).toDS).select(分解($“事务”).as(“事务”).select($“事务”。)
json:org.apache.spark.sql.DataFrame=[OrgTypeToExclude:array,TransactionId:string]
scala>df.show(假)
+---+----+---+
|id |代码| Alp|
+---+----+---+
|1 | USAL | A|
|4 | USAL | C|
|2 | USMD | B|
|5 | USMD | E|
|3 | USGA | C|
+---+----+---+
scala>json.show(false)
+----------------+-------------+
|OrgTypeToExclude | TransactionId|
+----------------+-------------+
|[A,B]|美国铝业|
|[E] | USMD|
|[]USGA|
+----------------+-------------+
scala>df.join(jsondf,(df(“code”)==jsondf(“TransactionId”)和&!array_包含(jsondf(“OrgTypeToExclude”)、df(“Alp”)、内部)。选择(“id”、“code”、“Alp”)。显示(false)
+---+----+---+
|id |代码| alp|
+---+----+---+
|4 | USAL | C|
|2 | USMD | B|
|3 | USGA | C|
+---+----+---+
斯卡拉>
首先,您似乎忽略了一个事实,即事务也是一个数组,我们可以使用explode来处理它:
val json = spark.read.json(Seq(jsonstr).toDS)
.select(explode($"Transactions").as("t")) // deal with Transactions array first
.select($"t.TransactionId", $"t.OrgTypeToExclude")
另外,array_包含的第二个参数是一个值而不是一列。我不知道有哪个版本支持引用列,因此我们将生成一个udf:
val arr_con = udf { (a: Seq[String], v: String) => a.contains(v) }
然后我们可以修改连接条件,如下所示:
df.join(json0, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
df.join(json0,$“code”$“TransactionId”&&&!arr_con($“OrgTypeToExclude”,$“Alp”),“inner”).show()
预期的结果是:
scala> df.join(json, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
+---+----+---+-------------+----------------+
| id|code|Alp|TransactionId|OrgTypeToExclude|
+---+----+---+-------------+----------------+
| 4|USAL| C| USAL| [A, B]|
| 2|USMD| B| USMD| [E]|
| 3|USGA| C| USGA| []|
+---+----+---+-------------+----------------+
scala>df.join(json,$“code”$“TransactionId”&&&!arr_con($“OrgTypeToExclude”,$“Alp”),“inner”).show()
+---+----+---+-------------+----------------+
|id |代码| Alp |交易id |组织类型除外|
+---+----+---+-------------+----------------+
|4 | USAL | C | USAL |[A,B]|
|2 | USMD | B | USMD |[E]|
|3 | USGA | C | USGA |[]|
+---+----+---+-------------+----------------+
您还可以添加预期输出吗?