Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark:与数组的连接条件(可为null的)_Scala_Apache Spark_Join_Apache Spark Sql_Azure Databricks - Fatal编程技术网

Scala Spark:与数组的连接条件(可为null的)

Scala Spark:与数组的连接条件(可为null的),scala,apache-spark,join,apache-spark-sql,azure-databricks,Scala,Apache Spark,Join,Apache Spark Sql,Azure Databricks,我有2个数据帧,我想加入它们并想过滤数据,我想过滤 OrgTypeToExclude与每个transactionid匹配的数据 在一个单词中,我的transactionId是join contiions,OrgTypeToExclude是exclude条件,这里分享一个简单的示例 import org.apache.spark.sql.functions.expr import spark.implicits._ val jsonstr ="""{ "id": "3b4219f8-0579

我有2个数据帧,我想加入它们并想过滤数据,我想过滤
OrgTypeToExclude与每个transactionid匹配的数据

在一个单词中,我的transactionId是join contiions,OrgTypeToExclude是exclude条件,这里分享一个简单的示例

import org.apache.spark.sql.functions.expr
import spark.implicits._
val jsonstr ="""{

  "id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
  "Transactions": [
    {
      "TransactionId": "USAL",
      "OrgTypeToExclude": ["A","B"]
    },
    {
      "TransactionId": "USMD",
      "OrgTypeToExclude": ["E"]
    },
    {
      "TransactionId": "USGA",
      "OrgTypeToExclude": []
    }
    ]   
}"""
val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
val json = spark.read.json(Seq(jsonstr).toDS).select("Transactions.TransactionId","Transactions.OrgTypeToExclude")

df.printSchema()
json.printSchema()
df.join(json,$"code"<=> $"TransactionId".cast("string") && !exp("array_contains(OrgTypeToExclude, Alp)") ,"inner" ).show()

  --Expecting output
 id  Code    Alp 
 4   "USAL"  "C"
 2   "USMD"  "B"
 3   "USGA"  "C"
import org.apache.spark.sql.functions.expr
导入spark.implicits_
val jsonstr=“”{
“id”:“3b4219f8-0579-4933-ba5e-c0fc532eeb2a”,
“交易”:[
{
“TransactionId”:“USAL”,
“OrgTypeToExclude”:[“A”,“B”]
},
{
“TransactionId”:“USMD”,
“OrgTypeToExclude”:[“E”]
},
{
“TransactionId”:“USGA”,
“OrgTypeToExclude”:[]
}
]   
}"""
val df=序列((1,“USAL”,“A”),(4,“USAL”,“C”),(2,“USMD”,“B”),(5,“USMD”,“E”),(3,“USGA”,“C”)。toDF(“id”,“代码”,“Alp”)
val json=spark.read.json(Seq(jsonstr.toDS)。选择(“Transactions.TransactionId”、“Transactions.OrgTypeToExclude”)
df.printSchema()
json.printSchema()
join(json,$“code”$“TransactionId”.cast(“string”)和&!exp(“array_contains(orgtypeteoexclude,Alp)”),“inner”).show()
--期望输出
id代码Alp
4“USAL”“C”
2“USMD”“B”
3“USGA”“C”
谢谢,
Manoj.

事务
是一种数组类型&您正在访问
事务ID
OrgTypeToExclude
,因此您将获得多个数组

相反,您只需分解根级别的
事务
数组并提取
OrgTypeToExclude
事务ID
的结构值,接下来的步骤将很简单

请检查下面的代码


scala> val jsonstr ="""{
     |
     |   "id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
     |   "Transactions": [
     |     {
     |       "TransactionId": "USAL",
     |       "OrgTypeToExclude": ["A","B"]
     |     },
     |     {
     |       "TransactionId": "USMD",
     |       "OrgTypeToExclude": ["E"]
     |     },
     |     {
     |       "TransactionId": "USGA",
     |       "OrgTypeToExclude": []
     |     }
     |     ]
     | }"""
jsonstr: String =
{

  "id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
  "Transactions": [
    {
      "TransactionId": "USAL",
      "OrgTypeToExclude": ["A","B"]
    },
    {
      "TransactionId": "USMD",
      "OrgTypeToExclude": ["E"]
    },
    {
      "TransactionId": "USGA",
      "OrgTypeToExclude": []
    }
    ]
}

scala> val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
df: org.apache.spark.sql.DataFrame = [id: int, code: string ... 1 more field]

scala> val json = spark.read.json(Seq(jsonstr).toDS).select(explode($"Transactions").as("Transactions")).select($"Transactions.*")
json: org.apache.spark.sql.DataFrame = [OrgTypeToExclude: array<string>, TransactionId: string]

scala> df.show(false)
+---+----+---+
|id |code|Alp|
+---+----+---+
|1  |USAL|A  |
|4  |USAL|C  |
|2  |USMD|B  |
|5  |USMD|E  |
|3  |USGA|C  |
+---+----+---+


scala> json.show(false)
+----------------+-------------+
|OrgTypeToExclude|TransactionId|
+----------------+-------------+
|[A, B]          |USAL         |
|[E]             |USMD         |
|[]              |USGA         |
+----------------+-------------+


scala> df.join(jsondf,(df("code") === jsondf("TransactionId") && !array_contains(jsondf("OrgTypeToExclude"),df("Alp"))),"inner").select("id","code","alp").show(false)
+---+----+---+
|id |code|alp|
+---+----+---+
|4  |USAL|C  |
|2  |USMD|B  |
|3  |USGA|C  |
+---+----+---+


scala>


scala>val jsonstr=“”{
|
|“id”:“3b4219f8-0579-4933-ba5e-c0fc532eeb2a”,
|“交易”:[
|     {
|“TransactionId”:“USAL”,
|“OrgTypeToExclude”:[“A”,“B”]
|     },
|     {
|“TransactionId”:“USMD”,
|“OrgTypeToExclude”:[“E”]
|     },
|     {
|“TransactionId”:“USGA”,
|“OrgTypeToExclude”:[]
|     }
|     ]
| }"""
jsonstr:String=
{
“id”:“3b4219f8-0579-4933-ba5e-c0fc532eeb2a”,
“交易”:[
{
“TransactionId”:“USAL”,
“OrgTypeToExclude”:[“A”,“B”]
},
{
“TransactionId”:“USMD”,
“OrgTypeToExclude”:[“E”]
},
{
“TransactionId”:“USGA”,
“OrgTypeToExclude”:[]
}
]
}
scala>val df=Seq((1,“USAL”,“A”),(4,“USAL”,“C”),(2,“USMD”,“B”),(5,“USMD”,“E”),(3,“USGA”,“C”)。toDF(“id”,“代码”,“Alp”)
df:org.apache.spark.sql.DataFrame=[id:int,code:string…还有一个字段]
scala>val json=spark.read.json(Seq(jsonstr).toDS).select(分解($“事务”).as(“事务”).select($“事务”。)
json:org.apache.spark.sql.DataFrame=[OrgTypeToExclude:array,TransactionId:string]
scala>df.show(假)
+---+----+---+
|id |代码| Alp|
+---+----+---+
|1 | USAL | A|
|4 | USAL | C|
|2 | USMD | B|
|5 | USMD | E|
|3 | USGA | C|
+---+----+---+
scala>json.show(false)
+----------------+-------------+
|OrgTypeToExclude | TransactionId|
+----------------+-------------+
|[A,B]|美国铝业|
|[E] | USMD|
|[]USGA|
+----------------+-------------+
scala>df.join(jsondf,(df(“code”)==jsondf(“TransactionId”)和&!array_包含(jsondf(“OrgTypeToExclude”)、df(“Alp”)、内部)。选择(“id”、“code”、“Alp”)。显示(false)
+---+----+---+
|id |代码| alp|
+---+----+---+
|4 | USAL | C|
|2 | USMD | B|
|3 | USGA | C|
+---+----+---+
斯卡拉>

首先,您似乎忽略了一个事实,即事务也是一个数组,我们可以使用explode来处理它:

val json = spark.read.json(Seq(jsonstr).toDS)
  .select(explode($"Transactions").as("t")) // deal with Transactions array first
  .select($"t.TransactionId", $"t.OrgTypeToExclude")
另外,array_包含的第二个参数是一个值而不是一列。我不知道有哪个版本支持引用列,因此我们将生成一个udf:

val arr_con = udf { (a: Seq[String], v: String) => a.contains(v) }
然后我们可以修改连接条件,如下所示:

df.join(json0, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
df.join(json0,$“code”$“TransactionId”&&&!arr_con($“OrgTypeToExclude”,$“Alp”),“inner”).show()
预期的结果是:

scala> df.join(json, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
+---+----+---+-------------+----------------+
| id|code|Alp|TransactionId|OrgTypeToExclude|
+---+----+---+-------------+----------------+
|  4|USAL|  C|         USAL|          [A, B]|
|  2|USMD|  B|         USMD|             [E]|
|  3|USGA|  C|         USGA|              []|
+---+----+---+-------------+----------------+
scala>df.join(json,$“code”$“TransactionId”&&&!arr_con($“OrgTypeToExclude”,$“Alp”),“inner”).show()
+---+----+---+-------------+----------------+
|id |代码| Alp |交易id |组织类型除外|
+---+----+---+-------------+----------------+
|4 | USAL | C | USAL |[A,B]|
|2 | USMD | B | USMD |[E]|
|3 | USGA | C | USGA |[]|
+---+----+---+-------------+----------------+

您还可以添加预期输出吗?