Spark中Json列的哈希值

Spark中Json列的哈希值,json,scala,apache-spark,apache-spark-sql,Json,Scala,Apache Spark,Apache Spark Sql,我有Cassandra表,最后一列名为“fullJson”的是JSON日志文件。我需要使用MD5对JSON行中的userID值进行散列。这是我的方法,但对于一些拉森来说,我总是在某个时候陷入困境。已加载的Cassandra表: scala> val rawCass = sc.cassandraTable[cassFormat]("keyspace", "logs").repartition(200) rawCass: org.apache.spark.rdd.RDD[cassFormat]

我有
Cassandra
表,最后一列名为“fullJson”的是
JSON
日志文件。我需要使用
MD5
JSON
行中的userID值进行散列。这是我的方法,但对于一些拉森来说,我总是在某个时候陷入困境。已加载的Cassandra表:

scala> val rawCass = sc.cassandraTable[cassFormat]("keyspace", "logs").repartition(200)
rawCass: org.apache.spark.rdd.RDD[cassFormat] = MapPartitionsRDD[73] at coalesce at CassandraTableScanRDD.scala:256
我得到:

scala> val cassDF2 = spark.createDataFrame(rawCass).select("fullJson")
cassDF2: org.apache.spark.sql.DataFrame = [fullJson: string]

scala> cassDF2.printSchema
root
 |-- fullJson: string (nullable = true)
我的
JSON
文件由“header”和“body”组成,我想最好的方法是获取
数据帧
,然后选择列
userID
,并用
MD5
散列它

scala> val nestedJson = spark.read.json(cassDF2.select("fullJson").rdd.map(_.getString(0))).select("header","body")
nestedJson: org.apache.spark.sql.DataFrame = [header: struct<KPI: string, action: string ... 16 more fields>, body: struct<1MYield: double, 1YYield: double ... 147 more fields>]

scala> nestedJson.printSchema
root
 |-- header: struct (nullable = true)
 |    |-- KPI: string (nullable = true)
 |    |-- action: string (nullable = true)
 |    |-- appID: string (nullable = true)
 |    |-- appVersion: string (nullable = true)
 |    |-- context: string (nullable = true)
 |    |-- eventID: string (nullable = true)
 |    |-- interestArea: string (nullable = true)
 |    |-- location: struct (nullable = true)
 |    |    |-- lat: string (nullable = true)
 |    |    |-- lon: string (nullable = true)
 |    |-- navigationGroup: string (nullable = true)
 |    |-- sessionID: string (nullable = true)
 |    |-- timestamp: string (nullable = true)
 |    |-- userAge: string (nullable = true)
 |    |-- userAgent: struct (nullable = true)
 |    |    |-- browser: string (nullable = true)
 |    |    |-- browserVersion: string (nullable = true)
 |    |    |-- deviceName: string (nullable = true)
 |    |    |-- deviceResolution: string (nullable = true)
 |    |    |-- deviceType: string (nullable = true)
 |    |    |-- deviceVendor: string (nullable = true)
 |    |    |-- os: string (nullable = true)
 |    |    |-- osVersion: string (nullable = true)
 |    |-- userID: string (nullable = true)
 |    |-- userSegment: string (nullable = true)
 |-- body: struct (nullable = true)
 |    |-- OS: string (nullable = true)
 |    |-- active: boolean (nullable = true)
 |    |-- amount: double (nullable = true)
 |    |-- amountCritical: string (nullable = true)
 |    |-- beneficiary: struct (nullable = true)
 |    |    |-- beneficiaryAccounts: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- beneficiaryAccountBank: string (nullable = true)
...
我想将其保存在
CSV
文件中,但无法完成,因为它是一个结构

newDF.write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").save("cass_full.csv")
试图避免
struct
类型,但由于其他嵌套(例如
location
包含
lat、lon

基本问题
最简单、最可取的方法是什么。我应该只为
JSON
中的每一行更改
userID
值,还是可以用
数据帧来做一些不同的事情?之所以这样做是因为我有另一个来自另一个数据库的
CSV
文件,该文件也需要使用相同的算法进行散列,然后加入。

请尝试将其保存在
parquet
中,然后继续加入逻辑的第二部分


希望这有帮助

请尝试将此保存在
拼花
中,然后继续第二部分的加入逻辑


希望这有帮助

从未使用过镶木地板,但可以尝试使用镶木地板,但可以尝试
newDF.write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").save("cass_full.csv")
scala> val tempT = newDF.select($"header.*",$"body.*")
tempT: org.apache.spark.sql.DataFrame = [KPI: string, action: string ... 165 more fields]

scala> tempT.printSchema
root
 |-- KPI: string (nullable = true)
 |-- action: string (nullable = true)
 |-- appID: string (nullable = true)
 |-- appVersion: string (nullable = true)
 |-- context: string (nullable = true)
 |-- eventID: string (nullable = true)
 |-- interestArea: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- lat: string (nullable = true)
 |    |-- lon: string (nullable = true)
 |-- navigationGroup: string (nullable = true)
...