Scala 如何在rdd映射函数中访问外部数据帧?
我有两个数据帧 countryDFScala 如何在rdd映射函数中访问外部数据帧?,scala,apache-spark,dataframe,rdd,Scala,Apache Spark,Dataframe,Rdd,我有两个数据帧 countryDF +-------+-------------------+--------+---------+ | id | CountryName |Latitude|Longitude| +-------+-------------------+--------+---------+ | 1 | United States | 39.76 | -98.5 | | 2 | China | 35
+-------+-------------------+--------+---------+
| id | CountryName |Latitude|Longitude|
+-------+-------------------+--------+---------+
| 1 | United States | 39.76 | -98.5 |
| 2 | China | 35 | 105 |
| 3 | India | 20 | 77 |
| 4 | Brazil | -10 | -55 |
...
+-------+-------------------+--------+---------+
salesDF
+-------+-------------------+--------+---------+--------+
| id | Country |Latitude|Longitude|revenue |
+-------+-------------------+--------+---------+--------+
| 1 | Japan | | | 11 |
| 2 | China | | | 12 |
| 3 | Brazil | | | 56 |
| 4 | Scotland | | | 12 |
...
+-------+-------------------+--------+---------+--------+
任务是为salesDF生成纬度和经度。这将从countryDF列CountryName中搜索salesDF列Country的每个单元格。如果找到一行,则将相应的纬度和经度附加到该行
输出数据帧为:
+-------+-------------------+--------+---------+---------+
| id | CountryName |Latitude|Longitude|revenue |
+-------+-------------------+--------+---------+---------+
| 1 | Japan | 35.6 | 139 | 11 |
| 2 | China | 35 | 105 | 12 |
| 3 | Brazil | -10 | -55 | 56 |
| 4 | Scotland | 55.95 | -3.18 | 12 |
...
+-------+-------------------+--------+---------+---------+
我写了一个map函数来做这个操作。但映射函数似乎无法访问外部数据帧变量。有什么解决办法吗
val countryDF = spark.read
.option("inferSchema", "true")
.option("header", "true")
.csv("Country.csv")
var revenueDF = spark.read
.option("inferSchema", "true")
.option("header", "true")
.csv("revenue.csv")
var resultRdd = revenueDF.rdd.map(row => {
val generateRow = (row: Row, latitude: Any, longitude: Any, latidudeIndex: Int, longitudeIndex: Int) => {
val arr = row.toSeq.toArray
arr(latidudeIndex) = latitude
arr(longitudeIndex) = longitude
Row.fromSeq(arr)
}
val countryName = row.getAs[String](1)
// cannot access countryDF, it is corrupted
val countryRow = countryDF.where(col("CountryName") === countryName)
generateRow(row, row.getAs[String](2), row.getAs[String](3),2, 3)
})
revenueDF.sqlContext.createDataFrame(resultRdd, revenueDF.schema).show()
你要找的行动是
不可以,您不能在map、udf或等效文件中使用数据帧、RDD和其他分布式对象。为什么输出数据帧中的记录Brazic有20、77,而不是-10、-55作为纬度、经度?抱歉,这是打字错误。我发现连接性能不好。您是否有其他提高绩效的建议。
salesDF.select("id", "Country").join(
countryDF.select("CountryName", "Latitude", "Longitude")
$"CountryName" === $"Country",
"left"
).drop("Country")