Scala 如何在rdd映射函数中访问外部数据帧?

Scala 如何在rdd映射函数中访问外部数据帧?,scala,apache-spark,dataframe,rdd,Scala,Apache Spark,Dataframe,Rdd,我有两个数据帧 countryDF +-------+-------------------+--------+---------+ | id | CountryName |Latitude|Longitude| +-------+-------------------+--------+---------+ | 1 | United States | 39.76 | -98.5 | | 2 | China | 35

我有两个数据帧

countryDF

+-------+-------------------+--------+---------+
|   id  |    CountryName    |Latitude|Longitude|
+-------+-------------------+--------+---------+
|  1    | United States     |  39.76 |   -98.5 |
|  2    | China             |  35    |   105   |
|  3    | India             |  20    |   77    |
|  4    | Brazil            |  -10   |   -55   |
...
+-------+-------------------+--------+---------+
salesDF

+-------+-------------------+--------+---------+--------+
|   id  |    Country        |Latitude|Longitude|revenue |
+-------+-------------------+--------+---------+--------+
|  1    | Japan             |        |         |   11   |
|  2    | China             |        |         |   12   |
|  3    | Brazil            |        |         |   56   |
|  4    | Scotland          |        |         |   12   |
...
+-------+-------------------+--------+---------+--------+
任务是为salesDF生成纬度和经度。这将从countryDF列CountryName中搜索salesDF列Country的每个单元格。如果找到一行,则将相应的纬度和经度附加到该行

输出数据帧为:

+-------+-------------------+--------+---------+---------+
|   id  |    CountryName    |Latitude|Longitude|revenue  |
+-------+-------------------+--------+---------+---------+
|  1    | Japan             |  35.6  |   139   | 11      |
|  2    | China             |  35    |   105   | 12      |
|  3    | Brazil            |  -10   |   -55   | 56      |
|  4    | Scotland          |  55.95 |  -3.18  | 12      |
...
+-------+-------------------+--------+---------+---------+
我写了一个map函数来做这个操作。但映射函数似乎无法访问外部数据帧变量。有什么解决办法吗

val countryDF = spark.read
  .option("inferSchema", "true")
  .option("header", "true")
  .csv("Country.csv")

var revenueDF = spark.read
  .option("inferSchema", "true")
  .option("header", "true")
  .csv("revenue.csv")

var resultRdd = revenueDF.rdd.map(row => {
  val generateRow = (row: Row, latitude: Any, longitude: Any, latidudeIndex: Int, longitudeIndex: Int) => {
    val arr = row.toSeq.toArray
    arr(latidudeIndex) = latitude
    arr(longitudeIndex) = longitude
    Row.fromSeq(arr)
  }
  val countryName = row.getAs[String](1)
  // cannot access countryDF, it is corrupted
  val countryRow = countryDF.where(col("CountryName") === countryName)
  generateRow(row, row.getAs[String](2), row.getAs[String](3),2, 3)

})
revenueDF.sqlContext.createDataFrame(resultRdd, revenueDF.schema).show()

你要找的行动是


不可以,您不能在map、udf或等效文件中使用数据帧、RDD和其他分布式对象。

为什么输出数据帧中的记录Brazic有20、77,而不是-10、-55作为纬度、经度?抱歉,这是打字错误。我发现连接性能不好。您是否有其他提高绩效的建议。
salesDF.select("id", "Country").join(
  countryDF.select("CountryName", "Latitude", "Longitude")
  $"CountryName" === $"Country",
  "left"
).drop("Country")