将foreach中给定的多个数据帧合并为一个数据帧-Scala spark
我有两个csv文件,如下所示将foreach中给定的多个数据帧合并为一个数据帧-Scala spark,scala,dataframe,apache-spark,foreach,Scala,Dataframe,Apache Spark,Foreach,我有两个csv文件,如下所示 a.csv ID,Name,Age,Subject 1,Arun,23,English 2,Melan,22,IT b.csv ID,Name,Department_ID,Age,Subject 3,Kumar,004,21,Science 4,Sagar,008,20,IT 正如您所看到的,这些文件结构是不同的。我只想要ID和Subject列。因此,我列出了文件的路径,并执行以下操作 val cols = List("ID", "Subject") v
a.csv
ID,Name,Age,Subject
1,Arun,23,English
2,Melan,22,IT
b.csv
ID,Name,Department_ID,Age,Subject
3,Kumar,004,21,Science
4,Sagar,008,20,IT
正如您所看到的,这些文件结构是不同的。我只想要ID
和Subject
列。因此,我列出了文件的路径,并执行以下操作
val cols = List("ID", "Subject")
val file_path = List("path to a.csv", "path to b.csv")
file_path.foreach(path => {
val df =
spark
.read
.option( "header", "true" )
.option( "delimiter", "," )
.csv(path )
.select(cols.head, cols.tail: _*)
df.show()
df.count()
})
第一数据帧
## +---+--------+
## |ID|Subject |
## +--+---------+
## | 1| English|
## | 2| IT|
## +--+---------+
第二数据帧
##+---+---------+
## |ID|Subject |
## +--+---------+
## | 3| Science|
## | 4| IT|
## +--+---------+
但是我需要通过合并这两个数据帧来获得一个数据帧。如下图所示
## +---+--------+
## |ID|Subject |
## +--+---------+
## | 1| English|
## | 2| IT|
## | 3| Science|
## | 4| IT|
## +--+---------+
有办法吗?我不想将这两个数据帧写入文件并作为一个数据帧读取
谢谢。使用
map
和reduce
而不是foreach
方法来实现这一点
请在下面查看
scala> val dfr = spark.read.format("csv").option("header","true")
dfr: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@cd6ccda
scala> val paths = List("/tmp/data/da.csv","/tmp/data/db.csv")
paths: List[String] = List(/tmp/data/da.csv, /tmp/data/db.csv)
scala> val columns = List("id","subject").map(c => col(c))
columns: List[org.apache.spark.sql.Column] = List(id, subject)
scala> spark.time { paths.map(path => dfr.load(path).select(columns:_*)).reduce(_ union _).show(false) }
+---+-------+
|id |subject|
+---+-------+
|1 |English|
|2 |IT |
|3 |Science|
|4 |IT |
+---+-------+
Time taken: 247 ms
scala>
编辑
由于两个文件具有不同的模式,一次加载所有文件将产生错误的结果,请检查以下内容
scala> val da = spark.read.option("header","true").csv("/tmp/data/da.csv")
da: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]
scala> da.show(false)
+---+-----+---+-------+
|id |name |age|subject|
+---+-----+---+-------+
|1 |Arun |23 |English|
|2 |Melan|22 |IT |
+---+-----+---+-------+
scala> val db = spark.read.option("header","true").csv("/tmp/data/db.csv")
db: org.apache.spark.sql.DataFrame = [id: string, name: string ... 3 more fields]
scala> db.show(false)
+---+-----+-------------+---+-------+
|id |name |department_id|age|subject|
+---+-----+-------------+---+-------+
|3 |Kumar|004 |21 |Science|
|4 |Sagar|008 |20 |IT |
+---+-----+-------------+---+-------+
scala> val paths = List("/tmp/data/da.csv","/tmp/data/db.csv")
paths: List[String] = List(/tmp/data/da.csv, /tmp/data/db.csv)
scala> val columns = List("id","subject").map(c => col(c))
columns: List[org.apache.spark.sql.Column] = List(id, subject)
scala> spark.read.option("header", "true" ).option("delimiter", "," ).csv(paths: _* ).select(columns:_*).show(false)
20/04/29 18:35:07 WARN CSVDataSource: CSV header does not conform to the schema.
Header: id,
Schema: id, subject
Expected: subject but found:
CSV file: file:///tmp/data/da.csv
+---+-------+
|id |subject|
+---+-------+
|3 |Science|
|4 |IT |
|1 |null |
|2 |null |
+---+-------+
scala> spark.read.option("header", "true" ).option("delimiter", "," ).csv(paths: _* ).select("id","name").show(false) // common columns from both fiels - id,name
+---+-----+
|id |name |
+---+-----+
|3 |Kumar|
|4 |Sagar|
|1 |Arun |
|2 |Melan|
+---+-----+
scala> spark.read.option("header", "true" ).option("delimiter", "," ).csv(paths: _* ).select("id","name","age").show(false) // file-1 has - id,name,age, file-2 has - id,name,department_id,age , in this age came after department_id
20/04/29 18:43:53 WARN CSVDataSource: CSV header does not conform to the schema.
Header: id, name, subject
Schema: id, name, age
Expected: age but found: subject
CSV file: file:///tmp/data/da.csv
+---+-----+-------+
|id |name |age |
+---+-----+-------+
|3 |Kumar|21 |
|4 |Sagar|20 |
|1 |Arun |English|
|2 |Melan|IT |
+---+-----+-------+
Spark Dataframe具有一次从多个文件加载的内置功能。 我认为,与其单独加载它们,然后加入它们,不如只在一个调用中加载它们,如下所示
object LoadJoinDataframe {
def main(args: Array[String]): Unit = {
val cols = List("ID", "Subject")
val file_path = List("path to a.csv", "path to b.csv")
val spark = Constant.getSparkSess
val df = spark
.read
.option( "header", "true" )
.option( "delimiter", "," )
.csv(file_path: _* )
.select(cols.head, cols.tail: _*)
df.show()
df.count()
}
}