Python Spark将文件中的键值对读取到数据帧中
我需要读取日志文件并将其转换为spark数据帧 输入文件内容:Python Spark将文件中的键值对读取到数据帧中,python,scala,apache-spark,pyspark,apache-spark-sql,Python,Scala,Apache Spark,Pyspark,Apache Spark Sql,我需要读取日志文件并将其转换为spark数据帧 输入文件内容: dateCreated : 20200720 customerId : 001 dateCreated : 20200720 customerId : 002 dateCreated : 20200721 customerId : 003 --------------------------- |dateCreated | customerId | -----------------------
dateCreated : 20200720
customerId : 001
dateCreated : 20200720
customerId : 002
dateCreated : 20200721
customerId : 003
---------------------------
|dateCreated | customerId |
---------------------------
|20200720 | 001 |
|20200720 | 002 |
|20200721 | 003 |
|------------|------------|
val spark = org.apache.spark.sql.SparkSession.builder.master("local").getOrCreate
val inputFile = "C:\\log_data.txt"
val rddFromFile = spark.sparkContext.textFile(inputFile)
val rdd = rddFromFile.map(f => {
f.split(":")
})
rdd.foreach(f => {
println(f(0) + "\t" + f(1))
})
预期数据帧:
dateCreated : 20200720
customerId : 001
dateCreated : 20200720
customerId : 002
dateCreated : 20200721
customerId : 003
---------------------------
|dateCreated | customerId |
---------------------------
|20200720 | 001 |
|20200720 | 002 |
|20200721 | 003 |
|------------|------------|
val spark = org.apache.spark.sql.SparkSession.builder.master("local").getOrCreate
val inputFile = "C:\\log_data.txt"
val rddFromFile = spark.sparkContext.textFile(inputFile)
val rdd = rddFromFile.map(f => {
f.split(":")
})
rdd.foreach(f => {
println(f(0) + "\t" + f(1))
})
火花代码:
dateCreated : 20200720
customerId : 001
dateCreated : 20200720
customerId : 002
dateCreated : 20200721
customerId : 003
---------------------------
|dateCreated | customerId |
---------------------------
|20200720 | 001 |
|20200720 | 002 |
|20200721 | 003 |
|------------|------------|
val spark = org.apache.spark.sql.SparkSession.builder.master("local").getOrCreate
val inputFile = "C:\\log_data.txt"
val rddFromFile = spark.sparkContext.textFile(inputFile)
val rdd = rddFromFile.map(f => {
f.split(":")
})
rdd.foreach(f => {
println(f(0) + "\t" + f(1))
})
关于如何将上述rdd转换为所需DF,您有什么想法吗?请查看下面的代码
scala> "cat /tmp/sample/input.csv".!
dateCreated : 20200720
customerId : 001
dateCreated : 20200720
customerId : 002
dateCreated : 20200721
customerId : 003
除了使用窗口功能,还有其他方法吗。我希望窗口功能是昂贵的操作,并且输入文件很大。