Python Spark将文件中的键值对读取到数据帧中_Python_Scala_Apache Spark_Pyspark_Apache Spark Sql

Python Spark将文件中的键值对读取到数据帧中

python scala apache-spark pyspark

Python Spark将文件中的键值对读取到数据帧中,python,scala,apache-spark,pyspark,apache-spark-sql,Python,Scala,Apache Spark,Pyspark,Apache Spark Sql,我需要读取日志文件并将其转换为spark数据帧输入文件内容： dateCreated : 20200720 customerId : 001 dateCreated : 20200720 customerId : 002 dateCreated : 20200721 customerId : 003 --------------------------- |dateCreated | customerId | -----------------------

我需要读取日志文件并将其转换为spark数据帧

输入文件内容：

dateCreated   : 20200720
customerId    :  001
dateCreated   : 20200720
customerId    :  002
dateCreated   : 20200721
customerId    :  003

---------------------------
|dateCreated | customerId |
---------------------------
|20200720    | 001        |
|20200720    | 002        |
|20200721    | 003        |
|------------|------------|

val spark = org.apache.spark.sql.SparkSession.builder.master("local").getOrCreate
    val inputFile = "C:\\log_data.txt"
    val rddFromFile = spark.sparkContext.textFile(inputFile)

    val rdd = rddFromFile.map(f => {
      f.split(":")
    })

    rdd.foreach(f => {
      println(f(0) + "\t" + f(1))
    })

预期数据帧：

dateCreated   : 20200720
customerId    :  001
dateCreated   : 20200720
customerId    :  002
dateCreated   : 20200721
customerId    :  003

---------------------------
|dateCreated | customerId |
---------------------------
|20200720    | 001        |
|20200720    | 002        |
|20200721    | 003        |
|------------|------------|

val spark = org.apache.spark.sql.SparkSession.builder.master("local").getOrCreate
    val inputFile = "C:\\log_data.txt"
    val rddFromFile = spark.sparkContext.textFile(inputFile)

    val rdd = rddFromFile.map(f => {
      f.split(":")
    })

    rdd.foreach(f => {
      println(f(0) + "\t" + f(1))
    })

火花代码：

dateCreated   : 20200720
customerId    :  001
dateCreated   : 20200720
customerId    :  002
dateCreated   : 20200721
customerId    :  003

---------------------------
|dateCreated | customerId |
---------------------------
|20200720    | 001        |
|20200720    | 002        |
|20200721    | 003        |
|------------|------------|

val spark = org.apache.spark.sql.SparkSession.builder.master("local").getOrCreate
    val inputFile = "C:\\log_data.txt"
    val rddFromFile = spark.sparkContext.textFile(inputFile)

    val rdd = rddFromFile.map(f => {
      f.split(":")
    })

    rdd.foreach(f => {
      println(f(0) + "\t" + f(1))
    })

关于如何将上述rdd转换为所需DF，您有什么想法吗？

请查看下面的代码

scala> "cat /tmp/sample/input.csv".!
dateCreated   : 20200720
customerId    :  001
dateCreated   : 20200720
customerId    :  002
dateCreated   : 20200721
customerId    :  003

除了使用窗口功能，还有其他方法吗。我希望窗口功能是昂贵的操作，并且输入文件很大。