Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark读取带有注释标题的csv_Scala_Csv_Apache Spark - Fatal编程技术网

Scala Spark读取带有注释标题的csv

Scala Spark读取带有注释标题的csv,scala,csv,apache-spark,Scala,Csv,Apache Spark,我需要使用scala中的spark读取以下文件- #Version: 1.0 #Fields: date time location timezone 2018-02-02 07:27:42 US LA 2018-02-02 07:27:42 UK LN 我目前正在尝试使用以下方法提取字段:- spark.read.csv(filepath) 我是spark+scala新手,想知道有没有更好的方法根据文件顶部的#fields行提取字段。您应该使用sparkContext的textFile

我需要使用scala中的spark读取以下文件-

#Version: 1.0
#Fields: date time location timezone
2018-02-02  07:27:42 US LA
2018-02-02  07:27:42 UK LN
我目前正在尝试使用以下方法提取字段:-

spark.read.csv(filepath)
我是spark+scala新手,想知道有没有更好的方法根据文件顶部的#fields行提取字段。

您应该使用sparkContext的textFile api读取文本文件,然后过滤标题行

应该是这样

现在,如果您想要创建一个数据帧,那么您应该解析它以形成
schema
,然后
filter
数据行以形成行。最后使用SQLContext创建一个数据帧

这应该给你

+----------+--------+--------+--------+
|date      |time    |location|timezone|
+----------+--------+--------+--------+
|2018-02-02|07:27:42|US      |LA      |
|2018-02-02|07:27:42|UK      |LN      |
+----------+--------+--------+--------+
注意:如果文件以制表符分隔,则

line.split(" ")
您应该使用
\t

line.split("\t")
示例输入文件“example.csv”

斯卡拉测试

line.split(" ")
line.split("\t")
#Version: 1.0
#Fields: date time location timezone
2018-02-02 07:27:42 US LA
2018-02-02 07:27:42 UK LN
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession.Builder
import org.apache.spark.sql._

import scala.util.Try

object Test extends App {

  // create spark session and sql context
  val builder: Builder = SparkSession.builder.appName("testAvroSpark")
  val sparkSession: SparkSession = builder.master("local[1]").getOrCreate()
  val sc: SparkContext = sparkSession.sparkContext
  val sqlContext: SQLContext = sparkSession.sqlContext

  case class CsvRow(date: String, time: String, location: String, timezone: String)

  // path of your csv file
  val path: String =
    "sample.csv"

  // read csv file and skip firs two lines

  val csvString: Seq[String] =
    sc.textFile(path).toLocalIterator.drop(2).toSeq

  // try to read only valid rows
  val csvRdd: RDD[(String, String, String, String)] =
    sc.parallelize(csvString).flatMap(r =>
      Try {
        val row: Array[String] = r.split(" ")
        CsvRow(row(0), row(1), row(2), row(3))
      }.toOption)
      .map(csvRow => (csvRow.date, csvRow.time, csvRow.location, csvRow.timezone))

  import sqlContext.implicits._

  // make data frame
  val df: DataFrame =
    csvRdd.toDF("date", "time", "location", "timezone")

  // display dataf frame
  df.show()
}