如何使用Spark优化CSV远程文件的模式推断
我在S3(或其他)中有一个远程文件,我需要该文件的模式。 我没有找到像JSON那样对数据进行采样的选项(如何使用Spark优化CSV远程文件的模式推断,csv,apache-spark,Csv,Apache Spark,我在S3(或其他)中有一个远程文件,我需要该文件的模式。 我没有找到像JSON那样对数据进行采样的选项(例如read.option(“samplinglation”,0.25)) 有没有办法优化模式的读取 Spark在返回推断的模式之前通过网络读取整个CSV文件。对于大型文件,这可能需要相当长的时间 .option(“samplinglatio”,samplinglational)不适用于csv /** * infer schema for a remote csv file by re
例如read.option(“samplinglation”,0.25)
)
有没有办法优化模式的读取
Spark在返回推断的模式之前通过网络读取整个CSV文件。对于大型文件,这可能需要相当长的时间
.option(“samplinglatio”,samplinglational)
不适用于csv
/**
* infer schema for a remote csv file by reading a sample of the file and infering on that.
* the spark-infer-schema behavior by default reads the entire dataset once!
* for large remote files this is not desired. (e.g. inferring schema on a 3GB file across oceans takes a while)
* speedup is achieved by only reading the first `schemaSampleSize` rows
*
* @param fileLocation
* @param schemaSampleSize rows to be taken into consideration for infering the Schema
* @param headerOption
* @param delimiterOption
* @return
*/
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, schemaSampleSize: Int, headerOption: Boolean, delimiterOption: String): StructType = {
val dataFrameReader: DataFrameReader = sparkSession.read
val dataSample: Array[String] = dataFrameReader.textFile(fileLocation).head(schemaSampleSize)
val firstLine = dataSample.head
import sparkSession.implicits._
val ds: Dataset[String] = sparkSession.createDataset(dataSample)
val extraOptions = new scala.collection.mutable.HashMap[String, String]
extraOptions += ("inferSchema" -> "true")
extraOptions += ("header" -> headerOption.toString)
extraOptions += ("delimiter" -> delimiterOption)
val csvOptions: CSVOptions = new CSVOptions(extraOptions.toMap, sparkSession.sessionState.conf.sessionLocalTimeZone)
val schema: StructType = TextInputCSVDataSource.inferFromDataset(sparkSession, ds, Some(firstLine), csvOptions)
schema
}
例如
schemaSampleSize=10000
delimiterOption=','