Scala 在Spark 1.5.2中,将csv文件转换为数据帧,而不使用DataRicks
我正在尝试在Spark 1.5.2中使用Scala将csv文件转换为数据帧,而不使用库DataRicks,因为它是一个社区项目,并且此库不可用。我的做法如下:Scala 在Spark 1.5.2中,将csv文件转换为数据帧,而不使用DataRicks,scala,csv,apache-spark,spark-dataframe,Scala,Csv,Apache Spark,Spark Dataframe,我正在尝试在Spark 1.5.2中使用Scala将csv文件转换为数据帧,而不使用库DataRicks,因为它是一个社区项目,并且此库不可用。我的做法如下: var inputPath = "input.csv" var text = sc.textFile(inputPath) var rows = text.map(line => line.split(",").map(_.trim)) var header = rows.first() var data = rows.filte
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim))
var header = rows.first()
var data = rows.filter(_(0) != header(0))
var df = sc.makeRDD(1 to data.count().toInt).map(i => (data.take(i).drop(i-1)(0)(0), data.take(i).drop(i-1)(0)(1), data.take(i).drop(i-1)(0)(2), data.take(i).drop(i-1)(0)(3), data.take(i).drop(i-1)(0)(4))).toDF(header(0), header(1), header(2), header(3), header(4))
尽管这段代码相当混乱,但它不会返回任何错误消息。当尝试在df
中显示数据以验证此方法的正确性,然后尝试在df
中执行一些查询时,会出现问题。执行df.show()
后得到的错误代码是SPARK-5063
。我的问题是:
1)为什么不能打印df
的内容
2)在
Spark 1.5.2
中,是否有其他更直接的方法将csv转换为数据帧,而无需使用库datatricks
?您可以这样创建:
SparkSession spark = SparkSession
.builder()
.appName("RDDtoDF_Updated")
.master("local[2]")
.config("spark.some.config.option", "some-value")
.getOrCreate();
StructType schema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("eid", DataTypes.IntegerType, false),
DataTypes.createStructField("eName", DataTypes.StringType, false),
DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
DataTypes.createStructField("eGen", DataTypes.StringType,true)});
String filepath = "F:/Hadoop/Data/EMPData.txt";
JavaRDD<Row> empRDD = spark.read()
.textFile(filepath)
.javaRDD()
.map(line -> line.split("\\,"))
.map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));
Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
empDF.groupBy("edept").max("esal").show();
SparkSession spark=SparkSession
.builder()
.appName(“RDDtoDF_更新”)
.master(“本地[2]”)
.config(“spark.some.config.option”、“some value”)
.getOrCreate();
StructType架构=数据类型
.createStructType(新结构字段[]{
DataTypes.createStructField(“eid”,DataTypes.IntegerType,false),
DataTypes.createStructField(“eName”,DataTypes.StringType,false),
DataTypes.createStructField(“eAge”,DataTypes.IntegerType,true),
DataTypes.createStructField(“eDept”,DataTypes.IntegerType,true),
DataTypes.createStructField(“eSal”,DataTypes.IntegerType,true),
createStructField(“eGen”,DataTypes.StringType,true)});
String filepath=“F:/Hadoop/Data/EMPData.txt”;
JavaRDD empRDD=spark.read()
.textFile(文件路径)
.javaRDD()
.map(直线->直线分割(“\\,”)
.map(r->RowFactory.create(Integer.parseInt(r[0]),r[1]),trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3])、Integer.parseInt(r[4])、r[5].trim());
数据集empDF=spark.createDataFrame(empRDD,schema);
empDF.groupBy(“edept”).max(“esal”).show();
您可以这样创建:
SparkSession spark = SparkSession
.builder()
.appName("RDDtoDF_Updated")
.master("local[2]")
.config("spark.some.config.option", "some-value")
.getOrCreate();
StructType schema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("eid", DataTypes.IntegerType, false),
DataTypes.createStructField("eName", DataTypes.StringType, false),
DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
DataTypes.createStructField("eGen", DataTypes.StringType,true)});
String filepath = "F:/Hadoop/Data/EMPData.txt";
JavaRDD<Row> empRDD = spark.read()
.textFile(filepath)
.javaRDD()
.map(line -> line.split("\\,"))
.map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));
Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
empDF.groupBy("edept").max("esal").show();
SparkSession spark=SparkSession
.builder()
.appName(“RDDtoDF_更新”)
.master(“本地[2]”)
.config(“spark.some.config.option”、“some value”)
.getOrCreate();
StructType架构=数据类型
.createStructType(新结构字段[]{
DataTypes.createStructField(“eid”,DataTypes.IntegerType,false),
DataTypes.createStructField(“eName”,DataTypes.StringType,false),
DataTypes.createStructField(“eAge”,DataTypes.IntegerType,true),
DataTypes.createStructField(“eDept”,DataTypes.IntegerType,true),
DataTypes.createStructField(“eSal”,DataTypes.IntegerType,true),
createStructField(“eGen”,DataTypes.StringType,true)});
String filepath=“F:/Hadoop/Data/EMPData.txt”;
JavaRDD empRDD=spark.read()
.textFile(文件路径)
.javaRDD()
.map(直线->直线分割(“\\,”)
.map(r->RowFactory.create(Integer.parseInt(r[0]),r[1]),trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3])、Integer.parseInt(r[4])、r[5].trim());
数据集empDF=spark.createDataFrame(empRDD,schema);
empDF.groupBy(“edept”).max(“esal”).show();
将Spark与Scala结合使用
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
var hiveCtx = new HiveContext(sc)
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))
val df = hiveCtx.createDataframe(rows,schema)
这应该行得通
但对于创建数据帧,建议您使用。将Spark与Scala结合使用
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
var hiveCtx = new HiveContext(sc)
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))
val df = hiveCtx.createDataframe(rows,schema)
这应该行得通
但对于创建数据帧,建议您使用。对于spark 1.5.x,可以使用下面的代码片段将输入转换为DF
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the DataClass interface with 5 fields.
case class DataClass(id: Int, name: String, surname: String, bdate: String, address: String)
// Create an RDD of DataClass objects and register it as a table.
val peopleData = sc.textFile("input.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim, p(3).trim, p(4).trim)).toDF()
peopleData.registerTempTable("dataTable")
val peopleDataFrame = sqlContext.sql("SELECT * from dataTable")
peopleDataFrame.show()
对于spark 1.5.x,可以使用下面的代码段将输入转换为DF
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the DataClass interface with 5 fields.
case class DataClass(id: Int, name: String, surname: String, bdate: String, address: String)
// Create an RDD of DataClass objects and register it as a table.
val peopleData = sc.textFile("input.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim, p(3).trim, p(4).trim)).toDF()
peopleData.registerTempTable("dataTable")
val peopleDataFrame = sqlContext.sql("SELECT * from dataTable")
peopleDataFrame.show()
“这是一个社区项目”——你是认真的吗?你知道Databricks是推动Spark开发的公司吗?您知道吗,
spark csv
插件已经合并到spark 2.x核心库中了?问题是我没有机会更改它,因此我正在寻找其他方法来将csv解析为数据帧而不使用DataRicks。“更改它”--您的意思是什么?您无法下载JAR(一次性)并使用--jars
及其commons csv
依赖项将其附加到作业?CDH发行版中捆绑了Spark,这对我来说非常有效(注意,在Apache构建中,--jars
与CDH不兼容,我不得不选择Spark.driver.extraClasspath
道具和显式的sc.addJar()
作为解决办法)“这是一个社区项目”——你是认真的吗?你知道Databricks是推动Spark开发的公司吗?您知道吗,spark csv
插件已经合并到spark 2.x核心库中了?问题是我没有机会更改它,因此我正在寻找其他方法来将csv解析为数据帧而不使用DataRicks。“更改它”--您的意思是什么?您无法下载JAR(一次性)并使用--jars
及其commons csv
依赖项将其附加到作业?CDH发行版中捆绑了Spark,这对我来说非常有效(注意,在Apache构建中,--jars
与CDH不兼容,我必须使用Spark.driver.extraClasspath
prop和显式sc.addJar()
作为解决方法)