Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/384.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java Apache Spark数据集API-不接受架构StructType_Java_Csv_Apache Spark_Spark Dataframe_Databricks - Fatal编程技术网

Java Apache Spark数据集API-不接受架构StructType

Java Apache Spark数据集API-不接受架构StructType,java,csv,apache-spark,spark-dataframe,databricks,Java,Csv,Apache Spark,Spark Dataframe,Databricks,我有一个类,它使用Spark数据API加载一个无头CSV文件 我的问题是,我无法让SparkSession接受应该定义每个列的模式StructType。结果数据帧是字符串类型的未指定列 public class CsvReader implements java.io.Serializable { public CsvReader(StructType builder) { this.builder = builder; } private StructType bui

我有一个类,它使用Spark数据API加载一个无头CSV文件

我的问题是,我无法让SparkSession接受应该定义每个列的模式StructType。结果数据帧是字符串类型的未指定列

public class CsvReader implements java.io.Serializable {

public CsvReader(StructType builder) {
        this.builder = builder;
    }
private StructType builder;

SparkConf conf = new SparkConf().setAppName("csvParquet").setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);

Dataset<Row> df = sparkSession
        .read()
        .format("com.databricks.spark.csv")
        .option("header", false)
        //.option("inferSchema", true)
        .schema(builder)
        .load("/Users/Chris/Desktop/Meter_Geocode_Data.csv"); //TODO: CMD line arg

public void printSchema() {
    System.out.println(builder.length());
    df.printSchema();
}

public void printData() {
    df.show();
}

public void printMeters() {
    df.select("meter").show();
}

public void printMeterCountByGeocode_result() {
    df.groupBy("geocode_result").count().show();
}

public Dataset getDataframe() {
            return df;
 }

}
调试器显示“生成器”结构类型已正确定义:

0 = {StructField@4904} "StructField(geocode_result,DoubleType,false)"
1 = {StructField@4905} "StructField(meter,StringType,false)"
2 = {StructField@4906} "StructField(orig_easting,StringType,false)"
3 = {StructField@4907} "StructField(orig_northing,StringType,false)"
4 = {StructField@4908} "StructField(temetra_easting,StringType,false)"
5 = {StructField@4909} "StructField(temetra_northing,StringType,false)"
6 = {StructField@4910} "StructField(orig_address,StringType,false)"
7 = {StructField@4911} "StructField(orig_postcode,StringType,false)"
8 = {StructField@4912} "StructField(postcode_easting,StringType,false)"
9 = {StructField@4913} "StructField(postcode_northing,StringType,false)"
10 = {StructField@4914} "StructField(distance_calc_method,StringType,false)"
11 = {StructField@4915} "StructField(distance,StringType,false)"
12 = {StructField@4916} "StructField(geocoded_address,StringType,false)"
13 = {StructField@4917} "StructField(geocoded_postcode,StringType,false)"

我做错了什么?非常感谢您的帮助

如果希望通过生成器初始化df,则应将其放入构造函数中。也可以将其放入成员函数中。

定义变量
数据集df
,并将用于读取CSV文件的代码块移动到
getDataframe()
方法中,如下所示

private Dataset<Row> df = null;

public Dataset getDataframe() {
    df = sparkSession
        .read()
        .format("com.databricks.spark.csv")
        .option("header", false)
        //.option("inferSchema", true)
        .schema(builder)
        .load("src/main/java/resources/test.csv"); //TODO: CMD line arg
        return df;
}
    CsvReader cr = new CsvReader(schema);
    Dataset df = cr.getDataframe();
    cr.printSchema();
我建议你重新设计你的课程。一个选项是您可以将df作为参数传递给其他方法。如果您使用的是Spark 2.0,那么就不需要SparkConf。请参阅创建SparkSession

    CsvReader cr = new CsvReader(schema);
    Dataset df = cr.getDataframe();
    cr.printSchema();