ApacheSparkJava-如何遍历行数据集并删除空字段_Java_Apache Spark_Apache Spark Sql_Apache Spark Dataset

ApacheSparkJava-如何遍历行数据集并删除空字段

java apache-spark

ApacheSparkJava-如何遍历行数据集并删除空字段,java,apache-spark,apache-spark-sql,apache-spark-dataset,Java,Apache Spark,Apache Spark Sql,Apache Spark Dataset,我正在尝试构建spark应用程序，该应用程序从配置单元表读取数据，并将输出写入JSON 在下面的代码中，我必须在输出之前遍历行数据集并删除空字段我希望我的输出像，请建议我如何才能实现这一点 {"personId":"101","personName":"Sam","email":"Sam@gmail.com"} {"personId":"102","personName":"Smith"} // as email is null or blank should not be included

我正在尝试构建spark应用程序，该应用程序从配置单元表读取数据，并将输出写入JSON

在下面的代码中，我必须在输出之前遍历行数据集并删除空字段

我希望我的输出像，请建议我如何才能实现这一点

{"personId":"101","personName":"Sam","email":"Sam@gmail.com"}
{"personId":"102","personName":"Smith"}  // as email is null or blank should not be included in output

这是我的密码：

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import com.fdc.model.Person;

public class ExtractionExample {

    public static void main(String[] args) throws Exception {
        SparkSession spark = SparkSession.builder().appName("ExtractionExample")
                .config("spark.sql.warehouse.dir", "/user/hive/warehouse/").enableHiveSupport().getOrCreate();
        Dataset<Row> sqlDF = spark.sql("SELECT person_id as personId, person_name as personName, email_id as emailId FROM person");
        Dataset<Person> person = sqlDF.as(Encoders.bean(Person.class));

        /*  
         * iterate through all the columns and identify the null value and drop
         * Looks like it will drop the column from entire table but when I tried it doesn't do anything.
         * String[] columns = sqlDF.columns();
        for (String column : columns) {
            String colValue = sqlDF.select(column).toString();
            System.out.println("printing the column: "+ column +" colvalue:"+colValue.toString());
            if(colValue != null && colValue.isEmpty() && (colValue).trim().length() == 0) {
                System.out.println("dropping the null value");
                sqlDF = sqlDF.drop(column);
            }

        }
        sqlDF.write().json("/data/testdb/test/person_json");
        */

        /* 
         * 
         * Unable to get the bottom of the solution 
         * also collect() is heavy operation is there any better way to do this?
         * List<Row> rowListDf = person.javaRDD().map(new Function<Row, Row>() {
                @Override
                public Row call(Row record) throws Exception {
                   String[] fieldNames =  record.schema().fieldNames();
                    Row modifiedRecord = new RowFactory().create();
                   for(int i=0; i < fieldNames.length; i++ ) {
                       String value = record.getAs(i).toString();
                      if (value!= null && !value.isEmpty() && value.trim().length() > 0) {
                          //   RowFactory.create(record.get(i)); ---> throwing this error
                      }
                   }
                    // return RowFactory object
                    return null;
                }
            }).collect();*/


        person.write().json("/data/testdb/test/person_json");

    }
}

正如用户9613318所建议的那样，JSON编写器默认忽略空字段。

此处无需执行任何操作。默认情况下，JSON编写器忽略空字段。如果您有空字符串，您还必须将它们转换为NULL；我的假设是，我们需要遍历数据集的每一行并删除空值，不用说。