Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/sqlite/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark csv文件中的新行字符spark_Apache Spark - Fatal编程技术网

Apache spark csv文件中的新行字符spark

Apache spark csv文件中的新行字符spark,apache-spark,Apache Spark,我正在尝试使用ApacheSpark(版本3)读取CSV文件。我面临两个问题 csv文件中的一个字段包含新行字符,因此该行被拆分为两行 其中一列包含逗号(也包含新行字符),因此记录被拆分为多个列 下面是我得到的代码和输出 import org.apache.spark.rdd.RDD; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkS

我正在尝试使用ApacheSpark(版本3)读取CSV文件。我面临两个问题

  • csv文件中的一个字段包含新行字符,因此该行被拆分为两行

  • 其中一列包含逗号(也包含新行字符),因此记录被拆分为多个列

  • 下面是我得到的代码和输出

    import org.apache.spark.rdd.RDD;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.SparkSession;
    
    import scala.Tuple2;
    
    public class MainApp {
    
        public static void main(String[] args) {
            // TODO Auto-generated method stub
            
            SparkSession session = SparkSession.builder().appName("Sample App").master("local[*]").getOrCreate();
            
            session.sparkContext().setLogLevel("ERROR");
            
            
            
    
            
            Dataset<Row> df = session.read().format("csv").
                    option("quote","*" )
                    .option("multiLine", "true")
                     .option("ignoreLeadingWhiteSpace", "true")
                     .option("ignoreTrailingWhiteSpace", "true")
                    .load("/home/deepak/sample_dataset/*.csv");
            df.printSchema();
            df.show(false);
            
            session.stop();
            
    
        }
    
    }
    
    
    
    root
     |-- _c0: string (nullable = true)
     |-- _c1: string (nullable = true)
     |-- _c2: string (nullable = true)
     |-- _c3: string (nullable = true)
    
    +--------------------------------------------+---------------+------+----------------------------+
    |_c0                                         |_c1            |_c2   |_c3                         |
    +--------------------------------------------+---------------+------+----------------------------+
    |15169                                       |Sample data I  |RST1  |"*Insurance data conversion.|
    |Sample (Sample/Sample)Sample Sample't Sample|Sample Box 9999|Sample|Sample 888888-7014*"        |
    +--------------------------------------------+---------------+------+----------------------------+
    

    我无法确定如何将spark作为一条记录读取。

    您不需要指定
    quote
    ,因为您的quote是默认字符

    Dataset df=session.read().format(“csv”)
    .选项(“多行”、“真”)
    .选项(“忽略前导空格”、“真”)
    .选项(“忽略跟踪空白”、“真”)
    .load(“/home/deepak/sample_dataset/*.csv”);
    df.show()
    +-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
    |_c0 | U c1 | U c2 | U c3|
    +-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
    |*15169*|样本数据I | RST1 |*保险数据转换。
    样本(样本/样本)样本未样本,样本箱9999,样本,样本88888-7014*|
    +-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
    
    *15169*,Sample data I,RST1,"*Insurance data conversion.
    Sample (Sample/Sample)Sample Sample't Sample, Sample Box 9999, Sample, Sample 888888-7014*"
    
    Dataset<Row> df = session.read().format("csv")
                     .option("multiLine", "true")
                     .option("ignoreLeadingWhiteSpace", "true")
                     .option("ignoreTrailingWhiteSpace", "true")
                     .load("/home/deepak/sample_dataset/*.csv");
    
    df.show()
    +-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
    |_c0    |_c1          |_c2 |_c3                                                                                                                   |
    +-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
    |*15169*|Sample data I|RST1|*Insurance data conversion.
    Sample (Sample/Sample)Sample Sample't Sample, Sample Box 9999, Sample, Sample 888888-7014*|
    +-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+