Apache spark csv文件中的新行字符spark
我正在尝试使用ApacheSpark(版本3)读取CSV文件。我面临两个问题Apache spark csv文件中的新行字符spark,apache-spark,Apache Spark,我正在尝试使用ApacheSpark(版本3)读取CSV文件。我面临两个问题 csv文件中的一个字段包含新行字符,因此该行被拆分为两行 其中一列包含逗号(也包含新行字符),因此记录被拆分为多个列 下面是我得到的代码和输出 import org.apache.spark.rdd.RDD; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkS
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
public class MainApp {
public static void main(String[] args) {
// TODO Auto-generated method stub
SparkSession session = SparkSession.builder().appName("Sample App").master("local[*]").getOrCreate();
session.sparkContext().setLogLevel("ERROR");
Dataset<Row> df = session.read().format("csv").
option("quote","*" )
.option("multiLine", "true")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.load("/home/deepak/sample_dataset/*.csv");
df.printSchema();
df.show(false);
session.stop();
}
}
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
+--------------------------------------------+---------------+------+----------------------------+
|_c0 |_c1 |_c2 |_c3 |
+--------------------------------------------+---------------+------+----------------------------+
|15169 |Sample data I |RST1 |"*Insurance data conversion.|
|Sample (Sample/Sample)Sample Sample't Sample|Sample Box 9999|Sample|Sample 888888-7014*" |
+--------------------------------------------+---------------+------+----------------------------+
我无法确定如何将spark作为一条记录读取。您不需要指定
quote
,因为您的quote是默认字符“
Dataset df=session.read().format(“csv”)
.选项(“多行”、“真”)
.选项(“忽略前导空格”、“真”)
.选项(“忽略跟踪空白”、“真”)
.load(“/home/deepak/sample_dataset/*.csv”);
df.show()
+-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
|_c0 | U c1 | U c2 | U c3|
+-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
|*15169*|样本数据I | RST1 |*保险数据转换。
样本(样本/样本)样本未样本,样本箱9999,样本,样本88888-7014*|
+-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
*15169*,Sample data I,RST1,"*Insurance data conversion.
Sample (Sample/Sample)Sample Sample't Sample, Sample Box 9999, Sample, Sample 888888-7014*"
Dataset<Row> df = session.read().format("csv")
.option("multiLine", "true")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.load("/home/deepak/sample_dataset/*.csv");
df.show()
+-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
|_c0 |_c1 |_c2 |_c3 |
+-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+
|*15169*|Sample data I|RST1|*Insurance data conversion.
Sample (Sample/Sample)Sample Sample't Sample, Sample Box 9999, Sample, Sample 888888-7014*|
+-------+-------------+----+----------------------------------------------------------------------------------------------------------------------+