Scala 如何在Spark中使用分隔符拆分列,该分隔符在记录中也可用
我有以下数据集Scala 如何在Spark中使用分隔符拆分列,该分隔符在记录中也可用,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有以下数据集 Segment.organizationId|^|Segment.segmentId|^|SegmentType|^|SegmentName|^|SegmentName.languageId|^|SegmentLocalLanguageLabel|^|SegmentLocalLanguageLabel.languageId|^|ValidFromPeriodEndDate|^|ValidToPeriodEndDate|^|SegmentInactivationReasonCode
Segment.organizationId|^|Segment.segmentId|^|SegmentType|^|SegmentName|^|SegmentName.languageId|^|SegmentLocalLanguageLabel|^|SegmentLocalLanguageLabel.languageId|^|ValidFromPeriodEndDate|^|ValidToPeriodEndDate|^|SegmentInactivationReasonCode|^|SegmentOrganizationId|^|IsShariaCompliant|^|IsCorporate|^|IsElimination|^|IsOther|^|InactiveReasonOtherDescription|^|InactiveReasonOtherDescription.languageId|^|IsOperatingSegment|^|SegmentFundbDescription|^|SegmentFundbDescription.languageId|^|SegmentTypeId|^|SegmentInactiveReasonId|^|FFAction|!|
4295876080|^|7|^|B|^|Test ||^|505074|^|jtrsu|^|505126|^|2010-03-31T00:00:00Z|^||^||^||^|False|^|False|^|False|^|False|^||^|505074|^|False|^||^|505074|^|3013618|^||^|I|!|
这是我的密码
val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialSegment/TEST")
但这并没有给我正确的输出
这是我的输出
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+
|Segment_organizationId|Segment_segmentId|SegmentType|SegmentName|SegmentName_languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel_languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription_languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription_languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction|DataPartition|
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+
| 4295876080| 7| B| Test | ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| ^| Japan|
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+
我得到这个是因为记录中使用了|字符
我如何处理这种情况
我的预期产出如下
...+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
|Segment.organizationId|Segment.segmentId|SegmentType|SegmentName|SegmentName.languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel.languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription.languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription.languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction|
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
|4295876080 |7 |B |Test | |505074 |jtrsu |505126 |2010-03-31T00:00:00Z | | | |False |False |False |False | |505074 |False | |505074 |3013618 | |I |
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
选项参数中的spark sql不支持分隔符的多个字符。因此,我建议您使用sparkContext,因为有支持多个字符的拆分函数
因此,第一步是使用sparkContext读取文件
然后,您需要将标题的第一行分隔开来,并从中创建模式
最后一步是使用创建的模式创建所需的数据帧
您应该具有以下数据帧
选项参数中的spark sql不支持分隔符的多个字符。因此,我建议您使用sparkContext,因为有支持多个字符的拆分函数
因此,第一步是使用sparkContext读取文件
然后,您需要将标题的第一行分隔开来,并从中创建模式
最后一步是使用创建的模式创建所需的数据帧
您应该具有以下数据帧
分隔符不能使用两个字符。这在spark中不受支持sql@RameshMaharjan更新了我的问题请看一看您应该使用sparkContext使用多字符分隔符并将rdd转换为dataset或DataFrame您不能使用两个字符作为分隔符。这在spark中不受支持sql@RameshMaharjan更新了我的问题,请再次查看。您应该使用sparkContext使用多字符分隔符,并将rdd转换为数据集或数据帧
val rdd = sc.textFile("s3://trfsmallfffile/FinancialSegment/TEST")
val header = rdd.filter(_.contains("Segment.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("Segment.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema).show(false)
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
|Segment_organizationId|Segment_segmentId|SegmentType|SegmentName|SegmentName_languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel_languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription_languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription_languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction|!||
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
|4295876080 |7 |B |Test | |505074 |jtrsu |505126 |2010-03-31T00:00:00Z | | | |False |False |False |False | |505074 |False | |505074 |3013618 | |I|!| |
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+