Apache spark 使用包含不同模式记录的csv设计spark作业_Apache Spark_Apache Spark Sql

Apache spark 使用包含不同模式记录的csv设计spark作业

apache-spark

Apache spark 使用包含不同模式记录的csv设计spark作业,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一个巨大的csv文件，其中包含属于10个不同模式的记录。我正在开发一个spark应用程序，其中我读取整个文件，清理数据（我使用RDD转换，我不能使用DF，因为没有1个模式）示例CSV： Record1,test,name,id Record10,test8,customer,value,info,id Record9,record,door,lamp,sofa,tv,sink,table,box,window 清除记录后，对于每个模式，我将从上一个RDD创建一个df，然后将其保存在HDF

我有一个巨大的csv文件，其中包含属于10个不同模式的记录。我正在开发一个spark应用程序，其中我读取整个文件，清理数据（我使用RDD转换，我不能使用DF，因为没有1个模式）

示例CSV：

Record1,test,name,id
Record10,test8,customer,value,info,id
Record9,record,door,lamp,sofa,tv,sink,table,box,window

清除记录后，对于每个模式，我将从上一个RDD创建一个df，然后将其保存在HDFS中

我的问题是，我做了什么来减少混乱？比如先按模式类型划分，然后保存数据

非常感谢您的反馈：）

您可以将文件读取为文本行，然后将每行映射为一个元组及其类型和实际行，然后按类型过滤并创建数据帧。下面是一个PySpark代码，可能会有所帮助。此外，你可以在这里看到更多

rdd = spark.sparkContext.textFile('/home/roi/repos/roizaig/top/spark/stackoverflow_answers/csv_multiple_schema.csv')

import re


def record_mapper(line):
    # I'm sure you can write better code to identify the schema type but this is the basic idea.
    # Also consider parsing the line to it's CSV value instead of additional pass on the data.
    if re.search('^.*,.*,.*,.*,.*,.*,.*,.*,.*,.*$', line) :
        #Record9,record,door,lamp,sofa,tv,sink,table,box,window
        type = 3
    elif re.search('^.*,.*,.*,.*,.*,.*$', line):
        # Record10,test8,customer,value,info,id
        type = 2
    elif re.search('^.*,.*,.*,.*$', line):
        # Record1,test,name,id
        type = 1
    return (type, line)


def part(tuple):
    return hash(tuple[0])


type_rdd = rdd.map(record_mapper, preservesPartitioning=True)
# use persist here to avoid re-calculations
type_rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK)

for type in range(1, 4):
    # Since the schema type is knowen you can create a DataFrame with defined schema
    df = type_rdd.filter(lambda t: t[0] == type).toDF()
df.show()

+---+--------------------+
| _1|                  _2|
+---+--------------------+
|  1|Record1,test,name,id|
+---+--------------------+

+---+--------------------+
| _1|                  _2|
+---+--------------------+
|  2|Record10,test8,cu...|
+---+--------------------+

+---+--------------------+
| _1|                  _2|
+---+--------------------+
|  3|Record9,record,do...|
+---+--------------------+