Pyspark 从spark函数读取tsv文件时显示空记录_Pyspark

Pyspark 从spark函数读取tsv文件时显示空记录

pyspark

Pyspark 从spark函数读取tsv文件时显示空记录,pyspark,Pyspark,我正在尝试创建一个spark函数来读取各种类型的文件，我还为没有标题的文件创建了参考架构。现在，当我尝试调用该函数时，架构将正确应用于没有标题的文件，但列值都为null。下面是我的代码nad输出 def generate_df3(spark, feed_type="File", type="csv", options=None): # options={"sep":"|", "header

我正在尝试创建一个spark函数来读取各种类型的文件，我还为没有标题的文件创建了参考架构。现在，当我尝试调用该函数时，架构将正确应用于没有标题的文件，但列值都为null。下面是我的代码nad输出

def generate_df3(spark, feed_type="File", type="csv", options=None):
    
    # options={"sep":"|", "header":"xyz","path":"xyz", "multiLine":"true","inferSchema":"true", "quote":"\""}

    options_local ={} # for some deafult local params if not passed.
    from pyspark.sql.types import StructType,StructField,LongType,IntegerType,StringType,TimestampType,DoubleType
    
  referral_schema=StructType([StructField("id_code",IntegerType(),True),StructField("de_description",StringType(),True)])

    path = None
    
    if spark is None:
        spark = get_spark()

    if feed_type is None:
        raise ValueError("Feed type can not be none")
        
    if "path" in options:
        path=options["path"]
        del options["path"]
    else:
        raise ValueError("Path is not provided in provided options")


    if feed_type == "File":
        
        flatten_schema = False
        supported_file_formats=["csv","psv","tsv","avro","parquet","orc","json"]
    

        if type not in supported_file_formats:
            raise ValueError("Given type is not supported. Supported file formats : " + str(supported_file_formats))

        if type in ["psv", "tsv", "csv"]:
            type="csv"
            options_local["header"]="true"  #by default read header
            if type == "psv":
                options["sep"]= "|"
            if type == "tsv":
                options["sep"]="\t"
                
                
        if type in ["json"]:
            options_local["multiLine"]="true"  #by default read header
                
        if "flatten" in options:
            if options["flatten"] == "true":
                flatten_schema=True
                del options["flatten"]
                
       
        spark_read= spark.read.format(type)
       
        
        
        
        
        options_local.update(options)
        
        
    
        
        for k,v in options_local.items():
            spark_read = spark_read.option(k,v)
            if  path == "C:/Users/HP/Desktop/bigdata/hit_data_tsbukglobaldev_20200816_sample.tsv":
                df=spark_read.schema(hit_file_schema).load(path)
            if   path == "C:/Users/HP/Downloads/connection_type.tsv":
                df=spark_read.schema(referral_schema).load(path)
                
            else:
                df=spark_read.load(path)
            
        
            
        
        

        
        
            if flatten_schema:
           
                df= df.select(flatten(df.schema))
            
        return df.show()

    elif feed_type == "Database":

        jdbc_url, table, connection_properties = read_db_config(path)
        df = spark.read.jdbc(url=jdbc_url, table=table, properties=connection_properties)
        return df


generate_df3(spark, feed_type="File", type="tsv", options={"path":"C:/Users/HP/Downloads/connection_type.tsv"})
Output:

+-------+--------------+
|id_code|de_description|
+-------+--------------+
|   null|          null|
|   null|          null|
|   null|          null|
|   null|          null|
+-------+--------------+

下面是我试图读取的示例文件记录

未指定0 1调制解调器 2局域网/无线网络 3未知 4移动运营商