Pyspark 从spark函数读取tsv文件时显示空记录
我正在尝试创建一个spark函数来读取各种类型的文件,我还为没有标题的文件创建了参考架构。现在,当我尝试调用该函数时,架构将正确应用于没有标题的文件,但列值都为null。下面是我的代码nad输出Pyspark 从spark函数读取tsv文件时显示空记录,pyspark,Pyspark,我正在尝试创建一个spark函数来读取各种类型的文件,我还为没有标题的文件创建了参考架构。现在,当我尝试调用该函数时,架构将正确应用于没有标题的文件,但列值都为null。下面是我的代码nad输出 def generate_df3(spark, feed_type="File", type="csv", options=None): # options={"sep":"|", "header
def generate_df3(spark, feed_type="File", type="csv", options=None):
# options={"sep":"|", "header":"xyz","path":"xyz", "multiLine":"true","inferSchema":"true", "quote":"\""}
options_local ={} # for some deafult local params if not passed.
from pyspark.sql.types import StructType,StructField,LongType,IntegerType,StringType,TimestampType,DoubleType
referral_schema=StructType([StructField("id_code",IntegerType(),True),StructField("de_description",StringType(),True)])
path = None
if spark is None:
spark = get_spark()
if feed_type is None:
raise ValueError("Feed type can not be none")
if "path" in options:
path=options["path"]
del options["path"]
else:
raise ValueError("Path is not provided in provided options")
if feed_type == "File":
flatten_schema = False
supported_file_formats=["csv","psv","tsv","avro","parquet","orc","json"]
if type not in supported_file_formats:
raise ValueError("Given type is not supported. Supported file formats : " + str(supported_file_formats))
if type in ["psv", "tsv", "csv"]:
type="csv"
options_local["header"]="true" #by default read header
if type == "psv":
options["sep"]= "|"
if type == "tsv":
options["sep"]="\t"
if type in ["json"]:
options_local["multiLine"]="true" #by default read header
if "flatten" in options:
if options["flatten"] == "true":
flatten_schema=True
del options["flatten"]
spark_read= spark.read.format(type)
options_local.update(options)
for k,v in options_local.items():
spark_read = spark_read.option(k,v)
if path == "C:/Users/HP/Desktop/bigdata/hit_data_tsbukglobaldev_20200816_sample.tsv":
df=spark_read.schema(hit_file_schema).load(path)
if path == "C:/Users/HP/Downloads/connection_type.tsv":
df=spark_read.schema(referral_schema).load(path)
else:
df=spark_read.load(path)
if flatten_schema:
df= df.select(flatten(df.schema))
return df.show()
elif feed_type == "Database":
jdbc_url, table, connection_properties = read_db_config(path)
df = spark.read.jdbc(url=jdbc_url, table=table, properties=connection_properties)
return df
generate_df3(spark, feed_type="File", type="tsv", options={"path":"C:/Users/HP/Downloads/connection_type.tsv"})
Output:
+-------+--------------+
|id_code|de_description|
+-------+--------------+
| null| null|
| null| null|
| null| null|
| null| null|
+-------+--------------+
下面是我试图读取的示例文件记录
未指定0
1调制解调器
2局域网/无线网络
3未知
4移动运营商