Python 输入行不';没有架构所需的预期值数

Python 输入行不';没有架构所需的预期值数,python,dataframe,pyspark,apache-spark-sql,rdd,Python,Dataframe,Pyspark,Apache Spark Sql,Rdd,我正在从一个包含700万行的csv文件中读取数据。该文件中的示例数据- "dfg.AAIXpWU4Q","1" "cvbc.AAU3aXfQ","1" "T-L5aL1uT_OfFbk","1" "D9TOXrA_LsQa-awVk","2" "JWg8_0lGDWcH_9aDc","2" "ewrq.AAbCVh5wA","1" "ewrq.AALAC-Qku3heg","1" "ewrq.AADSmhJ7A","2" "ewrq.AAEAoHUNA","1" "ewrq.AALfV5u-7Y

我正在从一个包含700万行的
csv
文件中读取数据。该文件中的示例数据-

"dfg.AAIXpWU4Q","1"
"cvbc.AAU3aXfQ","1"
"T-L5aL1uT_OfFbk","1"
"D9TOXrA_LsQa-awVk","2"
"JWg8_0lGDWcH_9aDc","2"
"ewrq.AAbCVh5wA","1"
"ewrq.AALAC-Qku3heg","1"
"ewrq.AADSmhJ7A","2"
"ewrq.AAEAoHUNA","1"
"ewrq.AALfV5u-7Yg","1"
我读起来就像-

>>> rdd = sc.textFile("/path/to/file/*")

>>> rdd.take(2)
['"7wAfdgdfgd","7"', '"1x3Qdfgdf","1"']
​
# reading the RDD into a dataframe
>>> my_df = rdd.map(lambda x: (x.split(","))).toDF()

# changing column names
>>> df1 = my_df.selectExpr("_1 as user_id", "_2 as hits")

>>> df1.show(3)
+-------+----+
|user_id|hits|
+-------+----+
|"aYk...| "7"|
|"yDQ...| "1"|
|"qUU...|"13"|
+-------+----+
only showing top 3 rows

>>> df2 = df1.sort(col('hits').desc())
>>> df2.show(10)
但这给了我以下的错误-

输入行没有架构所需的预期值数。需要2个字段,同时提供18个值。


我猜这是我将RDD转换为DF的方式。也许
x.split(“,”)
没有考虑坏数据-我该如何解决这个问题?

根据@pault的评论,我只是做了以下来解决这个问题-

>>> rdd = sc.textFile("/path/to/file/*")

# checking out how the data looks
>>> rdd.take(2)
['"7wAfdgdfgd","7"', '"1x3Qdfgdf","1"']

>>> my_df = spark.read.csv("/path/to/file/*", quote='"', sep=",")

>>> df1 = my_df.selectExpr("_c0 as user_id", "_c1 as hits")

>>> df1 = df1.withColumn("hits", df1["hits"].cast(IntegerType()))

>>> df2 = df1.sort(col('hits').desc())

很可能您的文件中某个地方有一条错误记录,其中包含行内的分隔符。Spark是惰性的,因此在需要对数据的特定部分执行操作之前,您不会遇到错误。在这种情况下,
sort
必须读取完整文件,这是导致错误的原因。我知道错误存在的原因,我想问的是如何避免此错误?看起来您的数据被引用了,但您读取数据的方式忽略了引号。相反,请尝试:
df=spark.read.csv(“/path/to/file/*”,quote=“”,sep=“,”
,如果这不起作用,请尝试添加:
mode=“DROPMALFORMED”
。@kev也许您可以在拆分中指定最大值,如
x.split(“,”,1)
,这可能允许创建df,但不会删除“格式错误的数据”!