Python 调用map后的pyspark-eoferor
我是spark&Pypark的新手 我正在将一个小的csv文件(~40k)读入数据帧Python 调用map后的pyspark-eoferor,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我是spark&Pypark的新手 我正在将一个小的csv文件(~40k)读入数据帧 from pyspark.sql import functions as F df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv') df = df.withColumn('verified', F.when(df['verifie
from pyspark.sql import functions as F
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv')
df = df.withColumn('verified', F.when(df['verified'] == 'Y', 1).otherwise(0))
df2 = df.map(lambda x: Row(label=float(x[0]), features=Vectors.dense(x[1:]))).toDF()
我得到一些奇怪的错误,不是每次都会发生,但确实经常发生
>>> df2.show(1)
+--------------------+---------+
| features| label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row
>>> df2.count()
41999
>>> df2.show(1)
+--------------------+---------+
| features| label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row
>>> df2.count()
41999
>>> df2.show(1)
Traceback (most recent call last):
File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line 545, in read_int
raise EOFError
EOFError
+--------------------+---------+
| features| label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row
一旦提出了EOFError,我将不会再看到它,直到我做一些需要与spark服务器交互的事情
当我调用df2.count()时,它会显示[Stage xxx]提示符,这就是我所说的将其发送到spark服务器的意思。当我使用df2做一些事情时,任何触发事件似乎最终都会再次给出EOFError
df似乎不会发生这种情况(与df2相比),因此df.map()行似乎一定发生了某种情况。请在将数据帧转换为rdd后尝试进行映射。您正在对数据帧应用map函数,然后再次从中创建数据帧
df.rdd.map().toDF()
请让我知道它是否有效。谢谢 我相信您正在运行Spark 2.x及以上版本。下面的代码应该从csv创建数据帧:
df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
然后您可以使用以下代码:
df = df.withColumn('verified', F.when(df['verified'] == 'Y', 1).otherwise(0))
然后您可以创建不带Row和toDF()的df2
让我知道这是否有效,或者您是否正在使用Spark 1.6…谢谢。我从Spark用户列表中听说这条消息有点过于冗长,可以忽略。Pete,您能给我们指一下档案吗?我搜索了Spark用户列表,但找不到关于EOFError的任何信息:(我认为问题在于数据帧类型,rdd.collect()或df.toJSON().collect()不是抛出错误,我忽略了这一点