Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 调用map后的pyspark-eoferor_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 调用map后的pyspark-eoferor

Python 调用map后的pyspark-eoferor,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我是spark&Pypark的新手 我正在将一个小的csv文件(~40k)读入数据帧 from pyspark.sql import functions as F df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv') df = df.withColumn('verified', F.when(df['verifie

我是spark&Pypark的新手

我正在将一个小的csv文件(~40k)读入数据帧

from pyspark.sql import functions as F
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv')
df = df.withColumn('verified', F.when(df['verified'] == 'Y', 1).otherwise(0))
df2 = df.map(lambda x: Row(label=float(x[0]), features=Vectors.dense(x[1:]))).toDF()
我得到一些奇怪的错误,不是每次都会发生,但确实经常发生

>>> df2.show(1)
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row

>>> df2.count()
41999                                                                           
>>> df2.show(1)
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row

>>> df2.count()
41999                                                                           
>>> df2.show(1)
Traceback (most recent call last):
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker    
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line 545, in read_int
    raise EOFError
EOFError
+--------------------+---------+
|            features|    label|
+--------------------+---------+
|[0.0,0.0,0.0,0.0,...|4700734.0|
+--------------------+---------+
only showing top 1 row
一旦提出了EOFError,我将不会再看到它,直到我做一些需要与spark服务器交互的事情

当我调用df2.count()时,它会显示[Stage xxx]提示符,这就是我所说的将其发送到spark服务器的意思。当我使用df2做一些事情时,任何触发事件似乎最终都会再次给出EOFError


df似乎不会发生这种情况(与df2相比),因此df.map()行似乎一定发生了某种情况。

请在将数据帧转换为rdd后尝试进行映射。您正在对数据帧应用map函数,然后再次从中创建数据帧

df.rdd.map().toDF()

请让我知道它是否有效。谢谢

我相信您正在运行Spark 2.x及以上版本。下面的代码应该从csv创建数据帧:

df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
然后您可以使用以下代码:

df = df.withColumn('verified', F.when(df['verified'] == 'Y', 1).otherwise(0))
然后您可以创建不带Row和toDF()的df2


让我知道这是否有效,或者您是否正在使用Spark 1.6…谢谢。

我从Spark用户列表中听说这条消息有点过于冗长,可以忽略。Pete,您能给我们指一下档案吗?我搜索了Spark用户列表,但找不到关于EOFError的任何信息:(我认为问题在于数据帧类型,rdd.collect()或df.toJSON().collect()不是抛出错误,我忽略了这一点