Apache spark 无法使用rdd.toDF(),但spark.createDataFrame(rdd)可以工作
我有一个格式为Apache spark 无法使用rdd.toDF(),但spark.createDataFrame(rdd)可以工作,apache-spark,pyspark,Apache Spark,Pyspark,我有一个格式为RDD[(字符串,列表(元组))]的RDD,如下所示: [(u'C1589::HG02922', [(83779208, 2), (677873089, 0), ...] 尝试运行以下代码以将其转换为数据帧时,spark.createDataFrame(rdd)工作正常,但rdd.toDF()失败 vector_df1 = spark.createDataFrame(vector_rdd) # Works fine. vector_df1.show() +------------
RDD[(字符串,列表(元组))]
的RDD,如下所示:
[(u'C1589::HG02922', [(83779208, 2), (677873089, 0), ...]
尝试运行以下代码以将其转换为数据帧时,spark.createDataFrame(rdd)
工作正常,但rdd.toDF()
失败
vector_df1 = spark.createDataFrame(vector_rdd) # Works fine.
vector_df1.show()
+--------------+--------------------+
| _1| _2|
+--------------+--------------------+
|C1589::HG02922|[[83779208,2], [6...|
| HG00367|[[83779208,0], [6...|
| C477::HG00731|[[83779208,0], [6...|
| HG00626|[[83779208,0], [6...|
| HG00622|[[83779208,0], [6...|
...
vector_df2 = vector_rdd.toDF() # Tosses the error.
引发的错误是:
Traceback (most recent call last):
File "/tmp/7ff0f62d-d849-4884-960f-bb89b5f3dd80/ml_on_vds.py", line 47, in <module>
vector_df2 = vector_rdd.toDF().show()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 57, in toDF
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1124, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1094, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 289, in get_command_part
AttributeError: 'PipelinedRDD' object has no attribute '_get_object_id'
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [7ff0f62d-d849-4884-960f-bb89b5f3dd80] entered state [ERROR] while waiting for [DONE].
根据请求,生成错误的更多代码:
vector_rdd = (indexed_df.rdd.map(lambda r: (r[0], (r[3], r[2])))
.groupByKey()
.mapValues(lambda l: Vectors.sparse((max_index + 1), list(l))))
vector_df = spark.createDataFrame(vector_rdd, ['s', 'features']) # Works
vector_df1 = vector_rdd.toDF()
vector_df1.show() # Fails
indexed_df
是架构的数据帧:
StructType(List(StructField(s,StringType,true),StructField(variant_hash,IntegerType,false),StructField(call,IntegerType,true),StructField(index,DoubleType,true)))
看起来
+--------------+------------+----+-----+
| s|variant_hash|call|index|
+--------------+------------+----+-----+
|C1046::HG02024| -60010252| 0|225.0|
|C1046::HG02025| -60010252| 1|225.0|
|C1046::HG02026| -60010252| 0|225.0|
|C1047::HG00731| -60010252| 0|225.0|
|C1047::HG00732| -60010252| 1|225.0|
|C1047::HG00733| -60010252| 0|225.0|
|C1048::HG02024| -60010252| 0|225.0|
|C1048::HG02025| -60010252| 1|225.0|
|C1048::HG02026| -60010252| 0|225.0|
|C1049::HG00731| -60010252| 0|225.0|
|C1049::HG00732| -60010252| 1|225.0|
|C1049::HG00733| -60010252| 0|225.0|
|C1050::HG03006| -60010252| 0|225.0|
|C1051::HG03642| -60010252| 0|225.0|
|C1589::HG02922| -60010252| 2|225.0|
|C1589::HG03006| -60010252| 0|225.0|
|C1589::HG03052| -60010252| 2|225.0|
|C1589::HG03642| -60010252| 0|225.0|
|C1589::NA12878| -60010252| 1|225.0|
|C1589::NA19017| -60010252| 1|225.0|
+--------------+------------+----+-----+
toDF
方法在中的SparkSession
和1.x版本中的SQLContex
下执行。
所以
如果您使用的是scala,则需要输入导入spark.implicits.\u
其中spark是您创建的SparkSession对象
希望这有帮助 我已经将如何初始化SparkSession添加到脚本的底部。这难道不能让我访问
toDF()
方法吗?如果您在scala上,您需要导入spark.implicits.\u我正在使用Python。我已经包含了我的导入。您是否尝试过vectorrdd.map(lambda x:(x,).toDF()@koiralo您可以编辑scala解决方案并添加以下内容,以便更清楚:“如果您在scala中,您需要导入spark.implicits.spark是SparkSession”。我很难理解这种情况。你能给我一个更大一点的RDD示例,这样我就可以创建测试示例来解决这个问题吗?
+--------------+------------+----+-----+
| s|variant_hash|call|index|
+--------------+------------+----+-----+
|C1046::HG02024| -60010252| 0|225.0|
|C1046::HG02025| -60010252| 1|225.0|
|C1046::HG02026| -60010252| 0|225.0|
|C1047::HG00731| -60010252| 0|225.0|
|C1047::HG00732| -60010252| 1|225.0|
|C1047::HG00733| -60010252| 0|225.0|
|C1048::HG02024| -60010252| 0|225.0|
|C1048::HG02025| -60010252| 1|225.0|
|C1048::HG02026| -60010252| 0|225.0|
|C1049::HG00731| -60010252| 0|225.0|
|C1049::HG00732| -60010252| 1|225.0|
|C1049::HG00733| -60010252| 0|225.0|
|C1050::HG03006| -60010252| 0|225.0|
|C1051::HG03642| -60010252| 0|225.0|
|C1589::HG02922| -60010252| 2|225.0|
|C1589::HG03006| -60010252| 0|225.0|
|C1589::HG03052| -60010252| 2|225.0|
|C1589::HG03642| -60010252| 0|225.0|
|C1589::NA12878| -60010252| 1|225.0|
|C1589::NA19017| -60010252| 1|225.0|
+--------------+------------+----+-----+
spark = SparkSession(sc)
hasattr(rdd, "toDF")