Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/list/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyspark错误:AttributeError:';SparkSession';对象没有属性';并行化';_Python_Hadoop_Pandas_Apache Spark_Pyspark - Fatal编程技术网

Python pyspark错误:AttributeError:';SparkSession';对象没有属性';并行化';

Python pyspark错误:AttributeError:';SparkSession';对象没有属性';并行化';,python,hadoop,pandas,apache-spark,pyspark,Python,Hadoop,Pandas,Apache Spark,Pyspark,我正在Jupyter笔记本上使用pyspark。以下是Spark设置的步骤: import findspark findspark.init(spark_home='/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive', python_path='python2.7') import pyspark from pyspark.sql import * sc = pyspark.sql.S

我正在Jupyter笔记本上使用pyspark。以下是Spark设置的步骤:

import findspark
findspark.init(spark_home='/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive', python_path='python2.7')

    import pyspark
    from pyspark.sql import *

    sc = pyspark.sql.SparkSession.builder.master("yarn-client").config("spark.executor.memory", "2g").config('spark.driver.memory', '1g').config('spark.driver.cores', '4').enableHiveSupport().getOrCreate()

    sqlContext = SQLContext(sc)
那么当我这样做的时候:

spark_df = sqlContext.createDataFrame(df_in)
其中,中的
df_是一个数据帧。然后我得到了以下错误:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-1db231ce21c9> in <module>()
----> 1 spark_df = sqlContext.createDataFrame(df_in)


/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
    297         Py4JJavaError: ...
    298         """
--> 299         return self.sparkSession.createDataFrame(data, schema, samplingRatio)
    300 
    301     @since(1.3)

/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in _createFromLocal(self, data, schema)
    400         # convert python objects to sql data
    401         data = [schema.toInternal(row) for row in data]
--> 402         return self._sc.parallelize(data), schema
    403 
    404     @since(2.0)

AttributeError: 'SparkSession' object has no attribute 'parallelize'
---------------------------------------------------------------------------
AttributeError回溯(最近一次呼叫上次)
在()
---->1 spark_df=sqlContext.createDataFrame(df_in)
/createDataFrame中的home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/context.pyc(自、数据、模式、采样)
297 Py4JJavaError:。。。
298         """
-->299返回self.sparkSession.createDataFrame(数据、模式、采样)
300
301@自(1.3)
/createDataFrame中的home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc(自、数据、模式、采样)
520 rdd,schema=self.\u createFromRDD(data.map(prepare),schema,samplingario)
521其他:
-->522 rdd,schema=self.\u createFromLocal(映射(准备,数据),schema)
523 jrdd=self.\u jvm.SerDeUtil.toJavaArray(rdd.\u to_java\u object\u rdd())
524 jdf=self.\u jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),schema.json())
/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc,位于
400#将python对象转换为sql数据
401数据=[schema.toInternal(行)表示数据中的行]
-->402返回自并行化(数据),模式
403
404@自(2.0)
AttributeError:“SparkSession”对象没有属性“parallelize”

有人知道我做错了什么吗?谢谢!

SparkSession
不是
SparkContext
的替代品,而是
SQLContext
的等价物。只需使用与使用
SQLContext
相同的方法即可:

spark.createDataFrame(...)
如果必须访问
SparkContext
请使用
SparkContext
属性:

spark.sparkContext
因此,如果需要
SQLContext
实现向后兼容性,您可以:

SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)

然后我尝试了:spark_df=sc.createDataFrame(df_in),但是spark_df似乎已损坏。spark_df=sc.createDataFrame(df_in)在这里进行转换的正确方法吗?只有当
df_in
createDataFrame
的有效参数时。df_in是熊猫数据帧。我认为它应该有效吗?谢谢@zero323