Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/344.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将Pandas数据帧转换为Spark数据帧时出错_Python_Pandas_Apache Spark_Pyspark - Fatal编程技术网

Python 将Pandas数据帧转换为Spark数据帧时出错

Python 将Pandas数据帧转换为Spark数据帧时出错,python,pandas,apache-spark,pyspark,Python,Pandas,Apache Spark,Pyspark,我基本上有一个有两列的数据框。它目前是一个Spark数据帧,看起来像这样 recommender_sdf.show(5,truncate=False) +------+-------------------------+ |TRANS |ITEM | +------+-------------------------+ |163589|How to Motivate Employees| |373053|How to Motivate Employees|

我基本上有一个有两列的数据框。它目前是一个Spark数据帧,看起来像这样

recommender_sdf.show(5,truncate=False)
+------+-------------------------+
|TRANS |ITEM                     |
+------+-------------------------+
|163589|How to Motivate Employees|
|373053|How to Motivate Employees|
|280169|How to Motivate Employees|
|495281|How to Motivate Employees|
|3498  |How to Motivate Employees|
+------+-------------------------+
基本上有两列,第一列表示个人ID,ITEM是他观看的ITEM视频的名称

我想交叉标记这个数据集,这样每个被观察的项目都会显示为单独的列,其总计数作为该列的值

我第一次尝试使用Spark数据帧交叉表函数,但它显示了以下Java堆错误

Py4JJavaError: An error occurred while calling o72.crosstab.
: java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.<init>(rows.scala:252)
    at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$4.apply(StatFunctions.scala:123)
    at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$4.apply(StatFunctions.scala:122)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:122)
    at org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:133)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

UnicodeEncodeError回溯(最近一次呼叫最后一次)
在()
1#为交叉表创建新数据框
---->2 recommender\u sct=sqlContext.createDataFrame(recommender\u pct)
/createDataFrame中的Users/i854319/spark/python/pyspark/sql/context.pyc(self、data、schema、samplinglatio)
423 rdd,schema=self.\u createFromRDD(数据、模式、采样)
424其他:
-->425 rdd,模式=self.\u createFromLocal(数据,模式)
426 jrdd=self.\u jvm.SerDeUtil.toJavaArray(rdd.\u to_java\u object\u rdd())
427 jdf=self.\u ssql\u ctx.applySchemaToPythonRDD(jrdd.rdd(),schema.json())
/_createFromLocal(self、data、schema)中的Users/i854319/spark/python/pyspark/sql/context.pyc
331如果有_pandas和isinstance(数据,pandas.DataFrame):
332如果架构为无:
-->333 schema=[str(x)表示data.columns中的x]
334 data=[r.tolist()表示数据中的r.to_记录(index=False)]
335
UnicodeEncodeError:“ascii”编解码器无法对位置1中的字符u'\u201c'进行编码:序号不在范围内(128)

有什么帮助吗?

我可以建议您将panda写入一个文件(csv,拼花地板)中,并尝试使用Spark阅读它吗?是的,或者退出Spark并转到Dask@AlbertoBonsanto快速问题:1)知道我为什么会得到上述Spark交叉表函数的错误吗?2) 从pandas写入文件,然后将该文件导入Spark是pandas中的一个问题(我能够正确导入其他pandas df),或者是与此问题相关的问题?您好,Spark Csv软件包不适用于我的Spark。你能帮我把熊猫数据框转换成Spark可以读取的另一种格式吗?我建议你把熊猫写在一个文件里(csv,拼花),然后试着用Spark来读它。是的,或者退出Spark,转到Dask@AlbertoBonsanto快速问题:1)知道我为什么会得到上述Spark交叉表函数的错误吗?2) 从pandas写入文件,然后将该文件导入Spark是pandas中的一个问题(我能够正确导入其他pandas df),或者是与此问题相关的问题?您好,Spark Csv软件包不适用于我的Spark。你能帮我把熊猫数据框转换成Spark可以读取的另一种格式吗?非常感谢
# Creating a new pandas dataframe for cross-tab
​
recommender_pct=pd.crosstab(recommender_pdf['TRANS'], recommender_pdf['ITEM'])

Converting the pandas Dataframe back to Spark Dataframe
In [31]:

# Creating a new Spark dataframe for cross-tab from Pandas data frame

recommender_sct=sqlContext.createDataFrame(recommender_pct)
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-31-234fe3fbf3e5> in <module>()
      1 # Creating a new dataframe for cross-tab
----> 2 recommender_sct=sqlContext.createDataFrame(recommender_pct)

/Users/i854319/spark/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
    423             rdd, schema = self._createFromRDD(data, schema, samplingRatio)
    424         else:
--> 425             rdd, schema = self._createFromLocal(data, schema)
    426         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    427         jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/Users/i854319/spark/python/pyspark/sql/context.pyc in _createFromLocal(self, data, schema)
    331         if has_pandas and isinstance(data, pandas.DataFrame):
    332             if schema is None:
--> 333                 schema = [str(x) for x in data.columns]
    334             data = [r.tolist() for r in data.to_records(index=False)]
    335 

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 1: ordinal not in range(128)