Python 将Pandas数据帧转换为Spark数据帧时出错
我基本上有一个有两列的数据框。它目前是一个Spark数据帧,看起来像这样Python 将Pandas数据帧转换为Spark数据帧时出错,python,pandas,apache-spark,pyspark,Python,Pandas,Apache Spark,Pyspark,我基本上有一个有两列的数据框。它目前是一个Spark数据帧,看起来像这样 recommender_sdf.show(5,truncate=False) +------+-------------------------+ |TRANS |ITEM | +------+-------------------------+ |163589|How to Motivate Employees| |373053|How to Motivate Employees|
recommender_sdf.show(5,truncate=False)
+------+-------------------------+
|TRANS |ITEM |
+------+-------------------------+
|163589|How to Motivate Employees|
|373053|How to Motivate Employees|
|280169|How to Motivate Employees|
|495281|How to Motivate Employees|
|3498 |How to Motivate Employees|
+------+-------------------------+
基本上有两列,第一列表示个人ID,ITEM是他观看的ITEM视频的名称
我想交叉标记这个数据集,这样每个被观察的项目都会显示为单独的列,其总计数作为该列的值
我第一次尝试使用Spark数据帧交叉表函数,但它显示了以下Java堆错误
Py4JJavaError: An error occurred while calling o72.crosstab.
: java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.<init>(rows.scala:252)
at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$4.apply(StatFunctions.scala:123)
at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$4.apply(StatFunctions.scala:122)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:122)
at org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:133)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
UnicodeEncodeError回溯(最近一次呼叫最后一次)
在()
1#为交叉表创建新数据框
---->2 recommender\u sct=sqlContext.createDataFrame(recommender\u pct)
/createDataFrame中的Users/i854319/spark/python/pyspark/sql/context.pyc(self、data、schema、samplinglatio)
423 rdd,schema=self.\u createFromRDD(数据、模式、采样)
424其他:
-->425 rdd,模式=self.\u createFromLocal(数据,模式)
426 jrdd=self.\u jvm.SerDeUtil.toJavaArray(rdd.\u to_java\u object\u rdd())
427 jdf=self.\u ssql\u ctx.applySchemaToPythonRDD(jrdd.rdd(),schema.json())
/_createFromLocal(self、data、schema)中的Users/i854319/spark/python/pyspark/sql/context.pyc
331如果有_pandas和isinstance(数据,pandas.DataFrame):
332如果架构为无:
-->333 schema=[str(x)表示data.columns中的x]
334 data=[r.tolist()表示数据中的r.to_记录(index=False)]
335
UnicodeEncodeError:“ascii”编解码器无法对位置1中的字符u'\u201c'进行编码:序号不在范围内(128)
有什么帮助吗?我可以建议您将panda写入一个文件(csv,拼花地板)中,并尝试使用Spark阅读它吗?是的,或者退出Spark并转到Dask@AlbertoBonsanto快速问题:1)知道我为什么会得到上述Spark交叉表函数的错误吗?2) 从pandas写入文件,然后将该文件导入Spark是pandas中的一个问题(我能够正确导入其他pandas df),或者是与此问题相关的问题?您好,Spark Csv软件包不适用于我的Spark。你能帮我把熊猫数据框转换成Spark可以读取的另一种格式吗?我建议你把熊猫写在一个文件里(csv,拼花),然后试着用Spark来读它。是的,或者退出Spark,转到Dask@AlbertoBonsanto快速问题:1)知道我为什么会得到上述Spark交叉表函数的错误吗?2) 从pandas写入文件,然后将该文件导入Spark是pandas中的一个问题(我能够正确导入其他pandas df),或者是与此问题相关的问题?您好,Spark Csv软件包不适用于我的Spark。你能帮我把熊猫数据框转换成Spark可以读取的另一种格式吗?非常感谢
# Creating a new pandas dataframe for cross-tab
recommender_pct=pd.crosstab(recommender_pdf['TRANS'], recommender_pdf['ITEM'])
Converting the pandas Dataframe back to Spark Dataframe
In [31]:
# Creating a new Spark dataframe for cross-tab from Pandas data frame
recommender_sct=sqlContext.createDataFrame(recommender_pct)
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-31-234fe3fbf3e5> in <module>()
1 # Creating a new dataframe for cross-tab
----> 2 recommender_sct=sqlContext.createDataFrame(recommender_pct)
/Users/i854319/spark/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
--> 425 rdd, schema = self._createFromLocal(data, schema)
426 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
427 jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
/Users/i854319/spark/python/pyspark/sql/context.pyc in _createFromLocal(self, data, schema)
331 if has_pandas and isinstance(data, pandas.DataFrame):
332 if schema is None:
--> 333 schema = [str(x) for x in data.columns]
334 data = [r.tolist() for r in data.to_records(index=False)]
335
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 1: ordinal not in range(128)