Python PySpark数据帧:长格式到宽格式
我有一份购买物品的客户名单:Python PySpark数据帧:长格式到宽格式,python,apache-spark,pyspark,spark-dataframe,Python,Apache Spark,Pyspark,Spark Dataframe,我有一份购买物品的客户名单: rdd = sc.parallelize([('A','Item1'), ('A','Item3'), ('B','Item1'), ('B','Item2')]) df=rdd.toDF(['Person','Item']) df.show() +------+-----+ |Person| Item| +------+-----+ | A|Item1| | A|Item3| | B|Item1| | B|Item2| +-----
rdd = sc.parallelize([('A','Item1'), ('A','Item3'), ('B','Item1'), ('B','Item2')])
df=rdd.toDF(['Person','Item'])
df.show()
+------+-----+
|Person| Item|
+------+-----+
| A|Item1|
| A|Item3|
| B|Item1|
| B|Item2|
+------+-----+
现在我想使用pyspark将其更改为宽格式。结果应该如下所示:
+------+-----+-----+-----+
|Person|Item1|Item2|Item3|
+------+-----+-----+-----+
| A| 1 | 0 | 0 |
| A| 0 | 0 | 1 |
| B| 1 | 0 | 0 |
| B| 0 | 1 | 0 |
+------+-----+-----+-----+
你知道怎么做吗
致以最良好的祝愿,
Felix我实际上找到了解决方案:
>>> df.crosstab('Person', 'Item').show()
+-----------+-----+-----+-----+
|Person_Item|Item1|Item2|Item3|
+-----------+-----+-----+-----+
| A| 1| 0| 1|
| B| 1| 1| 0|
+-----------+-----+-----+-----+