Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在pyspark中使用RDD从字典创建数据帧_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 在pyspark中使用RDD从字典创建数据帧

Python 在pyspark中使用RDD从字典创建数据帧,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一本字典,名字是“Word\u Count”,键是代表单词,值代表文本中的数字单词。我的目标是将其转换为具有两列word和count的数据帧 items = list(Word_Counts.items())[:5] items 输出: [('Akdeniz’in', 14), ('en', 13287), ('büyük', 3168), ('deniz', 1276), ('festivali:', 6)] ['Akdeniz’in', 'en', 'büyük', 'deniz',

我有一本字典,名字是“Word\u Count”,键是代表单词,值代表文本中的数字单词。我的目标是将其转换为具有两列word和count的数据帧

items = list(Word_Counts.items())[:5]
items
输出:

[('Akdeniz’in', 14), ('en', 13287), ('büyük', 3168), ('deniz', 1276), ('festivali:', 6)]
['Akdeniz’in', 'en', 'büyük', 'deniz', 'festivali:']

Df_Hur.show(5)
+---------------+ 
|_corrupt_record|
+---------------+ 
| Akdeniz’in|
| en| 
| büyük| 
| deniz| 
| festivali:| 
+---------------+
当我使用sc.parallelize建立一个RDD时,我意识到它会删除所有的值,并且只保留键。因此,当我创建一个表时,它只包含来自键。请让我知道如何使用RDD从字典中建立数据帧

rdd1 = sc.parallelize(Word_Counts)
Df_Hur = spark.read.json(rdd1)
rdd1.take(5)
输出:

[('Akdeniz’in', 14), ('en', 13287), ('büyük', 3168), ('deniz', 1276), ('festivali:', 6)]
['Akdeniz’in', 'en', 'büyük', 'deniz', 'festivali:']

Df_Hur.show(5)
+---------------+ 
|_corrupt_record|
+---------------+ 
| Akdeniz’in|
| en| 
| büyük| 
| deniz| 
| festivali:| 
+---------------+
输出:

[('Akdeniz’in', 14), ('en', 13287), ('büyük', 3168), ('deniz', 1276), ('festivali:', 6)]
['Akdeniz’in', 'en', 'büyük', 'deniz', 'festivali:']

Df_Hur.show(5)
+---------------+ 
|_corrupt_record|
+---------------+ 
| Akdeniz’in|
| en| 
| büyük| 
| deniz| 
| festivali:| 
+---------------+
我的目标是:

   word       count
  Akdeniz’in    14
  en            13287
  büyük         3168
  deniz         1276
  festivali:    6

您可以将
word\u count.items()
直接馈送到
parallelize

df_hur=sc.parallelize(word_count.items()).toDF(['word','count'])
df_hur.show()
>>>
+----------+-----+
|字数|
+----------+-----+
|Akdeniz'in | 14|
|en | 13287|
|büyük|3168|
|丹尼斯| 1276|
|阿里:| 6|
+----------+-----+