Python 使用数组对象计算Spark RDD中的不同文本_Python_Apache Spark_Pyspark_Rdd

Python 使用数组对象计算Spark RDD中的不同文本

python apache-spark pyspark

Python 使用数组对象计算Spark RDD中的不同文本,python,apache-spark,pyspark,rdd,Python,Apache Spark,Pyspark,Rdd,我有一个spark rdd（单词），它由文本数组组成。比如, words.take(3) 将返回类似的内容 [ ["A", "B"], ["B", "C"], ["C", "A", "D"] ] 现在，我想找出文本的总数以及唯一的文本数。如果RDD只有3条以上的记录 total_words = 7 unique_words = 4 (only A, B,C,D) 现在为了得到总数，我做了类似的事情 text_count_rdd = words.map(lambda x: len(x)) t

我有一个spark rdd（

单词

），它由文本数组组成。比如,

words.take(3)

将返回类似的内容

[ ["A", "B"], ["B", "C"], ["C", "A", "D"] ]

现在，我想找出文本的总数以及唯一的文本数。如果RDD只有3条以上的记录

total_words = 7
unique_words = 4 (only A, B,C,D)

现在为了得到总数，我做了类似的事情

text_count_rdd = words.map(lambda x: len(x))
text_count_rdd.sum()

但是我被困在如何检索唯一计数上

只需

flatMap

，取

distinct

和

count

：

words.flatMap(set).distinct().count()