Python PySpark中的双RAM计数
我试图在PySpark中拼凑出一个二元数计数程序,它获取一个文本文件并输出每个适当二元数的频率(一个句子中两个连续的单词)Python PySpark中的双RAM计数,python,apache-spark,pyspark,pyspark-sql,apache-spark-ml,Python,Apache Spark,Pyspark,Pyspark Sql,Apache Spark Ml,我试图在PySpark中拼凑出一个二元数计数程序,它获取一个文本文件并输出每个适当二元数的频率(一个句子中两个连续的单词) ngram\u df.select(“bigrams”)现在包含: +--------------------+ | bigrams| +--------------------+ |[April is, is the...| |[It is, is one, o...| |[April always, al...| |[April always,
ngram\u df.select(“bigrams”)
现在包含:
+--------------------+
| bigrams|
+--------------------+
|[April is, is the...|
|[It is, is one, o...|
|[April always, al...|
|[April always, al...|
|[April's flowers,...|
|[Its birthstone, ...|
|[The meaning, mea...|
|[April comes, com...|
|[It also, also co...|
|[April begins, be...|
|[April ends, ends...|
|[In common, commo...|
|[In common, commo...|
|[In common, commo...|
|[In years, years ...|
|[In years, years ...|
+--------------------+
因此,每个句子都有一个双字表。现在,需要计算不同的bigram。怎么用?此外,整个代码似乎仍然不必要地冗长,因此我很高兴看到更简洁的解决方案。如果您已经使用了
RDD
API,那么您只需遵循以下步骤即可
bigrams = text_file.flatMap(lambda line: line.split(".")) \
.map(lambda line: line.strip().split(" ")) \
.flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:])))
bigrams.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
否则:
from pyspark.sql.functions import explode
ngram_df.select(explode("bigrams").alias("bigram")).groupBy("bigram").count()
from pyspark.sql.functions import explode
ngram_df.select(explode("bigrams").alias("bigram")).groupBy("bigram").count()