Python PySpark中的双RAM计数_Python_Apache Spark_Pyspark_Pyspark Sql_Apache Spark Ml

Python PySpark中的双RAM计数

python apache-spark pyspark

Python PySpark中的双RAM计数,python,apache-spark,pyspark,pyspark-sql,apache-spark-ml,Python,Apache Spark,Pyspark,Pyspark Sql,Apache Spark Ml,我试图在PySpark中拼凑出一个二元数计数程序，它获取一个文本文件并输出每个适当二元数的频率（一个句子中两个连续的单词） ngram\u df.select（“bigrams”）现在包含： +--------------------+ | bigrams| +--------------------+ |[April is, is the...| |[It is, is one, o...| |[April always, al...| |[April always,

我试图在PySpark中拼凑出一个二元数计数程序，它获取一个文本文件并输出每个适当二元数的频率（一个句子中两个连续的单词）

ngram\u df.select（“bigrams”）

现在包含：

+--------------------+
|             bigrams|
+--------------------+
|[April is, is the...|
|[It is, is one, o...|
|[April always, al...|
|[April always, al...|
|[April's flowers,...|
|[Its birthstone, ...|
|[The meaning, mea...|
|[April comes, com...|
|[It also, also co...|
|[April begins, be...|
|[April ends, ends...|
|[In common, commo...|
|[In common, commo...|
|[In common, commo...|
|[In years, years ...|
|[In years, years ...|
+--------------------+

因此，每个句子都有一个双字表。现在，需要计算不同的bigram。怎么用？此外，整个代码似乎仍然不必要地冗长，因此我很高兴看到更简洁的解决方案。

如果您已经使用了

RDD

API，那么您只需遵循以下步骤即可

bigrams = text_file.flatMap(lambda line: line.split(".")) \
                   .map(lambda line: line.strip().split(" ")) \
                   .flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:])))

bigrams.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

否则：

from pyspark.sql.functions import explode

ngram_df.select(explode("bigrams").alias("bigram")).groupBy("bigram").count()

from pyspark.sql.functions import explode

ngram_df.select(explode("bigrams").alias("bigram")).groupBy("bigram").count()