Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PySpark中的双RAM计数_Python_Apache Spark_Pyspark_Pyspark Sql_Apache Spark Ml - Fatal编程技术网

Python PySpark中的双RAM计数

Python PySpark中的双RAM计数,python,apache-spark,pyspark,pyspark-sql,apache-spark-ml,Python,Apache Spark,Pyspark,Pyspark Sql,Apache Spark Ml,我试图在PySpark中拼凑出一个二元数计数程序,它获取一个文本文件并输出每个适当二元数的频率(一个句子中两个连续的单词) ngram\u df.select(“bigrams”)现在包含: +--------------------+ | bigrams| +--------------------+ |[April is, is the...| |[It is, is one, o...| |[April always, al...| |[April always,

我试图在PySpark中拼凑出一个二元数计数程序,它获取一个文本文件并输出每个适当二元数的频率(一个句子中两个连续的单词)

ngram\u df.select(“bigrams”)
现在包含:

+--------------------+
|             bigrams|
+--------------------+
|[April is, is the...|
|[It is, is one, o...|
|[April always, al...|
|[April always, al...|
|[April's flowers,...|
|[Its birthstone, ...|
|[The meaning, mea...|
|[April comes, com...|
|[It also, also co...|
|[April begins, be...|
|[April ends, ends...|
|[In common, commo...|
|[In common, commo...|
|[In common, commo...|
|[In years, years ...|
|[In years, years ...|
+--------------------+

因此,每个句子都有一个双字表。现在,需要计算不同的bigram。怎么用?此外,整个代码似乎仍然不必要地冗长,因此我很高兴看到更简洁的解决方案。

如果您已经使用了
RDD
API,那么您只需遵循以下步骤即可

bigrams = text_file.flatMap(lambda line: line.split(".")) \
                   .map(lambda line: line.strip().split(" ")) \
                   .flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:])))

bigrams.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
否则:

from pyspark.sql.functions import explode

ngram_df.select(explode("bigrams").alias("bigram")).groupBy("bigram").count()
from pyspark.sql.functions import explode

ngram_df.select(explode("bigrams").alias("bigram")).groupBy("bigram").count()