Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何从pyspark中的RDD中查找元音_Apache Spark_Pyspark - Fatal编程技术网

Apache spark 如何从pyspark中的RDD中查找元音

Apache spark 如何从pyspark中的RDD中查找元音,apache-spark,pyspark,Apache Spark,Pyspark,我需要找出每个单词中元音的数量。我已经写了下面的代码,但没有得到预期的输出。有人能帮忙处理下面的案子吗 from pyspark import SparkContext,SparkConf conf = SparkConf().setAppName("find vowel counnt").setMaster("local[*]") sc = SparkContext() inputRDD=sc.textFile("file:///home/vikram/data/vowel.txt") i

我需要找出每个单词中元音的数量。我已经写了下面的代码,但没有得到预期的输出。有人能帮忙处理下面的案子吗

from pyspark import SparkContext,SparkConf

conf = SparkConf().setAppName("find vowel counnt").setMaster("local[*]")
sc = SparkContext()
inputRDD=sc.textFile("file:///home/vikram/data/vowel.txt")

inputRDD.collect()

['vikram is best person']

flatRDD = inputRDD.flatMap(lambda x : x.split(" "))
flatRDD.collect()

['vikram', 'is', 'best', 'person']

vowels='aeiou'

def vowel_check(flatRDD, vowels):
    final=[x for x in flatRDD.collect() if x in vowels]
    print(len(final))
    print(final)

vowel_check(flatRDD,vowels)
您可以使用regex findall和count。这将执行计数并生成元音计数的单词元组:

import re
flatRDD.map(lambda l: (l, len(re.findall('[aeiou]', l)))).collect()
制作:

[('vikram', 2), ('is', 1), ('best', 1), ('person', 2)]

Vikram-2、Is-1、best-1、person-2输出应与单词和元音相同