Python 火花字数测试
我只想计算spark pyspark中的单词,但我可以映射字母或整个字符串 我试过: 整串Python 火花字数测试,python,count,mapreduce,apache-spark,pyspark,Python,Count,Mapreduce,Apache Spark,Pyspark,我只想计算spark pyspark中的单词,但我可以映射字母或整个字符串 我试过: 整串 v1='Hi hi hi bye bye bye word count' v1_temp=sc.parallelize([v1]) v1_map = v1_temp.flatMap(lambda x: x.split('\t')) v1_counts = v1_map.map(lambda x: (x, 1)) v1_counts.collect() 或者只是信 v1='Hi hi hi bye
v1='Hi hi hi bye bye bye word count'
v1_temp=sc.parallelize([v1])
v1_map = v1_temp.flatMap(lambda x: x.split('\t'))
v1_counts = v1_map.map(lambda x: (x, 1))
v1_counts.collect()
或者只是信
v1='Hi hi hi bye bye bye word count'
v1_temp=sc.parallelize(v1)
v1_map = v1_temp.flatMap(lambda x: x.split('\t'))
v1_counts = v1_map.map(lambda x: (x, 1))
v1_counts.collect()
当您执行sc.parallelizesequence时,您正在创建一个将并行操作的RDD。在第一种情况下,序列是一个包含整个句子的单个元素的列表。在第二种情况下,序列是一个字符串,在python中类似于字符列表
如果要并行计算单词数,可以执行以下操作:
from operator import add
s = 'Hi hi hi bye bye bye word count'
seq = s.split() # ['Hi', 'hi', 'hi', 'bye', 'bye', 'bye', 'word', 'count']
sc.parallelize(seq)\
.map(lambda word: (word, 1))\
.reduceByKey(add)\
.collect()
将为您提供:
[('count', 1), ('word', 1), ('bye', 3), ('hi', 2), ('Hi', 1)]
当您执行sc.parallelizesequence时,您正在创建一个将并行操作的RDD。在第一种情况下,序列是一个包含整个句子的单个元素的列表。在第二种情况下,序列是一个字符串,在python中类似于字符列表
如果要并行计算单词数,可以执行以下操作:
from operator import add
s = 'Hi hi hi bye bye bye word count'
seq = s.split() # ['Hi', 'hi', 'hi', 'bye', 'bye', 'bye', 'word', 'count']
sc.parallelize(seq)\
.map(lambda word: (word, 1))\
.reduceByKey(add)\
.collect()
将为您提供:
[('count', 1), ('word', 1), ('bye', 3), ('hi', 2), ('Hi', 1)]
如果您只想计算字母数字单词,这可能是一个解决方案:
import time, re
from pyspark import SparkContext, SparkConf
def linesToWordsFunc(line):
wordsList = line.split()
wordsList = [re.sub(r'\W+', '', word) for word in wordsList]
filtered = filter(lambda word: re.match(r'\w+', word), wordsList)
return filtered
def wordsToPairsFunc(word):
return (word, 1)
def reduceToCount(a, b):
return (a + b)
def main():
conf = SparkConf().setAppName("Words count").setMaster("local")
sc = SparkContext(conf=conf)
rdd = sc.textFile("your_file.txt")
words = rdd.flatMap(linesToWordsFunc)
pairs = words.map(wordsToPairsFunc)
counts = pairs.reduceByKey(reduceToCount)
# Get the first top 100 words
output = counts.takeOrdered(100, lambda (k, v): -v)
for(word, count) in output:
print word + ': ' + str(count)
sc.stop()
if __name__ == "__main__":
main()
如果您只想计算字母数字单词,这可能是一个解决方案:
import time, re
from pyspark import SparkContext, SparkConf
def linesToWordsFunc(line):
wordsList = line.split()
wordsList = [re.sub(r'\W+', '', word) for word in wordsList]
filtered = filter(lambda word: re.match(r'\w+', word), wordsList)
return filtered
def wordsToPairsFunc(word):
return (word, 1)
def reduceToCount(a, b):
return (a + b)
def main():
conf = SparkConf().setAppName("Words count").setMaster("local")
sc = SparkContext(conf=conf)
rdd = sc.textFile("your_file.txt")
words = rdd.flatMap(linesToWordsFunc)
pairs = words.map(wordsToPairsFunc)
counts = pairs.reduceByKey(reduceToCount)
# Get the first top 100 words
output = counts.takeOrdered(100, lambda (k, v): -v)
for(word, count) in output:
print word + ': ' + str(count)
sc.stop()
if __name__ == "__main__":
main()
网上有很多版本的wordcount,下面就是其中的一个
#to count the words in a file hdfs:/// of file:/// or localfile "./samplefile.txt"
rdd=sc.textFile(filename)
#or you can initialize with your list
v1='Hi hi hi bye bye bye word count'
rdd=sc.parallelize([v1])
wordcounts=rdd.flatMap(lambda l: l.split(' ')) \
.map(lambda w:(w,1)) \
.reduceByKey(lambda a,b:a+b) \
.map(lambda (a,b):(b,a)) \
.sortByKey(ascending=False)
output = wordcounts.collect()
for (count,word) in output:
print("%s: %i" % (word,count))
网上有很多版本的wordcount,下面就是其中的一个
#to count the words in a file hdfs:/// of file:/// or localfile "./samplefile.txt"
rdd=sc.textFile(filename)
#or you can initialize with your list
v1='Hi hi hi bye bye bye word count'
rdd=sc.parallelize([v1])
wordcounts=rdd.flatMap(lambda l: l.split(' ')) \
.map(lambda w:(w,1)) \
.reduceByKey(lambda a,b:a+b) \
.map(lambda (a,b):(b,a)) \
.sortByKey(ascending=False)
output = wordcounts.collect()
for (count,word) in output:
print("%s: %i" % (word,count))
这里的问题不在于Spark,而是尝试按tab:split'\t'拆分,而您需要的是简单地调用split。这里的问题不在于Spark,而是尝试按tab:split'\t'拆分,而您需要的是简单地调用split。