Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/wordpress/13.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在Pyspark中,如何检查文本文件中的连续单词是否以相同的字母开头?_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 在Pyspark中,如何检查文本文件中的连续单词是否以相同的字母开头?

Python 在Pyspark中,如何检查文本文件中的连续单词是否以相同的字母开头?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我的文件中有以下文本: Horrid Henry’s hound hunts in the massive Murree mountains. While silly stupid Samuel’s dark dreadful dragon likes to hunt in skies. Horrid Henry’s hound and Samuel’s dreadful dragon Dany are fast friends and like to hunt and play togethe

我的文件中有以下文本:

Horrid Henry’s hound hunts in the massive Murree mountains. While silly stupid Samuel’s dark dreadful dragon likes to hunt in
skies.
Horrid Henry’s hound and Samuel’s dreadful dragon Dany are fast friends and like to hunt and play together. They call themselves
fantastic fanciful foursome.
我加载此文件并按如下方式加载它:

lines=sc.textFile("BigData test.txt")
RddWords=lines.flatMap(lambda line: line.split(" "))
H => 3
M=> 1
S => 1
D => 1
F => 1 
这会将其转换为字符串列表(每个单词都是字符串)。我想检查三个连续的单词是否从同一个字母表开始。预期产出如下:

lines=sc.textFile("BigData test.txt")
RddWords=lines.flatMap(lambda line: line.split(" "))
H => 3
M=> 1
S => 1
D => 1
F => 1 
以“H”开头的连续单词出现3次。类似地,以“M”开头的连续单词只出现一次。下面显示了这些连续单词的详细出现情况

Horrid Henry’s hound =>2
Henry’s hound hunts => 1
massive Murree mountains =>1
silly stupid Samuel’s =>1
dreadful dragon Dany=>1
fantastic fanciful foursome =>1
我可以编写python函数,只需检查字符串中的三个连续单词。但是我想不出如何在名为
RddWords
的并行Rdd上实现该函数。如果我写一个map函数,它将在Rdd
RddWords
中的每个x上分别实现。我将如何处理连续单词?有人能给我指点路吗?
非常感谢解决方案1

您需要将每条线作为滚动三角图:

(word0, word1, word2)
(word1, word2, word3)
...
然后映射提取所需信息的函数
f

解决方案2


使用Dataframe API并应用长度为
3

的滚动窗口函数,非常感谢您为我提供了正确的路径。我现在已经形成了一个rdd,如下所示:
words=lines.map(lambda-line:line.split()).flatMap(lambda-xs:(zip中x的元组(x)(xs,xs[1:],xs[2:])
它看起来是这样的:
[('hord','Henry's','hound's'),('Henry's','s','hunts'),('hound hunts','s'),('hound hunts','s','s','s','in'),('They','They','They','call call self,('fantastic'、'fanciful'、'foursome.')]
我想我现在可以解决这个问题了。非常感谢