Python pyspark:在txt文件中的不同行之间滑动

Python pyspark:在txt文件中的不同行之间滑动,python,apache-spark,pyspark,mapreduce,Python,Apache Spark,Pyspark,Mapreduce,我需要在一个txt文件(带有标题和文本的体育文章)中以mapreduce的方式找到所有3克的带状疱疹。 但是,txt文件的格式是 This is the title Content is here on the next line. This is another line. 如果我使用sc.textFile()而不进行处理,text=sc.textFile().collect()将类似于 ['This is the title', '', 'Content is here on the

我需要在一个txt文件(带有标题和文本的体育文章)中以mapreduce的方式找到所有3克的带状疱疹。 但是,txt文件的格式是

This is the title
Content is here on the next line.
This is another line.
如果我使用
sc.textFile()
而不进行处理,
text=sc.textFile().collect()
将类似于

['This is the title',
 '',
 'Content is here on the next line.',
 '',
 'This is another line.']
[['This is the',
  'is the title'],
 [],
 ['Content is here', 
  'is here on',
  'here on the',
  'here on the',
  'the next line.'],
 [],
 ['This is another',
  'is another line.']]
因此,文本文件有多行。 因此,3克的木瓦将是一样的

['This is the title',
 '',
 'Content is here on the next line.',
 '',
 'This is another line.']
[['This is the',
  'is the title'],
 [],
 ['Content is here', 
  'is here on',
  'here on the',
  'here on the',
  'the next line.'],
 [],
 ['This is another',
  'is another line.']]
如果我使用map函数
text.map(shingling)

我想要的是

['This is the',
 'is the title',
 'the title Content',
 'title Content is',
 ......]

我想知道是否有任何函数可以使用,或者我应该如何修改代码才能做到这一点。

您可能需要使用下面的代码组合这些行:

rdd = sc.textFile('text')

rdd2 = sc.parallelize([rdd.fold('', lambda x, y: x + ' ' + y)]).map(shingling)

>>> rdd2.collect()
[['This is the', 'is the title', 'the title Content', 'title Content is',
  'Content is here', 'is here on', 'here on the', 'on the next', 'the next line.',
  'next line. This', 'line. This is', 'This is another', 'is another line.']]

您可能需要使用以下代码组合这些行:

rdd = sc.textFile('text')

rdd2 = sc.parallelize([rdd.fold('', lambda x, y: x + ' ' + y)]).map(shingling)

>>> rdd2.collect()
[['This is the', 'is the title', 'the title Content', 'title Content is',
  'Content is here', 'is here on', 'here on the', 'on the next', 'the next line.',
  'next line. This', 'line. This is', 'This is another', 'is another line.']]