Python 基于连词的递归句子分组
我有一个句子列表,例如:Python 基于连词的递归句子分组,python,regex,string,text,recursion,Python,Regex,String,Text,Recursion,我有一个句子列表,例如: Sentence 1. And Sentence 2. Or Sentence 3. New Sentence 4. New Sentence 5. And Sentence 6. 我试图根据“连词标准”对这些句子进行分组,这样,如果下一个句子以连词开头(目前仅为“and”或“or”),那么我想对它们进行分组,以便: Group 1: Sentence 1. And Sentence 2. Or Sentence 3. Group 2:
Sentence 1.
And Sentence 2.
Or Sentence 3.
New Sentence 4.
New Sentence 5.
And Sentence 6.
我试图根据“连词标准”对这些句子进行分组,这样,如果下一个句子以连词开头(目前仅为“and”或“or”),那么我想对它们进行分组,以便:
Group 1:
Sentence 1.
And Sentence 2.
Or Sentence 3.
Group 2:
New Sentence 4.
Group 3:
New Sentence 5.
And Sentence 6.
我写了下面的代码,它不知怎么地检测到了连续的句子,但不是全部
如何递归地编写代码?我试图以迭代的方式编写代码,但是在某些情况下,它不起作用,我不知道如何在递归中编写代码
tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.","New Sentence 4.","New Sentence 5.","And Sentence 6."]
already_selected = []
attachlist = {}
for i in tokens:
attachlist[i] = []
for i in range(len(tokens)):
if i in already_selected:
pass
else:
for j in range(i+1, len(tokens)):
if j not in already_selected:
first_word = nltk.tokenize.word_tokenize(tokens[j].lower())[0]
if first_word in conjucture_list:
attachlist[tokens[i]].append(tokens[j])
already_selected.append(j)
else:
break
结果:
[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'],
['New Sentence 4.'],
['New Sentence 5.', 'And Sentence 6.']]
[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]
我喜欢嵌入式迭代器和泛型,所以这里有一个超级泛型方法:
import re
class split_by:
def __init__(self, iterable, predicate=None):
self.iter = iter(iterable)
self.predicate = predicate or bool
try:
self.head = next(self.iter)
except StopIteration:
self.finished = True
else:
self.finished = False
def __iter__(self):
return self
def _section(self):
yield self.head
for self.head in self.iter:
if self.predicate(self.head):
break
yield self.head
else:
self.finished = True
def __next__(self):
if self.finished:
raise StopIteration
section = self._section()
return section
[list(x) for x in split_by(tokens, lambda sentence: not re.match("(?i)or|and", sentence))]
#>>> [['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]
它更长,但它的空间复杂度是
O(1)
的,并采用您选择的谓词。这个问题通过迭代而不是递归解决,因为输出只需要一个级别的分组。如果您正在寻找递归解决方案,请给出任意级别分组的示例
def is_conjunction(sentence):
return sentence.startswith('And') or sentence.startswith('Or')
tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.",
"New Sentence 4.","New Sentence 5.","And Sentence 6."]
def group_sentences_by_conjunction(sentences):
result = []
for s in sentences:
if result and not is_conjunction(s):
yield result #flush the last group
result = []
result.append(s)
if result:
yield result #flush the rest of the result buffer
>>> groups = group_sentences_by_conjunction(tokens)
当你的结果可能不适合记忆时,使用yield
语句会更好,比如从存储在文件中的书中读取所有句子。
如果出于某种原因需要将结果作为列表,请使用
>>> groups_list = list(groups)
结果:
[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'],
['New Sentence 4.'],
['New Sentence 5.', 'And Sentence 6.']]
[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]
如果需要组号,请使用枚举(组)
is_连词
的问题与其他答案中提到的问题相同。根据需要修改以满足您的条件。为什么需要递归?老实说,这是个愚蠢的差事。@unutbu我根据你遇到“Andy…”或“Orwell…”等情况的想法更新了代码。但我不认为使用result=[tokens[:1]]
或处理索引器是一个好主意。我对格式错误的文本不感兴趣,因为问题正文中没有给出处理此类句子的说明。我们可能需要捕捉错误或完全忽略这些句子,这取决于我们的应用程序。