Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 有没有办法删除字符串中重复和连续的单词/短语?_Python_Regex_String - Fatal编程技术网

Python 有没有办法删除字符串中重复和连续的单词/短语?

Python 有没有办法删除字符串中重复和连续的单词/短语?,python,regex,string,Python,Regex,String,有没有办法删除字符串中重复的和连续的单词/短语?例如 [in]:foo foo bar foo bar [out]:foobar foobar 我试过这个: >>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool' >>> [i for i,j in zip(s.split(),s.split()

有没有办法删除字符串中重复的和连续的单词/短语?例如

[in]:
foo foo bar foo bar

[out]:
foobar foobar

我试过这个:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'
当它变得更复杂,我想删除短语(假设短语最多由5个单词组成)时会发生什么?怎样才能做到呢?例如

[in]:
foo bar foo bar foo bar

[out]:
foo-bar

另一个例子:

[in]:
这是一个句子这是一个短语重复的句子。句子不是句子。


[out]:
这是一个短语重复的句子。句子不是相位。

您可以使用re模块来实现这一点

>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'
如果要匹配任意数量的连续事件:

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'    
编辑。最后一个例子的补充。要做到这一点,您必须在存在重复短语时调用re.sub。因此:

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'
输出:

In [7]: remove_duplicate_words(txt1)                                                                                                                                  
Out[7]: 'this is a foo bar black sheep , have you any wool woo yes sir three bag wu'                                                                                  

In [8]: remove_duplicate_words(txt2)                                                                                                                                 
Out[8]: 'this is a sentence where phrases duplicate' 

这将修复任意数量的相邻副本,并适用于两个示例。我将字符串转换为列表,对其进行修复,然后再转换回字符串进行输出:

mywords = "foo foo bar bar foo bar"
list = mywords.split()
def remove_adjacent_dups(alist):
    result = []
    most_recent_elem = None
    for e in alist:
        if e != most_recent_elem:
            result.append(e)
            most_recent_elem = e
    to_string = ' '.join(result)
    return to_string

print remove_adjacent_dups(list)
输出:

foo bar foo bar

我喜欢itertools。好像每次我想写东西的时候,itertools都已经有了。在这种情况下,
groupby
获取一个列表,并将该列表中重复的、连续的项分组到
的元组中(项值、迭代器值)
。在这里使用它,就像:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'
让我们用一个函数来扩展它,该函数返回一个列表,并删除重复值:

from itertools import chain, groupby

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))
这对一个单词的短语很好,但对较长的短语没有帮助。怎么办?好的,首先,我们要跨越我们的原始短语检查更长的短语:

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return
现在我们在做饭!好啊因此,我们的策略是首先删除所有单个单词的重复项。接下来,我们将删除两个重复的单词,从偏移量0开始,然后是1。在此之后,从偏移量0、1和2开始的三个单词重复,依此类推,直到我们找到五个单词重复:

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words
总而言之:

from itertools import chain, groupby

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'

b = 'this is a sentence where phrases duplicate . sentence are not prhases .'

print ' '.join(cleanse(a.split(), 5)) == b

就我个人而言,我不认为我们需要使用任何其他模块来实现这一点(尽管我承认其中一些模块很好)。我只是通过简单的循环来管理它,首先将字符串转换成一个列表。我在上面列出的所有例子上都试过了。它很好用

sentence = str(raw_input("Please enter your sentence:\n"))

word_list = sentence.split()

def check_if_same(i,j): # checks if two sets of lists are the same

    global word_list
    next = (2*j)-i   # this gets the end point for the second of the two lists to compare (it is essentially j + phrase_len)
    is_same = False
    if word_list[i:j] == word_list[j:next]:

        is_same = True
        # The line below is just for debugging. Prints lists we are comparing and whether it thinks they are equal or not
        #print "Comparing: " + ' '.join(word_list[i:j]) + " " + ''.join(word_list[j:next]) + " " + str(answer)

    return is_same

phrase_len = 1

while phrase_len <= int(len(word_list) / 2): # checks the sentence for different phrase lengths

    curr_word_index=0

    while curr_word_index < len(word_list): # checks all the words of the sentence for the specified phrase length

        result = check_if_same(curr_word_index, curr_word_index + phrase_len) # checks similarity

        if result == True:
            del(word_list[curr_word_index : curr_word_index + phrase_len]) # deletes the repeated phrase
        else:
            curr_word_index += 1

    phrase_len += 1

print "Answer: " + ' '.join(word_list)
句子=str(原始输入(“请输入您的句子:\n”))
单词列表=句子。拆分()
def check_if_same(i,j):#检查两组列表是否相同
全局单词表
next=(2*j)-i#这是两个列表中第二个要比较的列表的终点(本质上是j+phrase#len)
相同=错误
如果单词列表[i:j]==单词列表[j:next]:
这是真的吗
#下面这行只是为了调试。打印我们正在比较的列表,以及它是否认为它们相等
#打印“比较:”+“”.join(单词列表[i:j])+“”+“”.join(单词列表[j:next])+“”+str(答案)
回报是一样的
短语_len=1

虽然短语_len的模式类似于sharcashmo的模式,但您可以在while循环中使用它返回替换次数:

import re

txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .'

pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b')
repl = r'\1'

res = txt

while True:
    [res, nbr] = pattern.subn(repl, res)
    if (nbr == 0):
        break

print res
当不再有替换时,
while
循环停止


使用此方法,您可以获得所有重叠匹配(在替换上下文中,单次通过是不可能的),而无需对同一模式进行两次测试。

verbose but itertooly=)嘿!我相信这可以缩短,部分内容可以一行一行,但我想在简洁性和可读性之间取得平衡。我希望我能成功。:-)聪明的回答+1但如果应用于非常大的字符串,会不会遇到性能问题?
import re

txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .'

pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b')
repl = r'\1'

res = txt

while True:
    [res, nbr] = pattern.subn(repl, res)
    if (nbr == 0):
        break

print res