Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/328.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何消除Python中字符串的重复序列_Python_Python 3.x_Regex - Fatal编程技术网

如何消除Python中字符串的重复序列

如何消除Python中字符串的重复序列,python,python-3.x,regex,Python,Python 3.x,Regex,我有一个复杂的任务,就是删除重复的连续单词或句子。 下面是一个输入示例 The The Up The Up next The Up next we The Up next we bring The Up next we bring you The Up next we bring you a The Up next we bring you a rebroadcast The Up next we bring you a rebroadcast of The Up next we bring y

我有一个复杂的任务,就是删除重复的连续单词或句子。 下面是一个输入示例

The
The Up
The Up next
The Up next we
The Up next we bring
The Up next we bring you
The Up next we bring you a
The Up next we bring you a rebroadcast
The Up next we bring you a rebroadcast of
The Up next we bring you a rebroadcast of.
of. The
of. The Diane
of. The Diane Rehm
of. The Diane Rehm radio
of. The Diane Rehm radio talk
of. The Diane Rehm radio talk show
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The
The Diane Rehm radio talk show. The program
The Diane Rehm radio talk show. The program is
The Diane Rehm radio talk show. The program is heard
The Diane Rehm radio talk show. The program is heard over
The Diane Rehm radio talk show. The program is heard over W.A.M.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M.
The program is heard over W.A.M. you F.M. on
The program is heard over W.A.M. you F.M. on the
The program is heard over W.A.M. you F.M. on the campus
The program is heard over W.A.M. you F.M. on the campus of
The program is heard over W.A.M. you F.M. on the campus of the
The program is heard over W.A.M. you F.M. on the campus of the American
F.M. on the campus of the American University
F.M. on the campus of the American University in
F.M. on the campus of the American University in the
F.M. on the campus of the American University in the nation's
F.M. on the campus of the American University in the nation's capital
F.M. on the campus of the American University in the nation's capital.
University in the nation's capital. The
University in the nation's capital. The special
University in the nation's capital. The special Martin
University in the nation's capital. The special Martin Luther
University in the nation's capital. The special Martin Luther King
University in the nation's capital. The special Martin Luther King Day
University in the nation's capital. The special Martin Luther King Day show
The special Martin Luther King Day show recorded
The special Martin Luther King Day show recorded Monday
The special Martin Luther King Day show recorded Monday.
recorded Monday. Focused
recorded Monday. Focused on
recorded Monday. Focused on race
recorded Monday. Focused on race relations
recorded Monday. Focused on race relations.
Focused on race relations. Ms
Focused on race relations. Ms Rames
Focused on race relations. Ms Rames guests
Focused on race relations. Ms Rames guests were
Focused on race relations. Ms Rames guests were Eleanor
Focused on race relations. Ms Rames guests were Eleanor Holmes
Ms Rames guests were Eleanor Holmes Norton
Ms Rames guests were Eleanor Holmes Norton.
电流输出低于

The Up next we bring you a rebroadcast of.
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M. on the campus of the American
F.M. on the campus of the American University in the nation's capital.
University in the nation's capital. The special Martin Luther King Day show
The special Martin Luther King Day show recorded Monday.
recorded Monday. Focused on race relations.
Focused on race relations. Ms Rames guests were Eleanor Holmes
Ms Rames guests were Eleanor Holmes Norton.
正如你们所看到的,即使在这个过程之后,我们仍然有重复,比如

The Up next we bring you a rebroadcast of.
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M. on the campus of the American
我只是想要像这样的东西

The Up next we bring you a rebroadcast of.
The Diane Rehm radio talk show.
The program is heard over W.A.M. you F.M. on the campus of the American
University in the nation's capital. The special Martin Luther King Day show
recorded Monday. Focused on race relations.
...etc
我如何完成这项任务

现行代码

import os

def load_and_discard(file_path):
        """
        Load and discard previous substrings.

        Args:
                file_path (PathLike): path to data file

        Returns:
                list[str]
        """
        data = []
        with open("./input/"+infile_path) as f:
                for i, line in enumerate(f):
                        st = line.strip()
                        if i > 0 and st.startswith(data[-1]):
                                data[-1] = st
                        elif len(st) > 0:  # guard against empty string
                                data.append(st)
        return data

def find_lebms(s1, s2):
        """
        Binary search on the longest-end-begin-matching-substring (LEBMS).

        Args:
                s1 (str): 1st stripped str (match the end)
                s2 (str): 2nd stripped str (match the begin)

        Returns:
                int: length of LEBMS
        """

        # search up to this length
        n1 = min(len(s1), len(s2))

        for i in range(1, n1+1):
                if s1[-i:] == s2[:i]:
                        return i
                else:
                        return 0


def remove_repeated_substr(data):
        """
        Generate strings (in-place) ready for concatenation by
        removing the repeated substring in the first string.                                                                                                                                    
        Args:
                data (list[str]): list of strings

        Returns:
                None
        """

        n0 = len(data)
        for i, st in enumerate(data):

                # guard: no chopping for the last line
                if i == n0 - 1:
                        break

                # chop the current row
                n = find_lebms(st, data[i + 1])
                if n > 0:  # guard against n = 0
                        data[i] = st[:-n]

directory = './input'
for filename in os.listdir(directory):

        infile_path = filename

        data = load_and_discard(infile_path)
        remove_repeated_substr(data)

        # (optional) prevent un-spaced ending periods
        for i, st in enumerate(data):
                if st[-1] == ".":
                        data[i] += " "

        ans = "\n".join(data)
        with open("./output/"+filename, "w") as text_file:
                        text_file.write(ans)
如果您愿意,可以使用输出作为输入(如果更容易)。因此,您不必处理重复的行。如果你想使用输入作为你的输入,或者我的输出作为你的输入,这完全取决于你但是当你发帖时,请让我知道。

替代输入

You can watch a representative.
Twenty three zero seven of the Rayburn Office Building.
Washington D.C. each week. C.-SPAN
Washington D.C. each week. C.-SPAN breaks
Washington D.C. each week. C.-SPAN breaks from
Washington D.C. each week. C.-SPAN breaks from its
Washington D.C. each week. C.-SPAN breaks from its public
Washington D.C. each week. C.-SPAN breaks from its public affairs
C.-SPAN breaks from its public affairs programming
C.-SPAN breaks from its public affairs programming to
C.-SPAN breaks from its public affairs programming to give
C.-SPAN breaks from its public affairs programming to give the
C.-SPAN breaks from its public affairs programming to give the viewer
C.-SPAN breaks from its public affairs programming to give the viewer updated schedule information.
Join us at eight o'clock A.M. Eastern five o'clock A.M. Pacific Time.
Six thirty P.M. Eastern three thirty P.M. Pacific Time.
Eight o'clock P.M. Eastern five o'clock P.M. Pacific Time.
One o'clock A.M. Eastern ten o'clock P.M. Pacific Time. As always C.-SPAN
P.M. Pacific Time. As always C.-SPAN scheduled
P.M. Pacific Time. As always C.-SPAN scheduled programming
As always C.-SPAN scheduled programming is preempted by live coverage of the U.S. House of Representatives.
Going on this election year.
Covering every issue in the campaign calendar.
The calendar list the network's plans for campaign.
From now through election day.
In addition to election coverage.
Other major events are cameras record.
Call toll free one eight hundred three four six. Her it to order the C.-SPAN
four six. Her it to order the C.-SPAN update for
Her it to order the C.-SPAN update for twenty four dollars.
You can use your credit card or will be glad to send you a bill.
Call one eight hundred three four six eight hundred.
And you'll receive fifty issues of the C.-SPAN update.
If you order an update subscription now.
The receive a free gift. The C.-SPAN road to the White House
The C.-SPAN road to the White House poster is twenty two by twenty eight inch pen and ink drawing.
Attractively depicts the spans grassroots approach to the campaign called.

您可以使用这个正则表达式,使用前向和后向引用来匹配重叠的重复项并删除它们

(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+
使用空字符串进行替换

代码:

s = re.sub(r'(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+', '\n', s)
正则表达式详细信息:

s = re.sub(r'(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+', '\n', s)
  • :启动捕获组#1
    • \b
      :单词边界
    • [-\w\s.]+
      :匹配1+个单词、空格、连字符、点或
      字符
  • :结束捕获组#1
  • (?=[\s.]+\1)
    :正向前瞻,断言在1+空格/点之后,我们在使用前存在组1的捕获值
  • [\s.]+
    :匹配1+个空格或点

要保留多行,可以使用两个替换:

s = re.sub(r'(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+', '\n', s)
s = re.sub(r'\A\n+|(?<=[^.] )\n+|\n+(?=\n)|\n+\Z', '', s)
s=re.sub(r'(\b[-\w\s.]+?)(?=[\s.]+\1)[\s.]+','\n',s)

s=re.sub(r'\A\n+|)(?如果我理解正确,您希望删除重复的行(例如删除“我是”并保留“我是鱼”)。也许您可以使用树数据结构并将每个单词存储为树中的一个节点。这是一种在线性时间内查找重复序列的有效方法。但这假设重复序列的开始相同。@我喜欢是,但有点不同。您可以看到输入包含
下一步我们为您重播.of、 黛安·雷姆电台脱口秀。
其中
的在两行不同的结尾和开头重复了两次。不知何故,如果下一行有可能上映,我们将为您重播黛安·雷姆电台脱口秀。好吧,如果每句话都是前一句话的延续,我认为最简单的事情就是这样做检查它是否包含前一句中的子字符串。谢谢你的解决方案!如果我想在每个句子后添加换行符。我该怎么做?非常感谢!非常感谢你的回答。你能看看更新的输入吗?似乎代码在所有情况下都不起作用。你能用你的新代码检查这个演示吗输入: