使用python消除两个文本块之间的重叠_Python_String

使用python消除两个文本块之间的重叠

python string

使用python消除两个文本块之间的重叠,python,string,Python,String,我有两个稍微重叠的文本文件，即： text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that th

我有两个稍微重叠的文本文件，即：

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""

text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

正如您所见，text1的最后一句和text2的第一句略微重叠。现在，我想消除这种重叠，基本上删除text2中的字符串，这些字符串也在text1的最后一句中

为此，我可以提取text1的最后一句：

text1_last_sentence = list(filter(None,text1.split(".")))[-1]

文本2的第一句：

text2_first_sentence = text2.split(".")[0]

。。。但现在的问题是：

如何找到text2第一句中应该保留在text2中的部分并将所有内容重新组合起来？

编辑1：

预期产出：

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""

text2 = """greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

编辑2：

以下是完整的代码：

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy.""" 

text1_last_sentence = list(filter(None,text1.split(".")))[-1]
text2_first_sentence = text2.split(".")[0]

print(text1_last_sentence, "\n")
print(text2_first_sentence, "\n")

其他的都是实验性的，这意味着在实验中有困难创建一个实验来测试一个提出的理论或研究一个问题现象

更详细地理论或研究一种现象

这有点老套，但它可以工作：

text1=“”物理学中一些尚未解决的主要问题是理论性的，这意味着现有的理论似乎无法解释某个观察到的现象或实验结果。其他问题是实验性的，这意味着很难创建一个实验来测试提出的理论或调查物理学中的某个现象”“”
text2=“”更详细地理论或研究一种现象。在标准物理模型中仍然存在一些缺陷，例如质量的起源、强CP问题、中微子质量、物质-反物质不对称性以及暗物质和暗能量的性质。”“”
text1_ls=list（过滤器（无，text1.split（“.”））[-1]
text2_fs=text2.split（“.”[0]
temp2=text2_fs.split（“”）
对于范围（1，len（temp2））中的i：
如果“”不在文本1中加入（temp2[：i]）：
text2_fs=”“.加入（temp2[（i-1）：]）
打破
打印（文本1_ls，“\n”）
打印（文本2_fs，“\n”）

基本上你从

text2_fs

中提取越来越大的子串，直到它不再是

text1_ls

的子串，这告诉你

text2_fs

子串的最后一个字是

text1_ls

中不存在的第一个字这有点老套，但它可以：

text1=“”物理学中一些尚未解决的主要问题是理论性的，这意味着现有的理论似乎无法解释某个观察到的现象或实验结果。其他问题是实验性的，这意味着很难创建一个实验来测试提出的理论或调查物理学中的某个现象”“”
text2=“”更详细地理论或研究一种现象。在标准物理模型中仍然存在一些缺陷，例如质量的起源、强CP问题、中微子质量、物质-反物质不对称性以及暗物质和暗能量的性质。”“”
text1_ls=list（过滤器（无，text1.split（“.”））[-1]
text2_fs=text2.split（“.”[0]
temp2=text2_fs.split（“”）
对于范围（1，len（temp2））中的i：
如果“”不在文本1中加入（temp2[：i]）：
text2_fs=”“.加入（temp2[（i-1）：]）
打破
打印（文本1_ls，“\n”）
打印（文本2_fs，“\n”）

基本上，您从

text2\u fs

中获取越来越大的子字符串，直到它不再是

text1\u ls

的子字符串，这表明

text2\u fs

子字符串的最后一个单词是不在

text1\u ls

中的第一个单词，它可能不能解决所有的角点情况，但对所提到的文本有效

first_word_text2 = text2.split()[0]
pos = len(text1) - text1.rfind(first_word_text2)
text2[pos:].strip()

可能无法解决所有角落的情况，但将适用于上述文本

first_word_text2 = text2.split()[0]
pos = len(text1) - text1.rfind(first_word_text2)
text2[pos:].strip()

下面是一种方法，可以找到最大可能的重叠：

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

def remove_overlap(text1, text2):
    """Returns the part of text2 that doesn't overlap with text1"""

    words1 = text1.split()
    words2 = text2.split()

    # all apperances of the last word of text1 in text2
    last_word_appearances = [index for index, word in enumerate(words2) if word == words1[-1]]
    # we look for the largest possible overlap
    for n in reversed(last_word_appearances):
        # are the first n+1 words of text2 the same as the (n+1) last from text1? 
        if words2[:n+1] == words1[-(n+1):]:
            return ' '.join(words2[n+1:])
    else:
        # no overlap found
        return text2


remove_overlap(text1, text2)
# 'greater detail.There are still some deficiencies in [...]

下面是一种方法，可以找到最大可能的重叠：

text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

def remove_overlap(text1, text2):
    """Returns the part of text2 that doesn't overlap with text1"""

    words1 = text1.split()
    words2 = text2.split()

    # all apperances of the last word of text1 in text2
    last_word_appearances = [index for index, word in enumerate(words2) if word == words1[-1]]
    # we look for the largest possible overlap
    for n in reversed(last_word_appearances):
        # are the first n+1 words of text2 the same as the (n+1) last from text1? 
        if words2[:n+1] == words1[-(n+1):]:
            return ' '.join(words2[n+1:])
    else:
        # no overlap found
        return text2


remove_overlap(text1, text2)
# 'greater detail.There are still some deficiencies in [...]

Try:

text1=“在第一种理论中，解决方案是，但在第二种理论中”

和

text2=“在第一种理论中，解决方案是，但在第二种理论中，解决方案是”

text2=“在第一种理论中，解决方案是，但在第二种理论中”和

text2=“在第一种理论中，解决方案是，但在第二种理论中，它是”

非常好的解决方案！非常好的解决方案！