使用正则表达式和python发现相同的相邻字符串_Python_Regex_Regex Negation_Regex Lookarounds

使用正则表达式和python发现相同的相邻字符串

python regex

使用正则表达式和python发现相同的相邻字符串,python,regex,regex-negation,regex-lookarounds,Python,Regex,Regex Negation,Regex Lookarounds,考虑以下案文： ... bedeubedeu France The Provençal name for tripe bee balmbee balm Bergamot beechmastbeechmast Beech nut beech nutbeech nut A small nut from the beech tree, genus Fagus and Nothofagus, similar in flavour to a hazelnut but not commonly used.

考虑以下案文：

...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,

genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast

beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow

(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)

beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...

我想用python解析这个文本，并只保留出现两次且相邻的字符串。例如，可接受的结果应为

bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne

因为趋势是每个字符串与相同的字符串相邻，如下所示：

bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne

那么，人们如何用正则表达式搜索相邻和相同的字符串呢？我正在测试我的测试。谢谢

您可以使用以下正则表达式：

(\b.+)\1

看

或者，仅匹配并捕获唯一的子字符串部分：

(\b.+)(?=\1)

单词边界

\b

确保我们只匹配单词的开头，然后匹配除换行符以外的一个或多个字符（在单行模式下，

也将匹配换行符），然后在a的帮助下，我们匹配与

（\b.+）

捕获的字符序列完全相同的字符序列

当使用带有

（？=\1）

前瞻的版本时，匹配的文本不包含重复部分，因为前瞻不使用文本，并且匹配不包含这些区块

更新

见：

输出：

zyme
abbrühen

非常感谢你的正确回答。我想知道是否还有一个功能可以使用regex获取字符串的一半（因为它会给出想要的结果），以便为最终输出保存第二次数据传递。再次感谢您

stribizhev

。对不起，我想我应该从一开始就发布这个：。对吗？不用太感谢，向上投票真的足够了：）顺便说一句，你应该发布你尝试过的东西，因为我看到你尝试了一些东西。我现在看到，当我在这些数据中搜索建议的正则表达式时：

zymezyme Yeast，单词enzyme的起源，因为第一批酶是从酵母中提取出来的，8月19日，星期四，第632页，2004年7:50 PM

我得到的是

[['zyme']，[]，[]，['']，[''，['']，[']，][/code>，也就是说，它也解析逗号。我正在使用以下代码：reg=re.compile（r）（\b++）（？=\1）”）for line in textfile:matches+=[（reg.findall（line））]textfile.close（）
，您认为这可以改进吗？还有为什么“abbrühenabbrühen”
被解析为“abbr\xc3\xbchen”？如何避免以这种方式解析这些特殊字符？
zyme
abbrühen