Regex ngrams与正则表达式的重叠匹配_Regex_Python 3.x

Regex ngrams与正则表达式的重叠匹配

regex python-3.x

Regex ngrams与正则表达式的重叠匹配,regex,python-3.x,Regex,Python 3.x,我有一个字符串，需要使用正则表达式 "hello COMMA the matche's roll over matche's or the expression for details PCRE flavors of regex are supported here" 我想找到bi和它的三角图。所以专注于大字游戏应该会吸引你 hello COMMA COMMA the the matche's etc 我写这个正则表达式就是为了做到这一点，但它并没有抓住重叠的结果 [\w'-]+ [\w'-

我有一个字符串，需要使用正则表达式

"hello COMMA the matche's roll over matche's or the expression for details PCRE flavors of regex are supported here"

我想找到bi和它的三角图。所以专注于大字游戏应该会吸引你

hello COMMA
COMMA the
the matche's
etc

我写这个正则表达式就是为了做到这一点，但它并没有抓住重叠的结果

[\w'-]+ [\w'-]+

它只会抓人

hello COMMA
the matches
etc

当我把它包起来的时候，它现在会抓住各种各样的垃圾。我错过了什么

(?=([\w'-]+ [\w'-]+))

另外，overlap=True由于某种原因对我不起作用，请不要使用正则表达式进行文本处理。有专门为该作业设计的NLTK包：

import nltk
text = "hello COMMA the matche's roll over ..."
words = nltk.word_tokenize(text)
list(nltk.bigrams(words))
# [('hello', 'COMMA'), ('COMMA', 'the'), ('the', 'matche'),...]
list(nltk.trigrams(words))
#[('hello', 'COMMA', 'the'), ('COMMA', 'the', 'matche'), ...]

不要将正则表达式用于文本处理。有专门为该作业设计的NLTK包：

import nltk
text = "hello COMMA the matche's roll over ..."
words = nltk.word_tokenize(text)
list(nltk.bigrams(words))
# [('hello', 'COMMA'), ('COMMA', 'the'), ('the', 'matche'),...]
list(nltk.trigrams(words))
#[('hello', 'COMMA', 'the'), ('COMMA', 'the', 'matche'), ...]

请您尝试以下方法：

import re

str = "hello COMMA the matche's roll over matche's or the expression for details PCRE flavors of regex are supported here"

matches = re.finditer(r'\S+\s(?=(\S+))', str)
for match in matches:
    print(match.group(0) + match.group(1))

输出：

hello COMMA
COMMA the
the matche's
matche's roll
[snipped]

正则表达式？=\S+在正向前瞻断言中包含一个捕获组。

由于零宽度匹配，它将match.group1分配给匹配的子字符串，而不向前移动位置。

请尝试以下操作：

import re

str = "hello COMMA the matche's roll over matche's or the expression for details PCRE flavors of regex are supported here"

matches = re.finditer(r'\S+\s(?=(\S+))', str)
for match in matches:
    print(match.group(0) + match.group(1))

输出：

hello COMMA
COMMA the
the matche's
matche's roll
[snipped]

正则表达式？=\S+在正向前瞻断言中包含一个捕获组。

由于零宽度匹配，它将match.group1分配给匹配的子字符串，而不向前移动位置。

下面的正则表达式是@Wiktor在对问题的评论中建议的正则表达式的推广和简化。Wiktor的溶液是2克或2克。此解决方案适用于3克或三角形。对于n-grams，其中n是变量，将{2}替换为{n-1}

首先假设字符串只包含单词字符和空格。然后，可以使用以下正则表达式来提取三角图：

(?=(?<!\S)(\w+(?:\s+\w+){2}))

但是，如果对某些字符的可能数字和位置要求太多，正则表达式很快就会崩溃

这是一个不应该使用正则表达式的示例，因为使用其他工具可以更容易地生成所需的数组。然而，这是一个有用的练习，因为它确实提高了人们使用正则表达式的能力

下面的正则表达式是@Wiktor对该问题的评论中建议的正则表达式的推广和简化。Wiktor的溶液是2克或2克。此解决方案适用于3克或三角形。对于n-grams，其中n是变量，将{2}替换为{n-1}

首先假设字符串只包含单词字符和空格。然后，可以使用以下正则表达式来提取三角图：

(?=(?<!\S)(\w+(?:\s+\w+){2}))

但是，如果对某些字符的可能数字和位置要求太多，正则表达式很快就会崩溃

试试？=？，在Ruby中，对于一个n-gram，这可以按如下方式完成。如果str=hello逗号。。。这里支持，n=3一个三元组，然后str.split.each_consn.map{| a | a.join''>[你好，逗号，匹配的逗号，匹配的滚动，…]。我希望这可以在Python中使用与Ruby相当的Python方法来完成，并且.Try？=？，请参见Ruby中的这可以在n-gram中完成，如下所示。如果str=hello逗号。。。这里支持，n=3一个三元组，然后str.split.each_consn.map{| a | a.join''>[你好，逗号，匹配的逗号，匹配的滚动，…]。我希望这可以在Python中使用Ruby的Python等效方法来实现。这很巧妙，但我需要使用正则表达式来实现这一点，我更新op来实现这一点。这很巧妙，但我需要使用正则表达式来实现这一点，我更新op来实现这一点。