Regex-python：返回字符周围的单词_Python_Regex_Python 3.x_Tokenize

Regex-python：返回字符周围的单词

python regex python-3.x

Regex-python：返回字符周围的单词,python,regex,python-3.x,tokenize,Python,Regex,Python 3.x,Tokenize,我有一个包含数百万个单词的字符串，我希望有一个正则表达式，可以返回围绕任何美元符号的五个单词。例如： string = 'I have a sentence with $10.00 within it and this sentence is done. ' 我想要正则表达式返回 surrounding = ['I', 'have', 'a', 'sentence', 'with', 'within', 'it', 'and', 'this', 'sentence'] 我的最终目标是对提及“

我有一个包含数百万个单词的字符串，我希望有一个正则表达式，可以返回围绕任何美元符号的五个单词。例如：

string = 'I have a sentence with $10.00 within it and this sentence is done. '

我想要正则表达式返回

surrounding = ['I', 'have', 'a', 'sentence', 'with', 'within', 'it', 'and', 'this', 'sentence']

我的最终目标是对提及“$”的所有单词进行统计，因此上面的列表将包括：

final_return = [('I', 1), ('have', 1), ('a', 1), ('sentence', 2), ('with', 1), ('within', 1), ('it', 1), ('and', 1), ('this', 1)]

到目前为止，我开发的下面的正则表达式可以返回附加到货币符号的字符串，其中包含5个字符。有没有办法编辑正则表达式以捕获周围的五个单词？我是否应该（如果是，如何）使用NLTK的标记器来实现这一点

   import re
 .....\$\s?\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{1,2})?.....

您可以开始使用下面的代码，我正在尝试用更简单的方法来解决它

import re

string = 'I have a sentence with $10.00 within it and this sentence is done. '

surrounding  = re.search(r'(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*\$\d+\.?\d{2}?\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)', string, flags=0).groups()

print(surrounding )

我不认为正则表达式是解决这个问题的合适选择。相反，您可以提取围绕美元符号buy的所有10个单词，在这些单词上循环，并跟踪之前遍历的5个单词，以便在找到匹配项时返回

在这种情况下，您可以使用

collections.deque（）

，这是一种适当的数据结构，项目数量有限，可以保留前面的五个单词。然后可以使用

collections.Counter（）

对象返回阈值内的单词计数器

from collections import deque
from collections import Counter
from itertools import chain

def my_counter(string):
    container = deque(maxlen=5)
    words = iter(string.split())
    def next_five(words):
        for _ in range(5):
            try:
                yield next(words)
            except StopIteration:
                pass

    for w in words:
        if w.startswith('$'):
            yield Counter(chain(container, next_five(words)))
        else:
            container.append(w)

演示：

可以将正则表达式与计数器组合，如下所示：

(?P<before>(?:\w+\W+){5})
\$\d+(?:\.\d+)?
(?P<after>(?:\W+\w+){5})

这将产生（请注意，

计数器

已经是

dict

）：

使用split分割单词，使用isalpha删除非单词，然后计算列表中单词的频率

string='I have a sentence with $10.00 within it and this sentence is done. '
string1=string.split()
string2=[s for s in string1 if s.isalpha()]
[[x,string2.count(x)] for x in set(string2)] 
#[['and', 1], ['within', 1], ['sentence', 2], ['it', 1], ['a', 1], ['have', 1], ['with', 1], ['this', 1], ['is', 1], ['I', 1]]

您能导入

regex

模块吗？非常感谢！这真的很有帮助。无论如何，我可以按数字顺序返回这些单词吗？

from collections import Counter
import re

rx = re.compile(r'''
    (?P<before>(?:\w+\W+){5})
    \$\d+(?:\.\d+)?
    (?P<after>(?:\W+\w+){5})
    ''', re.VERBOSE)

sentence = 'I have a sentence with $10.00 within it and this sentence is done. '
words = [Counter(m.group('before').split() + m.group('after').split())
                    for m in rx.finditer(sentence)]
print(words)

[Counter({'sentence': 2, 'I': 1, 'have': 1, 'a': 1, 'with': 1, 'within': 1, 'it': 1, 'and': 1, 'this': 1})]

string='I have a sentence with $10.00 within it and this sentence is done. '
string1=string.split()
string2=[s for s in string1 if s.isalpha()]
[[x,string2.count(x)] for x in set(string2)] 
#[['and', 1], ['within', 1], ['sentence', 2], ['it', 1], ['a', 1], ['have', 1], ['with', 1], ['this', 1], ['is', 1], ['I', 1]]