如何在Python中使用正则表达式实现对内容的搜索_Python_Regex_String_Algorithm_Search

如何在Python中使用正则表达式实现对内容的搜索

python regex string algorithm search

如何在Python中使用正则表达式实现对内容的搜索,python,regex,string,algorithm,search,Python,Regex,String,Algorithm,Search,我有一个带有“content”键的分层字典：其中“内容”是html文件的内容： <p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from other scripts. Both of these typically flow left-to-right with

我有一个带有“content”键的分层字典：

其中“内容”是html文件的内容：

<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p> 
<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>

我想使用p（位置）在将内容指定给“content”键的位置（包括找到的单词位于句子开头的情况）提取找到的单词之前和之后的几个单词：

例如：

如何在Python中使用正则表达式或其他方法实现它？

提前谢谢你

我不确定您的词典结构和导航方式是否与您的问题相关，因此我将重新表述您的问题：

“如何使用正则表达式搜索一个词，并获取搜索词前后的单词？”

这个问题的答案是使用正则表达式捕获组

下面是一个查找搜索词前一个和后一个单词的示例。您可能需要调整表达式以获得所需的多个单词或标点符号：

import re

test_string = "How much wood could a wood chuck chuck if a wood chuck would chuck wood"
search_word = "wood"

for match in re.finditer('([^ ]*? |)%s( [^ ]*|)' % search_word, test_string):
    print "entire match: %s" % match.group(0)
    print "prev word: %s" % match.group(1)
    print "next word: %s" % match.group(2)

顺便说一句，如果您还没有，请访问www.regex101.com以测试和调整您的正则表达式模式

import re

def look_through(d, s):
    r = []
    content = readFile(d["path"])
    content = BeautifulSoup(content)
    content = content.getText()
    pos = [m.start() for m in re.finditer(s, content)]
    if pos:
        if "phrase" not in d:
            d["phrase"] = [s]
        else:
            d["phrase"].append(s)
        for p in pos:
            r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
    for b in d["decendent"] or []:
            r += look_through(b, s)
    return r

r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})

>>> look_through(dict, "how to write") 
[{"content": "article tells you how to write HTML where text", "phrase": "how to write", "name" : "Section_3"}]

import re

test_string = "How much wood could a wood chuck chuck if a wood chuck would chuck wood"
search_word = "wood"

for match in re.finditer('([^ ]*? |)%s( [^ ]*|)' % search_word, test_string):
    print "entire match: %s" % match.group(0)
    print "prev word: %s" % match.group(1)
    print "next word: %s" % match.group(2)