如何在Python中使用正则表达式实现对内容的搜索
我有一个带有“content”键的分层字典: 其中“内容”是html文件的内容:如何在Python中使用正则表达式实现对内容的搜索,python,regex,string,algorithm,search,Python,Regex,String,Algorithm,Search,我有一个带有“content”键的分层字典: 其中“内容”是html文件的内容: <p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from other scripts. Both of these typically flow left-to-right with
<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from other scripts. Both of these typically flow left-to-right within the overall right-to-left context. </p>
<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>
我想使用p(位置)在将内容指定给“content”键的位置(包括找到的单词位于句子开头的情况)提取找到的单词之前和之后的几个单词:
例如:
如何在Python中使用正则表达式或其他方法实现它?
提前谢谢你 我不确定您的词典结构和导航方式是否与您的问题相关,因此我将重新表述您的问题: “如何使用正则表达式搜索一个词,并获取搜索词前后的单词?” 这个问题的答案是使用正则表达式捕获组 下面是一个查找搜索词前一个和后一个单词的示例。您可能需要调整表达式以获得所需的多个单词或标点符号:
import re
test_string = "How much wood could a wood chuck chuck if a wood chuck would chuck wood"
search_word = "wood"
for match in re.finditer('([^ ]*? |)%s( [^ ]*|)' % search_word, test_string):
print "entire match: %s" % match.group(0)
print "prev word: %s" % match.group(1)
print "next word: %s" % match.group(2)
顺便说一句,如果您还没有,请访问www.regex101.com以测试和调整您的正则表达式模式
import re
def look_through(d, s):
r = []
content = readFile(d["path"])
content = BeautifulSoup(content)
content = content.getText()
pos = [m.start() for m in re.finditer(s, content)]
if pos:
if "phrase" not in d:
d["phrase"] = [s]
else:
d["phrase"].append(s)
for p in pos:
r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
for b in d["decendent"] or []:
r += look_through(b, s)
return r
r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
>>> look_through(dict, "how to write")
[{"content": "article tells you how to write HTML where text", "phrase": "how to write", "name" : "Section_3"}]
import re
test_string = "How much wood could a wood chuck chuck if a wood chuck would chuck wood"
search_word = "wood"
for match in re.finditer('([^ ]*? |)%s( [^ ]*|)' % search_word, test_string):
print "entire match: %s" % match.group(0)
print "prev word: %s" % match.group(1)
print "next word: %s" % match.group(2)