Python 搜索字符串中两个分隔符之间的字符_Python_Html_String

Python 搜索字符串中两个分隔符之间的字符

python html string

Python 搜索字符串中两个分隔符之间的字符,python,html,string,Python,Html,String,我试图解析一个字符串，以查找两个分隔符和之间的所有字符我尝试过使用正则表达式，但我似乎不明白发生了什么我的尝试： import re re.findall('<code>(.*?)</code>', processed_df['question'][2]) 如果processed_df['question'][2]是字符串，则该字符串是连续的，为了可读性，我将其键入多行： '<code>for x in finallist:\n matchinf

我试图解析一个字符串，以查找两个分隔符和之间的所有字符

我尝试过使用正则表达式，但我似乎不明白发生了什么

我的尝试：

import re
re.findall('<code>(.*?)</code>', processed_df['question'][2])

如果processed_df['question'][2]是字符串，则该字符串是连续的，为了可读性，我将其键入多行：

 '<code>for x in finallist:\n    matchinfo = 
 requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() 
 ["match_id"]\n    print(matchinfo)\n</code>'

我已使用此测试字符串进行了测试：

 test_string = '<code> this is a test </code>'

这似乎奏效了

我有一种感觉，它与到之间的角色中的特殊角色有关，但我不知道如何修复它。谢谢你的帮助

我认为问题在于换行符\n，请确保使用DOTALL标志进行匹配，例如

import re
regex = r"<code>(.*)\<\/code>"

test_str = ("<code>for x in finallist:\\n    matchinfo = \n"
    " requests.get(\"https://api.opendota.com/api/matches/{}\".format(x)).json() \n"
    " [\"match_id\"]\\n    print(matchinfo)\\n</code>\n")

re.findall(regex, test_str, re.DOTALL)

'for x in finallist:\\n    matchinfo = \n requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() \n ["match_id"]\\n    print(matchinfo)\\n'

所以问题并没有明确地说它需要正则表达式。话虽如此，我认为不使用它们是最好的：

乙二醇

使用html解析器可能比使用正则表达式更好

import lxml.html

html_snippet = """
 ...
 <p>Some stuff</p>
 ...
 <code>for x in finallist:\n    matchinfo = 
 requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() 
 ["match_id"]\n    print(matchinfo)\n</code>
 ...
 And some Stuff
 ...
 another code block <br />
 <code>
    print('Hello world')
 </code>
 """

dom = lxml.html.fromstring(html_snippet)
codes = dom.xpath('//code')


for code in codes:
    print(code.text)

 >>>> for x in finallist:
 >>>>     matchinfo = 
 >>>> requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() 
 >>>> ["match_id"]
 >>>>    print(matchinfo)

 >>>> print('Hello world')

我没有看到这个示例字符串。很抱歉，我在剪切和粘贴时意外地删除了它。我修复了它。DOTALL就足够了。@blhsing、noted和remove。如果html代码段中有多个代码块，这个解决方案可以工作吗？我试图在一个字符串中捕获多个代码块，并从代码的每个部分提取特征。

import lxml.html

html_snippet = """
 ...
 <p>Some stuff</p>
 ...
 <code>for x in finallist:\n    matchinfo = 
 requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() 
 ["match_id"]\n    print(matchinfo)\n</code>
 ...
 And some Stuff
 ...
 another code block <br />
 <code>
    print('Hello world')
 </code>
 """

dom = lxml.html.fromstring(html_snippet)
codes = dom.xpath('//code')


for code in codes:
    print(code.text)

 >>>> for x in finallist:
 >>>>     matchinfo = 
 >>>> requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() 
 >>>> ["match_id"]
 >>>>    print(matchinfo)

 >>>> print('Hello world')