Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何查找具有特定文本的HTML标记?-美丽之群_Python_Regex_Beautifulsoup - Fatal编程技术网

Python 如何查找具有特定文本的HTML标记?-美丽之群

Python 如何查找具有特定文本的HTML标记?-美丽之群,python,regex,beautifulsoup,Python,Regex,Beautifulsoup,以下是消息来源: <span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span> <span class="

以下是消息来源:

<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>

<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span>

<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span> 

什么也没找到。如果我删除
text=re.compile('.*do something.*')
以上所有标记都可以找到,我知道我的正则表达式模式应该有问题,那么正确的格式是什么?

迭代html文件内容并打印匹配的行。在这里,我用列表l替换了文件内容:

>>> l = ['<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>', 

'<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span>',

'<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>' ]
>>> for i in range(len(l)):
    if re.search('<span class="new">.*do something.*', l[i]):
        print l[i]


<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>
<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>
>>> 
>>l=['做点什么',
'在'做其他事情',
'在''做某事]
>>>对于范围内的i(len(l)):
如果重新搜索('.*做点什么。*',l[i]):
打印l[i]
做点什么
做点什么
>>> 

您可以尝试混合方法:

soup = bs4.BeautifulSoup(html, "lxml")
spans = soup.findAll("span", attrs = {"class": "new"})
regex = re.compile('.*do something at.*')
desired_tags = [span for span in spans if regex.match(span.text)]

这是我通常查找文本的方式

spans = soup.findAll("span", attrs = {"class": "new"})
for s in spans:
    if "do something" in str(s):

这确实有效,但问题是,我需要解析所选的标记、抓取URL和类似的内容。BeautifulSoup会做得更好的。谢谢,这确实有效。我不明白的是,当您将以上所有内容组合在一行中时,它为什么不起作用:
span=soup.findAll(“span”,attrs={“class”:“new”},text=re.compile('.*dosomething.*'))
我猜
text=
只适用于具有文本的标记,而不适用于其他标记。在HTML中,每个
都有
标记与文本混合在一起。如果span没有子标签,我想它可以正常工作。
spans = soup.findAll("span", attrs = {"class": "new"})
for s in spans:
    if "do something" in str(s):