使用请求和bs4库使用Python从HTML中删除隐藏值_Python_Html_Web Scraping_Beautifulsoup_Captcha

使用请求和bs4库使用Python从HTML中删除隐藏值

python html web-scraping

使用请求和bs4库使用Python从HTML中删除隐藏值,python,html,web-scraping,beautifulsoup,captcha,Python,Html,Web Scraping,Beautifulsoup,Captcha,我试图从一个html源代码中获取一个验证码，该源代码的格式如下 <div id="Custom"></div> 如上所述，BeautifulSoup具有NavigableString和Comment对象，就像Tag对象一样，它们都可以是子对象、兄弟对象等。有更多详细信息那么，您想找到div“Custom”： div = soup.find('div', id='Custom'} 然后您要查找find注释子项： c

我试图从一个html源代码中获取一个验证码，该源代码的格式如下

<div id="Custom"><!-- test: vdfnhu --></div>

如上所述，BeautifulSoup具有

NavigableString

和

Comment

对象，就像

Tag

对象一样，它们都可以是子对象、兄弟对象等。有更多详细信息

那么，您想找到div“Custom”：

div = soup.find('div', id='Custom'}

然后您要查找find

注释

子项：

comment = next(child for child in div.children if isinstance(child, bs4.Comment))

尽管如果格式与您呈现的格式一样固定不变，您可能希望将其简化为

next（div.children）

。另一方面，如果变量更大，您可能希望迭代所有

Comment

节点，而不只是获取第一个节点

而且，由于

注释

基本上只是一个字符串（如中所示，它支持所有

str

方法）：

综合起来：

>>> html = '''<html><head></head>
...           <body><div id="Custom"><!-- test: vdfnhu --></div>\n</body></html>'''
>>> soup = bs4.BeautifulSoup(html)
>>> div = bs4.find('div', id='Custom')
>>> comment = next(div.children)
>>> test = comment.partition(':')[-1].strip()
>>> test
'vdfnhu'

>html=''
...           \不
>>>soup=bs4.BeautifulSoup（html）
>>>div=bs4.find（'div'，id='Custom'）
>>>注释=下一个（div.children）
>>>test=comment.partition（“：”）[-1].strip（）
>>>试验
“vdfnhu”

作为旁注，您正在抓取哪个网站使用验证码，但在源代码中包含答案？这种做法完全违背了目的；它甚至不会减慢机器人的速度，但却会激怒用户……这是我正在为我的网络安全大师们工作的实验室。：）我有很多很多课程要上。如果我能用C语言编写所有代码，生活就会简单得多。学习python并不难，但是学习所有的库是一件非常困难的事情……你看过IronPython吗？Python语言、.NET库…听起来它可能是您喜欢的东西。它与下一个（div.children）一起工作得很好。谢谢我只是无法将我的思想集中在评论上，因为某种原因，它让我产生了一个思维循环…@Phil:BeautifulSoup的文档非常完整，写得很好…但是如果你还不知道你在搜索什么，就不容易组织起来。我同意，它们写得很好，但正如你所说，对于一个刚刚学习图书馆的人来说，组织很难消化。

test = comment.partition(':')[-1].strip()

>>> html = '''<html><head></head>
...           <body><div id="Custom"><!-- test: vdfnhu --></div>\n</body></html>'''
>>> soup = bs4.BeautifulSoup(html)
>>> div = bs4.find('div', id='Custom')
>>> comment = next(div.children)
>>> test = comment.partition(':')[-1].strip()
>>> test
'vdfnhu'