Python 显示警报时，从窗口警报中删除警报文本_Python_Web Scraping_Beautifulsoup_Screen Scraping

Python 显示警报时，从窗口警报中删除警报文本

python web-scraping

Python 显示警报时，从窗口警报中删除警报文本,python,web-scraping,beautifulsoup,screen-scraping,Python,Web Scraping,Beautifulsoup,Screen Scraping,我正在使用python请求库和BeautifulSoup。当请求无效时，有一个URL返回HTML并弹出alert（）。 Beautifulsoup中的问题是我无法获取窗口。警报弹出文本我曾尝试使用来自的regex方法，但它似乎不起作用因此，在这样做时： for script in soup.find_all("script"): alert = re.findall(r'(?<=alert\(\").+(?=\")', script.text) 它是通过使用html5lib

我正在使用python请求库和BeautifulSoup。当请求无效时，有一个URL返回HTML并弹出

alert（）。
Beautifulsoup中的问题是我无法获取窗口。警报弹出文本
我曾尝试使用来自的regex方法，但它似乎不起作用
因此，在这样做时：
for script in soup.find_all("script"):
    alert = re.findall(r'(?<=alert\(\").+(?=\")', script.text)

它是通过使用html5lib解析器库来解决的
如果您阅读文档，它会像web浏览器一样解析页面
因此，它将能够在body标记之外获取脚本
soup=BeautifulSoup（有效负载，'html5lib'）
错误=无
对于汤中的scr。查找所有（“脚本”）：
scrExtract=scr.extract（）
alert=re.findall（'err=“（.*\w）”，scrExtract.text）
如果len（警报）>0：
错误=警报[0]
打印（错误）
在您的数据上运行BeautifulSoup的diagnose（）
时，我会获得以下信息：
data = '''
<script language="JavaScript">
if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>

<html>
<body>

</body>
</html>


<script>
    var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>'''

from bs4.diagnose import diagnose

diagnose(data)

印刷品：
Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<script language="JavaScript">
 if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>
<html>
 <body>
 </body>
</html>
<script>
 var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
</script>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
  <script language="JavaScript">
   if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
  </script>
 </head>
 <body>
  <script>
   var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <head>
  <script language="JavaScript">
   if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
  </script>
 </head>
 <body>
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<script language="JavaScript">
 if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>
--------------------------------------------------------------------------------

['err']

相关：@JoaoPereira这不起作用，因为html有多个，所以我需要执行soup.find_all（）而不是soup.find（），在find_all中，即使我循环它，它也找不到正确的脚本，因为我认为警报会显示到窗口，所以它会在获取所有脚本之前停止alert@Fozoro它的不同之处在于，由于html的性质是编写的，它无法得到警报，我在测试中尝试了另一个答案，它可以工作，但不在这个html结构中。在这个答案中，extract（）是通过使用find（）方法找到的脚本调用的。您是否尝试为循环中的每个脚本实例调用extract（）函数？它位于HTML标记之外，因此不会出现在soup中。请检查HTML并查看是否可以添加查找以隔离正确的变量
['err']