Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/345.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 显示警报时,从窗口警报中删除警报文本_Python_Web Scraping_Beautifulsoup_Screen Scraping - Fatal编程技术网

Python 显示警报时,从窗口警报中删除警报文本

Python 显示警报时,从窗口警报中删除警报文本,python,web-scraping,beautifulsoup,screen-scraping,Python,Web Scraping,Beautifulsoup,Screen Scraping,我正在使用python请求库和BeautifulSoup。 当请求无效时,有一个URL返回HTML并弹出alert()。 Beautifulsoup中的问题是我无法获取窗口。警报弹出文本 我曾尝试使用来自的regex方法,但它似乎不起作用 因此,在这样做时: for script in soup.find_all("script"): alert = re.findall(r'(?<=alert\(\").+(?=\")', script.text) 它是通过使用html5lib

我正在使用python请求库和BeautifulSoup。 当请求无效时,有一个URL返回HTML并弹出
alert()。
Beautifulsoup中的问题是我无法获取
窗口。警报
弹出文本

我曾尝试使用来自的regex方法,但它似乎不起作用

因此,在这样做时:

for script in soup.find_all("script"):
    alert = re.findall(r'(?<=alert\(\").+(?=\")', script.text)

它是通过使用
html5lib
解析器库来解决的 如果您阅读文档,它会像web浏览器一样解析页面 因此,它将能够在body标记之外获取脚本

soup=BeautifulSoup(有效负载,'html5lib')
错误=无
对于汤中的scr。查找所有(“脚本”):
scrExtract=scr.extract()
alert=re.findall('err=“(.*\w)”,scrExtract.text)
如果len(警报)>0:
错误=警报[0]
打印(错误)

在您的数据上运行BeautifulSoup的
diagnose()
时,我会获得以下信息:

data = '''
<script language="JavaScript">
if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>

<html>
<body>

</body>
</html>


<script>
    var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>'''

from bs4.diagnose import diagnose

diagnose(data)
印刷品:

Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<script language="JavaScript">
 if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>
<html>
 <body>
 </body>
</html>
<script>
 var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
</script>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
  <script language="JavaScript">
   if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
  </script>
 </head>
 <body>
  <script>
   var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <head>
  <script language="JavaScript">
   if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
  </script>
 </head>
 <body>
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<script language="JavaScript">
 if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>
--------------------------------------------------------------------------------
['err']

相关:@JoaoPereira这不起作用,因为html有多个,所以我需要执行soup.find_all()而不是soup.find(),在find_all中,即使我循环它,它也找不到正确的脚本,因为我认为警报会显示到窗口,所以它会在获取所有脚本之前停止alert@Fozoro它的不同之处在于,由于html的性质是编写的,它无法得到警报,我在测试中尝试了另一个答案,它可以工作,但不在这个html结构中。在这个答案中,extract()是通过使用find()方法找到的脚本调用的。您是否尝试为循环中的每个脚本实例调用extract()函数?它位于HTML标记之外,因此不会出现在soup中。请检查HTML并查看是否可以添加查找以隔离正确的变量
['err']