Python Beautifulsoup 4跨,包括'@';返回奇怪的结果

Python Beautifulsoup 4跨,包括'@';返回奇怪的结果,python,css-selectors,scrapy,scrapy-spider,Python,Css Selectors,Scrapy,Scrapy Spider,我能够使用以下工具获得所需的跨度列表: attrs = soup.find_all("span") 这将返回作为键和值的跨距列表: [ <span>back camera resolution</span>, <span class="even">12 MP</span> ] [ <span>front camera resolution</span>, <span class=

我能够使用以下工具获得所需的跨度列表:

attrs = soup.find_all("span")
这将返回作为键和值的跨距列表:

[
    <span>back camera resolution</span>, 
    <span class="even">12 MP</span>
]

[
    <span>front camera resolution</span>, 
    <span class="even">16 MP</span>
]

[
    <span>video resolution</span>, 
    <span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script>
    </span>
]
[
后摄像头分辨率,
12 MP
]
[
前置摄像头分辨率,
16MP
]
[
视频分辨率,
/*  */ - /*  */ - /*  */
]
这方面的原始HTML是:

为什么“视频分辨率”会这样转换?

该站点正在使用,它似乎已将所有字符串中的
@
替换为模糊(XOR加密)值,以防止刮取器获取电子邮件地址。每次替换都包含用于解码的JavaScript代码

BeautifulSoup不会执行Javascript,但您的浏览器已经执行了Javascript,并用结果解密的数据替换了
标记

你可以用一个小的Python3函数做同样的事情;JavaScript代码所做的只是在一个简单的XOR解密例程中使用第一个字节作为密钥来“解密”(十六进制编码)值:

def decode(cfemail):
    enc = bytes.fromhex(cfemail)
    return bytes([c ^ enc[0] for c in enc[1:]]).decode('utf8')

def deobfuscate_cf_email(soup):
    for encrypted_email in soup.select('a.__cf_email__'):
        decrypted = decode(encrypted_email['data-cfemail'])
        # remove the <script> tag from the tree
        script_tag = encrypted_email.find_next_sibling('script')
        script_tag.decompose()
        # replace the <a class="__cf_email__"> tag with the decoded result
        encrypted_email.replace_with(decrypted)

不要将DOM查看器与服务器提供给浏览器的源代码混淆。BeautifulSoup无法执行服务器发送的Javascript代码。看起来服务器使用Javascript库自动模糊电子邮件地址,Javascript代码由浏览器执行以重新插入文本。@MartijnPieters哇!,如果有那么复杂,我想没那么重要,我会跳过它。谢谢。要想扭转局面并不难;我在我的回答中贴出了一个方法;它采用BeautifulSoup树,并用解模糊结果替换所有出现的情况。
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
...     <span>video resolution</span>,
...     <span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script>
...     </span>
... ''')
>>> deobfuscate_cf_email(soup)
>>> soup
<html><body><span>video resolution</span>,
    <span class="even">2160p@30fps - 1080p@30fps - 720@120fps
</span>
</body></html>