Python 网页抓取编码价格_Python_Html_Function_Web Scraping_Price

Python 网页抓取编码价格

python html function web-scraping

Python 网页抓取编码价格,python,html,function,web-scraping,price,Python,Html,Function,Web Scraping,Price,在网上抓取一篇文章时，价格是在元素中，而不是在资源中。取而代之的是下面的编码文本变量f3699334f586f4f2bb6edc10899026d63=函数（值）{ 返回base64UTF8Codec.decode（参数[0]） }; 取代( 文档.getElementById（'9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'）， f3699334f586f4f2bb6edc10899026d63（'CIAGICAGICA8C3BHBIBJBGFZCZ0ICHVS

在网上抓取一篇文章时，价格是在元素中，而不是在资源中。取而代之的是下面的编码文本


变量f3699334f586f4f2bb6edc10899026d63=函数（值）{
返回base64UTF8Codec.decode（参数[0]）
};
取代(
文档.getElementById（'9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'），
f3699334f586f4f2bb6edc10899026d63（'CIAGICAGICA8C3BHBIBJBGFZCZ0ICHVSBC1YAWDODCI+IDIUNTKWLC0GPC9ZCGUPGOGICAGICA='）
);

如何将文本解码为价格

文本是base64编码的。如果您可以使用beautifulsoup找到正确的

标记，则可以使用

re

模块提取正确的信息：

import re
import base64
from bs4 import BeautifulSoup

txt = '''<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value){return base64UTF8Codec.decode(arguments[0])};
replaceWith(document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA='));
</script>'''

soup = BeautifulSoup(txt, 'html.parser')

# 1. locate the right <script> tag
script = soup.script

# 2. get coded text from the script tag
coded_text = re.findall(r".*\('(.*?)'\)\);", script.text)[0]

# 3. decode the text
decoded_text = base64.b64decode(coded_text)  # b'\n                <span class="pull-right"> 2.590,- </span>\n            '

# 4. get the price from the decoded text
soup2 = BeautifulSoup(decoded_text, 'html.parser')

print(soup2.span.get_text(strip=True))

文本是base64编码的。如果您可以使用beautifulsoup找到正确的

标记，则可以使用

re

模块提取正确的信息：

import re
import base64
from bs4 import BeautifulSoup

txt = '''<script>
var f3699334f586f4f2bb6edc10899026d63 = function(value){return base64UTF8Codec.decode(arguments[0])};
replaceWith(document.getElementById('9ad80ca8-79ac-4fd8-8998-cb6662e8cc9a'), f3699334f586f4f2bb6edc10899026d63('CiAgICAgICAgICAgICAgICA8c3BhbiBjbGFzcz0icHVsbC1yaWdodCI+IDIuNTkwLC0gPC9zcGFuPgogICAgICAgICAgICA='));
</script>'''

soup = BeautifulSoup(txt, 'html.parser')

# 1. locate the right <script> tag
script = soup.script

# 2. get coded text from the script tag
coded_text = re.findall(r".*\('(.*?)'\)\);", script.text)[0]

# 3. decode the text
decoded_text = base64.b64decode(coded_text)  # b'\n                <span class="pull-right"> 2.590,- </span>\n            '

# 4. get the price from the decoded text
soup2 = BeautifulSoup(decoded_text, 'html.parser')

print(soup2.span.get_text(strip=True))

请看一看以回答您的问题：页面是通过使用javascript填充其他数据加载的。公开时，url几乎总是有用的，而不是在登录后。请看一看以回答您的问题：页面是通过使用javascript填充其他数据加载的。公开时，url几乎总是有用的而不是在登录之后。