Python beautifulsoup的网页抓取问题_Python_Web Scraping_Beautifulsoup

Python beautifulsoup的网页抓取问题

python web-scraping

Python beautifulsoup的网页抓取问题,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,当我打开要从中获取信息的url时，HTML代码会显示所有内容。但当我浏览网页的HTML代码时，它只显示其中的一部分，甚至不匹配。现在，当网站在我的浏览器上打开时，它确实有一个加载屏幕，但我不确定这是否是问题所在。也许他们阻止人们刮它？ HTML我回来了： <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"/> <title>

当我打开要从中获取信息的url时，HTML代码会显示所有内容。但当我浏览网页的HTML代码时，它只显示其中的一部分，甚至不匹配。现在，当网站在我的浏览器上打开时，它确实有一个加载屏幕，但我不确定这是否是问题所在。也许他们阻止人们刮它？ HTML我回来了：

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title></title>
<base href="/app"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="favicon.ico" rel="icon" type="image/x-icon"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="styles.css" rel="stylesheet"/></head>
<body class="cl">
<app-root>
<div class="loader-wrapper">
<div class="loader"></div>
</div>
</app-root>
<script src="runtime.js" type="text/javascript"></script><script src="polyfills.js" type="text/javascript"></script><script src="scripts.js" type="text/javascript"></script><script src="main.js" type="text/javascript"></script></body>
<script src="https://www.google.com/recaptcha/api.js"></script>
<noscript>
<meta content="0; URL=assets/javascript-warning.html" http-equiv="refresh"/>
</noscript>
</html>

看起来，网页内容是由javascript动态生成的。您可以组合/Beauty soup来解析此类网页。selenium的优点是，它可以在浏览器中重现用户行为—单击按钮或链接，在输入字段中输入文本等

下面是一个简短的例子：

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

# define 30 seconds delay 
DELAY = 30

# define URI
url = '<<WEBSITE_URL>>'

# define options for selenium driver
chrome_options = webdriver.ChromeOptions()
# this one make browser "invisible"
# comment it out to see all actions performed be selenium
chrome_options.add_argument('--headless')

# create selenium web driver
driver = webdriver.Chrome("<PATH_TO_CHROME_DRIVER>", options=chrome_options)

# open web page
driver.get(url)

# wait for h1 element to load for 30 seconds
h1_element = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1.trend-and-value')))

# parse web page content using bs4
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

print(soup)

从selenium导入webdriver
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.webdriver.common.by导入
从selenium.webdriver.support.ui导入WebDriverWait
从bs4导入BeautifulSoup
#定义30秒延迟
延迟=30
#定义URI
url=“”
#定义selenium驱动程序的选项
chrome\u options=webdriver.ChromeOptions（）
#这一个使浏览器“隐形”
#对其进行注释以查看是否执行了所有操作
chrome_选项。添加_参数（'--headless'）
#创建selenium web驱动程序
driver=webdriver.Chrome（“，options=Chrome\u options）
#打开网页
获取驱动程序（url）
#等待h1元件加载30秒
h1_元素=WebDriverWait（驱动程序，延迟）。直到（EC.presence_元素的位置（（By.CSS_选择器，'h1.trend and value'））
#使用bs4解析网页内容
html=driver.page\u源
soup=BeautifulSoup（html，'html.parser'）
印花（汤）

另一种解决方案可能是分析javascript呈现的网页。通常，此类网页以JSON格式从后端端点检索数据，您的scraper也可以调用JSON格式。

能否添加代码段请发布html和代码html可能是通过js动态生成的。在这种情况下，使用Beautifulsoup进行抓取只能获得服务器生成的初始HTML。动态生成的部分将不在BeautifulSoup响应中。@Ananth我添加了代码和output@DukeOfHazard我在帖子中包含了代码和html代码。如果html是通过js生成的，那么如何获取生成它的js呢？嗨！代码一直工作到h1_元素的声明。它说wd没有定义。另外，我要刮取的项目是div，但我想如果我只是将h1.trend-and-value替换为div.trend-and-value，它就会工作。

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

# define 30 seconds delay 
DELAY = 30

# define URI
url = '<<WEBSITE_URL>>'

# define options for selenium driver
chrome_options = webdriver.ChromeOptions()
# this one make browser "invisible"
# comment it out to see all actions performed be selenium
chrome_options.add_argument('--headless')

# create selenium web driver
driver = webdriver.Chrome("<PATH_TO_CHROME_DRIVER>", options=chrome_options)

# open web page
driver.get(url)

# wait for h1 element to load for 30 seconds
h1_element = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1.trend-and-value')))

# parse web page content using bs4
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

print(soup)