Python beautifulsoup的网页抓取问题

Python beautifulsoup的网页抓取问题,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,当我打开要从中获取信息的url时,HTML代码会显示所有内容。但当我浏览网页的HTML代码时,它只显示其中的一部分,甚至不匹配。现在,当网站在我的浏览器上打开时,它确实有一个加载屏幕,但我不确定这是否是问题所在。也许他们阻止人们刮它? HTML我回来了: <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"/> <title>

当我打开要从中获取信息的url时,HTML代码会显示所有内容。但当我浏览网页的HTML代码时,它只显示其中的一部分,甚至不匹配。现在,当网站在我的浏览器上打开时,它确实有一个加载屏幕,但我不确定这是否是问题所在。也许他们阻止人们刮它? HTML我回来了:

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title></title>
<base href="/app"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="favicon.ico" rel="icon" type="image/x-icon"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="styles.css" rel="stylesheet"/></head>
<body class="cl">
<app-root>
<div class="loader-wrapper">
<div class="loader"></div>
</div>
</app-root>
<script src="runtime.js" type="text/javascript"></script><script src="polyfills.js" type="text/javascript"></script><script src="scripts.js" type="text/javascript"></script><script src="main.js" type="text/javascript"></script></body>
<script src="https://www.google.com/recaptcha/api.js"></script>
<noscript>
<meta content="0; URL=assets/javascript-warning.html" http-equiv="refresh"/>
</noscript>
</html>

看起来,网页内容是由javascript动态生成的。您可以组合/Beauty soup来解析此类网页。selenium的优点是,它可以在浏览器中重现用户行为—单击按钮或链接,在输入字段中输入文本等

下面是一个简短的例子:

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

# define 30 seconds delay 
DELAY = 30

# define URI
url = '<<WEBSITE_URL>>'

# define options for selenium driver
chrome_options = webdriver.ChromeOptions()
# this one make browser "invisible"
# comment it out to see all actions performed be selenium
chrome_options.add_argument('--headless')

# create selenium web driver
driver = webdriver.Chrome("<PATH_TO_CHROME_DRIVER>", options=chrome_options)

# open web page
driver.get(url)

# wait for h1 element to load for 30 seconds
h1_element = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1.trend-and-value')))

# parse web page content using bs4
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

print(soup)
从selenium导入webdriver
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.webdriver.common.by导入
从selenium.webdriver.support.ui导入WebDriverWait
从bs4导入BeautifulSoup
#定义30秒延迟
延迟=30
#定义URI
url=“”
#定义selenium驱动程序的选项
chrome\u options=webdriver.ChromeOptions()
#这一个使浏览器“隐形”
#对其进行注释以查看是否执行了所有操作
chrome_选项。添加_参数('--headless')
#创建selenium web驱动程序
driver=webdriver.Chrome(“,options=Chrome\u options)
#打开网页
获取驱动程序(url)
#等待h1元件加载30秒
h1_元素=WebDriverWait(驱动程序,延迟)。直到(EC.presence_元素的位置((By.CSS_选择器,'h1.trend and value'))
#使用bs4解析网页内容
html=driver.page\u源
soup=BeautifulSoup(html,'html.parser')
印花(汤)

另一种解决方案可能是分析javascript呈现的网页。通常,此类网页以JSON格式从后端端点检索数据,您的scraper也可以调用JSON格式。

能否添加代码段请发布html和代码html可能是通过js动态生成的。在这种情况下,使用Beautifulsoup进行抓取只能获得服务器生成的初始HTML。动态生成的部分将不在BeautifulSoup响应中。@Ananth我添加了代码和output@DukeOfHazard我在帖子中包含了代码和html代码。如果html是通过js生成的,那么如何获取生成它的js呢?嗨!代码一直工作到h1_元素的声明。它说wd没有定义。另外,我要刮取的项目是div,但我想如果我只是将h1.trend-and-value替换为div.trend-and-value,它就会工作。
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

# define 30 seconds delay 
DELAY = 30

# define URI
url = '<<WEBSITE_URL>>'

# define options for selenium driver
chrome_options = webdriver.ChromeOptions()
# this one make browser "invisible"
# comment it out to see all actions performed be selenium
chrome_options.add_argument('--headless')

# create selenium web driver
driver = webdriver.Chrome("<PATH_TO_CHROME_DRIVER>", options=chrome_options)

# open web page
driver.get(url)

# wait for h1 element to load for 30 seconds
h1_element = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1.trend-and-value')))

# parse web page content using bs4
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

print(soup)