Python网页抓取：靓汤_Python_Web Scraping_Beautifulsoup

Python网页抓取：靓汤

python web-scraping

Python网页抓取：靓汤,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我有一个刮网页的问题。我试图获得两个团队之间的分数差（例如：+2，+1，…），但当我应用find_all方法时，它返回一个空列表 from bs4 import BeautifulSoup from requests import get url='https://www.mismarcadores.com/partido/Q942gje8/#punto-a-punto;1' response=get(url) html_soup=BeautifulSoup(response.text,'htm

我有一个刮网页的问题。我试图获得两个团队之间的分数差（例如：+2，+1，…），但当我应用find_all方法时，它返回一个空列表

from bs4 import BeautifulSoup
from requests import get
url='https://www.mismarcadores.com/partido/Q942gje8/#punto-a-punto;1'
response=get(url)
html_soup=BeautifulSoup(response.text,'html.parser')


html_soup.find_all('span',class_='match-history-diff-score-inc')

如果您检查页面源代码（例如通过查看源代码：在Chrome或Firefox中，或通过将html字符串写入文件），您将看到您正在查找的元素（搜索

匹配历史差异分数inc

）不在那里。事实上，这些速率是使用JS动态加载的

问题是web内容是通过JavaScript动态生成的。因此，请求无法处理它，因此最好使用类似的方法

编辑：根据@λ用户的建议，我修改了我的答案，通过XPath搜索您正在寻找的元素，只使用Selenium注意我使用XPath函数

start-with（）

来获取

匹配历史差异分数dec

和

匹配历史差异分数inc

。只选择其中一个会让你错过几乎一半的相对分数更新。这就是为什么输出结果是103而不是56

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.mismarcadores.com/partido/Q942gje8/#punto-a-punto;1")

table = driver.find_elements_by_xpath('//td//span[starts-with(@class, "match-history-diff-score-")]')

results = []
for tag in table:
    print(tag.get_attribute('innerHTML'))
print(results)

这将产生：

['+2', '+1', '+2', '+2', '+1', '+2', '+4', '+2', '+2', '+4', '+7', '+5', '+8', '+5', '+7', '+5', '+3', '+2', '+5', '+3', '+5', '+3', '+5', '+6', '+4', '+6', '+7', '+6', '+5', '+2', '+4', '+2', '+5', '+7', '+6', '+8', '+5', '+3', '+1', '+2', '+1', '+4', '+7', '+5', '+8', '+6', '+9', '+11', '+10', '+9', '+11', '+9', '+10', '+11', '+9', '+7', '+5', '+3', '+2', '+1', '+3', '+1', '+3', '+2', '+1', '+3', '+2', '+4', '+1', '+2', '+3', '+6', '+3', '+5', '+2', '+1', '+1', '+2', '+4', '+3', '+2', '+4', '+1', '+3', '+5', '+7', '+5', '+8', '+7', '+6', '+5', '+4', '+1', '+4', '+6', '+9', '+7', '+9', '+7', '+10', '+11', '+12', '+10']

Selenium

可能会解决您的问题，但我建议您通过浏览器追踪网络，找到生成所需数据的请求。在您的案例中，它是

d_mh_Q942gje8_es_1

我不喜欢

Selenium

，因为它太重，会使脚本变慢。它是为自动化测试而构建的，而不是web抓取

下面是我使用

请求的脚本，它无疑比Selenium运行得更快
import requests
from bs4 import BeautifulSoup

url = 'https://d.mismarcadores.com/x/feed/d_mh_Q942gje8_es_1'

r = requests.get(url, headers={'x-fsign':'SW9D1eZo'}) # Got this from browser
soup = BeautifulSoup(r.text, 'html.parser')
diff_list = [diff.text for diff in soup.find_all('span',{'class' : 'match-history-diff-score-inc'})]
print(diff_list)

输出：
['+2', '+1', '+2', '+2', '+2', '+4', '+2', '+4', '+7', '+8', '+7', '+5', '+5', '+5', '+6', '+6', '+7', '+4', '+5', '+7', '+8', '+1', '+2', '+4', '+7', '+8', '+9', '+11', '+11', '+10', '+11', '+1', '+3', '+3', '+3', '+4', '+2', '+3', '+6', '+5', '+1', '+1', '+2', '+4', '+4', '+3', '+5', '+7', '+8', '+4', '+6', '+9', '+9', '+10', '+11', '+12']

当beautifulsoup刮取页面时，您正在查找的内容似乎未加载。看起来它可能会被动态加载到页面上，而beautifulsoup无法将其拾取。我建议您查看这样的问题Selenium会使其变慢，从浏览器跟踪网络，找到特定请求，然后您可以使用requests+bs4解决您的问题。相信我，你的代码会快5倍。如果你使用的是selenium，为什么还要用Beautifulsoup再次解析源代码？Selenium已经可以为您执行XPath查询和提取文本。是的，您是正确的。这有点多余。也许我将只使用Selenium编辑解决方案的答案。