Python 我的靓汤怎么了？这里的代码都找到了吗？_Python_Web Scraping_Beautifulsoup

Python 我的靓汤怎么了？这里的代码都找到了吗？

python web-scraping

Python 我的靓汤怎么了？这里的代码都找到了吗？,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我觉得我错过了一些基本的东西，但我被卡住了。我只是想返回一张有漂亮汤的桌子，但出于某种原因，它没有按ID抓取带有行分数的桌子。我可以在此页面上按ID瞄准其他div和桌子，但出于某种原因，这张桌子没有返回任何东西。知道我错过了什么吗 from urllib.request import urlopen from bs4 import BeautifulSoup import ssl url = 'https://www.sports-reference.com/cbb/boxscores/202

我觉得我错过了一些基本的东西，但我被卡住了。我只是想返回一张有漂亮汤的桌子，但出于某种原因，它没有按ID抓取带有行分数的桌子。我可以在此页面上按ID瞄准其他div和桌子，但出于某种原因，这张桌子没有返回任何东西。知道我错过了什么吗

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

url = 'https://www.sports-reference.com/cbb/boxscores/2020-01-14-19-clemson.html'
html = urlopen(url)
soup = BeautifulSoup(html.read(), 'html.parser')
ls = soup.find_all('table', {"id": "line-score"})

该表似乎是由javascript添加的，如果您实际查看源代码（右键单击view source而不是inspect），您将看到该表以html注释掉。

您需要在浏览器中呈现实际页面，并获取生成的源代码。

正如Ipellis所说，您试图刮取的表是由Javascript添加的。这意味着在您发出初始请求后呈现该表，因此HTML代码不包含该表

要从中获取数据，我建议您从

BeautifulSoup

切换到

requests html

（一个非常类似的库）。这一个可以处理这些Javascript案例，还可以处理web请求

您需要先安装它：

pip install requests html

然后，您只需执行以下操作：

from requests_html import HTMLSession

session = HTMLSession()
request = session.get('https://www.sports-reference.com/cbb/boxscores/2020-01-14-19-clemson.html')
request.html.render() # Here is where you render the table

table = request.html.xpath('//table[@id="line-score"]')

我建议您在查找元素时使用

xpath

。这是一种更好的方法，可以获得您正在寻找的元素。如果你不懂XPath，下面是答案

您可以获得

请求html的文档

是的，该表是通过JS添加的，但是数据已经在源代码中。你可以用下面的方法得到它

from simplified_scrapy.request import req 
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://www.sports-reference.com/cbb/boxscores/2020-01-14-19-clemson.html')
doc = SimplifiedDoc(html)
table = doc.getElement('table',attr='id',value='line-score')
trs = table.trs.notContains('thead',attr='class').notContains('colspan') # Filter out the head
for tr in trs:
  tds = tr.children
  print ([td.text for td in tds])

table = doc.getElement('table',attr='id',value='four-factors')
trs = table.trs.notContains('thead',attr='class').notContains('colspan') # Filter out the head
for tr in trs:
  tds = tr.children
  print ([td.text for td in tds])

结果:

['Duke', '33', '39', '72']
['Clemson', '40', '39', '79']
['Duke', '73.5', '.574', '19.3', '13.8', '.185', '98.6']
['Clemson', '73.5', '.642', '19.3', '20.7', '.208', '108.2']

['Duke', '33', '39', '72']
['Clemson', '40', '39', '79']

您可以获得SimplifiedDoc的示例，也可以通过渲染获得表格。下面的代码使用pyppeter库

from simplified_html.request_render import RequestRender
req = RequestRender({ 'executablePath': '/Applications/chrome.app/Contents/MacOS/Google Chrome'})
def callback(html,url,data):
  from simplified_scrapy.simplified_doc import SimplifiedDoc 
  doc = SimplifiedDoc(html)
  table = doc.getElementByID('line-score')
  trs = table.trs.notContains('thead',attr='class').notContains('colspan') # Filter out the head
  for tr in trs:
    tds = tr.children
    print ([td.text for td in tds])
req.get('https://www.sports-reference.com/cbb/boxscores/2020-01-14-19-clemson.html',callback)

结果:

['Duke', '33', '39', '72']
['Clemson', '40', '39', '79']
['Duke', '73.5', '.574', '19.3', '13.8', '.185', '98.6']
['Clemson', '73.5', '.642', '19.3', '20.7', '.208', '108.2']

['Duke', '33', '39', '72']
['Clemson', '40', '39', '79']

问题似乎在于打电话时，我试了一下，也弄糊涂了。我建议使用requests包，这对我很有帮助