刮取信息-美化组/Python_Python_Html_Web Scraping_Beautifulsoup

刮取信息-美化组/Python

python html web-scraping

刮取信息-美化组/Python,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我的代码进入一个网站，提取URL，然后进入它刮取的URL（在这里工作正常）现在在这个新的页面上，我想得到一些信息（作者姓名），但它是打印空白代码如下： from selenium import webdriver from bs4 import BeautifulSoup import time import requests driver = webdriver.Chrome() eachLink=[] baseurl='https://meetinglibrary.asco.org' f

我的代码进入一个网站，提取URL，然后进入它刮取的URL（在这里工作正常）

现在在这个新的页面上，我想得到一些信息（作者姓名），但它是打印空白

代码如下：

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
driver = webdriver.Chrome()
eachLink=[]
baseurl='https://meetinglibrary.asco.org'
for x in range (1,2):
  driver.get(f'https://meetinglibrary.asco.org/results?meetingView=2020%20ASCO%20Virtual%20Scientific%20Program&page={x}')
  time.sleep(3)
  page_source = driver.page_source
  soup = BeautifulSoup(page_source,'html.parser')
  productlist=soup.find_all('a',class_='ng-star-inserted')
  for item in productlist:
     for link in item.find_all('a',href=True):
         eachLink.append(baseurl+link['href'])
print(eachLink)
infobox=[]
for b in eachLink:
    r=requests.get(b)
    time.sleep(1)
    soup1=BeautifulSoup(r.content,'html.parser')
    auth=soup1.find('a',class_='asset-metadata-value link ng-star-inserted')
    print(auth)

可能是时间。睡眠（1）在eachLink循环中不够长，页面仍在加载。您可以使用显式等待来检查预期条件，而不是使用time.sleep（隐式等待）

从selenium.webdriver.support.ui导入WebDriverWait
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.webdriver.common.by导入
path=“//div[@id='YOURIDHERE']”更改为每个hlink都应该出现的内容
按钮=WebDriverWait（驱动程序，10）。直到(
EC.元素的存在位置(
（By.XPATH，path）
)

我认为这很有帮助。无需等待循环来提取作者

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
driver = webdriver.Chrome()
eachLink=[]
authors = []
baseurl='https://meetinglibrary.asco.org'
for x in range (1,2):
  driver.get(f'https://meetinglibrary.asco.org/results?meetingView=2020%20ASCO%20Virtual%20Scientific%20Program&page={x}')
  time.sleep(120)
  page_source = driver.page_source

  soup = BeautifulSoup(page_source,'html.parser')
  productlist=soup.find_all('a',class_='ng-star-inserted')
  for auth in soup.find_all('div', {'class':'record__ellipsis'}):
      authors.append(auth.text)

  for item in productlist:
     for link in item.find_all('a',href=True):
         eachLink.append(baseurl+link['href'])
print(eachLink)
print('\n', authors, '\n')

driver.quit()

那么到底是什么问题呢？这在文章中已经明确指出，它是在打印空白……谢谢，但我想最终获取更多信息。所以我想从href页面而不是当前页面获取信息