刮取信息-美化组/Python
我的代码进入一个网站,提取URL,然后进入它刮取的URL(在这里工作正常) 现在在这个新的页面上,我想得到一些信息(作者姓名),但它是打印空白 代码如下:刮取信息-美化组/Python,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我的代码进入一个网站,提取URL,然后进入它刮取的URL(在这里工作正常) 现在在这个新的页面上,我想得到一些信息(作者姓名),但它是打印空白 代码如下: from selenium import webdriver from bs4 import BeautifulSoup import time import requests driver = webdriver.Chrome() eachLink=[] baseurl='https://meetinglibrary.asco.org' f
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
driver = webdriver.Chrome()
eachLink=[]
baseurl='https://meetinglibrary.asco.org'
for x in range (1,2):
driver.get(f'https://meetinglibrary.asco.org/results?meetingView=2020%20ASCO%20Virtual%20Scientific%20Program&page={x}')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('a',class_='ng-star-inserted')
for item in productlist:
for link in item.find_all('a',href=True):
eachLink.append(baseurl+link['href'])
print(eachLink)
infobox=[]
for b in eachLink:
r=requests.get(b)
time.sleep(1)
soup1=BeautifulSoup(r.content,'html.parser')
auth=soup1.find('a',class_='asset-metadata-value link ng-star-inserted')
print(auth)
可能是时间。睡眠(1)在eachLink循环中不够长,页面仍在加载。您可以使用显式等待来检查预期条件,而不是使用time.sleep(隐式等待)
从selenium.webdriver.support.ui导入WebDriverWait
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.webdriver.common.by导入
path=“//div[@id='YOURIDHERE']”更改为每个hlink都应该出现的内容
按钮=WebDriverWait(驱动程序,10)。直到(
EC.元素的存在位置(
(By.XPATH,path)
)
我认为这很有帮助。无需等待循环来提取作者
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
driver = webdriver.Chrome()
eachLink=[]
authors = []
baseurl='https://meetinglibrary.asco.org'
for x in range (1,2):
driver.get(f'https://meetinglibrary.asco.org/results?meetingView=2020%20ASCO%20Virtual%20Scientific%20Program&page={x}')
time.sleep(120)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('a',class_='ng-star-inserted')
for auth in soup.find_all('div', {'class':'record__ellipsis'}):
authors.append(auth.text)
for item in productlist:
for link in item.find_all('a',href=True):
eachLink.append(baseurl+link['href'])
print(eachLink)
print('\n', authors, '\n')
driver.quit()
那么到底是什么问题呢?这在文章中已经明确指出,它是在打印空白……谢谢,但我想最终获取更多信息。所以我想从href页面而不是当前页面获取信息