Selenium webdriver 如何使用selenium和beautifulsoup获取链接?
我想从这个网站上收集文章。我只是在早些时候使用了Beautifulsoup,但它没有抓住链接。所以我试着用硒。现在我试着写这段代码。这是给输出“无”。我以前从未使用过硒,所以我对它不太了解。我应该在这个代码中做些什么改变才能使它正常工作并给出期望的结果Selenium webdriver 如何使用selenium和beautifulsoup获取链接?,selenium-webdriver,web-scraping,beautifulsoup,Selenium Webdriver,Web Scraping,Beautifulsoup,我想从这个网站上收集文章。我只是在早些时候使用了Beautifulsoup,但它没有抓住链接。所以我试着用硒。现在我试着写这段代码。这是给输出“无”。我以前从未使用过硒,所以我对它不太了解。我应该在这个代码中做些什么改变才能使它正常工作并给出期望的结果 import time import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.key
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
wait = WebDriverWait(browser, 10)
browser.get(url)
link = browser.find_elements_by_class_name('gs-title')
for links in link:
links.get_attribute('href')
soup = BeautifulSoup(browser.page_source, 'lxml')
date = soup.find('span', {'class': 'post-date'})
title = soup.find('h1', {'class':'headline'})
content = soup.find('div',{'class':'article-body'})
print(date)
print(title)
print(content)
time.sleep(3)
browser.close()
我想收集的日期,标题和内容,从所有的文章在这一页和其他网页也一样,第7至18页
谢谢。我没有使用Selenium获取锚定,而是尝试先在Selenium的帮助下提取页面源代码,然后在其上使用Beautiful Soup 因此,从长远来看:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
#wait = WebDriverWait(browser, 10) #Not actually required
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser') #Get the Page Source
anchors = soup.find_all("a", class_ = "gs-title") #Now find the anchors
for anchor in anchors:
browser.get(anchor['href']) #Connect to the News Link, and extract it's Page Source
sub_soup = BeautifulSoup(browser.page_source, 'html.parser')
date = sub_soup.find('span', {'class': 'post-date'})
title = sub_soup.find('h1', {'class':'post-title'}) #Note that the class attribute for the heading is 'post-title' and not 'headline'
content = sub_soup.find('div',{'class':'article-body'})
print([date.string, title.string, content.string])
#time.sleep(3) #Even this I don't believe is required
browser.close()
通过此修改,我相信您可以获得所需的内容。您可以使用与页面使用相同的API。更改参数以获取所有页面的结果
import requests
import json
import re
r = requests.get('https://cse.google.com/cse/element/v1?rsz=filtered_cse&num=10&hl=en&source=gcsc&gss=.uk&start=60&cselibv=5d7bf4891789cfae&cx=012545676297898659090:wk87ya_pczq&q=cybersecurity&safe=off&cse_tok=AKaTTZjKIBzl-5fANH8dQ8f78cv2:1560500563340&filter=0&sort=date&exp=csqr,4229469&callback=google.search.cse.api3732')
p = re.compile(r'api3732\((.*)\);', re.DOTALL)
data = json.loads(p.findall(r.text)[0])
links = [item['clicktrackUrl'] for item in data['results']]
print(links)
谢谢,它正在工作。但是为什么要打印或给每篇文章两次?如果您看到的是的页面源代码,那么您会发现每个结果区域中有两个位置放置了带有
class=“gs title”
标记的。两者本质上都是div
,但在类别上有所不同。一个有class=“gsc缩略图在里面”
,另一个有class=“gs title gsc表格单元格缩略图gsc缩略图在左边”
。我相信,通过在每个循环开始时检查当前锚定值是否与以前的锚定值相似,可以很容易地解决这个问题。