Python Selenium与多链接的BeautifulSoup
我想从多个网页中提取链接。提取时一切正常,但对于多个url,第一个url得到两次,最后一个得不到。原因是什么Python Selenium与多链接的BeautifulSoup,python,selenium,beautifulsoup,Python,Selenium,Beautifulsoup,我想从多个网页中提取链接。提取时一切正常,但对于多个url,第一个url得到两次,最后一个得不到。原因是什么 import re from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager import csv from bs4 import BeautifulSoup URLs = ["https://www.oddsportal1.com/soccer/turkey
import re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import csv
from bs4 import BeautifulSoup
URLs = ["https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/1","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/2",
"https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/3","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/4","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/5",
"https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/6","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/7"]
driver = webdriver.Chrome(ChromeDriverManager().install())
file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])
for link in URLs:
driver.get(link)
html_source = driver.page_source
soup = BeautifulSoup(html_source, "html.parser")
for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
writer.writerow([links.get('href')])
driver.quit()
会发生什么?
获取一些重复项是由站点上的重复项和匹配的regex
造成的,所以脚本按设计工作-好消息是您可以修复它;)
如何避免重复书写?
创建一个只包含唯一的href
的列表
,并检查是否存在新的刮取href
。如果不将其写入csv,同时更新列表
(也可以稍后将列表写入csv。)
示例
...
file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])
hrefList = []
for link in URLs:
driver.get(link)
html_source = driver.page_source
soup = BeautifulSoup(html_source, "html.parser")
for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
if links.get('href') not in hrefList:
hrefList.append(links.get('href'))
writer.writerow([links.get('href')])
file.close()
...
经过多次扫描后,我发现这个问题,如果没有休息时间,网站会阻止你的请求,所以我通过增加睡眠时间来解决它!现在你的代码会很好的工作,我测试它
import re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import csv
from bs4 import BeautifulSoup
import time
URLs = ["https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/1",
"https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/2",
"https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/3",
"https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/4",
"https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/5",
"https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/6",
"https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/7"]
driver = webdriver.Chrome(ChromeDriverManager().install())
file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])
for link in URLs:
driver.get(link)
time.sleep(5)
html_source = driver.page_source
soup = BeautifulSoup(html_source, "html.parser")
for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
writer.writerow([links.get('href')])
driver.quit()
你说“最后一个没有得到”是什么意思?请详细解释并改进你的问题。谢谢我是指url列表中的最后一个url。谢谢你的努力。这对我帮助很大。