Python Selenium与多链接的BeautifulSoup

Python Selenium与多链接的BeautifulSoup,python,selenium,beautifulsoup,Python,Selenium,Beautifulsoup,我想从多个网页中提取链接。提取时一切正常,但对于多个url,第一个url得到两次,最后一个得不到。原因是什么 import re from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager import csv from bs4 import BeautifulSoup URLs = ["https://www.oddsportal1.com/soccer/turkey

我想从多个网页中提取链接。提取时一切正常,但对于多个url,第一个url得到两次,最后一个得不到。原因是什么

import re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import csv
from bs4 import BeautifulSoup

URLs = ["https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/1","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/2",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/3","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/4","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/5",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/6","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/7"]

driver = webdriver.Chrome(ChromeDriverManager().install())

file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])


for link in URLs:
  driver.get(link)

  html_source = driver.page_source

  soup = BeautifulSoup(html_source, "html.parser")

  for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
    writer.writerow([links.get('href')])


driver.quit()
会发生什么? 获取一些重复项是由站点上的重复项和匹配的
regex
造成的,所以脚本按设计工作-好消息是您可以修复它;)

如何避免重复书写? 创建一个只包含唯一的
href
列表
,并检查是否存在新的刮取
href
。如果不将其写入csv,同时更新
列表
(也可以稍后将列表写入csv。)

示例

...
file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])

hrefList = []

for link in URLs:
    driver.get(link)

    html_source = driver.page_source

    soup = BeautifulSoup(html_source, "html.parser")
    
    for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
        if links.get('href') not in hrefList:
            hrefList.append(links.get('href'))
            writer.writerow([links.get('href')])

file.close()
...

经过多次扫描后,我发现这个问题,如果没有休息时间,网站会阻止你的请求,所以我通过增加睡眠时间来解决它!现在你的代码会很好的工作,我测试它

import re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import csv
from bs4 import BeautifulSoup
import time

URLs = ["https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/1",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/2",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/3",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/4",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/5",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/6",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/7"]

driver = webdriver.Chrome(ChromeDriverManager().install())

file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])

for link in URLs:
    driver.get(link)
    time.sleep(5)
    html_source = driver.page_source

    soup = BeautifulSoup(html_source, "html.parser")

    for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
        writer.writerow([links.get('href')])

driver.quit()

你说“最后一个没有得到”是什么意思?请详细解释并改进你的问题。谢谢我是指url列表中的最后一个url。谢谢你的努力。这对我帮助很大。