Python Selenium与多链接的BeautifulSoup_Python_Selenium_Beautifulsoup

Python Selenium与多链接的BeautifulSoup

python selenium

Python Selenium与多链接的BeautifulSoup,python,selenium,beautifulsoup,Python,Selenium,Beautifulsoup,我想从多个网页中提取链接。提取时一切正常，但对于多个url，第一个url得到两次，最后一个得不到。原因是什么 import re from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager import csv from bs4 import BeautifulSoup URLs = ["https://www.oddsportal1.com/soccer/turkey

我想从多个网页中提取链接。提取时一切正常，但对于多个url，第一个url得到两次，最后一个得不到。原因是什么

import re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import csv
from bs4 import BeautifulSoup

URLs = ["https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/1","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/2",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/3","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/4","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/5",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/6","https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/7"]

driver = webdriver.Chrome(ChromeDriverManager().install())

file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])


for link in URLs:
  driver.get(link)

  html_source = driver.page_source

  soup = BeautifulSoup(html_source, "html.parser")

  for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
    writer.writerow([links.get('href')])


driver.quit()

会发生什么？获取一些重复项是由站点上的重复项和匹配的

regex

造成的，所以脚本按设计工作-好消息是您可以修复它；）

如何避免重复书写？创建一个只包含唯一的

href

的

列表

，并检查是否存在新的刮取

href

。如果不将其写入csv，同时更新

列表

（也可以稍后将列表写入csv。）

示例

...
file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])

hrefList = []

for link in URLs:
    driver.get(link)

    html_source = driver.page_source

    soup = BeautifulSoup(html_source, "html.parser")
    
    for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
        if links.get('href') not in hrefList:
            hrefList.append(links.get('href'))
            writer.writerow([links.get('href')])

file.close()
...

经过多次扫描后，我发现这个问题，如果没有休息时间，网站会阻止你的请求，所以我通过增加睡眠时间来解决它！现在你的代码会很好的工作，我测试它

import re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import csv
from bs4 import BeautifulSoup
import time

URLs = ["https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/1",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/2",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/3",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/4",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/5",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/6",
        "https://www.oddsportal1.com/soccer/turkey/super-lig-2019-2020/results/#/page/7"]

driver = webdriver.Chrome(ChromeDriverManager().install())

file = open('linkler.csv', 'w+', newline='')
writer = csv.writer(file)
writer.writerow(['linkler'])

for link in URLs:
    driver.get(link)
    time.sleep(5)
    html_source = driver.page_source

    soup = BeautifulSoup(html_source, "html.parser")

    for links in soup.findAll('a', attrs={'href': re.compile("^/soccer/turkey/super-lig-2019-2020/")}):
        writer.writerow([links.get('href')])

driver.quit()

你说“最后一个没有得到”是什么意思？请详细解释并改进你的问题。谢谢我是指url列表中的最后一个url。谢谢你的努力。这对我帮助很大。