如何改进这个Web垃圾python脚本？_Python_Selenium_Web Scraping_Scrapy

如何改进这个Web垃圾python脚本？

python selenium web-scraping scrapy

如何改进这个Web垃圾python脚本？,python,selenium,web-scraping,scrapy,Python,Selenium,Web Scraping,Scrapy,简而言之，我两周前开始使用Python，所以请不要犹豫，纠正您看到的任何错误或改进。我正在尝试从网站www.fff.fr的结果俱乐部列表中获取数据我的组织方式是：转到主页接受饼干使用搜索栏搜索cityname 获取结果列表遵循结果页面的每个url 转到每个“员工”小节从此页提取数据我开始构建下面的python代码，到目前为止还不起作用。我真的很感兴趣的反馈，如何真正使它的工作 from selenium import webdriver from selenium.webdrive

简而言之，我两周前开始使用Python，所以请不要犹豫，纠正您看到的任何错误或改进。我正在尝试从网站www.fff.fr的结果俱乐部列表中获取数据

我的组织方式是：

转到主页

接受饼干

使用搜索栏搜索cityname

获取结果列表

遵循结果页面的每个url

转到每个“员工”小节

从此页提取数据

我开始构建下面的python代码，到目前为止还不起作用。我真的很感兴趣的反馈，如何真正使它的工作

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from shutil import which

chrome_path = which("chromedriver")

driver = webdriver.Chrome(executable_path=chrome_path)
driver.get("https://fff.fr")

cookie_btn = driver.find_element_by_id("didomi-notice-agree-button")
cookie_btn.click()

search_input = driver.find_element_by_xpath("/html//form[@id='proximiteSearch']//input[@id='fff_club_form_club_near_to_search_address']")
search_input.send_keys("Paris")
search_input.send_keys(Keys.ENTER)

self.html = driver.page_source
driver.close()

def parse(self, response):
        resp = Selector(text=self.html)
        clubs = resp.xpath("(//ul[contains(@id, 'listresulclub')])/li/text()")
        for club in clubs:
            name = club.xpath(".//text()").get()
            name_link = club.xpath(".//@href").get()

            url = f"https://www.ffr.fr{name_link}"
            absolute_url = url[:-10] + "/le-staff"
            # absolute_url =  response.urljoin()

            yield scrapy.Request(url=absolute_url, meta={'club_name':name})
            #yield response.follow (url = name_link, callback=self.parse_country, meta={'club_name': name})

def parse_country(self, response):
        name = response.request.meta['club_name']
        contacts = response.xpath("//div[@class='coor-block-content']/ol")
        for contact in contacts:
            contact_nom = contact.xpath(".//li[1]/text()").get()
            yield {
                'club_name': name,
                'correspondant_nom': contact_nom
            }

您可以在不含硒的情况下尝试同样的方法，效果良好：

import bs4
import requests
import sys
import re 
import unicodedata
import os
import random
import datetime

Current_Date_Formatted = datetime.datetime.today().strftime ('%d-%b-%Y')
time = str(Current_Date_Formatted)

filename = "footballstuff"
cityname = sys.argv[1]

filename=r"D:\Huzefa\Desktop\\" +filename+ ".txt"
url = "https://www.fff.fr/resultats?search="+cityname
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "lxml")


file = open(filename , 'wb')
for i in soup.select("a"):
    f=i.text
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", f)).encode('ascii', 'ignore'))
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", os.linesep)).encode('ascii', 'ignore'))
    file.write(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", os.linesep)).encode('ascii', 'ignore'))
file.close()

请具体说明哪些不起作用。以上哪一步有效？我只给你一些提示。在我看来，硒是一条出路。当自动化某件事情或试图提取数据时，您必须考虑如果手动执行，您会怎么做。尝试在主函数中拆分任务，以便可以逐个测试它们，以检查是否得到所需的任务。这种方法更容易发现错误并在以后更新。继续努力！此外，这两个函数在您提供的代码中从未调用/使用，因此部分代码不会执行。也许可以尝试在

driver.close（）下执行parse（self.html）
？