Javascript 从网页中删除多篇文章_Javascript_Python_Selenium_Web Scraping_Beautifulsoup

Javascript 从网页中删除多篇文章

javascript python selenium web-scraping

Javascript 从网页中删除多篇文章,javascript,python,selenium,web-scraping,beautifulsoup,Javascript,Python,Selenium,Web Scraping,Beautifulsoup,我试图把每一份工作都写在一页上，但没有成功。我一直在尝试不同的方法，但没有成功。打开并删除第一个工作后，脚本崩溃。我不知道下一步该怎么做才能继续做下一份工作。有人帮我做吗？先谢谢你。我不得不缩短代码，因为它不允许我发布太多的代码第一部分从selenium导入webdriver 作为pd进口熊猫从bs4导入BeautifulSoup 从selenium.webdriver.common.by导入从selenium.webdriver.support.ui导入WebDriverWait

我试图把每一份工作都写在一页上，但没有成功。我一直在尝试不同的方法，但没有成功。打开并删除第一个工作后，脚本崩溃。我不知道下一步该怎么做才能继续做下一份工作。有人帮我做吗？先谢谢你。我不得不缩短代码，因为它不允许我发布太多的代码

第一部分从selenium导入webdriver 作为pd进口熊猫从bs4导入BeautifulSoup 从selenium.webdriver.common.by导入从selenium.webdriver.support.ui导入WebDriverWait 从selenium.webdriver.support将预期的_条件导入为EC 从selenium.webdriver.chrome.options导入选项从selenium导入webdriver 从webdriver_manager.chrome导入ChromeDriverManager 选项=选项 driver=webdriver.chromechromedivermanager.install，options=options df=pd.DataFramecolumns=[标题、描述、'Job-type'、'Skills'] 对于我来说，范围25：司机，上车https://www.reed.co.uk/jobs/care-jobs?pageno=“+stri 工作=[] driver.implicitly_wait20 用于驱动程序中的作业。通过\u xpath'/*[@id=content]/div[1]/div[3]'查找\u元素： soup=BeautifulSoupjob.get_属性'innerHTML'，'html.parser' 元素=WebDriverWaitdriver，50.0 EC.element_to_be_clickableBy.CSS_选择器，一个信任接受btn处理程序元素。单击尝试： title=soup.findh3，class=title.text.replace\n，.strip 版名除：标题='无' sum_div=job.find_element_by_css_selector'jobSection42826858>div.row>div>header>h3>a' 求和div.click driver.implicitly_wait2 尝试： job_desc=driver。通过_css_选择器'content>div>div.col-xs-12.col-sm-12.col-md-12>article>div>div.branded-job-details-container>div.branded-job-content>div.branded-job-description-container>div.文本查找元素打印作业描述除：作业描述='无' 尝试：作业类型=驱动程序。通过xpath'//*[@id=content]/div/div[2]/article/div/div[2]/div[3]/div[2]/div[3]/span.查找元素打印作业类型除：作业类型='无' 尝试： job_skills=driver.通过xpath'//*[@id=content]/div/div[2]/article/div/div[2]/div[3]/div[6]/div[2]/ul.查找元素印刷工作技能除：工作技能=‘无’ 司机，后面 driver.implicitly_wait2 追加{'Title'：Title，Description:job_desc，'job-type'：job_-type，'Skills'：job_-Skills}，ignore_-index=True

df.to_csvrC:\Users\Desktop\Python\newreed.csv，index=False在我看来，使用selenium比firefox或edge更难管理Chrome。如果不需要chrome，那么我会尝试使用firefox或Edge驱动程序。当Chrome给我带来问题时，我很幸运地使用了Edge。

你应该避免使用Selenium，它最初不是为刮网而设计的。您应该研究F12->Network->html或xhr选项卡

这是我的密码：

import requests as rq
from bs4 import BeautifulSoup as bs

def processPageData(soup):
    articles = soup.find_all("article")
    resultats = {}
    for article in articles:

        resultats[article["id"][10:]] = {}

        res1 = article.find_all("div", {"class", "metadata"})[0]
        location = res1.find("li", {"class", "location"}).text.strip().split('\n')
        resultats[article["id"][10:]]['location'] = list(map(str.strip, location))
        resultats[article["id"][10:]]['salary'] = res1.find("li", {"class", "salary"}).text

        resultats[article["id"][10:]]['description'] = article.find_all("div", {"class", "description"})[0].find("p").text

        resultats[article["id"][10:]]['posted_by'] = article.find_all("div", {"class", "posted-by"})[0].text.strip()
    
    return resultats

迭代上一个函数：

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
           "Host": "www.reed.co.uk"}
            
resultats = {}

for i in range(1, 10):
    url = " https://www.reed.co.uk/jobs/care-jobs?pageno=%d" % i

    s = rq.session()
    resp = s.get(url, headers=headers)#.text
    soup = bs(resp.text, "lxml")
    r = processPageData(soup)
    resultats.update(r)

给出：

{'42826858': {'location': ['Horsham', 'West Sussex'],
  'salary': '£11.50 - £14.20 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

 '42827040': {'location': ['Redhill', 'Surrey'],
  'salary': '£11.00 - £13.00 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

....

注1:resultats键是标识符，允许您在需要更多详细信息时导航到作业页面

注2：我在1到10页之间迭代；但是您可以尝试调整代码，使其具有最大的页数

注3：作为一般性建议，尝试理解网站的数据模型，而不是过多尝试，除非以错误的方式使用selenium

注4：css选择器和xpath选择器很难看；更喜欢按标签选择。个人观点

我认为问题出在任何一个司机身上，我的问题是我不知道如何让selenium在我的下一篇文章中删掉。为什么是driver.back？真的需要吗？匆匆一瞥似乎是多余的。有调试信息吗？我只是在那里插入了备份驱动程序，让我回到主页，有没有驱动程序都是一样的问题。我忘了标题。但是你可以很容易地添加它。