Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 我需要在提到的页面中为每个职位刮去职位描述文本_Python 3.x_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 3.x 我需要在提到的页面中为每个职位刮去职位描述文本

Python 3.x 我需要在提到的页面中为每个职位刮去职位描述文本,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我需要使用python模块在csv文件的不同列中为每个职位(如部门(会计)职位(员工会计)职位描述文本)刮取页面()中的职位描述 我是新来的美丽的汤,我尝试了一些方法,但它不工作,你能帮我的代码吗 # -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup import pandas as pd import time start = time.time() url = "" data = [] while Tr

我需要使用python模块在csv文件的不同列中为每个职位(如部门(会计)职位(员工会计)职位描述文本)刮取页面()中的职位描述

我是新来的美丽的汤,我尝试了一些方法,但它不工作,你能帮我的代码吗

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

start = time.time()

url = ""
data = []
while True:
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
jobdesc = soup.find("li",{'class':'col-xs-12 col-sm-4'})
section=soup.find("h4")
jd = {"jobdescription":jobdesc.text,"topic":section.text}
data.append(jd)
df = pd.DataFrame(data)
df.to_csv("JD.csv")

这里有一种方法:利用bs4.7.1+中的has来隔离各部分以进行循环。使用zip_最长,因此我们可以将章节标题加入到每个作业中

import requests, csv
from bs4 import BeautifulSoup as bs
from itertools import zip_longest

r = requests.get('https://resources.workable.com/job-descriptions/#', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')

with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:

    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['Section','Job Title'])

    for section in soup.select('section:has(.job)'):
        title = section.select_one('a').text.strip()
        jobs = [job.text for job in section.select('li a')]
        rows = list(zip_longest([title], jobs, fillvalue = title))
        for row in rows:
            w.writerow(row)

我有一个403禁止使用
请求
包,所以我决定使用
selenium

您可以尝试以下方法:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from selenium import webdriver

url = "https://resources.workable.com/job-descriptions/#"
data = []
#resp = requests.get(url)
#soup = BeautifulSoup(resp.text, 'html.parser')
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
section = soup.find_all('section',{'class':'box-white'})
for s in section:
    title = s.find('h4').text
    lis = soup.find_all("li",{'class':'col-xs-12 col-sm-4'})
    for li in lis:
        jd = {"jobdescription":li.text,"topic":title}
        data.append(jd)
df = pd.DataFrame(data)
df.to_csv("JD.csv")
编辑:获取所有作业的说明

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from selenium import webdriver

url = "https://resources.workable.com/job-descriptions/#"
data = []
#resp = requests.get(url)
#soup = BeautifulSoup(resp.text, 'html.parser')
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
section = soup.find_all('section',{'class':'box-white'})
for s in section:
    title = s.find('h4').text
    lis = s.find_all("li",{'class':'col-xs-12 col-sm-4'})
    for li in lis:
        job = li.text
        driver.get(li.find('a').get('href'))
        soup2 = BeautifulSoup(driver.page_source, 'html.parser')
        jd = {"job":job,"topic":title, "description": soup2.find('div',{'class':'entry-content article-content'}).text}
        data.append(jd)

df = pd.DataFrame(data)
df.to_csv("JD.csv")

从monster作业中抓取数据并上传到Mongo DB

from time import *
from selenium import webdriver
import pymongo
from pymongo.results import InsertManyResult
import os


client = pymongo.MongoClient()
mydb =  client['jobs']
collection  = mydb['med_title']

driver = webdriver.Chrome("C:/Users/91798/Desktop/pythn_files/chromedriver.exe")
driver.get("https://www.monsterindia.com/")

driver.implicitly_wait(9)
driver.find_element_by_id("SE_home_autocomplete").send_keys("nursing , Therapist , docter , medical ,nurse , hospital")

#for normal search use this 
driver.find_element_by_xpath("//body/div[@id='themeDefault']/section[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/form[1]/div[1]/div[2]/input[1]").click()
driver.implicitly_wait(20)
temp = 1
while(True):
    if temp == 5:
        break
    all_jobs =  driver.find_elements_by_class_name("card-apply-content")
    link_list = []
    for job in all_jobs:
        try:
            company = ""
            com_name = job.find_elements_by_class_name("job-tittle")
            driver.implicitly_wait(1)
            for ele in com_name:
                company = ele.find_element_by_class_name('company-name').text
            job_title = ""
            for ele in com_name:
                job_title = ele.find_element_by_class_name('medium').text
       
            location = job.find_element_by_class_name("loc").text
            driver.implicitly_wait(1)
            lnks= job.find_elements_by_tag_name("a")
            for lnk in lnks:
                link_list.append(lnk.get_attribute('href'))
                break
            driver.implicitly_wait(1)
            desc = job.find_element_by_class_name("job-descrip").text
            driver.implicitly_wait(1)
            skills = job.find_element_by_class_name("descrip-skills").text

        except:
            desc =  'desc Not Specified'
            skills =  'skills Not Specified'  
            location = ' location Not Specified'
            company = 'company  Not Specified'
            job_title = 'job_title not specified'
        
        s = skills.split(' ')
        for i in s:
            if i == ',':
                s.remove(',')
        data = {"job_title" : job_title ,"comapany_name": company,"job_location": 
        location,"job_desc":desc,"skills":s[2::],"card_link":link_list[0]}
        link_list.clear()
        y =  collection.insert_one(data)
        print(y.inserted_id)
    driver.find_element_by_xpath("//button[contains(text(),'Next')]").click()
    sleep(25)
    temp = temp +1

抱歉,它正在返回一个空的csv文件,请在修复后检查它,好吗请@Maaz@praveen如果信息在里面,请检查
soup
对象。它对我有用。你有stacktrace吗?soup元素有403 Forbidden 403 Forbidden
nginx,它在末尾返回空的csv文件您是否使用selenium或请求运行代码?因为它是我使用RequestSi is with Requests接收的内容,下面是我在运行代码TypeError时遇到的错误:file()最多接受3个参数(给定4个),“encoding”是此函数的无效关键字参数,UnicodeEncodeError:“ascii”编解码器无法对位置26处的字符u'\u2013'进行编码:序号不在范围内(128)是的,我按原样运行它,我正在使用python 2.7,所以我只是将zip_longest更改为izp_longest,而不是没有更改。。。问题一定是开着的。。。。由于是Python2.7,所以需要使用第行。删除编码参数代码运行良好,但在Python3中也没有返回数据中的任何数据,我还需要部分职务标题和职务说明,但它只显示部分职务标题的空列这两个我不理解您的评论。您现在运行的是Python3吗?请提供有关您的代码的功能以及如何解决此问题的详细信息?这是VLQ。好的,先生,这是我的第一篇帖子,请不要介意