Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/asp.net-core/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python:使用Selenium进行Web抓取。迭代表并检索数据_Python_Selenium_Selenium Webdriver_Web Scraping_Beautifulsoup - Fatal编程技术网

Python:使用Selenium进行Web抓取。迭代表并检索数据

Python:使用Selenium进行Web抓取。迭代表并检索数据,python,selenium,selenium-webdriver,web-scraping,beautifulsoup,Python,Selenium,Selenium Webdriver,Web Scraping,Beautifulsoup,我正在学习Python,并决定做一个web抓取项目,在那里我使用Beautifulsoup和Selenium 地点: 目标:检索与作业添加相关的所有变量。确定的变量:ID、职位、URL、城市、州、邮政编码、国家、职位发布日期 问题:我设法从表的第一页提取数据。但是,我无法从表的所有其他页面提取数据。我确实使用了这个选项进入下一页 任何帮助都将不胜感激 请在下面找到我的代码 import re import os import selenium import pandas as pd from

我正在学习Python,并决定做一个web抓取项目,在那里我使用Beautifulsoup和Selenium

地点:

目标:检索与作业添加相关的所有变量。确定的变量:ID、职位、URL、城市、州、邮政编码、国家、职位发布日期

问题:我设法从表的第一页提取数据。但是,我无法从表的所有其他页面提取数据。我确实使用了这个选项进入下一页

任何帮助都将不胜感激

请在下面找到我的代码

import re
import os
import selenium
import pandas as pd

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from bs4 import BeautifulSoup


#driver = webdriver.Chrome(ChromeDriverManager().install())
browser = webdriver.Chrome("") #path needed to execute chromedriver. Check your own 
#path
browser.get('https://careers.amgen.com/ListJobs?')
browser.implicitly_wait(100)
soup = BeautifulSoup(browser.page_source, 'html.parser')
code_soup = soup.find_all('tr', attrs={'role': 'row'})

# creating data set
df =pd.DataFrame({'id':[],
                  'jobs':[],
                 'url':[],
                 'city':[],
                 'state':[],
                  'zip':[],
                  'country':[],
                 'added':[]
                 })
d = code_soup

next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')



for i in range(2,12): #catch error, out of bonds?
    df = df.append({'id' : d[i].find_all("td", {"class": "DisplayJobId-cell"}),
                     "jobs" : d[i].find_all("td", {"class":"JobTitle-cell"}),
                     "url" : d[i].find("a").attrs['href'],
                     "city" : d[i].find_all("td", {"class": "City-cell"}),
                     "state" : d[i].find_all("td", {"class": "State-cell"}),
                     "zip" : d[i].find_all("td", {"class": "Zip-cell"}),
                     "country" : d[i].find_all("td", {"class": "Country-cell"}),
                     "added" : d[i].find_all("td", {"class": "AddedOn-cell"})}, ignore_index=True)
    
df['url'] = 'https://careers.amgen.com/' + df['url'].astype(str)
df["company"] = "Amgen"
df

#iterate through the pages

next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')
for p in range(1,7): #go from page 1 to 6
    next_page.click()
    browser.implicitly_wait(20)
    print(p)

>quote 
I tried multiple things, this is my last multiple attempt. It did not work:

```
p = 0
next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')

for p in range(1,7):   
    for i in range(2,12):
        df1 = df.append({'id' : d[i].find_all("td", {"class": "DisplayJobId-cell"}),
                         "jobs" : d[i].find_all("td", {"class":"JobTitle-cell"}),
                         "url" : d[i].find("a").attrs['href'],
                         "city" : d[i].find_all("td", {"class": "City-cell"}),
                         "state" : d[i].find_all("td", {"class": "State-cell"}),
                         "zip" : d[i].find_all("td", {"class": "Zip-cell"}),
                         "country" : d[i].find_all("td", {"class": "Country-cell"}),
                         "added" : d[i].find_all("td", {"class": "AddedOn-cell"})}, ignore_index=True)
        p += 1
        next_page.click()
    print(p)
导入请求
进口稀土
作为pd进口熊猫
参数={
'sort':'AddedOn desc',
“第页”:“1”,
“页面大小”:“1000”,
“组”:“,
“过滤器”:“,
“字段”:“JobTitle,DisplayJobId,城市,州,邮政编码,国家,AddedOn,UrlJobTitle”
}
标题={
“来源”:https://careers.amgen.com'
}
def主(url):
r=请求。获取(url)
api=re.search('JobsApiUrl=“(.*?\”,r.text)。组(1)
r=requests.get(api,params=params,headers=headers).json()
df=pd.DataFrame(r['Data'])
打印(df)
df.to_csv(“data.csv”,index=False)
主要(”https://careers.amgen.com/ListJobs")
输出:

样本:


在一行中更改代码将为您完成这项工作。 您可以使用以下xpath,而不是选择“下一步”箭头来更改表所使用的现有xpath

>>> next_page = browser.find_element_by_xpath('//a[@class="k-link k-pager-nav"]//following::a[@class="k-link k-pager-nav"]')