Python抓取超过1个页面并消除重复页面
目前,我的程序无法通过第一页,在打印到excel时会重复显示结果。我想知道如何解决这个问题,我一直在看URL,但我一直在想为什么每次发布的职位都会得到重复的结果,而不是一个结果Python抓取超过1个页面并消除重复页面,python,excel,pandas,Python,Excel,Pandas,目前,我的程序无法通过第一页,在打印到excel时会重复显示结果。我想知道如何解决这个问题,我一直在看URL,但我一直在想为什么每次发布的职位都会得到重复的结果,而不是一个结果 import numpy as np import pandas as pd import requests from bs4 import BeautifulSoup as Soup col = ['Name','Company','City','Ratings','Summary','Date'] indeed =
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup as Soup
col = ['Name','Company','City','Ratings','Summary','Date']
indeed = pd.DataFrame(columns = col)`
for page in range(0,5):
url = "https://www.indeed.com/jobs?q=Analyst&l=92840&radius=150&start=10"
P_url = requests.get(url)
P_html = P_url.text
P_soup = Soup(P_html, 'html.parser')
containers = P_soup.findAll("div", {"data-tn-component": "organicJob"})
#print(len(containers))
#print(Soup.prettify(containers[0]))
container = containers[0]
for container in containers:
Name = container.findAll("a", {"class": "jobtitle turnstileLink"})
if len(Name) !=0:
name = Name[0].text.strip()
else:
name = "NaN"
Company = container.findAll("span", {"class":"company"})
if len(Company) !=0:
comp = Company[0].text.strip()
else:
comp = "NaN"
City = container.findAll('span', {"class":"location accessible-contrast-color-location"})
if len(City) !=0:
city = City[0].text.strip()
else:
city = "NaN"
ratings = container.findAll("span", {"class":"ratingDisplay"})
if len(ratings) !=0:
rat = ratings[0].text.strip()
else:
rat = "NaN"
Summ = container.findAll("div", {"class":"summary"})
if len(Summ) !=0:
summ = Summ[0].text.strip()
else:
summ = "NaN"
date = container.findAll('span', {"class":"date"})
if len(date) !=0:
dat = date[0].text.strip()
else:
dat = "NaN"
data = pd.DataFrame([[name, comp, city, rat, summ, dat]])
data.columns = col
indeed = indeed.append(data, ignore_index = True)
P_url = requests.get(url)
P_url.text
print(indeed)
indeed.to_excel("output.xlsx")
看起来您没有更新
url
变量,也没有正确地迭代容器<代码>公司
,城市
等都不在容器
循环中(检查缩进)
对于url
,您可能需要执行以下操作:
url = "https://www.indeed.com/jobs?q=Analyst&l=92840&radius=150&start={}"
for page in range(0, 5):
P_url = requests.get(url.format(10*page))
...
这应该如预期的那样起作用:
import pandas as pd
import requests
from bs4 import BeautifulSoup as Soup
col = ['Name', 'Company', 'City', 'Ratings', 'Summary', 'Date']
indeed = pd.DataFrame(columns=col)
url = "https://www.indeed.com/jobs?q=Analyst&l=92840&radius=150&start={}"
for page in range(0, 5):
P_url = requests.get(url.format(10*page))
P_html = P_url.text
P_soup = Soup(P_html, 'html.parser')
containers = P_soup.findAll("div", {"data-tn-component": "organicJob"})
for container in containers:
Name = container.find("a", {"class": "jobtitle turnstileLink"})
if len(Name) != 0:
name = Name.text.strip()
else:
name = "NaN"
Company = container.findAll("span", {"class": "company"})
if len(Company) != 0:
comp = Company[0].text.strip()
else:
comp = "NaN"
City = container.findAll('span', {"class": "location accessible-contrast-color-location"})
if len(City) != 0:
city = City[0].text.strip()
else:
city = "NaN"
ratings = container.findAll("span", {"class": "ratingDisplay"})
if len(ratings) != 0:
rat = ratings[0].text.strip()
else:
rat = "NaN"
Summ = container.findAll("div", {"class": "summary"})
if len(Summ) != 0:
summ = Summ[0].text.strip()
else:
summ = "NaN"
date = container.findAll('span', {"class": "date"})
if len(date) != 0:
dat = date[0].text.strip()
else:
dat = "NaN"
data = pd.DataFrame([[name, comp, city, rat, summ, dat]])
data.columns = col
indeed = indeed.append(data, ignore_index=True)
如果要更改页面,则循环中url的格式不好。执行
打印(url)
以查看在每个循环中请求的url是什么。为了解决这个问题,我相信你需要url=url+f'&start={page*10}'
好的,现在我有了一个更好的url,它正在做我想做的事情,但是我一直在最后得到重复的输出。如何在输出中从每个公司获得唯一的数据,而不是重复数据。Name company City Ratings`
0采购分析师,加州南部Body Rivian Automotive Irvine`1采购分析师,加州南部Body Rivian Automotive Irvine
2采购分析师,加利福尼亚州南部的Body Rivian Automotive Irvine<3采购分析师,加利福尼亚州南部的Body Rivian Automotive Irvine<4采购分析师,加利福尼亚州南部的Body Rivian Automotive Irvine。为了进一步改进代码,如果项目数量增加,则在每次迭代时在数据帧上创建数据
并使用追加
,速度会变慢。最好在列表中添加,并在循环之后创建数据帧:)并回答好