Python抓取超过1个页面并消除重复页面

Python抓取超过1个页面并消除重复页面,python,excel,pandas,Python,Excel,Pandas,目前,我的程序无法通过第一页,在打印到excel时会重复显示结果。我想知道如何解决这个问题,我一直在看URL,但我一直在想为什么每次发布的职位都会得到重复的结果,而不是一个结果 import numpy as np import pandas as pd import requests from bs4 import BeautifulSoup as Soup col = ['Name','Company','City','Ratings','Summary','Date'] indeed =

目前,我的程序无法通过第一页,在打印到excel时会重复显示结果。我想知道如何解决这个问题,我一直在看URL,但我一直在想为什么每次发布的职位都会得到重复的结果,而不是一个结果

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup as Soup

col = ['Name','Company','City','Ratings','Summary','Date']
indeed = pd.DataFrame(columns = col)`

for page in range(0,5):
    url = "https://www.indeed.com/jobs?q=Analyst&l=92840&radius=150&start=10"
    P_url = requests.get(url)
    P_html = P_url.text
    P_soup = Soup(P_html, 'html.parser')
    containers = P_soup.findAll("div", {"data-tn-component": "organicJob"})
    #print(len(containers))
    #print(Soup.prettify(containers[0]))
    container = containers[0]
    for container in containers:
        Name = container.findAll("a", {"class": "jobtitle turnstileLink"})
        if len(Name) !=0:
            name = Name[0].text.strip()
        else:
            name = "NaN"
    
    Company = container.findAll("span", {"class":"company"})
    if len(Company) !=0:
        comp = Company[0].text.strip()
    else:
        comp = "NaN"
    
    City = container.findAll('span', {"class":"location accessible-contrast-color-location"})
    if len(City) !=0:
        city = City[0].text.strip()
    else:
        city = "NaN"
        
    ratings = container.findAll("span", {"class":"ratingDisplay"})
    if len(ratings) !=0:
        rat = ratings[0].text.strip()
    else:
        rat = "NaN"
        
    Summ = container.findAll("div", {"class":"summary"})
    if len(Summ) !=0:
        summ = Summ[0].text.strip()
    else:
        summ = "NaN"
        
    date = container.findAll('span', {"class":"date"})
    if len(date) !=0:
        dat = date[0].text.strip()
    else:
        dat = "NaN"
        
    data = pd.DataFrame([[name, comp, city, rat, summ, dat]])
    data.columns = col
    indeed = indeed.append(data, ignore_index = True)
    
P_url = requests.get(url)
P_url.text
    
print(indeed)
indeed.to_excel("output.xlsx")  

看起来您没有更新
url
变量,也没有正确地迭代
容器<代码>公司
城市
等都不在
容器
循环中(检查缩进)

对于
url
,您可能需要执行以下操作:

url = "https://www.indeed.com/jobs?q=Analyst&l=92840&radius=150&start={}"

for page in range(0, 5):
    P_url = requests.get(url.format(10*page))
    ...
这应该如预期的那样起作用:

import pandas as pd
import requests
from bs4 import BeautifulSoup as Soup

col = ['Name', 'Company', 'City', 'Ratings', 'Summary', 'Date']
indeed = pd.DataFrame(columns=col)
url = "https://www.indeed.com/jobs?q=Analyst&l=92840&radius=150&start={}"

for page in range(0, 5):
    P_url = requests.get(url.format(10*page))
    P_html = P_url.text
    P_soup = Soup(P_html, 'html.parser')
    containers = P_soup.findAll("div", {"data-tn-component": "organicJob"})

    for container in containers:
        Name = container.find("a", {"class": "jobtitle turnstileLink"})
        if len(Name) != 0:
            name = Name.text.strip()
        else:
            name = "NaN"

        Company = container.findAll("span", {"class": "company"})
        if len(Company) != 0:
            comp = Company[0].text.strip()
        else:
            comp = "NaN"

        City = container.findAll('span', {"class": "location accessible-contrast-color-location"})
        if len(City) != 0:
            city = City[0].text.strip()
        else:
            city = "NaN"

        ratings = container.findAll("span", {"class": "ratingDisplay"})
        if len(ratings) != 0:
            rat = ratings[0].text.strip()
        else:
            rat = "NaN"

        Summ = container.findAll("div", {"class": "summary"})
        if len(Summ) != 0:
            summ = Summ[0].text.strip()
        else:
            summ = "NaN"

        date = container.findAll('span', {"class": "date"})
        if len(date) != 0:
            dat = date[0].text.strip()
        else:
            dat = "NaN"

        data = pd.DataFrame([[name, comp, city, rat, summ, dat]])
        data.columns = col
        indeed = indeed.append(data, ignore_index=True)

如果要更改页面,则循环中url的格式不好。执行
打印(url)
以查看在每个循环中请求的url是什么。为了解决这个问题,我相信你需要
url=url+f'&start={page*10}'
好的,现在我有了一个更好的url,它正在做我想做的事情,但是我一直在最后得到重复的输出。如何在输出中从每个公司获得唯一的数据,而不是重复数据。
Name company City Ratings`
0采购分析师,加州南部Body Rivian Automotive Irvine`
1采购分析师,加州南部Body Rivian Automotive Irvine
2采购分析师,加利福尼亚州南部的Body Rivian Automotive Irvine<3采购分析师,加利福尼亚州南部的Body Rivian Automotive Irvine<4采购分析师,加利福尼亚州南部的Body Rivian Automotive Irvine。为了进一步改进代码,如果项目数量增加,则在每次迭代时在数据帧上创建
数据
并使用
追加
,速度会变慢。最好在列表中添加
,并在循环之后创建数据帧:)并回答好