Python 使用BeautifulSoup遍历列表
我正在使用BeautifulSoup4构建一个JSON格式的列表,其中包含: “title”、“company”、“location”、“date posted”和“link”来自一个公开的Linkedin求职搜索,我已经按照我想要的方式进行了格式化,但是它只列出了页面中的一个职位列表,我希望以相同的格式迭代页面中的每个职位 例如,我试图实现以下目标:Python 使用BeautifulSoup遍历列表,python,list,dictionary,web-scraping,beautifulsoup,Python,List,Dictionary,Web Scraping,Beautifulsoup,我正在使用BeautifulSoup4构建一个JSON格式的列表,其中包含: “title”、“company”、“location”、“date posted”和“link”来自一个公开的Linkedin求职搜索,我已经按照我想要的方式进行了格式化,但是它只列出了页面中的一个职位列表,我希望以相同的格式迭代页面中的每个职位 例如,我试图实现以下目标: [{'title': 'Job 1', 'company': 'company 1.', 'location': 'sunny side, Ca
[{'title': 'Job 1', 'company': 'company 1.', 'location': 'sunny side, California', 'date posted': '2 weeks ago', 'link': 'example1.com'}]
[{'title': 'Job 2', 'company': 'company 2.', 'location': 'runny side, California', 'date posted': '2 days ago', 'link': 'example2.com'}]
我已经尝试将第48、52、56、60和64行从contents.find更改为contents.findAll,但是,它返回的是所有内容,而不是我试图实现的顺序
from bs4 import BeautifulSoup
import requests
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
def search_website(url):
# Search HTML Page
result = requests.get(url)
content = result.content
soup = BeautifulSoup(content, 'html.parser')
# Job List
jobs = []
for contents in soup.find_all('body'):
# Title
title = contents.find('h3', attrs={'class': 'result-card__title ''job-result-card__title'})
formatted_title = strip_tags(str(title))
# Company
company = contents.find('h4', attrs={'class': 'result-card__subtitle job-result-card__subtitle'})
formatted_company = strip_tags(str(company))
# Location
location = contents.find('span', attrs={'class': 'job-result-card__location'})
formatted_location = strip_tags(str(location))
# Date Posted
posted = contents.find('time', attrs={'class': 'job-result-card__listdate'})
formatted_posted = strip_tags(str(posted))
# Apply Link
links = contents.find('a', attrs={'class': 'result-card__full-card-link'})
formatted_link = (links.get('href'))
# Add a new compiled job to our dict
jobs.append({'title': formatted_title,
'company': formatted_company,
'location': formatted_location,
'date posted': formatted_posted,
'link': formatted_link
})
# Return our jobs
return jobs
link = ("https://www.linkedin.com/jobs/search/currentJobId=1396095018&distance=25&f_E=3%2C4&f_LF=f_AL&geoId=102250832&keywords=software%20engineer&location=Mountain%20View%2C%20California%2C%20United%20States")
print(search_website(link))
我希望输出像这样
[{'title': 'x', 'company': 'x', 'location': 'x', 'date posted': 'x', 'link': 'x'}] [{'title': 'x', 'company': 'x', 'location': 'x', 'date posted': 'x', 'link': 'x'}] +..
切换到FindAll时的输出返回:
[{'title': 'x''x''x''x''x', 'company': 'x''x''x''x''x', 'location': 'x''x''x''x', 'date posted': 'x''x''x''x', 'link': 'x''x''x''x'}]
这是您的代码的简化版本,但它应该可以帮助您实现:
result = requests.get('https://www.linkedin.com/jobs/search/?distance=25&f_E=2%2C3&f_JT=F&f_LF=f_AL&geoId=102250832&keywords=software%20engineer&location=Mountain%20View%2C%20California%2C%20United%20States')
soup = bs(result.content, 'html.parser')
# Job List
jobs = []
for contents in soup.find_all('body'):
# Title
title = contents.find('h3', attrs={'class': 'result-card__title ''job-result-card__title'})
# Company
company = contents.find('h4', attrs={'class': 'result-card__subtitle job-result-card__subtitle'})
# Location
location = contents.find('span', attrs={'class': 'job-result-card__location'})
# Date Posted
posted = contents.find('time', attrs={'class': 'job-result-card__listdate'})
# Apply Link
link = contents.find('a', attrs={'class': 'result-card__full-card-link'})
# Add a new compiled job to our dict
jobs.append({'title': title.text,
'company': company.text,
'location': location.text,
'date posted': posted.text,
'link': link.get('href')
})
for job in jobs:
print(job)
输出:
{'title':'Systems Software Engineer-Controls','company':'Blue River Technology','location':'Sunnyvale,California','date posted':'1天前','link':'https://www.linkedin.com/jobs/view/systems-software-engineer-controls-at-blue-river-technology-1380882942?position=1&pageNum=0&trk=guest_job_search_job-结果卡\u结果卡\u点击“}
这是您的代码的简化版本,但它应该可以帮助您实现:
result = requests.get('https://www.linkedin.com/jobs/search/?distance=25&f_E=2%2C3&f_JT=F&f_LF=f_AL&geoId=102250832&keywords=software%20engineer&location=Mountain%20View%2C%20California%2C%20United%20States')
soup = bs(result.content, 'html.parser')
# Job List
jobs = []
for contents in soup.find_all('body'):
# Title
title = contents.find('h3', attrs={'class': 'result-card__title ''job-result-card__title'})
# Company
company = contents.find('h4', attrs={'class': 'result-card__subtitle job-result-card__subtitle'})
# Location
location = contents.find('span', attrs={'class': 'job-result-card__location'})
# Date Posted
posted = contents.find('time', attrs={'class': 'job-result-card__listdate'})
# Apply Link
link = contents.find('a', attrs={'class': 'result-card__full-card-link'})
# Add a new compiled job to our dict
jobs.append({'title': title.text,
'company': company.text,
'location': location.text,
'date posted': posted.text,
'link': link.get('href')
})
for job in jobs:
print(job)
输出:
{'title':'Systems Software Engineer-Controls','company':'Blue River Technology','location':'Sunnyvale,California','date posted':'1天前','link':'https://www.linkedin.com/jobs/view/systems-software-engineer-controls-at-blue-river-technology-1380882942?position=1&pageNum=0&trk=guest_job_search_job-结果卡\u结果卡\u点击“}
我很困惑:在result=requests.get(url)
中使用的原始url(用于“公共Linkedin求职”)是什么?很抱歉,我在晚些时候写了这篇文章,忘了把它包括在内。它也在link变量中。我很困惑:原始url(用于“公共Linkedin求职”)是什么在result=requests.get(url)
中使用?很抱歉,我写得太晚了,忘了包含它。它也在link变量中。谢谢,这是一个比我以前的string replace arg好得多的方法。谢谢,这是一个比我以前的string replace arg好得多的方法。