用python将网站中的数据提取到词典中_Python_Beautifulsoup_Web Crawler_Data Cleaning

用python将网站中的数据提取到词典中

python web-crawler

用python将网站中的数据提取到词典中,python,beautifulsoup,web-crawler,data-cleaning,Python,Beautifulsoup,Web Crawler,Data Cleaning,下面是从www.com提取特定作业数据的代码和相应输出。除了数据，我还有很多垃圾，我想把头衔、位置、工作描述和其他重要的特征分开。我怎样才能把它转换成字典 from bs4 import BeautifulSoup import urllib2 final_site = 'http://www.indeed.com/cmp/Pullskill-techonoligies/jobs/Data-Scientist-229a6b09c5eb6b44?q=%22data+scientist%22'

下面是从www.com提取特定作业数据的代码和相应输出。除了数据，我还有很多垃圾，我想把头衔、位置、工作描述和其他重要的特征分开。我怎样才能把它转换成字典

from bs4 import BeautifulSoup 
import urllib2 
final_site = 'http://www.indeed.com/cmp/Pullskill-techonoligies/jobs/Data-Scientist-229a6b09c5eb6b44?q=%22data+scientist%22'
html = urllib2.urlopen(final_site).read()
soup = BeautifulSoup(html)
deep = soup.find("td","snip")
deep.get("p","ul")
deep.get_text(strip= True)

输出：

u'Title : Data ScientistLocation : Seattle WADuration : Fulltime / PermanentJob Responsibilities:Implement advanced and predictive analytics models usingJava,R, and Pythonetc.Develop deep expertise with Company\u2019s data warehouse, systems, product and other resources.Extract, collate and analyze data from a variety of sources to provide insights to customersCollaborate with the research team to incorporate qualitative insights into projects where appropriateKnowledge, Skills and Experience:Exceptional problem solving skillsExperience withJava,R, and PythonAdvanced data mining and predictive modeling (especially Machine learning techniques) skillsMust have statistics orientation (Theory and applied)3+ years of business experience in an advanced analytics roleStrong Python and R programming skills are required. SAS, MATLAB will be plusStrong SQL skills are looked for.Analytical and decisive strategic thinker, flexible problem solver, great team player;Able to effectively communicate to all levelsImpeccable attention to detail and very strong ability to convert complex data into insights and action planThanksNick ArthurLead Recruiternick(at)pullskill(dot)com201-497-1010 Ext: 106Salary: $120,000.00 /yearRequired experience:Java And Python And R And PHD Level Education: 4 years5 days ago-save jobwindow[\'result_229a6b09c5eb6b44\'] = {"showSource": false, "source": "Indeed", "loggedIn": false, "showMyJobsLinks": true,"undoAction": "unsave","relativeJobAge": "5 days ago","jobKey": "229a6b09c5eb6b44", "myIndeedAvailable": true, "tellAFriendEnabled": false, "showMoreActionsLink": false, "resultNumber": 0, "jobStateChangedToSaved": false, "searchState": "", "basicPermaLink": "http://www.indeed.com", "saveJobFailed": false, "removeJobFailed": false, "requestPending": false, "notesEnabled": true, "currentPage" : "viewjob", "sponsored" : false, "reportJobButtonEnabled": false};\xbbApply NowPlease review all application instructions before applying to Pullskill Technologies.(function(d, s, id){var js, iajs = d.getElementsByTagName(s)[0], iaqs = \'vjtk=1aa24enhqagvcdj7&hl=en_US&co=US\'; if (d.getElementById(id)){return;}js = d.createElement(s); js.id = id; js.async = true; js.src = \'https://apply.indeed.com/indeedapply/static/scripts/app/bootstrap.js\'; js.setAttribute(\'data-indeed-apply-qs\', iaqs); iajs.parentNode.insertBefore(js, iajs);}(document, \'script\', \'indeed-apply-js\'));Recommended JobsData Scientist, Energy AnalyticsRenew Financial-Oakland, CARenew Financial-5 days agoData ScientistePrize-Seattle, WAePrize-7 days agoData ScientistDocuSign-Seattle, WADocuSign-12 days agoEasily applyEngineer - US Citizen or Permanent ResidentVoxel Innovations-Raleigh, NCIndeed-8 days agoEasily applyData ScientistUnity Technologies-San Francisco, CAUnity Technologies-22 days agoEasily apply'

如果输出总是具有相同的结构，则可以使用regex创建字典

dict = {}
title_match = re.match(r'Title : (.+)(?=Location)', output)
dict['Title'] = title_match.group(1)
location_match = re.match(r'Location : (.+)(?=Duration)', output)
dict['Location'] = location_match.group(1)

当然，这是一个非常脆弱的解决方案，使用BeautifulSoup的内置解析来获得所需的结果可能会更好，因为我猜它们可能被标准标记包围。

如果输出始终具有相同的结构，则可以使用regex创建字典

dict = {}
title_match = re.match(r'Title : (.+)(?=Location)', output)
dict['Title'] = title_match.group(1)
location_match = re.match(r'Location : (.+)(?=Duration)', output)
dict['Location'] = location_match.group(1)

当然，这是一个非常脆弱的解决方案，使用BeautifulSoup的内置解析来获得所需的结果可能会更好，因为我猜它们可能被标准标记包围。

查找作业摘要元素，找到里面的所有

元素，并将每个

元素的文本按

：

：

for elm in soup.find("span", id="job_summary").p.find_all("b"):
    label, text = elm.get_text().split(" : ")

    print(label.strip(), text.strip())

查找job summary元素，查找内部的所有

元素，并将每个

元素的文本拆分为

：

：

for elm in soup.find("span", id="job_summary").p.find_all("b"):
    label, text = elm.get_text().split(" : ")

    print(label.strip(), text.strip())

谢谢这只是给出标题、位置和位置。我想获得工作职责和工作下的其他信息。任何显示为工作项下所有内容的信息都是有价值的。谢谢。这只是给出标题、位置和位置。我想获得工作职责和工作下的其他信息。任何显示为工作项下所有内容的信息都是有价值的。