Python requests.get不返回html文档中某个标记中的文本

Python requests.get不返回html文档中某个标记中的文本,python,parsing,beautifulsoup,python-requests,Python,Parsing,Beautifulsoup,Python Requests,我试图解析一个个人项目的工作描述。我正在使用Python3.6、BeautifulSoup4和请求库。当我使用requests.get获取一个职位空缺页面的html时,它返回的html没有最关键的部分——描述文本。例如,以该页面的url为例,我编写了以下代码: def scrape_job_desc(self, url): job_desc_html = self._get_search_page_html(url) soup = BeautifulSoup(job_desc_h

我试图解析一个个人项目的工作描述。我正在使用Python3.6、BeautifulSoup4和请求库。当我使用requests.get获取一个职位空缺页面的html时,它返回的html没有最关键的部分——描述文本。例如,以该页面的url为例,我编写了以下代码:

def scrape_job_desc(self, url):
    job_desc_html = self._get_search_page_html(url)
    soup = BeautifulSoup(job_desc_html, features='html.parser')
    try:
        short_desc = str(soup.find('p', {'class': 'job-teaser svelte-a3rpl2'}).getText())
        full_desc = soup.find('div', {'class': 'job-description-wrapper svelte-a3rpl2'}).find('p').getText()
    except AttributeError:
        short_desc = None
        full_desc = None
    return short_desc, full_desc

def _get_search_page_html(self, url):
    html = requests.get(url=url, headers={'User-Agent': 'Mozilla/5.0 CK={} (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'})
    return html.text

它将返回简短的描述而不是完整的描述。此外,html中根本不存在所需的标记的文本。但是当我使用浏览器下载页面时,它就在那里了。这是什么原因造成的?

这是一个典型的web抓取错误

您可能在浏览器中查看了呈现HTML的源代码,并试图在
职务描述包装中的
p
中获取文本

但是,如果您只是加载页面本身(浏览器处理的第一个请求)并查看其内容,您会发现该段落最初没有加载。有些脚本稍后会加载它的内容,但这种情况发生得太快了,用户几乎不会注意到它

检查此文件的输出:

print(requests.get(url='https://djinni.co/jobs2/144172-data-scientist').text)

这就是问题的根源。如何解决是另一回事。一种方法是在Python中使用一个无头浏览器,在加载页面后运行脚本,并且只有在页面完成加载所有内容时,才能获取所需的内容。为此,您可以查看类似selenium的工具。

作业的完整描述以JavaScript变量的形式存储在页面中。您可以使用
selenium
提取它,或
re
模块:

import re
import requests
from bs4 import BeautifulSoup


url = 'https://djinni.co/jobs2/144172-data-scientist'        
html_data = requests.get(url).text

full_desc = re.search(r'fullDescription:"(.*?)",', html_data).group(1).replace(r'\r\n', '\n')
short_desc =  BeautifulSoup(html_data, 'html.parser').select_one('.job-teaser').get_text()

print(short_desc)
print('-' * 80)
print(full_desc)
印刷品:

Together Networks is looking for an experienced Data Scientist to join our Agile team. Together Networks is a worldwide leader in the online dating niche with millions of users across more than 45 countries. Our brands are BeNaughty, CheekyLovers, Flirt, Click&Flirt, Flirt Spielchen.
--------------------------------------------------------------------------------
What you get to deal with:

- Active collaboration with stakeholders throughout the organization;
- User experience modelling;
- Advanced segmentation;
- User behavior analytics;
- Anomaly detection, fraud detection;
- Looking for bottlenecks;
- Churn prediction.
 

You need to have (required):

- Masterâs or PHD in Statistics, Mathematics, Computer Science or another quantitative field;
- 2+ years of experience manipulating data sets and building statistical models;
- Strong knowledge in a wide range of machine learning methods and algorithms for classification, regression, clustering, and others;
- Knowledge and experience in statistical and data mining techniques;
- Experience using statistical computer languages (Python, SLQ, etc.) to manipulate data and draw insights from large data sets.
- Knowledge of a variety of machine learning techniques and their real-world advantages\u002Fdrawbacks;
- Experience visualizing\u002Fpresenting insights for stakeholders;
- Independent, creative thinking, and ability to learn fast.

Would be a great plus:

- Experience dealing with end to end machine learning projects: data exploration, feature engineering\u002Fdefinition, model building, production, maintenance;
- Experience in data visualization with Tableau;
- Experience in dating, game dev, social projects.