Python 如何从工作分类广告网站monster.com中获取链接？我想从显示搜索结果的页面中为特定工作分类刮取URL：_Python_Python 3.x_Web Scraping

Python 如何从工作分类广告网站monster.com中获取链接？我想从显示搜索结果的页面中为特定工作分类刮取URL：

python python-3.x web-scraping

Python 如何从工作分类广告网站monster.com中获取链接？我想从显示搜索结果的页面中为特定工作分类刮取URL：,python,python-3.x,web-scraping,Python,Python 3.x,Web Scraping,如果查看html，您将看到URL位于如下所示的块中：显示搜索结果的网页以page=1参数结尾。我们希望增加该值，直到“加载更多作业”按钮变成“不再显示结果”消息。下面提供了我的代码。它的效果不是很好：将itertools作为itts导入导入字符串导入urllib.request #在单击“加载更多作业”按钮之前 # https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1和page=2

如果查看html，您将看到URL位于如下所示的块中：显示搜索结果的网页以

page=1

参数结尾。我们希望增加该值，直到“加载更多作业”按钮变成“不再显示结果”消息。

下面提供了我的代码。它的效果不是很好：

将itertools作为itts导入
导入字符串
导入urllib.request
#在单击“加载更多作业”按钮之前
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1和page=2
#单击“加载更多作业”后
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1和page=3
#在URL的末尾，`page=2`更改为`page=3'`
前缀=”https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=1“
哨兵=”“”
没有更多的结果
"""
谓词=lambda ch，字符串=字符串：\
ch不在“\n\r”
sentinel=str（过滤器（谓词，sentinel））
对于范围（1,90）内的页码：
打印（“页码==”，页码）
fp=urllib.request.urlopen（前缀+str（页码））
mybytes=fp.read（）
page_html=mybytes.decode（“utf8”）
fp.close（）
如果sentinel在html页面中：
打破
#`page_html`是上述脚本的输出
打印（“len（page_html）==len（page_html）”）
LineIter类：
定义初始化（self，stryng）：
self.it=it（str（stryng））
self.delims=“\n\r”
self.detained=False
定义下一个（自我）：
如果自我耗尽：
提升停止迭代（）
尝试：
尽管如此：
ch=下一个（self.it）
如果ch不在self.delims中：
打破
行=列表（）
当ch不在self.delims中时：
行。追加（ch）
ch=下一个（self.it）
r=”“.加入（行）
除停止迭代外：
self.detained=True
尝试：
r=”“.加入（行）
除基本例外情况外：
r=“”
返回r
URL=list（）
对于LineIter中的line（第页）：
打印（行）
开始=行。查找（“https://job-openings.monster.com/")
如果开始>=0：
停止=行。查找（“”，开始）
追加（行[开始：停止]）

正如我在评论中已经提到的，一般情况下，您可以通过使用

但是，对于您提出的问题：手动搜索HTML字符串将非常痛苦，而且容易出错

首先，使用为您解析HTML：

导入请求
从bs4导入BeautifulSoup
url='1〕https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=2'
html=requests.get（url.text）
soup=BeautifulSoup（html，“lxml”）

现在，您可以搜索并提取包含您要查找的数据的

标记：

tags=soup（'script'，{'type'：'application/ld+json'}）
#在此页面上，数据位于两个标记中的第二个标记中。您需要为其他页面验证这一点。
标签=标签[-1]

然后可以将标记的内容解析为JSON：

导入json
data=json.load（tag.text）

并将其中的数据作为字典进行访问：

>数据['itemListElement']
[{@type'：'ListItem'，
"位置":51,，
“url”：”https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc'},
#
# [...]
#
{@type'：'ListItem'，
“位置”：108，
“url”：”https://job-openings.monster.com/uipath-rpa-architect-denver-co-us-enquero/f14cc922-dd4d-4636-a3d4-dc3b18eec843'}]

要获得所需的输出，只需将所有URL提取为列表，过滤任何空字符串：

>>[el['url']表示数据['itemListElement']中的el，如果el['url']]
['https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc',
'https://job-openings.monster.com/python-automation-engineer-denver-co-us-apidel-technologies/77e8f683-2e91-403f-b663-def61b62226e',
# [...]
'https://job-openings.monster.com/full-stack-java-engineer-denver-co-us-srinav-inc/fda6f2fd-be2f-4199-90eb-585ff8f96874',
'https://job-openings.monster.com/uipath-rpa-architect-denver-co-us-enquero/f14cc922-dd4d-4636-a3d4-dc3b18eec843']

您不使用他们的软件有什么原因吗？请分享您获得的输出

<script type="application/ld+json">
            {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=python&amp;where=aurora__2C-co&amp;stpage=1&amp;page=10"
            }
            ,"itemListElement":[

                 {"@type":"ListItem","position":2251,"url":"https://job-openings.monster.com/19-16001-senior-python-developer-boulder-co-us-sunrise-systems-inc/e09cfe38-2a32-465d-bd66-8846b9549c6a"}

L = [
    "https://job-openings.monster.com/senior-python-architect-boulder-co-us-experis/26b7c4e8-ec4f-4d93-84e4-959fd28e150a",
    "https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc",
    "https://job-openings.monster.com/python-automation-engineer-denver-co-us-apidel-technologies/77e8f683-2e91-403f-b663-def61b62226e",
    "https://job-openings.monster.com/immediate-need-for-python-developer-6-month-contract-onsite-in-boulder-co-boulder-co-us-addon-technologies-inc/e2826a70-490b-4e16-a4bb-05e767c8fb1f",
    "https://job-openings.monster.com/software-test-technician-englewood-co-us-kratos-defense-security-solutions/fa39cdfe-0fe8-4e02-b325-28f21561ac33" 
]

import itertools as itts
import string
import urllib.request

# BEFORE CLICK `LOAD MORE JOBS` BUTTON
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=2
# AFTER CLICK LOAD MORE JOBS
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=3
# AT END OF URL, `page=2` changes to `page=3`

prefix = "https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=1"

sentinel = """
<a class="mux-btn btn-secondary no-more-jobs-btn disabled "
style="display:none" id="noMoreResults" role="button">No More Results</a>
"""
predicate = lambda ch, string=string:\
    ch not in "\n\r"
sentinel = str(filter(predicate, sentinel))

for page_num in range(1, 90):
    print("page_num ==", page_num)
    fp = urllib.request.urlopen(prefix + str(page_num))
    mybytes = fp.read()
    page_html = mybytes.decode("utf8")
    fp.close()
    if sentinel in page_html:
        break
# `page_html` is the output of the script above

print("len(page_html) == len(page_html)")

class LineIter:
    def __init__(self, stryng):
        self.it = it(str(stryng))
        self.delims = "\n\r"
        self.depleted = False
    def __next__(self):
        if self.depleted:
            raise StopIteration()
        try:
            while True:
                ch = next(self.it)
                if ch not in self.delims:
                    break
            line = list()
            while ch not in self.delims:
                line.append(ch)
                ch = next(self.it)
            r = "".join(line)
        except StopIteration:
            self.depleted = True
            try:
                r = "".join(line)
            except BaseException:
                r = ""
        return r

urls = list()
for line in LineIter(page_html):
    print(line)
    start = line.find("https://job-openings.monster.com/")
    if start >= 0:
        stop = line.find('"', start)
        urls.append(line[start:stop])