Python 如何从工作分类广告网站monster.com中获取链接? 我想从显示搜索结果的页面中为特定工作分类刮取URL:

Python 如何从工作分类广告网站monster.com中获取链接? 我想从显示搜索结果的页面中为特定工作分类刮取URL:,python,python-3.x,web-scraping,Python,Python 3.x,Web Scraping,如果查看html,您将看到URL位于如下所示的块中: 显示搜索结果的网页以page=1参数结尾。我们希望增加该值,直到“加载更多作业”按钮变成“不再显示结果”消息。 下面提供了我的代码。它的效果不是很好: 将itertools作为itts导入 导入字符串 导入urllib.request #在单击“加载更多作业”按钮之前 # https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1和page=2

如果查看html,您将看到URL位于如下所示的块中: 显示搜索结果的网页以
page=1
参数结尾。我们希望增加该值,直到“加载更多作业”按钮变成“不再显示结果”消息。

下面提供了我的代码。它的效果不是很好:
将itertools作为itts导入
导入字符串
导入urllib.request
#在单击“加载更多作业”按钮之前
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1和page=2
#单击“加载更多作业”后
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1和page=3
#在URL的末尾,`page=2`更改为`page=3'`
前缀=”https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=1“
哨兵=”“”
没有更多的结果
"""
谓词=lambda ch,字符串=字符串:\
ch不在“\n\r”
sentinel=str(过滤器(谓词,sentinel))
对于范围(1,90)内的页码:
打印(“页码==”,页码)
fp=urllib.request.urlopen(前缀+str(页码))
mybytes=fp.read()
page_html=mybytes.decode(“utf8”)
fp.close()
如果sentinel在html页面中:
打破
#`page_html`是上述脚本的输出
打印(“len(page_html)==len(page_html)”)
LineIter类:
定义初始化(self,stryng):
self.it=it(str(stryng))
self.delims=“\n\r”
self.detained=False
定义下一个(自我):
如果自我耗尽:
提升停止迭代()
尝试:
尽管如此:
ch=下一个(self.it)
如果ch不在self.delims中:
打破
行=列表()
当ch不在self.delims中时:
行。追加(ch)
ch=下一个(self.it)
r=”“.加入(行)
除停止迭代外:
self.detained=True
尝试:
r=”“.加入(行)
除基本例外情况外:
r=“”
返回r
URL=list()
对于LineIter中的line(第页):
打印(行)
开始=行。查找(“https://job-openings.monster.com/")
如果开始>=0:
停止=行。查找(“”,开始)
追加(行[开始:停止])

正如我在评论中已经提到的,一般情况下,您可以通过使用

但是,对于您提出的问题:手动搜索HTML字符串将非常痛苦,而且容易出错

首先,使用为您解析HTML:

导入请求
从bs4导入BeautifulSoup
url='1〕https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=2'
html=requests.get(url.text)
soup=BeautifulSoup(html,“lxml”)
现在,您可以搜索并提取包含您要查找的数据的
标记:

tags=soup('script',{'type':'application/ld+json'})
#在此页面上,数据位于两个标记中的第二个标记中。您需要为其他页面验证这一点。
标签=标签[-1]
然后可以将标记的内容解析为JSON:

导入json
data=json.load(tag.text)
并将其中的数据作为字典进行访问:

>数据['itemListElement']
[{@type':'ListItem',
"位置":51,,
“url”:”https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc'},
#
# [...]
#
{@type':'ListItem',
“位置”:108,
“url”:”https://job-openings.monster.com/uipath-rpa-architect-denver-co-us-enquero/f14cc922-dd4d-4636-a3d4-dc3b18eec843'}]
要获得所需的输出,只需将所有URL提取为列表,过滤任何空字符串:

>>[el['url']表示数据['itemListElement']中的el,如果el['url']]
['https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc',
'https://job-openings.monster.com/python-automation-engineer-denver-co-us-apidel-technologies/77e8f683-2e91-403f-b663-def61b62226e',
# [...]
'https://job-openings.monster.com/full-stack-java-engineer-denver-co-us-srinav-inc/fda6f2fd-be2f-4199-90eb-585ff8f96874',
'https://job-openings.monster.com/uipath-rpa-architect-denver-co-us-enquero/f14cc922-dd4d-4636-a3d4-dc3b18eec843']

您不使用他们的软件有什么原因吗?请分享您获得的输出
<script type="application/ld+json">
            {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=python&amp;where=aurora__2C-co&amp;stpage=1&amp;page=10"
            }
            ,"itemListElement":[

                 {"@type":"ListItem","position":2251,"url":"https://job-openings.monster.com/19-16001-senior-python-developer-boulder-co-us-sunrise-systems-inc/e09cfe38-2a32-465d-bd66-8846b9549c6a"}
L = [
    "https://job-openings.monster.com/senior-python-architect-boulder-co-us-experis/26b7c4e8-ec4f-4d93-84e4-959fd28e150a",
    "https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc",
    "https://job-openings.monster.com/python-automation-engineer-denver-co-us-apidel-technologies/77e8f683-2e91-403f-b663-def61b62226e",
    "https://job-openings.monster.com/immediate-need-for-python-developer-6-month-contract-onsite-in-boulder-co-boulder-co-us-addon-technologies-inc/e2826a70-490b-4e16-a4bb-05e767c8fb1f",
    "https://job-openings.monster.com/software-test-technician-englewood-co-us-kratos-defense-security-solutions/fa39cdfe-0fe8-4e02-b325-28f21561ac33" 
]
import itertools as itts
import string
import urllib.request

# BEFORE CLICK `LOAD MORE JOBS` BUTTON
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=2
# AFTER CLICK LOAD MORE JOBS
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=3
# AT END OF URL, `page=2` changes to `page=3`

prefix = "https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=1"

sentinel = """
<a class="mux-btn btn-secondary no-more-jobs-btn disabled "
style="display:none" id="noMoreResults" role="button">No More Results</a>
"""
predicate = lambda ch, string=string:\
    ch not in "\n\r"
sentinel = str(filter(predicate, sentinel))

for page_num in range(1, 90):
    print("page_num ==", page_num)
    fp = urllib.request.urlopen(prefix + str(page_num))
    mybytes = fp.read()
    page_html = mybytes.decode("utf8")
    fp.close()
    if sentinel in page_html:
        break
# `page_html` is the output of the script above

print("len(page_html) == len(page_html)")

class LineIter:
    def __init__(self, stryng):
        self.it = it(str(stryng))
        self.delims = "\n\r"
        self.depleted = False
    def __next__(self):
        if self.depleted:
            raise StopIteration()
        try:
            while True:
                ch = next(self.it)
                if ch not in self.delims:
                    break
            line = list()
            while ch not in self.delims:
                line.append(ch)
                ch = next(self.it)
            r = "".join(line)
        except StopIteration:
            self.depleted = True
            try:
                r = "".join(line)
            except BaseException:
                r = ""
        return r

urls = list()
for line in LineIter(page_html):
    print(line)
    start = line.find("https://job-openings.monster.com/")
    if start >= 0:
        stop = line.find('"', start)
        urls.append(line[start:stop])