Python 无法从网页的不同深度刮取类似链接_Python_Python 3.x_Web Scraping

Python 无法从网页的不同深度刮取类似链接

python python-3.x web-scraping

Python 无法从网页的不同深度刮取类似链接,python,python-3.x,web-scraping,Python,Python 3.x,Web Scraping,我用python创建了一个脚本来解析网页中的不同链接。登录页中有两个部分。一个是顶级体验，另一个是更多体验。我当前的尝试可以从这两个类别中获取链接目前我想收集的链接类型（很少）在顶级体验部分。但是，当我遍历更多体验部分下的链接时，我可以看到它们都指向一个页面，其中有一个名为体验的部分，该部分下的链接与登录页中的顶级体验下的链接类似。我想把它们都抓起来一个我想要的链接看起来像：https://www.airbnb.com/experiences/20712?source=seo 我当前尝试从

我用python创建了一个脚本来解析网页中的不同链接。登录页中有两个部分。一个是

顶级体验

，另一个是

更多体验

。我当前的尝试可以从这两个类别中获取链接

目前我想收集的链接类型（很少）在

顶级体验

部分。但是，当我遍历

更多体验

部分下的链接时，我可以看到它们都指向一个页面，其中有一个名为

体验

的部分，该部分下的链接与登录页中的

顶级体验

下的链接类似。我想把它们都抓起来

一个我想要的链接看起来像：

https://www.airbnb.com/experiences/20712?source=seo

我当前尝试从两个类别获取链接：

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"

def get_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    items = [urljoin(link,item.get("href")) for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq")]
    return items

if __name__ == '__main__':
    for item in get_links(URL):
        print(item)

我如何解析

顶级体验

部分下的所有链接，以及在遍历

更多体验

下的链接时可以找到的

体验

部分下的链接

如果有任何不清楚的地方，请。我用了一支画笔，所以文字可能有点难理解。

似乎“顶级体验”和“更多体验”链接共享同一个类，因此您可以使用

。查找所有链接以获取链接
导入请求
#从urllib.parse导入urljoin
从bs4导入BeautifulSoup
#要刮取的URL
url=”https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"
#发出请求并使用请求内容初始化BS4
req=请求。获取（url）
汤=BeautifulSoup（所需内容，“lxml”）
#包含“顶级体验”和“更多体验”的标签
汤。查找所有（类\=“\u l8g1fr”）
#测试代码
#打印链接标题和href
links=soup.find_all（class_u=“\u l8g1fr”）
对于链接中的链接：
打印（link.find（“a”）.get_text（））
打印（link.find（“a”）.get（'href'））

重构代码以满足您的编码范式。
您可以使用类“\u 12kw8n71”
从div
中刮取：
输出（当完整输出超过Stackoverflow的字符限制时，仅顶部体验和来自更多体验的部分链接）：
过程：
获取所有顶级体验
链接
获取所有更多体验
链接
向所有更多体验
链接逐个发送请求，并获取每页体验
下的链接
链接所在的div
对于所有具有相同类的页面都是相同的\u 12kw8n71

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from time import sleep
from random import randint
URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"
res = requests.get(URL)
soup = BeautifulSoup(res.text,"lxml")
top_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')]
more_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[1].find_all('a')]
generated_experiences=[]
#visit each link in more_experiences
for url in more_experiences:
    sleep(randint(1,10))#avoid blocking by putting some delay
    generated_experiences.extend([urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')])

注:
您所需的链接将出现在三个列表中顶级体验
，更多体验
和生成的体验

我添加了随机延迟以避免被阻塞
不打印列表，因为列表太长
顶级体验
-50个链接
更多体验
-299链接
生成的_体验
-14950链接
解决方案有点棘手。它可以通过几种方式实现。我发现最有用的是递归地使用get\u links（）
函数中的More Experiences
下的链接。更多体验
下的所有链接都有一个通用关键字\u pdp-

因此，当您在函数中定义conditional语句以使链接递归地通过函数get_links（）
进行筛选时，else
块将生成所需的链接。需要注意的最重要的一点是，所有需要的链接都在类\u 1f0v6pq
中，因此获取链接的逻辑相当简单
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"

def get_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq"):
        if "_pdp-" in item.get("href"):
            get_links(urljoin(URL,item.get("href")))
        else:
            print(urljoin(URL,item.get("href")))

if __name__ == '__main__':
    get_links(URL)

也许，这就是我的局限性，我无法让你理解我试图实现的目标@Erick Guerra。如果您遵循更多体验
下的链接，则相关页面中的体验部分也提供了相同类型的链接。我想把它们都抓起来。谢谢
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from time import sleep
from random import randint
URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"
res = requests.get(URL)
soup = BeautifulSoup(res.text,"lxml")
top_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')]
more_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[1].find_all('a')]
generated_experiences=[]
#visit each link in more_experiences
for url in more_experiences:
    sleep(randint(1,10))#avoid blocking by putting some delay
    generated_experiences.extend([urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')])

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"

def get_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq"):
        if "_pdp-" in item.get("href"):
            get_links(urljoin(URL,item.get("href")))
        else:
            print(urljoin(URL,item.get("href")))

if __name__ == '__main__':
    get_links(URL)