使用python从网站抓取多个页面_Python_Json_Beautifulsoup_Request_Web Crawler

使用python从网站抓取多个页面

python json web-crawler

使用python从网站抓取多个页面,python,json,beautifulsoup,request,web-crawler,Python,Json,Beautifulsoup,Request,Web Crawler,我想知道如何使用BeautifulSoup从一个网站上抓取一个城市（如伦敦）的多个不同页面，而不必反复重复我的代码理想情况下，我的目标是首先抓取与一个城市相关的所有页面下面是我的代码： session = requests.Session() session.cookies.get_dict() url = 'http://www.citydis.com' headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_

我想知道如何使用BeautifulSoup从一个网站上抓取一个城市（如伦敦）的多个不同页面，而不必反复重复我的代码

理想情况下，我的目标是首先抓取与一个城市相关的所有页面

下面是我的代码：

session = requests.Session()
session.cookies.get_dict()
url = 'http://www.citydis.com'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)

soup = BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta",  property="configuration")


jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=0"
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))

for item in js_dict:
   headers = js_dict['searchResults']["tours"]
   prices = js_dict['searchResults']["tours"]

for title, price in zip(headers, prices):
   title_final = title.get("title")
   price_final = price.get("price")["original"]

print("Header: " + title_final + " | " + "Price: " + price_final)

输出如下：

Header: London Travelcard: 1 Tag lang unbegrenzt reisen | Price: 19,44 €
Header: 105 Minuten London bei Nacht im verdecklosen Bus | Price: 21,21 €
Header: Ivory House London: 4 Stunden mittelalterliches Bankett| Price: 58,92 €
Header: London: Themse Dinner Cruise | Price: 96,62 €

它只返回第一页的结果（4个结果），但我想得到伦敦的所有结果（必须超过200个结果）

你能给我一些建议吗？我想，我必须计算jsonURL上的页面数，但不知道怎么做

更新

多亏了他们的帮助，我才能更进一步

在这种情况下，我只能抓取一页（page=0），但我想抓取前10页。因此，我的做法如下：

代码中的相关片段：

soup = bs4.BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta",  property="configuration")

page = 0
while page <= 11:
    page += 1

    jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris&    customerSearch=1&page=" + str(page)
    response = session.get(jsonUrl, headers=headers)
    js_dict = (json.loads(response.content.decode('utf-8')))


   for item in js_dict:
       headers = js_dict['searchResults']["tours"]
       prices = js_dict['searchResults']["tours"]

       for title, price in zip(headers, prices):
           title_final = title.get("title")
           price_final = price.get("price")["original"]

           print("Header: " + title_final + " | " + "Price: " + price_final)

soup=bs4.beautifulsou（response.content，“html.parser”）
metaConfig=soup.find（“meta”，property=“configuration”）
第页=0
当page时，您确实应该确保您的代码示例完整（缺少导入）并且语法正确（您的代码包含缩进问题）。在尝试做一个有效的例子时，我想到了以下几点
import requests, json, bs4
session = requests.Session()
session.cookies.get_dict()
url = 'http://www.getyourguide.de'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)

soup = bs4.BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta",  property="configuration")
metaConfigTxt = metaConfig["content"]
csrf = json.loads(metaConfigTxt)["pageToken"]


jsonUrl = "https://www.getyourguide.de/s/results.json?&q=London& customerSearch=1&page=0"
headers.update({'X-Csrf-Token': csrf})
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
print(js_dict.keys())

for item in js_dict:
       headers = js_dict['searchResults']["tours"]
       prices = js_dict['searchResults']["tours"]

       for title, price in zip(headers, prices):
            title_final = title.get("title")
            price_final = price.get("price")["original"]

            print("Header: " + title_final + " | " + "Price: " + price_final)

这给了我四个以上的结果
通常，您会发现许多返回JSON的站点都会对其回复进行分页，每页提供固定数量的结果。在这些情况下，除了最后一页外，每个页面通常都包含一个键，该键的值为您提供下一页的URL。在页面上循环是一件很简单的事情，当你检测到没有该键时，将
从循环中分离出来。
如果你想要正确的网页爬行方式，请查找xpaths。这将使你的代码大大减少，可能最多只有你上面所做的5行代码。这是做任何与爬行和抓取相关的事情的标准方式。谢谢你提供的信息。我会试试看。然而，你能给我一些反馈，告诉我如何用我正在使用的方法解决上述问题吗？非常感谢。会考虑你的反馈。在这种情况下，我只能抓取一页（page=0），但我想抓取前10页。我已经在我的第一篇文章中发布了我的方法。希望你能指导我找到正确的解决方案。谢谢你的耐心：）很高兴。我认为任何进一步的进展都将取决于网站的具体情况，因此可能会超出Stackoverflow的范围
import requests, json, bs4
session = requests.Session()
session.cookies.get_dict()
url = 'http://www.getyourguide.de'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)

soup = bs4.BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta",  property="configuration")
metaConfigTxt = metaConfig["content"]
csrf = json.loads(metaConfigTxt)["pageToken"]


jsonUrl = "https://www.getyourguide.de/s/results.json?&q=London& customerSearch=1&page=0"
headers.update({'X-Csrf-Token': csrf})
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
print(js_dict.keys())

for item in js_dict:
       headers = js_dict['searchResults']["tours"]
       prices = js_dict['searchResults']["tours"]

       for title, price in zip(headers, prices):
            title_final = title.get("title")
            price_final = price.get("price")["original"]

            print("Header: " + title_final + " | " + "Price: " + price_final)