Python 从tripadvisor获取要做的事情列表

Python 从tripadvisor获取要做的事情列表,python,web-scraping,tripadvisor,Python,Web Scraping,Tripadvisor,如何获得“要做的事情”列表?我对网络垃圾很陌生,我不知道如何循环浏览每个页面来获取所有“要做的事情”的href?告诉我哪里做错了?任何帮助都将非常感谢。提前谢谢 import requests import re from bs4 import BeautifulSoup from urllib.request import urlopen offset = 0 url = 'https://www.tripadvisor.com/Attractions-g255057-Activitie

如何获得“要做的事情”列表?我对网络垃圾很陌生,我不知道如何循环浏览每个页面来获取所有“要做的事情”的href?告诉我哪里做错了?任何帮助都将非常感谢。提前谢谢

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen



offset = 0
url = 'https://www.tripadvisor.com/Attractions-g255057-Activities-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
urls = []
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")


for link in soup.find_all('a', {'last'}):
    page_number = link.get('data-page-number')
    last_offset = int(page_number) * 30
    print('last offset:', last_offset)


for offset in range(0, last_offset, 30):
    print('--- page offset:', offset, '---')
    url = 'https://www.tripadvisor.com/Attractions-g255057-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    for link in soup.find_all('a', {'property_title'}):
        iurl='https://www.tripadvisor.com/Attraction_Review-g255057' + link.get('href')
        print(iurl)
基本上我想要每个“要做的事情”的href。 我对“要做的事情”的期望输出是:

   https://www.tripadvisor.com/Attraction_Review-g255057-d3377852-Reviews-Weston_Park-Canberra_Australian_Capital_Territory.html
   https://www.tripadvisor.com/Attraction_Review-g255057-d591972-Reviews-Canberra_Museum_and_Gallery-Canberra_Australian_Capital_Territory.html
   https://www.tripadvisor.com/Attraction_Review-g255057-d312426-Reviews-Lanyon_Homestead-Canberra_Australian_Capital_Territory.html
   https://www.tripadvisor.com/Attraction_Review-g255057-d296666-Reviews-Australian_National_University-Canberra_Australian_Capital_Territory.html
就像下面的例子一样,我使用此代码获取堪培拉市每家餐厅的href 我的餐厅代码是:

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen



with requests.Session() as session:
    for offset in range(0, 1050, 30):
        url = 'https://www.tripadvisor.com/Restaurants-g255057-oa{0}-Canberra_Australian_Capital_Territory.html#EATERY_LIST_CONTENTS'.format(offset)

        soup = BeautifulSoup(session.get(url).content, "html.parser")
        for link in soup.select('a.property_title'):
            iurl = 'https://www.tripadvisor.com/' + link.get('href')
            print(iurl)        
餐厅代码的输出为:

   https://www.tripadvisor.com/Restaurant_Review-g255057-d1054676-Reviews-Lanterne_Rooms-Canberra_Australian_Capital_Territory.html
   https://www.tripadvisor.com/Restaurant_Review-g255057-d755055-Reviews-Courgette_Restaurant-Canberra_Australian_Capital_Territory.html
   https://www.tripadvisor.com/Restaurant_Review-g255057-d6893178-Reviews-Pomegranate-Canberra_Australian_Capital_Territory.html
   https://www.tripadvisor.com/Restaurant_Review-g255057-d7262443-Reviews-Les_Bistronomes-Canberra_Australian_Capital_Territory.html
    .
    .
    .
    .

好的,这并不难,你只需要知道要使用哪些标签。
让我用这个例子来解释:

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.tripadvisor.com/'  ## we need this to join the links later ##
main_page = 'https://www.tripadvisor.com/Attractions-g255057-Activities-oa{}-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
links = []

## get the initial page to find the number of pages ##
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html.parser")
## select the last page from the list of pages ('a', {'class':'pageNum taLnk'}) ##
last_page = max([ int(page.get('data-offset')) for page in soup.find_all('a', {'class':'pageNum taLnk'}) ])

## now iterate over that range (first page, last page, number of links), and extract the links from each page ##
for i in range(0, last_page + 30, 30):
    page = main_page.format(i)
    soup = BeautifulSoup(requests.get(page).text, "html.parser") ## get the next page and parse it with BeautifulSoup ##  
    ## get the hrefs from ('div', {'class':'listing_title'}), and join them with base_url to make the links ##
    links += [ base_url + link.find('a').get('href') for link in soup.find_all('div', {'class':'listing_title'}) ]

for link in links : 
    print(link)
我们总共有8页和212个链接(每页30个,最后一页2个)。

我希望这能让事情变得更清楚一点好的,这并不难,你只需要知道要使用哪些标签。
让我用这个例子来解释:

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.tripadvisor.com/'  ## we need this to join the links later ##
main_page = 'https://www.tripadvisor.com/Attractions-g255057-Activities-oa{}-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
links = []

## get the initial page to find the number of pages ##
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html.parser")
## select the last page from the list of pages ('a', {'class':'pageNum taLnk'}) ##
last_page = max([ int(page.get('data-offset')) for page in soup.find_all('a', {'class':'pageNum taLnk'}) ])

## now iterate over that range (first page, last page, number of links), and extract the links from each page ##
for i in range(0, last_page + 30, 30):
    page = main_page.format(i)
    soup = BeautifulSoup(requests.get(page).text, "html.parser") ## get the next page and parse it with BeautifulSoup ##  
    ## get the hrefs from ('div', {'class':'listing_title'}), and join them with base_url to make the links ##
    links += [ base_url + link.find('a').get('href') for link in soup.find_all('div', {'class':'listing_title'}) ]

for link in links : 
    print(link)
我们总共有8页和212个链接(每页30个,最后一页2个)。

我希望这能把事情弄清楚一点

试着在
汤中传递一个dict。find_all
,例如:
汤。find_all('a',{'k':'v'})
你的意思是我必须使用汤。find_all('a',{'class':'listing_element'))像这样吗?是的,类似的东西仍然不起作用:(你能更新你的代码并给出一个例子说明你得到的输出和期望的输出是什么吗?试着在
soup中传递一个dict。find_all
,例如:
soup.find_all('a',{'k':'v'))
你的意思是我必须使用soup.find_all('a',{'class':'listing_element')这样吗?是的,类似的东西仍然不起作用:(你能更新你的代码并给出一个你得到的输出和期望的输出的例子吗?你是对的,我忘了获得下一页。我更新了代码,再试一次。你很受欢迎,我希望我帮助你理解了过程是的,它帮助了很多:)你是对的,我忘了看下一页了。我更新了代码,再试一次。非常欢迎你,我希望我帮助你理解了这个过程。是的,这很有帮助:)泰