Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/317.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从tripadvisor中搜索酒店。如何从所有页面(如1到10页)获取酒店并存储?_Python_Csv_Web Scraping_Beautifulsoup_Python 3.5 - Fatal编程技术网

Python 从tripadvisor中搜索酒店。如何从所有页面(如1到10页)获取酒店并存储?

Python 从tripadvisor中搜索酒店。如何从所有页面(如1到10页)获取酒店并存储?,python,csv,web-scraping,beautifulsoup,python-3.5,Python,Csv,Web Scraping,Beautifulsoup,Python 3.5,我的代码显示了酒店的第一页。为什么它不显示更多 import csv import requests from bs4 import BeautifulSoup hotels=[] i=0 url0 = 'https://www.tripadvisor.com/Hotels-g295424-Dubai_Emirate_of_Dubai- Hotels.html#EATERY_LIST_CONTENTS' r = requests.get(url0) data = r.text

我的代码显示了酒店的第一页。为什么它不显示更多

import csv

import requests

from bs4 import BeautifulSoup

hotels=[]
i=0

url0 = 'https://www.tripadvisor.com/Hotels-g295424-Dubai_Emirate_of_Dubai-     Hotels.html#EATERY_LIST_CONTENTS'

r = requests.get(url0)

data = r.text
soup = BeautifulSoup(r.text, "html.parser")with open('hotels_Data.csv','wb') as file:

for link in soup.findAll('a', {'property_title'}):
    print('https://www.tripadvisor.com/Hotels-g295424-' + link.get('href'))
    print(link.string)


for i in range(20):
   while int(i) <= (20):
    i = str(i)

    url1 = 'https://www.tripadvisor.com/Hotels-g295424-oa' + i + '-  Dubai_Emirate_of_Dubai-Hotels.html#EATERY_LIST_CONTENTS'
    r1 = requests.get(url1)
    data1 = r1.text
    soup1 = BeautifulSoup(data1, "html.parser")
    for link in soup1.findAll('a', {'property_title','price'}):
        print('https://www.tripadvisor.com/Hotels-g294212-' +    link.get('href'))
        print(link.string)
        for link in soup.select("a.reference.internal"):
            url1 = link["href"]
            absolute_url = urljoin(base_url, url1)

            print(url1, absolute_url)       
        writer = csv.writer(file)
        for row in hotels:
            writer.writerow([s.encode("utf-8") for s in row])                                                
break
导入csv
导入请求
从bs4导入BeautifulSoup
酒店=[]
i=0
url0='1〕https://www.tripadvisor.com/Hotels-g295424-Dubai_Emirate_of_Dubai-     Hotels.html#EATERY_LIST_CONTENTS'
r=请求。获取(url0)
数据=r.text
soup=BeautifulSoup(r.text,“html.parser”),打开('hotels_Data.csv','wb')作为文件:
对于soup.findAll('a',{'property_title'})中的链接:
打印('https://www.tripadvisor.com/Hotels-g295424-'+link.get('href'))
打印(link.string)
对于范围(20)内的i:

虽然int(i)检查页面底部下一页的链接-此门户不使用页码-
1
2
3
等,但提供偏移量-
0
30
60
90
等。(因为它在页面上显示30个优惠)

因此,您必须在url中使用值
0
30
60
90
,等等

"...-oa" + offset + "-Dubai_Emirate..."
您可以使用ie.
范围(0,250,30)
来获取值
0
30
60
90

import requests
from bs4 import BeautifulSoup

for offset in range(0, 250, 30):
    print('--- page offset:', offset, '---')

    url = 'https://www.tripadvisor.com/Hotels-g295424-oa' + str(offset) + '-Dubai_Emirate_of_Dubai-Hotels.html#EATERY_LIST_CONTENTS'

    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    for link in soup.find_all('a', {'property_title'}):
          print(link.text)
但提供的服务可能超过250个,因此您必须获得指向最后一页的链接才能获得正确的值,而不是
250

import requests
from bs4 import BeautifulSoup

offset = 0
url = 'https://www.tripadvisor.com/Hotels-g295424-oa' + str(offset) + '-Dubai_Emirate_of_Dubai-Hotels.html#EATERY_LIST_CONTENTS'

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

for link in soup.find_all('a', {'last'}):
    page_number = link.get('data-page-number')
    last_offset = int(page_number) * 30
    print('last offset:', last_offset)
并在
范围内使用
last\u offset+1
(0,last\u offset+1,30)


编辑:餐馆使用JavaScript和AJAX加载数据

import requests
from bs4 import BeautifulSoup

size = 30

# direct url - doesn't have expected information
#url = 'https://www.tripadvisor.com/Restaurants-g187791-Rome_Lazio.html'

# url used by AJAX
url = 'https://www.tripadvisor.com/RestaurantSearch?Action=PAGE&geo=187791&ajax=1&itags=10591&sortOrder=relevance&o=a' + str(size) + '&availSearchEnabled=true&eaterydate=2017_04_27&date=2017-04-28&time=20%3A00%3A00&people=2'

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

link = soup.find_all('a')[-1]
page_number = link.get('data-page-number')
last_offset = int(page_number) * size # *30
print('last offset:', last_offset)

offset = link.get('data-offset')
offset = int(offset) + size # +30
print('offset:', offset)

当你问TripAdvisor是否允许你这样做时,他们没有通过API为你提供访问权限吗?没有,他们只为有业务的人提供API(供官方使用)。。。。。我是一名学生,我只是需要我的项目的一些数据。你可以尝试使用类似Selenium的东西来查找页面上的“下一页”按钮。需要比BS稍长一点,因为它实际上打开了一个浏览器窗口与之交互,但可以快速解决问题门户使用值30、60、90、120等,而不是1、2、3,作为下一个页码-因为第页上有30个报价。@furas你能告诉我怎么做吗?我需要你的帮助,你能告诉我如何从tripadvisor获得餐厅的补偿吗?因为我在酒店使用了你的上述方法,所以效果很好,但在餐馆里就不起作用了。请帮帮我。@Hifzaahmad“不工作”是什么意思?我并没有检查页面,但餐馆可以使用不同的标签或不同的分页,甚至可以使用JavaScript。这并不奇怪。