Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/302.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 查看刮表tripadvisor_Python_Python 3.x_Web Scraping_Beautifulsoup_Tripadvisor - Fatal编程技术网

Python 查看刮表tripadvisor

Python 查看刮表tripadvisor,python,python-3.x,web-scraping,beautifulsoup,tripadvisor,Python,Python 3.x,Web Scraping,Beautifulsoup,Tripadvisor,我对python3中的网页抓取是新手。我想抓取迪拜所有酒店的评论,但问题是我只能抓取我在url中描述的酒店评论。谁能告诉我如何在不隐式给出每个酒店的url的情况下获得所有酒店评论 import requests from bs4 import BeautifulSoup importurl = 'https://www.tripadvisor.com/Hotel_Review-g295424-d302778-Reviews-Roda_Al_Bustan_Dubai_Airport-Dubai

我对python3中的网页抓取是新手。我想抓取迪拜所有酒店的评论,但问题是我只能抓取我在url中描述的酒店评论。谁能告诉我如何在不隐式给出每个酒店的url的情况下获得所有酒店评论

import requests
from bs4 import BeautifulSoup


importurl = 'https://www.tripadvisor.com/Hotel_Review-g295424-d302778-Reviews-Roda_Al_Bustan_Dubai_Airport-Dubai_Emirate_of_Dubai.html'
r = requests.get(importurl)
soup = BeautifulSoup(r.content, "lxml")
 resultsoup = soup.find_all("p", {"class" : "partial_entry"})
#save the reviews to a test text file locally
for review in resultsoup:
review_list = review.get_text()
print(review_list)
with open('testreview.txt', 'w') as fid: 
    for review in resultsoup:
        review_list = review.get_text()
        fid.write(review_list)

你应该找到所有酒店的索引页面,将所有链接放入一个列表,然后循环url列表以获取评论

import bs4, requests
index_pages = ('http://www.tripadvisor.cn/Hotels-g295424-oa{}-Dubai_Emirate_of_Dubai-Hotels.html#ACCOM_OVERVIEW'.format(i) for i in range(0, 540, 30))
urls = []
with requests.session() as s:
    for index in index_pages:
        r = s.get(index)
        soup = bs4.BeautifulSoup(r.text, 'lxml')
        url_list = [i.get('href') for i in soup.select('.property_title')]
        urls.append(url_list)
输出:


这不是完整的酒店列表,但仅限于第一页的酒店:还有18页。@Andersson这是一个示例,如果您可以获得1页,只需使用循环即可获得18页。但是没有页面计算结果<代码>URL总是
http://www.tripadvisor.cn/Hotels-g295424-Dubai_Emirate_of_Dubai-Hotels.html
无论是第1页还是第19页…@Andersson是的,我注意到,这个页面使用JavaScript获取数据,很难使用请求来处理这些数据。@Andersson完成了!
len(urls): 540