Web抓取:通过Python抓取多个Web
我想从中提取排名、评论和评论日期,但是,我不知道如何从多个页面中提取并为删除结果创建一个熊猫数据框架,该网站是动态的,虽然您可以使用Web抓取:通过Python抓取多个Web,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想从中提取排名、评论和评论日期,但是,我不知道如何从多个页面中提取并为删除结果创建一个熊猫数据框架,该网站是动态的,虽然您可以使用BeautifulSoup查找评论的某些元素,您需要使用selenium来访问动态生成的内容: from bs4 import BeautifulSoup import requests url = 'https://uk.trustpilot.com/review/thread.com' for pg in range(1, 10): pg = url +
BeautifulSoup
查找评论的某些元素,您需要使用selenium
来访问动态生成的内容:
from bs4 import BeautifulSoup
import requests
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 10):
pg = url + '?page=' + str(pg)
soup = BeautifulSoup(page.content, 'lxml')
for paragraph in soup.find_all('p'):
print(paragraph.text)
从bs4导入BeautifulSoup作为汤
从selenium导入webdriver
时间,时间
d=webdriver.Chrome(“/Users/jamespetullo/Downloads/chromedriver”)
d、 得到('https://uk.trustpilot.com/review/thread.com')
def刮擦检查(汤)->记录:
返回{'date':_d.find('time')。文本,
“ranking”:re.findall(”(?Hi)您需要向每个页面发送请求,然后处理响应。另外,由于某些项目不能直接作为标记中的文本使用,因此您可以从javascript获取它(我使用json加载来确定日期)或从类名获取它(我得到的评级如下所示)
输出
Title Content Date Rating
0 I ordered a jacket 2 weeks ago I ordered a jacket 2 weeks ago. Still hasn't ... 2019-01-13 1
1 I've used this service for many years… I've used this service for many years and get ... 2018-12-31 4
2 Great website Great website, tailored recommendations, and e... 2018-12-19 5
3 I was excited by the prospect offered… I was excited by the prospect offered by threa... 2018-12-18 1
4 Thread set the benchmark for customer service Firstly, their customer service is second to n... 2018-12-12 5
5 It's a good idea It's a good idea. I am in between sizes and d... 2018-12-02 3
6 Great experience so far Great experience so far. Big choice of clothes... 2018-10-31 5
7 Absolutely love using Thread.com Absolutely love using Thread.com. As a man wh... 2018-10-31 5
8 I'd like to give Thread a one star… I'd like to give Thread a one star review, but... 2018-10-30 2
9 Really enjoying the shopping experience… Really enjoying the shopping experience on thi... 2018-10-22 5
10 The only way I buy clothes I absolutely love Thread. I've been surviving ... 2018-10-15 5
11 Excellent Service Excellent ServiceQuick delivery, nice items th... 2018-07-27 5
12 Convenient way to order clothes online Convenient way to order clothes online, and gr... 2018-07-05 5
13 Superb - would thoroughly recommend Recommendations have been brilliant - no more ... 2018-06-24 5
14 First time ordering from Thread First time ordering from Thread - Very slow de... 2018-06-22 1
15 Some of these criticisms are just madness I absolutely love thread.com, and I can't reco... 2018-05-28 5
16 Top service! Great idea and fantastic service. I just recei... 2018-05-17 5
17 Great service Great service. Great clothes which come well p... 2018-05-05 5
18 Thumbs up Easy, straightforward and very good costumer s... 2018-04-17 5
19 Good idea, ruined by slow delivery I really love the concept and the ordering pro... 2018-04-08 3
20 I love Thread I have been using thread for over a year. It i... 2018-03-12 5
21 Clever simple idea but.. low quality clothing Clever simple idea but.. low quality clothingL... 2018-03-12 2
22 Initially I was impressed.... Initially I was impressed with the Thread shop... 2018-02-07 2
23 Happy new customer Joined the site a few weeks ago, took a short ... 2018-02-06 5
24 Style tips for mature men I'm a man of mature age, let's say a "baby boo... 2018-01-31 5
25 Every shop, every item and in one place Simple, intuitive and makes online shopping a ... 2018-01-28 5
26 Fantastic experience all round Fantastic experience all round. Quick to regi... 2018-01-28 5
27 Superb "all in one" shopping experience … Superb "all in one" shopping experience that i... 2018-01-25 5
28 Great for time poor people who aren’t fond of ... Rally love this company. Super useful for thos... 2018-01-22 5
29 Really is worth trying! Quite cautious at first, however, love the way... 2018-01-10 4
30 14 days for returns is very poor given … 14 days for returns is very poor given most co... 2017-12-20 3
31 A great intro to online clothes … A great intro to online clothes shopping. Usef... 2017-12-15 5
32 I was skeptical at first I was skeptical at first, but the service is s... 2017-11-16 5
33 seems good to me as i hate to shop in … seems good to me as i hate to shop in stores, ... 2017-10-23 5
34 Great concept and service Great concept and service. This service has be... 2017-10-17 5
35 Slow dispatch My Order Dispatch was extremely slow compared ... 2017-10-07 1
36 This company sends me clothes in boxes This company sends me clothes in boxes! I find... 2017-08-28 5
37 I've been using Thread for the past six … I've been using Thread for the past six months... 2017-08-03 5
38 Thread Thread, this site right here is literally the ... 2017-06-22 5
39 good concept The website is a good concept in helping buyer... 2017-06-14 3
注:
虽然我能够“破解”这个网站的结果,但最好使用selenium来废弃动态页面
编辑:自动查找页数的代码
from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
try:
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
except AttributeError:
pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)
您可以从包含json的脚本标记中提取信息。这还允许您计算页面数,因为有一个总评论计数,您可以计算每个页面的评论数
import requests
from bs4 import BeautifulSoup as bs
import json
import math
import pandas as pd
def getInfo(url):
res=requests.get(url)
soup = bs(res.content, 'lxml')
data = json.loads(soup.select_one('[type="application/ld+json"]').text.strip()[:-1])[0]
return data
def addItems(data):
result = []
for item in data['review']:
review = {
'Headline': item['headline'] ,
'Ranking': item['reviewRating']['ratingValue'],
'Review': item['reviewBody'],
'ReviewDate': item['datePublished']
}
result.append(review)
return result
url = 'https://uk.trustpilot.com/review/thread.com?page={}'
results = []
data = getInfo(url.format(1))
results.append(addItems(data))
totalReviews = int(data['aggregateRating']['reviewCount'])
reviewsPerPage = len(data['review'])
totalPages = math.ceil(totalReviews/reviewsPerPage)
if totalPages > 1:
for page in range(2, totalPages + 1):
data = getInfo(url.format(page))
results.append(addItems(data))
final = [item for result in results for item in result]
df = pd.DataFrame(final)
您所指的排名是什么?您好,Bitto,排名是星号Hi Ajax,您的解决方案只能查看第一页,是否有任何线索可以自动为所有页面创建,并将评论主题添加到数据框中?谢谢。@ZakkYang请查看我最近的编辑。我使用while
循环不断查找下一页页面,因为每个分页条最多只能显示六页结果。这是一个完美的解决方案,Bitto。你知道如何在不输入范围()的情况下自动为所有页面创建它吗?@ZakkYang查看我的编辑。我想我已经找到了一个解决方案。我怎么说呢?你做得非常出色!请问你是如何制作汤的。find('h2',class_=“header--inline”).text?@ZakkYang您可以查看查看源代码。这是一个没有评论文本的类。嗨,Bitto,您是否有任何想法,为什么这个url在今天不起作用,但在trustpilot.com上仍然适用于其他公司的评论?
from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
try:
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
except AttributeError:
pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)
import requests
from bs4 import BeautifulSoup as bs
import json
import math
import pandas as pd
def getInfo(url):
res=requests.get(url)
soup = bs(res.content, 'lxml')
data = json.loads(soup.select_one('[type="application/ld+json"]').text.strip()[:-1])[0]
return data
def addItems(data):
result = []
for item in data['review']:
review = {
'Headline': item['headline'] ,
'Ranking': item['reviewRating']['ratingValue'],
'Review': item['reviewBody'],
'ReviewDate': item['datePublished']
}
result.append(review)
return result
url = 'https://uk.trustpilot.com/review/thread.com?page={}'
results = []
data = getInfo(url.format(1))
results.append(addItems(data))
totalReviews = int(data['aggregateRating']['reviewCount'])
reviewsPerPage = len(data['review'])
totalPages = math.ceil(totalReviews/reviewsPerPage)
if totalPages > 1:
for page in range(2, totalPages + 1):
data = getInfo(url.format(page))
results.append(addItems(data))
final = [item for result in results for item in result]
df = pd.DataFrame(final)