Python 关于Web抓取和丢失数据
我正试图从yelp页面中获取一些数据。但是,当我得到结果时,某些值丢失,并且每次执行代码时丢失的数据都会改变(例如:第一次执行时,2个数据丢失,第二次执行时,1个数据丢失)。你们知道为什么会这样吗?谢谢Python 关于Web抓取和丢失数据,python,web-scraping,missing-data,review,Python,Web Scraping,Missing Data,Review,我正试图从yelp页面中获取一些数据。但是,当我得到结果时,某些值丢失,并且每次执行代码时丢失的数据都会改变(例如:第一次执行时,2个数据丢失,第二次执行时,1个数据丢失)。你们知道为什么会这样吗?谢谢 import time review_listings= [] cols2 = ['restaurant name','username','ratings','review.text'] copy = 0 for url in data_rev['url']: # Each url ha
import time
review_listings= []
cols2 = ['restaurant name','username','ratings','review.text']
copy = 0
for url in data_rev['url']: # Each url has 20 so start
start = time.time()
for p in pages:
url_review = url+ "&start={}".format(str(p))
page = r.get(url_review)
soup = BeautifulSoup(page.content,'html.parser')
res_name = soup.find("h1",{"class":"lemon--h1__373c0__2ZHSL heading--h1__373c0___56D3 undefined heading--inline__373c0__1jeAh"}).text
tables=soup.findAll('li',{'class':'lemon--li__373c0__1r9wz margin-b3__373c0__q1DuY padding-b3__373c0__342DA border--bottom__373c0__3qNtD border-color--default__373c0__3-ifU'})
if(len(tables) == 0):
print(url_review)
break
else:
for table in tables:
#name,ratings,username:
username = table.find("span",{"class":"lemon--span__373c0__3997G text__373c0__2Kxyz fs-block text-color--blue-dark__373c0__1jX7S text-align--left__373c0__2XGa- text-weight--bold__373c0__1elNz"}).a.text
ratings = table.find("span",{"class":"lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU"}).div.get("aria-label")
text = table.find("span",{"class":"lemon--span__373c0__3997G raw__373c0__3rKqk"}).text
review_listings.append([res_name,username,ratings,text])
rev_df = pd.DataFrame.from_records(review_listings,columns=cols2)
size_df = len(rev_df)
print("review sizes are =>",size_df - copy)
print(res_name)
copy = size_df
end = time.time()
print(end-start)
您感兴趣的所有数据似乎都以
json
的形式存储在页面源代码中。这可能是从该页面获取信息的更可靠的方法:
import re
import json
import requests
## Using headers is always a good pratice
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
response = requests.get('https://www.yelp.com/biz/saku-vancouver-3', headers=headers)
soup = BeautifulSoup(response.content)
# Lets find the 'script' tag which contains restaurant's information
data_tag = soup.find('script',text = re.compile('"@type":'))
#Load it properly as json
data = json.loads(data_tag.text)
print(data)
输出
{'@context': 'https://schema.org',
'@type': 'Restaurant',
'name': 'Saku',
'image': 'https://s3-media0.fl.yelpcdn.com/bphoto/_TjVeAVRczn0yITxvBqrCA/l.jpg',
'priceRange': 'CA$11-30',
'telephone': '',
'address': {'streetAddress': '548 W Broadway',
'addressLocality': 'Vancouver',
'addressCountry': 'CA',
'addressRegion': 'BC',
'postalCode': 'V5Z 1E9'},
'review': [{'author': 'Jackie L.',
'datePublished': '1970-01-19',
'reviewRating': {'ratingValue': 5},
'description': 'With restaurants .... }
...]
}
尝试下面的方法获取餐厅名称、所有评论者的姓名、他们的评论、多页的评分。当然,如果你还没有被该网站阻止
import requests
url = 'https://www.yelp.com/biz/XAH2HpuUUtu7CUO26pbs4w/review_feed?'
params = {
'rl': 'en',
'sort_by': 'relevance_desc',
'q': '',
'start': ''
}
page = 0
while True:
params['start'] = page
res = requests.get(url,params=params)
if not res.json()['reviews']:break
for item in res.json()['reviews']:
restaurant = item['business']['name']
rating = item['rating']
user = item['user']['markupDisplayName']
review = item['comment']['text']
print(restaurant,rating,user,review)
page+=20
你能分享一些例子吗
url
?@user14245642使用Selenium
,而不是BeautifulSoup
。@ZarakiKenpachi我也想过,但我必须收集上千个数据,因此,如果我使用Selenium,它将花费太长的时间。您使用的类很可能是动态的,以避免重复感谢您的回复,但它给了我一个错误:JSONDecodeError:期望值:第1行第1列(char 0)您使用您提供给我的url对其进行了测试吗?如果是这样,可能是您的IP暂时被黑名单。您是否尝试了@user14245642脚本?