Python 亚马逊畅销书刮削
我已经开发了一个脚本来抓取URL、标题和其他信息 来自亚马逊畅销书分类。下面的脚本运行良好,但速度非常慢,因为Amazon有多个子类别,所以为了遍历所有子类别,需要花费大量时间, 我能做些什么使它快速工作吗?我使用的是Python2.7 64位 谢谢Python 亚马逊畅销书刮削,python,multithreading,web-scraping,amazon,Python,Multithreading,Web Scraping,Amazon,我已经开发了一个脚本来抓取URL、标题和其他信息 来自亚马逊畅销书分类。下面的脚本运行良好,但速度非常慢,因为Amazon有多个子类别,所以为了遍历所有子类别,需要花费大量时间, 我能做些什么使它快速工作吗?我使用的是Python2.7 64位 谢谢 import requests import json import threading from bs4 import BeautifulSoup import re def GetSoupResponseFromURL(url): r
import requests
import json
import threading
from bs4 import BeautifulSoup
import re
def GetSoupResponseFromURL(url):
response = requests.get(url, timeout=180)
soup = BeautifulSoup(response.content, 'html.parser')
return soup;
def GetSubCategories(categoryURL):
subCategory = []
soup = GetSoupResponseFromURL(categoryURL)
try:
ul = soup.find('span', {'class':'zg_selected'}).parent.parent.find('ul')
if ul is not None:
subCategories = ul.find_all('a')
for category in subCategories:
catTitle = category.text
url = category.get('href')
lists = soup.find('ul', {'id':'zg_browseRoot'}).find_all('ul')
del lists[-1]
global titleList
titleList = []
for ulist in lists:
text = re.sub(r'[^\x00-\x7F]+','', ulist.find('li').text)
titleList.append(text.strip(' \t\n\r'))
fullTitle = (' > '.join(map(str, titleList)) + ' > ' + catTitle)
soup = GetSoupResponseFromURL(url)
title = soup.find('span', {'class':'category'})
if title is not None:
title = title.text
else:
title = soup.find('div', {'id':'zg_rssLinks'}).find_all('a')[-1].text
title = title[title.index('>') + 2:]
print('Complete Title: ' + fullTitle)
print('Title: ' + title)
print('URL: ' + url)
print('-----------------------------------')
data = {}
data['completeTitle'] = fullTitle
data['title'] = title
data['url'] = url
data['subCategory'] = GetSubCategories(url)
subCategory.append(data)
except Exception, e:
pass
return subCategory
class myThread (threading.Thread):
def __init__(self, threadID, url):
threading.Thread.__init__(self)
self.threadID = threadID
self.url = url
def run(self):
print "Starting Thread " + str(self.threadID)
array = []
array = GetSubCategories(self.url)
with open('Category ' + str(self.threadID) + '.json', 'w') as outfile:
json.dump(array, outfile)
print "Exiting Thread " + str(self.threadID)
mainURL = 'https://www.amazon.fr/gp/bestsellers/ref=zg_bs_unv_petsupplies_0_2036875031_3'
soup = GetSoupResponseFromURL(mainURL)
mainCategories = soup.find('ul', {'id':'zg_browseRoot'}).find_all('a')
print mainCategories
counter = 1
for category in mainCategories[1:2]:
thread = myThread(counter, category.get('href'))
thread.start()
counter+=1
-使用amazon API而不是web抓取。还值得一提的是,你刮得越快,你从亚马逊获得IP禁令的机会就越高,因为他们不希望你刮页面。好吧,实际上我不是刮产品,我在寻找类别列表及其子类别数据-使用亚马逊API而不是网络刮。还值得一提的是,你刮得越快,你从亚马逊获得IP禁令的机会就越高,因为他们不希望你刮页面。事实上,我不是在刮产品,我在寻找类别列表及其子类别数据