Python 亚马逊畅销书刮削_Python_Multithreading_Web Scraping_Amazon

Python 亚马逊畅销书刮削

python multithreading web-scraping

Python 亚马逊畅销书刮削,python,multithreading,web-scraping,amazon,Python,Multithreading,Web Scraping,Amazon,我已经开发了一个脚本来抓取URL、标题和其他信息来自亚马逊畅销书分类。下面的脚本运行良好，但速度非常慢，因为Amazon有多个子类别，所以为了遍历所有子类别，需要花费大量时间，我能做些什么使它快速工作吗？我使用的是Python2.7 64位谢谢 import requests import json import threading from bs4 import BeautifulSoup import re def GetSoupResponseFromURL(url): r

我已经开发了一个脚本来抓取URL、标题和其他信息来自亚马逊畅销书分类。下面的脚本运行良好，但速度非常慢，因为Amazon有多个子类别，所以为了遍历所有子类别，需要花费大量时间，我能做些什么使它快速工作吗？我使用的是Python2.7 64位

谢谢

import requests
import json
import threading
from bs4 import BeautifulSoup
import re

def GetSoupResponseFromURL(url):
    response = requests.get(url, timeout=180)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup;

def GetSubCategories(categoryURL):

    subCategory = []

    soup = GetSoupResponseFromURL(categoryURL)

    try:
        ul = soup.find('span', {'class':'zg_selected'}).parent.parent.find('ul')

        if ul is not None:
            subCategories = ul.find_all('a')

            for category in subCategories:
                catTitle =  category.text                
                url = category.get('href')

                lists = soup.find('ul', {'id':'zg_browseRoot'}).find_all('ul')

                del lists[-1]
                global  titleList
                titleList = []

                for ulist in lists:
                    text = re.sub(r'[^\x00-\x7F]+','', ulist.find('li').text)
                    titleList.append(text.strip(' \t\n\r'))            

                fullTitle = (' > '.join(map(str, titleList)) + ' > ' + catTitle)

                soup = GetSoupResponseFromURL(url)
                title = soup.find('span', {'class':'category'})

                if title is not None:
                    title = title.text
                else:
                    title = soup.find('div', {'id':'zg_rssLinks'}).find_all('a')[-1].text
                    title = title[title.index('>') + 2:]

                print('Complete Title: ' + fullTitle)
                print('Title: ' + title)
                print('URL: ' + url)
                print('-----------------------------------')

                data = {}
                data['completeTitle'] = fullTitle
                data['title'] = title
                data['url'] = url

                data['subCategory'] = GetSubCategories(url)         
                subCategory.append(data)
    except Exception, e:
        pass

    return subCategory      

class myThread (threading.Thread):
    def __init__(self, threadID, url):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.url = url
    def run(self):
        print "Starting Thread " + str(self.threadID)
        array = []
        array = GetSubCategories(self.url)

        with open('Category ' + str(self.threadID) + '.json', 'w') as outfile:
            json.dump(array, outfile)

        print "Exiting Thread " + str(self.threadID)    


mainURL = 'https://www.amazon.fr/gp/bestsellers/ref=zg_bs_unv_petsupplies_0_2036875031_3'
soup = GetSoupResponseFromURL(mainURL)

mainCategories = soup.find('ul', {'id':'zg_browseRoot'}).find_all('a')
print mainCategories

counter = 1
for category in mainCategories[1:2]:
    thread = myThread(counter, category.get('href'))
    thread.start()
    counter+=1

-使用amazon API而不是web抓取。还值得一提的是，你刮得越快，你从亚马逊获得IP禁令的机会就越高，因为他们不希望你刮页面。好吧，实际上我不是刮产品，我在寻找类别列表及其子类别数据-使用亚马逊API而不是网络刮。还值得一提的是，你刮得越快，你从亚马逊获得IP禁令的机会就越高，因为他们不希望你刮页面。事实上，我不是在刮产品，我在寻找类别列表及其子类别数据