Python 使用scrapy缺少数据_Python_Scrapy

Python 使用scrapy缺少数据

python scrapy

Python 使用scrapy缺少数据,python,scrapy,Python,Scrapy,我用scrapy来获取所以我创建了一些项目来保存信息，但是我并不是每次运行脚本时都能得到所有的数据，通常我会得到一些空项目，所以我需要再次运行脚本，直到我得到所有的项目这是蜘蛛的代码 import scrapy from tutorial.items import Product from scrapy.loader import ItemLoader from scrapy.contrib.loader import XPathItemLoader from scrapy.select

我用scrapy来获取

所以我创建了一些项目来保存信息，但是我并不是每次运行脚本时都能得到所有的数据，通常我会得到一些空项目，所以我需要再次运行脚本，直到我得到所有的项目

这是蜘蛛的代码

import scrapy
from tutorial.items import Product
from scrapy.loader import ItemLoader

from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector



class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["bbb.org/"]
    start_urls = [
        "http://www.bbb.org/greater-san-francisco/business-reviews/architects/klopf-architecture-in-san-francisco-ca-152805"
        #"http://www.bbb.org/greater-san-francisco/business-reviews/architects/a-d-architects-in-oakland-ca-133229"
        #"http://www.bbb.org/greater-san-francisco/business-reviews/architects/aecom-in-concord-ca-541360"
    ]


    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        producto = Product()
        #producto['name'] = response.xpath('//*[@id="business-detail"]/div/h1')
        producto = Product(Name=response.xpath('//*[@id="business-detail"]/div/h1/text()').extract(),
        Telephone=response.xpath('//*[@id="business-detail"]/div/p/span[1]/text()').extract(),
        Address=response.xpath('//*[@id="business-detail"]/div/p/span[2]/span[1]/text()').extract(),
        Description=response.xpath('//*[@id="business-description"]/p[2]/text()').extract(),
        BBBAccreditation =response.xpath('//*[@id="business-accreditation-content"]/p[1]/text()').extract(),
        Complaints=response.xpath('//*[@id="complaint-sort-container"]/text()').extract(),
        Reviews=response.xpath('//*[@id="complaint-sort-container"]/p/text()').extract(),
        WebPage=response.xpath('//*[@id="business-detail"]/div/p/span[3]/a/text()').extract(),
        Rating = response.xpath('//*[@id="accedited-rating"]/img/text()').extract(),
        ServiceArea = response.xpath('//*[@id="business-additional-info-text"]/span[4]/p/text()').extract(),
        ReasonForRating = response.xpath('//*[@id="reason-rating-content"]/ul/li[1]/text()').extract(),
        NumberofEmployees = response.xpath('//*[@id="business-additional-info-text"]/p[8]/text()').extract(),
        LicenceNumber = response.xpath('//*[@id="business-additional-info-text"]/p[6]/text()').extract(),
        Contact = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),
        BBBFileOpened = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),
        BusinessStarted  = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),)



        #producto.add_xpath('name', '//*[@id="business-detail"]/div/h1')
        #product.add_value('name', 'today') # you can also use literal values
        #product.load_item()





        return producto

此页面需要设置一个用户代理，因此我有一个用户代理文件，其中一些可能是错误的？

是的，您的一些用户代理可能是错误的（可能是一些旧的，已弃用），并且站点，如果仅使用一个用户代理没有问题，您可以将其添加到

设置.py

：

USER_AGENT="someuseragent"

请记住从

设置中删除或禁用随机化用户代理。py

是的，可能其中一些是错误的，如果您只是在设置中设置

user\u agent=“someuseragent”

（请记住删除随机化用户代理中间件）。