Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/326.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 斯拉皮:蜘蛛什么也不回_Python_Python 2.7_Web Scraping_Web Crawler_Scrapy Spider - Fatal编程技术网

Python 斯拉皮:蜘蛛什么也不回

Python 斯拉皮:蜘蛛什么也不回,python,python-2.7,web-scraping,web-crawler,scrapy-spider,Python,Python 2.7,Web Scraping,Web Crawler,Scrapy Spider,这是我第一次创建spider,尽管我付出了努力,它仍然没有返回任何内容到csv导出。我的代码是: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector class Emag(CrawlSpider): name = "emag"

这是我第一次创建spider,尽管我付出了努力,它仍然没有返回任何内容到csv导出。我的代码是:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow= True))

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href').extract()
        for site in sites:
            site = str(site)

        for clean_site in site:
            name = clean_site.xpath('//[@id=""]/span').extract()
            return name
问题是,如果我打印站点,它会给我一个URL列表,这是可以的。如果我在scrapy shell中的一个URL中搜索名称,它会找到它。问题是当我搜索所有链接中的所有名称时,我使用“scrapy crawl emag>emag.csv”运行它


你能给我一个提示吗?

蜘蛛中有多个问题:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Field, Item


class MyItem(Item):
    name = Field()


class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow=True), )

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href')
        for site in sites:
            item = MyItem()
            item['name'] = site.xpath('//[@id=""]/span').extract()
            yield item
  • 规则
    在最后一个括号前应该是一个可替换的、缺少逗号的逗号
  • 没有指定
    s-您需要定义一个类并从spider
    parse()回调返回/生成它
以下是蜘蛛的固定版本:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Field, Item


class MyItem(Item):
    name = Field()


class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow=True), )

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href')
        for site in sites:
            item = MyItem()
            item['name'] = site.xpath('//[@id=""]/span').extract()
            yield item

一个问题可能是,您已被robots.txt网站禁止访问 您可以从日志跟踪中进行检查。 如果是这样,请转到您的settings.py并使ROBOTSTXT_服从=False
这就解决了我的问题

花了很多时间。但它仍然一无所获;xpath选择也是错误的吗?@user3753592尝试这样运行spider:
scrapy crawl-o output.csv-t csv
.thks,这是我运行它们的最初方式。不管怎样,我还是空着files@user3753592您真正想从网站页面中提取什么?我不理解您提供的xpath的意图。我以为你故意省略了
id
。不,我想提取产品名称和价格