Scrapy HtmlXPathSelector_Scrapy - Fatal编程技术网

Scrapy HtmlXPathSelector

scrapy

Scrapy HtmlXPathSelector,scrapy,Scrapy,只是试一下刮皮，试着让一个基本的蜘蛛工作。我知道这可能是我错过的东西，但我已经尝试了我能想到的一切我得到的错误是： line 11, in JustASpider sites = hxs.select('//title/text()') NameError: name 'hxs' is not defined 我的代码目前非常基本，但我似乎仍然找不到哪里出了问题。谢谢你的帮助 from scrapy.spider import BaseSpider from scrapy.selec

只是试一下刮皮，试着让一个基本的蜘蛛工作。我知道这可能是我错过的东西，但我已经尝试了我能想到的一切

我得到的错误是：

line 11, in JustASpider
    sites = hxs.select('//title/text()')
NameError: name 'hxs' is not defined

我的代码目前非常基本，但我似乎仍然找不到哪里出了问题。谢谢你的帮助

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class JustASpider(BaseSpider):
    name = "google.com"
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//title/text()')
        for site in sites:
            print site.extract()


SPIDER = JustASpider()

确保您正在运行显示给我们的代码

尝试删除项目中的

*.pyc

文件。

我在结尾删除了SPIDER调用，并删除了for循环。只有一个标题标签（正如人们所期望的那样），这似乎是在跳转。我工作的代码如下：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class JustASpider(BaseSpider):
    name = "google.com"
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//title/text()')
        final = titles.extract()

我有一个类似的问题，

NameError:name'hxs'未定义

，与空格和制表符相关的问题：IDE使用空格而不是制表符，您应该检查一下。

这对我来说很有效：

将文件另存为

test.py

使用命令

scrapy runspider

例如：

scrapy runspider test.py

代码看起来是正确的

在Scrapy的最新版本中
HtmlXPathSelector已弃用。使用选择器：

hxs = Selector(response)
sites = hxs.xpath('//title/text()')

这只是一个演示，但它的工作。当然需要定制

#!/usr/bin/env python

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        for site in sites:
            title = site.select('a/text()').extract()
            link = site.select('a/@href').extract()
            desc = site.select('text()').extract()
            print title, link, desc

你应该改变

from scrapy.selector import HtmlXPathSelector

进入

并改用

hxs=选择器（响应）

代码看起来很旧。我建议改用这些代码

从scrapy.spider导入spider
从scrapy.selector导入选择器
类JustASpider（Spider）：
name=“谷歌蜘蛛”
允许的_域=[“google.com”]
起始URL=[”http://www.google.com/search?hl=en&q=search"]
def解析（自我，响应）：
sel=选择器（响应）
sites=sel.xpath（'//title/text（））.extract（）
印刷网站
#对于站点中的站点：（我不知道为什么要循环提取title元素中的文本）
#打印site.extract（）

我将Scrapy与BeautifulSoup4.0结合使用。对我来说，汤很容易阅读和理解。如果不必使用HtmlXPathSelector，这是一个选项。希望这有帮助

import scrapy
from bs4 import BeautifulSoup
import Item

def parse(self, response):

    soup = BeautifulSoup(response.body,'html.parser')
    print 'Current url: %s' % response.url
    item = Item()
    for link in soup.find_all('a'):
        if link.get('href') is not None:
            url = response.urljoin(link.get('href'))
            item['url'] = url
            yield scrapy.Request(url,callback=self.parse)
            yield item

你怎么管理你的蜘蛛

scrapy crawl“google.com”

？你的代码没有问题（除了不再需要声明SPIDER之外），它对我很有用。@Leo我就是这样运行它的。你从命令行上的“

scrapy version-v

”得到了什么输出？@stav scrapy:0.14.4 Twisted:12.1.0 Python:2.7.2（v2.7.2:8527427142011年6月11日，15:22:34）-[GCC 4.2.1（Apple Inc.build 5666）（dot 3）]平台：Darwin-10.8.0-i386-64位删除文件夹中的所有pyc文件后，我仍然会收到相同的错误。如果我缺少依赖项，是否会收到导入错误？请检查代码中的缩进。也许你将制表符与空格混合使用？你的代码可以工作，但最好对爬行器使用一个简单的名称，如“google”或“googleSpider”而不是“google.com”

import scrapy
from bs4 import BeautifulSoup
import Item

def parse(self, response):

    soup = BeautifulSoup(response.body,'html.parser')
    print 'Current url: %s' % response.url
    item = Item()
    for link in soup.find_all('a'):
        if link.get('href') is not None:
            url = response.urljoin(link.get('href'))
            item['url'] = url
            yield scrapy.Request(url,callback=self.parse)
            yield item