Python 从链接中提取剪贴画
我试图提取某些链接中的信息,但我无法访问这些链接,我从start_url提取信息,我不知道为什么 这是我的密码:Python 从链接中提取剪贴画,python,scrapy,scrapy-spider,Python,Scrapy,Scrapy Spider,我试图提取某些链接中的信息,但我无法访问这些链接,我从start_url提取信息,我不知道为什么 这是我的密码: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from tutorial.items import DmozItem from scrapy.selector impo
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem
from scrapy.selector import HtmlXPathSelector
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse')]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
# Extract links
item['link'] = hxs.select("//li/a/text()").extract() # Xpath selector for tag(s)
print item['title']
for cont, i in enumerate(item['link']):
print "link: ", cont, i
我不是从“”获取链接,而是从“”获取链接
为什么?要使
规则
起作用,您需要使用爬行爬行器,而不是一般的刮擦爬行器
此外,您需要将第一个解析函数重命名为parse
以外的名称。否则,您将覆盖爬行爬行器的一个重要方法,它将无法工作。请参阅文档中的警告
您的代码正在从“”中删除链接,因为常规爬行器忽略了规则
命令
此代码应适用于:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from dmoz.items import DmozItem
from scrapy.selector import HtmlXPathSelector
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse_item')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
# Extract links
item['link'] = hxs.select("//li/a/text()").extract() # Xpath selector for tag(s)
print item['link']
for cont, i in enumerate(item['link']):
print "link: ", cont, i
要使
规则
起作用,您需要使用爬行爬行器,而不是一般的刮擦爬行器
此外,您需要将第一个解析函数重命名为parse
以外的名称。否则,您将覆盖爬行爬行器的一个重要方法,它将无法工作。请参阅文档中的警告
您的代码正在从“”中删除链接,因为常规爬行器忽略了规则
命令
此代码应适用于:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from dmoz.items import DmozItem
from scrapy.selector import HtmlXPathSelector
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse_item')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
# Extract links
item['link'] = hxs.select("//li/a/text()").extract() # Xpath selector for tag(s)
print item['link']
for cont, i in enumerate(item['link']):
print "link: ", cont, i