Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/357.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用scrapy刮取xml URL_Python_Xml_Scrapy - Fatal编程技术网

Python 如何使用scrapy刮取xml URL

Python 如何使用scrapy刮取xml URL,python,xml,scrapy,Python,Xml,Scrapy,嗨,我正在使用scrapy来刮取xml URL 假设下面是我的spider.py代码 class TestSpider(BaseSpider): name = "test" allowed_domains = {"www.example.com"} start_urls = [ "https://example.com/jobxml.asp" ] def parse(self, response): prin

嗨,我正在使用scrapy来刮取xml URL

假设下面是我的spider.py代码

class TestSpider(BaseSpider):
    name = "test"
    allowed_domains = {"www.example.com"}


    start_urls = [
        "https://example.com/jobxml.asp"
        ]


    def parse(self, response):
        print response,"??????????????????????"
结果:

2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines: 
2012-07-24 16:43:34+0530 [test] INFO: Spider opened
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None)
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished)
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats:
    {'downloader/request_bytes': 651,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 504,
     'downloader/response_count': 3,
     'downloader/response_status_count/400': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931),
     'scheduler/memory_enqueued': 3,
     'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)}
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished)
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 263143424, 'memusage/startup': 263143424}
2012-07-2416:43:34+0530[scrapy]信息:scrapy 0.14.3已启动(bot:testproject)
2012-07-24 16:43:34+0530[scrapy]调试:启用的扩展:LogStats、TelnetConsole、CloseSpider、WebService、CoreStats、MemoryUsage、SpiderState
2012-07-24 16:43:34+0530[scrapy]调试:启用的下载程序中间件:HttpAuthMiddleware,DownloadTimeoutMiddleware,UserAgentMiddleware,RetryMiddleware,DefaultHeadersMiddleware,RedirectMiddleware,Cookies中间件,HttpCompressionMiddleware,ChunkedTransferMiddleware,DownloaderStats
2012-07-24 16:43:34+0530[scrapy]调试:启用的spider中间件:HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2012-07-24 16:43:34+0530[scrapy]调试:启用的项目管道:
2012-07-24 16:43:34+0530[测试]信息:蜘蛛网已打开
2012-07-24 16:43:34+0530[测试]信息:爬网0页(0页/分钟),刮取0项(0项/分钟)
2012-07-24 16:43:34+0530[scrapy]调试:Telnet控制台在0.0.0.0上侦听:6023
2012-07-24 16:43:34+0530[scrapy]调试:Web服务侦听0.0.0.0:6080
2012-07-24 16:43:36+0530[testproject]调试:重试(失败1次):400个错误请求
2012-07-24 16:43:37+0530[测试]调试:重试(失败2次):400个错误请求
2012-07-24 16:43:38+0530[测试]调试:放弃重试(失败3次):400个错误请求
2012-07-24 16:43:38+0530[测试]调试:爬网(400)(参考:无)
2012-07-24 16:43:38+0530[测试]信息:关闭十字轴(已完成)
2012-07-24 16:43:38+0530[测试]信息:转储蜘蛛统计信息:
{'downloader/request_bytes':651,
“下载程序/请求计数”:3,
“下载程序/请求方法\计数/获取”:3,
“下载程序/响应字节”:504,
“下载程序/响应计数”:3,
“下载程序/响应状态\计数/400”:3,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2012,7,24,11,13,38573931),
“调度程序/内存已排队”:3,
“开始时间”:datetime.datetime(2012,7,24,11,13,34803202)}
2012-07-24 16:43:38+0530[测试]信息:十字轴关闭(完成)
2012-07-24 16:43:38+0530[scrapy]信息:倾销全球统计数据:
{'memusage/max':263143424,'memusage/startup':263143424}
scrapy是否适用于xml刮取,如果是,请任何人提供一个如何刮取xml标记数据的示例


提前感谢……。

您有一个专门用于抓取xml提要的爬行器。这是来自零碎的文档:

XMLFeedSpider示例

这些蜘蛛非常容易使用,让我们看一个例子:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))

        item = Item()
        item['id'] = node.select('@id').extract()
        item['name'] = node.select('name').extract()
        item['description'] = node.select('description').extract()
        return item
然后解析用iterparse下载并保存的xml文件,类似于:

for event, elem in iterparse('xml/test.xml'):
        if elem.tag == "properties":
            print elem.text
这只是一个如何遍历xml树的示例


此外,这是我的一个旧代码,因此您最好使用with打开文件。

日志输出来自何处?HTTP请求在哪里执行?@Tichodroma:我已经编辑了上面的实际结果,请查看标题以获得答复,我所做的唯一一件事就是像你提到的那样从XMLFeedSpider继承,我已经运行了代码,但仍然存在相同的问题重试….,这是url的问题吗(因为它非常长,如果我们将其保存到本地桌面,实际上大约总大小为7.6MB)这应该不会是一个问题,所有xml提要的大小通常都是几mb,但我不能肯定地告诉你,因为我从来没有使用过这个爬行器,对于提要,我实际上使用简单的urllib2下载xmlfeed,然后iterparse进行解析,如果你愿意的话,我可以给你发送一个该爬行器的示例
for event, elem in iterparse('xml/test.xml'):
        if elem.tag == "properties":
            print elem.text