Android Scrapy仅在爬网开始时找到一些文件

Android Scrapy仅在爬网开始时找到一些文件,android,web-scraping,web-crawler,scrapy,apk,Android,Web Scraping,Web Crawler,Scrapy,Apk,我正在使用github的命令行应用程序,在linux virtualbox上运行,从第三方appstore www.mumayi.com上获取它们承载的所有apk文件,以便我可以分析它们。我知道这个网站有很多APK 然而,当我运行这个程序时,一开始它运行得非常好,平均35-50秒就可以很快找到文件,然后将它们插入数据库供以后下载,但运行了1-2分钟后,它再也找不到了,无论我让它运行多长时间,尽管我知道那里有更多的APK文件 有没有人能解释为什么会发生这种情况?是不是因为我不喜欢用程序浏览他们的文

我正在使用github的命令行应用程序,在linux virtualbox上运行,从第三方appstore www.mumayi.com上获取它们承载的所有apk文件,以便我可以分析它们。我知道这个网站有很多APK

然而,当我运行这个程序时,一开始它运行得非常好,平均35-50秒就可以很快找到文件,然后将它们插入数据库供以后下载,但运行了1-2分钟后,它再也找不到了,无论我让它运行多长时间,尽管我知道那里有更多的APK文件

有没有人能解释为什么会发生这种情况?是不是因为我不喜欢用程序浏览他们的文件

我在下面的命令行中包含了一个示例日志,请注意,我只让它在停止查找文件后运行了大约10分钟,而让其他日志运行了24小时,结果仍然相同

matt@matt-VirtualBox:~/Downloads/android-apps-crawler-master/crawler$ ./crawl.sh mumayi.com
/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py:3: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import Spider
/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py:7: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
from scrapy import log
2015-08-31 09:38:28 [scrapy] INFO: Scrapy 1.0.3 started (bot: android_apps_crawler)
2015-08-31 09:38:28 [scrapy] INFO: Optional features available: ssl, http11
2015-08-31 09:38:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'android_apps_crawler.spiders', 'SPIDER_MODULES': ['android_apps_crawler.spiders'], 'LOG_LEVEL': 'INFO', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11(KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11', 'BOT_NAME': 'android_apps_crawler'}
2015-08-31 09:38:28 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-31 09:38:28 [scrapy] INFO: Enabled downloader middlewares: DownloaderMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-31 09:38:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
../repo/databases/mumayi.com.db
2015-08-31 09:38:28 [scrapy] INFO: Enabled item pipelines: AppPipeline, SQLitePipeline
2015-08-31 09:38:28 [scrapy] INFO: Spider opened
2015-08-31 09:38:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-31 09:38:41 [py.warnings] WARNING: /home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py:85: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
log.msg("Catch an application: %s" % url, level=log.INFO)

2015-08-31 09:38:41 [scrapy] INFO: Catch an application: http://down.mumayi.com/54049
2015-08-31 09:38:41 [py.warnings] WARNING: /home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/pipelines.py:12: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
log.msg("Catch an AppItem", level=log.INFO)

2015-08-31 09:38:41 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:41 [py.warnings] WARNING: /home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/pipelines.py:33: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
log.msg("Inserting into database");

2015-08-31 09:38:41 [scrapy] INFO: Inserting into database
2015-08-31 09:38:44 [scrapy] INFO: Catch an application: http://down.mumayi.com/989871
2015-08-31 09:38:45 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:45 [scrapy] INFO: Inserting into database
2015-08-31 09:38:45 [scrapy] INFO: Catch an application: http://down.mumayi.com/1003630
2015-08-31 09:38:45 [scrapy] INFO: Catch an application: http://down.mumayi.com/217624
2015-08-31 09:38:45 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:45 [scrapy] INFO: Inserting into database
2015-08-31 09:38:45 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:45 [scrapy] INFO: Inserting into database
2015-08-31 09:38:45 [scrapy] INFO: Catch an application: http://down.mumayi.com/970142
2015-08-31 09:38:45 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:45 [scrapy] INFO: Inserting into database
Ignore request!
Ignore request!
Ignore request!
2015-08-31 09:38:47 [scrapy] INFO: Catch an application: http://down.mumayi.com/42860
2015-08-31 09:38:47 [scrapy] INFO: Catch an application: http://down.mumayi.com/555845
2015-08-31 09:38:47 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:47 [scrapy] INFO: Inserting into database
2015-08-31 09:38:47 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:47 [scrapy] INFO: Inserting into database
2015-08-31 09:38:47 [scrapy] INFO: Catch an application: http://down.mumayi.com/121890
2015-08-31 09:38:47 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:47 [scrapy] INFO: Inserting into database
2015-08-31 09:38:48 [scrapy] INFO: Catch an application: http://down.mumayi.com/197417
2015-08-31 09:38:48 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:48 [scrapy] INFO: Inserting into database
2015-08-31 09:38:48 [scrapy] INFO: Catch an application: http://down.mumayi.com/254262
2015-08-31 09:38:48 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:48 [scrapy] INFO: Inserting into database
2015-08-31 09:38:49 [scrapy] INFO: Catch an application: http://down.mumayi.com/308575
2015-08-31 09:38:49 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:49 [scrapy] INFO: Inserting into database
2015-08-31 09:38:50 [scrapy] INFO: Catch an application: http://down.mumayi.com/227335
2015-08-31 09:38:50 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:50 [scrapy] INFO: Inserting into database
2015-08-31 09:38:50 [scrapy] ERROR: Spider error processing <GET http://down.mumayi.com/minisetup/970142> (referer: http://www.mumayi.com/android-970142.html)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py", line 36, in parse
self.parse_xpath(response, xpath_rule[key]))
File "/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py", line 82, in parse_xpath
sel = Selector(response)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 80, in __init__
_root = LxmlDocument(response, self._parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 27, in __new__
cache[parser] = _factory(response, parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 13, in _factory
body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
AttributeError: 'Response' object has no attribute 'body_as_unicode'
2015-08-31 09:38:50 [scrapy] INFO: Catch an application: http://down.mumayi.com/45243
2015-08-31 09:38:50 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:50 [scrapy] INFO: Inserting into database
2015-08-31 09:38:50 [scrapy] INFO: Catch an application: http://down.mumayi.com/7937
2015-08-31 09:38:50 [scrapy] INFO: Catch an application: http://down.mumayi.com/858308
2015-08-31 09:38:50 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:50 [scrapy] INFO: Inserting into database
2015-08-31 09:38:50 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:51 [scrapy] INFO: Inserting into database
2015-08-31 09:38:51 [scrapy] INFO: Catch an application: http://down.mumayi.com/499346
2015-08-31 09:38:52 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:52 [scrapy] INFO: Inserting into database
2015-08-31 09:38:52 [scrapy] INFO: Catch an application: http://down.mumayi.com/1003438
2015-08-31 09:38:52 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:52 [scrapy] INFO: Inserting into database
2015-08-31 09:38:52 [scrapy] INFO: Catch an application: http://down.mumayi.com/549777
2015-08-31 09:38:52 [scrapy] INFO: Catch an application: http://down.mumayi.com/1002249
2015-08-31 09:38:52 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:52 [scrapy] INFO: Inserting into database
2015-08-31 09:38:52 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:52 [scrapy] INFO: Inserting into database
2015-08-31 09:38:53 [scrapy] INFO: Catch an application: http://down.mumayi.com/335562
2015-08-31 09:38:53 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:53 [scrapy] INFO: Inserting into database
2015-08-31 09:39:21 [scrapy] INFO: Catch an application: http://down.mumayi.com/51129
2015-08-31 09:39:21 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:21 [scrapy] INFO: Inserting into database
2015-08-31 09:39:22 [scrapy] INFO: Catch an application: http://down.mumayi.com/72090
2015-08-31 09:39:22 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:22 [scrapy] INFO: Inserting into database
2015-08-31 09:39:23 [scrapy] INFO: Catch an application: http://down.mumayi.com/318245
2015-08-31 09:39:23 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:23 [scrapy] INFO: Inserting into database
2015-08-31 09:39:23 [scrapy] INFO: Catch an application: http://down.mumayi.com/52958
2015-08-31 09:39:23 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:24 [scrapy] INFO: Inserting into database
2015-08-31 09:39:25 [scrapy] INFO: Catch an application: http://down.mumayi.com/212803
2015-08-31 09:39:25 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:25 [scrapy] INFO: Inserting into database
2015-08-31 09:39:26 [scrapy] INFO: Catch an application: http://down.mumayi.com/287
2015-08-31 09:39:26 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:26 [scrapy] INFO: Inserting into database
2015-08-31 09:39:26 [scrapy] INFO: Catch an application: http://down.mumayi.com/426381
2015-08-31 09:39:26 [scrapy] INFO: Catch an application: http://down.mumayi.com/32326
2015-08-31 09:39:26 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:26 [scrapy] INFO: Inserting into database
2015-08-31 09:39:26 [scrapy] INFO: Catch an application: http://down.mumayi.com/113156
2015-08-31 09:39:26 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:26 [scrapy] INFO: Inserting into database
2015-08-31 09:39:26 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:26 [scrapy] INFO: Inserting into database
2015-08-31 09:39:28 [scrapy] INFO: Crawled 184 pages (at 184 pages/min), scraped 29 items (at 29 items/min)
2015-08-31 09:39:28 [scrapy] INFO: Catch an application: http://down.mumayi.com/230146
2015-08-31 09:39:28 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:28 [scrapy] INFO: Inserting into database
2015-08-31 09:39:28 [scrapy] INFO: Catch an application: http://down.mumayi.com/208
2015-08-31 09:39:28 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:28 [scrapy] INFO: Inserting into database
2015-08-31 09:39:28 [scrapy] ERROR: Spider error processing <GET http://down.mumayi.com/minisetup/318245> (referer: http://www.mumayi.com/android-318245.html)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py", line 36, in parse
self.parse_xpath(response, xpath_rule[key]))
File "/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py", line 82, in parse_xpath
sel = Selector(response)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 80, in __init__
_root = LxmlDocument(response, self._parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 27, in __new__
cache[parser] = _factory(response, parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 13, in _factory
body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
AttributeError: 'Response' object has no attribute 'body_as_unicode'
2015-08-31 09:39:28 [scrapy] INFO: Catch an application: http://down.mumayi.com/59
2015-08-31 09:39:28 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:28 [scrapy] INFO: Inserting into database
2015-08-31 09:39:30 [scrapy] INFO: Catch an application: http://down.mumayi.com/882209
2015-08-31 09:39:30 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:30 [scrapy] INFO: Inserting into database
2015-08-31 09:39:30 [scrapy] INFO: Catch an application: http://down.mumayi.com/987896
2015-08-31 09:39:30 [scrapy] INFO: Catch an application: http://down.mumayi.com/97686
2015-08-31 09:39:30 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:30 [scrapy] INFO: Inserting into database
2015-08-31 09:39:30 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:30 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/979277
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/350618
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/343323
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/21799
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/485394
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:32 [scrapy] INFO: Catch an application: http://down.mumayi.com/24615
2015-08-31 09:39:32 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:32 [scrapy] INFO: Inserting into database
2015-08-31 09:39:32 [scrapy] INFO: Catch an application: http://down.mumayi.com/872176
2015-08-31 09:39:32 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:32 [scrapy] INFO: Inserting into database
2015-08-31 09:39:32 [scrapy] INFO: Catch an application: http://down.mumayi.com/63575
2015-08-31 09:39:32 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:32 [scrapy] INFO: Inserting into database
2015-08-31 09:39:33 [scrapy] INFO: Catch an application: http://down.mumayi.com/1007326
2015-08-31 09:39:33 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:33 [scrapy] INFO: Inserting into database
2015-08-31 09:39:35 [scrapy] INFO: Catch an application: http://down.mumayi.com/62258
2015-08-31 09:39:35 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:35 [scrapy] INFO: Inserting into database
2015-08-31 09:39:35 [scrapy] INFO: Catch an application: http://down.mumayi.com/64880
2015-08-31 09:39:35 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:35 [scrapy] INFO: Inserting into database
2015-08-31 09:39:35 [scrapy] INFO: Catch an application: http://down.mumayi.com/455675
2015-08-31 09:39:35 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:35 [scrapy] INFO: Inserting into database
2015-08-31 09:39:36 [scrapy] INFO: Catch an application: http://down.mumayi.com/851783
2015-08-31 09:39:36 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:36 [scrapy] INFO: Inserting into database
2015-08-31 09:39:40 [scrapy] INFO: Catch an application: http://down.mumayi.com/14037
2015-08-31 09:39:40 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:40 [scrapy] INFO: Inserting into database
2015-08-31 09:39:43 [scrapy] INFO: Catch an application: http://down.mumayi.com/274799
2015-08-31 09:39:43 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:43 [scrapy] INFO: Inserting into database
2015-08-31 09:40:28 [scrapy] INFO: Crawled 333 pages (at 149 pages/min), scraped 50 items (at 21 items/min)
2015-08-31 09:41:28 [scrapy] INFO: Crawled 538 pages (at 205 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:42:28 [scrapy] INFO: Crawled 795 pages (at 257 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:43:28 [scrapy] INFO: Crawled 1044 pages (at 249 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:44:28 [scrapy] INFO: Crawled 1269 pages (at 225 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:45:28 [scrapy] INFO: Crawled 1616 pages (at 347 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:46:28 [scrapy] INFO: Crawled 2041 pages (at 425 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:47:28 [scrapy] INFO: Crawled 2417 pages (at 376 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:48:28 [scrapy] INFO: Crawled 2790 pages (at 373 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:49:28 [scrapy] INFO: Crawled 3131 pages (at 341 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:50:28 [scrapy] INFO: Crawled 3463 pages (at 332 pages/min), scraped 50 items (at 0 items/min)
^C2015-08-31 09:51:11 [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2015-08-31 09:51:11 [scrapy] INFO: Closing spider (shutdown)
2015-08-31 09:51:21 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 3,
 'downloader/request_bytes': 5030837,
 'downloader/request_count': 4613,
 'downloader/request_method_count/GET': 4613,
 'downloader/response_bytes': 49981622,
 'downloader/response_count': 4613,
 'downloader/response_status_count/200': 3742,
 'downloader/response_status_count/302': 871,
 'dupefilter/filtered': 545833,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2015, 8, 31, 8, 51, 21, 908163),
 'item_scraped_count': 50,
 'log_count/ERROR': 2,
 'log_count/INFO': 170,
 'log_count/WARNING': 3,
 'offsite/domains': 140,
 'offsite/filtered': 81248,
 'request_depth_max': 143,
 'response_received_count': 3742,
 'scheduler/dequeued': 4616,
 'scheduler/dequeued/disk': 4616,
 'scheduler/enqueued': 20763,
 'scheduler/enqueued/disk': 20763,
 'spider_exceptions/AttributeError': 2,
 'start_time': datetime.datetime(2015, 8, 31, 8, 38, 28, 184357)}
2015-08-31 09:51:21 [scrapy] INFO: Spider closed (shutdown)

请在问题中添加蜘蛛的代码。在不知道代码的情况下,我只能胡乱猜测:dupefilter/filter计数相当高。也许你需要在你的请求中添加一个“Don_filter”!?从我的蜘蛛中添加了代码,如果我能找到它的代码,我将尝试关闭过滤!好的,在请求中添加了don't_filter=True,它似乎找到了更多的应用程序!不幸的是,它现在每隔几秒钟就会出现一次错误2015-08-31 11:42:41[scrapy]错误:下载回溯最新调用时出错最后:File/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py,第45行,mustbe_deferred result=f*args,**kw File/usr/local/lib/python2.7/dist packages/scrapy/core/downloader/handlers/_init__.py,第40行,在download_请求raise NotSupportedUnsupported URL方案“%s”:%s%scheme中,msg NotSupported:不支持的URL方案“tencent”:该方案没有可用的处理程序代码似乎获取的URL不是以http开头的,但tencent显然不是有效的URL方案。也许你应该问问安卓应用程序爬虫的作者到底发生了什么。
import re

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import HtmlResponse
from scrapy import log

from urlparse import urlparse
from urlparse import urljoin

from android_apps_crawler.items import AppItem
from android_apps_crawler import settings
from android_apps_crawler import custom_parser


class AndroidAppsSpider(Spider):
    name = "android_apps_spider"
    scrape_rules = settings.SCRAPE_RULES

    def __init__(self, market=None, database_dir="../repo/databases/", *args, **kwargs):
        super(AndroidAppsSpider, self).__init__(*args, **kwargs)
        self.allowed_domains = settings.ALLOWED_DOMAINS[market]
        self.start_urls = settings.START_URLS[market]
        settings.MARKET_NAME = market
        settings.DATABASE_DIR = database_dir

    def parse(self, response):
        response_domain = urlparse(response.url).netloc
        appItemList = []
        cookie = {}
        xpath_rule = self.scrape_rules['xpath']
        for key in xpath_rule.keys():
            if key in response_domain:
                appItemList.extend(
                        self.parse_xpath(response, xpath_rule[key]))
                break
        custom_parser_rule = self.scrape_rules['custom_parser']
        for key in custom_parser_rule.keys():
            if key in response_domain:
                appItemList.extend(
                        getattr(custom_parser, custom_parser_rule[key])(response))
                break
        #if "appchina" in response_domain:
        #    xpath = "//a[@id='pc-download' and @class='free']/@href"
        #    appItemList.extend(self.parse_xpath(response, xpath))
        #elif "hiapk" in response_domain:
        #    xpath = "//a[@class='linkbtn d1']/@href"
        #    appItemList.extend(self.parse_xpath(response, xpath))
        #elif "android.d.cn" in response_domain:
        #    xpath = "//a[@class='down']/@href"
        #    appItemList.extend(self.parse_xpath(response, xpath))
        #elif "anzhi" in response_domain:
        #    xpath = "//div[@id='btn']/a/@onclick"
        #    appItemList.extend(self.parse_anzhi(response, xpath))
        #else:
        #    pass
        sel = Selector(response)
        for url in sel.xpath('//a/@href').extract():
            url = urljoin(response.url, url)
            yield Request(url, meta=cookie, callback=self.parse)

        for item in appItemList:
            yield item


    #def parse_appchina(self, response):
    #    appItemList = []
    #    hxs = HtmlXPathSelector(response)
    #    for url in hxs.select(
    #        "//a[@id='pc-download' and @class='free']/@href"
    #        ).extract():
    #        url = urljoin(response.url, url)
    #        log.msg("Catch an application: %s" % url, level=log.INFO)
    #        appItem = AppItem()
    #        appItem['url'] = url
    #        appItemList.append(appItem)
    #    return appItemList

    def parse_xpath(self, response, xpath):
        appItemList = []
        sel = Selector(response)
        for url in sel.xpath(xpath).extract():
            url = urljoin(response.url, url)
            log.msg("Catch an application: %s" % url, level=log.INFO)
            appItem = AppItem()
           appItem['url'] = url
            appItemList.append(appItem)
        return appItemList

    #def parse_anzhi(self, response, xpath):
    #    appItemList = []
    #    hxs = HtmlXPathSelector(response)
    #    for script in hxs.select(xpath).extract():
    #        id = re.search(r"\d+", script).group()
    #        url = "http://www.anzhi.com/dl_app.php?s=%s&n=5" % (id,)
    #        appItem = AppItem()
    #        appItem['url'] = url
    #        appItemList.append(appItem)
    #    return appItemList