Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/308.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy:下面是正则表达式的链接_Python_Regex_Hyperlink_Scrapy_Forum - Fatal编程技术网

Python Scrapy:下面是正则表达式的链接

Python Scrapy:下面是正则表达式的链接,python,regex,hyperlink,scrapy,forum,Python,Regex,Hyperlink,Scrapy,Forum,我想从一个德国论坛抓取线程 实际不同的子卷位于 子论坛:musiker-board.de/forum/subflumname 实际线程有以下地址:musiker-board.de/Threads/threadname 我想跟踪所有子论坛的所有链接并提取其中的所有线程, 但是线程的URL将不再与开始URL匹配 但是,如果我选择“musiker board.de/”作为起始URL,它不会跟随所有子论坛的链接 代码如下: allowed_domains = ["musiker-board.de"] s

我想从一个德国论坛抓取线程

实际不同的子卷位于

子论坛:musiker-board.de/forum/subflumname

实际线程有以下地址:musiker-board.de/Threads/threadname

我想跟踪所有子论坛的所有链接并提取其中的所有线程, 但是线程的URL将不再与开始URL匹配

但是,如果我选择“musiker board.de/”作为起始URL,它不会跟随所有子论坛的链接

代码如下:

allowed_domains = ["musiker-board.de"]
start_urls = ['http://www.musiker-board.de/forum/'
             ]
rules = (
         Rule(SgmlLinkExtractor(allow=[r'forum/\w+']), follow=True),
         Rule(SgmlLinkExtractor(allow=[r'threads/\w+']), callback='parse_item'),
         )

def parse_item(self, response):
    #extract items...
我应该如何遵循所有musiker-board.de/forum/subform并提取所有musiker-forum.de/threads/threadname?

以下代码(由您的代码片段生成)似乎工作正常:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class Scrapy1Spider(CrawlSpider):

    name = "musiker"
    allowed_domains = ["musiker-board.de"]
    start_urls = ['http://www.musiker-board.de/forum/'
             ]
    rules = (
        Rule(LinkExtractor(allow=[r'forum/\w+']), follow=True),
        Rule(LinkExtractor(allow=[r'threads/\w+']), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('response.url=%s' % response.url)
至少有此输出(已截断):


奇怪的是,我只抓到了其中包含/论坛的链接。你用的是什么版本的Scrapy?我用过这个
Scrapy 1.0.1
。我重新安装了Scrapy,现在它可以正常工作了
INFO: response.url=http://www.musiker-board.de/threads/peavey-ms-412-userthread.271458/
INFO: response.url=http://www.musiker-board.de/threads/peavey-5150-6505-etc-userthread.180295/
INFO: response.url=http://www.musiker-board.de/threads/marshall-ma-serie-user-thread.386428/
INFO: response.url=http://www.musiker-board.de/threads/h-k-metal-master-shredder-user-thread.250846/
INFO: response.url=http://www.musiker-board.de/threads/hughes-und-kettner-grandmeister-user-thread.553487/
INFO: response.url=http://www.musiker-board.de/threads/ibanez-userthread.190547/
INFO: response.url=http://www.musiker-board.de/threads/hughes-kettner-edition-blue-user-thread.209499/page-2
INFO: response.url=http://www.musiker-board.de/threads/fender-prosonic-userthread.239519/
INFO: response.url=http://www.musiker-board.de/threads/fender-prosonic-userthread.239519/page-5
INFO: response.url=http://www.musiker-board.de/threads/engl-steve-morse-signature-e656-user-thread.427802/page-2
INFO: response.url=http://www.musiker-board.de/threads/engl-sovereign-user-thread.136266/page-20
INFO: response.url=http://www.musiker-board.de/threads/engl-steve-morse-signature-e656-user-thread.427802/
INFO: response.url=http://www.musiker-board.de/threads/engl-sovereign-user-thread.136266/page-19
INFO: response.url=http://www.musiker-board.de/threads/engl-sovereign-user-thread.136266/page-18
INFO: response.url=http://www.musiker-board.de/threads/engl-invader-user-thread.248090/page-5
INFO: response.url=http://www.musiker-board.de/threads/engl-sovereign-user-thread.136266/
INFO: response.url=http://www.musiker-board.de/threads/engl-invader-user-thread.248090/page-4
INFO: response.url=http://www.musiker-board.de/threads/engl-invader-user-thread.248090/page-3
INFO: response.url=http://www.musiker-board.de/threads/fender-cybertwin-userthread.305789/
INFO: response.url=http://www.musiker-board.de/threads/fenders-famose-farbwelten.454766/