Python 使用SGMLLinkedExtractor刮泥_Python_Regex_Scrapy

Python 使用SGMLLinkedExtractor刮泥

python regex scrapy

Python 使用SGMLLinkedExtractor刮泥,python,regex,scrapy,Python,Regex,Scrapy,我正在尝试抓取表单的页面 . 我想从笔记本电脑上点击这样的URL，但由于URL只在应用程序和WAP上工作，所以我将用户代理作为 settings.py中的“Mozilla/5.0（Linux；U；Android 2.3.4；fr；HTC Desire Build/GRJ22）AppleWebKit/533.1（KHTML，类似Gecko）版本/4.0 Mobile Safari/533.1”。我的代码文件读取 from scrapy import Selector from wynks.ite

我正在尝试抓取表单的页面 . 我想从笔记本电脑上点击这样的URL，但由于URL只在应用程序和WAP上工作，所以我将用户代理作为 settings.py中的“Mozilla/5.0（Linux；U；Android 2.3.4；fr；HTC Desire Build/GRJ22）AppleWebKit/533.1（KHTML，类似Gecko）版本/4.0 Mobile Safari/533.1”。我的代码文件读取

from scrapy import Selector
from wynks.items import WynksItem

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class MySpider(CrawlSpider):

name = "wynk"
#allowed_domains = ["wynk.in"]
start_urls = ["http://www.wynk.in/", ]
#start_urls = []
rules = (Rule(SgmlLinkExtractor(allow=[r'/music/song/\w+.html']), callback='parse_item', follow=True),)

def parse_item(self, response):
    hxs = Selector(response)
    if hxs:
        tds = hxs.xpath("//div[@class='songDetails']//tr//td")
        if tds:
            for td in tds.xpath('.//div'):
                titles = td.xpath("a/text()").extract()
                if titles:
                    for title in titles:
                        print title

我通过运行刮痧爬行wynk-o abcd.csv-t csv

然而，我只得到了这个结果爬网（200）http://www.wynk.in/>（推荐人：无） 2015-03-23 11:06:04+0530[wynk]信息：关闭卡盘（已完成）

我做错了什么？

因为主页上没有指向上述URL的直接链接，所以通过获取所有链接来解决问题，并通过创建递归请求来递归访问音乐/歌曲页面。将继承更改为从Spider继承，而不是从CrawlSpider继承

尝试删除“允许的域”字段。没有帮助，为更快搜索而编辑

name=“wynk”

允许的域=[“wynk.in”]

开始\u URL=[”http://www.wynk.in“，]”

规则=（规则（SgmlLinkExtractor）（允许=['/music/song/srch\uw+.html']），callback='parse'，follow=True），

此外，我可能会补充，这些链接在第一页不可用，我希望刮板从网站上找到这种模式的链接，这可能吗？