Regex scrapy re.match不工作使用正则表达式在字符串中查找URL_Regex_Scrapy_Scrapy Spider

Regex scrapy re.match不工作使用正则表达式在字符串中查找URL

regex scrapy

Regex scrapy re.match不工作使用正则表达式在字符串中查找URL,regex,scrapy,scrapy-spider,Regex,Scrapy,Scrapy Spider,我尝试在同一个域中爬网多个url。我必须在字符串中添加url列表。我想搜索字符串中的正则表达式并查找URL。但是re.match（）总是不返回任何值。我测试了我的正则表达式，它运行正常。这是我的代码： # -*- coding: UTF-8 -*- import scrapy import codecs import re from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextrac

我尝试在同一个域中爬网多个url。我必须在字符串中添加url列表。我想搜索字符串中的正则表达式并查找URL。但是re.match（）总是不返回任何值。我测试了我的正则表达式，它运行正常。这是我的代码：

# -*- coding: UTF-8 -*-

import scrapy
import codecs 
import re

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy import Request

from scrapy.selector import HtmlXPathSelector

from hurriyet.items import HurriyetItem

class hurriyet_spider(CrawlSpider):
    name = 'hurriyet'
    allowed_domains = ['hurriyet.com.tr']
    start_urls = ['http://www.hurriyet.com.tr/gundem/']

    rules = (Rule(SgmlLinkExtractor(allow=('\/gundem(\/\S*)?.asp$')),'parse',follow=True),) 

    def parse(self, response):
        image = HurriyetItem()
        text =  response.xpath("//a/@href").extract()
        print text

        urls = ''.join(text)


        page_links = re.match("(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))", urls, re.M)

        image['title'] = response.xpath("//h1[@class = 'title selectionShareable'] | //h1[@itemprop = 'name']/text()").extract()
        image['body'] = response.xpath("//div[@class = 'detailSpot']").extract()
        image['body2'] = response.xpath("//div[@class = 'ctx_content'] ").extract()
        print page_links

        return image, text

#-*-编码：UTF-8-*-
进口羊瘙痒
导入编解码器
进口稀土
从scrapy.contrib.spider导入爬行蜘蛛，规则
从scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor
从刮擦进口请求
从scrapy.selector导入HtmlXPathSelector
从hurriyet.items导入HurriyetItem
类爬行器（爬行爬行器）：
name='hurriyet'
允许的_域=['hurriyet.com.tr']
起始URL=['http://www.hurriyet.com.tr/gundem/']
规则=（规则（SgmlLinkExtractor（allow=（'\/gundem（\/\S*）？.asp$），'parse'，follow=True），）
def解析（自我，响应）：
image=HurriyetItem（）
text=response.xpath（“//a/@href”）.extract（）
打印文本
URL=''.join（文本）
（a）a-z[[a-a-z[[[[[w-[[[w-[[[w-[[[[w-]]10：：：{1,3}[a-z0-9-9-10-10-10-10-10-10-10-10-10-10-10-9[a-3-10-10-10-10-10-9-10-9-10-10-9-9-10-9-10-9-9-9-9-9-9%]]））））））网页网页再重重重重重重.比赛（第二个网页链接（第二个网页网页链接（第二次（第二）网页链接）网页（第二次（第二次链接链接）网页（第二个网页链接）网页（第二次（第二）网页）网页（第二次（第二次链接）网页链接）网页（重重重重重重重重重重重重重重重重重.比赛比赛比赛比赛（（！（）\[\]{}；：\'\'，«»'''']），URL，re.M）
image['title']=response.xpath（//h1[@class='title selectionshared'].\124;//h1[@itemprop='name']/text（））.extract（）
image['body']=response.xpath（“//div[@class='detailSpot']”）。extract（）
image['body2']=response.xpath（“//div[@class='ctx_content']”）。extract（）
打印页面链接
返回图像、文本

无需使用

re

模块，刮片选择器具有：

话虽如此，我建议您首先在Scrapy shell中使用这种方法，以确保您的正则表达式确实工作。因为我不希望人们尝试调试一英里长的正则表达式-它基本上是一种只写语言：）

使用

re.findall

re.match只匹配字符串开头的部分。我尝试不工作再次ng re.match（）返回none re.findall（）返回[]这意味着你的正则表达式有错。这篇文章中的正则表达式有帮助吗：？还有另一个需要你检查的问题：嘿，为什么

只写？：）看看我刚才回答的问题：。正则表达式并不是那么不可读。这是一个关于它们读起来有多难看的笑话，但写起来更流畅。你开玩笑吧。上面的正则表达式只有半英里长g、 像这样的网站可以把它变成美丽的抽象艺术。@lcd047：我不是在who-has-a-BERGER-regex中竞争的人，所以我没有实际测量，它更像是一个快速的近似值：D
def parse(self, response):
        ...
        page_links = response.xpath("//a/@href").re('your_regex_expression')
        ...