Python 下一页和刮擦爬虫不';行不通

Python 下一页和刮擦爬虫不';行不通,python,scrapy,Python,Scrapy,我试着跟踪下一代页码非常奇怪的地方。与常规指数化不同,接下来的页面如下所示: new/v2.php?cat=69&pnum=2&pnum=3 new/v2.php?cat=69&pnum=2&pnum=3&pnum=4 new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5 from scrapy.item import Item, Field from scrapy.contrib.s

我试着跟踪下一代页码非常奇怪的地方。与常规指数化不同,接下来的页面如下所示:

new/v2.php?cat=69&pnum=2&pnum=3
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


from mymobile.items import MymobileItem


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = ["mymobile.ge"]
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]

    rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php\?cat=69&pnum=\d*", ))
            , callback="parse_items", follow=True),)

    def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = t.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new/", url[0])

            items.append(item)

        return(items)   
因此,我的刮板进入循环,从不停止,从此类页面上刮取项目:

DEBUG: Scraped from <200 http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=1&pnum=1&pnum=2&pnum=3>`

有什么建议我可以驯服它吗?

据我所知。所有页码都显示在起始url中,
http://mymobile.ge/new/v2.php?cat=69&pnum=1
,因此您可以使用
follow=False
,该规则只执行一次,但它将在第一次传递中提取所有链接

我试过:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = ["mymobile.ge"]
    start_urls = [ 
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]   

    rules = ( 
        Rule(SgmlLinkExtractor(
                allow=("new/v2\.php\?cat=69&pnum=\d*",),
            )   
            , callback="parse_items", follow=False),)

    def parse_items(self, response):
        sel = Selector(response)
        print response.url
像这样运行:

scrapy crawl mmoby2
请求计数的数量为6,输出如下:

...
2014-05-18 12:20:35+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: None)
2014-05-18 12:20:36+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1
2014-05-18 12:20:37+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5
2014-05-18 12:20:39+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3
2014-05-18 12:20:39+0200 [mmoby2] INFO: Closing spider (finished)
2014-05-18 12:20:39+0200 [mmoby2] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 1962,
         'downloader/request_count': 6,
         'downloader/request_method_count/GET': 6,
         ...
。。。
2014-05-18 12:20:35+0200[mmoby2]调试:爬网(200)(参考:无)
2014-05-18 12:20:36+0200[mmoby2]调试:爬网(200)(参考:http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1
2014-05-18 12:20:37+0200[mmoby2]调试:爬网(200)(参考:http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4
2014-05-18 12:20:38+0200[mmoby2]调试:爬网(200)(参考:http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2
2014-05-18 12:20:38+0200[mmoby2]调试:爬网(200)(参考:http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5
2014-05-18 12:20:39+0200[mmoby2]调试:爬网(200)(参考:http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3
2014-05-18 12:20:39+0200[mmoby2]信息:关闭蜘蛛(完成)
2014-05-18 12:20:39+0200[mmoby2]信息:倾销垃圾统计数据:
{'downloader/request_bytes':1962,
“下载程序/请求计数”:6,
“下载程序/请求方法\计数/获取”:6,
...

如果使用Smgllinkextractor提取链接失败,您可以始终使用简单的scrapy spider并使用选择器/XPath提取下一页的链接,然后通过回调生成下一页的请求,以便在没有下一页链接时解析并停止进程

像这样的东西应该适合你

from scrapy.spider import Spider
from scrapy.http import Request

class MmobySpider(Spider):
    name = "mmoby2"
    allowed_domains = ["mymobile.ge"]
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]

    def parse(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = t.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new/", url[0])

            yield item

        # extract next page link
        next_page_xpath = "//td[span]/following-sibling::td[1]/a[contains(@href, 'num')]/@href"
        next_page = sel.xpath(next_page_xpath).extract()

        # if there is next page yield Request for it
        if next_page:
            next_page = urljoin(response.url, next_page[0])
            yield Request(next_page, callback=self.parse)

下一页的Xpath并不容易,因为您的页面的标记完全不可靠,但它应该可以正常工作。

抱歉,睡着了:)。请查看更新的问题和完整的代码。感谢mate提供的这一精彩提示。我必须深入了解这一点。