Python 下一页和刮擦爬虫不'；行不通_Python_Scrapy

Python 下一页和刮擦爬虫不'；行不通

python scrapy

Python 下一页和刮擦爬虫不'；行不通,python,scrapy,Python,Scrapy,我试着跟踪下一代页码非常奇怪的地方。与常规指数化不同，接下来的页面如下所示： new/v2.php?cat=69&pnum=2&pnum=3 new/v2.php?cat=69&pnum=2&pnum=3&pnum=4 new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5 from scrapy.item import Item, Field from scrapy.contrib.s

我试着跟踪下一代页码非常奇怪的地方。与常规指数化不同，接下来的页面如下所示：

new/v2.php?cat=69&pnum=2&pnum=3
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


from mymobile.items import MymobileItem


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = ["mymobile.ge"]
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]

    rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php\?cat=69&pnum=\d*", ))
            , callback="parse_items", follow=True),)

    def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = t.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new/", url[0])

            items.append(item)

        return(items)

因此，我的刮板进入循环，从不停止，从此类页面上刮取项目：

DEBUG: Scraped from <200 http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=1&pnum=1&pnum=2&pnum=3>`

有什么建议我可以驯服它吗？

据我所知。所有页码都显示在起始url中，

http://mymobile.ge/new/v2.php?cat=69&pnum=1

，因此您可以使用

follow=False

，该规则只执行一次，但它将在第一次传递中提取所有链接

我试过：

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = ["mymobile.ge"]
    start_urls = [ 
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]   

    rules = ( 
        Rule(SgmlLinkExtractor(
                allow=("new/v2\.php\?cat=69&pnum=\d*",),
            )   
            , callback="parse_items", follow=False),)

    def parse_items(self, response):
        sel = Selector(response)
        print response.url

像这样运行：

scrapy crawl mmoby2

请求计数的数量为6，输出如下：

...
2014-05-18 12:20:35+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: None)
2014-05-18 12:20:36+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1
2014-05-18 12:20:37+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5
2014-05-18 12:20:39+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3
2014-05-18 12:20:39+0200 [mmoby2] INFO: Closing spider (finished)
2014-05-18 12:20:39+0200 [mmoby2] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 1962,
         'downloader/request_count': 6,
         'downloader/request_method_count/GET': 6,
         ...

。。。
2014-05-18 12:20:35+0200[mmoby2]调试：爬网（200）（参考：无）
2014-05-18 12:20:36+0200[mmoby2]调试：爬网（200）（参考：http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1
2014-05-18 12:20:37+0200[mmoby2]调试：爬网（200）（参考：http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4
2014-05-18 12:20:38+0200[mmoby2]调试：爬网（200）（参考：http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2
2014-05-18 12:20:38+0200[mmoby2]调试：爬网（200）（参考：http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5
2014-05-18 12:20:39+0200[mmoby2]调试：爬网（200）（参考：http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3
2014-05-18 12:20:39+0200[mmoby2]信息：关闭蜘蛛（完成）
2014-05-18 12:20:39+0200[mmoby2]信息：倾销垃圾统计数据：
{'downloader/request_bytes'：1962，
“下载程序/请求计数”：6，
“下载程序/请求方法\计数/获取”：6，
...

如果使用Smgllinkextractor提取链接失败，您可以始终使用简单的scrapy spider并使用选择器/XPath提取下一页的链接，然后通过回调生成下一页的请求，以便在没有下一页链接时解析并停止进程

像这样的东西应该适合你

from scrapy.spider import Spider
from scrapy.http import Request

class MmobySpider(Spider):
    name = "mmoby2"
    allowed_domains = ["mymobile.ge"]
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]

    def parse(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = t.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new/", url[0])

            yield item

        # extract next page link
        next_page_xpath = "//td[span]/following-sibling::td[1]/a[contains(@href, 'num')]/@href"
        next_page = sel.xpath(next_page_xpath).extract()

        # if there is next page yield Request for it
        if next_page:
            next_page = urljoin(response.url, next_page[0])
            yield Request(next_page, callback=self.parse)

下一页的Xpath并不容易，因为您的页面的标记完全不可靠，但它应该可以正常工作。

抱歉，睡着了：）。请查看更新的问题和完整的代码。感谢mate提供的这一精彩提示。我必须深入了解这一点。