Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/337.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何获取给定网页中的所有出站链接并进行跟踪?_Python_Web Scraping_Scrapy_Scrape - Fatal编程技术网

Python 如何获取给定网页中的所有出站链接并进行跟踪?

Python 如何获取给定网页中的所有出站链接并进行跟踪?,python,web-scraping,scrapy,scrape,Python,Web Scraping,Scrapy,Scrape,我有一个代码可以获取网页中的所有链接: from scrapy.spider import Spider from scrapy import Selector from socialmedia.items import SocialMediaItem from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class MySpider(Spider): name = 'smm' allowed_doma

我有一个代码可以获取网页中的所有链接:

from scrapy.spider import Spider
from scrapy import Selector
from socialmedia.items import SocialMediaItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    def parse(self, response):
        items = []
        for link in response.xpath("//a"):
            item = SocialMediaItem()
            item['SourceTitle'] = link.xpath('/html/head/title').extract()
            item['TargetTitle'] = link.xpath('text()').extract()
            item['link'] = link.xpath('@href').extract()
            items.append(item)
        return items
我想做以下几点: 1) 与其获取所有链接,不如只获取出站链接,或者至少只获取那些使用http/s的StartInfo 2) 遵循出站链接 3) 仅当下一个网页包含元数据上的某些关键字时,才对其进行刮取 4) 对给定数量的循环重复整个过程 有人能帮忙吗? 干杯


Dani

我想你可能在寻找类似于scrapy法则和LinkedExtractor的东西

from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    rules = (
        Rule(LinkExtractor(restrict_paths=('//a[contains(., "http")]'), callback='pre_parse')
    )

def pre_parse(self, response):
    if keyword in response.body:
        parse(response)

def parse(self, response):
这段代码完全未经测试,但只需给出一个IDE,说明如何获取所有链接,然后在进行完整解析之前检查后续页面中的关键字


祝你好运。

谢谢你在对不同问题的答案的评论中提出单独的问题,而不是试图解决单独的问题。不客气?你能帮忙吗?