如何告诉python scrapy移动到下一个开始URL

如何告诉python scrapy移动到下一个开始URL,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我写了一个scrapy spider,它有许多起始URL,并在这些URL中提取电子邮件地址。脚本需要很长时间才能执行,所以我想告诉Scrapy在发现电子邮件并移动到下一个站点时停止对特定站点进行爬网 编辑:添加代码 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector imp

我写了一个scrapy spider,它有许多起始URL,并在这些URL中提取电子邮件地址。脚本需要很长时间才能执行,所以我想告诉Scrapy在发现电子邮件并移动到下一个站点时停止对特定站点进行爬网

编辑:添加代码

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
import csv
from urlparse import urlparse

from entreprise.items import MailItem

class MailSpider(CrawlSpider):
    name = "mail"
    start_urls = []
    allowed_domains = []
    with open('scraped_data.csv', 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        next(reader)
        for row in reader:
            url = row[5].strip()
            if (url.strip() != ""):
                start_urls.append(url)
                fragments = urlparse(url).hostname.split(".")
                hostname = ".".join(len(fragments[-2]) < 4 and fragments[-3:] or fragments[-2:])
                allowed_domains.append(hostname)

    rules = [
        Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item')
    ]

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        for mail in hxs.select('//body//text()').re(r'[\w.-]+@[\w.-]+'):
            item = MailItem()
            item['url'] = response.url
            item['mail'] = mail
            items.append(item)
        return items
从scrapy.contrib.spider导入爬行爬行爬行器,规则
从scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor
从scrapy.selector导入HtmlXPathSelector
从scrapy.item导入项目
导入csv
从URLPRASE导入URLPRASE
从enterprise.items导入邮件项目
类邮件爬行器(爬行爬行器):
name=“邮件”
起始URL=[]
允许的_域=[]
将open('scraped_data.csv','rb')作为csvfile:
reader=csv.reader(csvfile,分隔符=',',引号='')
下一位(读者)
对于读取器中的行:
url=行[5]。条带()
如果(url.strip()!=“”):
开始\u url.append(url)
片段=urlparse(url).hostname.split(“.”)
hostname=“.”.join(len(fragments[-2])<4和fragments[-3:]或fragments[-2:]))
允许的\u域。追加(主机名)
规则=[
规则(SgmlLinkExtractor(allow=('.+')),follow=True,callback='parse_item'),
规则(SgmlLinkExtractor(allow=('.+')),callback='parse_item')
]
def解析_项(自身、响应):
hxs=HtmlXPathSelector(响应)
项目=[]
对于hxs中的邮件。选择('//body//text()')。re(r'[\w.-]+@[\w.-]+'):
item=MailItem()
项['url']=response.url
项目['mail']=邮件
items.append(项目)
退货项目
我们的想法是使用该方法来决定下一步要爬网的URL。此外,我们将跟踪是否在
parsed\u hostnames
类级别集中为主机名解析了电子邮件

此外,我还更改了从url获取主机名的方式,使用now


理论上应该行得通。希望能有所帮助。

我不再使用流程链接

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import csv
from urlparse import urlparse

class MailItem(Item):
    url = Field()
    mail = Field()

class MailSpider(CrawlSpider):
    name = "mail"

    parsed_hostnames= set()

    rules = [
        Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item', process_links='process_links'),
        Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item', process_links='process_links')
    ]

    start_urls = []
    allowed_domains = []
    with open('scraped_data.csv', 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        next(reader)
        for row in reader:
            url = row[5].strip()
            if (url.strip() != ""):
                start_urls.append(url)
                hostname = urlparse(url).hostname
                allowed_domains.append(hostname)

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        mails = hxs.select('//body//text()').re(r'[\w.-]+@[\w.-]+')
        if mails:
            for mail in mails:
                item = MailItem()
                item['url'] = response.url
                item['mail'] = mail
                items.append(item)
                hostname = urlparse(response.url).hostname
                self.parsed_hostnames.add(hostname)

        return items

    def process_links(self, links):
        return [l for l in links if urlparse(l.url).hostname not in self.parsed_hostnames]

你可以显示你的蜘蛛的代码吗?这将有助于回答。允许的域设置不正确,我在twitter中测试了代码和蜘蛛爬网的URL,但不在列表中。好的,这是因为
allow\u domains
没有为你的规则链接提取器设置。我已经编辑了代码-试试。link\u提取器。allow\u domains是一个集合而不是列表,所以我使用add而不是append。一旦找到电子邮件地址,脚本就不会停止对当前域的爬网,因此没有任何更改。是的,我已经编辑了答案(它是集合,对吧)。再试一次:现在我正在从
allow\u domains
中删除主机名,如果它已经在
解析的\u hostnames
中。还没有工作:-)您可以使用此url进行测试,它应该从第一个请求开始停止。多亏了您的帮助,我找到了解决方案,诀窍是查看爬行器的源代码并了解其工作原理。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import csv
from urlparse import urlparse

class MailItem(Item):
    url = Field()
    mail = Field()

class MailSpider(CrawlSpider):
    name = "mail"

    parsed_hostnames= set()

    rules = [
        Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item', process_links='process_links'),
        Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item', process_links='process_links')
    ]

    start_urls = []
    allowed_domains = []
    with open('scraped_data.csv', 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        next(reader)
        for row in reader:
            url = row[5].strip()
            if (url.strip() != ""):
                start_urls.append(url)
                hostname = urlparse(url).hostname
                allowed_domains.append(hostname)

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        mails = hxs.select('//body//text()').re(r'[\w.-]+@[\w.-]+')
        if mails:
            for mail in mails:
                item = MailItem()
                item['url'] = response.url
                item['mail'] = mail
                items.append(item)
                hostname = urlparse(response.url).hostname
                self.parsed_hostnames.add(hostname)

        return items

    def process_links(self, links):
        return [l for l in links if urlparse(l.url).hostname not in self.parsed_hostnames]