使用Ruby'；s的海葵宝石从一个网站上刮取所有电子邮件地址_Ruby_Anemone

使用Ruby'；s的海葵宝石从一个网站上刮取所有电子邮件地址

ruby

使用Ruby'；s的海葵宝石从一个网站上刮取所有电子邮件地址,ruby,anemone,Ruby,Anemone,我正在尝试使用一个文件Ruby脚本刮取给定站点上的所有电子邮件地址。在文件的底部，我有一个硬编码的测试用例，使用一个URL，该URL在特定页面上列出了一个电子邮件地址（因此它应该在第一个循环的第一次迭代中找到一个电子邮件地址）出于某种原因，我的正则表达式似乎不匹配： #get_emails.rb require 'rubygems' require 'open-uri' require 'nokogiri' require 'mechanize' require 'uri' require '

我正在尝试使用一个文件Ruby脚本刮取给定站点上的所有电子邮件地址。在文件的底部，我有一个硬编码的测试用例，使用一个URL，该URL在特定页面上列出了一个电子邮件地址（因此它应该在第一个循环的第一次迭代中找到一个电子邮件地址）

出于某种原因，我的正则表达式似乎不匹配：

#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'

class GetEmails

  def initialize
      @urlCounter, @anemoneCounter  = 0
      $allUrls, $emailUrls, $emails = []
  end


  def has_email?(listingUrl)
   hasListing = false
   Anemone.crawl(listingUrl) do |anemone|
      anemone.on_every_page do |page|
      body_text = page.body.to_s
      matchOrNil = body_text.match(/\A[^@\s]+@[^@\s]+\z/)
       if matchOrNil != nil
        $emailUrls[$anemoneCounter] = listingUrl
        $emails[$anemoneCounter] = body_text.match
        $anemoneCounter += 1
        hasListing = true
      else 
      end
    end
   end
   return hasListing
  end

end 

emailGrab = GetEmails.new()
emailGrab.has_email?("http://genuinestoragesheds.com/contact/")
puts $emails[0]

\A

和

\z

分别显示在匹配字符串的开头和结尾。显然，该网页包含的内容不仅仅是电子邮件字符串，否则根本不进行正则表达式测试

您可以将其简化为仅

/[^@\s]+@[^@\s]+/

，但在提取电子邮件时仍需要清理字符串。

因此，这是代码的工作版本。使用一个正则表达式查找包含电子邮件的字符串，再使用三个正则表达式来清理它

#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'

class GetEmails

  def initialize
      @urlCounter = 0
      $anemoneCounter  = 0
      $allUrls = []
      $emailUrls = []
      $emails = []
  end

  def email_clean(email)
    email = email.gsub(/(\w+=)/,"")  
    email = email.gsub(/(\w+:)/, "")
    email = email.gsub!(/\A"|"\Z/, '')
    return email
  end


  def has_email?(listingUrl)
   hasListing = false
   Anemone.crawl(listingUrl) do |anemone|
      anemone.on_every_page do |page|
      body_text = page.body.to_s
      #matchOrNil = body_text.match(/\A[^@\s]+@[^@\s]+\z/)   
      matchOrNil = body_text.match(/[^@\s]+@[^@\s]+/)
       if matchOrNil != nil
        $emailUrls[$anemoneCounter] = listingUrl
        $emails[$anemoneCounter] = matchOrNil
        $anemoneCounter += 1
        hasListing = true
      else 
      end
    end
   end
   return hasListing
  end

end 

emailGrab = GetEmails.new()
found_email = "href=\"mailto:genuinestoragesheds@gmail.com\""
puts emailGrab.email_clean(found_email)

该gem必须是未维护的。只是想知道，为什么美元符号？创建了全局变量，以便能够直接从irb访问它们。这很有趣。我实际上遇到了一些误报问题。例如，上面的正则表达式匹配以下内容作为“电子邮件地址”：“href=”Try with something

/\a[\w+\-.]+@[a-z\d\-]+（\）。[a-z\d\-]+）*\[a-z]+\z/i

签入。