Python 从给定URL列表的特定页面抓取电子邮件_Python_Email_Scrapy

Python 从给定URL列表的特定页面抓取电子邮件

python email scrapy

Python 从给定URL列表的特定页面抓取电子邮件,python,email,scrapy,Python,Email,Scrapy,我有一个txt文件中的URL列表，然后是一个联系人页面模式列表。我只需要检查那些特定的网页来抓取url的电子邮件请告诉我一些我能做的可能性。我不熟悉Python和Scrapy。先谢谢你 class FinalspiderSpider(scrapy.Spider): name = "finalspider" source_urls = open("/Users/NiveRam/Documents/urllist.txt","rb") start_u

我有一个txt文件中的URL列表，然后是一个联系人页面模式列表。我只需要检查那些特定的网页来抓取url的电子邮件

请告诉我一些我能做的可能性。我不熟悉Python和Scrapy。先谢谢你

   class FinalspiderSpider(scrapy.Spider):
       name = "finalspider"
       source_urls = open("/Users/NiveRam/Documents/urllist.txt","rb")
       start_urls = [url.strip() for url in source_urls.readlines()]
       contact_page_pattern = ['help','office','global','feedback','branch','contact','about']

       def parse(self, response):
           hxs = HtmlXPathSelector(response)
           emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)
           story = FinaltestItem()
           story["url"] = response.url
           story["title"] = response.xpath("//title/text()").extract()
           story["email"] = emails
           return(story)

这将从整个网页中检索电子邮件，并输出电子邮件，如

电子邮件：[info@abc.com,infor@abc.com, yourname@abc.com]

您可以通过

response

对象的

url

属性访问当前url：

class MySpider(scrapy.Spider):
    url_keywords = ['stackoverflow', 'tea']

    def parse(self, response):
        story = FinaltestItem()
        # check if any of defined keywords can be found in response.url
        get_email = any(k in response.url for k in self.url_keywords)
        if get_email:  # if yes add in email!
            emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)
            story["email"] = emails
        story["url"] = response.url
        story["title"] = response.xpath("//title/text()").extract()
        return story

你是个垃圾邮件发送者，对吧？我真的需要在网站爬行中学习这一点。我不是垃圾邮件发送者你的电子邮件正则表达式不是很准确。它将错过

firstname\u lastname+stuff@example.com

并错误地提取了几乎所有包含

的内容。如果您进行一些后期筛选，可能很容易筛选出错误的匹配项。传统的观点是不要单独使用正则表达式。对不起，我只是参考了以前的一些建议和评论并使用了它们。所以你建议我试试其他的regrex？域部分不允许使用下划线和加号。点不应在字符类内反斜杠。但是，是的，这可以解决我在脑海中发现的案例。有一些特殊情况，如

“quoted string”@example.com

，这是RFC允许的，但在实践中没有看到。你的意思是，“article\u by_john”是我必须传递给响应对象的联系页面模式？@Niveram是的，请参阅我的编辑，了解如何实现这个简单版本。谢谢你。我能理解你的评论。但是我有一个疑问，例如，如果url是www.abc.com，我需要在url后面加上contact\u page\u模式中的单词，以及像www.abc.com/about这样的url，然后从特定页面搜索电子邮件。

get\u email=any（k in response.url for k in self.contact\u page\u模式）print（get\u email）

当我打印get\u电子邮件时，它返回false，电子邮件id not retrievedYes

get_email

是一个布尔值，指示是否获取电子邮件。如果在

response.url

中找到任何模式位，则

get\u email

变为true并将email添加到项目中，但如果未找到keywrods，则

get\u email

为False且未添加email。