Python 无法刮取邮件ID_Python_Web Scraping_Scrapy

Python 无法刮取邮件ID

python web-scraping scrapy

Python 无法刮取邮件ID,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在尝试从以下页面使用Scrapy、Python和正则表达式刮取邮件ID: 为此，我编写了以下命令，每个命令都返回一个空列表： response.xpath('//a/*[@href = "#"]/text()').extract() response.xpath('//a/@onclick').extract() response.xpath('//a/@onclick/text()').extract() response.xpath('//span/*[@class = ""]/a

我正在尝试从以下页面使用Scrapy、Python和正则表达式刮取邮件ID:

为此，我编写了以下命令，每个命令都返回一个空列表：

response.xpath('//a/*[@href = "#"]/text()').extract()

response.xpath('//a/@onclick').extract()

response.xpath('//a/@onclick/text()').extract()

response.xpath('//span/*[@class = ""]/a/text()').extract()

response.xpath('//a/@onclick/text()').extract()

除此之外，我还计划使用正则表达式从描述中提取电子邮件ID。为此，我编写了一个命令来删除描述，该命令删除了除描述末尾的电子邮件Id之外的所有内容：

response.xpath('//*[@property = "schema:description"]/text()').extract()

上述命令的输出为：

[u'\n\t\t\t\t\t\t\t     "Your Future is created by what you do today Let\'s shape it With Summer Training Program \u2026\u2026\u2026 ."', u'\n', u'\nWith ever changing technologies & methodologies, the competition today is much greater than ever before. The industrial scenario needs constant technical enhancements to cater to the rapid demands.', u'\nHT India Labs is presenting Summer Training Program to acquire and clear your concepts about your respective fields. ', u'\nEnroll on ', u' and avail Early bird Discounts.', u'\n', u'\nFor Registration or Enquiry call 9911330807, 7065657373 or write us at ', u'\t\t\t\t\t\t']

我不太了解onclick事件属性。我想，当设置为返回false时，请求通常会跳过该部分。然而，如果你尝试下面我展示的方法，你可能会得到非常接近你想要的结果

import requests
from scrapy import Selector

res = requests.get("https://allevents.in/bangalore/project-based-summer-training-program/1851553244864163")
sel = Selector(res)
for items in sel.css("div[property='schema:description']"):
    emailid = items.css("span::text").extract_first()
    print(emailid)

输出：

htindialabsworkshops | gmail ! com

非常感谢你。我将您的代码转换为等效的xpath命令，从而产生与您相同的输出：response.xpath'//div[@property=schema:description]/span/text。extract_first如果所有电子邮件ID的格式都相同，则使用xpath'//div[@property=schema:description]/span/text。extract_first.replace |，@.replace，。它应该能解决一切问题。