Python 如何修复';解析来自modal'的电子邮件;事后请求
我正在使用Windows10和Python3和scrapy。这是我需要解析电子邮件地址的站点链接 要获得个人电子邮件,每次都需要点击,但我从网络部分得到了一个帖子查询,并开发了一个刮擦蜘蛛,但它仍然不能解析任何电子邮件Python 如何修复';解析来自modal'的电子邮件;事后请求,python,scrapy,Python,Scrapy,我正在使用Windows10和Python3和scrapy。这是我需要解析电子邮件地址的站点链接 要获得个人电子邮件,每次都需要点击,但我从网络部分得到了一个帖子查询,并开发了一个刮擦蜘蛛,但它仍然不能解析任何电子邮件 url = "https://find.plasticsurgery.org/default.aspx/GetMemberInfo" and the payload = {"memberId":"102971","searchId":"38066000"} 下面是我的蜘蛛代
url = "https://find.plasticsurgery.org/default.aspx/GetMemberInfo"
and the payload = {"memberId":"102971","searchId":"38066000"}
下面是我的蜘蛛代码
from scrapy.http import Request, FormRequest
from scrapy.utils.response import open_in_browser
from time import sleep
import scrapy
import csv
import json
import urllib
# urllib.parse.urlencode()
class PlasticsurgerySpider(scrapy.Spider):
name = 'plasticsurgery'
url = "https://find.plasticsurgery.org/default.aspx/GetMemberInfo"
start_urls = [url]
def parse(self, response):
payload = {"memberId":"102971","searchId":"38066000"}
yield Request(response.url, self.parse_page, method="POST", body=urllib.parse.urlencode(payload))
# yield FormRequest.from_response(
# response=response,
# formdata=payload,
# callback=self.parse_page,
# )
def parse_page(self, response):
# data = json.loads(response.body)
# print(data)
# open_in_browser(response)
email = response.xpath('//*[@class="btn btn-default card-btn email"]//@href').extract_first()
email = email.replace('mailto:','')
yield {
'email':email
}
结果的结尾我只找到了{'email':'#'}
我们希望电子邮件地址的结果,如{'email':any@anyemail.com}也许你需要使用实际的标题
headers = {
'origin': 'https://find.plasticsurgery.org',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36',
'content-type': 'application/json; charset=UTF-8',
'accept': 'application/json, text/javascript, */*; q=0.01',
'referer': 'https://find.plasticsurgery.org/city/new-york',
'authority': 'find.plasticsurgery.org',
'x-requested-with': 'XMLHttpRequest',
'dnt': '1',
}
body= '{"searchId":"38074964","memberId":"20747"}'
yield Request('https://find.plasticsurgery.org/default.aspx/GetMemberInfo', headers=headers, body=body)
这是你可能想做的事情,以获得他们的姓名和电子邮件地址。请随意使用不同的搜索id以获得不同的结果,如
38078106
或38066000
e.t.c
import json
import scrapy
class PlasticsurgerySpider(scrapy.Spider):
name = 'plasticsurgery'
post_url = "https://find.plasticsurgery.org/default.aspx/GetMemberInfo"
headers = {"content-type": "application/json; charset=UTF-8"}
start_urls = ["https://find.plasticsurgery.org/city/new-york"]
def parse(self, response):
items = set([item.split("('")[1].split("')")[0] for item in response.css("a[onclick^='showMemberInfo']::attr(onclick)").getall()])
for item in items:
payload = {'memberId':item,'searchId':'38066000'}
yield scrapy.Request(url=self.post_url,headers=self.headers,callback=self.parse_page, method="POST", body=json.dumps(payload))
def parse_page(self,response):
data = json.loads(response.body_as_unicode())
for item in data:
name = data[item]['MemberName'].strip()
email = data[item]['Email']
yield {"name":name,"email":email}