Web scraping 使用Scrapy发送post请求
我正在学习如何使用Scrapy进行web抓取,但在抓取动态加载的内容时遇到了问题。我正试图从一个发送POST请求的服务器上获取一个电话号码,以便获得该号码: 这是它发送的Post请求的标题:Web scraping 使用Scrapy发送post请求,web-scraping,scrapy,Web Scraping,Scrapy,我正在学习如何使用Scrapy进行web抓取,但在抓取动态加载的内容时遇到了问题。我正试图从一个发送POST请求的服务器上获取一个电话号码,以便获得该号码: 这是它发送的Post请求的标题: Host: www.mymarket.ge User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0 Accept: */* Accept-Language: en-US,en;q=0
Host: www.mymarket.ge
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://www.mymarket.ge/en/pr/16399126/savaWro-inventari/fulis-yuTi
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Content-Length: 13
Origin: https://www.mymarket.ge
Connection: keep-alive
Cookie: Lang=en; split_test_version=v1; CookieID=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJEYXRhIjp7IklEIjozOTUwMDY2MzUsImN0IjoxNTkyMzA2NDMxfSwiVG9rZW5JRCI6Ik55empxVStDa21QT1hKaU9lWE56emRzNHNSNWtcL1wvaVVUYjh2dExCT3ZKWT0iLCJJc3N1ZWRBdCI6MTU5MjMyMTc1MiwiRXhwaXJlc0F0IjoxNTkyMzIyMDUyfQ.mYR-I_51WLQbzWi-EH35s30soqoSDNIoOyXgGQ4Eu84; ka=da; SHOW_BETA_POPUP=B; APP_VERSION=B; LastSearch=%7B%22CatID%22%3A%22515%22%7D; PHPSESSID=eihhfcv85liiu3kt55nr9fhu5b; PopUpLog=%7B%22%2A%22%3A%222020-05-07+15%3A13%3A29%22%7D
这就是身体:
PrID=16399126
我成功地在上复制了post请求,但不知道如何使用Scrapy。这就是我的代码的样子:
MymarketcrawlerSpider类(爬行蜘蛛):
name=“mymarketcrawler”
允许的_域=[“mymarket.ge”]
起始URL=[”http://mymarket.ge/"]
规则=(
统治(
LinkedExtractor(allow=r“*mymarket.ge/ka/*”,restrict_css=“.product card”),
callback=“parse_item”,
follow=True,
),
)
def解析_项(自身、响应):
item_loader=ItemLoader(item=MymarketItem(),response=response)
def parse_num(响应):
尝试:
response\u text=response.text
response\u dict=ast.literal\u eval(response\u text)
数字=响应\u dict['Data']['Data']['numberToShow']
非本地项目加载程序
项目加载器。添加值(“编号”,编号)
屈服项加载程序。加载项()
例外情况除外,如e:
升起十字轴(e)
从_响应中生成FormRequest.from(
回答
url=r“https://www.mymarket.ge/ka/pr/ShowFullNumber/",
标题={
“主持人”:“www.mymarket.ge”,
“用户代理”:“Mozilla/5.0(Windows NT 10.0;Win64;x64;rv:77.0)Gecko/20100101 Firefox/77.0”,
“接受”:“*/*”,
“接受语言”:“en-US,en;q=0.5”,
“接受编码”:“gzip,deflate,br”,
“推荐人”:https://www.mymarket.ge/ka/pr/16399126/savaWro-inventari/fulis-yuTi",
“内容类型”:“application/x-www-form-urlencoded;charset=UTF-8”,
“X-request-With”:“XMLHttpRequest”,
},
formdata={“PrID”:“16399126”},
method=“POST”,
Don_filter=True,
callback=parse_num
)
item_loader.add_xpath(
“卖方”,“//div[@class='d-flex用户配置文件']/div/span/text()
)
item_loader.add_xpath(
“产品”,
“//div[contains(@class,'container product')]//h1[contains(@class,'product title')]]/text()”,
)
item_loader.add_xpath(
“价格”,
“//div[contains(@class,'container product')]//span[contains(@class,'product price')]][1]/text()”,
TakeFirst(),
)
item_loader.add_xpath(
“图像”,
“//div[@class='position-sticky']/ul[@id='imageGallery']/li/@data src”,
)
item_loader.add_xpath(
“条件”,“//div[contains(@class,'condition label')]/text()
)
item_loader.add_xpath(
“城市”,
“//div[@class='d-flex font-14 font-weight中等位置视图']/span[contains(@class='location')]]/text()”,
)
item_loader.add_xpath(
“浏览次数”,
“//div[@class='d-flex font-14 font-weight中等位置视图']/span[contains(@class'svg-18')]/span/text()”,
)
item_loader.add_xpath(
“发布日期”,
“//div[@class='d-flex left side']//div[contains(@class'font-12')]]/span[2]/text()”,
)
item_loader.add_xpath(
“产品总额”,
“//div[包含(@class,'user profile')]/div/a/text()”,
re=r“\d+”,
)
item_loader.add_xpath(
“description”,“//div[contains(@class,'text full')]/p/text()
)
项目加载器。添加值(“url”,response.url)
屈服项加载程序。加载项()
上面的代码不起作用;数字字段未填充。
我可以将数字打印到屏幕上,但无法将其保存到csv文件中。csv文件中的编号列为空,不包含任何值。Scrapy异步工作,要爬网的每个链接、要处理的每个项目等都放在队列中。这就是为什么您会产生一个请求并等待SpiderDownloader、itempipline等来处理您的请求 发生的情况是,您的请求被单独处理,这就是为什么您看不到结果的原因。就我个人而言,我会解析第一个请求的结果,将它们保存在“元数据”中,并将它们传递给下一个请求,以便以后可以使用这些数据 例如
class MymarketcrawlerSpider(CrawlSpider):
name = "mymarketcrawler"
allowed_domains = ["mymarket.ge"]
start_urls = ["http://mymarket.ge/"]
rules = (
Rule(
LinkExtractor(allow=r".*mymarket.ge/ka/*", restrict_css=".product-card"),
callback="parse_item",
follow=True,
),
)
def parse_item(self, response):
def parse_num(response):
item_loader = ItemLoader(item=MymarketItem(), response=response)
try:
response_text = response.text
response_dict = ast.literal_eval(response_text)
number = response_dict['Data']['Data']['numberToShow']
# New part:
product = response.meta['product']
# You won't need this now: nonlocal item_loader
# Also new:
item_loader.add_value("number", number)
item_loader.add_value("product", product)
yield item_loader.load_item()
except Exception as e:
raise CloseSpider(e)
# Rewrite your parsers like this:
product = response.xpath(
"//div[contains(@class, 'container product')]//h1[contains(@class, 'product-title')]/text()"
).get()
yield FormRequest.from_response(
response,
url=r"https://www.mymarket.ge/ka/pr/ShowFullNumber/",
headers={
"Host": "www.mymarket.ge",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.mymarket.ge/ka/pr/16399126/savaWro-inventari/fulis-yuTi",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With": "XMLHttpRequest",
},
formdata={"PrID": "16399126"},
method="POST",
dont_filter=True,
callback=parse_num,
meta={"product": product}
)