无法使用python scrapy刮取URL,因为我包含#(URI片段)
我也面临同样的问题。有人能告诉我怎样才能刮取下面提到的URL吗无法使用python scrapy刮取URL,因为我包含#(URI片段),python,scrapy,fragment,uri,Python,Scrapy,Fragment,Uri,我也面临同样的问题。有人能告诉我怎样才能刮取下面提到的URL吗 start_urls = [ 'https://onlinelibrary.ectrims-congress.eu/ectrims/#!*menu=6*browseby=3*sortby=2*media=3*ce_id=1428' ] 我得到的回应是 Crawled (200) <GET https://onlinelibrary.ectrims-congress.eu/ectrims/?_escaped_fragment_
start_urls = [ 'https://onlinelibrary.ectrims-congress.eu/ectrims/#!*menu=6*browseby=3*sortby=2*media=3*ce_id=1428' ]
我得到的回应是
Crawled (200) <GET https://onlinelibrary.ectrims-congress.eu/ectrims/?_escaped_fragment_=%2Amenu%3D6%2Abrowseby%3D3%2Asortby%3D2%2Amedia%3D3%2Ace_id%3D1428%3E> (referer: None) ['cached']
Crawled(200)(referer:None)['cached']
但不幸的是,我无法提取数据(response.xpath),因为它给了我空值。这是因为当我单击响应URL时,它似乎没有给我想要从中获取数据的确切URL
请帮忙 网站
from scrapy import Request
class Ectrims(scrapy.Spider):
name = 'library'
headers = {
"Connection": "keep-alive",
"sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
"Accept": "application/json, text/javascript, */*; q=0.01",
"X-Requested-With": "XMLHttpRequest",
"sec-ch-ua-mobile": "?0",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Origin": "https://onlinelibrary.ectrims-congress.eu",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://onlinelibrary.ectrims-congress.eu/ectrims/",
"Accept-Language": "en-US,en;q=0.9"
}
cookies = {
"PHPSESSID": "if994kqobo2l80nk1ki7q233i5",
"_ga": "GA1.2.212877690.1624208120",
"_gid": "GA1.2.291339791.1624208120",
"intercom-id-aucjjau5": "bdcafc49-97d0-42fb-b61e-46c74cfed3b0",
"cp_user_200": "{\"1\":0}"
}
body = 'menu=6&browseby=3&sortby=2&media=3&ce_id=1428&getpage=1'
def start_requests(self):
url = 'https://onlinelibrary.ectrims-congress.eu/ectrims/listing/events/banners'
yield Request(url=url, method='POST', cookies=self.cookies, headers=self.headers, body=self.body, callback=self.parse)
def parse(self, response):
print(response.body)
通过查看网站,您可以看到您想要获取的内容是由javascript驱动的,javascript通过发出AJAX请求,增加了通过API端点加载数据的机会。使用chrome开发工具,您可以检查XHR中是否加载了5个请求。但是,这个APIhttps://onlinelibrary.ectrims-congress.eu/ectrims/listing/events/banners
将在传递所需参数后为您提供所需数据,这些参数是Scrapy中的Header、Cookie和body
代码
from scrapy import Request
class Ectrims(scrapy.Spider):
name = 'library'
headers = {
"Connection": "keep-alive",
"sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
"Accept": "application/json, text/javascript, */*; q=0.01",
"X-Requested-With": "XMLHttpRequest",
"sec-ch-ua-mobile": "?0",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Origin": "https://onlinelibrary.ectrims-congress.eu",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://onlinelibrary.ectrims-congress.eu/ectrims/",
"Accept-Language": "en-US,en;q=0.9"
}
cookies = {
"PHPSESSID": "if994kqobo2l80nk1ki7q233i5",
"_ga": "GA1.2.212877690.1624208120",
"_gid": "GA1.2.291339791.1624208120",
"intercom-id-aucjjau5": "bdcafc49-97d0-42fb-b61e-46c74cfed3b0",
"cp_user_200": "{\"1\":0}"
}
body = 'menu=6&browseby=3&sortby=2&media=3&ce_id=1428&getpage=1'
def start_requests(self):
url = 'https://onlinelibrary.ectrims-congress.eu/ectrims/listing/events/banners'
yield Request(url=url, method='POST', cookies=self.cookies, headers=self.headers, body=self.body, callback=self.parse)
def parse(self, response):
print(response.body)
如果有帮助,请投票