Python 如何在scrapy spider中按请求设置自定义提要文件名?
我已经设置了一个蜘蛛,如下所示。我从邮递员那里向同一个facebook蜘蛛发出多个不同搜索字符串的请求。为具有相同名称的所有请求生成输出提要文件fb_20201025024201.json 对于不同的请求,预期的文件名应该不同,因为我在不同的时间发出请求Python 如何在scrapy spider中按请求设置自定义提要文件名?,python,web-scraping,scrapy,web-crawler,Python,Web Scraping,Scrapy,Web Crawler,我已经设置了一个蜘蛛,如下所示。我从邮递员那里向同一个facebook蜘蛛发出多个不同搜索字符串的请求。为具有相同名称的所有请求生成输出提要文件fb_20201025024201.json 对于不同的请求,预期的文件名应该不同,因为我在不同的时间发出请求 class Facebook(scrapy.Spider): name = "fb" start_urls = [FB_SEARCH_URL] allowed_domains = ["facebook.co
class Facebook(scrapy.Spider):
name = "fb"
start_urls = [FB_SEARCH_URL]
allowed_domains = ["facebook.com"]
fb_url = FB_SEARCH_URL
timestring = time.strftime("%Y%m%d%H%M%S")
FB_ROOT = "/fb"
feed_uri = setting.S3_BASE_PATH + FB_ROOT + "/fb_{}.json".format(timestring)
# setting to save extracted data.
custom_settings = {
"ITEM_PIPELINES": {"fb_scrapping.fb_scrapping.pipelines.JSONPipeline": 200,},
"FEEDS": {
feed_uri: {"format": "json", "encoding": "utf8", "indent": 4,}
},
"FEED_EXPORT_ENCODING": "utf-8",
"FEED_EXPORT_INDENT": 2,
}
def parse(self, response):
search_key_list = getattr(self, "keys", None)
if len(search_key_list) == 0:
self.logger.error("no search key details provided")
else:
self.logger.info(f"crawler request payload : {search_key_list}")
for search_key in search_key_list:
search_string = {
"key": search_key,
}
self.logger.debug(
"scrapping for key: "
+ search_string
)
# fetch user details for each search string
details_info = self.extract_info(
search_string, GET_DETAILS
)
self.logger.debug(
"after extraction " + json.dumps(details_info)
)
for search_key in search_key_list:
user_name = details_info["User Name"]
search_string = {
"key": search_key,
}
# fetch more details for each key
more_details_info = self.extract_info(
search_string, GET_MORE_DETAILS
)
doc = {
"search_key": search_key,
"INFO": details_info,
"MORE_INFO": more_details_info,
}
yield doc
请求1:
> curl --location --request POST 'http://localhost:8000/v1/fb/search/' \
--header 'X-CSRFToken: NDZ1cJZBq8Mbk9xEObdimb5BgI4KiAKXOYOQg6Ipeu4wDN' \
--header 'Content-Type: application/json' \
--header 'Cookie: csrftoken=NDZ1cJZBq8Mbk9xEObdimb5BgI4KiAKXOYOQg6Ipeu4wDN' \
--data-raw '[
{
"key" : "Amazon",
}
]'
请求2:
> curl --location --request POST 'http://localhost:8000/v1/fb/search/' \
--header 'X-CSRFToken: NDZ1cJZBq8Mbk9xEObdimb5BgI4KiAKXOYOQg6Ipeu4wDN' \
--header 'Content-Type: application/json' \
--header 'Cookie: csrftoken=NDZ1cJZBq8Mbk9xEObdimb5BgI4KiAKXOYOQg6Ipeu4wDN' \
--data-raw '[
{
"key" : "Google",
}
]'
两个请求都生成了文件fb_20201025024201.json。因此,它们相互超越
附加信息:我正在使用DJango和芹菜来授权scrapy任务
您能帮我为不同的请求生成不同的文件吗?Scrapy feed导出不支持每个请求的文件存储。您将不需要使用提要导出,而是在spider、项目管道或spider中间件中实现相应的逻辑。