爬行RSS:Scrapy未返回任何数据_Scrapy

爬行RSS:Scrapy未返回任何数据

scrapy

爬行RSS:Scrapy未返回任何数据,scrapy,Scrapy,这是我爬网RSS BBC的代码，但它没有返回任何内容我在Chrome中使用“Inspect”以交互方式检查了xpath，看起来还可以 import scrapy class BbcSpider(scrapy.Spider): name = "bbc" allowed_domains = ["feeds.bbci.co.uk/news/world/rss.xml"] start_urls = ["https://feeds.bbci.co.uk/news/world/

这是我爬网RSS BBC的代码，但它没有返回任何内容

我在Chrome中使用“Inspect”以交互方式检查了xpath，看起来还可以

import scrapy


class BbcSpider(scrapy.Spider):
    name = "bbc"
    allowed_domains = ["feeds.bbci.co.uk/news/world/rss.xml"]
    start_urls = ["https://feeds.bbci.co.uk/news/world/rss.xml"]

    def parse(self, response):
        all_rss = response.xpath('//div[@id="item"]/ul/li')
        for rss in all_rss:
            rss_url = rss.xpath('//a/@href').extract_first()
            rss_title = rss.xpath('//a/text()').extract_first()
            rss_short_content = rss.xpath('//div/text()').extract_first()
            yield {
            "URL": rss_url,
            "Title": rss_title,
            "Short Content": rss_short_content
        }

任何帮助都将不胜感激

此爬网程序不生成任何数据的主要原因是

all\u rss

列表为空。其次，在Scrapy中，您只能访问第一个GET请求，因此如果您使用ctrl/cmd+U打开源代码，您将无法找到

项

id。因此，您的

response.xpath（'//div[@id=“item”]/ul/li'）

选择器返回空列表，而for循环未执行

试试这个

    for rss in response.css('item'):
        rss_url = rss.css('link::text').extract_first()
        rss_title = rss.css('title::text').extract_first()
        rss_short_content = response.css('description::text').extract_first()

响应是一个.txt文件，因此您可以按以下方式对其进行解析：

import scrapy


class BbcSpider(scrapy.Spider):
    name = "bbc"
    allowed_domains = ["feeds.bbci.co.uk/news/world/rss.xml"]
    start_urls = ["https://feeds.bbci.co.uk/news/world/rss.xml"]

        def parse(self, response):
            rss_url = response.xpath('//link/text()').extract()[2:]
            rss_title = response.xpath('//title/text()').extract()[2:]
            rss_short_content = response.xpath('//description/text()').extract()
            for i in range(len(rss_url)):
                yield {
                "URL": rss_url[i],
                "Title": rss_title[i],
                "Short Content": rss_short_content[i],
                }

前两个URL和标题与新闻无关，因此我放弃了它们。

Hi@Ikram，那么我应该如何修复它？我在函数中添加了rss_url=response.xpath（'//div[@id=“item”]/ul/li/a/@href'）。extract_first（）在函数内部，不使用循环，但它仍然返回None。这不会改变。您正在尝试获取页面源中不存在的内容。试试这个。