Xpath 对于循环，不刮除所有项目，仅一项_Xpath_Web Scraping_Scrapy

Xpath 对于循环，不刮除所有项目，仅一项

xpath web-scraping scrapy

Xpath 对于循环，不刮除所有项目，仅一项,xpath,web-scraping,scrapy,Xpath,Web Scraping,Scrapy,我试图在一个网页上浏览大约20篇文章，但由于某种原因，蜘蛛只能找到第一篇文章所需的信息。我怎样才能把每一篇文章都删掉呢我尝试过多次更改XPath，但我认为我对这一点太陌生，无法确定问题是什么。当我从for循环中取出所有路径时，它会很好地删除所有内容，但其格式不允许我将数据传输到csv文件 import scrapy class AfgSpider(scrapy.Spider): name = 'afg' allowed_domains = ['www.pajhwok.com

我试图在一个网页上浏览大约20篇文章，但由于某种原因，蜘蛛只能找到第一篇文章所需的信息。我怎样才能把每一篇文章都删掉呢

我尝试过多次更改XPath，但我认为我对这一点太陌生，无法确定问题是什么。当我从for循环中取出所有路径时，它会很好地删除所有内容，但其格式不允许我将数据传输到csv文件

import scrapy


class AfgSpider(scrapy.Spider):
    name = 'afg'
    allowed_domains = ['www.pajhwok.com/en']
    start_urls = ['https://www.pajhwok.com/en/security-crime']

    def parse(self, response):
        container = response.xpath("//div[@id='taxonomy-page-block']")
        for x in container:
            title = x.xpath(".//h2[@class='node-title']/a/text()").get()
            author = x.xpath(".//div[@class='field-item even']/a/text()").get()
            rel_url = x.xpath(".//h2[@class='node-title']/a/@href").get()
        


            yield{
                'title' : title,
                'author' : author,
                'rel_url' : rel_url
            }

您可以使用此代码收集所需信息：

import scrapy
AfgSpider类（刮屑蜘蛛）：
名称='test'
允许的_域=['www.pajhwok.com/en']
起始URL=['https://www.pajhwok.com/en/security-crime']
def解析（自我，响应）：
container=response.css（“div#分类法页面块div.node-article”）
对于容器中的x：
title=x.xpath（“.//h2[@class='node-title']/a/text（）”）.get（）
author=x.xpath（“.//div[@class='field-item偶数']/a/text（）”）.get（）
rel_url=x.xpath（“.//h2[@class='node-title']/a/@href”）.get（）
屈服{
“标题”：标题，
“作者”：作者，
“rel\u url”：rel\u url
}

问题是您编写了

container=response.xpath（“//div[@id='taxonomy-page-block']）

只返回一行，这是因为

id

在整个页面中应该是唯一的，

class

对于一些标记可以是相同的

这是@Roman提供的很好的答案。修复脚本的其他选项：

。为循环步骤声明正确的XPath：

container = response.xpath("//div[@class='node-inner clearfix']")

。或者，删除循环步骤并使用

.getall（）

方法获取数据：

title = response.xpath(".//h2[@class='node-title']/a/text()").getall()
author = response.xpath(".//div[@class='field-item even']/a/text()").getall()
rel_url = response.xpath(".//h2[@class='node-title']/a/@href").getall()