Python 3.x 使用Scrapy提取维基百科链接_Python 3.x_Scrapy

Python 3.x 使用Scrapy提取维基百科链接

python-3.x scrapy

Python 3.x 使用Scrapy提取维基百科链接,python-3.x,scrapy,Python 3.x,Scrapy,我正在学习scrapy，并尝试使用它来浏览以下维基百科页面：我想刮除每个国家和超链接附加到该国家和以下是我的代码到目前为止： import scrapy class CountrypopSpider(scrapy.Spider): name = 'countryPop' allowed_domains = ['en.wikipedia.org'] start_urls = ['https://en.wikipedia.org/wiki/List_of_sovere

我正在学习scrapy，并尝试使用它来浏览以下维基百科页面：

我想刮除每个国家和超链接附加到该国家和以下是我的代码到目前为止：

import scrapy


class CountrypopSpider(scrapy.Spider):
    name = 'countryPop'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']

    def parse(self, response):
        countries = response.xpath('//table//b//@title').extract()

        for country in countries:

            country_url = response.xpath('//table//b[contains(@href, 'Afghanistan')]').extract()

            yield {'countries': country}

它目前所做的是从主表中获取所有国家，然后我希望它通过每个国家循环，使用国家名称获取url。虽然我很难找到使用国家名称查找url的方法，但我最近的尝试是使用contains（）

如对我的刮码有任何其他意见，将不胜感激

谢谢你试试这个 方法1

import scrapy

class CountrypopSpider(scrapy.Spider):
    name = 'countryPop'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']


    def parse(self, response):
        coutries=200
        cnames=['Australia','Bhutan']
        noduplicateset= set()
        for cname in cnames:
            for title in response.xpath('//table[1]//a[contains(@title,'+cname+')]'):
                if cname not in noduplicateset:
                    yield {cname:'https://en.wikipedia.org'+title.css('a').get().split("\"")[1]}
                noduplicateset.add(cname)

方法2

import scrapy

class CountrypopSpider(scrapy.Spider):
    LOG_LEVEL = 'INFO'
    name = 'countryPop'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']


    def parse(self, response):
        coutries=200
        cnames=['Australia','Bhutan']
        for i in range(5,coutries):
            for title in response.xpath('//*[@id="mw-content-text"]/div/table[1]/tbody/tr['+str(i+2)+']/td[1]/b/a'):
                name=title.css('a ::text').get()
                if name in cnames:
                    yield {name:'https://en.wikipedia.org'+title.css('a').get().split("\"")[1]}

如果输出到json文件，它将如下所示

我对scrapy不太了解，无法回答这个问题，但您可以直接引用

href

您引用标题的方式。看看有没有和你现在做的很相似的事情。