Python 从scrapy中的网站存档中递归提取URL_Python_Scrapy

Python 从scrapy中的网站存档中递归提取URL

python scrapy

Python 从scrapy中的网站存档中递归提取URL,python,scrapy,Python,Scrapy,嗨，我想从中抓取数据，所有的URL都是基于日期、月份和年份存档的，首先要获取我使用的URL列表，将我网站的代码修改为 import scrapy import urllib def etUrl(): totalWeeks = [] totalPosts = [] url = 'http://economictimes.indiatimes.com/archive.cms' data = urllib.urlopen(url).read() hxs

嗨，我想从中抓取数据，所有的URL都是基于日期、月份和年份存档的，首先要获取我使用的URL列表，将我网站的代码修改为

import scrapy
import urllib    
def etUrl():
    totalWeeks = []
    totalPosts = []
    url = 'http://economictimes.indiatimes.com/archive.cms'
    data = urllib.urlopen(url).read()
    hxs = scrapy.Selector(text=data)
    months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
    admittMonths = 12*(2013-2007) + 8
    months = months[:admittMonths]
    for month in months:
        data = urllib.urlopen(month).read()
        hxs = scrapy.Selector(text=data)
        weeks = hxs.xpath('//ul[@class="weeks"]/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news/day\\d+\.cms')
        totalWeeks += weeks
        for week in totalWeeks:
            data = urllib.urlopen(week).read()
            hxs = scrapy.Selector(text=data)
            posts = hxs.xpath('//ul[@class="archive"]/li/h1/a/@href').extract()
            totalPosts += posts
            with open("eturls.txt", "a") as myfile:
                for post in totalPosts:
                    post = post + '\n'
                    myfile.write(post)

etUrl()

将文件保存为urlGenerator.py，并使用命令$python urlGenerator.py运行

我没有收到任何结果，有人能帮助我如何将此代码用于我的网站用例或任何其他解决方案吗？

尝试使用一次一行地遍历您的代码。运行python-m pdb urlGenerator.py并按照链接页面中使用pdb的说明进行操作

如果您逐行地浏览代码，您可以立即看到

data = urllib.urlopen(url).read()

未能返回有用的内容：

(pdb) print(data)
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "http&#58;&#47;&#47;economictimes&#46;indiatimes&#46;com&#47;archive&#46;cms" on this server.<P>
Reference&#32;&#35;18&#46;6057c817&#46;1508411706&#46;1c3ffe4
</BODY>
</HTML>

返回一个空列表，即使给定此站点的真实HTML。如果你看一下HTML，它显然是在一个表中，而不是未排序的列表中。您的URL格式也有错误。相反，像这样的方法是可行的：

months = response.xpath('//table//tr//a/@href').re(r'/archive/year-\d+,month-\d+.cms')

如果您想构建一个web scraper，而不是从一些发现甚至不正确的代码开始，然后试图盲目地修改它，请尝试按照下面的步骤，从一些非常简单的示例开始，然后从那里开始构建。例如：

class EtSpider(scrapy.Spider):
    name = 'et'
    start_urls = ["https://economictimes.indiatimes.com/archive.cms"]

    def parse(self, response):
        months = response.xpath('//table//tr//a/@href').re(r'/archive/year-\d+,month-\d+.cms')
        for month in months:
            self.logger.info(month)

process = scrapy.crawler.CrawlerProcess()
process.crawl(EtSpider)
process.start()

这运行正确，您可以清楚地看到它找到了打印到日志中的各个月份的正确URL。现在，您可以从那里开始使用回调，如文档中所解释的，以发出进一步的附加请求

最终，阅读文档并了解自己在做什么，而不是从互联网上删除一些可疑代码并试图将其硬塞进你的问题中，这将为你节省大量时间和麻烦。

是否调用了etUrl，传统上由if\uuuu name\uuuu==\uuuuu main\uuuuuu:etUrl类型结构保护？安装Scrapy然后使用基于urllib的请求响应也是非常奇怪的；可以说，Scrapy 50%的能力在于它如何处理整个过程——包括定义良好的回调，以避免出现4个深度的缩进。我冒昧地清理了一下你的帖子，因为我认为你并不想在底部递归调用etUrl……但是，看看你修改的代码，看起来for循环并不像您显示的那样嵌套。你发布的代码实际上是你的真实代码吗？在Python中，缩进非常重要，因此请确保您发布的内容符合您实际运行的代码。

class EtSpider(scrapy.Spider):
    name = 'et'
    start_urls = ["https://economictimes.indiatimes.com/archive.cms"]

    def parse(self, response):
        months = response.xpath('//table//tr//a/@href').re(r'/archive/year-\d+,month-\d+.cms')
        for month in months:
            self.logger.info(month)

process = scrapy.crawler.CrawlerProcess()
process.crawl(EtSpider)
process.start()