Python 为什么acrapy spider不能正确使用烧瓶？_Python_Flask_Scrapy

Python 为什么acrapy spider不能正确使用烧瓶？

python flask scrapy

Python 为什么acrapy spider不能正确使用烧瓶？,python,flask,scrapy,Python,Flask,Scrapy,我有一个Flask应用程序，它从用户那里获取一个URL，然后对该网站进行爬网，并返回在该网站上找到的链接。以前，我遇到了一个问题，爬虫程序只运行一次，之后就不会再运行了。我通过使用CrawlerRunner找到了解决方案，而不是使用爬网进程。这就是我的代码的样子： from flask import Flask, render_template, request, redirect, url_for, session, make_response from flask_executor imp

我有一个Flask应用程序，它从用户那里获取一个URL，然后对该网站进行爬网，并返回在该网站上找到的链接。以前，我遇到了一个问题，爬虫程序只运行一次，之后就不会再运行了。我通过使用CrawlerRunner找到了解决方案，而不是使用

爬网进程

。这就是我的代码的样子：

from flask import Flask, render_template, request, redirect, url_for, session, make_response
from flask_executor import Executor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
from uuid import uuid4
import urllib3, requests, urllib.parse

app = Flask(__name__)
executor = Executor(app)

http = urllib3.PoolManager()
runner = CrawlerRunner()

list = set([])
list_validate = set([])
list_final = set([])

@app.route('/', methods=["POST", "GET"])
def index():
   if request.method == "POST":
      url_input = request.form["usr_input"]

        # Modifying URL
        if 'https://' in url_input and url_input[-1] == '/':
            url = str(url_input)
        elif 'https://' in url_input and url_input[-1] != '/':
            url = str(url_input) + '/'
        elif 'https://' not in url_input and url_input[-1] != '/':
            url = 'https://' + str(url_input) + '/'
        elif 'https://' not in url_input and url_input[-1] == '/':
            url = 'https://' + str(url_input)
        # Validating URL
        try:
            response = requests.get(url)
            error = http.request("GET", url)
            if error.status == 200:
                parse = urlparse(url).netloc.split('.')
                base_url = parse[-2] + '.' + parse[-1]
                start_url = [str(url)]
                allowed_url = [str(base_url)]

                # Crawling links
                class Crawler(CrawlSpider):
                    name = "crawler"
                    start_urls = start_url
                    allowed_domains = allowed_url
                    rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]

                    def parse_links(self, response):
                        base_url = url
                        href = response.xpath('//a/@href').getall()
                        list.add(urllib.parse.quote(response.url, safe=':/'))
                        for link in href:
                            if base_url not in link:
                                list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
                        for link in list:
                            if base_url in link:
                                list_validate.add(link)

                 def start_spider():
                    d = runner.crawl(Crawler)

                    def start(d):
                        for link in list_validate:
                        error = http.request("GET", link)
                        if error.status == 200:
                            list_final.add(link)
                        original_stdout = sys.stdout
                        with open('templates/file.txt', 'w') as f:
                           sys.stdout = f
                           for link in list_final:
                              print(link)

                     d.addCallback(start)

                def run():                         
                   reactor.run(0)

                unique_id = uuid4().__str__()
                executor.submit_stored(unique_id, start_spider)
                executor.submit(run)
                return redirect(url_for('crawling', id=unique_id))

            elif error.status != 200:
                return render_template('index.html')

        except requests.ConnectionError as exception:
            return render_template('index.html')
   else:
     return render_template('index.html')

@app.route('/crawling-<string:id>')
def crawling(id):
if not executor.futures.done(id):
    return render_template('start-crawl.html', refresh=True)
else:
    executor.futures.pop(id)
    return render_template('finish-crawl.html')

问题是它仅在爬行时呈现

start crawl.html

，而不是在验证时呈现。因此，基本上，它会获取URL，在呈现

start crawl.html

时对其进行爬网。然后在验证时转到

finishcrawl.html

我相信问题可能出在

start\u spider（）

，在

d.addCallback（start）

行中。我认为这是因为它可能在后台执行我不想要的那一行。我相信这里可能发生的事情是在

start\u spider（）

，

d=runner.crawl（Crawler）

正在执行，然后

d.addCallback（start）

在后台发生，这就是为什么在验证过程中我需要

完成crawl.html

。我希望整个函数在后台执行，而不仅仅是那一部分。这就是为什么我有：

executor.submit\u storage（唯一的\u id，启动\u spider）

我希望此代码获取URL，然后在呈现

start crawl.html

时对其进行爬网和验证。然后，当它完成时，我希望它呈现

finish crawl.html

不管怎样，如果这不是问题所在，有人知道它是什么以及如何解决它吗？请忽略此代码的共谋以及任何不是“编程约定”的内容。提前向大家表示感谢。

通过查看代码，我发现如果您在某个时候调用函数

run（）

，一切都应该正常工作，因为它现在从未被调用过。此外，如注释中所述，您应该将类和函数从路由移到单独的文件中-基本上，您应该重新构造代码，以便堆栈能够正确工作，如果需要存储状态，请使用一些tmp文件或至少使用SQLite作为队列和结果。

首先，我认为您在索引路由中输入的大部分代码可能都应该在路由之外。每次有人转到根（

）路由时，您都要定义一个调用/方法和函数。同样从我所看到的，你每次都会创造这些东西，但实际上你并没有把它们叫做。我不确定这是否只是缩进问题，即使是，我也看不出你在调用任何东西。对不起，我复制代码时出现缩进错误。好的，现在我看到了，但你有一个

try

块，你能显示除了

块之外块上有什么吗？我添加了它的其余部分。
{% if refresh %}
    <meta http-equiv="refresh" content="5">
{% endif %}