Selenium 使用与Gunicorn和NGINX一起部署的Flask API无限期地获取刮取请求加载
我正在编写一个从银行获取事务数据的scraper,并使用Flask为其创建API端点,以便另一个程序(test_scraper.py)可以执行GET请求并获取数据 在对端点进行GET调用时,我希望触发基于Selenium的刮取函数Selenium 使用与Gunicorn和NGINX一起部署的Flask API无限期地获取刮取请求加载,selenium,nginx,flask,web-scraping,gunicorn,Selenium,Nginx,Flask,Web Scraping,Gunicorn,我正在编写一个从银行获取事务数据的scraper,并使用Flask为其创建API端点,以便另一个程序(test_scraper.py)可以执行GET请求并获取数据 在对端点进行GET调用时,我希望触发基于Selenium的刮取函数scrape_bank(),并以JSON格式返回相应的事务。 当我在localhost上运行Flask应用程序并向其发送请求时,对于不同大小的数据和刮取时间,它工作得非常好 app.py: from flask import Flask, render_template
scrape_bank()
,并以JSON格式返回相应的事务。
当我在localhost上运行Flask应用程序并向其发送请求时,对于不同大小的数据和刮取时间,它工作得非常好
app.py
:
from flask import Flask, render_template, request, jsonify
from flask_restful import Resource, Api, reqparse
from BankScraper import scrape_bank
import sys
import json
import traceback
import os
import re
import logging
app = Flask(__name__)
api = Api(app)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
CHROMEDRIVER_PATH = '/usr/local/bin/chromedriver'
if __name__ != '__main__':
gunicorn_logger = logging.getLogger('gunicorn.error')
app.logger.handlers = gunicorn_logger.handlers
app.logger.setLevel(gunicorn_logger.level)
class BankTransactions(Resource):
# scrape and return bank transactions
def get(self):
"""
Request format:
http://127.0.0.1:5000/banktransactions?userId=<yashkukadia>&password1=<pass1>&password2=<pass2>
Values within <> will be replaced for different users.
:return:
"""
try:
parser = reqparse.RequestParser() # initialize
parser.add_argument('userId', required=False) # add args
parser.add_argument('password1', required=False)
parser.add_argument('password2', required=False)
args = parser.parse_args() # parse arguments to dictionary
userID = args['userId']
password1 = args['password1']
password2 = args['password2']
# if no credentials are received, set defaults
if userID == None:
userID = "username"
if password1 == None:
password1 = "password1"
if password2 == None:
password2 = "password2"
results = scrape_bank(username=userID, password_one=password1,
password_two=password2, CHROMEDRIVER_PATH=CHROMEDRIVER_PATH)
print("Results in GET method of /banktransactions")
print(results)
print(type(results))
if results:
print("Returning results")
return results, 200
else:
return {'Result': 'Empty dictionary'}, 200
except Exception as e:
var = traceback.format_exc()
return {'Message': 'Error'+ str(var)}
api.add_resource(BankTransactions, '/banktransactions') #bank transactions endpoint
@app.route('/')
def hello_world():
"""Default route"""
return jsonify('hello world')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
import requests
import time
start_time = time.time()
# api-endpoint
URL = "http://52.175.56.98:5000/banktransactions"
# sending get request and saving the response as response object
r = requests.get(url=URL)
# extracting data in json format
data = r.json()
print(data)
print("--- {} seconds ---".format(time.time() - start_time))
在为更少的帐户/事务运行此脚本时(在大小和刮取所需时间方面,数据更少),我通过测试脚本获得了所需的结果,并在浏览器上打开了链接或使用POSTMAN发送了请求
但是,当scraper从网站获取数据需要很长时间时,GET请求会无限期地加载/运行,不会返回任何内容,连接也不会关闭。
我打开了输出日志,检查bank_scrape()函数中是否存在任何问题,并调试程序,程序正在获取数据并打印“返回结果”,但是app.py中的后续返回语句似乎没有执行。另一件事是,在成功执行请求时,输出日志显示“正在关闭连接”,这在本例中是看不到的
测试.py
:
from flask import Flask, render_template, request, jsonify
from flask_restful import Resource, Api, reqparse
from BankScraper import scrape_bank
import sys
import json
import traceback
import os
import re
import logging
app = Flask(__name__)
api = Api(app)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
CHROMEDRIVER_PATH = '/usr/local/bin/chromedriver'
if __name__ != '__main__':
gunicorn_logger = logging.getLogger('gunicorn.error')
app.logger.handlers = gunicorn_logger.handlers
app.logger.setLevel(gunicorn_logger.level)
class BankTransactions(Resource):
# scrape and return bank transactions
def get(self):
"""
Request format:
http://127.0.0.1:5000/banktransactions?userId=<yashkukadia>&password1=<pass1>&password2=<pass2>
Values within <> will be replaced for different users.
:return:
"""
try:
parser = reqparse.RequestParser() # initialize
parser.add_argument('userId', required=False) # add args
parser.add_argument('password1', required=False)
parser.add_argument('password2', required=False)
args = parser.parse_args() # parse arguments to dictionary
userID = args['userId']
password1 = args['password1']
password2 = args['password2']
# if no credentials are received, set defaults
if userID == None:
userID = "username"
if password1 == None:
password1 = "password1"
if password2 == None:
password2 = "password2"
results = scrape_bank(username=userID, password_one=password1,
password_two=password2, CHROMEDRIVER_PATH=CHROMEDRIVER_PATH)
print("Results in GET method of /banktransactions")
print(results)
print(type(results))
if results:
print("Returning results")
return results, 200
else:
return {'Result': 'Empty dictionary'}, 200
except Exception as e:
var = traceback.format_exc()
return {'Message': 'Error'+ str(var)}
api.add_resource(BankTransactions, '/banktransactions') #bank transactions endpoint
@app.route('/')
def hello_world():
"""Default route"""
return jsonify('hello world')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
import requests
import time
start_time = time.time()
# api-endpoint
URL = "http://52.175.56.98:5000/banktransactions"
# sending get request and saving the response as response object
r = requests.get(url=URL)
# extracting data in json format
data = r.json()
print(data)
print("--- {} seconds ---".format(time.time() - start_time))
虽然scraper获取了事务,但是API似乎没有返回它们,我希望得到一些帮助。请随时询问是否需要任何其他详细信息