Python 从Firestore删除收藏的最快方法？_Python_Google App Engine_Flask_Google Cloud Firestore

Python 从Firestore删除收藏的最快方法？

python google-app-engine flask google-cloud-firestore

Python 从Firestore删除收藏的最快方法？,python,google-app-engine,flask,google-cloud-firestore,Python,Google App Engine,Flask,Google Cloud Firestore,我有一个应用程序，它将数百万个文档加载到一个集合中，使用30-80个工作人员同时加载数据。有时，我发现加载过程并没有顺利完成，对于其他数据库，我可以简单地删除表并重新开始，但对于Firestore集合则不行。我必须列出文档并删除它们，但我还没有找到一种方法来扩展它，使其具有与加载过程相同的容量。我现在要做的是，我有两个AppEngine托管的Flask/Python方法，一个用于获取包含1000个文档的页面，然后传递给另一个方法来删除它们。这样，列出文档的过程不会被删除文档的过程阻止。它仍然需要

我有一个应用程序，它将数百万个文档加载到一个集合中，使用30-80个工作人员同时加载数据。有时，我发现加载过程并没有顺利完成，对于其他数据库，我可以简单地删除表并重新开始，但对于Firestore集合则不行。我必须列出文档并删除它们，但我还没有找到一种方法来扩展它，使其具有与加载过程相同的容量。我现在要做的是，我有两个AppEngine托管的Flask/Python方法，一个用于获取包含1000个文档的页面，然后传递给另一个方法来删除它们。这样，列出文档的过程不会被删除文档的过程阻止。它仍然需要几天才能完成，这太长了

方法获取文档列表并创建任务以删除它们，该任务是单线程的：

@app.route('/delete_collection/<collection_name>/<batch_size>', methods=['POST'])
def delete_collection(collection_name, batch_size):
    batch_size = int(batch_size)
    coll_ref = db.collection(collection_name)
    print('Received request to delete collection {} {} docs at a time'.format(
        collection_name,
        batch_size
    ))
    num_docs = batch_size
    while num_docs >= batch_size:
        docs = coll_ref.limit(batch_size).stream()
        found = 0
        deletion_request = {
            'doc_ids': []
        }
        for doc in docs:
            deletion_request['doc_ids'].append(doc.id)
            found += 1
        num_docs = found
        print('Creating request to delete docs: {}'.format(
            json.dumps(deletion_request)
        ))
        # Add to task queue
        queue = tasks_client.queue_path(PROJECT_ID, LOCATION, 'database-manager')

        task_meet = {
            'app_engine_http_request': {  # Specify the type of request.
                'http_method': 'POST',
                'relative_uri': '/delete_documents/{}'.format(
                    collection_name
                ),
                'body': json.dumps(deletion_request).encode(),
                'headers': {
                    'Content-Type': 'application/json'
                }
            }
        }
        task_response_meet = tasks_client.create_task(queue, task_meet)
        print('Created task to delete {} docs: {}'.format(
            batch_size,
            json.dumps(deletion_request)
        ))

@app.route（'/delete\u collection/'，methods=['POST']）
def delete_集合（集合名称、批次大小）：
批次大小=整数（批次大小）
coll\u ref=db.collection（collection\u name）
打印（'一次收到删除集合{}{}文档的请求'。格式(
集合名称，
批量大小
))
num\u docs=批量大小
当num\u docs>=批量大小时：
docs=coll\u ref.limit（批量大小）.stream（）
找到=0
删除请求={
“文档ID”：[]
}
对于文档中的文档：
删除请求['doc\u id'].append（doc.id）
发现+=1
num_docs=已找到
打印（'创建删除文档的请求：{}'。格式(
json.dumps（删除请求）
))
#添加到任务队列
queue=tasks\u client.queue\u路径（项目\u ID，位置，'数据库管理器'）
任务会议={
'app_engine_http_request'：{#指定请求的类型。
“http_方法”：“POST”，
“相对uri”：“/delete_documents/{}”。格式(
集合名称
),
“body”：json.dumps（删除请求）.encode（），
“标题”：{
“内容类型”：“应用程序/json”
}
}
}
task\u response\u meet=tasks\u client.创建任务（队列，task\u meet）
打印（'已创建任务以删除{}个文档：{}'。格式(
批量大小，
json.dumps（删除请求）
))

下面是我用来删除文档的方法，它可以扩展。实际上，它一次只处理5-10个文档，受其他方法传递要删除的文档ID页面的速率限制。将两者分开有帮助，但没有那么多

@app.route('/delete_documents/<collection_name>', methods=['POST'])
def delete_documents(collection_name):
    # Validate we got a body in the POST
    if flask.request.json:
        print('Request received to delete docs from :{}'.format(collection_name))
    else:
        message = 'No json found in request: {}'.format(flask.request)
        print(message)
        return message, 400

    # Validate that the payload includes a list of doc_ids
    doc_ids = flask.request.json.get('doc_ids', None)
    if doc_ids is None:
        return 'No doc_ids specified in payload: {}'.format(flask.request.json), 400
    print('Received request to delete docs: {}'.format(doc_ids))
    for doc_id in doc_ids:
        db.collection(collection_name).document(doc_id).delete()
    return 'Finished'


if __name__ == '__main__':
    # Set environment variables for running locally
    app.run(host='127.0.0.1', port=8080, debug=True)

@app.route（'/delete_documents/'，methods=['POST']））
def delete_文档（集合名称）：
#我们在邮局找到一具尸体
如果flask.request.json：
打印（'从以下位置删除文档的请求：{}'。格式（集合名称））
其他：
消息='在请求{}中找不到json'。格式（flask.request）
打印（信息）
返回消息，400
#验证有效负载是否包括文档ID列表
doc\u id=flask.request.json.get（'doc\u id'，无）
如果文档ID为“无”：
返回“负载：{}中未指定文档ID”。格式（flask.request.json），400
打印（'收到删除文档的请求：{}'。格式（文档ID））
对于文档id中的文档id：
db.collection（collection\u name）.document（doc\u id）.delete（）
返回“已完成”
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
#设置本地运行的环境变量
app.run（host='127.0.0.1'，port=8080，debug=True）

我尝试过运行delete_collection（）的多个并发执行，但我不确定这是否有帮助，因为我不确定它是否每次调用limit（batch_size）.stream（）时都会得到一组不同的文档，或者可能会得到重复的文档

如何使其运行更快？

本文介绍了如何使用可调用的云函数在每秒删除多达4000个文档的过程中利用firestore delete命令。

以下是我用来测试批量删除的简单Python脚本。就像@Chris32所说的那样，如果延迟不太差，批处理模式每秒将删除数千个文档

from time import time
from uuid import uuid4
from google.cloud import firestore

DB = firestore.Client()

def generate_user_data(entries = 10):
    print('Creating {} documents'.format(entries))
    now = time()
    batch = DB.batch()
    for counter in range(entries):
        # Each transaction or batch of writes can write to a maximum of 500 documents.
        # https://cloud.google.com/firestore/quotas#writes_and_transactions
        if counter % 500 == 0 and counter > 0:
            batch.commit()

        user_id = str(uuid4())
        data = {
            "some_data": str(uuid4()),
            "expires_at": int(now)
            }
        user_ref = DB.collection(u'users').document(user_id)
        batch.set(user_ref, data)
    batch.commit()
    print('Wrote {} documents in {:.2f} seconds.'.format(entries, time() - now))

def delete_one_by_one():
    print('Deleting documents one by one')
    now = time()
    docs = DB.collection(u'users').where(u'expires_at', u'<=', int(now)).stream()
    counter = 0
    for doc in docs:
        doc.reference.delete()
        counter = counter + 1
    print('Deleted {} documents in {:.2f} seconds.'.format(counter, time() - now))

def delete_in_batch():
    print('Deleting documents in batch')
    now = time()
    docs = DB.collection(u'users').where(u'expires_at', u'<=', int(now)).stream()
    batch = DB.batch()
    counter = 0
    for doc in docs:
        counter = counter + 1
        if counter % 500 == 0:
            batch.commit()
        batch.delete(doc.reference)
    batch.commit()
    print('Deleted {} documents in {:.2f} seconds.'.format(counter, time() - now))


generate_user_data(10)
delete_one_by_one()
print('###')
generate_user_data(10)
delete_in_batch()
print('###')
generate_user_data(2000)
delete_in_batch()

从时间导入时间
从uuid导入uuid4
从google.cloud导入firestore
DB=firestore.Client（）
def生成用户数据（条目=10）：
打印（'创建{}个文档'。格式化（条目））
现在=时间（）
batch=DB.batch（）
对于范围内的计数器（条目）：
#每个事务或写入批最多可写入500个文档。
# https://cloud.google.com/firestore/quotas#writes_and_transactions
如果计数器%500==0且计数器>0：
batch.commit（）
user_id=str（uuid4（））
数据={
“一些数据”：str（uuid4（）），
“过期时间”：int（现在）
}
user\u ref=DB.collection（u'users'）.document（user\u id）
批处理设置（用户参考，数据）
batch.commit（）
打印（{.2f}秒内写入{}个文档。。格式（条目，time（）-now））
def按一个删除一个：
打印（'逐个删除文档'）
现在=时间（）
docs=DB.collection（u'users'）。where（u'expires_at'，u'这是我想到的。它不是很快（每秒120-150个文档），但我在python中找到的所有其他示例根本不起作用：
now = datetime.now()
then = now - timedelta(days=DOCUMENT_EXPIRATION_DAYS)
doc_counter = 0
commit_counter = 0
limit = 5000
while True:
    docs = []
    print('Getting next doc handler')
    docs = [snapshot for snapshot in db.collection(collection_name)
        .where('id.time', '<=', then)
        .limit(limit)
        .order_by('id.time', direction=firestore.Query.ASCENDING
      ).stream()]
    batch = db.batch()
    for doc in docs:
        doc_counter = doc_counter + 1
        if doc_counter % 500 == 0:
            commit_counter += 1
            print('Committing batch {} from {}'.format(commit_counter, doc.to_dict()['id']['time']))
            batch.commit()
        batch.delete(doc.reference)
    batch.commit()
    if len(docs) == limit:
        continue
    break

print('Deleted {} documents in {} seconds.'.format(doc_counter, datetime.now() - now))

now=datetime.now（）
然后=现在-时间增量（天=文档\u到期\u天）
doc\u计数器=0
提交计数器=0
限额=5000
尽管如此：
文档=[]
打印（'获取下一个文档处理程序'）
docs=[数据库集合中快照的快照（集合名称）
.where（'id.time'，'是的，我看到了，不幸的是，唯一的例子是Node。正在寻找一种用Python实现这一点的方法。在15-20批之后，你没有收到一堆截止日期错误吗？这对我来说很有效，但我需要在它通过所有文档之前经常重试。