Python 2.7 大容量插入后删除重复项_Python 2.7_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch_Bigdata

Python 2.7 大容量插入后删除重复项

python-2.7

Python 2.7 大容量插入后删除重复项,python-2.7,elasticsearch,bigdata,Python 2.7,elasticsearch,Bigdata,我有一个这类索引： { "email": email, "data": { domain: [{ "purchase_date": date, "amount": amount, }] } 这是我写的Python方法，它将数据插入ES： # 1: check if mail exists mailExists = es.exists(index=index_param, doc_type=doctype_param, id=email) # i

我有一个这类索引：

{
"email": email,
"data": {
    domain: [{
        "purchase_date": date,
        "amount": amount,
    }]
}

这是我写的Python方法，它将数据插入ES：

# 1: check if mail exists
mailExists = es.exists(index=index_param, doc_type=doctype_param, id=email)

# if mail does not exists => insert entire doc
if mailExists is False:
    doc = {
        "email": email,
        "data": {
            domain: [{
                "purchase_date": date,
                "amount": amount
            }]
        }
    }

    res = es.index(index=index_param, doc_type=doctype_param, id=email, body=doc)
# 2: check if already exists a domain
else:
    query = es.get(index=index_param, doc_type=doctype_param, id=email)
    # save json content into mydata
    mydata = query['_source']['data']

    # if domain exists => check if 'purchase_date' is the same as the one I'm trying to insert
    if domain in mydata:
        differentPurchaseDate = True
        for element in mydata[domain]:
            if element['purchase_date'] == purchase_date:
                differentPurchaseDate = False
        # if 'purchase_date' does not exists => add it to current domain
        if differentPurchaseDate:
            es.update(index=index_param, doc_type=doctype_param, id=email,
                 body={
                    "script": {
                        "inline":"ctx._source.data['"+domain+"'].add(params.newPurchaseDate)",
                        "params":{
                            "newPurchaseDate": {
                                "purchase_date": purchase_date, 
                                "amount": amount
                            }
                    }
                }
            })

    # add entire domain
    else:
        es.update(index=index_param, doc_type=doctype_param, id=email,
         body={
            "script": {
                "inline":"ctx._source.data['"+domain+"'] = params.newDomain",
                "params":{
                    "newDomain": [{
                        "purchase_date": purchase_date, 
                        "amount": amount
                    }]
                }
            }
        })

问题是，如果我使用这种算法，每插入一行大约需要50秒，但我使用的是非常大的文件。因此，我想：是否有可能通过对每个文件进行大容量插入来减少导入时间，并在处理每个文件后删除重复项？谢谢

尝试使用平行块体：

如果您还想批量处理get和exists查询，那么应该在elastic-中使用msearch查询。在这种情况下，您将生成一个有序的查询列表，并且您应该更改脚本的结构，因为您将收到一个唯一的输出，其中包含所有现有查询的结果的有序列表，或者get查询，因此您不能像当前使用的那样使用if-else语句。如果你能提供更多信息，我将帮助你实现多搜索查询

下面是get查询的mget查询示例：

 emails = [ <list_of_email_ID_values> ]
 results = es.mget(index = index_param,
                doc_type = doctype_param,
                body = {'ids': emails})

电子邮件=[]
结果=es.mget（索引=索引参数，
doc_type=doctype_参数，
正文={'ID'：电子邮件}）

Hi@lupanoide，谢谢您的建议。因此，在运行脚本之后，我出现了以下错误：

为了成功，helpers.parallel_bulk（client=es，action=paramL，thread_count=4）：TypeError:parallel_bulk（）至少接受2个参数（给定2个）

@bit这非常奇怪，因为在这个方法中传递3个参数：client-es，action-paramL-和thread count-4。你能用print语句检查这3个参数中是否有一个不一致吗？如果你只是复制并粘贴了我的脚本，而没有修改它，你必须插入es变量es=Elasticsearch（），它可以工作。我的es var是空的，因此我有那个错误。所以，有一个问题：用这种方法插入3181378个数据需要4个小时。透视图中是否有更有效的方法插入更多数据？谢谢。是的，您可以对exists使用msearch，对get查询使用mget，并将其批量化。我将在回答中插入一个例子。请记住，通过这种方式，您将所有查询保持在一个周期中，因此您必须更改if-else条件

 emails = [ <list_of_email_ID_values> ]
 results = es.mget(index = index_param,
                doc_type = doctype_param,
                body = {'ids': emails})