<img src="//i.stack.imgur.com/RUiNP.png" height="16" width="18" alt="" class="sponsor tag img">elasticsearch 设置编解码器/搜索Elasticsearch从Python中搜索unicode值_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch

elasticsearch 设置编解码器/搜索Elasticsearch从Python中搜索unicode值

elasticsearch 设置编解码器/搜索Elasticsearch从Python中搜索unicode值,elasticsearch,elasticsearch,这个问题可能是由于我对ELK、Python和Unicode的无知我有一个包含logstash摘要日志的索引，包括一个包含主机名的字段“host_req”。使用Elasticsearch py，我从记录中提取主机名，并使用它在另一个索引中搜索。但是，如果主机名包含多字节字符，则会出现UnicodeDecodeError错误。当我使用“curl-XGET”从命令行输入时，完全相同的查询工作正常。unicode字符是带有分音符（两点）的小写字母“a”。UTF-8值是C3 A4，unicode代码点

这个问题可能是由于我对ELK、Python和Unicode的无知

我有一个包含logstash摘要日志的索引，包括一个包含主机名的字段“host_req”。使用Elasticsearch py，我从记录中提取主机名，并使用它在另一个索引中搜索。但是，如果主机名包含多字节字符，则会出现UnicodeDecodeError错误。当我使用“curl-XGET”从命令行输入时，完全相同的查询工作正常。unicode字符是带有分音符（两点）的小写字母“a”。UTF-8值是C3 A4，unicode代码点似乎是00E4（语言是瑞典语）

这些curl命令在命令行中可以正常工作：

 curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utkl\u00E4dningskl\u00E4derna.se" }}}'
 curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utklädningskläderna.se" }}}'

他们找到并返回记录

（第二行显示了主机名在我从中提取它的日志中的显示方式，在两个位置显示了带有diaersis的小写字母“a”）

我编写了一个非常简短的Python脚本来说明这个问题：它使用硬连线查询，打印它们和它们的类型，然后尝试使用它们在搜索中

 #!/usr/bin/python
 # -*- coding: utf-8 -*-

 import json
 import elasticsearch

 es = elasticsearch.Elasticsearch()

 if __name__=="__main__":
   #uq = u'{ "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}'           # raw utf-8 characters. does not work
   #uq = u'{ "query": { "match": { "req_host": "www.utkl\u00E4dningskl\u00E4derna.se" }}}' # quoted unicode characters. does not work
   #uq = u'{ "query": { "match": { "req_host": "www.utkl\uC3A4dningskl\uC3A4derna.se" }}}' # quoted utf-8 characters. does not work
   uq = u'{ "query": { "match": { "req_host": "www.facebook.com" }}}'                     # non-unicode. works fine
   print "uq", type(uq), uq
   result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
   if result["hits"]["total"] == 0:
     print "nothing found"
   else:
     print "found some"

如果我按图所示运行“facebook”查询，结果很好-输出为：

$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.facebook.com" }}}
found some

$python testutf8b.py
uq{“查询”：{“匹配”：{“请求主机”：“www.facebook.com”}
找到一些

请注意，查询字符串“uq”是unicode

但是如果我使用其他三个字符串，包括Unicode字符，它就会爆炸。例如，在第二行中，我得到：

$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}
Traceback (most recent call last):
   File "testutf8b.py", line 15, in <module>
    result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
  File "build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py", line 68, in _wrapped
  File "build/bdist.linux-x86_64/egg/elasticsearch/client/__init__.py", line 497, in search
  File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307, in perform_request
  File "build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py", line 82, in perform_request
elasticsearch.exceptions.ConnectionError: ConnectionError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128)) caused by: UnicodeDecodeError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128))
$

$python testutf8b.py
uq{“查询”：{“匹配”：{“请求主机”：“www.utklädningskläderna.se”}}
回溯（最近一次呼叫最后一次）：
文件“testutf8b.py”，第15行，在
结果=es.search（index=“logstash-2015.01.30”，doc_type=“logs”，超时=1000，正文=uq）；
文件“build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py”，第68行，以
文件“build/bdist.linux-x86_64/egg/elasticsearch/client/_init__.py”，第497行，搜索中
文件“build/bdist.linux-x86_64/egg/elasticsearch/transport.py”，第307行，在perform_请求中
文件“build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py”，第82行，在perform_请求中
elasticsearch.exceptions.ConnectionError:ConnectionError（'ascii'编解码器无法解码位置45:序号不在范围（128））中的字节0xc3，原因是：UnicodeDecodeError（'ascii'编解码器无法解码位置45:序号不在范围（128））中的字节0xc3
$

再次注意，查询字符串是unicode字符串（是的，源代码行是带有

\u00E4

字符的行）

我真的很想解决这个问题。我尝试了各种组合的

uq=uq.encode（“utf=8”）

和

uq=uq.decode（“utf=8”）

，但似乎没有帮助。我开始怀疑

elasticsearch py

库中是否存在问题

谢谢

PS：这是在Centos 7下，使用ES 1.5.0。日志在稍旧的版本下被分解为ES，使用logstash-1.4.2

基本上，您不需要将

正文作为字符串传递。使用本机python数据结构。或者在飞行中改变它们。请试一试：
>>> import elasticsearch
>>> es = elasticsearch.Elasticsearch()
>>> es.index(index='unicode-index', body={'host': u'www.utklädningskläderna.se'}, doc_type='log')

{u'_id': u'AUyGJuFMy0qdfghJ6KwJ',
 u'_index': u'unicode-index',
 u'_type': u'log',
 u'_version': 1,
 u'created': True}

>>> es.search(index='unicode-index', body={}, doc_type='log')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
    u'_index': u'unicode-index',
    u'_score': 1.0,
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
    u'_type': u'log'}],
  u'max_score': 1.0,
  u'total': 1},
 u'timed_out': False,
 u'took': 5}

>>> es.search(index='unicode-index', body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}, doc_type='log')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
    u'_index': u'unicode-index',
    u'_score': 0.30685282,
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
    u'_type': u'log'}],
  u'max_score': 0.30685282,
  u'total': 1},
 u'timed_out': False,
 u'took': 122}

>>> import json

>>> body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}

>>> es.search(index='unicode-index', body=body, doc_type='log')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
    u'_index': u'unicode-index',
    u'_score': 0.30685282,
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
    u'_type': u'log'}],
  u'max_score': 0.30685282,
  u'total': 1},
 u'timed_out': False,
 u'took': 4}

>>> es.search(index='unicode-index', body=json.dumps(body), doc_type='log')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
    u'_index': u'unicode-index',
    u'_score': 0.30685282,
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
    u'_type': u'log'}],
  u'max_score': 0.30685282,
  u'total': 1},
 u'timed_out': False,
 u'took': 5}

>>> json.dumps(body)
'{"query": {"match": {"host": "www.utkl\\u00e4dningskl\\u00e4derna.se"}}}'

通过在querystring上运行*.encode（'utf-8'）并将其发送到具有适当HTTP头的原始套接字，我已经确定可以从Python进行此查询。这似乎不适用于Elasticpy搜索