elasticsearch Elasticsearch analyzer标记,用于带点的字母数字值
我有一个具有此值的文本字段-
elasticsearch Elasticsearch analyzer标记,用于带点的字母数字值,
elasticsearch,analyzer,
elasticsearch,Analyzer,我有一个具有此值的文本字段- term1-term2-term3-term4-term5-RWHPSA951000155.2013-05-27.log 当我使用AnalyzeAPI(默认分析器)检查时,我得到- { "tokens": [ { "token": "text", "start_offset": 2, "end_offset": 6, "type": "&
term1-term2-term3-term4-term5-RWHPSA951000155.2013-05-27.log
当我使用AnalyzeAPI(默认分析器)检查时,我得到-
{
"tokens": [
{
"token": "text",
"start_offset": 2,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "term1",
"start_offset": 9,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "term2",
"start_offset": 15,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "term3",
"start_offset": 21,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "term4",
"start_offset": 27,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "term5",
"start_offset": 33,
"end_offset": 38,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "rwhpsa951000155.2013",
"start_offset": 39,
"end_offset": 59,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "05",
"start_offset": 60,
"end_offset": 62,
"type": "<NUM>",
"position": 8
},
{
"token": "27",
"start_offset": 63,
"end_offset": 65,
"type": "<NUM>",
"position": 9
},
{
"token": "log",
"start_offset": 66,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 10
}
]
}
{
“代币”:[
{
“令牌”:“文本”,
“起始偏移量”:2,
“端部偏移”:6,
“类型”:“,
“职位”:1
},
{
“令牌”:“term1”,
“起始偏移量”:9,
“端部偏移”:14,
“类型”:“,
“职位”:2
},
{
“令牌”:“term2”,
“起始偏移量”:15,
“端部偏移”:20,
“类型”:“,
“职位”:3
},
{
“令牌”:“term3”,
“起始偏移量”:21,
“端部偏移”:26,
“类型”:“,
“职位”:4
},
{
“令牌”:“term4”,
“起始偏移量”:27,
“端部偏移”:32,
“类型”:“,
“职位”:5
},
{
“令牌”:“term5”,
“起始偏移量”:33,
“端部偏移量”:38,
“类型”:“,
“职位”:6
},
{
“令牌”:“rwhpsa951000155.2013”,
“起始偏移量”:39,
“端部偏移量”:59,
“类型”:“,
“职位”:7
},
{
“令牌”:“05”,
“起始偏移量”:60,
“端部偏移”:62,
“类型”:“,
“职位”:8
},
{
“令牌”:“27”,
“起始偏移量”:63,
“端部偏移”:65,
“类型”:“,
“职位”:9
},
{
“令牌”:“日志”,
“起始偏移量”:66,
“端部偏移”:69,
“类型”:“,
“职位”:10
}
]
}
我对这个令牌特别好奇-rwhpsa951000155.2013
。这是怎么发生的?目前,我搜索匹配的RWHPSA951000155
失败,原因是此。如何使其将RWHPSA951000155
和2013
识别为单独的令牌
请注意,如果值为term1-term2-term3-term4-term5-RWHPSA.2013-05-27.log
,则它将RWHPSA
和2013
拆分为单独的令牌。因此,这与951000155
有关
谢谢,正在将rwhpsa951000155.2013标记为产品编号
在中以连字符分隔单词,除非标记中有数字
在这种情况下,整个令牌被解释为产品编号,并且
不分裂
您可以添加模式分析器以将“.”替换为空白。然后,默认分析器将按照您想要的方式标记术语
/POST test
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"my_pattern": {
"type": "pattern_replace",
"pattern": "\\.",
"replacement": " "
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_pattern"
]
}
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"test": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
调用分析API:
curl -XGET 'localhost:9200/test/_analyze?analyzer=my_analyzer&pretty=true' -d 'term1-term2-term3-term4-term5-RWHPSA.2013-05-27.log'
返回:
{
"tokens" : [ {
"token" : "term1",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "term2",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "term3",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "term4",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "term5",
"start_offset" : 24,
"end_offset" : 29,
"type" : "<ALPHANUM>",
"position" : 5
}, {
"token" : "RWHPSA951000155",
"start_offset" : 30,
"end_offset" : 45,
"type" : "<ALPHANUM>",
"position" : 6
}, {
"token" : "2013",
"start_offset" : 46,
"end_offset" : 50,
"type" : "<NUM>",
"position" : 7
}, {
"token" : "05",
"start_offset" : 51,
"end_offset" : 53,
"type" : "<NUM>",
"position" : 8
}, {
"token" : "27",
"start_offset" : 54,
"end_offset" : 56,
"type" : "<NUM>",
"position" : 9
}, {
"token" : "log",
"start_offset" : 57,
"end_offset" : 60,
"type" : "<ALPHANUM>",
"position" : 10
} ]
}
{
“代币”:[{
“令牌”:“term1”,
“起始偏移量”:0,
“端部偏移”:5,
“类型”:“,
“职位”:1
}, {
“令牌”:“term2”,
“起始偏移量”:6,
“端部偏移”:11,
“类型”:“,
“职位”:2
}, {
“令牌”:“term3”,
“起始偏移量”:12,
“端部偏移”:17,
“类型”:“,
“职位”:3
}, {
“令牌”:“term4”,
“起始偏移量”:18,
“端部偏移”:23,
“类型”:“,
“职位”:4
}, {
“令牌”:“term5”,
“起始偏移量”:24,
“端部偏移”:29,
“类型”:“,
“职位”:5
}, {
“令牌”:“RWHPSA951000155”,
“起始偏移量”:30,
“端部偏移量”:45,
“类型”:“,
“职位”:6
}, {
“代币”:“2013年”,
“起始偏移量”:46,
“端部偏移量”:50,
“类型”:“,
“职位”:7
}, {
“令牌”:“05”,
“起始偏移量”:51,
“端部偏移”:53,
“类型”:“,
“职位”:8
}, {
“令牌”:“27”,
“起始偏移量”:54,
“端部偏移”:56,
“类型”:“,
“职位”:9
}, {
“令牌”:“日志”,
“起始偏移量”:57,
“端部偏移量”:60,
“类型”:“,
“职位”:10
} ]
}
谢谢!我一直在浏览ES文档,了解标准分析器是如何工作的,从来没有想过要看lucene的文档。顺便说一句,我可以在映射中为字段分配一个分析器。如何使分析器成为索引的默认值?ES文档说:“可以使用default\u索引
逻辑名称来配置一个默认分析器,该分析器将在索引时使用”。这样行吗?我已经定义了映射,索引中有数据。如何单独更新默认分析器?是的,它可以工作。在上面的设置示例中,将“my_analyzer”更改为“default_index”。要更新分析器,请查看此链接:我尝试了此链接,但似乎出现了问题。现在,当我将analyze API与提到的文本一起使用时,我得到了{“tokens”:[]}
。当我做curl-XGET'http://localhost:9200/myindex/_settings?pretty“
,我可以看到新的分析器。如何删除所有分析仪和过滤器并恢复到旧状态?在此之前,我从未在设置中使用过分析{}部分。过滤器中的“.”需要转义,即“\\”,请尝试再次更新。Dan,我只是观察到其他内容。它正在按预期分解代币。但我无法使用现有数据通过RWHPSA951000155
进行搜索。在我能够通过RWHPSA951000155.2013
搜索之前。现在它也停止工作了。我不想进一步讨论,但你知道可能是什么原因造成的吗?我需要重新索引我的数据吗?