Regexp以不工作的Elasticsearch 6开始*
我在理解ElasticSearch中的regexp机制时遇到了麻烦。我有代表物业单位的文件:Regexp以不工作的Elasticsearch 6开始*,regex,
elasticsearch,lucene,Regex,
elasticsearch,Lucene,我在理解ElasticSearch中的regexp机制时遇到了麻烦。我有代表物业单位的文件: { "Unit" : { "DailyAvailablity" : "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUU
{
"Unit" :
{
"DailyAvailablity" :
"UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
}
DailyAvailability字段按天对从今天起的未来两年内的财产可用性进行编码。”“A”表示可用,“U”不可保释,“I”可以办理入住手续,“O”可以办理退房手续。如何编写regexp过滤器以获取特定日期可用的所有单位
我试图在DailyAvailability字段中找到具有特定长度和偏移量的“A”子字符串。例如,要查找从今天起7天内可用7天的单元:
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
}
]
}
}
}
此查询返回DateAvailability的实例单元,该实例单元从“uuu iaa”开始,但在字段中的某个位置包含合适的序列。如何为整个源字符串锚定regexp?ES文档说lucene正则表达式应该默认锚定
另外,我试过
^.{7}a{7}.*$'
。返回空集。看起来您正在使用数据类型存储单位。DailyAvailability
(如果您正在使用,这也是字符串的默认值)。应该考虑使用数据类型。
让我更详细地解释一下
为什么我的正则表达式在<代码>文本< /代码>字段的中间匹配?
使用text
datatype时,会对数据进行分析,以进行全文搜索。它进行一些转换,如小写和拆分为标记
让我们尝试对您的输入使用:
POST _analyze
{
"text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
答复是:
{
"tokens": [
{
"token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
"start_offset": 0,
"end_offset": 255,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 255,
"end_offset": 510,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 510,
"end_offset": 732,
"type": "<ALPHANUM>",
"position": 2
}
]
}
您将能够执行类似这样的查询,以匹配文章中的文档:
POST my_regexes/doc/_search
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" }
}
]
}
}
}
请注意,查询变得区分大小写,因为未分析字段
这个regexp
将不再返回任何结果:“{12}a{7}.*”
这将:“{12}A{7}.*”
那么锚定呢?
正则表达式是:
Lucene的模式总是固定的。提供的模式必须与整个字符串匹配
锚定看起来是错误的原因很可能是因为代币在分析的
文本
字段中被分割。除了尼古拉·瓦西里耶夫(Nikolay Vasiliev)精彩而有用的回答之外。在我的情况下,我被迫走得更远,使它在NEST.net上工作。我将属性映射添加到DailyAvailability
:
[Keyword(Name = "DailyAvailability")]
public string DailyAvailability { get; set; }
过滤器仍然不起作用,我得到了映射:
"DailyAvailability":"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
我的字段包含大约732个符号,因此索引忽略了它。我试过:
[Keyword(Name = "DailyAvailability", IgnoreAbove = 1024)]
public string DailyAvailability { get; set; }
这对映射没有任何影响。只有在添加手动映射后,它才能正常工作:
var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
.Mappings(ms => ms.Map<Unit>(m => m
.Properties(ps => ps
.Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
)
)
));
var客户端=新的ElasticClient(设置);
CreateIndex(“vrp”,c=>c
.Mappings(ms=>ms.Map(m=>m
.Properties(ps=>ps
.Keyword(k=>k.Name(u=>u.DailyAvailability).ignoreover(1024))
)
)
));
重点是:
忽略上面的\u-不要索引任何长度超过此值的字符串。默认值为2147483647,以便接受所有值。但是请注意,默认动态映射规则创建一个子关键字字段,该字段通过将ignore_设置为256来覆盖此默认值
因此,如果需要使用regexp对长关键字字段进行过滤,请使用显式映射来设置上面的
忽略\u
regexp不区分大小写。试试“[^A]*A{7}..*”
我想是忽略了大小写。我试着用大写字母。它每次都返回空集。使用时会发生什么情况:{“query”:{“regexp”:{“Unit.DailyAvailability”:“{7}A{7}.*}}}
大写符号导致空结果'.{7}a{7}.*'
返回内部包含适当序列的所有单元。我怀疑这个问题与lucene中regexp的锚定有关。另一种可能性是:ES无法正确解析大字符串。您如何分析该字段?您的分析是您需要使用小写字母搜索的原因,并且可能也是造成这种行为的原因,也许吧?回答得很好。
var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
.Mappings(ms => ms.Map<Unit>(m => m
.Properties(ps => ps
.Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
)
)
));