elasticsearch,lucene,Regex,elasticsearch,Lucene" /> elasticsearch,lucene,Regex,elasticsearch,Lucene" />

Regexp以不工作的Elasticsearch 6开始*

Regexp以不工作的Elasticsearch 6开始*,regex,elasticsearch,lucene,Regex,elasticsearch,Lucene,我在理解ElasticSearch中的regexp机制时遇到了麻烦。我有代表物业单位的文件: { "Unit" : { "DailyAvailablity" : "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUU

我在理解ElasticSearch中的regexp机制时遇到了麻烦。我有代表物业单位的文件:

{
    "Unit" :
    {
         "DailyAvailablity" : 
         "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
    }
}
DailyAvailability字段按天对从今天起的未来两年内的财产可用性进行编码。”“A”表示可用,“U”不可保释,“I”可以办理入住手续,“O”可以办理退房手续。如何编写regexp过滤器以获取特定日期可用的所有单位

我试图在DailyAvailability字段中找到具有特定长度和偏移量的“A”子字符串。例如,要查找从今天起7天内可用7天的单元:

{
 "query": {
   "bool": {
     "filter": [
        {
         "regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
        }
      ]
    }
  }
}
此查询返回DateAvailability的实例单元,该实例单元从“uuu iaa”开始,但在字段中的某个位置包含合适的序列。如何为整个源字符串锚定regexp?ES文档说lucene正则表达式应该默认锚定


另外,我试过
^.{7}a{7}.*$'
。返回空集。

看起来您正在使用数据类型存储
单位。DailyAvailability
(如果您正在使用,这也是字符串的默认值)。应该考虑使用数据类型。

让我更详细地解释一下

为什么我的正则表达式在<代码>文本< /代码>字段的中间匹配? 使用
text
datatype时,会对数据进行分析,以进行全文搜索。它进行一些转换,如小写和拆分为标记

让我们尝试对您的输入使用:

POST _analyze
{
  "text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
答复是:

{
  "tokens": [
    {
      "token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
      "start_offset": 0,
      "end_offset": 255,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
      "start_offset": 255,
      "end_offset": 510,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
      "start_offset": 510,
      "end_offset": 732,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}
您将能够执行类似这样的查询,以匹配文章中的文档:

POST my_regexes/doc/_search
{
 "query": {
   "bool": {
     "filter": [
        {
         "regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*"  }
        }
      ]
    }
  }
}
请注意,查询变得区分大小写,因为未分析字段

这个
regexp
将不再返回任何结果:
“{12}a{7}.*”

这将:
“{12}A{7}.*”

那么锚定呢? 正则表达式是:

Lucene的模式总是固定的。提供的模式必须与整个字符串匹配


锚定看起来是错误的原因很可能是因为代币在分析的
文本
字段中被分割。

除了尼古拉·瓦西里耶夫(Nikolay Vasiliev)精彩而有用的回答之外。在我的情况下,我被迫走得更远,使它在NEST.net上工作。我将属性映射添加到
DailyAvailability

[Keyword(Name = "DailyAvailability")]
public string DailyAvailability { get; set; }
过滤器仍然不起作用,我得到了映射:

 "DailyAvailability":"type":"text",
     "fields":{  
         "keyword":{  
             "type":"keyword",
             "ignore_above":256
         }
      }
 }
我的字段包含大约732个符号,因此索引忽略了它。我试过:

[Keyword(Name = "DailyAvailability", IgnoreAbove = 1024)]
public string DailyAvailability { get; set; }
这对映射没有任何影响。只有在添加手动映射后,它才能正常工作:

var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
    .Mappings(ms => ms.Map<Unit>(m => m
        .Properties(ps => ps
            .Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
        )
     )
  ));
var客户端=新的ElasticClient(设置);
CreateIndex(“vrp”,c=>c
.Mappings(ms=>ms.Map(m=>m
.Properties(ps=>ps
.Keyword(k=>k.Name(u=>u.DailyAvailability).ignoreover(1024))
)
)
));
重点是:

忽略上面的\u-不要索引任何长度超过此值的字符串。默认值为2147483647,以便接受所有值。但是请注意,默认动态映射规则创建一个子关键字字段,该字段通过将ignore_设置为256来覆盖此默认值


因此,如果需要使用regexp对长关键字字段进行过滤,请使用显式映射来设置上面的
忽略\u

regexp不区分大小写。试试
“[^A]*A{7}..*”
我想是忽略了大小写。我试着用大写字母。它每次都返回空集。使用时会发生什么情况:
{“query”:{“regexp”:{“Unit.DailyAvailability”:“{7}A{7}.*}}}
大写符号导致空结果
'.{7}a{7}.*'
返回内部包含适当序列的所有单元。我怀疑这个问题与lucene中regexp的锚定有关。另一种可能性是:ES无法正确解析大字符串。您如何分析该字段?您的分析是您需要使用小写字母搜索的原因,并且可能也是造成这种行为的原因,也许吧?回答得很好。
var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
    .Mappings(ms => ms.Map<Unit>(m => m
        .Properties(ps => ps
            .Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
        )
     )
  ));