XQuery:搜索不同字母的组合
我有以下xml结构XQuery:搜索不同字母的组合,xquery,Xquery,我有以下xml结构 <image> <id>88091942</id> <imageType>Primary</imageType> <format>pdf</format> <status timestamp="2019-11-20T12:20:02.616Z">Accepted</status> <size/> <lan
<image>
<id>88091942</id>
<imageType>Primary</imageType>
<format>pdf</format>
<status timestamp="2019-11-20T12:20:02.616Z">Accepted</status>
<size/>
<languageCode>
<val>eng</val>
</languageCode>
<comments/>
<effectiveDate>2013-01-01T00:00:00.000Z</effectiveDate>
<extractedText> Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ABCDE remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including B.C.D.E versions of Lorem Ipsum.</extractedText>
</image>
我得到以下结果-
88091942
我应该去哪里-
88091942!!!!Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ABCDE remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including B.C.D.E versions of Lorem Ipsum.
非常感谢您的帮助。包含(,“A或B或C”)
正在搜索文本字符串“A或B或C”
您希望包含(,“A”)或包含(,“B”)或包含(,“C”)
或者,您可以将其重新格式化为匹配(,“A | B | C”)
或者如果字符串是相关的,比如ABCDE,A.B.C.D.E,Abcde,A.B.C.D.E,那么你可以试试
contains(. => upper-case() => translate('.', ''), "ABCDE")
包含(,,“A或B或C”)
正在搜索文本字符串“A或B或C”
您希望包含(,“A”)或包含(,“B”)或包含(,“C”)
或者,您可以将其重新格式化为匹配(,“A | B | C”)
或者如果字符串是相关的,比如ABCDE,A.B.C.D.E,Abcde,A.B.C.D.E,那么你可以试试
contains(. => upper-case() => translate('.', ''), "ABCDE")
您的示例不太清楚,但看起来您希望在XQuery中使用这种类型的RegExp:
let $id := /image/id
let $text := /image/extractedText[matches(.,'((a|A)\.)?(b|B)\.(c|C)\.(d|D)\.(e|E)')]
return fn:string-join(($id,$text),"!!!!")
输出:
88091942!!!! Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ABCDE remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including B.C.D.E versions of Lorem Ipsum.
测试
如果您想要XPath/XQuery特定的RegExp样式,您可以使用类似于
匹配(,,,,(A\)?B\.C\.D\.E',i')
中的标志来忽略大小写。您的示例不太清楚,但看起来您希望在XQuery中使用这种类型的RegExp:
let $id := /image/id
let $text := /image/extractedText[matches(.,'((a|A)\.)?(b|B)\.(c|C)\.(d|D)\.(e|E)')]
return fn:string-join(($id,$text),"!!!!")
输出:
88091942!!!! Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ABCDE remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including B.C.D.E versions of Lorem Ipsum.
测试
如果您想要XPath/XQuery特定的RegExp风格,您可以使用类似于
匹配(,,,(A\)?B\.C\.D\.E',i')
中的标志来忽略大小写。原始请求者提到内容驻留在MarkLogic中。因此,我将根据可用信息为用例提供以MarkLogic为中心的响应。包括区分大小写在内的完整实现的调优可能需要更多地了解MarkLogic的基本功能:
案例:“A.B.C”、“A.B.D.C”等
在引擎盖下,所有内容都被索引。单词索引标记内容,并在存储内容时应用大小写敏感度和变音敏感度规则。单词边界基于一组与空白和标点符号相关的合理默认值。这意味着数据已经为具有一些字符分离的项目做好了准备。通过分析上述示例,我们可以看到:
xdmp:describe(cts:tokenize("A.B D.C"))
显示根据示例忽略空格并理解标点符号,结果如下:
(cts:word("A"), cts:punctuation("."), cts:word("B"), ...)
这意味着我们需要简单地考虑到文本中每个单词之间的关系。为此,我们确保单词(A、B、C、D)彼此靠近。为此,数据库中名为单词位置
的设置可能有助于提高性能。对我来说,我把它留给了我的30万份文件样本。我们的查询非常简单,如下所示:
cts:search(doc(), cts:near-query(cts:word-query(("A", "B", "C", "D")), 1))
细分:
- doc()是最简单的可搜索表达式-对于JS,您不会有这个李>
- 然后我们要求对单词A、B、C、D进行单词查询。
的内部工作规则已经将其扩展为or查询列表cts:word-query()
- 这都受到位置的限制-单词查询结果应该彼此位于一个位置内
您还可以将这种方法与元素的词典相结合,并将术语扩展到系统中存在的术语,并且仍然将它们作为序列传递到搜索中。原始请求者提到内容驻留在MarkLogic中。因此,我将根据可用信息为用例提供以MarkLogic为中心的响应。包括区分大小写在内的完整实现的调优可能需要更多地了解MarkLogic的基本功能: 案例:“A.B.C”、“A.B.D.C”等 在引擎盖下,所有内容都被索引。单词索引标记内容,并在存储内容时应用大小写敏感度和变音敏感度规则。单词边界基于一组与空白和标点符号相关的合理默认值。这意味着数据已经为具有一些字符分离的项目做好了准备。通过分析上述示例,我们可以看到:
xdmp:describe(cts:tokenize("A.B D.C"))
显示根据示例忽略空格并理解标点符号,结果如下:
(cts:word("A"), cts:punctuation("."), cts:word("B"), ...)
这意味着我们需要简单地考虑到文本中每个单词之间的关系。为此,我们确保单词(A、B、C、D)彼此靠近。为此,数据库中名为单词位置
的设置可能有助于提高性能。对我来说,我把它留给了我的30万份文件样本。我们的查询非常简单,如下所示:
cts:search(doc(), cts:near-query(cts:word-query(("A", "B", "C", "D")), 1))
细分:
- doc()是最简单的可搜索表达式-对于JS,您不会有这个李>
- 然后我们要求对单词A、B、C、D进行单词查询。
的内部工作规则已经将其扩展为or查询列表cts:word-query()
- 这都受到位置的限制-单词查询结果应该彼此位于一个位置内