Regex Clojure re seq正则表达式,结果意外
我在为以下内容找出正确的正则表达式时遇到了一些困难: 我有一个输入文件,我正试图根据关键字表达式分组。下面是该文件的一个示例(我们称之为案例1): 以下正则表达式:Regex Clojure re seq正则表达式,结果意外,regex,clojure,Regex,Clojure,我在为以下内容找出正确的正则表达式时遇到了一些困难: 我有一个输入文件,我正试图根据关键字表达式分组。下面是该文件的一个示例(我们称之为案例1): 以下正则表达式: #"(?s)(Foo:)(?:(?!Foo:).)*" 工作起来很有魅力,产生了我的预期结果: (["Foo: B\n \"This is instance B of type Foo\"\n Bar: X\n etc.\n\n" "Foo:"] ["Foo: C\n \"This is instance C of
#"(?s)(Foo:)(?:(?!Foo:).)*"
工作起来很有魅力,产生了我的预期结果:
(["Foo: B\n \"This is instance B of type Foo\"\n Bar: X\n etc.\n\n"
"Foo:"]
["Foo: C\n \"This is instance C of type Foo\"\n Bar: Y\n etc.\n\n\n"
"Foo:"])
但是,如果有人在已注释的“Foo”中添加冒号,它会变得古怪,并导致:
(["Foo: B\n \"This is instance B of type " "Foo:"]
["Foo:\"\n Bar: X\n etc.\n\n" "Foo:"]
["Foo: C\n \"This is instance C of type Foo\"\n Bar: Y\n etc.\n\n\n"
"Foo:"])
(["Foo: B\n \"This is instance B of type Foo:\"\n Bar: X\n etc.\n\nFoo: C\n \"This is instance C of type Foo:\"\n Bar: Y\n etc.\n\n\n\n"
"Foo:"])
如果在测试中,我从输入中删除Foo:C及其内容
,并将正则表达式更改为:
"(?s)(Foo:)(?:(?!\"Foo:\").)*"
我得到了预期的结果:
(["Foo: B\n \"This is instance B of type Foo:\"\n Bar: X\n etc.\n\n\n\n"
"Foo:"])
但是,将Foo:C
重新添加到混合中,它不再尊重边界并导致:
(["Foo: B\n \"This is instance B of type " "Foo:"]
["Foo:\"\n Bar: X\n etc.\n\n" "Foo:"]
["Foo: C\n \"This is instance C of type Foo\"\n Bar: Y\n etc.\n\n\n"
"Foo:"])
(["Foo: B\n \"This is instance B of type Foo:\"\n Bar: X\n etc.\n\nFoo: C\n \"This is instance C of type Foo:\"\n Bar: Y\n etc.\n\n\n\n"
"Foo:"])
我试过了,但没有成功:“(?s)(Foo:)(?:(?!Foo:\“Foo:\”)*”
列举了几千次不成功的旋转
谢谢你的帮助。目的是使用正则表达式对文件进行分块
当前解决方案
不再使用regex
,因为它太细微了,无法处理我需要的简单分块。第一种解决方案是循环/重现情况,其中有几个(太多)条件和突变原子作为累积映射
我一直渴望用reduce
做一些特定的事情,虽然可能不是最好的应用程序,但我在这个练习中学会了这一点,并删除了过多的代码行
(def owl-type-map
{
"Prefix:" :prefixes
"AnnotationProperty:" :annotation-properties
"Ontology:" :ontology
"Datatype:" :data-types
"DataProperty:" :data-properties
"ObjectProperty:" :object-properties
"Class:" :classes
"Individual:" :individuals
"EquivalentClasses:" :miscellaneous
"DisjointClasses:" :miscellaneous
"EquivalentProperties:" :miscellaneous
"DisjointProperties:" :miscellaneous
"SameIndividual:" :miscellaneous
"DifferentIndividuals:" :miscellaneous
})
(def owl-control (reduce #(assoc %1 (second %2) nil) {:current nil} owl-type-map))
(def space-split #(s/split (str %) #" "))
(defn owl-chunk
"Reduce ready function to accumulate a series of strings associated to
particular instaparse EBNF productions (e.g. Class:, Prefix:, Ontology:).
owl-type-map refers to the association between owl-type (string) and EBNF production"
[acc v]
(let [odex (:current acc)
stip ((comp first space-split) v)
index (get owl-type-map stip odex)
imap (if (= index odex) acc (assoc-in k [:current] index))
]
(assoc-in imap [index] (str (get imap index) v "\n"))))
;; Calling
(reduce owl-chunk owl-control s)
您可能想考虑使用分析器生成器。Mark Engelberg的是Clojure的一个优秀解析库,旨在使之成为一个简单的选择——其自述的第一行是,如果上下文无关语法与正则表达式一样易于使用,会怎么样
下面是一个示例,说明如何使用它来解析示例输入:;; [instaparse "1.3.5"]
(require '[instaparse.core :as insta])
(def p (insta/parser "
S = Group*
Group = GroupHeader GroupComment GroupBody
GroupHeader = #'[A-Za-z]+' ': ' #'[A-Za-z]+' '\n'
GroupComment = ws? '\"' #'[^\"]+' '\"\n'
GroupBody = Line*
Line = #'.*' '\n'
ws = #'\\s+'
"))
(p "Foo: B
\"This is instance B of type Foo\"
Bar: X
Foo: C
\"This is instance C of type Foo\"
Bar: Y
")
;;=
[:S
[:Group
[:GroupHeader "Foo" ": " "B" "\n"]
[:GroupComment [:ws " "] "\"" "This is instance B of type Foo" "\"\n"]
[:GroupBody
[:Line " Bar: X" "\n"]]]
[:Group
[:GroupHeader "Foo" ": " "C" "\n"]
[:GroupComment [:ws " "] "\"" "This is instance C of type Foo" "\"\n"]
[:GroupBody
[:Line " Bar: Y" "\n"]]]]
在qouted字符串中的“Foo”后面添加冒号不是问题。(当然,上面的语法非常简单——我想你可能想在
栏开始嵌套组:
等等。)Michal,我只在这一点上使用instaparse,我需要在输入文件之前将文件分块。规模问题…更具体地说,我使用的w/instaparse语法有歧义,导致指数解析。通过在输入之前进行分块,它允许我以特定的产品为目标,并轻松地完成解析。希望这有帮助,这是有道理的。也许您可以在预处理步骤中使用更简单的“分块语法”,然后将您的模糊语法应用于生成的分块?避免使用regex和instaparse,并使用clojure.string/split方法进行分块。添加了最终解决方案,我的脚被reduce打湿了。一如既往,谢谢你的建议。