Regex Clojure re seq正则表达式,结果意外

Regex Clojure re seq正则表达式,结果意外,regex,clojure,Regex,Clojure,我在为以下内容找出正确的正则表达式时遇到了一些困难: 我有一个输入文件,我正试图根据关键字表达式分组。下面是该文件的一个示例(我们称之为案例1): 以下正则表达式: #"(?s)(Foo:)(?:(?!Foo:).)*" 工作起来很有魅力,产生了我的预期结果: (["Foo: B\n \"This is instance B of type Foo\"\n Bar: X\n etc.\n\n" "Foo:"] ["Foo: C\n \"This is instance C of

我在为以下内容找出正确的正则表达式时遇到了一些困难:

我有一个输入文件,我正试图根据关键字表达式分组。下面是该文件的一个示例(我们称之为案例1):

以下正则表达式:

#"(?s)(Foo:)(?:(?!Foo:).)*"
工作起来很有魅力,产生了我的预期结果:

(["Foo: B\n  \"This is instance B of type Foo\"\n  Bar: X\n  etc.\n\n"
  "Foo:"]
 ["Foo: C\n  \"This is instance C of type Foo\"\n  Bar: Y\n  etc.\n\n\n"
  "Foo:"])
但是,如果有人在已注释的“Foo”中添加冒号,它会变得古怪,并导致:

(["Foo: B\n  \"This is instance B of type " "Foo:"]
 ["Foo:\"\n  Bar: X\n  etc.\n\n" "Foo:"]
 ["Foo: C\n  \"This is instance C of type Foo\"\n  Bar: Y\n  etc.\n\n\n"
  "Foo:"])
(["Foo: B\n  \"This is instance B of type Foo:\"\n  Bar: X\n  etc.\n\nFoo: C\n  \"This is instance C of type Foo:\"\n  Bar: Y\n  etc.\n\n\n\n"
  "Foo:"])
如果在测试中,我从输入中删除
Foo:C及其内容
,并将正则表达式更改为:

"(?s)(Foo:)(?:(?!\"Foo:\").)*"
我得到了预期的结果:

(["Foo: B\n  \"This is instance B of type Foo:\"\n  Bar: X\n  etc.\n\n\n\n"
  "Foo:"])
但是,将
Foo:C
重新添加到混合中,它不再尊重边界并导致:

(["Foo: B\n  \"This is instance B of type " "Foo:"]
 ["Foo:\"\n  Bar: X\n  etc.\n\n" "Foo:"]
 ["Foo: C\n  \"This is instance C of type Foo\"\n  Bar: Y\n  etc.\n\n\n"
  "Foo:"])
(["Foo: B\n  \"This is instance B of type Foo:\"\n  Bar: X\n  etc.\n\nFoo: C\n  \"This is instance C of type Foo:\"\n  Bar: Y\n  etc.\n\n\n\n"
  "Foo:"])
我试过了,但没有成功:
“(?s)(Foo:)(?:(?!Foo:\“Foo:\”)*”
列举了几千次不成功的旋转

谢谢你的帮助。目的是使用正则表达式对文件进行分块

当前解决方案 不再使用
regex
,因为它太细微了,无法处理我需要的简单分块。第一种解决方案是循环/重现情况,其中有几个(太多)条件和突变原子作为累积映射

我一直渴望用
reduce
做一些特定的事情,虽然可能不是最好的应用程序,但我在这个练习中学会了这一点,并删除了过多的代码行

(def owl-type-map
    {
     "Prefix:"               :prefixes
     "AnnotationProperty:"   :annotation-properties
     "Ontology:"             :ontology
     "Datatype:"             :data-types
     "DataProperty:"         :data-properties
     "ObjectProperty:"       :object-properties
     "Class:"                :classes
     "Individual:"           :individuals
     "EquivalentClasses:"    :miscellaneous
     "DisjointClasses:"      :miscellaneous
     "EquivalentProperties:" :miscellaneous
     "DisjointProperties:"   :miscellaneous
     "SameIndividual:"       :miscellaneous
     "DifferentIndividuals:" :miscellaneous
     })

  (def owl-control (reduce #(assoc %1 (second %2) nil) {:current nil} owl-type-map))

  (def space-split #(s/split (str %) #" "))

  (defn owl-chunk
    "Reduce ready function to accumulate a series of strings associated to
    particular instaparse EBNF productions (e.g. Class:, Prefix:, Ontology:).
    owl-type-map refers to the association between owl-type (string) and EBNF production"
    [acc v]
    (let [odex  (:current acc)
          stip  ((comp first space-split) v)
          index (get owl-type-map stip odex)
          imap  (if (= index odex) acc (assoc-in k [:current] index))
          ]
      (assoc-in imap [index] (str (get imap index) v "\n"))))

;; Calling

(reduce owl-chunk owl-control s) 

您可能想考虑使用分析器生成器。Mark Engelberg的是Clojure的一个优秀解析库,旨在使之成为一个简单的选择——其自述的第一行是,如果上下文无关语法与正则表达式一样易于使用,会怎么样

下面是一个示例,说明如何使用它来解析示例输入:

;; [instaparse "1.3.5"]
(require '[instaparse.core :as insta])

(def p (insta/parser "

S = Group*
Group = GroupHeader GroupComment GroupBody
GroupHeader = #'[A-Za-z]+' ': ' #'[A-Za-z]+' '\n'
GroupComment = ws? '\"' #'[^\"]+' '\"\n'
GroupBody = Line*
Line = #'.*' '\n'
ws = #'\\s+'

"))

(p "Foo: B
  \"This is instance B of type Foo\"
  Bar: X
Foo: C
  \"This is instance C of type Foo\"
  Bar: Y
")
;;=
[:S
 [:Group
  [:GroupHeader "Foo" ": " "B" "\n"]
  [:GroupComment [:ws "  "] "\"" "This is instance B of type Foo" "\"\n"]
  [:GroupBody
   [:Line "  Bar: X" "\n"]]]
 [:Group
  [:GroupHeader "Foo" ": " "C" "\n"]
  [:GroupComment [:ws "  "] "\"" "This is instance C of type Foo" "\"\n"]
  [:GroupBody
   [:Line "  Bar: Y" "\n"]]]]

在qouted字符串中的“Foo”后面添加冒号不是问题。(当然,上面的语法非常简单——我想你可能想在
栏开始嵌套组:
等等。)

Michal,我只在这一点上使用instaparse,我需要在输入文件之前将文件分块。规模问题…更具体地说,我使用的w/instaparse语法有歧义,导致指数解析。通过在输入之前进行分块,它允许我以特定的产品为目标,并轻松地完成解析。希望这有帮助,这是有道理的。也许您可以在预处理步骤中使用更简单的“分块语法”,然后将您的模糊语法应用于生成的分块?避免使用regex和instaparse,并使用clojure.string/split方法进行分块。添加了最终解决方案,我的脚被reduce打湿了。一如既往,谢谢你的建议。