Regex Clojure re seq正则表达式，结果意外_Regex_Clojure

Regex Clojure re seq正则表达式，结果意外

regex clojure

Regex Clojure re seq正则表达式，结果意外,regex,clojure,Regex,Clojure,我在为以下内容找出正确的正则表达式时遇到了一些困难：我有一个输入文件，我正试图根据关键字表达式分组。下面是该文件的一个示例（我们称之为案例1）：以下正则表达式： #"(?s)(Foo:)(?:(?!Foo:).)*" 工作起来很有魅力，产生了我的预期结果： (["Foo: B\n \"This is instance B of type Foo\"\n Bar: X\n etc.\n\n" "Foo:"] ["Foo: C\n \"This is instance C of

我在为以下内容找出正确的正则表达式时遇到了一些困难：

我有一个输入文件，我正试图根据关键字表达式分组。下面是该文件的一个示例（我们称之为案例1）：

以下正则表达式：

#"(?s)(Foo:)(?:(?!Foo:).)*"

工作起来很有魅力，产生了我的预期结果：

(["Foo: B\n  \"This is instance B of type Foo\"\n  Bar: X\n  etc.\n\n"
  "Foo:"]
 ["Foo: C\n  \"This is instance C of type Foo\"\n  Bar: Y\n  etc.\n\n\n"
  "Foo:"])

但是，如果有人在已注释的“Foo”中添加冒号，它会变得古怪，并导致：

(["Foo: B\n  \"This is instance B of type " "Foo:"]
 ["Foo:\"\n  Bar: X\n  etc.\n\n" "Foo:"]
 ["Foo: C\n  \"This is instance C of type Foo\"\n  Bar: Y\n  etc.\n\n\n"
  "Foo:"])

(["Foo: B\n  \"This is instance B of type Foo:\"\n  Bar: X\n  etc.\n\nFoo: C\n  \"This is instance C of type Foo:\"\n  Bar: Y\n  etc.\n\n\n\n"
  "Foo:"])

如果在测试中，我从输入中删除

Foo:C及其内容

，并将正则表达式更改为：

"(?s)(Foo:)(?:(?!\"Foo:\").)*"

我得到了预期的结果：

(["Foo: B\n  \"This is instance B of type Foo:\"\n  Bar: X\n  etc.\n\n\n\n"
  "Foo:"])

但是，将

Foo:C

重新添加到混合中，它不再尊重边界并导致：

(["Foo: B\n  \"This is instance B of type " "Foo:"]
 ["Foo:\"\n  Bar: X\n  etc.\n\n" "Foo:"]
 ["Foo: C\n  \"This is instance C of type Foo\"\n  Bar: Y\n  etc.\n\n\n"
  "Foo:"])

(["Foo: B\n  \"This is instance B of type Foo:\"\n  Bar: X\n  etc.\n\nFoo: C\n  \"This is instance C of type Foo:\"\n  Bar: Y\n  etc.\n\n\n\n"
  "Foo:"])

我试过了，但没有成功：

“（？s）（Foo:）（？：（？！Foo:\“Foo:\”）*”

列举了几千次不成功的旋转

谢谢你的帮助。目的是使用正则表达式对文件进行分块

当前解决方案 不再使用

regex

，因为它太细微了，无法处理我需要的简单分块。第一种解决方案是循环/重现情况，其中有几个（太多）条件和突变原子作为累积映射

我一直渴望用

reduce

做一些特定的事情，虽然可能不是最好的应用程序，但我在这个练习中学会了这一点，并删除了过多的代码行

(def owl-type-map
    {
     "Prefix:"               :prefixes
     "AnnotationProperty:"   :annotation-properties
     "Ontology:"             :ontology
     "Datatype:"             :data-types
     "DataProperty:"         :data-properties
     "ObjectProperty:"       :object-properties
     "Class:"                :classes
     "Individual:"           :individuals
     "EquivalentClasses:"    :miscellaneous
     "DisjointClasses:"      :miscellaneous
     "EquivalentProperties:" :miscellaneous
     "DisjointProperties:"   :miscellaneous
     "SameIndividual:"       :miscellaneous
     "DifferentIndividuals:" :miscellaneous
     })

  (def owl-control (reduce #(assoc %1 (second %2) nil) {:current nil} owl-type-map))

  (def space-split #(s/split (str %) #" "))

  (defn owl-chunk
    "Reduce ready function to accumulate a series of strings associated to
    particular instaparse EBNF productions (e.g. Class:, Prefix:, Ontology:).
    owl-type-map refers to the association between owl-type (string) and EBNF production"
    [acc v]
    (let [odex  (:current acc)
          stip  ((comp first space-split) v)
          index (get owl-type-map stip odex)
          imap  (if (= index odex) acc (assoc-in k [:current] index))
          ]
      (assoc-in imap [index] (str (get imap index) v "\n"))))

;; Calling

(reduce owl-chunk owl-control s)

您可能想考虑使用分析器生成器。Mark Engelberg的是Clojure的一个优秀解析库，旨在使之成为一个简单的选择——其自述的第一行是，如果上下文无关语法与正则表达式一样易于使用，会怎么样

下面是一个示例，说明如何使用它来解析示例输入：

;; [instaparse "1.3.5"]
(require '[instaparse.core :as insta])

(def p (insta/parser "

S = Group*
Group = GroupHeader GroupComment GroupBody
GroupHeader = #'[A-Za-z]+' ': ' #'[A-Za-z]+' '\n'
GroupComment = ws? '\"' #'[^\"]+' '\"\n'
GroupBody = Line*
Line = #'.*' '\n'
ws = #'\\s+'

"))

(p "Foo: B
  \"This is instance B of type Foo\"
  Bar: X
Foo: C
  \"This is instance C of type Foo\"
  Bar: Y
")
;;=
[:S
 [:Group
  [:GroupHeader "Foo" ": " "B" "\n"]
  [:GroupComment [:ws "  "] "\"" "This is instance B of type Foo" "\"\n"]
  [:GroupBody
   [:Line "  Bar: X" "\n"]]]
 [:Group
  [:GroupHeader "Foo" ": " "C" "\n"]
  [:GroupComment [:ws "  "] "\"" "This is instance C of type Foo" "\"\n"]
  [:GroupBody
   [:Line "  Bar: Y" "\n"]]]]

在qouted字符串中的“Foo”后面添加冒号不是问题。（当然，上面的语法非常简单——我想你可能想在

栏开始嵌套组：

等等。）

Michal，我只在这一点上使用instaparse，我需要在输入文件之前将文件分块。规模问题…更具体地说，我使用的w/instaparse语法有歧义，导致指数解析。通过在输入之前进行分块，它允许我以特定的产品为目标，并轻松地完成解析。希望这有帮助，这是有道理的。也许您可以在预处理步骤中使用更简单的“分块语法”，然后将您的模糊语法应用于生成的分块？避免使用regex和instaparse，并使用clojure.string/split方法进行分块。添加了最终解决方案，我的脚被reduce打湿了。一如既往，谢谢你的建议。