将html结构转换为Clojure结构_Html_Clojure_Enlive

将html结构转换为Clojure结构

html clojure

将html结构转换为Clojure结构,html,clojure,enlive,Html,Clojure,Enlive,我有一个html页面，其中一个结构我想转换成Clojure数据结构。我在思考如何用惯用的方式处理这个问题这就是我的结构： <div class=“group”> <h2>title1<h2> <div class=“subgroup”> <p>unused</p> <h3>subheading1</h3> <a href=“path1” /> <

我有一个html页面，其中一个结构我想转换成Clojure数据结构。我在思考如何用惯用的方式处理这个问题

这就是我的结构：

<div class=“group”>
  <h2>title1<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>

重复标题是有意的

我读过。如果组和子组之间存在奇偶性，那么这提供了一个很好的解决方案，但在这种情况下，它可能是随机的

谢谢您的建议。

您可以使用它进行解析，然后Clojure提供了一些非常好的工具，用于将解析后的HTML转换为您想要的格式：

(require '[hickory.core :as html])

(defn classifier [tag klass]
  (comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))

(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))

(defn only [x]
  ;; https://stackoverflow.com/a/14792289/5044950
  {:pre [(seq x)
         (nil? (next x))]}
  (first x))

(defn identifier [tag element]
  (->> element :content (filter (identifier? tag)) only :content only))

(defn process [data]
  (for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
        :let [title (identifier :h2 group)]
        subgroup (filter subgroup? (:content group))
        :let [subheading (identifier :h3 subgroup)]
        path (filter path? (:content subgroup))]
    [title subheading (:href (:attrs path))]))

例如：

(require '[clojure.pprint :as pprint])

(def data
"<div class=“group”>
  <h2>title1</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>")

(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;;  ["title1" "subheading2" "“path2”"]
;;  ["title2" "subheading3" "“path3”"])

（需要“[clojure.pprint:as-pprint]”
（def数据
"
标题1
未使用
子目1
未使用
子目2
标题2
未使用
子目3
")
（pprint/pprint（过程数据））
;; （[“标题1”“副标题1”“路径1”“]
；；；[“标题1”“副标题2”“路径2”“]
；；；[“标题2”“副标题3”“路径3”“]）

您可以使用它进行解析，然后Clojure提供了一些非常好的工具，用于将解析后的HTML转换为您想要的格式：

(require '[hickory.core :as html])

(defn classifier [tag klass]
  (comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))

(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))

(defn only [x]
  ;; https://stackoverflow.com/a/14792289/5044950
  {:pre [(seq x)
         (nil? (next x))]}
  (first x))

(defn identifier [tag element]
  (->> element :content (filter (identifier? tag)) only :content only))

(defn process [data]
  (for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
        :let [title (identifier :h2 group)]
        subgroup (filter subgroup? (:content group))
        :let [subheading (identifier :h3 subgroup)]
        path (filter path? (:content subgroup))]
    [title subheading (:href (:attrs path))]))

例如：

(require '[clojure.pprint :as pprint])

(def data
"<div class=“group”>
  <h2>title1</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>")

(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;;  ["title1" "subheading2" "“path2”"]
;;  ["title2" "subheading3" "“path3”"])

（需要“[clojure.pprint:as-pprint]”
（def数据
"
标题1
未使用
子目1
未使用
子目2
标题2
未使用
子目3
")
（pprint/pprint（过程数据））
;; （[“标题1”“副标题1”“路径1”“]
；；；[“标题1”“副标题2”“路径2”“]
；；；[“标题2”“副标题3”“路径3”“]）

解决方案可分为两部分

解析：使用或任何其他解析器解析它
自定义数据结构：修改已解析的html，如果需要，可以使用它

溶液可分为两部分

解析：使用或任何其他解析器解析它
自定义数据结构：修改已解析的html，如果需要，可以使用它

您可以使用解决此问题。下面是一个带注释的单元测试，展示了该方法。您可以找到更多信息以及和。其他文件即将提供

(dotest
  (with-forest (new-forest)
    (let [html-str        "<div class=“group”>
                              <h2>title1</h2>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading1</h3>
                                <a href=“path1” />
                              </div>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading2</h3>
                                <a href=“path2” />
                              </div>
                            </div>
                            <div class=“group”>
                              <h2>title2</h2>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading3</h3>
                                <a href=“path3” />
                              </div>
                            </div>"

          enlive-tree     (->> html-str
                            java.io.StringReader.
                            en-html/html-resource
                            first)
          root-hid        (add-tree-enlive enlive-tree)
          tree-1          (hid->hiccup root-hid)

          ; Removing whitespace nodes is optional; just done to keep things neat
          blank-leaf-hid? (fn fn-blank-leaf-hid? ; whitespace pred fn
                            [hid]
                            (let [node (hid->node hid)]
                              (and (contains-key? node ::tf/value)
                                (ts/whitespace? (grab ::tf/value node)))))
          blank-leaf-hids (keep-if blank-leaf-hid? (all-leaf-hids)) ; find whitespace nodes
          >>              (apply remove-hid blank-leaf-hids) ; delete whitespace nodes found
          tree-2          (hid->hiccup root-hid)
          >>              (is= tree-2 [:html
                                       [:body
                                        [:div {:class "“group”"}
                                         [:h2 "title1"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading1"]
                                          [:a {:href "“path1”"}]]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading2"]
                                          [:a {:href "“path2”"}]]]
                                        [:div {:class "“group”"}
                                         [:h2 "title2"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading3"]
                                          [:a {:href "“path3”"}]]]]])

          ; find consectutive nested [:div :h2] pairs at any depth in the tree
          div-h2-paths    (find-paths root-hid [:** :div :h2])
          >>              (is= (format-paths div-h2-paths)
                            [[{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title1"}]]]]
                             [{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title2"}]]]]])

          ; find the hid for each top-level :div (i.e. "group"); the next-to-last (-2) hid in each vector
          div-hids        (mapv #(idx % -2) div-h2-paths)
          ; for each of div-hids, find and collect nested :h3 values
          dif-h3-paths    (vec
                            (lazy-gen
                              (doseq [div-hid div-hids]
                                (let [h2-value  (find-leaf-value div-hid [:div :h2])
                                      h3-paths  (find-paths div-hid [:** :h3])
                                      h3-values (it-> h3-paths (mapv last it) (mapv hid->value it))]
                                  (doseq [h3-value h3-values]
                                    (yield [h2-value h3-value]))))))
          ]
      (is= dif-h3-paths
        [["title1" "subheading1"]
         ["title1" "subheading2"]
         ["title2" "subheading3"]])

      )))

（dotest
（带森林（新森林）
（让[html str]
标题1
未使用
子目1
未使用
子目2
标题2
未使用
子目3
"
enlive树（->html str）
java.io.StringReader。
en html/html资源
（一）
根隐藏（添加树扩展树）
树-1（hid->hiccup根hid）
；删除空白节点是可选的；只是为了保持整洁
空白页隐藏？（fn fn空白页隐藏？；空格pred fn
[隐藏]
（让[节点（hid->节点hid）]
（和（包含键？节点：：tf/值）
（ts/空格？（抓取：：tf/值节点(()())）
空白叶hid（如果空白叶hid（所有叶hid））则保留）；查找空白节点
>>（应用删除hid空白叶hid）；删除找到的空白节点
树-2（hid->hiccup根hid）
>>（is=tree-2[：html
[：正文
[：div{:class“group”}
[：h2“标题1”]
[：div{:class“subgroup”}
[：p“未使用”]
[：h3“副标题1”]
[：a{:href“path1”}]
[：div{:class“subgroup”}
[：p“未使用”]
[：h3“副标题2”]
[：a{:href“path2”}]]
[：div{:class“group”}
[：h2“标题2”]
[：div{:class“subgroup”}
[：p“未使用”]
[：h3“副标题3”]
[：a{:href“path3”}]]]
；查找树中任意深度的连续嵌套[：div:h2]对
div-h2-path（查找路径根hid[：**:div:h2]）
>>（is=（格式路径div-h2-paths）
[[{:tag:html}
[{:tag:body}
[{:class“group”：tag:div}
[{:tag:h2，:tupelo.forest/value“title1”}]]
[{:tag:html}
[{:tag:body}
[{:class“group”：tag:div}
[{:tag:h2，:tupelo.forest/value“title2”}]]]
；查找每个顶级的hid:div（即“组”）；每个向量中倒数第二个（-2）hid
div hids（mapv#（idx%-2）div-h2-path）
；对于每个div hid，查找并收集嵌套的：h3值
dif-h3路径（vec
（懒将军
（doseq[div hid div hids]
（让[h2]值（查找