Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/73.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
将html结构转换为Clojure结构_Html_Clojure_Enlive - Fatal编程技术网

将html结构转换为Clojure结构

将html结构转换为Clojure结构,html,clojure,enlive,Html,Clojure,Enlive,我有一个html页面,其中一个结构我想转换成Clojure数据结构。我在思考如何用惯用的方式处理这个问题 这就是我的结构: <div class=“group”> <h2>title1<h2> <div class=“subgroup”> <p>unused</p> <h3>subheading1</h3> <a href=“path1” /> <

我有一个html页面,其中一个结构我想转换成Clojure数据结构。我在思考如何用惯用的方式处理这个问题

这就是我的结构:

<div class=“group”>
  <h2>title1<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>
重复标题是有意的

我读过。如果组和子组之间存在奇偶性,那么这提供了一个很好的解决方案,但在这种情况下,它可能是随机的

谢谢您的建议。

您可以使用它进行解析,然后Clojure提供了一些非常好的工具,用于将解析后的HTML转换为您想要的格式:

(require '[hickory.core :as html])

(defn classifier [tag klass]
  (comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))

(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))

(defn only [x]
  ;; https://stackoverflow.com/a/14792289/5044950
  {:pre [(seq x)
         (nil? (next x))]}
  (first x))

(defn identifier [tag element]
  (->> element :content (filter (identifier? tag)) only :content only))

(defn process [data]
  (for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
        :let [title (identifier :h2 group)]
        subgroup (filter subgroup? (:content group))
        :let [subheading (identifier :h3 subgroup)]
        path (filter path? (:content subgroup))]
    [title subheading (:href (:attrs path))]))
例如:

(require '[clojure.pprint :as pprint])

(def data
"<div class=“group”>
  <h2>title1</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>")

(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;;  ["title1" "subheading2" "“path2”"]
;;  ["title2" "subheading3" "“path3”"])
(需要“[clojure.pprint:as-pprint]”
(def数据
"
标题1
未使用

子目1 未使用

子目2 标题2 未使用

子目3 ") (pprint/pprint(过程数据)) ;; ([“标题1”“副标题1”“路径1”“] ;;;[“标题1”“副标题2”“路径2”“] ;;;[“标题2”“副标题3”“路径3”“])
您可以使用它进行解析,然后Clojure提供了一些非常好的工具,用于将解析后的HTML转换为您想要的格式:

(require '[hickory.core :as html])

(defn classifier [tag klass]
  (comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))

(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))

(defn only [x]
  ;; https://stackoverflow.com/a/14792289/5044950
  {:pre [(seq x)
         (nil? (next x))]}
  (first x))

(defn identifier [tag element]
  (->> element :content (filter (identifier? tag)) only :content only))

(defn process [data]
  (for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
        :let [title (identifier :h2 group)]
        subgroup (filter subgroup? (:content group))
        :let [subheading (identifier :h3 subgroup)]
        path (filter path? (:content subgroup))]
    [title subheading (:href (:attrs path))]))
例如:

(require '[clojure.pprint :as pprint])

(def data
"<div class=“group”>
  <h2>title1</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>")

(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;;  ["title1" "subheading2" "“path2”"]
;;  ["title2" "subheading3" "“path3”"])
(需要“[clojure.pprint:as-pprint]”
(def数据
"
标题1
未使用

子目1 未使用

子目2 标题2 未使用

子目3 ") (pprint/pprint(过程数据)) ;; ([“标题1”“副标题1”“路径1”“] ;;;[“标题1”“副标题2”“路径2”“] ;;;[“标题2”“副标题3”“路径3”“])
解决方案可分为两部分

  • 解析:使用或任何其他解析器解析它
  • 自定义数据结构:修改已解析的html,如果需要,可以使用它

溶液可分为两部分

  • 解析:使用或任何其他解析器解析它
  • 自定义数据结构:修改已解析的html,如果需要,可以使用它

您可以使用解决此问题。下面是一个带注释的单元测试,展示了该方法。您可以找到更多信息以及和。其他文件即将提供

(dotest
  (with-forest (new-forest)
    (let [html-str        "<div class=“group”>
                              <h2>title1</h2>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading1</h3>
                                <a href=“path1” />
                              </div>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading2</h3>
                                <a href=“path2” />
                              </div>
                            </div>
                            <div class=“group”>
                              <h2>title2</h2>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading3</h3>
                                <a href=“path3” />
                              </div>
                            </div>"

          enlive-tree     (->> html-str
                            java.io.StringReader.
                            en-html/html-resource
                            first)
          root-hid        (add-tree-enlive enlive-tree)
          tree-1          (hid->hiccup root-hid)

          ; Removing whitespace nodes is optional; just done to keep things neat
          blank-leaf-hid? (fn fn-blank-leaf-hid? ; whitespace pred fn
                            [hid]
                            (let [node (hid->node hid)]
                              (and (contains-key? node ::tf/value)
                                (ts/whitespace? (grab ::tf/value node)))))
          blank-leaf-hids (keep-if blank-leaf-hid? (all-leaf-hids)) ; find whitespace nodes
          >>              (apply remove-hid blank-leaf-hids) ; delete whitespace nodes found
          tree-2          (hid->hiccup root-hid)
          >>              (is= tree-2 [:html
                                       [:body
                                        [:div {:class "“group”"}
                                         [:h2 "title1"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading1"]
                                          [:a {:href "“path1”"}]]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading2"]
                                          [:a {:href "“path2”"}]]]
                                        [:div {:class "“group”"}
                                         [:h2 "title2"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading3"]
                                          [:a {:href "“path3”"}]]]]])

          ; find consectutive nested [:div :h2] pairs at any depth in the tree
          div-h2-paths    (find-paths root-hid [:** :div :h2])
          >>              (is= (format-paths div-h2-paths)
                            [[{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title1"}]]]]
                             [{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title2"}]]]]])

          ; find the hid for each top-level :div (i.e. "group"); the next-to-last (-2) hid in each vector
          div-hids        (mapv #(idx % -2) div-h2-paths)
          ; for each of div-hids, find and collect nested :h3 values
          dif-h3-paths    (vec
                            (lazy-gen
                              (doseq [div-hid div-hids]
                                (let [h2-value  (find-leaf-value div-hid [:div :h2])
                                      h3-paths  (find-paths div-hid [:** :h3])
                                      h3-values (it-> h3-paths (mapv last it) (mapv hid->value it))]
                                  (doseq [h3-value h3-values]
                                    (yield [h2-value h3-value]))))))
          ]
      (is= dif-h3-paths
        [["title1" "subheading1"]
         ["title1" "subheading2"]
         ["title2" "subheading3"]])

      )))
(dotest
(带森林(新森林)
(让[html str]
标题1
未使用

子目1 未使用

子目2 标题2 未使用

子目3 " enlive树(->html str) java.io.StringReader。 en html/html资源 (一) 根隐藏(添加树扩展树) 树-1(hid->hiccup根hid) ;删除空白节点是可选的;只是为了保持整洁 空白页隐藏?(fn fn空白页隐藏?;空格pred fn [隐藏] (让[节点(hid->节点hid)] (和(包含键?节点::tf/值) (ts/空格?(抓取::tf/值节点(()())) 空白叶hid(如果空白叶hid(所有叶hid))则保留);查找空白节点 >>(应用删除hid空白叶hid);删除找到的空白节点 树-2(hid->hiccup根hid) >>(is=tree-2[:html [:正文 [:div{:class“group”} [:h2“标题1”] [:div{:class“subgroup”} [:p“未使用”] [:h3“副标题1”] [:a{:href“path1”}] [:div{:class“subgroup”} [:p“未使用”] [:h3“副标题2”] [:a{:href“path2”}]] [:div{:class“group”} [:h2“标题2”] [:div{:class“subgroup”} [:p“未使用”] [:h3“副标题3”] [:a{:href“path3”}]]] ;查找树中任意深度的连续嵌套[:div:h2]对 div-h2-path(查找路径根hid[:**:div:h2]) >>(is=(格式路径div-h2-paths) [[{:tag:html} [{:tag:body} [{:class“group”:tag:div} [{:tag:h2,:tupelo.forest/value“title1”}]] [{:tag:html} [{:tag:body} [{:class“group”:tag:div} [{:tag:h2,:tupelo.forest/value“title2”}]]] ;查找每个顶级的hid:div(即“组”);每个向量中倒数第二个(-2)hid div hids(mapv#(idx%-2)div-h2-path) ;对于每个div hid,查找并收集嵌套的:h3值 dif-h3路径(vec (懒将军 (doseq[div hid div hids] (让[h2]值(查找