将html结构转换为Clojure结构
我有一个html页面,其中一个结构我想转换成Clojure数据结构。我在思考如何用惯用的方式处理这个问题 这就是我的结构:将html结构转换为Clojure结构,html,clojure,enlive,Html,Clojure,Enlive,我有一个html页面,其中一个结构我想转换成Clojure数据结构。我在思考如何用惯用的方式处理这个问题 这就是我的结构: <div class=“group”> <h2>title1<h2> <div class=“subgroup”> <p>unused</p> <h3>subheading1</h3> <a href=“path1” /> <
<div class=“group”>
<h2>title1<h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2<h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>
重复标题是有意的
我读过。如果组和子组之间存在奇偶性,那么这提供了一个很好的解决方案,但在这种情况下,它可能是随机的
谢谢您的建议。您可以使用它进行解析,然后Clojure提供了一些非常好的工具,用于将解析后的HTML转换为您想要的格式:
(require '[hickory.core :as html])
(defn classifier [tag klass]
(comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))
(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))
(defn only [x]
;; https://stackoverflow.com/a/14792289/5044950
{:pre [(seq x)
(nil? (next x))]}
(first x))
(defn identifier [tag element]
(->> element :content (filter (identifier? tag)) only :content only))
(defn process [data]
(for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
:let [title (identifier :h2 group)]
subgroup (filter subgroup? (:content group))
:let [subheading (identifier :h3 subgroup)]
path (filter path? (:content subgroup))]
[title subheading (:href (:attrs path))]))
例如:
(require '[clojure.pprint :as pprint])
(def data
"<div class=“group”>
<h2>title1</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>")
(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;; ["title1" "subheading2" "“path2”"]
;; ["title2" "subheading3" "“path3”"])
(需要“[clojure.pprint:as-pprint]”
(def数据
"
标题1
未使用
子目1
未使用
子目2
标题2
未使用
子目3
")
(pprint/pprint(过程数据))
;; ([“标题1”“副标题1”“路径1”“]
;;;[“标题1”“副标题2”“路径2”“]
;;;[“标题2”“副标题3”“路径3”“])
您可以使用它进行解析,然后Clojure提供了一些非常好的工具,用于将解析后的HTML转换为您想要的格式:
(require '[hickory.core :as html])
(defn classifier [tag klass]
(comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))
(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))
(defn only [x]
;; https://stackoverflow.com/a/14792289/5044950
{:pre [(seq x)
(nil? (next x))]}
(first x))
(defn identifier [tag element]
(->> element :content (filter (identifier? tag)) only :content only))
(defn process [data]
(for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
:let [title (identifier :h2 group)]
subgroup (filter subgroup? (:content group))
:let [subheading (identifier :h3 subgroup)]
path (filter path? (:content subgroup))]
[title subheading (:href (:attrs path))]))
例如:
(require '[clojure.pprint :as pprint])
(def data
"<div class=“group”>
<h2>title1</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>")
(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;; ["title1" "subheading2" "“path2”"]
;; ["title2" "subheading3" "“path3”"])
(需要“[clojure.pprint:as-pprint]”
(def数据
"
标题1
未使用
子目1
未使用
子目2
标题2
未使用
子目3
")
(pprint/pprint(过程数据))
;; ([“标题1”“副标题1”“路径1”“]
;;;[“标题1”“副标题2”“路径2”“]
;;;[“标题2”“副标题3”“路径3”“])
解决方案可分为两部分
- 解析:使用或任何其他解析器解析它
- 自定义数据结构:修改已解析的html,如果需要,可以使用它
- 解析:使用或任何其他解析器解析它
- 自定义数据结构:修改已解析的html,如果需要,可以使用它
(dotest
(with-forest (new-forest)
(let [html-str "<div class=“group”>
<h2>title1</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>"
enlive-tree (->> html-str
java.io.StringReader.
en-html/html-resource
first)
root-hid (add-tree-enlive enlive-tree)
tree-1 (hid->hiccup root-hid)
; Removing whitespace nodes is optional; just done to keep things neat
blank-leaf-hid? (fn fn-blank-leaf-hid? ; whitespace pred fn
[hid]
(let [node (hid->node hid)]
(and (contains-key? node ::tf/value)
(ts/whitespace? (grab ::tf/value node)))))
blank-leaf-hids (keep-if blank-leaf-hid? (all-leaf-hids)) ; find whitespace nodes
>> (apply remove-hid blank-leaf-hids) ; delete whitespace nodes found
tree-2 (hid->hiccup root-hid)
>> (is= tree-2 [:html
[:body
[:div {:class "“group”"}
[:h2 "title1"]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading1"]
[:a {:href "“path1”"}]]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading2"]
[:a {:href "“path2”"}]]]
[:div {:class "“group”"}
[:h2 "title2"]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading3"]
[:a {:href "“path3”"}]]]]])
; find consectutive nested [:div :h2] pairs at any depth in the tree
div-h2-paths (find-paths root-hid [:** :div :h2])
>> (is= (format-paths div-h2-paths)
[[{:tag :html}
[{:tag :body}
[{:class "“group”", :tag :div}
[{:tag :h2, :tupelo.forest/value "title1"}]]]]
[{:tag :html}
[{:tag :body}
[{:class "“group”", :tag :div}
[{:tag :h2, :tupelo.forest/value "title2"}]]]]])
; find the hid for each top-level :div (i.e. "group"); the next-to-last (-2) hid in each vector
div-hids (mapv #(idx % -2) div-h2-paths)
; for each of div-hids, find and collect nested :h3 values
dif-h3-paths (vec
(lazy-gen
(doseq [div-hid div-hids]
(let [h2-value (find-leaf-value div-hid [:div :h2])
h3-paths (find-paths div-hid [:** :h3])
h3-values (it-> h3-paths (mapv last it) (mapv hid->value it))]
(doseq [h3-value h3-values]
(yield [h2-value h3-value]))))))
]
(is= dif-h3-paths
[["title1" "subheading1"]
["title1" "subheading2"]
["title2" "subheading3"]])
)))
(dotest
(带森林(新森林)
(让[html str]
标题1
未使用
子目1
未使用
子目2
标题2
未使用
子目3
"
enlive树(->html str)
java.io.StringReader。
en html/html资源
(一)
根隐藏(添加树扩展树)
树-1(hid->hiccup根hid)
;删除空白节点是可选的;只是为了保持整洁
空白页隐藏?(fn fn空白页隐藏?;空格pred fn
[隐藏]
(让[节点(hid->节点hid)]
(和(包含键?节点::tf/值)
(ts/空格?(抓取::tf/值节点(()()))
空白叶hid(如果空白叶hid(所有叶hid))则保留);查找空白节点
>>(应用删除hid空白叶hid);删除找到的空白节点
树-2(hid->hiccup根hid)
>>(is=tree-2[:html
[:正文
[:div{:class“group”}
[:h2“标题1”]
[:div{:class“subgroup”}
[:p“未使用”]
[:h3“副标题1”]
[:a{:href“path1”}]
[:div{:class“subgroup”}
[:p“未使用”]
[:h3“副标题2”]
[:a{:href“path2”}]]
[:div{:class“group”}
[:h2“标题2”]
[:div{:class“subgroup”}
[:p“未使用”]
[:h3“副标题3”]
[:a{:href“path3”}]]]
;查找树中任意深度的连续嵌套[:div:h2]对
div-h2-path(查找路径根hid[:**:div:h2])
>>(is=(格式路径div-h2-paths)
[[{:tag:html}
[{:tag:body}
[{:class“group”:tag:div}
[{:tag:h2,:tupelo.forest/value“title1”}]]
[{:tag:html}
[{:tag:body}
[{:class“group”:tag:div}
[{:tag:h2,:tupelo.forest/value“title2”}]]]
;查找每个顶级的hid:div(即“组”);每个向量中倒数第二个(-2)hid
div hids(mapv#(idx%-2)div-h2-path)
;对于每个div hid,查找并收集嵌套的:h3值
dif-h3路径(vec
(懒将军
(doseq[div hid div hids]
(让[h2]值(查找