Java 使用JSoup使用Clojure解析字符串_Java_Web Scraping_Clojure_Jsoup

Java 使用JSoup使用Clojure解析字符串

java web-scraping clojure

Java 使用JSoup使用Clojure解析字符串,java,web-scraping,clojure,jsoup,Java,Web Scraping,Clojure,Jsoup,使用JSoup使用Clojure解析html字符串，源代码如下依赖关系 :dependencies [[org.clojure/clojure "1.10.1"] [org.jsoup/jsoup "1.13.1"]] 源代码 (require '[clojure.string :as str]) (def HTML (str "<html><head><title>Websi

使用JSoup使用Clojure解析html字符串，源代码如下

依赖关系

:dependencies [[org.clojure/clojure "1.10.1"]
               [org.jsoup/jsoup "1.13.1"]]

源代码

(require '[clojure.string :as str])
(def HTML (str "<html><head><title>Website title</title></head>
                <body><p>Sample paragraph number 1 </p>
                      <p>Sample paragraph number 2</p>
                </body></html>"))

(defn fetch_html [html]
  (let [soup (Jsoup/parse html)
        titles (.title soup)
        paragraphs (.getElementsByTag soup "p")]
    {:title titles :paragraph paragraphs}))

(fetch_html HTML)

不幸的是，结果并不像预期的那样

user ==> (fetch_html HTML)
{:title "Website title", :paragraph []}

我有一个可能有用的建议。试着用这种方式运行它。要在项目中使用，请添加以下行：

[tupelo "21.01.05"]

到

项目.clj

中的

：依赖项

代码示例：
(ns tst.demo.core
  (:use demo.core tupelo.core tupelo.test)
  (:require
    [tupelo.parse.tagsoup :as tagsoup]
    ))

(dotest
  (let [html "<html>
                <head><title>Website title</title></head>
                <body><p>Sample paragraph number 1 </p>
                      <p>Sample paragraph number 2</p>
                </body></html>"]
    (is= (tagsoup/parse html)
      {:tag     :html,
       :attrs   {},
       :content [{:tag     :head,
                  :attrs   {},
                  :content [{:tag :title, :attrs {}, :content ["Website title"]}]}
                 {:tag     :body,
                  :attrs   {},
                  :content [{:tag :p, :attrs {}, :content ["Sample paragraph number 1 "]}
                            {:tag :p, :attrs {}, :content ["Sample paragraph number 2"]}]}]})))

（.getElementsByTag…）返回元素的序列，您需要对每个元素调用.text（）方法以获取文本值。我使用的是JSOUP1.13.1版

（ns核心
（：导入（org.jsoup jsoup））
（：require[clojure.string:as str]））
（def HTML（str）网站标题
第1段示例
第2段示例
"))
（defn fetch_html[html]
（let[soup（Jsoup/parse html）
标题（.title-soup）
段落（.getElementsByTag“p”）]
{：标题：段落（mapv#（.text%）段落）}）
（获取html）

还考虑使用Reaver，它是一个包装JTUCH的Culjress库，或者其他任何包装器，如其他人所建议的。
您是否尝试将“P”而不是“A”传递给GETelEntsByTAG方法？请您具体使用所使用的版本等代码。不需要使用str
there（导入也不需要），但这不会损害结果。@Rulle typo-fixed。感谢you@cfrick：dependencies[[org.clojure/clojure“1.10.1”][org.jsoup/jsoup“1.13.1”]，谢谢，我将尝试嵌入我的代码中。谢谢你的回答，请查看更新。不要嵌入，只需通过project.clj
(ns tst.demo.core
  (:use demo.core tupelo.core tupelo.test)
  (:require
    [tupelo.parse.tagsoup :as tagsoup]
    ))

(dotest
  (let [html "<html>
                <head><title>Website title</title></head>
                <body><p>Sample paragraph number 1 </p>
                      <p>Sample paragraph number 2</p>
                </body></html>"]
    (is= (tagsoup/parse html)
      {:tag     :html,
       :attrs   {},
       :content [{:tag     :head,
                  :attrs   {},
                  :content [{:tag :title, :attrs {}, :content ["Website title"]}]}
                 {:tag     :body,
                  :attrs   {},
                  :content [{:tag :p, :attrs {}, :content ["Sample paragraph number 1 "]}
                            {:tag :p, :attrs {}, :content ["Sample paragraph number 2"]}]}]})))

(ns tupelo.parse.tagsoup
  (:use tupelo.core)
  (:require
    [schema.core :as s]
    [tupelo.parse.xml :as xml]
    [tupelo.string :as ts]
    [tupelo.schema :as tsk]))

(s/defn ^:private tagsoup-parse-fn
  [input-source :- org.xml.sax.InputSource
   content-handler]
  (doto (org.ccil.cowan.tagsoup.Parser.)
    (.setFeature "http://www.ccil.org/~cowan/tagsoup/features/default-attributes" false)
    (.setFeature "http://www.ccil.org/~cowan/tagsoup/features/cdata-elements" true)
    (.setFeature "http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace" true)
    (.setContentHandler content-handler)
    (.setProperty "http://www.ccil.org/~cowan/tagsoup/properties/auto-detector"
      (proxy [org.ccil.cowan.tagsoup.AutoDetector] []
        (autoDetectingReader [^java.io.InputStream is]
          (java.io.InputStreamReader. is "UTF-8"))))
    (.setProperty "http://xml.org/sax/properties/lexical-handler" content-handler)
    (.parse input-source)))

; #todo make use string input:  (ts/string->stream html-str)
(s/defn parse-raw :- tsk/KeyMap
  "Loads and parse an HTML resource and closes the input-stream."
  [html-str :- s/Str]
  (xml/parse-raw-streaming
    (org.xml.sax.InputSource.
      (ts/string->stream html-str))
    tagsoup-parse-fn))

; #todo make use string input:  (ts/string->stream html-str)
(s/defn parse :- tsk/KeyMap
  "Loads and parse an HTML resource and closes the input-stream."
  [html-str :- s/Str]
  (xml/enlive-remove-whitespace
    (xml/enlive-normalize
      (parse-raw
        html-str))))