Scheme 如何在Racket中从html中提取元素?

Scheme 如何在Racket中从html中提取元素?,scheme,racket,Scheme,Racket,我想在reddit中提取URL,我的代码是 #lang racket (require net/url) (require html) (define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all")) (define in (get-pure-port reddit #:redirectio

我想在reddit中提取URL,我的代码是

#lang racket

(require net/url)
(require html)

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define in (get-pure-port reddit #:redirections 5))

(define response-html (read-html-as-xml in))
(define content-0 (list-ref response-html 0))

(close-input-port in)
上面的内容-0是

(element
 (location 0 0 15)
 (location 0 0 82)
...
我想知道如何从中提取特定内容

  • 通常,将HTML处理为而不是
    HTML
    模块的
    struct
    s更方便

  • 您可能还应该使用来处理自动关闭端口的问题

  • 您可以通过定义一个
    read html as xexpr
    函数并如下使用,将这两种思想结合起来:

    #lang racket/base
    
    (require html
             net/url
             xml)
    
    (define (read-html-as-xexpr in) ;; input-port? -> xexpr?
      (caddr
       (xml->xexpr
        (element #f #f 'root '()
                 (read-html-as-xml in)))))
    
    (define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
    
    (call/input-url reddit
                    get-pure-port
                    read-html-as-xexpr)
    
    这将返回一个大的x表达式,如:

    '(html
      ((lang "en") (xml:lang "en") (xmlns "http://www.w3.org/1999/xhtml"))
      (head
       ()
       (title () "programming: search results")
       (meta
        ((content " reddit, reddit.com, vote, comment, submit ")
         (name "keywords")))
       (meta
        ((content "reddit: the front page of the internet") (name "description")))
       (meta ((content "origin") (name "referrer")))
       (meta ((content "text/html; charset=UTF-8") (http-equiv "Content-Type")))
    ... snip ...
    
    如何提取其中的特定片段

    • 对于简单的HTML,我不希望整体结构发生变化,我通常只使用

    • 然而,一种更正确、更稳健的方法是使用



    更新:我注意到您的问题是从询问提取URL开始的。下面的示例更新为使用
    se path*/list
    获取所有
    元素的所有
    href
    属性:

    #lang racket/base
    
    (require html
             net/url
             xml
             xml/path)
    
    (define (read-html-as-xexprs in) ;; (-> input-port? xexpr?)
      (caddr
       (xml->xexpr
        (element #f #f 'root '()
                 (read-html-as-xml in)))))
    
    (define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
    
    (define xe (call/input-url reddit
                               get-pure-port
                               read-html-as-xexprs))
    
    (se-path*/list '(a #:href) xe)
    
    结果:

    '("#content"
      "http://www.reddit.com/r/announcements/"
      "http://www.reddit.com/r/Art/"
      "http://www.reddit.com/r/AskReddit/"
      "http://www.reddit.com/r/askscience/"
      "http://www.reddit.com/r/aww/"
      "http://www.reddit.com/r/blog/"
      "http://www.reddit.com/r/books/"
      "http://www.reddit.com/r/creepy/"
      "http://www.reddit.com/r/dataisbeautiful/"
      "http://www.reddit.com/r/DIY/"
      "http://www.reddit.com/r/Documentaries/"
      "http://www.reddit.com/r/EarthPorn/"
      "http://www.reddit.com/r/explainlikeimfive/"
      "http://www.reddit.com/r/Fitness/"
      "http://www.reddit.com/r/food/"
      ... snip ...