groovy解析本地html文件_Groovy

groovy解析本地html文件

groovy

groovy解析本地html文件,groovy,Groovy,我正在开发一个groovy脚本，它将获取所有本地html文件并解析其中的某些标记。我尝试过使用html clean之类的东西，但它就是不起作用。我试着读每一行，但这只在我需要的东西在一行的时候起作用。我在github上有这个脚本。谢谢你的意见编辑：所以我越来越近了。我现在有这个密码了 def parser = new org.cyberneko.html.parsers.SAXParser() new XmlParser( parser ).parse( curFile+ "/index.ht

我正在开发一个groovy脚本，它将获取所有本地html文件并解析其中的某些标记。我尝试过使用html clean之类的东西，但它就是不起作用。我试着读每一行，但这只在我需要的东西在一行的时候起作用。我在github上有这个脚本。谢谢你的意见

编辑：所以我越来越近了。我现在有这个密码了

def parser = new org.cyberneko.html.parsers.SAXParser()
new XmlParser( parser ).parse( curFile+ "/index.html" ).with { page ->
    page.'**'.DIV.grep { it.'@class'?.contains 'entry-content' }.each {
    println it
    println "--------------------------------"
    }
}

它打印的是什么

DIV[attributes={class=entry-content}; value=[P[attributes={}; value=[As an automation developer, I have learned how to write code in Java. When I am having an issue, one of the nice things that you can do is debug your code, line by line. For the longest I had wished that something like this existed in PHP. I have come to find out that you can actually debug code, like I do in Java. This is such a helpful task because I do not have to waste time using var_dump and such on variables or results. In your apache/php server you need to install and or enable something called, A[attributes={href=http://xdebug.org/}; value=[Xdebug]], . I will work on a tutorial on how to use xdebug while writing code in Sublime Text 2. So keep an eye out on my blog and or, A[attributes={href=http://www.youtube.com/jrock20041}; value=[YouTube]], channel for this tutorial.]]]]

因此，基本上我想要的是用类条目内容在div中隔离包含html元素的文本。如果您想查看该页面，可在此处找到--

谢谢你的帮助

它确实有用。。。将此页面的HTML保存到文件中，然后可以对其进行解析

以下代码打印页面上每条评论的作者姓名：

@Grab('net.sourceforge.nekohtml:nekohtml:1.9.16')
def parser = new org.cyberneko.html.parsers.SAXParser()

new XmlParser( parser ).parse( file ).with { page ->
  page.'**'.A.grep { it.'@class'?.contains 'comment-user' }.each {
    println it.text()
  }
}

当

文件

设置为指向保存的HTML的

文件

（或包含此问题URL的

字符串

）时，它将打印：

tim_yates
jrock2004
tim_yates

编辑：要打印给定节点的内容，可以执行以下操作（使用已编辑问题中的示例）：

试试nekohtml：似乎不起作用。你有一个简单的失败例子吗？蒂姆，这不是失败。当我使用上面添加的代码运行该文件时，会得到一堆空行。我认为问题在于div内部都是嵌套的html代码。所以当我打印ln it.text（）时，它是空的，因为现在divAhhh中有文本。。。你想要什么？没有html的纯文本？还是div中的html？我想我的问题是如何查找元素。谢谢你的帮助。这解决了我的问题。

@Grab('net.sourceforge.nekohtml:nekohtml:1.9.16')
import groovy.xml.*

def parser = new org.cyberneko.html.parsers.SAXParser()

new XmlParser( parser ).parse( 'http://jcwebconcepts.net/blog/2013/02/02/xdebug/' ).with { page ->
  page.'**'.DIV.grep { it.'@class'?.contains 'entry-content' }.each { it ->
    println XmlUtil.serialize( it )
  }
}