不使用选择器解析html页面内容_Html_Css_Xml_Xpath_Html Parsing

不使用选择器解析html页面内容

html css xml xpath

不使用选择器解析html页面内容,html,css,xml,xpath,html-parsing,Html,Css,Xml,Xpath,Html Parsing,我将使用Java程序解析一些网页。为此，我编写了一个小代码，通过使用xpath作为选择器来解析页面内容。要解析不同的站点，需要为每个站点找到合适的xpath。问题在于这样做需要一个操作符来为您找到write xpath。（例如，使用firepath firefox插件）假设您不知道应该解析哪个页面，或者操作员要找到正确的xpath，站点的数量变得非常大。在这种情况下，您需要一种无需使用任何选择器即可解析页面的方法。（CSS选择器也存在相同的情况）或者应该有一种自动查找xpath的方法！我想知道以

我将使用Java程序解析一些网页。为此，我编写了一个小代码，通过使用xpath作为选择器来解析页面内容。要解析不同的站点，需要为每个站点找到合适的xpath。问题在于这样做需要一个操作符来为您找到write xpath。（例如，使用firepath firefox插件）假设您不知道应该解析哪个页面，或者操作员要找到正确的xpath，站点的数量变得非常大。在这种情况下，您需要一种无需使用任何选择器即可解析页面的方法。（CSS选择器也存在相同的情况）或者应该有一种自动查找xpath的方法！我想知道以这种方式解析网页的方法是什么？这是我为此编写的小代码，请在介绍您的解决方案时随意扩展

public downloadHTML(String url) throws IOException{
        CleanerProperties props = new CleanerProperties();

        // set some properties to non-default values
        props.setTranslateSpecialEntities(true);
        props.setTransResCharsToNCR(true);
        props.setOmitComments(true);

        // do parsing
        TagNode tagNode = new HtmlCleaner(props).clean(
            new URL(url)
        );

        // serialize to xml file
        new PrettyXmlSerializer(props).writeToFile(
            tagNode, "c:\\TEMP\\clean.xml", "utf-8"
        );
    }


public static void testJavaxXpath(String pattern)
            throws ParserConfigurationException, SAXException, IOException,
            FileNotFoundException, XPathExpressionException {

        DocumentBuilder b = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder();
        org.w3c.dom.Document doc = b.parse(new FileInputStream(
                "c:\\TEMP\\clean.xml"));

        // Evaluate XPath against Document itself
        javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
        NodeList nodes = (NodeList) xPath.evaluate(pattern,
                doc.getDocumentElement(), XPathConstants.NODESET);
        for (int i = 0; i < nodes.getLength(); ++i) {
            Element e = (Element) nodes.item(i);
            System.out.println(e.getFirstChild().getTextContent());
        }
    }

公共下载HTML（字符串url）引发IOException{
CleanerProperties props=新的CleanerProperties（）；
//将某些属性设置为非默认值
props.setTranslateSpecialEntities（true）；
道具：setTransResCharsToNCR（真）；
props.setomit注释（true）；
//解析
TagNode TagNode=新的HtmlCleaner（道具）.clean(
新URL（URL）
);
//序列化为xml文件
新的PrettyXmlSerializer（props）.writeToFile(
标记节点，“c:\\TEMP\\clean.xml”，“utf-8”
);
}
公共静态void testJavaxXpath（字符串模式）
抛出ParserConfiguration异常、SAXException、IOException、，
FileNotFoundException，XPathExpressionException{
DocumentBuilder b=DocumentBuilderFactory.newInstance（）
.newDocumentBuilder（）；
org.w3c.dom.Document doc=b.parse（新文件输入流(
“c:\\TEMP\\clean.xml”）；
//根据文档本身计算XPath
javax.xml.xpath.xpath xpath=XPathFactory.newInstance（）.newXPath（）；
节点列表节点=（节点列表）xPath.evaluate（模式，
doc.getDocumentElement（），XPathConstants.NODESET）；
对于（int i=0；i

如果您不知道使用什么XPath表达式，您知道什么？你想找一个特定的字符串或id吗？@LarsH我知道我想解析一篇博客文章的内容。但我不知道什么是博客，所以该博客的xpath或css是未知的！除了不知道你要看什么博客之外，你知道你要在博客帖子中寻找什么吗？XPath不是XML解析器；它用于在解析后从文档结构中选择特定的内容。您已经在使用

b.parse（）

解析文档。现在，您需要决定在解析的结构（

doc

）中查找什么。例如，Firepath Firefox插件要求您告诉它您要选择的元素。@LarsH我知道我想从博客页面中解析帖子内容、帖子标题和帖子日期，但问题是每个博客都有自己的页面结构，我无法找到在所有博客上都可以使用的特定xpath！是的，HTML输出结构对每一个都会有很大的不同。我建议的最佳方法是选择最流行的3种左右的博客格式（Wordpress等），为您想要的每一项使用XPath表达式，然后使用

操作符将它们合并。如果您需要有关联合方面的帮助，请在此处发布XPath表达式，例如特定博客的标题，我们可以帮助您将它们组合在一起。