用java DOM/SAX解析dblp.xml_Java_Dom

用java DOM/SAX解析dblp.xml

java dom

用java DOM/SAX解析dblp.xml,java,dom,Java,Dom,我试图用java解析dblp.xml以获得作者姓名/标题/年份等，但由于文件很大（860MB），我无法在完整的文件上使用DOM/SAX 因此，我将该文件拆分为多个小文件，每个文件大约100MB 现在，每个文件包含各种（数千）节点，如下所示： <dblp> <inproceedings mdate="2011-06-23" key="conf/aime/BianchiD95"> <author>Nadia Bianchi</author> <a

我试图用java解析dblp.xml以获得作者姓名/标题/年份等，但由于文件很大（860MB），我无法在完整的文件上使用DOM/SAX

因此，我将该文件拆分为多个小文件，每个文件大约100MB

现在，每个文件包含各种（数千）节点，如下所示：

<dblp>
<inproceedings mdate="2011-06-23" key="conf/aime/BianchiD95">
<author>Nadia Bianchi</author>
<author>Claudia Diamantini</author>
<title>Integration of Neural Networks and Rule Based Systems in the Interpretation of Liver     Biopsy Images.</title>
<pages>367-378</pages>
<year>1995</year>
<crossref>conf/aime/1995</crossref>
<booktitle>AIME</booktitle>
<url>db/conf/aime/aime1995.html#BianchiD95</url>
<ee>http://dx.doi.org/10.1007/3-540-60025-6_152</ee>
</inproceedings>
</dblp>


纳迪亚·比安奇
克劳迪亚曼蒂尼
神经网络和基于规则的系统在肝活检图像解释中的集成。
367-378
1995
conf/aime/1995
艾美
db/conf/aime/aime1995.html#BianchiD95
http://dx.doi.org/10.1007/3-540-60025-6_152

我假设在DOM中100MB应该是可读的，但代码在大约45k行之后停止。以下是我正在使用的java代码：

@SuppressWarnings({"unchecked", "null"})
public List<dblpModel> readConfigDOM(String configFile) {
    List<dblpModel> items = new ArrayList<dblpModel>();
    List<String> strList = null;
    dblpModel item = null;

    try {

        File fXmlFile = new File(configFile);
        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        Document doc = dBuilder.parse(fXmlFile);
        doc.getDocumentElement().normalize();

        NodeList nList = doc.getElementsByTagName("incollection");

        for (int temp = 0; temp < nList.getLength(); temp++) {
            item = new dblpModel();
            strList = new ArrayList<String>();
            Node nNode = nList.item(temp);
            if (nNode.getNodeType() == Node.ELEMENT_NODE) {

                Element eElement = (Element) nNode;

                strList = getTagValueString("title", eElement);
                System.out.println(strList.get(0).toString());

                strList = getTagValueString("author", eElement);
                System.out.println("Author : " + strList.size());
                for(String s: strList) {
                    System.out.println(s);

                }
            }
            items.add(item);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return items;
}


private static String getTagValueString(String sTag, Element eElement) {
    String temp = "";
    StringBuffer concatTestSb = new StringBuffer();
    List<String> strList = new ArrayList<String>();
    int len = eElement.getElementsByTagName(sTag).getLength();

    try {

        for (int i = 0; i < len; i++) {
            NodeList nl = eElement.getElementsByTagName(sTag).item(i).getChildNodes();
            if (nl.getLength() > 1) {
                for (int j = 0; j < nl.getLength(); j++) {
                    concatTestSb.append(nl.item(j).getTextContent());
                }
            } else {
                temp = nl.item(0).getNodeValue();
                concatTestSb.append(temp);
                if (len > 1) {
                    concatTestSb.append("*");
                }
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return concatTestSb.toString();
}

@SuppressWarnings（{“unchecked”，“null”}）
公共列表readConfigDOM（字符串配置文件）{
列表项=新建ArrayList（）；
列表strList=null；
dblpModel项=null；
试一试{
File fXmlFile=新文件（configFile）；
DocumentBuilderFactory dbFactory=DocumentBuilderFactory.newInstance（）；
DocumentBuilder dBuilder=dbFactory.newDocumentBuilder（）；
documentdoc=dBuilder.parse（fXmlFile）；
doc.getDocumentElement（）.normalize（）；
NodeList nList=doc.getElementsByTagName（“incollection”）；
对于（int-temp=0；temp1）{
对于（int j=0；j1）{
（b）附加（“*”）；
}
}
}
}捕获（例外e）{
e、 printStackTrace（）；
}
使某人返回字符串（）；
}

有什么帮助吗？我也尝试过使用staxapi来解析大型文档，但这也是

如果您的目标只是获取详细信息，那么只需使用BufferedReader将文件作为文本文件读取即可。如果你愿意，加入一些正则表达式

如果使用mysql是一种选择，您可以通过它的

希望这能有所帮助。

不要对xml格式过分操心。无论如何，它也不是非常有用。只需将其作为文本文件读取，并将行解析为字符串。然后，您可以将数据导出到csv，并从该点开始以您想要的方式使用它。不幸的是，xml对于大型文档不是很有效。我在这里为一个研究项目做了类似的事情：

如果你准确说出“代码停止”的意思，你会得到更好的答案。

readConfigDOM（）

返回还是挂起？如果挂起，它挂起在哪一行（您可以在调试器下运行和/或获得线程转储）。顺便说一句，SAX对大文件没有问题。