Java SAXParseException问题（如何从xml文件中删除任何BOM表字符）_Java_Io_Processing

Java SAXParseException问题（如何从xml文件中删除任何BOM表字符）

java io processing

Java SAXParseException问题（如何从xml文件中删除任何BOM表字符）,java,io,processing,Java,Io,Processing,我在一个xml文件中有一些数据，我正在使用流程库解析该文件。我遇到了，这导致抛出了一些错误。我在其他地方找到了一个非常慢的工作：在跳过表示BOM数据的字节后，我使用ApacheCommons BOMInputStream将文件读取为一组字节我认为我的问题的根源实际上是我缺乏关于流、读者和作者的知识。有那么多不同的读者和作者，还有各种各样的“流”（我几乎不懂这个词），我真想弄清楚该用哪一个，怎么用。我想我只是选择了错误的实现问题：有人能告诉我为什么我的代码这么慢，并帮助我提高对文件i/o的理

我在一个xml文件中有一些数据，我正在使用流程库解析该文件。我遇到了，这导致抛出了一些错误。我在其他地方找到了一个非常慢的工作：在跳过表示BOM数据的字节后，我使用ApacheCommons BOMInputStream将文件读取为一组字节

我认为我的问题的根源实际上是我缺乏关于流、读者和作者的知识。有那么多不同的读者和作者，还有各种各样的“流”（我几乎不懂这个词），我真想弄清楚该用哪一个，怎么用。我想我只是选择了错误的实现

问题： 有人能告诉我为什么我的代码这么慢，并帮助我提高对文件i/o的理解吗

代码：

private static XML noBOM(String filename, PApplet p) throws FileNotFoundException, IOException{

    ByteArrayOutputStream out = new ByteArrayOutputStream();
    File f = new File(filename);
    InputStream stream = new FileInputStream(f);
    BOMInputStream bomIn = new BOMInputStream(stream);

    int tmp = -1;
    while ((tmp = bomIn.read()) != -1){
        out.write(tmp);
    }

    String strXml = out.toString();
    return p.parseXML(strXml);
}

public static Map<String, Float> lifeExpectancyFromXML(String filename, PApplet p, 
        int year) throws FileNotFoundException, IOException{


    Map<String, Float> dataMap = new HashMap<>();

    XML xml = noBOM(filename, p);

    if(xml != null){

        XML[] records = xml.getChild("data").getChildren("record");

        for (XML record : records){
            XML[] fields = record.getChildren("field");

            String country = fields[0].getContent();
            int entryYear = fields[2].getIntContent();
            float lifeEx = fields[3].getFloatContent();

            if (entryYear == year){
                System.out.println("Country: " + country);
                System.out.println("Life Expectency: " + lifeEx);
                dataMap.put(country, lifeEx);
            }
        }
    } 
    else {
        System.out.println("String could not be parsed.");
    }

    return dataMap;
}

private static XML noBOM（字符串文件名，PApplet p）抛出FileNotFoundException，IOException{
ByteArrayOutputStream out=新建ByteArrayOutputStream（）；
文件f=新文件（文件名）；
InputStream=新文件InputStream（f）；
BOMInputStream bomIn=新的BOMInputStream（流）；
int tmp=-1；
而（（tmp=bomIn.read（））！=-1）{
out.write（tmp）；
}
字符串strXml=out.toString（）；
返回p.parseXML（strXml）；
}
公共静态映射lifeExpectancyFromXML（字符串文件名，PApplet p，
int year）抛出FileNotFoundException、IOException{
Map dataMap=newhashmap（）；
XML=noBOM（文件名，p）；
if（xml！=null）{
XML[]records=XML.getChild（“数据”）.getchilds（“记录”）；
for（XML记录：记录）{
XML[]fields=record.getChildren（“字段”）；
字符串country=字段[0]。getContent（）；
int entryYear=字段[2]。getIntContent（）；
float-lifeEx=字段[3]。getFloatContent（）；
如果（入口年份==年份）{
System.out.println（“国家：+国家”）；
System.out.println（“预期寿命：+lifeEx”）；
dataMap.put（国家/地区，lifeEx）；
}
}
} 
否则{
System.out.println（“无法解析字符串”）；
}
返回数据图；
} 
问题可能是，InputStream是逐字节读取的。尝试使用缓冲区使其更具性能：
try (BOMInputStream bis = new BOMInputStream(new FileInputStream(new File(filename)))) {
    byte[] buffer = new byte[1000];
    while (bis.read(buffer) != -1) {
        out.write(buffer);
    }
}

更新：
生成的ByteArrayOutputStream最后可能包含一些空字节。要删除它们，请修剪生成的字符串：
out.toString("UTF-8").trim()

我的解决方案是使用BufferedReader，而不是创建自己的缓冲区。它使一切都变得相当迅速：
private static XML noBOM(String path, PApplet p) throws 
            FileNotFoundException, UnsupportedEncodingException, IOException{

        //set default encoding
        String defaultEncoding = "UTF-8";

        //create BOMInputStream to get rid of any Byte Order Mark
        BOMInputStream bomIn = new BOMInputStream(new FileInputStream(path));

        //If BOM is present, determine encoding. If not, use UTF-8
        ByteOrderMark bom = bomIn.getBOM();
        String charSet = bom == null ? defaultEncoding : bom.getCharsetName();

        //get buffered reader for speed
        InputStreamReader reader = new InputStreamReader(bomIn, charSet);
        BufferedReader breader = new BufferedReader(reader);

        //Build string to parse into XML using Processing's PApplet.parsXML
        StringBuilder buildXML = new StringBuilder();
        int c;
        while((c = breader.read()) != -1){
            buildXML.append((char) c);
        }
        reader.close();
        return p.parseXML(buildXML.toString());
    }

此方法似乎会留下一些被视为非法的尾随数据。我现在得到了错误org.xml.sax.SAXParseException；尾随部分不允许包含内容。
。更新了答案，并提供了删除尾随字符的示例。恐怕即使在更新之后，错误仍然存在。另外，您是否介意解释一下缓冲区到底在做什么？当使用缓冲区时，数据是按块从InputStream读取并写入OutputStream的，每个块的缓冲区大小。它应该提高读/写性能。您是否有一个在所有处理之后检索的结果字符串（写入日志或从调试数据）的示例？它是否有前导或尾随空格或其他字符？