使用Java从大文件中删除注释

使用Java从大文件中删除注释,java,Java,我有.sh、.txt、.sql、.pkb等文件,文件大小大于10 MB,这意味着超过100k行 我想从这些文件中删除注释,然后进一步使用未注释的内容。我已经为此编写了以下代码 /** * Removes all the commented part from the file content as well as returns a * file structure which have just lines with declaration syntax for eg. * Create

我有.sh、.txt、.sql、.pkb等文件,文件大小大于10 MB,这意味着超过100k行

我想从这些文件中删除注释,然后进一步使用未注释的内容。我已经为此编写了以下代码

/**
 * Removes all the commented part from the file content as well as returns a
 * file structure which have just lines with declaration syntax for eg.
 * Create Package packageName <- Stores all decalartion lines as separate
 * string in an array
 * 
 * @param file
 * @return file content
 * @throws IOException
 */
private static String[] filterContent(File file) throws IOException {

    String withoutComment = "";
    String declare = "";
    String[] content;
    List<String> readLines = FileUtils.readLines(file);

    int size = readLines.size();
    System.out.println(file.getName() + " Files number of lines "+ size + " at "+new Date());
    String[] declareLines = new String[size];
    int startComment = 0;
    int endComment = 0;
    Boolean check = false;
    int j = 0;
    int i=0;
    // Reading content line by line
    for (String line:readLines) {
        // If line contains */ that means comment is ending in this line,
        // making a note of the line number
        if (line.toString().contains("*/")) {
            endComment = i;
            // Removing the content before */ from the line
            int indexOf = line.indexOf("*/");
            line = line.replace(line.substring(0, indexOf + 2), "");
        }

        // If startComment is assigned fresh value and end comment hasn't,
        // that means the current line is part of the comment
        // Ignoring the line in this case and moving on to the next one
        if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check)
            continue;

        // If line contains /* that means comment is starting in this line,
        // making a note of the line number
        if (line.contains("/*")) {
            startComment = i;
            // Removing the content after /* from the line
            int indexOf = line.indexOf("/*");
            line = line.replace(line.substring(indexOf), "");
            if (i == 0)
                check = true; // means comment in the very first line
        }

        // If line contains -- that means single line comment is present in
        // this line,
        // removing the content after --
        if (line.contains("--")) {
            int indexOf = line.indexOf("--");
            line = line.replace(line.substring(indexOf), "");
        }
        // If line contains -- that means single line comment is present in
        // this line,
        // removing the content after --
        if (line.contains("#")) {
            int indexOf = line.indexOf("#");
            line = line.replace(line.substring(indexOf), "");
        }

        // At this point, all commented part is removed from the line, hence
        // appending it to the final content
        if (!line.isEmpty())
            withoutComment = withoutComment + line + " \n";
        // If line contains CREATE its a declaration line, holding it
        // separately in the array
        if (line.toUpperCase().contains(("CREATE"))) {
            // If next line does not contains Create and the current line is
            // the not the last line,
            // then considering two consecutive lines as declaration line,
            if (i < size - 1 && !readLines.get(i + 1).toString().toUpperCase().contains(("CREATE"))) {
                declare = line + " " + readLines.get(i + 1).toString() + "\n";
            } else if (i < size) {// If the line is last line, including
                                    // that line alone.
                declare = line + "\n";
            }

            declareLines[j] = declare.toUpperCase();
            j++;
        }
        i++;
    }
    System.out.println("Read lines "+ new Date());
    List<String> list = new ArrayList<String>(Arrays.asList(declareLines));
    list.removeAll(Collections.singleton(null));

    content = list.toArray(new String[list.size() + 1]);

    withoutComment = withoutComment.toUpperCase();
    content[j] = withoutComment;
    System.out.println("Retruning uncommented content "+ new Date());
    return content;
}


 public static void main(String[] args) {
        String[] content = filterContent(new File("abc.txt"));
}
/**
*从文件内容中删除所有已注释的部分,并返回
*文件结构,其中仅包含带有eg声明语法的行。

*创建packageName您可以创建多个线程来完成此工作(需要正确拆分行)

您可以创建多个线程来完成此工作(需要正确拆分行)

一些加快此代码速度的方法

使用
InputStream
读取文件并直接分析行,将新字符串存储在未注释的文件中。这将防止文件的多次读取(一次创建
列表读取行
,一次由迭代完成)

在设计中,可以使用注释语法的映射,而不是此冗余代码


一旦这样做了,这应该会快得多。当然,多线程可能是一种解决方案,但这需要进行一些检查,以确保您不会仅在注释块中拆分文件。因此,首先改进代码,然后您可以考虑这一点。

一些加快代码速度的想法

使用
InputStream
读取文件并直接分析行,将新字符串存储在未注释的文件中。这将防止文件的多次读取(一次创建
列表读取行
,一次由迭代完成)

在设计中,可以使用注释语法的映射,而不是此冗余代码


一旦这样做了,这应该会快得多。当然,多线程可能是一种解决方案,但这需要进行一些检查,以确保您不会仅在注释块中拆分文件。因此,首先改进代码,然后您可以考虑这一点。

我的代码最大的问题是使用
字符串。用任何方法读取行都没有多大区别,但是使用
StringBuilder
而不是
String
来存储未注释的行,极大地改变了性能。现在,与
StringBuilder
相同的代码需要几秒钟才能删除注释,而删除注释需要几个小时

这是密码。为了获得更好的性能,我将
列表
更改为
BufferedReader

/**
     * Removes all the commented part from the file content as well as returns a
     * file structure which have just lines with declaration syntax for eg.
     * Create Package packageName <- Stores all decalartion lines as separate
     * string in an array
     * 
     * @param file
     * @return file content
     * @throws IOException
     */
    private static List<String> filterContent(File file) throws IOException {

        StringBuilder withoutComment = new StringBuilder();
//      String declare = "";
//      String[] content;
//      List<String> readLines = FileUtils.readLines(file);
//
//      int size = readLines.size();
        System.out.println(file.getName() + "  at " + new Date());
        List<String> declareLines = new ArrayList<String>();
        // String line = null;
        int startComment = 0;
        int endComment = 0;
        Boolean check = false;
        Boolean isLineDeclaration = false;

        int j = 0;
        int i = 0;

        InputStream in = new FileInputStream(file);
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        String line;
        // Reading content line by line
        while ((line = reader.readLine()) != null) {
            // for (int i = 0; i < size; i++) {
            // line = readLines.get(i).toString();// storing current line data
            // If line contains */ that means comment is ending in this line,
            // making a note of the line number
            if (line.toString().contains("*/")) {
                endComment = i;
                // Removing the content before */ from the line
                int indexOf = line.indexOf("*/");
                line = line.replace(line.substring(0, indexOf + 2), "");
            }

            // If startComment is assigned fresh value and end comment hasn't,
            // that means the current line is part of the comment
            // Ignoring the line in this case and moving on to the next one
            if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check)
                continue;

            // If line contains /* that means comment is starting in this line,
            // making a note of the line number
            if (line.contains("/*")) {
                startComment = i;
                // Removing the content after /* from the line
                int indexOf = line.indexOf("/*");
                line = line.replace(line.substring(indexOf), "");
                if (i == 0)
                    check = true; // means comment in the very first line
            }

            // If line contains -- that means single line comment is present in
            // this line,
            // removing the content after --
            if (line.contains("--")) {
                int indexOf = line.indexOf("--");
                line = line.replace(line.substring(indexOf), "");
            }
            // If line contains -- that means single line comment is present in
            // this line,
            // removing the content after --
            if (line.contains("#")) {
                int indexOf = line.indexOf("#");
                line = line.replace(line.substring(indexOf), "");
            }

            // At this point, all commented part is removed from the line, hence
            // appending it to the final content
            if (!line.isEmpty())
                withoutComment.append(line).append(" \n");
            // If line contains CREATE its a declaration line, holding it
            // separately in the array
            if (line.toUpperCase().contains(("CREATE"))) {
                // If next line does not contains Create and the current line is
                // the not the last line,
                // then considering two consecutive lines as declaration line,
                declareLines.add(line.toUpperCase());

                isLineDeclaration = true;
                j++;
            } else if (isLineDeclaration && !line.toUpperCase().contains(("CREATE"))) {
                // If next line does not contains Create and the current line is
                // the not the last line,
                // then considering two consecutive lines as declaration line,
                declareLines.set(j - 1, declareLines.get(j - 1) + " " + line.toUpperCase());
                isLineDeclaration = false;
            }
            i++;
        }

        reader.close();
        System.out.println("Read lines " + new Date());
//      List<String> list = new ArrayList<String>(Arrays.asList(declareLines));
        declareLines.removeAll(Collections.singleton(null));

//      content = list.toArray(new String[list.size() + 1]);

//      withoutComment = withoutComment..toUpperCase();
        declareLines.add(withoutComment.toString().toUpperCase());
        System.out.println("Retruning uncommented content " + new Date());
        return declareLines;
    }
/**
*从文件内容中删除所有已注释的部分,并返回
*文件结构,其中仅包含带有eg声明语法的行。

*创建包packageName原来我的代码最大的问题是使用
字符串。用任何方法读取行都没有多大区别,但是使用
StringBuilder
而不是
String
来存储未注释的行,极大地改变了性能。现在,与
StringBuilder
相同的代码需要几秒钟才能删除注释,而删除注释需要几个小时

这是密码。为了获得更好的性能,我将
列表
更改为
BufferedReader

/**
     * Removes all the commented part from the file content as well as returns a
     * file structure which have just lines with declaration syntax for eg.
     * Create Package packageName <- Stores all decalartion lines as separate
     * string in an array
     * 
     * @param file
     * @return file content
     * @throws IOException
     */
    private static List<String> filterContent(File file) throws IOException {

        StringBuilder withoutComment = new StringBuilder();
//      String declare = "";
//      String[] content;
//      List<String> readLines = FileUtils.readLines(file);
//
//      int size = readLines.size();
        System.out.println(file.getName() + "  at " + new Date());
        List<String> declareLines = new ArrayList<String>();
        // String line = null;
        int startComment = 0;
        int endComment = 0;
        Boolean check = false;
        Boolean isLineDeclaration = false;

        int j = 0;
        int i = 0;

        InputStream in = new FileInputStream(file);
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        String line;
        // Reading content line by line
        while ((line = reader.readLine()) != null) {
            // for (int i = 0; i < size; i++) {
            // line = readLines.get(i).toString();// storing current line data
            // If line contains */ that means comment is ending in this line,
            // making a note of the line number
            if (line.toString().contains("*/")) {
                endComment = i;
                // Removing the content before */ from the line
                int indexOf = line.indexOf("*/");
                line = line.replace(line.substring(0, indexOf + 2), "");
            }

            // If startComment is assigned fresh value and end comment hasn't,
            // that means the current line is part of the comment
            // Ignoring the line in this case and moving on to the next one
            if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check)
                continue;

            // If line contains /* that means comment is starting in this line,
            // making a note of the line number
            if (line.contains("/*")) {
                startComment = i;
                // Removing the content after /* from the line
                int indexOf = line.indexOf("/*");
                line = line.replace(line.substring(indexOf), "");
                if (i == 0)
                    check = true; // means comment in the very first line
            }

            // If line contains -- that means single line comment is present in
            // this line,
            // removing the content after --
            if (line.contains("--")) {
                int indexOf = line.indexOf("--");
                line = line.replace(line.substring(indexOf), "");
            }
            // If line contains -- that means single line comment is present in
            // this line,
            // removing the content after --
            if (line.contains("#")) {
                int indexOf = line.indexOf("#");
                line = line.replace(line.substring(indexOf), "");
            }

            // At this point, all commented part is removed from the line, hence
            // appending it to the final content
            if (!line.isEmpty())
                withoutComment.append(line).append(" \n");
            // If line contains CREATE its a declaration line, holding it
            // separately in the array
            if (line.toUpperCase().contains(("CREATE"))) {
                // If next line does not contains Create and the current line is
                // the not the last line,
                // then considering two consecutive lines as declaration line,
                declareLines.add(line.toUpperCase());

                isLineDeclaration = true;
                j++;
            } else if (isLineDeclaration && !line.toUpperCase().contains(("CREATE"))) {
                // If next line does not contains Create and the current line is
                // the not the last line,
                // then considering two consecutive lines as declaration line,
                declareLines.set(j - 1, declareLines.get(j - 1) + " " + line.toUpperCase());
                isLineDeclaration = false;
            }
            i++;
        }

        reader.close();
        System.out.println("Read lines " + new Date());
//      List<String> list = new ArrayList<String>(Arrays.asList(declareLines));
        declareLines.removeAll(Collections.singleton(null));

//      content = list.toArray(new String[list.size() + 1]);

//      withoutComment = withoutComment..toUpperCase();
        declareLines.add(withoutComment.toString().toUpperCase());
        System.out.println("Retruning uncommented content " + new Date());
        return declareLines;
    }
/**
*从文件内容中删除所有已注释的部分,并返回
*文件结构,其中仅包含带有eg声明语法的行。

*创建包packageName 1。不要把整个文件都保存在内存中。2.为什么要这样做?首先,不要将其放入列表中,使用InputStream读取文件并直接分析行。您可以很容易地找到一行是否包含
/*
/**/,删除此项并在不添加注释的情况下重新创建新文件。读取一个超过100MB的文件不应该花费那么长的时间…可能是1的副本。不要把整个文件都保存在内存中。2.为什么要这样做?首先,不要将其放入列表中,使用InputStream读取文件并直接分析行。您可以很容易地找到一行是否包含
/*
/**/,删除此项并在不添加注释的情况下重新创建新文件。读取一个超过100MB的文件不应该花费那么长的时间…文件的可能副本甚至可能有50万行。创建数百个线程不会使线程堆栈过载吗?该文件甚至可能有50万行。创建数百个线程不会使线程堆栈过载吗?