在Java中从文本文件中检索单个特定数字的有效方法_Java_Regex_Text Files

在Java中从文本文件中检索单个特定数字的有效方法
java regex
在Java中从文本文件中检索单个特定数字的有效方法,java,regex,text-files,Java,Regex,Text Files,如果之前有人问过这个问题，我表示歉意；我试着搜索，没有找到任何接近这个的东西我有一个文本文件，我已经在Java程序中读入了字符串。我需要在该文本文件中搜索包含数字的特定短语，并提取该数字以保存到int变量中。以下是其中一个文件的样本摘录： ≥ ≥ ≥CREDIT CARD TERMS
如果之前有人问过这个问题，我表示歉意；我试着搜索，没有找到任何接近这个的东西
我有一个文本文件，我已经在Java程序中读入了
字符串。我需要在该文本文件中搜索包含数字的特定短语，并提取该数字以保存到int
变量中。以下是其中一个文件的样本摘录：
≥                                                                              ≥
≥CREDIT CARD TERMS                                                             ≥
≥ORDER IS ON HOLD FOR PREPAYMENT OF ORDER TOTAL + FREIGHT BY CREDIT CARD.      ≥
≥ORDER TOTAL DOES NOT REFLECT FREIGHT COSTS & WILL BE CHARGED AFTER ORDER      ≥
≥SHIPS. ORDER WILL SHIP _5_ WORKING DAYS FROM RECEIPT OF ALL APPROVALS &       ≥
≥RECEIPT OF CREDIT CARD FORM.                                                  ≥
≥                                                                              ≥

此代码段通常显示在文本文件的下方，但位于特定的行上。我需要从文件中的短语“订单将在5个工作日内发货”
”中提取编号5

我已经可以想象这样做的一种方法，它由两个连续的正则表达式搜索组成。第一个搜索短语，然后第二个搜索数字。然而，这在我看来效率相当低，必须在这两个过程中创建模式
和匹配器
类的实例
我认为一定有更有效的方法来执行这个提取，而不需要两个正则表达式搜索。有这样的方法吗？或者，这是用这两个连续的搜索来实现的唯一方法
2016年3月30日修正案：我忘了提到我需要提取的数字周围可能没有下划线。这可能会影响任何不使用正则表达式的答案。
非正则表达式方式。但不确定它是否更有效：
static final String PREFIX = "ORDER WILL SHIP _";
String s = // file content
value = s.substring(s.indexOf(PREFIX) + PREFIX.length(), s.indexOf("_ WORKING DAYS"));

也可以使用单个正则表达式
对于单个正则表达式，您可以执行以下操作
String text = "SHIPS. ORDER WILL SHIP _5_ WORKING DAYS FROM RECEIPT OF ALL APPROVALS";
    Pattern p = Pattern.compile("ORDER\\sWILL\\sSHIP\\s_?(\\d+)_?\\sWORKING\\sDAYS");
    Matcher m = p.matcher(text);
    if (m.matches()) {
        System.out.println(m.group(0));
    }

首先，为什么要将整个文件读入字符串？这是低效的，你不需要它。使用BufferedReader
的readLine（）
逐行读取文件。并仅处理当前行。这样不会消耗不必要的内存量
使用正则表达式重复文本也是一种过分的做法。以“ORDER WILL SHIP”和“WORKING DAYS”作为参数的简单字符串的indexOf（）
方法应足以识别行中的正确行和所需数字位置
然后提取所需的int
值很容易，只需调用整数即可。parseInt（String s）
其中s
是indexOf（）
调用返回的索引之间当前行的子字符串。
我编写了一些代码来比较两个不同的示例，如何实现目标和所需的时间如果你能像Einar写的那样做（当然增加了第一个子字符串参数），它将比使用正则表达式快得多
示例代码：
public static void main(String[] args) {
    // CREATE TEST-DATA
    StringBuilder testSequenceBuilder = new StringBuilder();
    testSequenceBuilder.append("                                                                              ");
    testSequenceBuilder.append("CREDIT CARD TERMS                                                             ");
    testSequenceBuilder.append("ORDER IS ON HOLD FOR PREPAYMENT OF ORDER TOTAL + FREIGHT BY CREDIT CARD.      ");
    testSequenceBuilder.append("ORDER TOTAL DOES NOT REFLECT FREIGHT COSTS & WILL BE CHARGED AFTER ORDER      ");
    testSequenceBuilder.append("SHIPS. ORDER WILL SHIP _52_ WORKING DAYS FROM RECEIPT OF ALL APPROVALS &      ");
    testSequenceBuilder.append("RECEIPT OF CREDIT CARD FORM.                                                  ");
    testSequenceBuilder.append("                                                                              ");

    // TEST
    String testSequence = testSequenceBuilder.toString();

    // REGEX
    performAndPrintNanos(() -> {
        Pattern pattern = Pattern.compile("ORDER WILL SHIP _(?<g>[0-9]+)_ WORKING DAYS",
                Pattern.CASE_INSENSITIVE); // Edited with Kuzeko's pretty example. If you want the pattern to be case-sensitive, just remove the second param of Pattern.compile
        Matcher matcher = pattern.matcher(testSequence);
        if (matcher.find()) {
            System.out.println("OUTPUT-regex: " + matcher.group(1));
        }
    });

    // SUBSTRING
    performAndPrintNanos(() -> {
        String pre = "ORDER WILL SHIP _";
        String suf = "_ WORKING DAYS";
        System.out.println("OUTPUT-java: "
                + testSequence.substring(testSequence.indexOf(pre) + pre.length(), testSequence.indexOf(suf)));
    });
}

private static void performAndPrintNanos(Runnable runnable) {
    long startNanos = System.nanoTime();
    runnable.run();
    System.out.println(System.nanoTime() - startNanos);
}

更新（评论）：
如果下划线未知，如何使用子字符串：
String pre = "ORDER WILL SHIP ";
String suf = " WORKING DAYS";
String output = testSequence.substring(testSequence.indexOf(pre) + pre.length(), testSequence.indexOf(suf));
if(output.startsWith("_")&&output.endsWith("_")){
    output = output.substring(1, output.length()-1);
}
int num = Integer.parseInt(output);

如何对“5”使用Integer.parseInt（…）并返回5
条件剪切与上面的完全相同，这就是为什么我在本例中使用内联if块。再举一个例子：
String input = "\"5\""; // "5" escaped
int num = (input.startsWith("\"") && input.endsWith("\""))
        ? Integer.parseInt(input.substring(1, input.length() - 1)) : Integer.parseInt(input);
System.out.println(num);

更新#2（评论）：
可能的换行符：
    String pre = "ORDER WILL SHIP ";
    String suf = " WORKING DAYS";
    String output = testSequence.substring(testSequence.indexOf(pre) + pre.length(), testSequence.indexOf(suf));
    // remove linebreaks
    output = output.replaceAll("\n", "");
    // Remove "_" in front and after the digit.
    if (output.startsWith("_") && output.endsWith("_")) {
        // Before (example): output = "_5_"
        output = output.substring(1, output.length() - 1);
        // After (example): output = "5"
    }
    int num = Integer.parseInt(output);

更新#3（缓冲读取器-示例）
对于大文件，您应该使用例如BufferedReader逐行读取。我假定您要检测的短语不会超过两行。但如果你使用读卡器，你需要在缓存中有一行，正如我在评论中所说的
这是一个如何实现这一目标的示例：
        String cache = null;
        while (bufferedReader.ready()) {
            String readLine = bufferedReader.readLine();
            if (readLine != null) {
                readLine = readLine.replaceAll("\n", "");
                // we concat the last read line and the actual read one
                String concatLine = (cache != null ? cache : "") + readLine;
                String pre = "ORDER WILL SHIP ";
                String suf = " WORKING DAYS";
                // We check, if the concat line contains both: pre and suf
                if (concatLine.contains(pre) && concatLine.contains(suf)) {
                    String output = concatLine.substring(concatLine.indexOf(pre) + pre.length(),
                            concatLine.indexOf(suf));
                    // no need to remove linebreaks, because we're reading line by line
                    // Remove "_" in front and after the digit.
                    if (output.startsWith("_") && output.endsWith("_")) {
                        // Before (example): output = "_5_"
                        output = output.substring(1, output.length() - 1);
                        // After (example): output = "5"
                    }
                    int num = Integer.parseInt(output);
                    // break here too if you only have one digit in that input file.
                }
                // cached line is now the one we just read
                cache = readLine;
            } else {
                break;
            }
        } // And don't forget to close the Reader afterwards ;-)

您的文本文件中有多个数字吗？或者可以假定您要查找的号码是文本文件中的最后一个吗？最后，您是否自己创建文本文件？您可以使用Matcher#find（）
对带有组的regexp使用例如订单将发货（[0-9]+）uu工作日
，然后m.group（“groupName”）
将返回it@RoelStrolenberg：是的，文件中有许多数字，所以我必须只寻找那个关键短语中的一个。@Kuzeko:我在回答你下面的答案。这行不通。您必须增加第一个子字符串param（+“ORDER WILL SHIP u“.length），以便在_之后剪切s，这是一种我从未想到的很棒的方法。然而，有一个问题我在我的原始帖子中忽略了：我需要的数字周围可能有下划线，也可能没有下划线。我将修改我的帖子，把这一点讲清楚。不过，你肯定会因为这么好的解决方案而获得赞誉！我认为这可能是我见过的最优雅的解决方案，可以稍加修改，以考虑到下划线有时可能不在关键短语中这一事实。我猜没有必要显式地命名这个组，因为m.group（0）
如果它是正则表达式中的第一个（也是唯一的）组，那么它应该工作吗？我想两者的表现是一样的。与将[0-9]+
替换为\\d+
相同。我还将稍微修改正则表达式，以说明“关键短语”可能会换行。我不确定正则表达式中的文字空格是否适用于换行符，但我对此表示怀疑。因此，我可能会使用类似于“ORDER\\sWILL\\sSHIP\\s_？（\\d+）\u？\\ sWORKING\\sDAYS”
的东西，这应该可以解释换行符。当然，这更有意义，这是我能想到的最灵活的regexp。非常好！有趣的是，您可以在这两种方法上执行速度测试。我想您不知道在下划线未知的情况下可以使用Substring方法吗？或者如果Integer.parseInt（）
能处理像“5”这样的输入并返回5，我很高兴能帮助您。请参阅我的最新帖子，以了解您其他问题的答案。我忘了markdown喜欢把下划线变成强调（斜体）。我的意思是问.parseInt（）
是否可以识别“\u5”，并将其设置为5。我已经试过了，但失败了，所以我肯定需要使用你的“下划线的存在是未知的”方法。如果t中出现其他奇怪的东西，我可以从这里开始
        String cache = null;
        while (bufferedReader.ready()) {
            String readLine = bufferedReader.readLine();
            if (readLine != null) {
                readLine = readLine.replaceAll("\n", "");
                // we concat the last read line and the actual read one
                String concatLine = (cache != null ? cache : "") + readLine;
                String pre = "ORDER WILL SHIP ";
                String suf = " WORKING DAYS";
                // We check, if the concat line contains both: pre and suf
                if (concatLine.contains(pre) && concatLine.contains(suf)) {
                    String output = concatLine.substring(concatLine.indexOf(pre) + pre.length(),
                            concatLine.indexOf(suf));
                    // no need to remove linebreaks, because we're reading line by line
                    // Remove "_" in front and after the digit.
                    if (output.startsWith("_") && output.endsWith("_")) {
                        // Before (example): output = "_5_"
                        output = output.substring(1, output.length() - 1);
                        // After (example): output = "5"
                    }
                    int num = Integer.parseInt(output);
                    // break here too if you only have one digit in that input file.
                }
                // cached line is now the one we just read
                cache = readLine;
            } else {
                break;
            }
        } // And don't forget to close the Reader afterwards ;-)