Java 基于最小和最大时间戳的信息提取_Java_Regex_File_Java Stream_Regex Group

Java 基于最小和最大时间戳的信息提取

java regex file

Java 基于最小和最大时间戳的信息提取,java,regex,file,java-stream,regex-group,Java,Regex,File,Java Stream,Regex Group,考虑到上面的搜索查询，我想提取两件事：首先，随机选择一个用户（基于id），其次，我想提取对应用户的第一个和最后一个时间戳。我对以下正则表达式给出了类似的答案： 1410 google 2006-05-01 21:40:54 1 http://www.google.com 2005 google 2006-03-24 21:25:10 1 http://www.google.com 2005 google 2006-03-26 21:58:12 2178 go

考虑到上面的搜索查询，我想提取两件事：首先，随机选择一个用户（基于id），其次，我想提取对应用户的第一个和最后一个时间戳。我对以下正则表达式给出了类似的答案：

1410    google  2006-05-01 21:40:54 1   http://www.google.com
2005    google  2006-03-24 21:25:10 1   http://www.google.com
2005    google  2006-03-26 21:58:12
2178    google  2006-03-27 20:58:44 1   http://www.google.com
2178    google  2006-04-11 11:06:20
2178    google  2006-04-11 11:06:41
2178    google  2006-05-16 10:54:39 1   http://www.google.com
2421    google  2006-05-04 15:39:25 1   http://www.google.com
2421    google  2006-05-04 21:14:33 1   http://www.google.com
2421    google  2006-05-05 16:16:01
2722    google  2006-04-12 15:18:12 1   http://www.google.com
2722    google  2006-05-02 09:09:19 1   http://www.google.com
2722    google  2006-05-25 15:42:26 1   http://www.google.com
2722    google  2006-05-25 15:42:26 1   http://www.google.com
6497    google  2006-04-06 22:47:10 1   http://www.google.com
6497    google  2006-04-06 23:05:58 1   http://www.google.com
9777    google  2006-03-11 23:25:57 1   http://www.google.com
9844    google  2006-03-19 10:31:09
9844    google  2006-03-19 10:31:12 1   http://www.google.com
12404   google  2006-03-04 00:42:26 1   http://www.google.com
12404   google  2006-03-13 21:17:22 1   http://www.google.com
12404   google  2006-03-13 21:17:22 1   http://www.google.com
12404   google  2006-03-13 21:17:22 1   http://www.google.com
12404   google  2006-03-13 21:17:22 1   http://www.google.com
12404   google  2006-03-13 21:47:04 1   http://www.google.com
12404   google  2006-03-13 21:47:04 1   http://www.google.com
12404   google  2006-03-22 16:57:44 1   http://www.google.com
12404   google  2006-03-23 22:07:33 1   http://www.google.com
12404   google  2006-03-23 22:07:33 1   http://www.google.com
12404   google  2006-03-23 22:07:33 1   http://www.google.com

))

try（Stream=Files.line（path.get（“文件名”））{
结果=stream.map（LINE_REGEX:：matcher）
//无需操作即可过滤掉任何行
.filter（Matcher:：matches）
//按用户分组
.collect（收集器.groupingBy（（匹配器m）->m.group（1），
收藏，收藏，然后(
//比较时间戳（最早为最小值，最晚为最大值）
收集器.maxBy（比较器比较（（匹配器m）->m.group（2）），
//提取作用
（可选m）->m.get（）.group（3））；

}

但是有两个问题，第一个问题是它将按关键字分组（在我的例子中），而不是按用户id分组；第二个问题是，如果我使用

.minBy（）

，它将获得其他随机用户的第一个时间戳，该用户与

.maxBy（）

的用户不同。

你知道如何解决这个问题吗？

你目前正在按用户名（或关键字）而不是用户ID对用户进行分组。因为用户名总是“google”，所以所有行都在一个组中

将正则表达式的第一部分（用户ID）放在括号中；或者删除用户名部分周围的组括号，或者增加时间戳和操作的组索引

try(Stream<String> stream = Files.lines(Paths.get("file name"))) {
    result = stream.map(LINE_REGEX::matcher)
        // filter out any lines without an Action
        .filter(Matcher::matches)
        // group by User
        .collect(Collectors.groupingBy((Matcher m) -> m.group(1),
            Collectors.collectingAndThen(
                // compare Timestamp (min for earliest, max for latest)
                Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),
                // extract Action
                (Optional<Matcher> m) -> m.get().group(3))));

现在，我将该行的每一部分都包装在一个捕获组中（有关更多信息，请阅读

模式

文档）。通过这种方式，您可以在最后通过向matcher询问组来访问它

这些组按它们在正则表达式中出现的顺序进行枚举，从1开始。请求

matcher.group（0）

返回整行。

是指括号吗？因为用户id已经在括号里了。哦，是的。对不起，总是混淆这些。我是说“圆的”。我会改变我的答案。用户ID

[0-9]*

必须放在“（）”中。然后，您可以通过调用

groupingBy

call中的

matcher.group（1）

来使用它对项目进行分组。目前，组1指的是用户名/关键字

*？[^\\s]

，这对所有项目都是相同的。是否可以修改此答案，以返回整行，而不是仅返回匹配的部分？换句话说，我的意思是，如果可以先进行匹配，然后返回其中发生匹配的行吗？当然，我们不必只提取操作，只需获取匹配器并请求所有内容。我在上面添加了一个完整的例子。这很尴尬，但在我的IDE中，它说，

group

没有定义。所以

v.group

什么也不做。

try(Stream<String> stream = Files.lines(Paths.get("file name"))) {
    result = stream.map(LINE_REGEX::matcher)
        // filter out any lines without an Action
        .filter(Matcher::matches)
        // group by User
        .collect(Collectors.groupingBy((Matcher m) -> m.group(1),
            Collectors.collectingAndThen(
                // compare Timestamp (min for earliest, max for latest)
                Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),
                // extract Action
                (Optional<Matcher> m) -> m.get().group(3))));

private static final Pattern LINE_REGEX = Pattern.compile(
    "([0-9]+)" // user id                      <- parentheses go here
    + "\\s+" // space after user id
    + ".*?[^\\s]" // user name (group 1)       <- not here
    + "\\s+" // space after user name
    + "([0-9]+-.{14})" // timestamp (group 2)
    + "\\s+" //space after timestamp
    + "[0-9]*" // random int
    + "\\s+" //space after random int
    + "(.*[^\\s])" // user action (group 3)
);

Pattern LINE_REGEX = Pattern.compile("([0-9]+)"       // user id (group 1)
                                   + "\\s+"           // space after user id
                                   + "(.*?[^\\s])"    // user name (group 2)
                                   + "\\s+"           // space after user name
                                   + "([0-9]+-.{14})" // timestamp (group 3)
                                   + "\\s+"           // space after timestamp
                                   + "([0-9]*)"       // random int (group 4)
                                   + "\\s+"           // space after random int
                                   + "(.*[^\\s])"     // user action (group 5)
);
Stream<String> lines = Stream.of("1410    google  2006-05-01 21:40:54 1   http://www.google.com",
        "2005    google  2006-03-24 21:25:10 1   http://www.google.com", "2005    google  2006-03-26 21:58:12",
        "2178    google  2006-03-27 20:58:44 1   http://www.google.com", "2178    google  2006-04-11 11:06:20",
        "2178    google  2006-04-11 11:06:41", "2178    google  2006-05-16 10:54:39 1   http://www.google.com",
        "2421    google  2006-05-04 15:39:25 1   http://www.google.com", "2421    google  2006-05-04 21:14:33 1   http://www.google.com",
        "2421    google  2006-05-05 16:16:01", "2722    google  2006-04-12 15:18:12 1   http://www.google.com",
        "2722    google  2006-05-02 09:09:19 1   http://www.google.com", "2722    google  2006-05-25 15:42:26 1   http://www.google.com",
        "2722    google  2006-05-25 15:42:26 1   http://www.google.com", "6497    google  2006-04-06 22:47:10 1   http://www.google.com",
        "6497    google  2006-04-06 23:05:58 1   http://www.google.com", "9777    google  2006-03-11 23:25:57 1   http://www.google.com",
        "9844    google  2006-03-19 10:31:09", "9844    google  2006-03-19 10:31:12 1   http://www.google.com",
        "12404   google  2006-03-04 00:42:26 1   http://www.google.com", "12404   google  2006-03-13 21:17:22 1   http://www.google.com",
        "12404   google  2006-03-13 21:17:22 1   http://www.google.com", "12404   google  2006-03-13 21:17:22 1   http://www.google.com",
        "12404   google  2006-03-13 21:17:22 1   http://www.google.com", "12404   google  2006-03-13 21:47:04 1   http://www.google.com",
        "12404   google  2006-03-13 21:47:04 1   http://www.google.com", "12404   google  2006-03-22 16:57:44 1   http://www.google.com",
        "12404   google  2006-03-23 22:07:33 1   http://www.google.com", "12404   google  2006-03-23 22:07:33 1   http://www.google.com",
        "12404   google  2006-03-23 22:07:33 1   http://www.google.com");

Map<String, Matcher> result = 
    lines.map(LINE_REGEX::matcher)
         .filter(Matcher::matches)
         .collect(Collectors.groupingBy((Matcher m) -> m.group(1),
                                        Collectors.collectingAndThen(Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),
                                                                     Optional<Matcher>::get)));

result.forEach((k, v) -> System.out.println(v.group(0) + ": " + v.group(1) + " " + v.group(2) + " "
                                            + v.group(3) + " " + v.group(4) + " " + v.group(5)));

--------- Output ---------------

2722: 2722 google 2006-04-12 15:18:12 1 http://www.google.com
9777: 9777 google 2006-03-11 23:25:57 1 http://www.google.com
2005: 2005 google 2006-03-24 21:25:10 1 http://www.google.com
9844: 9844 google 2006-03-19 10:31:12 1 http://www.google.com
6497: 6497 google 2006-04-06 22:47:10 1 http://www.google.com
1410: 1410 google 2006-05-01 21:40:54 1 http://www.google.com
2421: 2421 google 2006-05-04 15:39:25 1 http://www.google.com
2178: 2178 google 2006-03-27 20:58:44 1 http://www.google.com
12404: 12404 google 2006-03-04 00:42:26 1 http://www.google.com