Java 使用正则表达式从文本中获取对话片段_Java_Regex

Java 使用正则表达式从文本中获取对话片段

java regex

Java 使用正则表达式从文本中获取对话片段,java,regex,Java,Regex,我试图从一本书的文本中提取对话片段。例如，如果我有字符串 "What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me." 然后我想提取“国旗怎么了？”和“我觉得没问题。” 我找到了一个要使用的正则表达式，它是“[^”\]*（\\.[^”\]*）*”。当我在book.txt文件上执行Ctrl+F find regex时，这在Eclipse中非常有效，但当我运行以下代码时： String

我试图从一本书的文本中提取对话片段。例如，如果我有字符串

"What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me."

然后我想提取

“国旗怎么了？”

和

“我觉得没问题。”

我找到了一个要使用的正则表达式，它是

“[^”\]*（\\.[^”\]*）*”

。当我在book.txt文件上执行Ctrl+F find regex时，这在Eclipse中非常有效，但当我运行以下代码时：

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\""; Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

if(m.find())
 System.out.println(m.group(1));

唯一打印的内容是

null

。那么，我是否没有正确地将正则表达式转换为Java字符串？我需要考虑Java字符串的双引号是

\“

”这一事实吗？

在自然语言文本中，

“

不太可能由前面的斜杠转义，因此您应该能够只使用模式

”（[^”]*）“

作为Java字符串文本，这是

“\”（[^\“]*）\”

这里是Java：

String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

上述印刷品：

What's the matter with the flag?
Seems all right to me.

关于逃逸序列鉴于这一声明：

String s = "\"";
System.out.println(s.length()); // prints "1"

字符串

只有一个字符，

“

。

是Java源代码级别的转义序列；字符串本身没有斜杠

另见

原始代码的问题实际上，模式本身没有问题，但您没有捕获正确的部分。

\1

没有捕获引用的文本。以下是具有正确捕获组的模式：

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

为了进行视觉比较，以下是原始模式，作为Java字符串文本：

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
                            ^^^^^^^^^^^^^^^^^
                           why capture this part?

下面是修改后的模式：

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    we want to capture this part!

但是，正如前面提到的：自然语言文本不需要这种复杂的模式，因为自然语言文本不可能包含转义引号

另见

在自然语言文本中，

“

不太可能被前面的斜杠转义，因此您应该能够只使用模式

”（[^“]*）”

作为Java字符串文本，这是

“\”（[^\“]*）\”

这里是Java：

String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

上述印刷品：

What's the matter with the flag?
Seems all right to me.

关于逃逸序列鉴于这一声明：

String s = "\"";
System.out.println(s.length()); // prints "1"

字符串

只有一个字符，

“

。

是Java源代码级别的转义序列；字符串本身没有斜杠

另见

原始代码的问题实际上，模式本身没有问题，但您没有捕获正确的部分。

\1

没有捕获引用的文本。以下是具有正确捕获组的模式：

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

为了进行视觉比较，以下是原始模式，作为Java字符串文本：

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
                            ^^^^^^^^^^^^^^^^^
                           why capture this part?

下面是修改后的模式：

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    we want to capture this part!

但是，正如前面提到的：自然语言文本不需要这种复杂的模式，因为自然语言文本不可能包含转义引号

另见

Hm，这似乎也在对话片段之间获得了一切……我只想要对话本身。@sheldon：请查阅随附的Java代码。我认为你把正则表达式模式与Java字符串文字混淆了。你的

太贪婪了。我认为@sheldon的意思是，它似乎在片段之间获得了一切。正确的答案应该是

“（[^”]*？）”

@Amargosh:否定字符类使贪婪与勉强无关：只能有一个匹配项。嗯，这似乎也让对话片段之间的所有内容……我只想要对话本身。@sheldon:请查阅随附的Java代码。我认为你把正则表达式模式和Java字符串文字混淆了。你的

太贪婪了。我想这就是@sheldon的意思，它似乎把所有东西都放在了片段之间。正确的是

“（[^”]*？）”

@Amargosh:被否定的字符类使贪婪与勉强无关：只能有一个匹配。