使用Java正则表达式，如何检查字符串是否包含集合中的任何单词？_Java_Regex_String Matching

使用Java正则表达式，如何检查字符串是否包含集合中的任何单词？

java regex

使用Java正则表达式，如何检查字符串是否包含集合中的任何单词？,java,regex,string-matching,Java,Regex,String Matching,我有一套词汇，比如苹果、橘子、梨、香蕉、猕猴桃我想检查一个句子是否包含上面列出的单词，如果包含，我想找到匹配的单词。我如何在正则表达式中实现这一点我目前正在为我的每一组单词调用String.indexOf（）。我假设这不如正则表达式匹配有效？我不认为正则表达式在性能方面会做得更好，但您可以按如下方式使用它： Pattern p = Pattern.compile("(apple|orange|pear)"); Matcher m = p.matcher(inputString); while

我有一套词汇，比如苹果、橘子、梨、香蕉、猕猴桃

我想检查一个句子是否包含上面列出的单词，如果包含，我想找到匹配的单词。我如何在正则表达式中实现这一点

我目前正在为我的每一组单词调用String.indexOf（）。我假设这不如正则表达式匹配有效？

我不认为正则表达式在性能方面会做得更好，但您可以按如下方式使用它：

Pattern p = Pattern.compile("(apple|orange|pear)");
Matcher m = p.matcher(inputString);
while (m.find()) {
   String matched = m.group(1);
   // Do something
}

TL；DR对于简单的子字符串

contains（）

是最好的，但是对于仅匹配整词的子字符串，正则表达式可能更好

要想知道哪种方法更有效，最好的办法就是测试它

您可以使用

String.contains（）

而不是

String.indexOf（）

来简化非regexp代码

要搜索不同的单词，正则表达式如下所示：

apple|orange|pear|banana|kiwi

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

在正则表达式中用作

或

我的非常简单的测试代码如下所示：

apple|orange|pear|banana|kiwi

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

显然，时间会根据搜索的字数和字符串的不同而有所不同，但是对于这样的简单搜索，

contains（）

似乎比正则表达式快10倍左右

通过使用正则表达式来搜索另一个字符串中的字符串，就是在用大锤敲开一个螺母，所以我想我们不应该对它的速度慢感到惊讶。当要查找的模式更复杂时，请保存正则表达式

您可能希望使用正则表达式的一种情况是，如果

indexOf（）

和

contains（）

不起作用，因为您只想匹配整个单词，而不仅仅是子字符串，例如，您想匹配
pear
，而不是
spears
。正则表达式可以很好地处理这种情况，因为它们具有
在这种情况下，我们将模式更改为：

\b(apple|orange|pear|banana|kiwi)\b

\b
表示只匹配单词的开头或结尾，括号将或表达式组合在一起
注意，在代码中定义此模式时，需要使用另一个反斜杠来转义反斜杠：

Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");

下面是我找到的最简单的解决方案（与通配符匹配）：

性能取决于正则表达式的长度。如果少于1000个字符，请继续。如果时间更长，您需要其他解决方案。例如，将文本拆分为单独的单词，并对照预定义的哈希表/一组“已知”单词进行检查。@驱逐出境者回答的目的是给出一个很好的提示，说明如何解决问题，而不是提供一个完美、有光泽的世界级解决方案。它可以很容易地改进，至于可读性，如果您有200个字符串（不使用regexp的另一个原因），您可以使用for循环并在
StringBuilder
中连接。我认为我的回答提供了足够的味道。你可能是指“（苹果）（橘子）；（梨）”。否则，你就是在匹配苹果橘子酱或苹果果酱之类的东西。嗯，不。对不起，它就是这样工作的。您的解决方案也很有效，但是您必须为每个解决方案使用不同的组word@GuillaumePolet添加单词边界
“\\b（苹果|橙|梨）\\b”