Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在java中从另一个字符串中删除字符串_Java_String - Fatal编程技术网

在java中从另一个字符串中删除字符串

在java中从另一个字符串中删除字符串,java,string,Java,String,假设我有一个单词列表: String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"}; 比

假设我有一个单词列表:

 String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"};
比我有文字

 String text = "I would like to do a nice novel about nature AND people"
是否有方法匹配停止字并在忽略大小写的情况下删除它们;像这样吗

 String noStopWordsText = remove(text, stopWords);
结果:

 " would like do nice novel nature people"
如果您知道regex,它会工作得很好,但我更喜欢像commons解决方案这样更注重性能的解决方案

顺便说一句,现在我使用的这个commons方法缺少适当的不敏感大小写处理:

 private static final String[] stopWords = new String[]{"i", "a", "and", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "the", "www", "I", "A", "AND", "ABOUT", "AN", "ARE", "AS", "AT", "BE", "BY", "COM", "FOR", "FROM", "HOW", "IN", "IS", "IT", "NOT", "OF", "ON", "OR", "THAT", "THE", "THIS", "TO", "WAS", "WHAT", "WHEN", "WHERE", "WHO", "WILL", "WITH", "THE", "WWW"};
 private static final String[] blanksForStopWords = new String[]{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""};

 noStopWordsText = StringUtils.replaceEach(text, stopWords, blanksForStopWords);     

您可以创建一个reg表达式来匹配所有停止字[例如
a
,注意此处的空格],并以

str.replaceAll(regexpression,"");

 String[] stopWords = new String[]{" i ", " a ", " and ", " about ", " an ", " are ", " as ", " at ", " be ", " by ", " com ", " for ", " from ", " how ", " in ", " is ", " it ", " not ", " of ", " on ", " or ", " that ", " the ", " this ", " to ", " was ", " what ", " when ", " where ", " who ", " will ", " with ", " the ", " www "};
        String text = " I would like to do a nice novel about nature AND people ";

        for (String stopword : stopWords) {
            text = text.replaceAll("(?i)"+stopword, " ");
        }
        System.out.println(text);
输出:

 would like do nice novel nature people 

可能有更好的方法。

使用停止词创建一个正则表达式,使其不区分大小写,然后使用matcher的
replaceAll
方法将所有匹配项替换为空字符串

import java.util.regex.*;

Pattern stopWords = Pattern.compile("\\b(?:i|a|and|about|an|are|...)\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = stopWords.matcher("I would like to do a nice novel about nature AND people");
String clean = matcher.replaceAll("");
模式中的
只是我在偷懒,请继续列出停止词

另一种方法是循环所有停止词,并使用
String
replaceAll
方法。这种方法的问题是,
replaceAll
将为每个调用编译一个新的正则表达式,因此在循环中使用它不是很有效。此外,在使用
String
replaceAll
时,不能传递使正则表达式不区分大小写的标志


编辑:我在模式周围添加了
\b
,使其仅与整个单词匹配。我还添加了
\s*
,以使其在后面的任何空格都是全局的,这可能是不必要的。

这是一个不使用正则表达式的解决方案。我认为它不如我的另一个答案,因为它长得多,也不太清晰,但如果性能真的非常重要,那么这就是O(n),其中n是文本的长度

Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("and");
// and so on ...

String sampleText = "I would like to do a nice novel about nature AND people";
StringBuffer clean = new StringBuffer();
int index = 0;

while (index < sampleText.length) {
  // the only word delimiter supported is space, if you want other
  // delimiters you have to do a series of indexOf calls and see which
  // one gives the smallest index, or use regex
  int nextIndex = sampleText.indexOf(" ", index);
  if (nextIndex == -1) {
    nextIndex = sampleText.length - 1;
  }
  String word = sampleText.substring(index, nextIndex);
  if (!stopWords.contains(word.toLowerCase())) {
    clean.append(word);
    if (nextIndex < sampleText.length) {
      // this adds the word delimiter, e.g. the following space
      clean.append(sampleText.substring(nextIndex, nextIndex + 1)); 
    }
  }
  index = nextIndex + 1;
}

System.out.println("Stop words removed: " + clean.toString());
Set stopWords=new HashSet();
停止语。添加(“a”);
停止语。添加(“和”);
//等等。。。
String sampleText=“我想写一本关于自然和人类的好小说”;
StringBuffer clean=新的StringBuffer();
int指数=0;
while(索引
在whilespace上拆分
文本。然后循环遍历数组,仅当字符串不是停止字时才继续追加到
StringBuilder

字符串中是否有标点符号?是否有一些硬数字表明regexp解决方案性能不足,或者这只是过早优化?我的意思是,这肯定不是最有效的解决方案,但除非这是您所做的全部,并且您需要每秒执行10K次,否则我敢打赌这不是一个问题。1)无法处理方法不区分大小写的要求。2) 不删除停止词——它会删除“小说”中的“不”。聪明的把戏,不知道这是可能的。我唯一的批评是,
replaceAll
效率很低,它编译了一个一次性的regexp模式,所以在循环中使用它不是很好。是的,它应该。我在regexp中出错,\b在Java中必须是\\b,我忘记了。但是现在它应该可以工作了。非常正确,我把
break
改成了
nextIndex=sampleText.length
,这应该可以解决这个问题。噢,这就是我测试的,但是当我更改代码时,我很马虎。谢谢你指出这一点。