Java中从字符串中的列表中查找每个短语的最佳方法

Java中从字符串中的列表中查找每个短语的最佳方法,java,string,algorithm,performance,arraylist,Java,String,Algorithm,Performance,Arraylist,我对此进行了大量搜索,大多数帖子都在讨论如何在两个ArrayList之间查找公共字符串,这可以通过Collections.retainal完成,也可以使用包含单个单词的ArrayList与文本进行比较 我有一些在Java中看起来像这样的文本 String text = "Get a placement right today by applying to our interviews and don't forget to email us your resume. This is a top

我对此进行了大量搜索,大多数帖子都在讨论如何在两个ArrayList之间查找公共字符串,这可以通过Collections.retainal完成,也可以使用包含单个单词的ArrayList与文本进行比较

我有一些在Java中看起来像这样的文本

String text = "Get a placement right today by applying to our interviews and don't forget to email us your resume. This is a top job opportunity to get yourself acquainted with real world programming and skill building. Hurry! apply for placement now here";
我有一个ArrayList,它有两个字符串,“placement”和“job opportunity”

我想要的结果作为安置(2)和工作机会(1) 目前我有几种方法,但我想知道实现这一目标的最佳方法

方法1
为ArrayList中的每个单词维护一个计数器。对于ArrayList中的每个单词,执行一个text.contains(word),如果为true,则递增相应的计数器。如果文本中的单词多于ArrayList或ArrayList中的单词多于此处的文本,会发生什么情况?有没有最佳或更短的方法来实现同样的目标?我的ArrayList中可能有单词或短语。提前感谢您的建议。

一个简单的方法是使用
字符串搜索列表中的每个单词。indexOf

for (String word : list) {
  int prev = -1;
  int count = 0;
  do {
    prev = s.indexOf(word, prev + 1);
    if (prev != -1 /* && check for word breaks */) {
      count++
    }
  } while (prev != -1);
  System.out.println(word + " " + count);
}
然而,除了简单性之外,对于任何特定的标准来说,这并不是最优的

请注意,这不会检查是否有分词,因此它会在
“xfoox”
中找到
“foo”
;这将是可能的,以改变条件,我已表明寻找这些


如果您需要处理一个非常大的单词列表,像这样的算法会更有效,因为这样可以避免检查列表中的所有字符串。然而,它需要对单词列表进行一些预处理,尽管这可以合理有效地实现,并且可以对给定的单词列表脱机一次完成。

如果我理解正确,这个问题就是模式匹配问题的一个例子。 列出最佳字符串搜索算法及其平均和最坏情况复杂性。 如果我没记错的话,Alfred V.Aho、Jerffery Ullman和John E.Hopcroft在模式匹配一章中对算法的设计和分析进行了分析

下面两个似乎是最有效的

  • 我发现这两种算法都是在 我也会把文件复制到这里,以防链接断开。 实施:

  • (时间复杂度Θ(m)+Θ(n))
  • (时间复杂度Θ(m+k)+O(n))
  • StdOut只是系统。out

    备份KMP:

    /******************************************************************************
     *  Compilation:  javac KMP.java
     *  Execution:    java KMP pattern text
     *  Dependencies: StdOut.java
     *
     *  Reads in two strings, the pattern and the input text, and
     *  searches for the pattern in the input text using the
     *  KMP algorithm.
     *
     *  % java KMP abracadabra abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:               abracadabra          
     *
     *  % java KMP rab abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:         rab
     *
     *  % java KMP bcara abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:                                   bcara
     *
     *  % java KMP rabrabracad abacadabrabracabracadabrabrabracad 
     *  text:    abacadabrabracabracadabrabrabracad
     *  pattern:                        rabrabracad
     *
     *  % java KMP abacad abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad
     *  pattern: abacad
     *
     ******************************************************************************/
    
    /**
     *  The <tt>KMP</tt> class finds the first occurrence of a pattern string
     *  in a text string.
     *  <p>
     *  This implementation uses a version of the Knuth-Morris-Pratt substring search
     *  algorithm. The version takes time as space proportional to
     *  <em>N</em> + <em>M R</em> in the worst case, where <em>N</em> is the length
     *  of the text string, <em>M</em> is the length of the pattern, and <em>R</em>
     *  is the alphabet size.
     *  <p>
     *  For additional documentation,
     *  see <a href="http://algs4.cs.princeton.edu/53substring">Section 5.3</a> of
     *  <i>Algorithms, 4th Edition</i> by Robert Sedgewick and Kevin Wayne.
     */
    public class KMP {
        private final int R;       // the radix
        private int[][] dfa;       // the KMP automoton
    
        private char[] pattern;    // either the character array for the pattern
        private String pat;        // or the pattern string
    
        /**
         * Preprocesses the pattern string.
         *
         * @param pat the pattern string
         */
        public KMP(String pat) {
            this.R = 256;
            this.pat = pat;
    
            // build DFA from pattern
            int M = pat.length();
            dfa = new int[R][M]; 
            dfa[pat.charAt(0)][0] = 1; 
            for (int X = 0, j = 1; j < M; j++) {
                for (int c = 0; c < R; c++) 
                    dfa[c][j] = dfa[c][X];     // Copy mismatch cases. 
                dfa[pat.charAt(j)][j] = j+1;   // Set match case. 
                X = dfa[pat.charAt(j)][X];     // Update restart state. 
            } 
        } 
    
        /**
         * Preprocesses the pattern string.
         *
         * @param pattern the pattern string
         * @param R the alphabet size
         */
        public KMP(char[] pattern, int R) {
            this.R = R;
            this.pattern = new char[pattern.length];
            for (int j = 0; j < pattern.length; j++)
                this.pattern[j] = pattern[j];
    
            // build DFA from pattern
            int M = pattern.length;
            dfa = new int[R][M]; 
            dfa[pattern[0]][0] = 1; 
            for (int X = 0, j = 1; j < M; j++) {
                for (int c = 0; c < R; c++) 
                    dfa[c][j] = dfa[c][X];     // Copy mismatch cases. 
                dfa[pattern[j]][j] = j+1;      // Set match case. 
                X = dfa[pattern[j]][X];        // Update restart state. 
            } 
        } 
    
        /**
         * Returns the index of the first occurrrence of the pattern string
         * in the text string.
         *
         * @param  txt the text string
         * @return the index of the first occurrence of the pattern string
         *         in the text string; N if no such match
         */
        public int search(String txt) {
    
            // simulate operation of DFA on text
            int M = pat.length();
            int N = txt.length();
            int i, j;
            for (i = 0, j = 0; i < N && j < M; i++) {
                j = dfa[txt.charAt(i)][j];
            }
            if (j == M) return i - M;    // found
            return N;                    // not found
        }
    
        /**
         * Returns the index of the first occurrrence of the pattern string
         * in the text string.
         *
         * @param  text the text string
         * @return the index of the first occurrence of the pattern string
         *         in the text string; N if no such match
         */
        public int search(char[] text) {
    
            // simulate operation of DFA on text
            int M = pattern.length;
            int N = text.length;
            int i, j;
            for (i = 0, j = 0; i < N && j < M; i++) {
                j = dfa[text[i]][j];
            }
            if (j == M) return i - M;    // found
            return N;                    // not found
        }
    
    
        /** 
         * Takes a pattern string and an input string as command-line arguments;
         * searches for the pattern string in the text string; and prints
         * the first occurrence of the pattern string in the text string.
         */
        public static void main(String[] args) {
            String pat = args[0];
            String txt = args[1];
            char[] pattern = pat.toCharArray();
            char[] text    = txt.toCharArray();
    
            KMP kmp1 = new KMP(pat);
            int offset1 = kmp1.search(txt);
    
            KMP kmp2 = new KMP(pattern, 256);
            int offset2 = kmp2.search(text);
    
            // print results
            StdOut.println("text:    " + txt);
    
            StdOut.print("pattern: ");
            for (int i = 0; i < offset1; i++)
                StdOut.print(" ");
            StdOut.println(pat);
    
            StdOut.print("pattern: ");
            for (int i = 0; i < offset2; i++)
                StdOut.print(" ");
            StdOut.println(pat);
        }
    }
    
    BoyerMoore.java
    
    
    Below is the syntax highlighted version of BoyerMoore.java from §5.3 Substring Search.
    
    
    /******************************************************************************
     *  Compilation:  javac BoyerMoore.java
     *  Execution:    java BoyerMoore pattern text
     *  Dependencies: StdOut.java
     *
     *  Reads in two strings, the pattern and the input text, and
     *  searches for the pattern in the input text using the
     *  bad-character rule part of the Boyer-Moore algorithm.
     *  (does not implement the strong good suffix rule)
     *
     *  % java BoyerMoore abracadabra abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:               abracadabra
     *
     *  % java BoyerMoore rab abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:         rab
     *
     *  % java BoyerMoore bcara abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad 
     *  pattern:                                   bcara
     *
     *  % java BoyerMoore rabrabracad abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad
     *  pattern:                        rabrabracad
     *
     *  % java BoyerMoore abacad abacadabrabracabracadabrabrabracad
     *  text:    abacadabrabracabracadabrabrabracad
     *  pattern: abacad
     *
     ******************************************************************************/
    
    /**
     *  The <tt>BoyerMoore</tt> class finds the first occurrence of a pattern string
     *  in a text string.
     *  <p>
     *  This implementation uses the Boyer-Moore algorithm (with the bad-character
     *  rule, but not the strong good suffix rule).
     *  <p>
     *  For additional documentation,
     *  see <a href="http://algs4.cs.princeton.edu/53substring">Section 5.3</a> of
     *  <i>Algorithms, 4th Edition</i> by Robert Sedgewick and Kevin Wayne.
     */
    public class BoyerMoore {
        private final int R;     // the radix
        private int[] right;     // the bad-character skip array
    
        private char[] pattern;  // store the pattern as a character array
        private String pat;      // or as a string
    
        /**
         * Preprocesses the pattern string.
         *
         * @param pat the pattern string
         */
        public BoyerMoore(String pat) {
            this.R = 256;
            this.pat = pat;
    
            // position of rightmost occurrence of c in the pattern
            right = new int[R];
            for (int c = 0; c < R; c++)
                right[c] = -1;
            for (int j = 0; j < pat.length(); j++)
                right[pat.charAt(j)] = j;
        }
    
        /**
         * Preprocesses the pattern string.
         *
         * @param pattern the pattern string
         * @param R the alphabet size
         */
        public BoyerMoore(char[] pattern, int R) {
            this.R = R;
            this.pattern = new char[pattern.length];
            for (int j = 0; j < pattern.length; j++)
                this.pattern[j] = pattern[j];
    
            // position of rightmost occurrence of c in the pattern
            right = new int[R];
            for (int c = 0; c < R; c++)
                right[c] = -1;
            for (int j = 0; j < pattern.length; j++)
                right[pattern[j]] = j;
        }
    
        /**
         * Returns the index of the first occurrrence of the pattern string
         * in the text string.
         *
         * @param  txt the text string
         * @return the index of the first occurrence of the pattern string
         *         in the text string; N if no such match
         */
        public int search(String txt) {
            int M = pat.length();
            int N = txt.length();
            int skip;
            for (int i = 0; i <= N - M; i += skip) {
                skip = 0;
                for (int j = M-1; j >= 0; j--) {
                    if (pat.charAt(j) != txt.charAt(i+j)) {
                        skip = Math.max(1, j - right[txt.charAt(i+j)]);
                        break;
                    }
                }
                if (skip == 0) return i;    // found
            }
            return N;                       // not found
        }
    
    
        /**
         * Returns the index of the first occurrrence of the pattern string
         * in the text string.
         *
         * @param  text the text string
         * @return the index of the first occurrence of the pattern string
         *         in the text string; N if no such match
         */
        public int search(char[] text) {
            int M = pattern.length;
            int N = text.length;
            int skip;
            for (int i = 0; i <= N - M; i += skip) {
                skip = 0;
                for (int j = M-1; j >= 0; j--) {
                    if (pattern[j] != text[i+j]) {
                        skip = Math.max(1, j - right[text[i+j]]);
                        break;
                    }
                }
                if (skip == 0) return i;    // found
            }
            return N;                       // not found
        }
    
    
        /**
         * Takes a pattern string and an input string as command-line arguments;
         * searches for the pattern string in the text string; and prints
         * the first occurrence of the pattern string in the text string.
         */
        public static void main(String[] args) {
            String pat = args[0];
            String txt = args[1];
            char[] pattern = pat.toCharArray();
            char[] text    = txt.toCharArray();
    
            BoyerMoore boyermoore1 = new BoyerMoore(pat);
            BoyerMoore boyermoore2 = new BoyerMoore(pattern, 256);
            int offset1 = boyermoore1.search(txt);
            int offset2 = boyermoore2.search(text);
    
            // print results
            StdOut.println("text:    " + txt);
    
            StdOut.print("pattern: ");
            for (int i = 0; i < offset1; i++)
                StdOut.print(" ");
            StdOut.println(pat);
    
            StdOut.print("pattern: ");
            for (int i = 0; i < offset2; i++)
                StdOut.print(" ");
            StdOut.println(pat);
        }
    }
    
    
    Copyright © 2002–2015, Robert Sedgewick and Kevin Wayne.
    Last updated: Sat Aug 29 11:16:30 EDT 2015.
    
    /******************************************************************************
    *编译:javac KMP.java
    *执行:JavaKMP模式文本
    *依赖项:StdOut.java
    *
    *读入两个字符串,模式和输入文本,以及
    *在输入文本中使用
    *KMP算法。
    *
    *%java KMP abracadabra Abacadabra
    *正文:阿巴卡巴拉巴拉巴拉德
    *图案:abracadabra
    *
    *%java KMP rab ABACADABRABRACAD
    *正文:阿巴卡巴拉巴拉巴拉德
    *模式:rab
    *
    *%java KMP bcara ABACADABRABRACAD
    *正文:阿巴卡巴拉巴拉巴拉德
    *图案:bcara
    *
    *%java KMP Rabrabad Abacadabracad
    *正文:阿巴卡巴拉巴拉巴拉德
    *图案:拉布拉卡
    *
    *%java KMP abacad ABACADABARABARABARABARABARABARABAD
    *正文:阿巴卡巴拉巴拉巴拉德
    *图案:阿巴卡
    *
    ******************************************************************************/
    /**
    *KMP类查找模式字符串的第一个匹配项
    *在文本字符串中。
    *
    *此实现使用Knuth-Morris-Pratt子字符串搜索的一个版本
    *算法。该版本将时间视为与时间成比例的空间
    *最坏情况下的N+mr,其中N是长度
    *在文本字符串中,M是模式的长度,R
    *是字母表的大小。
    *
    *有关其他文件,
    *看到
    *算法,第四版,罗伯特·塞吉威克和凯文·韦恩。
    */
    公共级KMP{
    私有final int R;//基数
    私有int[][]dfa;//KMP自动马达
    private char[]模式;//模式的字符数组
    私有字符串pat;//或模式字符串
    /**
    *预处理模式字符串。
    *
    *@param轻拍模式字符串
    */
    公共九龙公园(串拍){
    这个R=256;
    this.pat=pat;
    //从模式构建DFA
    int M=拍片长度();
    dfa=新整数[R][M];
    dfa[pat.charAt(0)][0]=1;
    对于(int X=0,j=1;j