Java 快速检测字符串中n-gram的方法？_Java_Nlp_N Gram

Java 快速检测字符串中n-gram的方法？
java nlp
Java 快速检测字符串中n-gram的方法？,java,nlp,n-gram,Java,Nlp,N Gram,我在上找到了此解决方案，以便检测字符串中的n-gram：（此处：） import java.util.*；公开课考试{ 公共静态列表ngrams（int n，String str）{ List ngrams=new ArrayList（）； String[]words=str.split（“”）； for（int i=0；i
我在上找到了此解决方案，以便检测字符串中的n-gram：（此处：）
import java.util.*；
公开课考试{
公共静态列表ngrams（int n，String str）{
List ngrams=new ArrayList（）；
String[]words=str.split（“”）；
for（int i=0；i开始？）：“）+单词[i]；
使某人返回字符串（）；
}
公共静态void main（字符串[]args）{
对于（int n=1；n）而言，与其他操作（删除stopwords等）所需的毫秒相比，该位代码的处理时间最长（对于我的语料库，检测1克、2克、3克和4grams需要28秒：4Mb原始文本）
有人知道Java中的解决方案比上面介绍的循环解决方案更快吗？（我在考虑多线程、集合的使用，或者可能是拆分字符串的创造性方法…）谢谢！
通过您提供的代码运行大约5兆的Lorus Ipsum文本通常需要7秒多一点的时间来检测1-4个n-gram。我修改了代码，列出了最长的n-gram，然后在这个列表上迭代，生成了连续较短的Ngram列表。在测试中，sam大约需要2.6秒而且，它占用的内存要少得多
import java.util.*;

public class Test {

    public static List<String> ngrams(int max, String val) {
        List<String> out = new ArrayList<String>(1000);
        String[] words = val.split(" ");
        for (int i = 0; i < words.length - max + 1; i++) {
            out.add(makeString(words, i,  max));
        }
        return out;
    }

    public static String makeString(String[] words, int start, int length) {
        StringBuilder tmp= new StringBuilder(100);
        for (int i = start; i < start + length; i++) {
            tmp.append(words[i]).append(" ");
        }
        return tmp.substring(0, tmp.length() - 1);
    }

    public static List<String> reduceNgrams(List<String> in, int size) {
        if (1 < size) {
            List<String> working = reduceByOne(in);
            in.addAll(working);
            for (int i = size -2 ; i > 0; i--) {
                working = reduceByOne(working);
                in.addAll(working);
            }
        }
        return in;
    }

    public static List<String> reduceByOne(List<String> in) {
        List<String> out = new ArrayList<String>(in.size());
        int end;
        for (String s : in) {
            end = s.lastIndexOf(" ");
            out.add(s.substring(0, -1 == end ? s.length() : end));  
        }
        //the last one will always reduce twice - words 0, n-1 are in the loop this catches the words 1, n
        String s = in.get(in.size() -1);
        out.add(s.substring(s.indexOf(" ")+1));
        return out;
    }

    public static void main(String[] args) {
        long start;
        start = System.currentTimeMillis();
        List<String> ngrams = ngrams(3, "Your text goes here, actual mileage may vary");
        reduceNgrams(ngrams, 3);
        System.out.println(System.currentTimeMillis() - start);
    }
}

import java.util.*；
公开课考试{
公共静态列表ngrams（int max，String val）{
列表输出=新阵列列表（1000）；
字符串[]字=val.split（“”）；
对于（int i=0；i0；i--）{
工作=还原人（工作）；
in.addAll（工作）；
}
}
返回；
}
公共静态列表reduceByOne（列表中）{
List out=new ArrayList（in.size（））；
内端；
for（字符串s:in）{
结束=s.lastIndexOf（“”）；
out.add（s.substring（0，-1==end？s.length（）：end））；
}
//最后一个将总是减少两次-单词0，n-1在循环中，这将捕获单词1，n
字符串s=in.get（in.size（）-1）；
out.add（s.substring（s.indexOf（“”+1））；
返回；
}
公共静态void main（字符串[]args）{
长起点；
start=System.currentTimeMillis（）；
列出ngrams=ngrams（3，“您的文本在这里，实际里程可能有所不同”）；
还原图（ngrams，3）；
System.out.println（System.currentTimeMillis（）-start）；
}
}
您可以尝试以下方法：
public class NGram {

    private final int n;
    private final String text;

    private final int[] indexes;
    private int index = -1;
    private int found = 0;

    public NGram(String text, int n) {
        this.text = text;
        this.n = n;
        indexes = new int[n];
    }

    private boolean seek() {
        if (index >= text.length()) {
            return false;
        }
        push();
        while(++index < text.length()) {
            if (text.charAt(index) == ' ') {
                found++;
                if (found<n) {
                    push();
                } else {
                    return true;
                }
            }
        }
        return true;
    }

    private void push() {
        for (int i = 0; i < n-1; i++) {
            indexes[i] = indexes[i+1];
        }
        indexes[n-1] = index+1;
    }

    private List<String> list() {
        List<String> ngrams = new ArrayList<String>();
        while (seek()) {
            ngrams.add(get());
        }
        return ngrams;
    }

    private String get() {
        return text.substring(indexes[0], index);
    }
}

你试过分析什么需要时间吗？似乎创建了很多不需要的对象。是拆分需要时间，还是创建ngram对象或将它们插入列表？我不会先拆分成单独的字符串，然后再重新组合它们，而是扫描分隔符，只需记住索引，因此，对于3gram，您要跟踪分隔符n、n-1、n-2和n-3。3gram开始于n-3，结束于n。然后向前移动n（m-3现在是n-2，等等。thx@RogerLindsjö这看起来很有希望！我用扫描仪试了一下，但我不确定是否正确理解了你的方法。如果我跟踪最后3个分隔符，当我到达n（在n-3、n-2、n-1之后）时，如何检索相应的3个单词。AFAICS没有扫描仪方法来获取以前的值（有点像“scanner.previous（）”，如果你愿意的话！）有什么我不明白的？又来了！收到了！我为你创建了一个循环（n=1；上次我测试的是3克。我应该将其改回n或对值进行注释。你能给我一个产生不同结果的示例吗？我在几个不同的示例上运行了原始代码和代码，结果相同。顺便说一句，两个程序之间的结果顺序不同。Karakuri我还没有测试你的代码，我上面对不同输出的评论是写给罗杰的。如果你能用n而不是3发布你的版本，我会很高兴的！而且，我也很困惑：你的类返回的列表是“in”还是“out”？在我看来，它是“in”，但作为输出的名称，这将是非常违反直觉的？当我检查生成的列表时，它们完全相同。你能显示一些它们不同的文本吗？当使用文本运行时（将所有行合并为一行），我得到近570000个ngrams。时间有点不同（很多GC），但我的实现大约需要100毫秒，而你的实现需要500毫秒。很多时间都花在GC中（当循环生成并丢弃大量字符串时）。Thx karakuri！我不太理解你的代码。为什么是“3”作为reduceNGrams的参数，如果查找4-n-grams？Thx！3是测试遗留下来的。替换所需的深度。我应该将其更改为变量或提供注释。
public class NGram {

    private final int n;
    private final String text;

    private final int[] indexes;
    private int index = -1;
    private int found = 0;

    public NGram(String text, int n) {
        this.text = text;
        this.n = n;
        indexes = new int[n];
    }

    private boolean seek() {
        if (index >= text.length()) {
            return false;
        }
        push();
        while(++index < text.length()) {
            if (text.charAt(index) == ' ') {
                found++;
                if (found<n) {
                    push();
                } else {
                    return true;
                }
            }
        }
        return true;
    }

    private void push() {
        for (int i = 0; i < n-1; i++) {
            indexes[i] = indexes[i+1];
        }
        indexes[n-1] = index+1;
    }

    private List<String> list() {
        List<String> ngrams = new ArrayList<String>();
        while (seek()) {
            ngrams.add(get());
        }
        return ngrams;
    }

    private String get() {
        return text.substring(indexes[0], index);
    }
}

Loop 01 Code mine ngram 1 time 071ms ngrams 294121
Loop 01 Code orig ngram 1 time 534ms ngrams 294121
Loop 01 Code mine ngram 2 time 016ms ngrams 294120
Loop 01 Code orig ngram 2 time 360ms ngrams 294120
Loop 01 Code mine ngram 3 time 082ms ngrams 294119
Loop 01 Code orig ngram 3 time 319ms ngrams 294119
Loop 01 Code mine ngram 4 time 014ms ngrams 294118
Loop 01 Code orig ngram 4 time 439ms ngrams 294118

Loop 10 Code mine ngram 1 time 013ms ngrams 294121
Loop 10 Code orig ngram 1 time 268ms ngrams 294121
Loop 10 Code mine ngram 2 time 014ms ngrams 294120
Loop 10 Code orig ngram 2 time 323ms ngrams 294120
Loop 10 Code mine ngram 3 time 013ms ngrams 294119
Loop 10 Code orig ngram 3 time 412ms ngrams 294119
Loop 10 Code mine ngram 4 time 014ms ngrams 294118
Loop 10 Code orig ngram 4 time 423ms ngrams 294118