Java 无法用Lingpipe标识西班牙语文本

Java 无法用Lingpipe标识西班牙语文本,java,nlp,lingpipe,Java,Nlp,Lingpipe,几天前,我正在开发一个java服务器来保存一组数据并识别其语言,所以我决定使用lingpipe来完成这样的任务。但我面临一个问题,在训练代码并用两种语言(英语和西班牙语)评估代码后,我无法识别西班牙语文本,但我用英语和法语获得了成功的结果 为了完成此任务,我遵循的教程是: 下面是我为完成任务而采取的步骤: 训练语言分类器所遵循的步骤 ~1.首先将英文和西班牙文元数据放在一个名为leipzig的文件夹中,并将其解包,如下所示(注意:元数据和句子来自): ~2.第二步将压缩后的语言元数据解包到解包

几天前,我正在开发一个java服务器来保存一组数据并识别其语言,所以我决定使用lingpipe来完成这样的任务。但我面临一个问题,在训练代码并用两种语言(英语和西班牙语)评估代码后,我无法识别西班牙语文本,但我用英语和法语获得了成功的结果

为了完成此任务,我遵循的教程是:

下面是我为完成任务而采取的步骤: 训练语言分类器所遵循的步骤

~1.首先将英文和西班牙文元数据放在一个名为leipzig的文件夹中,并将其解包,如下所示(注意:元数据和句子来自):

~2.第二步将压缩后的语言元数据解包到解包文件夹中

unpacked                                    //Folder
    eng_news_2015_300K          //Folder with the english metadata 
        eng_news_2015_300K-co_n.txt
        eng_news_2015_300K-co_s.txt
        eng_news_2015_300K-import.sql
        eng_news_2015_300K-inv_so.txt
        eng_news_2015_300K-inv_w.txt
        eng_news_2015_300K-sources.txt
        eng_news_2015_300K-words.txt
        sentences.txt
    spa-hn_web_2015_300K                    //Folder with the spanish metadata 
        sentences.txt
        spa-hn_web_2015_300K-co_n.txt
        spa-hn_web_2015_300K-co_s.txt
        spa-hn_web_2015_300K-import.sql
        spa-hn_web_2015_300K-inv_so.txt
        spa-hn_web_2015_300K-inv_w.txt
        spa-hn_web_2015_300K-sources.txt
        spa-hn_web_2015_300K-words.txt
~3.然后咀嚼每个句子,删除行号、制表符,并用单空格字符替换换行符。使用UTF-8 unicode编码统一编写输出(注意:Lingpipe站点的munge.java)

~5。我们用下一个结果评估了经过培训的代码,在混淆矩阵上有一些问题(注意:Lingpipe LanguageId教程中的EvalLanguageId.java)

~7.我试过使用一个100万元数据文件,但它得到了相同的结果,并且通过得到相同的结果改变了ngram的编号。
我将非常感谢您的帮助。

在自然语言处理领域工作了几天之后,我找到了一种使用OpenNLP确定一个文本的语言的方法。 以下是示例代码:

这里是为进行语言预测而创建的模型的训练语料库

我决定使用OpenNLP解决这个问题中描述的问题,实际上这个库有一个完整的功能堆栈。 以下是模型培训的示例>


在自然语言处理领域工作了几天之后,我找到了一种使用OpenNLP确定一个文本的语言的方法。 以下是示例代码:

这里是为进行语言预测而创建的模型的训练语料库

我决定使用OpenNLP解决这个问题中描述的问题,实际上这个库有一个完整的功能堆栈。 以下是模型培训的示例>


我会尝试提取西班牙语分类器,并确保它确实得到了训练。如果它仍然处于基线,因为它从未看到任何数据,那么英语看起来更像西班牙语,而不是基线。我会尝试提取西班牙语分类器,并确保它确实得到了训练。如果因为没有看到任何数据,所以仍然处于基线,那么英语看起来更像西班牙语,而不是基线。
unpacked                                    //Folder
    eng_news_2015_300K          //Folder with the english metadata 
        eng_news_2015_300K-co_n.txt
        eng_news_2015_300K-co_s.txt
        eng_news_2015_300K-import.sql
        eng_news_2015_300K-inv_so.txt
        eng_news_2015_300K-inv_w.txt
        eng_news_2015_300K-sources.txt
        eng_news_2015_300K-words.txt
        sentences.txt
    spa-hn_web_2015_300K                    //Folder with the spanish metadata 
        sentences.txt
        spa-hn_web_2015_300K-co_n.txt
        spa-hn_web_2015_300K-co_s.txt
        spa-hn_web_2015_300K-import.sql
        spa-hn_web_2015_300K-inv_so.txt
        spa-hn_web_2015_300K-inv_w.txt
        spa-hn_web_2015_300K-sources.txt
        spa-hn_web_2015_300K-words.txt
/-----------------Command line----------------------------------------------/

javac -cp lingpipe-4.1.2.jar: Munge.java
java -cp lingpipe-4.1.2.jar: Munge /home/samuel/leipzig/unpacked /home/samuel/leipzig/munged
----------------------------------------Results-----------------------------
spa
reading from=/home/samuel/leipzig/unpacked/spa-hn_web_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/spa/spa.txt charset=utf-8
total length=43267166

eng
reading from=/home/samuel/leipzig/unpacked/eng_news_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/eng/eng.txt charset=utf-8
total length=35847257

/---------------------------------------------------------------/

<---------------------------------Folder------------------------------------->
   munged                                      //Folder
    eng                     //folder containing the sentences.txt for english
        sentences.txt
    spa                 //folder containing the sentences.txt for spanish
        sentences.txt
<-------------------------------------------------------------------------->
/---------------Command line--------------------------------------------/

javac -cp lingpipe-4.1.2.jar: TrainLanguageId.java
java -cp lingpipe-4.1.2.jar: TrainLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 5
-----------------------------------Results-----------------------------------
nGram=100000 numChars=5
Training category=eng
Training category=spa

Compiling model to file=/home/samuel/leipzig/langid-leipzig.classifier

/----------------------------------------------------------------------------/
/------------------------Command line---------------------------------/

javac -cp lingpipe-4.1.2.jar: EvalLanguageId.java
java -cp lingpipe-4.1.2.jar: EvalLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 50 1000
-------------------------------Results-------------------------------------

Reading classifier from file=/home/samuel/leipzig/langid-leipzig.classifier
Evaluating category=eng
Evaluating category=spa
TEST RESULTS
BASE CLASSIFIER EVALUATION
Categories=[eng, spa]
Total Count=2000
Total Correct=1000
Total Accuracy=0.5
95% Confidence Interval=0.5 +/- 0.02191346617949794
Confusion Matrix
reference \ response
  ,eng,spa
  eng,1000,0                                <---------- not diagonal sampling
  spa,1000,0
Macro-averaged Precision=NaN
Macro-averaged Recall=0.5
Macro-averaged F=NaN
Micro-averaged Results
         the following symmetries are expected:
           TP=TN, FN=FP
           PosRef=PosResp=NegRef=NegResp
           Acc=Prec=Rec=F
  Total=4000
  True Positive=1000
  False Negative=1000
  False Positive=1000
  True Negative=1000
  Positive Reference=2000
  Positive Response=2000
  Negative Reference=2000
  Negative Response=2000
  Accuracy=0.5
  Recall=0.5
  Precision=0.5
  Rejection Recall=0.5
  Rejection Precision=0.5
  F(1)=0.5
  Fowlkes-Mallows=2000.0
  Jaccard Coefficient=0.3333333333333333
  Yule's Q=0.0
  Yule's Y=0.0
  Reference Likelihood=0.5
  Response Likelihood=0.5
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.5
  kappa=0.0
  kappa Unbiased=0.0
  kappa No Prevalence=0.0
  chi Squared=0.0
  phi Squared=0.0
  Accuracy Deviation=0.007905694150420948
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence =0.0
Reference Entropy=1.0
Response Entropy=NaN
Cross Entropy=Infinity
Joint Entropy=1.0
Conditional Entropy=0.0
Mutual Information=0.0
Kullback-Liebler Divergence=Infinity
chi Squared=NaN
chi-Squared Degrees of Freedom=1
phi Squared=NaN
Cramer's V=NaN
lambda A=0.0
lambda B=NaN

ONE VERSUS ALL EVALUATIONS BY CATEGORY


CATEGORY[0]=eng VERSUS ALL

First-Best Precision/Recall Evaluation
  Total=2000
  True Positive=1000
  False Negative=0
  False Positive=1000
  True Negative=0
  Positive Reference=1000
  Positive Response=2000
  Negative Reference=1000
  Negative Response=0
  Accuracy=0.5
  Recall=1.0
  Precision=0.5
  Rejection Recall=0.0
  Rejection Precision=NaN
  F(1)=0.6666666666666666
  Fowlkes-Mallows=1414.2135623730949
  Jaccard Coefficient=0.5
  Yule's Q=NaN
  Yule's Y=NaN
  Reference Likelihood=0.5
  Response Likelihood=1.0
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.625
  kappa=0.0
  kappa Unbiased=-0.3333333333333333
  kappa No Prevalence=0.0
  chi Squared=NaN
  phi Squared=NaN
  Accuracy Deviation=0.011180339887498949


CATEGORY[1]=spa VERSUS ALL

First-Best Precision/Recall Evaluation
  Total=2000
  True Positive=0
  False Negative=1000
  False Positive=0
  True Negative=1000
  Positive Reference=1000
  Positive Response=0
  Negative Reference=1000
  Negative Response=2000
  Accuracy=0.5
  Recall=0.0
  Precision=NaN
  Rejection Recall=1.0
  Rejection Precision=0.5
  F(1)=NaN
  Fowlkes-Mallows=NaN
  Jaccard Coefficient=0.0
  Yule's Q=NaN
  Yule's Y=NaN
  Reference Likelihood=0.5
  Response Likelihood=0.0
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.625
  kappa=0.0
  kappa Unbiased=-0.3333333333333333
  kappa No Prevalence=0.0
  chi Squared=NaN
  phi Squared=NaN
  Accuracy Deviation=0.011180339887498949

/-----------------------------------------------------------------------/
/-------------------Command line----------------------------------/

javac -cp lingpipe-4.1.2.jar: ClassifyLang.java
java -cp lingpipe-4.1.2.jar: ClassifyLang

/-------------------------------------------------------------------------/

<---------------------------------Result------------------------------------>
Text:   Yo soy una persona increíble y muy inteligente, me admiro a mi mismo lo que me hace sentir ansiedad de lo que viene, por que es algo grandioso lleno de cosas buenas y de ahora en adelante estaré enfocado y optimista aunque tengo que aclarar que no lo haré por querer algo, sino por que es mi pasión. 
Best    Language:   eng     <------------- Wrong Result

<----------------------------------------------------------------------->
import com.aliasi.classify.Classification;
import com.aliasi.classify.Classified;
import com.aliasi.classify.ConfusionMatrix;
import com.aliasi.classify.DynamicLMClassifier;
import com.aliasi.classify.JointClassification;
import com.aliasi.classify.JointClassifier;
import com.aliasi.classify.JointClassifierEvaluator;
import com.aliasi.classify.LMClassifier;

import com.aliasi.lm.NGramProcessLM;

import com.aliasi.util.AbstractExternalizable;

import java.io.File;
import java.io.IOException;

import com.aliasi.util.Files;

public class ClassifyLang {

    public static String text   =   "Yo soy una persona increíble y muy inteligente, me admiro a mi mismo"
                +   " estoy ansioso de lo que viene, por que es algo grandioso lleno de cosas buenas"
                +   " y de ahora en adelante estaré enfocado y optimista"
                +   " aunque tengo que aclarar que no lo haré por querer algo, sino por que no es difícil serlo.    ";

    private static File MODEL_DIR
        = new File("/home/samuel/leipzig/langid-leipzig.classifier");

    public static void main(String[] args)
        throws ClassNotFoundException, IOException {

    System.out.println("Text:   "   +   text);

    LMClassifier    classifier  =   null;
    try {
        classifier  =   (LMClassifier)  AbstractExternalizable.readObject(MODEL_DIR);
        }   catch   (IOException    |   ClassNotFoundException  ex) {
                    //  Handle  exceptions
            System.out.println("Problem with the Model");
        }

    Classification  classification  =   classifier.classify(text);
    String  bestCategory    =   classification.bestCategory();
    System.out.println("Best    Language:   "   +   bestCategory);

        }
}