Java weka文本分类器:如何正确训练分类器问题
我试图使用Weka构建一个文本分类器,但是类的Java weka文本分类器:如何正确训练分类器问题,java,weka,text-classification,categorization,Java,Weka,Text Classification,Categorization,我试图使用Weka构建一个文本分类器,但是类的分布概率例如在一种情况下为1.0,在所有其他情况下为0.0,因此classifyInstance总是返回与预测相同的类。培训中的某些内容无法正常工作 ARFF培训 训练方法 public void trainClassifier(最终字符串输入\u文件名)引发异常 { getTrainingDataset(输入文件名); //训练状态由每个输入的特征向量组成 例如(实例currentInstance:inputDataset) {
分布概率例如在一种情况下为1.0
,在所有其他情况下为0.0
,因此classifyInstance
总是返回与预测相同的类。培训中的某些内容无法正常工作
ARFF培训
训练方法
public void trainClassifier(最终字符串输入\u文件名)引发异常
{
getTrainingDataset(输入文件名);
//训练状态由每个输入的特征向量组成
例如(实例currentInstance:inputDataset)
{
实例currentFeatureVector=提取特征(currentInstance);
currentFeatureVector.setDataset(训练状态);
trainingInstances.add(currentFeatureVector);
}
分类器=新的朴素贝叶斯();
试一试{
//分类器训练码
分类器。构建分类器(训练姿态);
//将经过训练的分类器存储到文件中以备将来使用
write(“NaiveBayes.model”,分类器);
}捕获(例外情况除外){
System.out.println(“分类器训练中的异常。”+ex);
}
}
私有实例提取功能(实例输入状态)引发异常
{
String tweet=inputInstance.stringValue(0);
StringTokenizer defaultTokenizer=新的StringTokenizer(tweet);
List tokens=new ArrayList();
while(defaultTokenizer.hasMoreTokens())
{
字符串t=defaultTokenizer.nextToken();
代币。添加(t);
}
迭代器a=tokens.Iterator();
while(a.hasNext())
{
字符串标记=(字符串)a.next();
String word=token.replaceAll(“#“,”);
if(featureWords.contains(word))
{
double cont=featureMap.get(featureWords.indexOf(word))+1;
featureMap.put(featureWords.indexOf(word),cont);
}
否则{
添加(单词);
featureMap.put(featureWords.indexOf(word),1.0);
}
}
attributeList.clear();
for(字符串特征词:特征词)
{
添加(新属性(featureWord));
}
add(新属性(“类”,classvalue));
int index[]=新的int[featureMap.size()+1];
双精度值[]=新双精度[featureMap.size()+1];
int i=0;
对于(Map.Entry:featureMap.entrySet())
{
索引[i]=entry.getKey();
values[i]=entry.getValue();
i++;
}
索引[i]=featureWords.size();
值[i]=(双)classValues.indexOf(inputInstance.stringValue(1));
trainingInstances=createInstances(“TRAINING_INSTANCES”);
返回新SparseInstance(1.0,值,索引,1000000);
}
私有void getTrainingDataset(最终字符串输入\u文件名)
{
试一试{
ArffLoader trainingLoader=新ArffLoader();
setSource(新文件(输入文件名));
inputDataset=trainingLoader.getDataSet();
}捕获(IOEX异常)
{
System.out.println(“getTrainingDataset方法中的异常”);
}
System.out.println(“数据集”+inputDataset.numAttributes());
}
私有实例createInstances(最终字符串实例\u名称)
{
//创建初始容量为零的实例对象
实例=新实例(实例名称,属性列表,0);
//将类索引设置为最后一个属性
instances.setClassIndex(instances.numAttributes()-1);
返回实例;
}
公共静态void main(字符串[]args)引发异常
{
Classificatore wekaTutorial=新Classificatore();
wekaTutorial.trainClassifier(“training_set_prova_tent.arff”);
testClassifier(“testing.arff”);
}
公共分类()
{
attributeList=new ArrayList。问题是分类器几乎为testing.arff中的每条消息预测了错误的类,因为类的概率不正确。training_set_prova_tent.arff每个类的消息数相同。
我下面的示例使用featureWords.dat并将1.0与消息中出现的单词相关联,相反,我希望创建自己的字典,将训练集中出现的单词加上测试中出现的单词,并将出现的单词数与每个单词相关联
附言
我知道这正是我可以用过滤器StringToOrdVector做的事情,但我还没有找到任何例子来说明如何将这个过滤器与两个文件一起使用:一个用于训练集,一个用于测试集。因此,修改我找到的代码似乎更容易
非常感谢您似乎在一些关键点上更改了源代码,但不是以一种好的方式。我将尝试起草您试图执行的操作以及我发现的错误
您(可能)想在extractFeature
中做的是
- 将每条推文拆分为文字(标记化)
- 计算这些单词出现的次数
- 创建一个表示这些字数加上类的特征向量
你在这个方法中忽略的是
您永远不会重置功能映射
这也体现在这样一个事实上,即您使用每个新tweet构建一个新的属性列表
,而不是只在初始化
中构建一次,这是不好的,原因已经解释过了
可能还有更多的东西,但是——事实上——你的代码是不可编译的。你想要的是更接近你修改过的教程源代码,而不是你的版本
此外,你应该调查一下,因为这似乎正是你想要做的:
将字符串属性转换为一组表示单词出现(取决于标记器)信息的属性
@relation test1
@attribute tweetmsg String
@attribute classValues {politica,sport,musicatvcinema,infogeneriche,fattidelgiorno,statopersonale,checkin,conversazione}
@DATA
"Renzi Berlusconi Salvini Bersani",politica
"Allegri insulta la terna arbitrale",sport
"Bravo Garcia",sport
public void trainClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
for(Instance currentInstance : inputDataset)
{
Instance currentFeatureVector = extractFeature(currentInstance);
currentFeatureVector.setDataset(trainingInstances);
trainingInstances.add(currentFeatureVector);
}
classifier = new NaiveBayes();
try {
//classifier training code
classifier.buildClassifier(trainingInstances);
//storing the trained classifier to a file for future use
weka.core.SerializationHelper.write("NaiveBayes.model",classifier);
} catch (Exception ex) {
System.out.println("Exception in training the classifier."+ex);
}
}
private Instance extractFeature(Instance inputInstance) throws Exception
{
String tweet = inputInstance.stringValue(0);
StringTokenizer defaultTokenizer = new StringTokenizer(tweet);
List<String> tokens=new ArrayList<String>();
while (defaultTokenizer.hasMoreTokens())
{
String t= defaultTokenizer.nextToken();
tokens.add(t);
}
Iterator<String> a = tokens.iterator();
while(a.hasNext())
{
String token=(String) a.next();
String word = token.replaceAll("#","");
if(featureWords.contains(word))
{
double cont=featureMap.get(featureWords.indexOf(word))+1;
featureMap.put(featureWords.indexOf(word),cont);
}
else{
featureWords.add(word);
featureMap.put(featureWords.indexOf(word), 1.0);
}
}
attributeList.clear();
for(String featureWord : featureWords)
{
attributeList.add(new Attribute(featureWord));
}
attributeList.add(new Attribute("Class", classValues));
int indices[] = new int[featureMap.size()+1];
double values[] = new double[featureMap.size()+1];
int i=0;
for(Map.Entry<Integer,Double> entry : featureMap.entrySet())
{
indices[i] = entry.getKey();
values[i] = entry.getValue();
i++;
}
indices[i] = featureWords.size();
values[i] = (double)classValues.indexOf(inputInstance.stringValue(1));
trainingInstances = createInstances("TRAINING_INSTANCES");
return new SparseInstance(1.0,values,indices,1000000);
}
private void getTrainingDataset(final String INPUT_FILENAME)
{
try{
ArffLoader trainingLoader = new ArffLoader();
trainingLoader.setSource(new File(INPUT_FILENAME));
inputDataset = trainingLoader.getDataSet();
}catch(IOException ex)
{
System.out.println("Exception in getTrainingDataset Method");
}
System.out.println("dataset "+inputDataset.numAttributes());
}
private Instances createInstances(final String INSTANCES_NAME)
{
//create an Instances object with initial capacity as zero
Instances instances = new Instances(INSTANCES_NAME,attributeList,0);
//sets the class index as the last attribute
instances.setClassIndex(instances.numAttributes()-1);
return instances;
}
public static void main(String[] args) throws Exception
{
Classificatore wekaTutorial = new Classificatore();
wekaTutorial.trainClassifier("training_set_prova_tent.arff");
wekaTutorial.testClassifier("testing.arff");
}
public Classificatore()
{
attributeList = new ArrayList<Attribute>();
initialize();
}
private void initialize()
{
featureWords= new ArrayList<String>();
featureMap = new TreeMap<>();
classValues= new ArrayList<String>();
classValues.add("politica");
classValues.add("sport");
classValues.add("musicatvcinema");
classValues.add("infogeneriche");
classValues.add("fattidelgiorno");
classValues.add("statopersonale");
classValues.add("checkin");
classValues.add("conversazione");
}
public void testClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
Instances testingInstances = createInstances("TESTING_INSTANCES");
for(Instance currentInstance : inputDataset)
{
//extractFeature method returns the feature vector for the current input
Instance currentFeatureVector = extractFeature(currentInstance);
//Make the currentFeatureVector to be added to the trainingInstances
currentFeatureVector.setDataset(testingInstances);
testingInstances.add(currentFeatureVector);
}
try {
//Classifier deserialization
classifier = (Classifier) weka.core.SerializationHelper.read("NaiveBayes.model");
//classifier testing code
for(Instance testInstance : testingInstances)
{
double score = classifier.classifyInstance(testInstance);
double[] vv= classifier.distributionForInstance(testInstance);
for(int k=0;k<vv.length;k++){
System.out.println("distribution "+vv[k]); //this are the probabilities of the classes and as result i get 1.0 in one and 0.0 in all the others
}
System.out.println(testingInstances.attribute("Class").value((int)score));
}
} catch (Exception ex) {
System.out.println("Exception in testing the classifier."+ex);
}
}
Map<Integer,Double> featureMap = new TreeMap<>();
indices[i] = featureWords.size();
values[i] = (double) classValues.indexOf(inputInstance.stringValue(1));