Machine learning Weka-StringtoVector过滤器不工作_Machine Learning_Weka_Nlp

Machine learning Weka-StringtoVector过滤器不工作

machine-learning nlp

Machine learning Weka-StringtoVector过滤器不工作,machine-learning,weka,nlp,Machine Learning,Weka,Nlp,我正在使用路透社的数据练习Weka。StringtoVector分类器用于转换我的字符串数据（如下所示），因此我可以分析文章，了解哪些词可以预测文章类型。如果项目类型为true，则原始数据集显示为true/FALSE，但我将其转换为0/1。但是，它拒绝使用“review”字符串上的StringtoVector过滤器处理这个arff文件我仅在检查review属性时使用了以下StringtoVector过滤器： weka.filters.unsupervised.attribute.StringT

我正在使用路透社的数据练习Weka。StringtoVector分类器用于转换我的字符串数据（如下所示），因此我可以分析文章，了解哪些词可以预测文章类型。如果项目类型为true，则原始数据集显示为true/FALSE，但我将其转换为0/1。但是，它拒绝使用“review”字符串上的StringtoVector过滤器处理这个arff文件

我仅在检查review属性时使用了以下StringtoVector过滤器：

weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\""

我得到这个错误： “问题筛选实例：属性名称不唯一。原因：情绪”当仅为筛选器选中“审阅”时

以下是一些情况下我的数据集/格式的标题：

@relation text_files
@attribute review string
@attribute sentiment {0, 1}
@data   "cocoa the the cocoa the early the levels its the the this the ended the mln against at the that cocoa the to crop cocoa to crop around mln sales at mln the to this cocoa export the their cocoa prices to to per to offer sales at to dlrs per to to crop sales to at dlrs at dlrs at dlrs per sales at at at at to dlrs at at dlrs the currency sales at to dlrs dlrs dlrs the currency sales at at dlrs at at dlrs at at sales at mln against the crop mln against the the to to the cocoa commission reuter", 0"prices reserve the agriculture department reported the reserve price loan call price price wheat corn 1986 loan call price price reserves grain wheat per reuter", 0"grain crop their products to to wheat export the export wheat oil oil reuter", 0"inc the stock corp its dlrs oil to dlrs production its the company to its to profit to reuter", 0"products stock split products inc its stock split its common shares shareholders the company its to to shareholders at the the stock mln to mln reuter", 0

有人知道为什么会这样吗？我想这可能与数据可能包含0和1这一事实相冲突，因为它们是文本中自然出现的单词的一部分。我还认为在前一个字符串之后的字符串引号之前可能需要额外的空格。

嗨，问题是过滤器将字符串中的每个术语转换为属性。现在，在你的数据部分必须有一个术语“回顾”或“情绪”。因此，属性是重复的

因此，将这两个属性的名称更改为“myreview”和“MyEntity”，或者更改为数据中不太可能出现的名称。它应该可以工作。

我也遇到了同样的问题，因为数据中出现了“域”一词，导致过滤器在识别它时产生误解。我的解决方案是从数据中删除所有“域”，只保留@attribute中的“域”。

避免这些属性名称冲突的最简单解决方案是为生成的属性使用前缀

前缀可以通过

-p

命令行选项、GenericObject编辑器中的

attributeNamePrefix

选项或Java代码中的

setAttributeNamePrefix

方法提供

请参阅过滤器的Javadoc