Java 为什么我的bagOfWord朴素贝叶斯算法的性能比wekas StringToWordVector差?

Java 为什么我的bagOfWord朴素贝叶斯算法的性能比wekas StringToWordVector差?,java,machine-learning,weka,Java,Machine Learning,Weka,我正在尝试为1000个正面+负面标记的IMDB评论(txt_sentoken)和用于Java的weka API构建一个基于朴素贝叶斯的分类器 由于我不知道StringToWordVector,它基本上提供了一个BagoWords模型,准确率达到80%,所以我自己创建了词汇表和向量,准确率只有75%:( 现在我想知道为什么我的解决方案表现得如此糟糕 1) 根据我2000年的评论,我构建了BagofWord: Pipeline<String, Void> bagOfWordsChain

我正在尝试为1000个正面+负面标记的IMDB评论(txt_sentoken)和用于Java的weka API构建一个基于朴素贝叶斯的分类器

由于我不知道
StringToWordVector
,它基本上提供了一个BagoWords模型,准确率达到80%,所以我自己创建了词汇表和向量,准确率只有75%:(

现在我想知道为什么我的解决方案表现得如此糟糕

1) 根据我2000年的评论,我构建了BagofWord:

Pipeline<String, Void> bagOfWordsChain = Pipeline
        .start(Preprocessing.luceneTokenizer)
        .append(Preprocessing.stopwordFilter)
        .append(Preprocessing.vocabularyBuilder);
2) 对于每一篇评论,我都会创建一个向量,其中包含这1500个单词的出现次数:

{exception=1, nicely=0, crappy=0, unconvincing=0, desperate=0, awful=0, wreck=0, satan=0, fumbling=0, ted=0, protected=0, poor=0, wasted=0, legs=0, understanding=0, absent=0, neat=0, inept=0, ashamed=0, unlikely=0, solid=0, inviting=0, excellent=0, younger=0, opulent=0, trashy=0, raw=0, inspired=1, compassionate=0, charismatic=1, apparent=0, 0=0, 1=0, bollocks=0, amusing=0, placate=0, poorly=0, bogus=0, notable=0, vertiginous=0, alienating=0, sentimental=0, plausible=0, catastrophic=0, salt=0, superlative=0, i=0, artistic=0, neck=0, weird=0, stunt=0, destroyed=0, corny=0, exciting=0, obvious=0, dogged=0, sweet=0, novel=0, malaise=0, acceptable=0, ace=0, eager=0, correct=0, moved=0, melancholic=0, jerking=0, woeful=0, good=4, fortunately=0, wish=0, deadly=0, wise=0, tiresome=0, roughneck=0, faint=0, nonexistent=0, add=0, murder=0, unopposed=0, pat=0, fantasy=0, obligatory=0, vicious=0, cruel=0, befouled=0, gristly=0, respect=0, gone=0, faux=0, gorgeous=0, softhearted=0, success=0, indignant=0, wacky=0, smashing=0, cynical=0, trust=0, raging=0, searching=0, wonderment=0, paucity=0, fugly=0, nowhere=0, disturbing=1, sorry=0, spirited=0, happiness=0, responsible=0, hard=0, mistake=0, redneck=0, malevolence=0, sexy=0, caliber=0, lucrative=0, better=0, woefully=0, crap=0, alleviated=0, truth=0, well=0, detest=0, creepy=0, taking=1, terrifying=0, wanting=0, resentful=0, invisible=0, changing=0, moral=0, tried=0, disappointment=0, loved=0, strangely=0, cameo=0, struck=0, hate=0, darkness=0, pet=0, gory=0, protecting=0, disrespect=0, tough=0, loving=0, malnutrition=0, unhappy=0, flawed=0, charming=0, erotic=0, spots=0, demonic=0, animated=0, crazy=0, mighty=0, homage=0, other=2, magnificent=0, highbrow=0, swell=0, crude=0, frankly=0, surly=0, amiss=0, melodramatic=0, wail=0, unforgiving=0, energetic=0, shark=0, famous=0, thoroughly=0, stupidity=0, question=0, honestly=0, worrying=0, spirit=0, imponderable=0, intellectual=0, cheap=0, humbug=0, sickly=0, torturer=0, officious=0, infamous=0, heartbreaking=0, kudos=0, duke=0, cop=0, cheapjack=0, honor=0, supernatural=0, rowdy=0, nasty=0, respectability=0, all=7, terror=0, read=0, plenty=0, less=0, alas=0, adventure=0, idyllic=0, secular=0, scowl=0, shining=0, evil=0, inconvenience=0, infatuated=0, badly=0, shame=0, overjoyed=0, torn=0, chicken=0, entertaining=0, rob=0, interfering=0, assimilating=0, bush=0, elderly=0, financial=0, dumb=0, combat=0, respective=0, trick=0, maddening=0, times=0, extra=0, busy=0, talented=1, detrimental=0, hapless=0, floor=0, idealized=0, wounded=0, guilt=0, stinky=0, chief=0, satisfying=0, despite=4, indignity=0, super=0, groan=0, caring=0, botch=0, fantastic=0, spoken=0, interested=1, bitchiness=0, lame=0, clumsy=0, meddling=0, accurate=0, pity=0, flawless=0, infuriating=0, decided=0, beautiful=0, whiney=0, generalized=0, limitlessness=0, botched=0, ape=0, lovable=0, welcome=0, devoted=0, reek=0, cheesy=0, wanted=1, pathetic=0, untamed=0, difference=0, must=1, deserved=0, flash=0, unoriginal=0, sophisticated=0, perfectly=0, goofy=0, nudity=0, dandy=0, killing=1, penniless=0, singable=0, giving=0, accident=0, excuse=0, drunken=0, humour=0, knight=0, disabled=0, mournful=0, insane=0, worried=0, unappetizing=0, stench=0, pointless=0, triumph=0, perplexed=0, silly=0, black=0, bliss=0, lacking=0, fortunate=0, entirely=0, boring=1, mongrel=0, calm=0, crackerjack=0, classic=1, charm=0, tragedy=0, absolute=0, contrived=0, feelings=0, battered=0, shapely=0, surely=0, becoming=0, wealthy=0, genuinely=0, sterling=0, unable=0, disappointed=0, dispirited=0, dying=0, paying=0, bias=0, sinister=0, brutally=0, basically=0, menacing=0, uncommercial=0, imagine=0, attractive=0, egregious=0, definite=0, superb=2, flashy=0, insufficient=0, uncomfortable=0, unmarried=0, surprising=0, worse=1, camp=0, improving=0, warm=0, guilty=0, embarrassing=0, everywhere=0, worst=0, despondent=0, derogatory=0, blind=0, color=0, hidden=0, indigestion=0, impossible=0, soured=0, showdown=0, complaining=0, non=0, disaster=0, mono=0, negative=0, chilling=0, venomous=0, outrageous=0, painful=0, pain=0, learned=0, wan=0, yes=0, effectively=0, appropriately=0, manipulative=0, stylish=0, genius=0, detailed=1, hype=0, delightful=0, motormouth=0, paid=0, short=0, stranger=0, attempted=0, horrifying=0, fancy=0, notorious=0, innocence=0, fab=0, happy=1, pleaser=0, overacting=0, nearby=0, unfeasible=0, grime=0, struggling=0, specifically=0, controversial=0, truly=0, greater=0, promising=0, okay=0, swoon=0, shlock=0, epic=0, horrid=0, saving=0, rely=0, apparently=0, bungled=0, excessive=0, completely=0, suit=0, bastard=0, damage=0, boss=0, masterful=0, bright=0, harsh=0, clueless=0, alien=0, smart=0, anti=0, unknown=0, diaz=0, bleak=0, premium=0, frivolous=0, gloating=0, low=0, droll=0, tragic=0, amusingly=0, older=0, confusing=0, protect=0, levity=0, mischief=0, comical=0, touching=0, inane=0, unfortunately=0, freeman=0, great=2, wrong=0, beautifully=0, disembodied=0, impressively=0, constipated=0, incredulous=0, choice=0, grim=0, small=3, crushing=0, shut=0, fiction=1, doom=0, disability=0, amicable=0, straying=0, hip=0, wondering=0, totally=0, potential=0, unethical=0, otherwise=0, kind=1, repugnant=0, lifeless=0, important=0, veteran=0, nerve=0, absolutely=0, affection=0, campy=0, psycho=0, wondrous=0, game=0, mommy=0, mournfulness=0, unexpected=0, crucial=0, rocky=0, principal=0, joy=0, patient=0, sad=0, phony=0, imitation=0, visually=0, depressing=0, nostalgia=0, deserve=0, revenge=0, nostalgic=0, clear=0, banner=0, armageddon=0, craven=0, slapstick=0, momma=0, shine=0, favor=0, neither=0, further=0, stupid=0, bad=0, luckily=0, depraved=0, fit=0, crack=0, unsatisfying=0, disenchanted=0, honest=0, giant=0, asthmatic=0, bumbling=0, killer=0, pseudo=0, sure=0, otherworldly=0, going=0, shock=0, loyal=0, mild=0, opportunity=1, reconstructive=0, downright=0, astonishing=0, trying=0, finer=0, stinking=0, hurt=0, average=0, compare=0, unwilling=0, admire=0, dead=0, soil=0, eyes=0, amuse=0, sudden=0, fool=0, unlike=0, popularity=1, brag=0, topping=0, bully=0, crummy=0, outstanding=0, keeping=0, sex=0, emotional=0, outraged=0, right=1, possible=0, battle=0, awesome=0, fly=0, glowing=0, meet=0, complicated=0, masterpiece=0, jobless=0, lovelorn=0, hollywood=0, beauty=0, scare=0, woe=0, needless=0, wounding=0, wretched=0, outdated=0, absurd=0, accomplished=0, unworkable=0, won=0, forgotten=0, useless=0, warning=0, scary=0, ed=0, needs=0, disadvantage=0, sumptuous=0, unpredictable=0, intriguing=0, suspicious=0, confrontation=0, inventive=0, horrific=0, never=0, phantom=0, oddly=0, blame=0, macho=0, nude=0, confusion=0, little=1, lucky=0, some=6, virtual=0, subtlety=0, blank=0, waiting=0, importance=0, uma=0, worthy=0, lamentable=0, training=0, mistaken=0, fox=0, content=0, legendary=0, woozy=0, trouble=0, conceited=0, sin=0, just=0, bloody=0, remarry=0, over=2, sole=0, sold=0, brilliant=0, crazed=0, abusive=0, go=0, wearing=0, false=0, obviously=0, sleazy=0, kept=0, grand=1, insomnia=0, disconcerting=0, endearing=0, decidedly=0, fiendish=0, atrocious=0, ludicrous=0, elaborate=0, very=6, expert=0, irrefutable=0, deplorable=0, provoking=0, delayed=0, sick=0, foul=0, superficial=0, easily=0, model=0, believable=0, autistic=0, fear=0, bonnie=0, disbelief=0, understated=0, letdown=0, plaintive=0, lively=0, crotchety=0, whiny=0, annoying=1, sly=0, ornery=0, upset=0, alive=0, unpleasant=0, majestic=0, abhorrent=0, lugubrious=0, ruthless=0, thinking=0, world=1, known=1, handicapped=0, composed=0, mangled=0, prejudice=0, hopefully=0, ability=0, together=0, delight=0, sadly=0, missed=0, positive=0, obnoxious=0, joking=0, off=0, joke=0, virtuoso=0, scummy=0, troublesome=0, complete=0, undeniable=0, forged=0, constant=0, instance=0, dreck=0, liked=0, second=0, confused=0, esteemed=0, fine=0, find=0, patchy=0, international=0, regardless=0, terrible=0, untraditional=0, ideal=0, pleasant=0, hamming=0, difficult=0, fill=0, cheating=0, plus=0, convenient=0, background=0, true=0, uninspired=0, malice=0, nonetheless=0, handsome=0, dozens=0, dangerous=0, groveling=0, best=2, decent=0, nonsense=0, eerie=0, troubled=0, loser=0, ok=0, make=1, rescue=0, experienced=0, reprehensible=0, highly=0, certainly=0, unfortunate=0, interesting=0, enthralled=0, cringing=0, intelligent=0, master=0, fright=0, extraordinary=0, selfless=0, due=0, howling=0, evident=0, authentic=0, essentially=0, heroine=0, worthwhile=0, undependable=0, sitting=0, psychological=0, credible=0, threatening=0, moralizing=0, bullshit=0, danger=0, somewhere=0, firm=0, extremely=0, speaking=0, starred=0, clever=0, reputable=0, horrible=0, drawn=0, recent=0, xenophobic=0, inevitable=0, horribly=0, unrealistic=0, underdog=0, miserable=0, wonderful=0, received=1, cracking=0, remaining=1, quality=0, glory=0, disastrous=0, propelling=0, disingenuous=0, animal=0, consistently=0, psychotic=0, sub=0, nevertheless=0, entranced=0, workmanlike=0, cute=0, ravaging=0, fell=0, unfunny=0, frightening=0, wonderfully=0, lecherous=0, apart=0, dirty=0, offensive=0, bother=0, righteous=0, necessary=0, thorough=0, beloved=0, reverend=0, controlled=0, face=0, definitely=0, stab=0, afraid=0, marvelous=0, bum=0, respected=0, randy=0, separate=0, suffering=0, instinct=0, buy=0, reputation=0, express=0, zero=0, amazing=0, trustworthiness=0, instantly=0, climactic=0, awkward=0, reluctant=0, passion=0, redeeming=0, ruin=0, formidable=0, admirable=0, please=0, troubling=0, punk=0, hint=0, know=0, eccentric=0, rough=0, proper=0, ruffian=0, cold=0, beast=0, cole=0, gag=0, bitter=0, contemptible=0, bats=0, shoddy=0, interest=0, damaging=0, hurting=0, missing=0, wonder=0, ungodly=0, gay=0, successful=0, lovely=0, brutal=0, corrupt=0, slight=0, winning=0, tumultuous=0, discernable=0, dramatic=0, damn=0, mediocre=0, superior=0, incredibly=0, imaginary=0, base=0, immensely=0, mom=0, whole=0, tingle=0, fable=0, schlock=0, none=0, fair=0, hell=0, quirky=0, humor=0, problem=0, lost=1, depressed=0, still=0, researcher=0, worn=0, lose=0, matt=0, ironically=0, props=0, fail=0, enjoyment=0, enjoyable=0, unredeemable=0, irritating=0, love=0, enjoy=0, gem=0, out=0, laughable=0, seeing=0, dark=0, witty=0, suspenseful=0, gusto=0, rootless=0, entrapment=0, aged=1, fascinating=0, suspect=0, nice=0, stink=0, opinion=0, lots=0, elegance=0, inexorable=0, altogether=0, emotion=0, elements=1, ended=0, cutting=0, fake=0, remorseful=0, squalor=0, upsetting=0, insult=0, magic=0, regrettably=0, villain=0, bizarre=0, perfect=0, utter=0, heartbroken=0, prepared=0, sound=0, preferable=0, healing=0, utterly=0, spite=0, abysmally=0, plain=0, criminal=0, incongruity=0, smelly=0, proud=0, like=5, ill=0, enamored=0, ugly=0, paranoia=0, messy=0, condemning=0, cliched=0, jocular=0, paranoid=0, sheltered=0, safe=0, interrogator=0, honored=0, thrilling=0, trite=0, regret=0, steal=0, irreplaceable=0, congratulations=0, stereotypical=0, pong=0, weep=0, engaging=0, seemingly=0, aware=0, filthy=0, soiled=0, pneumonia=0, ready=0, walking=0, disappointing=0, greatest=0, haunting=0, advantage=0, fault=0, really=0, nifty=0, expensive=0, magical=0, refreshing=0, assured=0, dignity=0, comic=0}
3) 我使用的是wekas naive bayes和k=10交叉折叠评估:

Instances trainingSet = new Instances("Data", features.getAttributes(), 2000);         
trainingSet.setClassIndex(0); 


    // process every review, extract the ~1500 feature values and add them to the training-set
    rvw.getReviews().parallelStream().forEach((review) -> {
        Instance inst = new DenseInstance(features.getNumberOfFeatures()); // = 1501 features

        // get the word vector for this review
        HashMap<String,Integer> wordVector = stringToWordVectorChain.run(review.getReviewText());
        // set the sentiment class to positive or negative label
        features.setClass(inst, review.getPositiveOrNegative())); // sets the class attribute to positive or negative
        features.setFeatureValues(inst, wordVector); // for each feature it will do "setValue" on the instance

        trainingSet.add(inst);
    });


Classifier cModel = (Classifier) new NaiveBayes();
cModel.buildClassifier(trainingSet);

// Test the model
Evaluation eTest = new Evaluation(trainingSet);
eTest.crossValidateModel(cModel, trainingSet, 10, new Random(1));

// print results
String strSummary = eTest.toSummaryString();
System.out.println(strSummary);
实例训练集=新实例(“数据”,features.getAttributes(),2000); 培训集setClassIndex(0); //处理每次审查,提取约1500个特征值,并将其添加到培训集中 rvw.getReviews().parallelStream().forEach((review)->{ Instance inst=new DenseInstance(features.getNumberOfFeatures());/=1501 features //获取此评论的单词vector HashMap wordVector=stringToWordVectorChain.run(review.getReviewText()); //将情绪类别设置为正面或负面标签 features.setClass(inst,review.getPositiveOrNegative());//将class属性设置为正或负 features.setFeatureValues(inst,wordVector);//对于每个特性,它将在实例上执行“setValue” 培训集添加(仪表); }); 分类器cModel=(分类器)新的朴素贝叶斯(); C模型构建分类器(训练集); //测试模型 评估测试=新评估(培训集); 测试交叉验证模型(cModel,训练集,10,新随机数(1)); //打印结果 字符串strSummary=eTest.toSummaryString(); System.out.println(标准摘要);
我尝试用不同的字数(StringToWordVector使用~1200)和不同的极性阈值进行实验,75%是我的解决方案达到的最大精度。

通过Weka的
StringToWordVector
阅读,似乎有一些实现细节与您的不同。以下是前两个,我认为这两个因素可能是造成性能差异的原因:

  • 默认情况下,生成的向量似乎是布尔向量(即注意单词的存在,而不是出现的次数)
  • 如果在对文本进行矢量化之前设置了class属性,则为每个类构建单独的字典,然后合并所有字典
虽然其中任何一个(或其他更细微的差异)都可能是罪魁祸首,但我的赌注是第二点

内置类允许设置和取消设置每个选项;您可以尝试使用
StringToWordVector
重新运行80%版本,使用-C选项使用出现次数而不是布尔值,使用-O在两个类中使用单个字典

这将使您能够验证其中是否有人确实是罪魁祸首


编辑:关于第一点,即计数出现与注意单词存在(也称为伯努利和多项式模型),90年代有几篇学术论文研究了它们之间的差异,例如和。虽然通常多项式模型效果更好,但也有相反的情况,这取决于语料库和分类问题

谢谢你的研究!将计数出现次数改为单词存在后,我没有注意到任何差异。另一方面,根据词汇表中包含的单词数量,使用单独的词典确实会产生2-4%的好结果。我发现了我的方法的另一个问题:我首先认为,通过检查SentiWordNet只添加具有某些极性的单词将是一件好事。然而,这实际上降低了精确度!另一件有帮助的事情是使用wekas AttributeSelection算法,该算法选择关于冗余等的顶级属性。结果是83%,所以我的打赌几乎是正确的:)很高兴听到你能够在基线中击败buil。看看这是否适用于10K或100K数据集,会很有趣。
{exception=1, nicely=0, crappy=0, unconvincing=0, desperate=0, awful=0, wreck=0, satan=0, fumbling=0, ted=0, protected=0, poor=0, wasted=0, legs=0, understanding=0, absent=0, neat=0, inept=0, ashamed=0, unlikely=0, solid=0, inviting=0, excellent=0, younger=0, opulent=0, trashy=0, raw=0, inspired=1, compassionate=0, charismatic=1, apparent=0, 0=0, 1=0, bollocks=0, amusing=0, placate=0, poorly=0, bogus=0, notable=0, vertiginous=0, alienating=0, sentimental=0, plausible=0, catastrophic=0, salt=0, superlative=0, i=0, artistic=0, neck=0, weird=0, stunt=0, destroyed=0, corny=0, exciting=0, obvious=0, dogged=0, sweet=0, novel=0, malaise=0, acceptable=0, ace=0, eager=0, correct=0, moved=0, melancholic=0, jerking=0, woeful=0, good=4, fortunately=0, wish=0, deadly=0, wise=0, tiresome=0, roughneck=0, faint=0, nonexistent=0, add=0, murder=0, unopposed=0, pat=0, fantasy=0, obligatory=0, vicious=0, cruel=0, befouled=0, gristly=0, respect=0, gone=0, faux=0, gorgeous=0, softhearted=0, success=0, indignant=0, wacky=0, smashing=0, cynical=0, trust=0, raging=0, searching=0, wonderment=0, paucity=0, fugly=0, nowhere=0, disturbing=1, sorry=0, spirited=0, happiness=0, responsible=0, hard=0, mistake=0, redneck=0, malevolence=0, sexy=0, caliber=0, lucrative=0, better=0, woefully=0, crap=0, alleviated=0, truth=0, well=0, detest=0, creepy=0, taking=1, terrifying=0, wanting=0, resentful=0, invisible=0, changing=0, moral=0, tried=0, disappointment=0, loved=0, strangely=0, cameo=0, struck=0, hate=0, darkness=0, pet=0, gory=0, protecting=0, disrespect=0, tough=0, loving=0, malnutrition=0, unhappy=0, flawed=0, charming=0, erotic=0, spots=0, demonic=0, animated=0, crazy=0, mighty=0, homage=0, other=2, magnificent=0, highbrow=0, swell=0, crude=0, frankly=0, surly=0, amiss=0, melodramatic=0, wail=0, unforgiving=0, energetic=0, shark=0, famous=0, thoroughly=0, stupidity=0, question=0, honestly=0, worrying=0, spirit=0, imponderable=0, intellectual=0, cheap=0, humbug=0, sickly=0, torturer=0, officious=0, infamous=0, heartbreaking=0, kudos=0, duke=0, cop=0, cheapjack=0, honor=0, supernatural=0, rowdy=0, nasty=0, respectability=0, all=7, terror=0, read=0, plenty=0, less=0, alas=0, adventure=0, idyllic=0, secular=0, scowl=0, shining=0, evil=0, inconvenience=0, infatuated=0, badly=0, shame=0, overjoyed=0, torn=0, chicken=0, entertaining=0, rob=0, interfering=0, assimilating=0, bush=0, elderly=0, financial=0, dumb=0, combat=0, respective=0, trick=0, maddening=0, times=0, extra=0, busy=0, talented=1, detrimental=0, hapless=0, floor=0, idealized=0, wounded=0, guilt=0, stinky=0, chief=0, satisfying=0, despite=4, indignity=0, super=0, groan=0, caring=0, botch=0, fantastic=0, spoken=0, interested=1, bitchiness=0, lame=0, clumsy=0, meddling=0, accurate=0, pity=0, flawless=0, infuriating=0, decided=0, beautiful=0, whiney=0, generalized=0, limitlessness=0, botched=0, ape=0, lovable=0, welcome=0, devoted=0, reek=0, cheesy=0, wanted=1, pathetic=0, untamed=0, difference=0, must=1, deserved=0, flash=0, unoriginal=0, sophisticated=0, perfectly=0, goofy=0, nudity=0, dandy=0, killing=1, penniless=0, singable=0, giving=0, accident=0, excuse=0, drunken=0, humour=0, knight=0, disabled=0, mournful=0, insane=0, worried=0, unappetizing=0, stench=0, pointless=0, triumph=0, perplexed=0, silly=0, black=0, bliss=0, lacking=0, fortunate=0, entirely=0, boring=1, mongrel=0, calm=0, crackerjack=0, classic=1, charm=0, tragedy=0, absolute=0, contrived=0, feelings=0, battered=0, shapely=0, surely=0, becoming=0, wealthy=0, genuinely=0, sterling=0, unable=0, disappointed=0, dispirited=0, dying=0, paying=0, bias=0, sinister=0, brutally=0, basically=0, menacing=0, uncommercial=0, imagine=0, attractive=0, egregious=0, definite=0, superb=2, flashy=0, insufficient=0, uncomfortable=0, unmarried=0, surprising=0, worse=1, camp=0, improving=0, warm=0, guilty=0, embarrassing=0, everywhere=0, worst=0, despondent=0, derogatory=0, blind=0, color=0, hidden=0, indigestion=0, impossible=0, soured=0, showdown=0, complaining=0, non=0, disaster=0, mono=0, negative=0, chilling=0, venomous=0, outrageous=0, painful=0, pain=0, learned=0, wan=0, yes=0, effectively=0, appropriately=0, manipulative=0, stylish=0, genius=0, detailed=1, hype=0, delightful=0, motormouth=0, paid=0, short=0, stranger=0, attempted=0, horrifying=0, fancy=0, notorious=0, innocence=0, fab=0, happy=1, pleaser=0, overacting=0, nearby=0, unfeasible=0, grime=0, struggling=0, specifically=0, controversial=0, truly=0, greater=0, promising=0, okay=0, swoon=0, shlock=0, epic=0, horrid=0, saving=0, rely=0, apparently=0, bungled=0, excessive=0, completely=0, suit=0, bastard=0, damage=0, boss=0, masterful=0, bright=0, harsh=0, clueless=0, alien=0, smart=0, anti=0, unknown=0, diaz=0, bleak=0, premium=0, frivolous=0, gloating=0, low=0, droll=0, tragic=0, amusingly=0, older=0, confusing=0, protect=0, levity=0, mischief=0, comical=0, touching=0, inane=0, unfortunately=0, freeman=0, great=2, wrong=0, beautifully=0, disembodied=0, impressively=0, constipated=0, incredulous=0, choice=0, grim=0, small=3, crushing=0, shut=0, fiction=1, doom=0, disability=0, amicable=0, straying=0, hip=0, wondering=0, totally=0, potential=0, unethical=0, otherwise=0, kind=1, repugnant=0, lifeless=0, important=0, veteran=0, nerve=0, absolutely=0, affection=0, campy=0, psycho=0, wondrous=0, game=0, mommy=0, mournfulness=0, unexpected=0, crucial=0, rocky=0, principal=0, joy=0, patient=0, sad=0, phony=0, imitation=0, visually=0, depressing=0, nostalgia=0, deserve=0, revenge=0, nostalgic=0, clear=0, banner=0, armageddon=0, craven=0, slapstick=0, momma=0, shine=0, favor=0, neither=0, further=0, stupid=0, bad=0, luckily=0, depraved=0, fit=0, crack=0, unsatisfying=0, disenchanted=0, honest=0, giant=0, asthmatic=0, bumbling=0, killer=0, pseudo=0, sure=0, otherworldly=0, going=0, shock=0, loyal=0, mild=0, opportunity=1, reconstructive=0, downright=0, astonishing=0, trying=0, finer=0, stinking=0, hurt=0, average=0, compare=0, unwilling=0, admire=0, dead=0, soil=0, eyes=0, amuse=0, sudden=0, fool=0, unlike=0, popularity=1, brag=0, topping=0, bully=0, crummy=0, outstanding=0, keeping=0, sex=0, emotional=0, outraged=0, right=1, possible=0, battle=0, awesome=0, fly=0, glowing=0, meet=0, complicated=0, masterpiece=0, jobless=0, lovelorn=0, hollywood=0, beauty=0, scare=0, woe=0, needless=0, wounding=0, wretched=0, outdated=0, absurd=0, accomplished=0, unworkable=0, won=0, forgotten=0, useless=0, warning=0, scary=0, ed=0, needs=0, disadvantage=0, sumptuous=0, unpredictable=0, intriguing=0, suspicious=0, confrontation=0, inventive=0, horrific=0, never=0, phantom=0, oddly=0, blame=0, macho=0, nude=0, confusion=0, little=1, lucky=0, some=6, virtual=0, subtlety=0, blank=0, waiting=0, importance=0, uma=0, worthy=0, lamentable=0, training=0, mistaken=0, fox=0, content=0, legendary=0, woozy=0, trouble=0, conceited=0, sin=0, just=0, bloody=0, remarry=0, over=2, sole=0, sold=0, brilliant=0, crazed=0, abusive=0, go=0, wearing=0, false=0, obviously=0, sleazy=0, kept=0, grand=1, insomnia=0, disconcerting=0, endearing=0, decidedly=0, fiendish=0, atrocious=0, ludicrous=0, elaborate=0, very=6, expert=0, irrefutable=0, deplorable=0, provoking=0, delayed=0, sick=0, foul=0, superficial=0, easily=0, model=0, believable=0, autistic=0, fear=0, bonnie=0, disbelief=0, understated=0, letdown=0, plaintive=0, lively=0, crotchety=0, whiny=0, annoying=1, sly=0, ornery=0, upset=0, alive=0, unpleasant=0, majestic=0, abhorrent=0, lugubrious=0, ruthless=0, thinking=0, world=1, known=1, handicapped=0, composed=0, mangled=0, prejudice=0, hopefully=0, ability=0, together=0, delight=0, sadly=0, missed=0, positive=0, obnoxious=0, joking=0, off=0, joke=0, virtuoso=0, scummy=0, troublesome=0, complete=0, undeniable=0, forged=0, constant=0, instance=0, dreck=0, liked=0, second=0, confused=0, esteemed=0, fine=0, find=0, patchy=0, international=0, regardless=0, terrible=0, untraditional=0, ideal=0, pleasant=0, hamming=0, difficult=0, fill=0, cheating=0, plus=0, convenient=0, background=0, true=0, uninspired=0, malice=0, nonetheless=0, handsome=0, dozens=0, dangerous=0, groveling=0, best=2, decent=0, nonsense=0, eerie=0, troubled=0, loser=0, ok=0, make=1, rescue=0, experienced=0, reprehensible=0, highly=0, certainly=0, unfortunate=0, interesting=0, enthralled=0, cringing=0, intelligent=0, master=0, fright=0, extraordinary=0, selfless=0, due=0, howling=0, evident=0, authentic=0, essentially=0, heroine=0, worthwhile=0, undependable=0, sitting=0, psychological=0, credible=0, threatening=0, moralizing=0, bullshit=0, danger=0, somewhere=0, firm=0, extremely=0, speaking=0, starred=0, clever=0, reputable=0, horrible=0, drawn=0, recent=0, xenophobic=0, inevitable=0, horribly=0, unrealistic=0, underdog=0, miserable=0, wonderful=0, received=1, cracking=0, remaining=1, quality=0, glory=0, disastrous=0, propelling=0, disingenuous=0, animal=0, consistently=0, psychotic=0, sub=0, nevertheless=0, entranced=0, workmanlike=0, cute=0, ravaging=0, fell=0, unfunny=0, frightening=0, wonderfully=0, lecherous=0, apart=0, dirty=0, offensive=0, bother=0, righteous=0, necessary=0, thorough=0, beloved=0, reverend=0, controlled=0, face=0, definitely=0, stab=0, afraid=0, marvelous=0, bum=0, respected=0, randy=0, separate=0, suffering=0, instinct=0, buy=0, reputation=0, express=0, zero=0, amazing=0, trustworthiness=0, instantly=0, climactic=0, awkward=0, reluctant=0, passion=0, redeeming=0, ruin=0, formidable=0, admirable=0, please=0, troubling=0, punk=0, hint=0, know=0, eccentric=0, rough=0, proper=0, ruffian=0, cold=0, beast=0, cole=0, gag=0, bitter=0, contemptible=0, bats=0, shoddy=0, interest=0, damaging=0, hurting=0, missing=0, wonder=0, ungodly=0, gay=0, successful=0, lovely=0, brutal=0, corrupt=0, slight=0, winning=0, tumultuous=0, discernable=0, dramatic=0, damn=0, mediocre=0, superior=0, incredibly=0, imaginary=0, base=0, immensely=0, mom=0, whole=0, tingle=0, fable=0, schlock=0, none=0, fair=0, hell=0, quirky=0, humor=0, problem=0, lost=1, depressed=0, still=0, researcher=0, worn=0, lose=0, matt=0, ironically=0, props=0, fail=0, enjoyment=0, enjoyable=0, unredeemable=0, irritating=0, love=0, enjoy=0, gem=0, out=0, laughable=0, seeing=0, dark=0, witty=0, suspenseful=0, gusto=0, rootless=0, entrapment=0, aged=1, fascinating=0, suspect=0, nice=0, stink=0, opinion=0, lots=0, elegance=0, inexorable=0, altogether=0, emotion=0, elements=1, ended=0, cutting=0, fake=0, remorseful=0, squalor=0, upsetting=0, insult=0, magic=0, regrettably=0, villain=0, bizarre=0, perfect=0, utter=0, heartbroken=0, prepared=0, sound=0, preferable=0, healing=0, utterly=0, spite=0, abysmally=0, plain=0, criminal=0, incongruity=0, smelly=0, proud=0, like=5, ill=0, enamored=0, ugly=0, paranoia=0, messy=0, condemning=0, cliched=0, jocular=0, paranoid=0, sheltered=0, safe=0, interrogator=0, honored=0, thrilling=0, trite=0, regret=0, steal=0, irreplaceable=0, congratulations=0, stereotypical=0, pong=0, weep=0, engaging=0, seemingly=0, aware=0, filthy=0, soiled=0, pneumonia=0, ready=0, walking=0, disappointing=0, greatest=0, haunting=0, advantage=0, fault=0, really=0, nifty=0, expensive=0, magical=0, refreshing=0, assured=0, dignity=0, comic=0}
Instances trainingSet = new Instances("Data", features.getAttributes(), 2000);         
trainingSet.setClassIndex(0); 


    // process every review, extract the ~1500 feature values and add them to the training-set
    rvw.getReviews().parallelStream().forEach((review) -> {
        Instance inst = new DenseInstance(features.getNumberOfFeatures()); // = 1501 features

        // get the word vector for this review
        HashMap<String,Integer> wordVector = stringToWordVectorChain.run(review.getReviewText());
        // set the sentiment class to positive or negative label
        features.setClass(inst, review.getPositiveOrNegative())); // sets the class attribute to positive or negative
        features.setFeatureValues(inst, wordVector); // for each feature it will do "setValue" on the instance

        trainingSet.add(inst);
    });


Classifier cModel = (Classifier) new NaiveBayes();
cModel.buildClassifier(trainingSet);

// Test the model
Evaluation eTest = new Evaluation(trainingSet);
eTest.crossValidateModel(cModel, trainingSet, 10, new Random(1));

// print results
String strSummary = eTest.toSummaryString();
System.out.println(strSummary);