Lucene 基于RDD的词语规范化
也许这个问题有点奇怪。。。但我会尽量问的 使用Lucene API编写应用程序的每个人都看到了这样的情况:Lucene 基于RDD的词语规范化,lucene,apache-spark,rdd,Lucene,Apache Spark,Rdd,也许这个问题有点奇怪。。。但我会尽量问的 使用Lucene API编写应用程序的每个人都看到了这样的情况: public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException { TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, ne
public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException
{
TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, new StringReader(text));
tokenStream = new StopFilter(Version.LUCENE_44, tokenStream, StopFilter.makeStopSet(Version.LUCENE_44, stopWords, true));
tokenStream = new LowerCaseFilter(Version.LUCENE_44, tokenStream);
tokenStream = new StandardFilter(Version.LUCENE_44, tokenStream);
tokenStream.reset();
String result = "";
while (tokenStream.incrementToken())
{
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
try
{
//normalizer.getNormalForm(...) - stemmer or lemmatizer
result += normalizer.getNormalForm(token.toString()) + " ";
}
catch(Exception e)
{
//if something went wrong
}
}
return result;
}
是否可以使用RDD重写单词规范化?
也许有人有这个转换的例子,或者可以指定关于它的web资源
谢谢。我最近在一次演讲中使用了一个类似的例子。它显示了如何删除停止词。它没有规范化阶段,但是如果
normalizer.getNormalForm
来自可以重用的库,那么它应该很容易集成
此代码可能是一个起点:
// source text
val rdd = sc.textFile(...)
// stop words src
val stopWordsRdd = sc.textFile(...)
// bring stop words to the driver to broadcast => more efficient than rdd.subtract(stopWordsRdd)
val stopWords = stopWordsRdd.collect.toSet
val stopWordsBroadcast = sc.broadcast(stopWords)
val words = rdd.flatMap(line => line.split("\\W").map(_.toLowerCase))
val cleaned = words.mapPartitions{iterator =>
val stopWordsSet = stopWordsBroadcast.value
iterator.filter(elem => !stopWordsSet.contains(elem))
}
// plug the normalizer function here
val normalized = cleaned.map(normalForm(_))
注意:这是从Spark作业的角度来看的。我对Lucene不熟悉。Thanx Man!我将尝试使用它并告知结果!伙计!我需要建议。。。您认为-哪种方法更有效-获取文档并将其分散到节点,然后对每个文档的单词进行标记化和规范化,或者连续获取每个文档,将其标记化并通过将其分散到节点来规范化单词,其中每个节点都有一个normalizer函数的副本?非常感谢。