Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/linq/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何将word2vec应用于k-means聚类?_Scala_Cluster Analysis_Text Mining_Word2vec_Dl4j - Fatal编程技术网

Scala 如何将word2vec应用于k-means聚类?

Scala 如何将word2vec应用于k-means聚类?,scala,cluster-analysis,text-mining,word2vec,dl4j,Scala,Cluster Analysis,Text Mining,Word2vec,Dl4j,我是word2vec的新手。通过应用这种方法,我试图根据word2vec从科学出版物摘要中提取的单词形成一些聚类。为此,我首先通过stanfordNLP从摘要中检索句子,并将每个句子放入文本文件中的一行。然后,deeplearning4j word2vec所需的文本文件就可以处理了() 由于文本来自科学领域,因此有许多数学术语或括号。见下面的例句: The meta-analysis showed statistically significant effects of pharmacopunc

我是word2vec的新手。通过应用这种方法,我试图根据word2vec从科学出版物摘要中提取的单词形成一些聚类。为此,我首先通过stanfordNLP从摘要中检索句子,并将每个句子放入文本文件中的一行。然后,deeplearning4j word2vec所需的文本文件就可以处理了()

由于文本来自科学领域,因此有许多数学术语或括号。见下面的例句:

The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) . 

90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months . 
准备好文本文件后,我运行了word2vec,如下所示:

SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
        iter.setPreProcessor(new SentencePreProcessor() {
            @Override
            public String preProcess(String sentence) {
                //System.out.println(sentence.toLowerCase());
                return sentence.toLowerCase();
            }
        });


        // Split on white spaces in the line to get words
        TokenizerFactory t = new DefaultTokenizerFactory();
        t.setTokenPreProcessor(new CommonPreprocessor());

        log.info("Building model....");
        Word2Vec vec = new Word2Vec.Builder()
                .minWordFrequency(5)
                .iterations(1)
                .layerSize(100)
                .seed(42)
                .windowSize(5)
                .iterate(iter)
                .tokenizerFactory(t)
                .build();

        log.info("Fitting Word2Vec model....");
        vec.fit();

        log.info("Writing word vectors to text file....");

        // Write word vectors
        WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
此脚本创建了一个文本文件,其中包含许多单词,每行中都有相关的向量值,如下所示:

SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
        iter.setPreProcessor(new SentencePreProcessor() {
            @Override
            public String preProcess(String sentence) {
                //System.out.println(sentence.toLowerCase());
                return sentence.toLowerCase();
            }
        });


        // Split on white spaces in the line to get words
        TokenizerFactory t = new DefaultTokenizerFactory();
        t.setTokenPreProcessor(new CommonPreprocessor());

        log.info("Building model....");
        Word2Vec vec = new Word2Vec.Builder()
                .minWordFrequency(5)
                .iterations(1)
                .layerSize(100)
                .seed(42)
                .windowSize(5)
                .iterate(iter)
                .tokenizerFactory(t)
                .build();

        log.info("Fitting Word2Vec model....");
        vec.fit();

        log.info("Writing word vectors to text file....");

        // Write word vectors
        WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
作为后续步骤,此文本文件已用于通过spark中的k-means形成一些簇。请参阅下面的代码:

    val rawData = sc.textFile("...abs_terms.txt")
    val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()

    val numberOfClusters = 10
    val numberOfInterations = 100

    //We use KMeans object provided by MLLib to run
    val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)

    modell.clusterCenters.foreach(println)


    //Get cluster index for each buyer Id
    val AltCompByCluster = rawData.map {
      row=>
        (modell.predict(Vectors.dense(row.split(' ').slice(2,101)
          .map(_.toDouble))),row.split(',').slice(0,1).head)
    }

    AltCompByCluster.foreach(println)
根据上面最新的scala代码,我根据word2vec建议的单词向量检索了10个集群。然而,当我检查集群时,没有出现明显的常用词。也就是说,我不能像我预期的那样得到合理的集群。基于我的这个瓶颈,我有几个问题:

1) 从word2vec的一些教程中,我看到没有进行数据清理。也就是说,介词等留在课文中。那么,在应用word2vec时,我应该如何应用清洁程序呢

2) 如何以解释的方式可视化聚类结果

3) 我可以使用word2vec单词向量作为神经网络的输入吗?如果是这样,哪种神经网络(卷积、递归、递归)方法更适合我的目标

4) word2vec对我的目标有意义吗


提前感谢。

k-means在高维空间中不可能很好地工作。你也应该试试ELKIs k-means,它比Sparks好得多(而且快得多)。但解释集群可能是最困难的部分。你可能会到处看到不相关的词。