Java 获取mallet中所有文档的实例和主题序列_Java_Lda_Topic Modeling_Mallet

Java 获取mallet中所有文档的实例和主题序列

java

Java 获取mallet中所有文档的实例和主题序列,java,lda,topic-modeling,mallet,Java,Lda,Topic Modeling,Mallet,我正在使用mallet库进行主题建模。我的数据集位于filePath路径中，csvIterator似乎可以读取数据，因为model.getData（）大约有27000行，相当于我的数据集。我编写了一个循环，打印10个第一个文档的实例和主题序列，但标记的大小是0。我哪里出错了在下面，我想以10个第一个文档的比例显示主题中的前5个单词，但所有输出都是相同的 cosole中的out示例： ----文件0 0.200 com（1723）twitter（1225）http（871）cbr（688）堪培

我正在使用mallet库进行主题建模。我的数据集位于filePath路径中，csvIterator似乎可以读取数据，因为model.getData（）大约有27000行，相当于我的数据集。我编写了一个循环，打印10个第一个文档的实例和主题序列，但标记的大小是0。我哪里出错了

在下面，我想以10个第一个文档的比例显示主题中的前5个单词，但所有输出都是相同的

cosole中的out示例：

----文件0

0.200 com（1723）twitter（1225）http（871）cbr（688）堪培拉（626）

1 0.200 com（981）推特（901）日（205）五月（159）周三（156）

2 0.200推特（1068）com（947）act（433）actvcc（317）堪培拉（302）

30.200 http（1039）堪培拉（841）乔布斯（378）dlvr（313）com（228）

40.200 com（1185）www（1074）http（831）news（708）canberratimes（560）

----文件1

0.200 com（1723）twitter（1225）http（871）cbr（688）堪培拉（626）

1 0.200 com（981）推特（901）日（205）五月（159）周三（156）

2 0.200推特（1068）com（947）act（433）actvcc（317）堪培拉（302）

30.200 http（1039）堪培拉（841）乔布斯（378）dlvr（313）com（228）

40.200 com（1185）www（1074）http（831）news（708）canberratimes（560）

据我所知，LDA模型生成每个文档并将它们分配给主题词。那么为什么每个文档的结果都是一样的呢

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
   pipeList.add(new CharSequenceLowercase());
    pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")));
    //stoplists/en.txt
    pipeList.add(new TokenSequenceRemoveStopwords(new File(pathStopWords), "UTF-8", false, false, false));
    pipeList.add(new TokenSequence2FeatureSequence());

    InstanceList instances = new InstanceList(new SerialPipes(pipeList));

    Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
//header of my data set
// row,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
    CsvIterator csvIterator = new CsvIterator(fileReader,
            Pattern.compile("^(\\d+)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*([^,]*)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*[^,]*$"),
            2, 0, 1);
    instances.addThruPipe(csvIterator); // data, label, name fields

    int numTopics = 5;
    ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);

    model.addInstances(instances);

    model.setNumThreads(2);


    model.setNumIterations(50);
    model.estimate();

    Alphabet dataAlphabet = instances.getDataAlphabet();
    ArrayList<TopicAssignment> arrayTopics = model.getData();

    for (int i = 0; i < 10; i++) {
        System.out.println("---- document " + i);
        FeatureSequence tokens = (FeatureSequence) model.getData().get(i).instance.getData();
        LabelSequence topics = model.getData().get(i).topicSequence;

        Formatter out = new Formatter(new StringBuilder(), Locale.US);
        for (int position = 0; position < tokens.getLength(); position++) {
            out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)),
                    topics.getIndexAtPosition(position));
        }
        System.out.println(out);

        double[] topicDistribution = model.getTopicProbabilities(i);

        ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();


        for (int topic = 0; topic < numTopics; topic++) {
            Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
            out = new Formatter(new StringBuilder(), Locale.US);
            out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
            int rank = 0;
            while (iterator.hasNext() && rank < 5) {
                IDSorter idCountPair = iterator.next();
                out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
                rank++;
            }
            System.out.println(out);
        }

        StringBuilder topicZeroText = new StringBuilder();
        Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();

        int rank = 0;
        while (iterator.hasNext() && rank < 5) {
            IDSorter idCountPair = iterator.next();
            topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
            rank++;
        }

    }

ArrayList管道列表=新建ArrayList（）；
添加（新的CharSequenceLowercase（））；
添加（新的CharSequence2TokenSequence（Pattern.compile（\\p{L}[\\p{L}\\p{p}]+\\p{L}））；
//停止列表/en.txt
添加（新的TokenSequenceRemoveStopwords（新文件（pathStopWords），“UTF-8”，false，false，false））；
添加（新的TokenSequence2FeatureSequence（））；
InstanceList实例=新InstanceList（新的串行管道（管道列表））；
Reader fileReader=新的InputStreamReader（新的FileInputStream（新文件（文件路径）），“UTF-8”）；
//我的数据集的标题
//行、位置、用户名、哈希标记、文本、转发、日期、收藏夹、numberOfComment
CsvIterator CsvIterator=新的CsvIterator（文件读取器，
模式。编译（“^（\\d+）[，]*[^，]*[，]*[^，]*[，]*[^，]*[，]*（[^，]*]）[，]*[，]*[，]*[，]*[^，]*[，]*[，]*[^，]*[，]*[^，]*$”，
2, 0, 1);
实例。addThruPipe（csvIterator）；//数据、标签、名称字段
int numTopics=5；
ParallelTopicModel=新的ParallelTopicModel（numTopics，1.0,0.01）；
模型。附加说明（实例）；
model.setNumThreads（2）；
模型组（50）；
模型估计（）；
Alphabet dataAlphabet=实例。getDataAlphabet（）；
ArrayList arrayTopics=model.getData（）；
对于（int i=0；i<10；i++）{
System.out.println（“--document”+i）；
FeatureSequence标记=（FeatureSequence）model.getData（）.get（i）.instance.getData（）；
LabelSequence topics=model.getData（）.get（i）.topicSequence；
Formatter out=new Formatter（new StringBuilder（），Locale.US）；
for（int position=0；position

主题是在模型级别定义的，而不是在文档级别定义的。它们对所有人都应该是一样的

看起来所有的文本都是URL。向导入序列中添加

PrintInputPipe

可能有助于调试。

主题是在模型级别定义的，而不是在文档级别定义的。它们对所有人都应该是一样的

看起来所有的文本都是URL。在导入序列中添加

PrintInputPipe

，可能有助于调试