拼写检查-带java流的噪声信道模型_Java_Java 8_Java Stream_Spell Checking

拼写检查-带java流的噪声信道模型

java java-8

拼写检查-带java流的噪声信道模型,java,java-8,java-stream,spell-checking,Java,Java 8,Java Stream,Spell Checking,我有一个查询日志列表，其中的条目如下所示： Session ID Query 01 Movie atcor 01 Movie actor 02 Award winning axtor 02 Award winning actor 03 Soap opera axtor 03 Soap opera actor ... .map(sessionId -> getLogsWithSameSessionId(sessionId) .stream() .filter

我有一个查询日志列表，其中的条目如下所示：

Session ID Query
01 Movie atcor
01 Movie actor
02 Award winning axtor
02 Award winning actor
03 Soap opera axtor
03 Soap opera actor
...

.map(sessionId -> getLogsWithSameSessionId(sessionId)
        .stream()
        .filter(queryLog -> //queryLog.getQuery().equals(some other queryLog in the same session)
        .count()
).count();

我需要确定拼写建议正确的概率。例如，如果我想确定“actor”是“axtor”的正确拼写的概率，我将通过确定“axtor”被“actor”替换的会话数除以“actor”是任何拼写错误单词的正确拼写的会话数来计算

这意味着在这种情况下，概率将为2/3，因为有两个会话中“actor”替换“axtor”，三个会话中“actor”替换错误销售（“atcor”和“axtor”）

我试图让自己更熟悉Java8流，所以我尝试使用流来获得解决方案

这是我能想到的。这是朝着正确的方向迈出的一步，但我仍然缺少一些东西

public int numberOfCorrections(String misspelledWord, String suggestedWord)
{
    return (int) sessionIdsWithWord(misspelledWord)
            .stream()
            .map(sessionId -> getLogsWithSameSessionId(sessionId)
                    .stream()
                    .filter(queryLog -> queryLog.queryContainsWord(suggestedWord))
                    .count()
            ).count();
}

public Set<String> sessionIdsWithWord(String word)
{
    return getQueryLogsThatContainWord(word)
            .stream()
            .map(QueryLog::getSessionId)
            .collect(Collectors.toSet());
}

public List<QueryLog> getQueryLogsThatContainWord(String word)
{
    return logs
            .stream()
            .filter(queryLog -> queryLog.queryContainsWord(word))
            .collect(Collectors.toList());
}

public Map<String, List<QueryLog>> getSessionIdMapping()
{
    return logs
            .stream()
            .collect(Collectors.groupingBy(QueryLog::getSessionId));
}

public List<QueryLog> getLogsWithSameSessionId(String sessionId)
{
    return getSessionIdMapping()
            .get(sessionId);
}

但我不知道是否有办法与同一会话中的其他

queryLog

s进行比较

除非我能找出如何根据给定查询是否与同一会话中的另一个查询相似来进行筛选，否则我无法真正转到概率的第二部分。

逐个解释您的方法并不容易。以下是一个简单的解决方案：

public double countProbability(String misspelledWord, String suggestedWord) {
    try (Stream<String> stream = Files.lines(logFilePath)) {
        return stream.skip(1).map(line -> line.contains(misspelledWord) ? misspelledWord : (line.contains(suggestedWord) ? suggestedWord : ""))
                .filter(w -> !w.equals("")).collect(collectingAndThen(groupingBy(Function.identity(), counting()),
                        m -> m.size() < 2 ? 0d : m.get(misspelledWord).doubleValue() / m.get(suggestedWord)));
    }
}

公共双计数概率（字符串拼写错误、字符串建议词）{
try（Stream=Files.lines（logFilePath））{
返回流.skip（1）.map（行->行.contains（拼写错误的单词）？拼写错误的单词：（行.contains（建议的单词）？建议的单词：“”）
.filter（w->！w.equals（“”）。collect（collectingAndThen（groupingBy（Function.identity（），counting（）），
m->m.size（）；
}
}

我可能误解了您的问题。

从您的查询日志示例列表中可以看出，正确的单词总是出现在拼写错误的单词之后。这个假设正确吗？此外，也许我不完全理解你想做什么，但创建一个完整的地图并过滤你的方法收到的每两个单词似乎不是最好的方法。。。