有没有一种简单的方法可以从python中的无空格句子生成可能的单词列表？_Python_Nlp

有没有一种简单的方法可以从python中的无空格句子生成可能的单词列表？

python nlp

有没有一种简单的方法可以从python中的无空格句子生成可能的单词列表？,python,nlp,Python,Nlp,我有一些文字： s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:" 我想把它分解成单独的词。我很快查看了enchant和nltk，但没有看到任何立即有用的东西。如果我有时间在这方面投资，我会考虑编写一个动态程序，使用enchant的能力来检查一个单词是否是英语。我本以为网上会有这样的事情发生，我错了吗使用trie的贪婪方法使用（pip安装biopython）尝试此

我有一些文字：

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

我想把它分解成单独的词。我很快查看了enchant和nltk，但没有看到任何立即有用的东西。如果我有时间在这方面投资，我会考虑编写一个动态程序，使用enchant的能力来检查一个单词是否是英语。我本以为网上会有这样的事情发生，我错了吗

使用trie的贪婪方法使用（

pip安装biopython

）尝试此操作：

结果警告英语中有一些退化的情况，这是行不通的。您需要使用回溯来处理这些问题，但这应该可以让您开始

强制性测试

这是亚洲NLP中经常出现的一种问题。如果你有字典，那么你可以用这个（免责声明：我写的，希望你不介意）

请注意，搜索空间可能非常大，因为字母英语中的字符数肯定比音节汉语/日语长。

您可以将字典中的单词编码为trie，并使用贪婪算法：提取匹配的最长单词，然后继续下一个单词，失败时回溯。可能不是最优的。请尝试以下关于数据结构的建议：有趣的问题。我猜答案（“简单方法”）将是“否”。之前提出的类似问题没有多少运气：例如，您的算法如何知道它不是大致除以din到？它们都是正确的英语单词…@timpietzcker：因为这不是贪婪的做法。“贪婪，因为没有更好的词，是好的。贪婪是正确的。贪婪有效。”太好了。这正是我想要的！

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

>>> main("expertsexchange")
experts
exchange