PythonNLTK:Bigrams、trigrams和fourgrams_Python_Nltk_N Gram

PythonNLTK:Bigrams、trigrams和fourgrams

python

PythonNLTK:Bigrams、trigrams和fourgrams,python,nltk,n-gram,Python,Nltk,N Gram,我有这个例子，我想知道如何得到这个结果。我有文本，我标记它，然后我收集二元图、三元图和四元图 import nltk from nltk import word_tokenize from nltk.util import ngrams text = "Hi How are you? i am fine and you" token=nltk.word_tokenize(text) bigrams=ngrams(token,2) bigrams:[（'Hi'，'How'），（'How'，'ar

我有这个例子，我想知道如何得到这个结果。我有文本，我标记它，然后我收集二元图、三元图和四元图

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
text = "Hi How are you? i am fine and you"
token=nltk.word_tokenize(text)
bigrams=ngrams(token,2)

bigrams:

[（'Hi'，'How'），（'How'，'are'），（'are'，'you'），（'you'，'？'），（'you'，'i'），（'i'，'am'，'am'，'fine'），（'fine'，'and'），（'and'，'you'）]

三角形：<代码>[（“你好”，“你好”，“你”，“你好”，“你”，“你”，“你”，“你”，“我”，“你”，“我”，“我”，“我”，“我”，“我”，“我”，“我”，“我”，“我”，“很好”，“我”，“很好”，“很好”，“你”，“很好”，“很好”，“你”）

如果你运用一些集合论（如果我正确地解释了你的问题），任何想法都会很有帮助，你会发现你想要的三角图只是

标记列表中的元素[2:5]、[4:7]、[6:8]等等
您可以这样生成它们：
>>> new_trigrams = []
>>> c = 2
>>> while c < len(token) - 2:
...     new_trigrams.append((token[c], token[c+1], token[c+2]))
...     c += 2
>>> print new_trigrams
[('are', 'you', '?'), ('?', 'i', 'am'), ('am', 'fine', 'and')]

def words_to_ngrams(words, n, sep=" "):
    return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]

>>新三角形=[]
>>>c=2
>>>而c>>打印新的三角形
[（‘是’、‘你’、‘你’、‘我’、‘我’、‘是’、‘我’、‘很好’、‘和’）]
我是这样做的：
>>> new_trigrams = []
>>> c = 2
>>> while c < len(token) - 2:
...     new_trigrams.append((token[c], token[c+1], token[c+2]))
...     c += 2
>>> print new_trigrams
[('are', 'you', '?'), ('?', 'i', 'am'), ('am', 'fine', 'and')]

def words_to_ngrams(words, n, sep=" "):
    return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]

这将使用一个列表作为输入，并返回一个ngram列表（对于给定的n），由sep
分隔（在本例中为空格）。
尝试everygrams
：
from nltk import everygrams
list(everygrams('hello', 1, 5))

[out]：
[('h',),
 ('e',),
 ('l',),
 ('l',),
 ('o',),
 ('h', 'e'),
 ('e', 'l'),
 ('l', 'l'),
 ('l', 'o'),
 ('h', 'e', 'l'),
 ('e', 'l', 'l'),
 ('l', 'l', 'o'),
 ('h', 'e', 'l', 'l'),
 ('e', 'l', 'l', 'o'),
 ('h', 'e', 'l', 'l', 'o')]

[('hello',),
 ('word',),
 ('is',),
 ('a',),
 ('fun',),
 ('program',),
 ('hello', 'word'),
 ('word', 'is'),
 ('is', 'a'),
 ('a', 'fun'),
 ('fun', 'program'),
 ('hello', 'word', 'is'),
 ('word', 'is', 'a'),
 ('is', 'a', 'fun'),
 ('a', 'fun', 'program'),
 ('hello', 'word', 'is', 'a'),
 ('word', 'is', 'a', 'fun'),
 ('is', 'a', 'fun', 'program'),
 ('hello', 'word', 'is', 'a', 'fun'),
 ('word', 'is', 'a', 'fun', 'program')]

单词标记：
from nltk import everygrams

list(everygrams('hello word is a fun program'.split(), 1, 5))

[out]：
[('h',),
 ('e',),
 ('l',),
 ('l',),
 ('o',),
 ('h', 'e'),
 ('e', 'l'),
 ('l', 'l'),
 ('l', 'o'),
 ('h', 'e', 'l'),
 ('e', 'l', 'l'),
 ('l', 'l', 'o'),
 ('h', 'e', 'l', 'l'),
 ('e', 'l', 'l', 'o'),
 ('h', 'e', 'l', 'l', 'o')]

[('hello',),
 ('word',),
 ('is',),
 ('a',),
 ('fun',),
 ('program',),
 ('hello', 'word'),
 ('word', 'is'),
 ('is', 'a'),
 ('a', 'fun'),
 ('fun', 'program'),
 ('hello', 'word', 'is'),
 ('word', 'is', 'a'),
 ('is', 'a', 'fun'),
 ('a', 'fun', 'program'),
 ('hello', 'word', 'is', 'a'),
 ('word', 'is', 'a', 'fun'),
 ('is', 'a', 'fun', 'program'),
 ('hello', 'word', 'is', 'a', 'fun'),
 ('word', 'is', 'a', 'fun', 'program')]

我不明白；似乎你已经生成了Ngram？@Emre我的问题是如何获得新的trigram我试图找到一个函数，它可以在bigram元素内部搜索，并将其与trigram元素进行比较，只取不同的值这里有一个everygrams
implementation now=）实际上我的问题与另一个问题相关，如果你可以看看这个问题，也许你会明白全部的意思