Python 在word2vec Gensim中获取bigram和trigram

Python 在word2vec Gensim中获取bigram和trigram,python,tokenize,word2vec,gensim,n-gram,Python,Tokenize,Word2vec,Gensim,N Gram,我目前在word2vec模型中使用Unigram,如下所示 def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the paragraph into sentences raw_sente

我目前在word2vec模型中使用Unigram,如下所示

def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    #Returns a list of sentences, where each sentence is a list of words
    #
    #NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())

    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences
然而,我将错过数据集中重要的bigram和trigram

E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"
因此,我希望捕获数据集中重要的bigram、trigram等,并将其输入到word2vec模型中


我是wordvec的新手,正在为如何做到这一点而挣扎。请帮帮我。

首先,您应该使用gensim的类来获取Bigram,它可以按照文档中的说明工作

>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
为了得到三叉图等等,你应该使用你已经拥有的二叉图模型,并再次对其应用短语,等等。 例如:

trigram_model = Phrases(bigram_sentences)
还有一个很好的笔记本和视频,解释了如何使用它

它最重要的部分是如何在现实生活中使用它,具体如下:

// to create the bigrams
bigram_model = Phrases(unigram_sentences)

// apply the trained model to a sentence
 for unigram_sentence in unigram_sentences:                
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])

// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
希望这对您有所帮助,但下次给我们更多关于您正在使用什么等的信息


附言:既然你编辑了它,你没有做任何事情来获取bigrams,只是将它拆分,你必须使用短语来获取像纽约这样的单词作为bigrams。

短语和短语是你应该寻找的

from gensim.models import Phrases

from gensim.models.phrases import Phraser

documents = 
["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]

sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')

bigram_phraser = Phraser(bigram)


print(bigram_phraser)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]

    print(tokens_)
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100) 
一旦添加了足够多的人声,就可以使用Phraser进行更快的访问和高效的内存使用。不是强制性的,但有用

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

谢谢,

提供一些代码和更好的示例。您显示的示例没有反映您在第一行中提供的数据Done!更新了问题。请帮我解决这个问题。谢谢你宝贵的回答。但是当我使用bigram=Phraser(短语)时。它表示未定义的名称短语和短语。我需要导入它们吗?@Volka是的,你需要导入它们,它在gensim的模型中,我知道gensim医生很困惑sometimes@nitheism如果您知道这个问题的答案,请告诉我。一般来说,在您创建n-gram字典后,最好删除停止词和词干。@user8566323您需要从gensim.models导入下面的短语从gensim.models.phrase import PhraserIt可以很好地了解短语和短语的输出以及bigram和bigram_Phraser的外观。对于sg=1的Word2Vec,对于带有负采样的skip gram=1,以及如何将其用于训练和测试数据?我想使用训练数据学习短语,然后将其转换为测试。我该怎么做?