如何在python中计算SkipGram？_Python_Nlp_N Gram_Language Model

如何在python中计算SkipGram？

python nlp

如何在python中计算SkipGram？,python,nlp,n-gram,language-model,Python,Nlp,N Gram,Language Model,k是一个ngram，它是所有ngram和每个（k-i）skipgram的超集，直到（k-i）==0（包括0个skipgram）。那么，如何在python中高效地计算这些skipgram呢以下是我尝试过的代码，但没有达到预期效果： <pre> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] def find_skipgrams(input_list, N,K): bigram_list

k是一个ngram，它是所有ngram和每个（k-i）skipgram的超集，直到（k-i）==0（包括0个skipgram）。那么，如何在python中高效地计算这些skipgram呢

以下是我尝试过的代码，但没有达到预期效果：

<pre>
    input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
    def find_skipgrams(input_list, N,K):
  bigram_list = []
  nlist=[]

  K=1
  for k in range(K+1):
      for i in range(len(input_list)-1):
          if i+k+1<len(input_list):
              nlist=[]
              for j in range(N+1):
                  if i+k+j+1<len(input_list):
                    nlist.append(input_list[i+k+j+1])

          bigram_list.append(nlist)
  return bigram_list

</pre>


input_list=['all'、'this'、'counted'、'more'、'less']
def查找SKIPGRAM（输入列表，N，K）：
bigram_list=[]
nlist=[]
K=1
对于范围内的k（k+1）：
对于范围内的i（len（输入列表）-1）：
如果i+k+1使用其他人的实现如何，其中k=skip_size
和n=ngram_order
：
from itertools import chain, combinations
import copy
from nltk.util import ngrams

def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    return sequence

def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
    sequence_length = len(sequence)
    sequence = iter(sequence)
    sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)

    if sequence_length + pad_left + pad_right < k:
        raise Exception("The length of sentence + padding(s) < skip")

    if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    

    history = []
    nk = n+k

    # Return point for recursion.
    if nk < 1: 
        return
    # If n+k longer than sequence, reduce k by 1 and recur
    elif nk > sequence_length: 
        for ng in skipgrams(list(sequence), n, k-1):
            yield ng

    while nk > 1: # Collects the first instance of n+k length history
        history.append(next(sequence))
        nk -= 1

    # Iterative drop first item in history and picks up the next
    # while yielding skipgrams for each iteration.
    for item in sequence:
        history.append(item)
        current_token = history.pop(0)      
        # Iterates through the rest of the history and 
        # pick out all combinations the n-1grams
        for idx in list(combinations(range(len(history)), n-1)):
            ng = [current_token]
            for _id in idx:
                ng.append(history[_id])
            yield tuple(ng)

    # Recursively yield the skigrams for the rest of seqeunce where
    # len(sequence) < n+k
    for ng in list(skipgrams(history, n, k-1)):
        yield ng

def skipgram\u ndarray（已发送，k=1，n=2）：
"""
这不完全是矢量化版本，因为我们仍然
使用for循环
"""
tokens=sent.split（）
如果len（代币）跳过+2”）
矩阵=np.零（（len（标记），k+2），数据类型=对象）
矩阵[：，0]=标记
矩阵[：，1]=标记[1::][']
结果=[]
对于范围内的跳过（1，k+1）：
矩阵[：，skip+1]=标记[skip+1:][']*（skip+1）
对于范围（1，k+2）内的索引：
温度=矩阵[：，0]+'，'+矩阵[：，索引]
映射（result.append，temp.tolist（））
极限=（（k+1）*（k+2））/6）*（（3*n）-（2*k）-6）
返回结果[：限制]
def skipgram_列表（已发送，k=1，n=2）：
"""
使用列表理解形成skipgram功能
"""
tokens=sent.split（）
令牌\u n=[''令牌[index+j+{0}]''。格式（索引）
对于范围内的索引（n-1）]
x='（标记[索引]，'+'，'.join（标记n）+'）”
查询_part1='result=['+x+'用于范围内的索引（len（令牌））'
查询_part2='如果索引+j+n
从操作链接中，输入以下字符串：
叛乱分子在持续战斗中丧生
收益率：
2-skip-bi-grams={叛乱分子被杀，叛乱分子进入，叛乱分子
正在进行中，在中，在中，在中，在中
战斗，持续战斗}
2-skip-tri-grams={叛乱分子在年被杀，叛乱分子正在被杀，
叛乱分子在战斗中阵亡，叛乱分子仍在继续，叛乱分子仍在继续
战斗，叛乱分子正在战斗，在战斗中被杀，在战斗中被杀
战斗，杀死正在进行的战斗，在正在进行的战斗中}
对NLTK的ngrams
code（）稍作修改：
但是请注意，如果n+k>len（序列）
，它将产生与skipgrams（序列，n，k-1）
相同的效果（这不是一个bug，它是一个故障保护功能），例如
这允许n==k
，但不允许n>k
，如行所示：
for idx in list(combinations(range(len(history)), n-1)):
    pass # Do something

给定一个独特项目列表，组合会产生以下结果：
>>> sent = ['this', 'is', 'a', 'foo', 'bar']
>>> current_token = sent.pop(0) # i.e. 'this'
>>> range(len(sent))
[0,1,2,3]

由于令牌列表的索引总是唯一的，例如
>>> n = 3
>>> list(combinations(range(len(sent)), n-1))
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

可以计算范围的可能值：
>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]

如果我们将索引映射回令牌列表：
>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]

然后，我们与当前\u令牌
连接，得到当前令牌和上下文+跳过窗口的skipgram：
def skipgrams(sequence, n, k, **kwargs):
    """
    Returns all possible skipgrams generated from a sequence of items, as an iterator.
    Skipgrams are ngrams that allows tokens to be skipped.
    Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

    :param sequence: the source data to be converted into trigrams
    :type sequence: sequence or iter
    :param n: the degree of the ngrams
    :type n: int
    :param k: the skip distance
    :type  k: int
    :rtype: iter(tuple)
    """

    # Pads the sequence as desired by **kwargs.
    if 'pad_left' in kwargs or 'pad_right' in kwargs:
    sequence = pad_sequence(sequence, n, **kwargs)

    # Note when iterating through the ngrams, the pad_right here is not
    # the **kwargs padding, it's for the algorithm to detect the SENTINEL
    # object on the right pad to stop inner loop.
    SENTINEL = object()
    for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
    head = ngram[:1]
    tail = ngram[1:]
    for skip_tail in combinations(tail, n - 1):
        if skip_tail[-1] is SENTINEL:
            continue
        yield head + skip_tail

然后我们继续下一个单词。
最新的NLTK版本3.2.5实现了skipgrams

以下是NLTK repo中@jnothman的一个更干净的实现：
[out]：
import colibricore

#Prepare corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input, one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #corpus output
classfile = "somecorpus.colibri.cls" #class encoding output
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)

#Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length)
colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True)

#Instantiate an empty pattern model 
model = colibricore.UnindexedPatternModel()

#Train the model on the encoded corpus file (this does the skipgram extraction)
model.train(corpusfile, options)

#Load a decoder so we can view the output
decoder = colibricore.ClassDecoder(classfile)

#Output all skipgrams
for pattern in model:
     if pattern.category() == colibricore.Category.SKIPGRAM:
         print(pattern.tostring(decoder))

尽管这将完全脱离您的代码，并将其推迟到外部库；您可以使用Colibri Core（）提取skipgram。这是一个专门为从大文本语料库中高效提取n-gram和skipgram而编写的库。代码库是C++的（速度/效率），但是Python绑定是可用的。
您正确地提到了效率，因为skipgram提取快速显示出指数级的复杂性，如果您仅像在输入列表中那样传递一个句子，这可能不是一个大问题，但如果您在大型语料库数据上发布它，则会出现问题。为了缓解这种情况，您可以设置参数，例如发生阈值，或者要求skipgram的每个跳过至少可填充x个不同的n-gram
>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]

网站上有一个关于这一切的更广泛的Python教程
免责声明：我是Colibri Core的作者，请参阅完整信息
下面的例子已经提到了它的用法，它就像一个符咒
你尝试过什么？@msw更新了问题！！不，它不起作用，print skipgram_ndarray（“What is your name”）给出：[“What，is”，“is，your”，“your，name”，“name”，“What，your”，“is，name']名称是unigram，而另一个函数则更错误此实现是为knice work硬编码的，但是如果长度超过，我希望它应该返回句子本身。你能回答这个问题吗：@stackit这是一个完全不同的NLP任务，但我有空时会尝试一下=）关于elif-nk>序列的长度：对于skipgrams中的ng（list（sequence），n，k-1）：产生ng基本上与正常生成ngram的方式相同。我会保持原样，而不是返回单个字符串列表。谢谢你检查这个问题，令人惊讶的是，这样一个常见的问题还没有解决。酷，链接在哪里？是的，我在写这个问题之前尝试过，但无法在ubuntuI上安装colibri。上周，我改进了安装程序和说明，我希望现在安装起来麻烦少一点。@proycon，是否可以在colibri的python包装中创建duck类型，使界面看起来像NLTK
，例如colibri.ngrams（text，n=3）
或colibri.skipgram（text，n=3，k=2）
？或者在NLTK repo中重新实现一些colibri包装器更容易吗？@alvas我担心额外的开销会带来巨大的性能成本，并可能导致代码效率低下。将Python字符串编码和解码为colibri的内部压缩repr
>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]

>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]

def skipgrams(sequence, n, k, **kwargs):
    """
    Returns all possible skipgrams generated from a sequence of items, as an iterator.
    Skipgrams are ngrams that allows tokens to be skipped.
    Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

    :param sequence: the source data to be converted into trigrams
    :type sequence: sequence or iter
    :param n: the degree of the ngrams
    :type n: int
    :param k: the skip distance
    :type  k: int
    :rtype: iter(tuple)
    """

    # Pads the sequence as desired by **kwargs.
    if 'pad_left' in kwargs or 'pad_right' in kwargs:
    sequence = pad_sequence(sequence, n, **kwargs)

    # Note when iterating through the ngrams, the pad_right here is not
    # the **kwargs padding, it's for the algorithm to detect the SENTINEL
    # object on the right pad to stop inner loop.
    SENTINEL = object()
    for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
    head = ngram[:1]
    tail = ngram[1:]
    for skip_tail in combinations(tail, n - 1):
        if skip_tail[-1] is SENTINEL:
            continue
        yield head + skip_tail

>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

import colibricore

#Prepare corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input, one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #corpus output
classfile = "somecorpus.colibri.cls" #class encoding output
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)

#Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length)
colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True)

#Instantiate an empty pattern model 
model = colibricore.UnindexedPatternModel()

#Train the model on the encoded corpus file (this does the skipgram extraction)
model.train(corpusfile, options)

#Load a decoder so we can view the output
decoder = colibricore.ClassDecoder(classfile)

#Output all skipgrams
for pattern in model:
     if pattern.category() == colibricore.Category.SKIPGRAM:
         print(pattern.tostring(decoder))

>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]