Python numpy-如何组合多个索引(将多个逐个矩阵访问替换为一个访问) 更新 没有考虑同一个词的多次出现和自词出现。

Python numpy-如何组合多个索引(将多个逐个矩阵访问替换为一个访问) 更新 没有考虑同一个词的多次出现和自词出现。,python,arrays,numpy,matrix-indexing,Python,Arrays,Numpy,Matrix Indexing,例如,当步幅=2且该位置的单词为W时,X的共现需要+2,W的自共现需要+1 X | Y | W | X | W 问题: 更新m*mmatrix(),当前使用循环逐行访问。整个代码在底部 如何删除循环并同时更新多行?我认为应该有一种方法将每个索引组合成一个矩阵,用一个矢量化更新替换循环 请建议可能的方法 当前实施 循环一组单词索引序列(单词索引是每个单词的整数代码) 对于循环中位置处的每个单词,检查跨步距离内两侧同时出现的单词这是一个N-gram上下文窗口,如图中的紫色框所示N=上下文大小=步幅*

例如,当步幅=2且该位置的单词为W时,X的共现需要+2,W的自共现需要+1

X | Y | W | X | W

问题: 更新
m*m
matrix(),当前使用循环逐行访问。整个代码在底部

如何删除循环并同时更新多行?我认为应该有一种方法将每个索引组合成一个矩阵,用一个矢量化更新替换循环

请建议可能的方法

当前实施
  • 循环一组单词索引
    序列
    (单词索引是每个单词的整数代码)
  • 对于循环中
    位置
    处的每个单词,检查
    跨步
    距离内两侧同时出现的单词
    这是一个N-gram
    上下文
    窗口,如图中的紫色框所示<代码>N=上下文大小=步幅*2+1
  • 根据图表中的蓝线,增加
    共现矩阵中每个共现字的计数
  • 尝试 这似乎是一种同时访问多行的方法

    x = np.array([[ 0,  1,  2],
                  [ 3,  4,  5],
                  [ 6,  7,  8],
                  [ 9, 10, 11]])
    rows = np.array([[0, 0],
                     [3, 3]], dtype=np.intp)
    columns = np.array([[0, 2],
                        [0, 2]], dtype=np.intp)
    x[rows, columns]
    ---
    array([[ 0,  2],
           [ 9, 11]])
    
    通过组合循环中的每个索引来创建多维索引,但它不会产生错误。请告知原因和错误,或者如果尝试没有意义

        indices = np.array([
            [
                sequence[0],                                         # position  to the word
                sequence[max(0, 0-stride) : min((0+stride),n-1) +1]  # positions to co-occurrence words
            ]]
        )
        assert n > 1
        for position in range(1, n):
            co_occurrence_indices = np.array([
                [
                    sequence[position],                                                # position  to the word
                    sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
                ]]
            )
            indices = np.append(
                indices,
                co_occurrence_indices,
                axis=0
            )
    
        print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
            indices,
            indices.dtype
        ))
        co_ccurrence_matrix[  
            indices              <---- Error
        ] += 1
     
    
    轮廓 从中使用ptb.train.txt

    定时器单元:1e-06秒
    总时间:23.0015秒
    文件:
    功能:在第1行创建共现矩阵
    行#每次命中的命中次数%时间行内容
    ==============================================================
    1定义创建共现矩阵(顺序、词汇库大小、上下文大小=3):
    2                                               """
    3个参数:
    4序列:原始语料库文本的单词索引序列
    5词汇库大小:词汇库中的单词数(与共现向量大小相同)
    6上下文大小:用于检查共现的上下文(N-gram大小N)。
    7返回:
    8共现矩阵
    9                                               """
    10 1 4.0 4.0 0 0.0 n=序列尺寸=长度(序列)
    11 1 98.0 98.0 0 0.0共现矩阵=np.0((词汇库大小,词汇库大小),数据类型=np.int32)
    12
    13 1 5.0 5.0 0 0.0步幅=整数((上下文大小-1)/2)
    14 1.0 1.0 0 0.0断言(n>stride),“序列大小{}小于/等于stride{}”。格式(
    15牛顿,迈步
    16                                               )
    17
    18                                               """
    19#手柄位置=切片(0:(步幅-1)+1),共现=切片(最大(0,位置步幅):最小(位置+步幅),n-1)+1)
    20#手柄位置=切片((n-1-步幅):(n-1)+1),共现=切片(最大值(0,位置步幅):最小值((位置+步幅),n-1)+1)
    21个指数=[*范围(0,(步幅-1)+1),*范围(n-1)-步幅+1,(n-1)+1]
    22#打印(索引)
    23
    24指数排名:
    25调试(顺序、位置、步幅、错误)
    26共现矩阵[
    27序列[位置],#到单词的位置
    28序列[max(0,位置步幅):min((位置+步幅),n-1)+1]#共现词索引
    29                                                   ] += 1
    30
    31
    32#手柄位置=切片(步幅,((序列大小-1)-步幅)+1)
    33对于范围内的位置(步幅,(顺序大小-1)-步幅+1):
    34共现矩阵[
    35序列[位置],#到单词的位置
    36序列[(位置步幅):(位置+步幅+1)]#共现词索引
    37                                                   ] += 1
    38                                               """        
    39
    40 929590 1175326.0 1.3 5.1范围内的位置(0,n):
    41 2788767 15304643.0 5.5 66.5共现矩阵[
    42 1859178 2176964.0 1.2 9.5序列[位置],#到单词的位置
    43 929589 3280181.0 3.5 14.3顺序[最大(0,位置步幅):最小(位置+st
    
        indices = np.array([
            [
                sequence[0],                                         # position  to the word
                sequence[max(0, 0-stride) : min((0+stride),n-1) +1]  # positions to co-occurrence words
            ]]
        )
        assert n > 1
        for position in range(1, n):
            co_occurrence_indices = np.array([
                [
                    sequence[position],                                                # position  to the word
                    sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
                ]]
            )
            indices = np.append(
                indices,
                co_occurrence_indices,
                axis=0
            )
    
        print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
            indices,
            indices.dtype
        ))
        co_ccurrence_matrix[  
            indices              <---- Error
        ] += 1
     
    
    Updating the co_occurrence_matrix: indices 
    [[0 array([0, 1])]
     [1 array([0, 1, 2])]
     [2 array([1, 2, 3])]
     [3 array([2, 3, 0])]
     [0 array([3, 0, 1])]
     [1 array([0, 1, 4])]
     [4 array([1, 4, 5])]
     [5 array([4, 5, 6])]
     [6 array([5, 6, 7])]
     [7 array([6, 7])]] 
    indices.dtype object
    
    <ipython-input-88-d9b081bf2f1a>:48: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
      indices = np.array([
    <ipython-input-88-d9b081bf2f1a>:56: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
      co_occurrence_indices = np.array([
    
    ---------------------------------------------------------------------------
    IndexError                                Traceback (most recent call last)
    <ipython-input-88-d9b081bf2f1a> in <module>
         84 sequence, word_to_id, id_to_word = preprocess(corpus)
         85 vocabrary_size = max(word_to_id.values()) + 1
    ---> 86 create_cooccurrence_matrix(sequence, vocabrary_size , 3)
    
    <ipython-input-88-d9b081bf2f1a> in create_cooccurrence_matrix(sequence, vocabrary_size, context_size)
         70         indices.dtype
         71     ))
    ---> 72     co_ccurrence_matrix[
         73         indices
         74     ] += 1
    
    IndexError: arrays used as indices must be of integer (or boolean) type
    
    import numpy as np
     
    def preprocess(text):
        """
        Args:
            text: A string including sentences to process. corpus
        Returns:
            sequence:
                A numpy array of word indices to every word in the original text as they appear in the text.
                The objective of corpus is to preserve the original text but as numerical indices.
            word_to_id: A dictionary to map a word to a word index
            id_to_word: A dictionary to map a word index to a word
        """
        text = text.lower()
        text = text.replace('.', ' .')
        words = text.split(' ')
     
        word_to_id = {}
        id_to_word = {}
        for word in words:
            if word not in word_to_id:
                new_id = len(word_to_id)
                word_to_id[word] = new_id
                id_to_word[new_id] = word
     
        sequence= np.array([word_to_id[w] for w in words])
     
        return sequence, word_to_id, id_to_word
     
     
    def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
        """
        Args:
            sequence: word index sequence of the original corpus text
            vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
            context_size: context (N-gram size N) within which to check co-occurrences.         
        """
        n = sequence_size = len(sequence)
        co_ccurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
     
        stride = int((context_size - 1)/2 )
        assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
            n, stride
        )
     
        for position in range(0, n):       
            co_ccurrence_matrix[
                sequence[position],                                                # position  to the word
                sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
            ] += 1
     
        np.fill_diagonal(co_ccurrence_matrix, 0)
        return co_ccurrence_matrix
     
     
    corpus= "To be, or not to be, that is the question"
     
    sequence, word_to_id, id_to_word = preprocess(corpus)
    vocabrary_size = max(word_to_id.values()) + 1
    create_cooccurrence_matrix(sequence, vocabrary_size , 3)
    ---
    [[0 2 0 1 0 0 0 0]
     [2 0 1 0 1 0 0 0]
     [0 1 0 1 0 0 0 0]
     [1 0 1 0 0 0 0 0]
     [0 1 0 0 0 1 0 0]
     [0 0 0 0 1 0 1 0]
     [0 0 0 0 0 1 0 1]
     [0 0 0 0 0 0 1 0]]
    
    Timer unit: 1e-06 s
    
    Total time: 23.0015 s
    File: <ipython-input-8-27f5e530d4ff>
    Function: create_cooccurrence_matrix at line 1
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
         1                                           def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
         2                                               """
         3                                               Args: 
         4                                                   sequence: word index sequence of the original corpus text
         5                                                   vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
         6                                                   context_size: context (N-gram size N) within to check co-occurrences.
         7                                               Returns:
         8                                                   co_occurrence matrix
         9                                               """
        10         1          4.0      4.0      0.0      n = sequence_size = len(sequence)
        11         1         98.0     98.0      0.0      co_occurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
        12                                           
        13         1          5.0      5.0      0.0      stride = int((context_size - 1)/2 )
        14         1          1.0      1.0      0.0      assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
        15                                                   n, stride
        16                                               )
        17                                           
        18                                               """
        19                                               # Handle position=slice(0 : (stride-1) +1),       co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
        20                                               # Handle position=slice((n-1-stride) : (n-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
        21                                               indices = [*range(0, (stride-1) +1), *range((n-1)-stride +1, (n-1) +1)]
        22                                               #print(indices)
        23                                               
        24                                               for position in indices:
        25                                                   debug(sequence, position, stride, False)
        26                                                   co_occurrence_matrix[
        27                                                       sequence[position],                                             # position to the word
        28                                                       sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # indices to co-occurance words 
        29                                                   ] += 1
        30                                           
        31                                               
        32                                               # Handle position=slice(stride, ((sequence_size-1) - stride) +1)
        33                                               for position in range(stride, (sequence_size-1) - stride + 1):        
        34                                                   co_occurrence_matrix[
        35                                                       sequence[position],                                 # position to the word
        36                                                       sequence[(position-stride) : (position + stride + 1)]  # indices to co-occurance words 
        37                                                   ] += 1
        38                                               """        
        39                                               
        40    929590    1175326.0      1.3      5.1      for position in range(0, n):        
        41   2788767   15304643.0      5.5     66.5          co_occurrence_matrix[
        42   1859178    2176964.0      1.2      9.5              sequence[position],                                                # position  to the word
        43    929589    3280181.0      3.5     14.3              sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurance words 
        44    929589    1062613.0      1.1      4.6          ] += 1
        45                                           
        46         1       1698.0   1698.0      0.0      np.fill_diagonal(co_occurrence_matrix, 0)
        47                                               
        48         1          2.0      2.0      0.0      return co_occurrence_matrix
    
    #Definitions
    sentences, vocab, length, context_size = 100, 12, 15, 2
    
    #Create dummy corpus (label encoded)
    window = context_size*2+1
    corpus = np.random.randint(0, vocab, (sentences, length))  #(100, 15)
    
    #Create rolling window view of the sequences
    shape = corpus.shape[0], corpus.shape[1]-window+1, window  #(100, 11, 5) 
    stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]  #(120, 8, 8)
    rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)  #(100, 11, 5)
    
    #Creating co-occurence matrix based on context window
    center_idx = context_size
    #position = rolling_window[:,:,context_size]  #(100, 11)
    context = np.delete(rolling_window, center_idx, -1)  #(100, 11, 4)
    context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  #(100, 11, 12)
    cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))  #(12, 12)
    np.fill_diagonal(cooccurence,0)  #(12, 12)
    print(cooccurence)
    
    [[  0.  94. 100. 114.  91.  92.  90. 128. 100. 114.  91.  84.]
     [ 94.   0.  78.  96.  90.  65.  76.  68.  76. 108.  58.  68.]
     [100.  78.   0. 125. 107.  93.  83.  84.  73.  84.  97. 110.]
     [114.  96. 125.   0.  84.  97.  76. 110.  80.  94. 117.  97.]
     [ 91.  90. 107.  84.   0.  84.  87. 103.  60. 127. 123.  97.]
     [ 92.  65.  93.  97.  84.   0.  67.  87.  72.  87.  74.  92.]
     [ 90.  76.  83.  76.  87.  67.   0.  83.  73. 118.  81. 108.]
     [128.  68.  84. 110. 103.  87.  83.   0.  72. 100. 115.  69.]
     [100.  76.  73.  80.  60.  72.  73.  72.   0.  83.  81. 100.]
     [114. 108.  84.  94. 127.  87. 118. 100.  83.   0. 109. 110.]
     [ 91.  58.  97. 117. 123.  74.  81. 115.  81. 109.   0. 104.]
     [ 84.  68. 110.  97.  97.  92. 108.  69. 100. 110. 104.   0.]]
    
    sentence = 'to be or not to be that is the question'
    corpus = np.array([[0, 1, 2, 3, 0, 1, 4, 5, 6, 7]])
    
    #Definitions
    vocab, context_size = 8, 2
    window = context_size*2+1
    
    #Create rolling window view of the sequences
    shape = corpus.shape[0], corpus.shape[1]-window+1, window
    stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
    rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)
    
    #Creating co-occurence matrix based on context window
    center_idx = context_size
    #position = rolling_window[:,:,context_size]  
    context = np.delete(rolling_window, center_idx, -1)  
    context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  
    cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))
    np.fill_diagonal(cooccurence,0)
    print(cooccurence)
    
    [[0. 5. 1. 3. 1. 2. 1. 0.]
     [5. 0. 3. 2. 2. 1. 2. 1.]
     [1. 3. 0. 1. 1. 0. 0. 0.]
     [3. 2. 1. 0. 2. 1. 0. 0.]
     [1. 2. 1. 2. 0. 1. 1. 1.]
     [2. 1. 0. 1. 1. 0. 1. 0.]
     [1. 2. 0. 0. 1. 1. 0. 1.]
     [0. 1. 0. 0. 1. 0. 1. 0.]]
    
    sentences, vocab, length, context_size = 100, 12, 15, 2
    window = context_size*2+1
    corpus = np.random.randint(0, vocab, (sentences, length))
    corpus[0:2]
    
    #top 2 sentences
    array([[ 9,  8,  9,  4,  2, 10,  9,  0,  7,  1, 11,  0,  7,  3,  1],
           [ 7,  9,  4,  0,  1,  9, 10,  7,  4,  2,  2,  3,  5,  8,  8]])
    
    #Create shape and stride definitions
    shape = corpus.shape[0], corpus.shape[1]-window+1, window
    stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
    print(shape, stride)
    
    #create view
    rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)  #(100, 11, 5)
    print('\nView for first sequence ->')
    print(rolling_window[0])
    
    (100, 11, 5) (120, 8, 8)
    
    View for first sequence ->
    [[ 9  8  9  4  2]
     [ 8  9  4  2 10]
     [ 9  4  2 10  9]
     [ 4  2 10  9  0]
     [ 2 10  9  0  7]
     [10  9  0  7  1]
     [ 9  0  7  1 11]
     [ 0  7  1 11  0]
     [ 7  1 11  0  7]
     [ 1 11  0  7  3]
     [11  0  7  3  1]]
    
    position = rolling_window[0][:,2]
    context = np.delete(rolling_window[0], 2, 1)
    context_multihot = np.sum(np.eye(vocab)[context], axis=1)
    cooccurence = context_multihot.T@context_multihot
    np.fill_diagonal(cooccurence,0)
    print(cooccurence)
    
    [[0. 3. 2. 1. 1. 0. 0. 5. 0. 2. 1. 4.]
     [3. 0. 0. 2. 0. 0. 0. 4. 0. 2. 1. 3.]
     [2. 0. 0. 0. 2. 0. 0. 1. 2. 3. 2. 0.]
     [1. 2. 0. 0. 0. 0. 0. 1. 0. 0. 0. 2.]
     [1. 0. 2. 0. 0. 0. 0. 0. 1. 4. 1. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [5. 4. 1. 1. 0. 0. 0. 0. 0. 1. 2. 2.]
     [0. 0. 2. 0. 1. 0. 0. 0. 0. 2. 1. 0.]
     [2. 2. 3. 0. 4. 0. 0. 1. 2. 0. 4. 1.]
     [1. 1. 2. 0. 1. 0. 0. 2. 1. 4. 0. 0.]
     [4. 3. 0. 2. 0. 0. 0. 2. 0. 1. 0. 0.]]
    
    #Creating co-occurence matrix based on context window
    center_idx = context_size
    #position = rolling_window[:,:,context_size]  #(100, 11)
    context = np.delete(rolling_window, center_idx, -1)  #(100, 11, 4)
    context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  #(100, 11, 12)
    cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))  #(12, 12)
    np.fill_diagonal(cooccurence,0)  #(12, 12)
    print(cooccurence)
    
    [[  0.  94. 100. 114.  91.  92.  90. 128. 100. 114.  91.  84.]
     [ 94.   0.  78.  96.  90.  65.  76.  68.  76. 108.  58.  68.]
     [100.  78.   0. 125. 107.  93.  83.  84.  73.  84.  97. 110.]
     [114.  96. 125.   0.  84.  97.  76. 110.  80.  94. 117.  97.]
     [ 91.  90. 107.  84.   0.  84.  87. 103.  60. 127. 123.  97.]
     [ 92.  65.  93.  97.  84.   0.  67.  87.  72.  87.  74.  92.]
     [ 90.  76.  83.  76.  87.  67.   0.  83.  73. 118.  81. 108.]
     [128.  68.  84. 110. 103.  87.  83.   0.  72. 100. 115.  69.]
     [100.  76.  73.  80.  60.  72.  73.  72.   0.  83.  81. 100.]
     [114. 108.  84.  94. 127.  87. 118. 100.  83.   0. 109. 110.]
     [ 91.  58.  97. 117. 123.  74.  81. 115.  81. 109.   0. 104.]
     [ 84.  68. 110.  97.  97.  92. 108.  69. 100. 110. 104.   0.]]