Python numpy-如何组合多个索引（将多个逐个矩阵访问替换为一个访问）更新没有考虑同一个词的多次出现和自词出现。_Python_Arrays_Numpy_Matrix Indexing

Python numpy-如何组合多个索引（将多个逐个矩阵访问替换为一个访问）更新没有考虑同一个词的多次出现和自词出现。

python arrays numpy

Python numpy-如何组合多个索引（将多个逐个矩阵访问替换为一个访问）更新没有考虑同一个词的多次出现和自词出现。,python,arrays,numpy,matrix-indexing,Python,Arrays,Numpy,Matrix Indexing,例如，当步幅=2且该位置的单词为W时，X的共现需要+2，W的自共现需要+1 X | Y | W | X | W 问题: 更新m*mmatrix（），当前使用循环逐行访问。整个代码在底部如何删除循环并同时更新多行？我认为应该有一种方法将每个索引组合成一个矩阵，用一个矢量化更新替换循环请建议可能的方法当前实施循环一组单词索引序列（单词索引是每个单词的整数代码）对于循环中位置处的每个单词，检查跨步距离内两侧同时出现的单词这是一个N-gram上下文窗口，如图中的紫色框所示N=上下文大小=步幅*

例如，当步幅=2且该位置的单词为W时，X的共现需要+2，W的自共现需要+1

X | Y | W | X | W

问题: 更新

m*m

matrix（），当前使用循环逐行访问。整个代码在底部

如何删除循环并同时更新多行？我认为应该有一种方法将每个索引组合成一个矩阵，用一个矢量化更新替换循环

请建议可能的方法

当前实施

循环一组单词索引

序列（单词索引是每个单词的整数代码）


对于循环中位置
处的每个单词，检查跨步
距离内两侧同时出现的单词
这是一个N-gram上下文
窗口，如图中的紫色框所示<代码>N=上下文大小=步幅*2+1
根据图表中的蓝线，增加共现矩阵中每个共现字的计数


尝试
这似乎是一种同时访问多行的方法
x = np.array([[ 0,  1,  2],
              [ 3,  4,  5],
              [ 6,  7,  8],
              [ 9, 10, 11]])
rows = np.array([[0, 0],
                 [3, 3]], dtype=np.intp)
columns = np.array([[0, 2],
                    [0, 2]], dtype=np.intp)
x[rows, columns]
---
array([[ 0,  2],
       [ 9, 11]])

通过组合循环中的每个索引来创建多维索引，但它不会产生错误。请告知原因和错误，或者如果尝试没有意义
    indices = np.array([
        [
            sequence[0],                                         # position  to the word
            sequence[max(0, 0-stride) : min((0+stride),n-1) +1]  # positions to co-occurrence words
        ]]
    )
    assert n > 1
    for position in range(1, n):
        co_occurrence_indices = np.array([
            [
                sequence[position],                                                # position  to the word
                sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
            ]]
        )
        indices = np.append(
            indices,
            co_occurrence_indices,
            axis=0
        )

    print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
        indices,
        indices.dtype
    ))
    co_ccurrence_matrix[  
        indices              <---- Error
    ] += 1
 

轮廓
从中使用ptb.train.txt
定时器单元：1e-06秒
总时间：23.0015秒
文件：
功能：在第1行创建共现矩阵
行#每次命中的命中次数%时间行内容
==============================================================
1定义创建共现矩阵（顺序、词汇库大小、上下文大小=3）：
2                                               """
3个参数：
4序列：原始语料库文本的单词索引序列
5词汇库大小：词汇库中的单词数（与共现向量大小相同）
6上下文大小：用于检查共现的上下文（N-gram大小N）。
7返回：
8共现矩阵
9                                               """
10 1 4.0 4.0 0 0.0 n=序列尺寸=长度（序列）
11 1 98.0 98.0 0 0.0共现矩阵=np.0（（词汇库大小，词汇库大小），数据类型=np.int32）
12
13 1 5.0 5.0 0 0.0步幅=整数（（上下文大小-1）/2）
14 1.0 1.0 0 0.0断言（n>stride），“序列大小{}小于/等于stride{}”。格式(
15牛顿，迈步
16                                               )
17
18                                               """
19#手柄位置=切片（0：（步幅-1）+1），共现=切片（最大（0，位置步幅）：最小（位置+步幅），n-1）+1）
20#手柄位置=切片（（n-1-步幅）：（n-1）+1），共现=切片（最大值（0，位置步幅）：最小值（（位置+步幅），n-1）+1）
21个指数=[*范围（0，（步幅-1）+1），*范围（n-1）-步幅+1，（n-1）+1]
22#打印（索引）
23
24指数排名：
25调试（顺序、位置、步幅、错误）
26共现矩阵[
27序列[位置]，#到单词的位置
28序列[max（0，位置步幅）：min（（位置+步幅），n-1）+1]#共现词索引
29                                                   ] += 1
30
31
32#手柄位置=切片（步幅，（（序列大小-1）-步幅）+1）
33对于范围内的位置（步幅，（顺序大小-1）-步幅+1）：
34共现矩阵[
35序列[位置]，#到单词的位置
36序列[（位置步幅）：（位置+步幅+1）]#共现词索引
37                                                   ] += 1
38                                               """        
39
40 929590 1175326.0 1.3 5.1范围内的位置（0，n）：
41 2788767 15304643.0 5.5 66.5共现矩阵[
42 1859178 2176964.0 1.2 9.5序列[位置]，#到单词的位置
43 929589 3280181.0 3.5 14.3顺序[最大（0，位置步幅）：最小（位置+st
    indices = np.array([
        [
            sequence[0],                                         # position  to the word
            sequence[max(0, 0-stride) : min((0+stride),n-1) +1]  # positions to co-occurrence words
        ]]
    )
    assert n > 1
    for position in range(1, n):
        co_occurrence_indices = np.array([
            [
                sequence[position],                                                # position  to the word
                sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
            ]]
        )
        indices = np.append(
            indices,
            co_occurrence_indices,
            axis=0
        )

    print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
        indices,
        indices.dtype
    ))
    co_ccurrence_matrix[  
        indices              <---- Error
    ] += 1
 

Updating the co_occurrence_matrix: indices 
[[0 array([0, 1])]
 [1 array([0, 1, 2])]
 [2 array([1, 2, 3])]
 [3 array([2, 3, 0])]
 [0 array([3, 0, 1])]
 [1 array([0, 1, 4])]
 [4 array([1, 4, 5])]
 [5 array([4, 5, 6])]
 [6 array([5, 6, 7])]
 [7 array([6, 7])]] 
indices.dtype object

<ipython-input-88-d9b081bf2f1a>:48: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  indices = np.array([
<ipython-input-88-d9b081bf2f1a>:56: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  co_occurrence_indices = np.array([

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-88-d9b081bf2f1a> in <module>
     84 sequence, word_to_id, id_to_word = preprocess(corpus)
     85 vocabrary_size = max(word_to_id.values()) + 1
---> 86 create_cooccurrence_matrix(sequence, vocabrary_size , 3)

<ipython-input-88-d9b081bf2f1a> in create_cooccurrence_matrix(sequence, vocabrary_size, context_size)
     70         indices.dtype
     71     ))
---> 72     co_ccurrence_matrix[
     73         indices
     74     ] += 1

IndexError: arrays used as indices must be of integer (or boolean) type

import numpy as np
 
def preprocess(text):
    """
    Args:
        text: A string including sentences to process. corpus
    Returns:
        sequence:
            A numpy array of word indices to every word in the original text as they appear in the text.
            The objective of corpus is to preserve the original text but as numerical indices.
        word_to_id: A dictionary to map a word to a word index
        id_to_word: A dictionary to map a word index to a word
    """
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')
 
    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word
 
    sequence= np.array([word_to_id[w] for w in words])
 
    return sequence, word_to_id, id_to_word
 
 
def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
    """
    Args:
        sequence: word index sequence of the original corpus text
        vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
        context_size: context (N-gram size N) within which to check co-occurrences.         
    """
    n = sequence_size = len(sequence)
    co_ccurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
 
    stride = int((context_size - 1)/2 )
    assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
        n, stride
    )
 
    for position in range(0, n):       
        co_ccurrence_matrix[
            sequence[position],                                                # position  to the word
            sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurrence words
        ] += 1
 
    np.fill_diagonal(co_ccurrence_matrix, 0)
    return co_ccurrence_matrix
 
 
corpus= "To be, or not to be, that is the question"
 
sequence, word_to_id, id_to_word = preprocess(corpus)
vocabrary_size = max(word_to_id.values()) + 1
create_cooccurrence_matrix(sequence, vocabrary_size , 3)
---
[[0 2 0 1 0 0 0 0]
 [2 0 1 0 1 0 0 0]
 [0 1 0 1 0 0 0 0]
 [1 0 1 0 0 0 0 0]
 [0 1 0 0 0 1 0 0]
 [0 0 0 0 1 0 1 0]
 [0 0 0 0 0 1 0 1]
 [0 0 0 0 0 0 1 0]]

Timer unit: 1e-06 s

Total time: 23.0015 s
File: <ipython-input-8-27f5e530d4ff>
Function: create_cooccurrence_matrix at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
     2                                               """
     3                                               Args: 
     4                                                   sequence: word index sequence of the original corpus text
     5                                                   vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
     6                                                   context_size: context (N-gram size N) within to check co-occurrences.
     7                                               Returns:
     8                                                   co_occurrence matrix
     9                                               """
    10         1          4.0      4.0      0.0      n = sequence_size = len(sequence)
    11         1         98.0     98.0      0.0      co_occurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
    12                                           
    13         1          5.0      5.0      0.0      stride = int((context_size - 1)/2 )
    14         1          1.0      1.0      0.0      assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
    15                                                   n, stride
    16                                               )
    17                                           
    18                                               """
    19                                               # Handle position=slice(0 : (stride-1) +1),       co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
    20                                               # Handle position=slice((n-1-stride) : (n-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
    21                                               indices = [*range(0, (stride-1) +1), *range((n-1)-stride +1, (n-1) +1)]
    22                                               #print(indices)
    23                                               
    24                                               for position in indices:
    25                                                   debug(sequence, position, stride, False)
    26                                                   co_occurrence_matrix[
    27                                                       sequence[position],                                             # position to the word
    28                                                       sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # indices to co-occurance words 
    29                                                   ] += 1
    30                                           
    31                                               
    32                                               # Handle position=slice(stride, ((sequence_size-1) - stride) +1)
    33                                               for position in range(stride, (sequence_size-1) - stride + 1):        
    34                                                   co_occurrence_matrix[
    35                                                       sequence[position],                                 # position to the word
    36                                                       sequence[(position-stride) : (position + stride + 1)]  # indices to co-occurance words 
    37                                                   ] += 1
    38                                               """        
    39                                               
    40    929590    1175326.0      1.3      5.1      for position in range(0, n):        
    41   2788767   15304643.0      5.5     66.5          co_occurrence_matrix[
    42   1859178    2176964.0      1.2      9.5              sequence[position],                                                # position  to the word
    43    929589    3280181.0      3.5     14.3              sequence[max(0, position-stride) : min((position+stride),n-1) +1]  # positions to co-occurance words 
    44    929589    1062613.0      1.1      4.6          ] += 1
    45                                           
    46         1       1698.0   1698.0      0.0      np.fill_diagonal(co_occurrence_matrix, 0)
    47                                               
    48         1          2.0      2.0      0.0      return co_occurrence_matrix

#Definitions
sentences, vocab, length, context_size = 100, 12, 15, 2

#Create dummy corpus (label encoded)
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length))  #(100, 15)

#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window  #(100, 11, 5) 
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]  #(120, 8, 8)
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)  #(100, 11, 5)

#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]  #(100, 11)
context = np.delete(rolling_window, center_idx, -1)  #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))  #(12, 12)
np.fill_diagonal(cooccurence,0)  #(12, 12)
print(cooccurence)

[[  0.  94. 100. 114.  91.  92.  90. 128. 100. 114.  91.  84.]
 [ 94.   0.  78.  96.  90.  65.  76.  68.  76. 108.  58.  68.]
 [100.  78.   0. 125. 107.  93.  83.  84.  73.  84.  97. 110.]
 [114.  96. 125.   0.  84.  97.  76. 110.  80.  94. 117.  97.]
 [ 91.  90. 107.  84.   0.  84.  87. 103.  60. 127. 123.  97.]
 [ 92.  65.  93.  97.  84.   0.  67.  87.  72.  87.  74.  92.]
 [ 90.  76.  83.  76.  87.  67.   0.  83.  73. 118.  81. 108.]
 [128.  68.  84. 110. 103.  87.  83.   0.  72. 100. 115.  69.]
 [100.  76.  73.  80.  60.  72.  73.  72.   0.  83.  81. 100.]
 [114. 108.  84.  94. 127.  87. 118. 100.  83.   0. 109. 110.]
 [ 91.  58.  97. 117. 123.  74.  81. 115.  81. 109.   0. 104.]
 [ 84.  68. 110.  97.  97.  92. 108.  69. 100. 110. 104.   0.]]

sentence = 'to be or not to be that is the question'
corpus = np.array([[0, 1, 2, 3, 0, 1, 4, 5, 6, 7]])

#Definitions
vocab, context_size = 8, 2
window = context_size*2+1

#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)

#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]  
context = np.delete(rolling_window, center_idx, -1)  
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))
np.fill_diagonal(cooccurence,0)
print(cooccurence)

[[0. 5. 1. 3. 1. 2. 1. 0.]
 [5. 0. 3. 2. 2. 1. 2. 1.]
 [1. 3. 0. 1. 1. 0. 0. 0.]
 [3. 2. 1. 0. 2. 1. 0. 0.]
 [1. 2. 1. 2. 0. 1. 1. 1.]
 [2. 1. 0. 1. 1. 0. 1. 0.]
 [1. 2. 0. 0. 1. 1. 0. 1.]
 [0. 1. 0. 0. 1. 0. 1. 0.]]

sentences, vocab, length, context_size = 100, 12, 15, 2
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length))
corpus[0:2]

#top 2 sentences
array([[ 9,  8,  9,  4,  2, 10,  9,  0,  7,  1, 11,  0,  7,  3,  1],
       [ 7,  9,  4,  0,  1,  9, 10,  7,  4,  2,  2,  3,  5,  8,  8]])

#Create shape and stride definitions
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
print(shape, stride)

#create view
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)  #(100, 11, 5)
print('\nView for first sequence ->')
print(rolling_window[0])

(100, 11, 5) (120, 8, 8)

View for first sequence ->
[[ 9  8  9  4  2]
 [ 8  9  4  2 10]
 [ 9  4  2 10  9]
 [ 4  2 10  9  0]
 [ 2 10  9  0  7]
 [10  9  0  7  1]
 [ 9  0  7  1 11]
 [ 0  7  1 11  0]
 [ 7  1 11  0  7]
 [ 1 11  0  7  3]
 [11  0  7  3  1]]

position = rolling_window[0][:,2]
context = np.delete(rolling_window[0], 2, 1)
context_multihot = np.sum(np.eye(vocab)[context], axis=1)
cooccurence = context_multihot.T@context_multihot
np.fill_diagonal(cooccurence,0)
print(cooccurence)

[[0. 3. 2. 1. 1. 0. 0. 5. 0. 2. 1. 4.]
 [3. 0. 0. 2. 0. 0. 0. 4. 0. 2. 1. 3.]
 [2. 0. 0. 0. 2. 0. 0. 1. 2. 3. 2. 0.]
 [1. 2. 0. 0. 0. 0. 0. 1. 0. 0. 0. 2.]
 [1. 0. 2. 0. 0. 0. 0. 0. 1. 4. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [5. 4. 1. 1. 0. 0. 0. 0. 0. 1. 2. 2.]
 [0. 0. 2. 0. 1. 0. 0. 0. 0. 2. 1. 0.]
 [2. 2. 3. 0. 4. 0. 0. 1. 2. 0. 4. 1.]
 [1. 1. 2. 0. 1. 0. 0. 2. 1. 4. 0. 0.]
 [4. 3. 0. 2. 0. 0. 0. 2. 0. 1. 0. 0.]]

#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]  #(100, 11)
context = np.delete(rolling_window, center_idx, -1)  #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)  #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))  #(12, 12)
np.fill_diagonal(cooccurence,0)  #(12, 12)
print(cooccurence)

[[  0.  94. 100. 114.  91.  92.  90. 128. 100. 114.  91.  84.]
 [ 94.   0.  78.  96.  90.  65.  76.  68.  76. 108.  58.  68.]
 [100.  78.   0. 125. 107.  93.  83.  84.  73.  84.  97. 110.]
 [114.  96. 125.   0.  84.  97.  76. 110.  80.  94. 117.  97.]
 [ 91.  90. 107.  84.   0.  84.  87. 103.  60. 127. 123.  97.]
 [ 92.  65.  93.  97.  84.   0.  67.  87.  72.  87.  74.  92.]
 [ 90.  76.  83.  76.  87.  67.   0.  83.  73. 118.  81. 108.]
 [128.  68.  84. 110. 103.  87.  83.   0.  72. 100. 115.  69.]
 [100.  76.  73.  80.  60.  72.  73.  72.   0.  83.  81. 100.]
 [114. 108.  84.  94. 127.  87. 118. 100.  83.   0. 109. 110.]
 [ 91.  58.  97. 117. 123.  74.  81. 115.  81. 109.   0. 104.]
 [ 84.  68. 110.  97.  97.  92. 108.  69. 100. 110. 104.   0.]]