Python numpy-如何组合多个索引(将多个逐个矩阵访问替换为一个访问) 更新 没有考虑同一个词的多次出现和自词出现。
例如,当步幅=2且该位置的单词为W时,X的共现需要+2,W的自共现需要+1Python numpy-如何组合多个索引(将多个逐个矩阵访问替换为一个访问) 更新 没有考虑同一个词的多次出现和自词出现。,python,arrays,numpy,matrix-indexing,Python,Arrays,Numpy,Matrix Indexing,例如,当步幅=2且该位置的单词为W时,X的共现需要+2,W的自共现需要+1 X | Y | W | X | W 问题: 更新m*mmatrix(),当前使用循环逐行访问。整个代码在底部 如何删除循环并同时更新多行?我认为应该有一种方法将每个索引组合成一个矩阵,用一个矢量化更新替换循环 请建议可能的方法 当前实施 循环一组单词索引序列(单词索引是每个单词的整数代码) 对于循环中位置处的每个单词,检查跨步距离内两侧同时出现的单词这是一个N-gram上下文窗口,如图中的紫色框所示N=上下文大小=步幅*
X | Y | W | X | W
问题:
更新m*m
matrix(),当前使用循环逐行访问。整个代码在底部
如何删除循环并同时更新多行?我认为应该有一种方法将每个索引组合成一个矩阵,用一个矢量化更新替换循环
请建议可能的方法
当前实施
序列(单词索引是每个单词的整数代码)
位置
处的每个单词,检查跨步
距离内两侧同时出现的单词这是一个N-gram
上下文
窗口,如图中的紫色框所示<代码>N=上下文大小=步幅*2+1共现矩阵中每个共现字的计数
x = np.array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
rows = np.array([[0, 0],
[3, 3]], dtype=np.intp)
columns = np.array([[0, 2],
[0, 2]], dtype=np.intp)
x[rows, columns]
---
array([[ 0, 2],
[ 9, 11]])
通过组合循环中的每个索引来创建多维索引,但它不会产生错误。请告知原因和错误,或者如果尝试没有意义
indices = np.array([
[
sequence[0], # position to the word
sequence[max(0, 0-stride) : min((0+stride),n-1) +1] # positions to co-occurrence words
]]
)
assert n > 1
for position in range(1, n):
co_occurrence_indices = np.array([
[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
]]
)
indices = np.append(
indices,
co_occurrence_indices,
axis=0
)
print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
indices,
indices.dtype
))
co_ccurrence_matrix[
indices <---- Error
] += 1
轮廓
从中使用ptb.train.txt
定时器单元:1e-06秒
总时间:23.0015秒
文件:
功能:在第1行创建共现矩阵
行#每次命中的命中次数%时间行内容
==============================================================
1定义创建共现矩阵(顺序、词汇库大小、上下文大小=3):
2 """
3个参数:
4序列:原始语料库文本的单词索引序列
5词汇库大小:词汇库中的单词数(与共现向量大小相同)
6上下文大小:用于检查共现的上下文(N-gram大小N)。
7返回:
8共现矩阵
9 """
10 1 4.0 4.0 0 0.0 n=序列尺寸=长度(序列)
11 1 98.0 98.0 0 0.0共现矩阵=np.0((词汇库大小,词汇库大小),数据类型=np.int32)
12
13 1 5.0 5.0 0 0.0步幅=整数((上下文大小-1)/2)
14 1.0 1.0 0 0.0断言(n>stride),“序列大小{}小于/等于stride{}”。格式(
15牛顿,迈步
16 )
17
18 """
19#手柄位置=切片(0:(步幅-1)+1),共现=切片(最大(0,位置步幅):最小(位置+步幅),n-1)+1)
20#手柄位置=切片((n-1-步幅):(n-1)+1),共现=切片(最大值(0,位置步幅):最小值((位置+步幅),n-1)+1)
21个指数=[*范围(0,(步幅-1)+1),*范围(n-1)-步幅+1,(n-1)+1]
22#打印(索引)
23
24指数排名:
25调试(顺序、位置、步幅、错误)
26共现矩阵[
27序列[位置],#到单词的位置
28序列[max(0,位置步幅):min((位置+步幅),n-1)+1]#共现词索引
29 ] += 1
30
31
32#手柄位置=切片(步幅,((序列大小-1)-步幅)+1)
33对于范围内的位置(步幅,(顺序大小-1)-步幅+1):
34共现矩阵[
35序列[位置],#到单词的位置
36序列[(位置步幅):(位置+步幅+1)]#共现词索引
37 ] += 1
38 """
39
40 929590 1175326.0 1.3 5.1范围内的位置(0,n):
41 2788767 15304643.0 5.5 66.5共现矩阵[
42 1859178 2176964.0 1.2 9.5序列[位置],#到单词的位置
43 929589 3280181.0 3.5 14.3顺序[最大(0,位置步幅):最小(位置+st
indices = np.array([
[
sequence[0], # position to the word
sequence[max(0, 0-stride) : min((0+stride),n-1) +1] # positions to co-occurrence words
]]
)
assert n > 1
for position in range(1, n):
co_occurrence_indices = np.array([
[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
]]
)
indices = np.append(
indices,
co_occurrence_indices,
axis=0
)
print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
indices,
indices.dtype
))
co_ccurrence_matrix[
indices <---- Error
] += 1
Updating the co_occurrence_matrix: indices
[[0 array([0, 1])]
[1 array([0, 1, 2])]
[2 array([1, 2, 3])]
[3 array([2, 3, 0])]
[0 array([3, 0, 1])]
[1 array([0, 1, 4])]
[4 array([1, 4, 5])]
[5 array([4, 5, 6])]
[6 array([5, 6, 7])]
[7 array([6, 7])]]
indices.dtype object
<ipython-input-88-d9b081bf2f1a>:48: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
indices = np.array([
<ipython-input-88-d9b081bf2f1a>:56: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
co_occurrence_indices = np.array([
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-88-d9b081bf2f1a> in <module>
84 sequence, word_to_id, id_to_word = preprocess(corpus)
85 vocabrary_size = max(word_to_id.values()) + 1
---> 86 create_cooccurrence_matrix(sequence, vocabrary_size , 3)
<ipython-input-88-d9b081bf2f1a> in create_cooccurrence_matrix(sequence, vocabrary_size, context_size)
70 indices.dtype
71 ))
---> 72 co_ccurrence_matrix[
73 indices
74 ] += 1
IndexError: arrays used as indices must be of integer (or boolean) type
import numpy as np
def preprocess(text):
"""
Args:
text: A string including sentences to process. corpus
Returns:
sequence:
A numpy array of word indices to every word in the original text as they appear in the text.
The objective of corpus is to preserve the original text but as numerical indices.
word_to_id: A dictionary to map a word to a word index
id_to_word: A dictionary to map a word index to a word
"""
text = text.lower()
text = text.replace('.', ' .')
words = text.split(' ')
word_to_id = {}
id_to_word = {}
for word in words:
if word not in word_to_id:
new_id = len(word_to_id)
word_to_id[word] = new_id
id_to_word[new_id] = word
sequence= np.array([word_to_id[w] for w in words])
return sequence, word_to_id, id_to_word
def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
"""
Args:
sequence: word index sequence of the original corpus text
vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
context_size: context (N-gram size N) within which to check co-occurrences.
"""
n = sequence_size = len(sequence)
co_ccurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
stride = int((context_size - 1)/2 )
assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
n, stride
)
for position in range(0, n):
co_ccurrence_matrix[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
] += 1
np.fill_diagonal(co_ccurrence_matrix, 0)
return co_ccurrence_matrix
corpus= "To be, or not to be, that is the question"
sequence, word_to_id, id_to_word = preprocess(corpus)
vocabrary_size = max(word_to_id.values()) + 1
create_cooccurrence_matrix(sequence, vocabrary_size , 3)
---
[[0 2 0 1 0 0 0 0]
[2 0 1 0 1 0 0 0]
[0 1 0 1 0 0 0 0]
[1 0 1 0 0 0 0 0]
[0 1 0 0 0 1 0 0]
[0 0 0 0 1 0 1 0]
[0 0 0 0 0 1 0 1]
[0 0 0 0 0 0 1 0]]
Timer unit: 1e-06 s
Total time: 23.0015 s
File: <ipython-input-8-27f5e530d4ff>
Function: create_cooccurrence_matrix at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
2 """
3 Args:
4 sequence: word index sequence of the original corpus text
5 vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
6 context_size: context (N-gram size N) within to check co-occurrences.
7 Returns:
8 co_occurrence matrix
9 """
10 1 4.0 4.0 0.0 n = sequence_size = len(sequence)
11 1 98.0 98.0 0.0 co_occurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
12
13 1 5.0 5.0 0.0 stride = int((context_size - 1)/2 )
14 1 1.0 1.0 0.0 assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
15 n, stride
16 )
17
18 """
19 # Handle position=slice(0 : (stride-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
20 # Handle position=slice((n-1-stride) : (n-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
21 indices = [*range(0, (stride-1) +1), *range((n-1)-stride +1, (n-1) +1)]
22 #print(indices)
23
24 for position in indices:
25 debug(sequence, position, stride, False)
26 co_occurrence_matrix[
27 sequence[position], # position to the word
28 sequence[max(0, position-stride) : min((position+stride),n-1) +1] # indices to co-occurance words
29 ] += 1
30
31
32 # Handle position=slice(stride, ((sequence_size-1) - stride) +1)
33 for position in range(stride, (sequence_size-1) - stride + 1):
34 co_occurrence_matrix[
35 sequence[position], # position to the word
36 sequence[(position-stride) : (position + stride + 1)] # indices to co-occurance words
37 ] += 1
38 """
39
40 929590 1175326.0 1.3 5.1 for position in range(0, n):
41 2788767 15304643.0 5.5 66.5 co_occurrence_matrix[
42 1859178 2176964.0 1.2 9.5 sequence[position], # position to the word
43 929589 3280181.0 3.5 14.3 sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurance words
44 929589 1062613.0 1.1 4.6 ] += 1
45
46 1 1698.0 1698.0 0.0 np.fill_diagonal(co_occurrence_matrix, 0)
47
48 1 2.0 2.0 0.0 return co_occurrence_matrix
#Definitions
sentences, vocab, length, context_size = 100, 12, 15, 2
#Create dummy corpus (label encoded)
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length)) #(100, 15)
#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window #(100, 11, 5)
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1] #(120, 8, 8)
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride) #(100, 11, 5)
#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size] #(100, 11)
context = np.delete(rolling_window, center_idx, -1) #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2) #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1])) #(12, 12)
np.fill_diagonal(cooccurence,0) #(12, 12)
print(cooccurence)
[[ 0. 94. 100. 114. 91. 92. 90. 128. 100. 114. 91. 84.]
[ 94. 0. 78. 96. 90. 65. 76. 68. 76. 108. 58. 68.]
[100. 78. 0. 125. 107. 93. 83. 84. 73. 84. 97. 110.]
[114. 96. 125. 0. 84. 97. 76. 110. 80. 94. 117. 97.]
[ 91. 90. 107. 84. 0. 84. 87. 103. 60. 127. 123. 97.]
[ 92. 65. 93. 97. 84. 0. 67. 87. 72. 87. 74. 92.]
[ 90. 76. 83. 76. 87. 67. 0. 83. 73. 118. 81. 108.]
[128. 68. 84. 110. 103. 87. 83. 0. 72. 100. 115. 69.]
[100. 76. 73. 80. 60. 72. 73. 72. 0. 83. 81. 100.]
[114. 108. 84. 94. 127. 87. 118. 100. 83. 0. 109. 110.]
[ 91. 58. 97. 117. 123. 74. 81. 115. 81. 109. 0. 104.]
[ 84. 68. 110. 97. 97. 92. 108. 69. 100. 110. 104. 0.]]
sentence = 'to be or not to be that is the question'
corpus = np.array([[0, 1, 2, 3, 0, 1, 4, 5, 6, 7]])
#Definitions
vocab, context_size = 8, 2
window = context_size*2+1
#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)
#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]
context = np.delete(rolling_window, center_idx, -1)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))
np.fill_diagonal(cooccurence,0)
print(cooccurence)
[[0. 5. 1. 3. 1. 2. 1. 0.]
[5. 0. 3. 2. 2. 1. 2. 1.]
[1. 3. 0. 1. 1. 0. 0. 0.]
[3. 2. 1. 0. 2. 1. 0. 0.]
[1. 2. 1. 2. 0. 1. 1. 1.]
[2. 1. 0. 1. 1. 0. 1. 0.]
[1. 2. 0. 0. 1. 1. 0. 1.]
[0. 1. 0. 0. 1. 0. 1. 0.]]
sentences, vocab, length, context_size = 100, 12, 15, 2
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length))
corpus[0:2]
#top 2 sentences
array([[ 9, 8, 9, 4, 2, 10, 9, 0, 7, 1, 11, 0, 7, 3, 1],
[ 7, 9, 4, 0, 1, 9, 10, 7, 4, 2, 2, 3, 5, 8, 8]])
#Create shape and stride definitions
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
print(shape, stride)
#create view
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride) #(100, 11, 5)
print('\nView for first sequence ->')
print(rolling_window[0])
(100, 11, 5) (120, 8, 8)
View for first sequence ->
[[ 9 8 9 4 2]
[ 8 9 4 2 10]
[ 9 4 2 10 9]
[ 4 2 10 9 0]
[ 2 10 9 0 7]
[10 9 0 7 1]
[ 9 0 7 1 11]
[ 0 7 1 11 0]
[ 7 1 11 0 7]
[ 1 11 0 7 3]
[11 0 7 3 1]]
position = rolling_window[0][:,2]
context = np.delete(rolling_window[0], 2, 1)
context_multihot = np.sum(np.eye(vocab)[context], axis=1)
cooccurence = context_multihot.T@context_multihot
np.fill_diagonal(cooccurence,0)
print(cooccurence)
[[0. 3. 2. 1. 1. 0. 0. 5. 0. 2. 1. 4.]
[3. 0. 0. 2. 0. 0. 0. 4. 0. 2. 1. 3.]
[2. 0. 0. 0. 2. 0. 0. 1. 2. 3. 2. 0.]
[1. 2. 0. 0. 0. 0. 0. 1. 0. 0. 0. 2.]
[1. 0. 2. 0. 0. 0. 0. 0. 1. 4. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[5. 4. 1. 1. 0. 0. 0. 0. 0. 1. 2. 2.]
[0. 0. 2. 0. 1. 0. 0. 0. 0. 2. 1. 0.]
[2. 2. 3. 0. 4. 0. 0. 1. 2. 0. 4. 1.]
[1. 1. 2. 0. 1. 0. 0. 2. 1. 4. 0. 0.]
[4. 3. 0. 2. 0. 0. 0. 2. 0. 1. 0. 0.]]
#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size] #(100, 11)
context = np.delete(rolling_window, center_idx, -1) #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2) #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1])) #(12, 12)
np.fill_diagonal(cooccurence,0) #(12, 12)
print(cooccurence)
[[ 0. 94. 100. 114. 91. 92. 90. 128. 100. 114. 91. 84.]
[ 94. 0. 78. 96. 90. 65. 76. 68. 76. 108. 58. 68.]
[100. 78. 0. 125. 107. 93. 83. 84. 73. 84. 97. 110.]
[114. 96. 125. 0. 84. 97. 76. 110. 80. 94. 117. 97.]
[ 91. 90. 107. 84. 0. 84. 87. 103. 60. 127. 123. 97.]
[ 92. 65. 93. 97. 84. 0. 67. 87. 72. 87. 74. 92.]
[ 90. 76. 83. 76. 87. 67. 0. 83. 73. 118. 81. 108.]
[128. 68. 84. 110. 103. 87. 83. 0. 72. 100. 115. 69.]
[100. 76. 73. 80. 60. 72. 73. 72. 0. 83. 81. 100.]
[114. 108. 84. 94. 127. 87. 118. 100. 83. 0. 109. 110.]
[ 91. 58. 97. 117. 123. 74. 81. 115. 81. 109. 0. 104.]
[ 84. 68. 110. 97. 97. 92. 108. 69. 100. 110. 104. 0.]]