Python pytorch中选择替换张量向量的有效方法_Python_Pytorch_Tensor

Python pytorch中选择替换张量向量的有效方法

python pytorch

Python pytorch中选择替换张量向量的有效方法,python,pytorch,tensor,Python,Pytorch,Tensor,给定一批文本序列，将其转换为张量，每个单词使用单词嵌入或向量（300维）表示。我需要有选择地用一组新的嵌入替换某些特定单词的向量。此外，这种替换将仅对特定单词的所有出现发生，而只是随机发生。目前，我有以下代码来实现这一点。它使用2个for循环遍历每个单词，检查单词是否在指定列表中，splIndices。然后，它根据所选的T或F值检查是否需要替换单词但这能以更有效的方式实现吗下面的代码可能不是MWE，但我已尝试通过删除细节来简化代码，以便将重点放在问题上。请忽略代码的语义或用途，因为此代码段中

给定一批文本序列，将其转换为张量，每个单词使用单词嵌入或向量（300维）表示。我需要有选择地用一组新的嵌入替换某些特定单词的向量。此外，这种替换将仅对特定单词的所有出现发生，而只是随机发生。目前，我有以下代码来实现这一点。它使用2个for循环遍历每个单词，检查单词是否在指定列表中，

splIndices

。然后，它根据

所选的T或F值检查是否需要替换单词
但这能以更有效的方式实现吗
下面的代码可能不是MWE，但我已尝试通过删除细节来简化代码，以便将重点放在问题上。请忽略代码的语义或用途，因为此代码段中可能没有适当地表示它。问题在于提高绩效

splIndices=[45,622983456762]#需要替换的词汇索引
splFreqs=2000#假设splIndices中的单词出现2000次
选定的_u3;=火炬。张量（2000）。均匀_3;（0，1）>0.2_3;张量，20%的条目为真
replIndexCtr=0#所选计数器的计数器_
#具有要替换的向量的字典。这是一个伪函数。
#原始功能取决于单词的某些属性
diffVector={45:Torch.Tensor（300）.uniform_（0，1），…762:Torch.Tensor（300）.uniform_（0，1）}
嵌入=nn.Embedding.from_pretrained（嵌入矩阵，冻结=假）
tempVals=x#shape[32,41]-一批32个序列，每个序列有41个字
x=嵌入（x）#形状[32,41,300]-序列现在已用嵌入替换了vocab索引
#迭代批处理以获取序列
对于i，枚举（x）中的项目：
#迭代单词序列
对于j，枚举（项）中的内容：
如果样条曲线中的tempVals[i][j].item（）：
如果self.selected_u[replIndexCtr]==True：
x[i，j]=diffVector[tempVals[i][j]。项（）
replIndexCtr+=1
可以通过以下方式对其进行矢量化：
import torch
import torch.nn as nn
import torch.nn.functional as F

batch_size, sentence_size, vocab_size, emb_size = 3, 2, 15, 1

# Make certain bias as a marker of embedding 
embedder_1 = nn.Linear(vocab_size, emb_size)
embedder_1.weight.data.fill_(0)
embedder_1.bias.data.fill_(200)

embedder_2 = nn.Linear(vocab_size, emb_size)
embedder_2.weight.data.fill_(0)
embedder_2.bias.data.fill_(404)

# Here are the indices of words which need different embdedding
replace_list = [3, 5, 7, 9] 

# Make a binary mask highlighing special words' indices
mask = torch.zeros(batch_size, sentence_size, vocab_size)
mask[..., replace_list] = 1

# Make random dataset
data_indices = torch.randint(0, vocab_size, (batch_size, sentence_size))
data_onehot = F.one_hot(data_indices, vocab_size)

# Check if onehot of a word collides with replace mask 
replace_mask = mask.long() * data_onehot
replace_mask = torch.sum(replace_mask, dim=-1).byte() # byte() is critical here

data_emb = torch.empty(batch_size, sentence_size, emb_size)

# Fill default embeddings
data_emb[1-replace_mask] = embedder_1(data_onehot[1-replace_mask].float())
if torch.max(replace_mask) != 0: # If not all zeros
    # Fill special embeddings
    data_emb[replace_mask] = embedder_2(data_onehot[replace_mask].float())

print(data_indices)
print(replace_mask)
print(data_emb.squeeze(-1).int())

以下是一个可能输出的示例：
# Word indices
tensor([[ 6,  9],
        [ 5, 10],
        [ 4, 11]])
# Embedding replacement mask
tensor([[0, 1],
        [1, 0],
        [0, 0]], dtype=torch.uint8)
# Resulting replacement
tensor([[200, 404],
        [404, 200],
        [200, 200]], dtype=torch.int32)