Python 使用TSNE的单词嵌入可视化不清晰_Python_Nlp_Word2vec_Word Embedding

Python 使用TSNE的单词嵌入可视化不清晰

python nlp

Python 使用TSNE的单词嵌入可视化不清晰,python,nlp,word2vec,word-embedding,Python,Nlp,Word2vec,Word Embedding,我已经从网站下载了预先训练过的单词嵌入模型我想把句子中单词的嵌入形象化。我有两句话： sentence1 = "Four people died in an accident." sentence2 = "4 men are dead from a collision" sentence1 = ['Four', 'people', 'died', 'accident'] sentence2 = ['4', 'men', 'dead', 'collision'] sentences = li

我已经从网站下载了预先训练过的单词嵌入模型我想把句子中单词的嵌入形象化。我有两句话：

sentence1 = "Four people died in an accident."

sentence2 = "4 men are dead from a collision"

sentence1 = ['Four', 'people', 'died', 'accident']
sentence2 = ['4', 'men', 'dead', 'collision']
sentences = list(set(sentence1)| set(sentence2))

我有从上述链接加载嵌入文件的功能：

def load_data(FileName = './EN-wform.w.5.cbow.neg10.400.subsmpl.txt'):

    embeddings = {}
    file = open(FileName,'r')
    i = 0
    print "Loading word embeddings first time"
    for line in file:
        # print line

        tokens = line.split('\t')

        #since each line's last token content '\n'
        # we need to remove that
        tokens[-1] = tokens[-1].strip()

        #each line has 400 tokens
        for i in xrange(1, len(tokens)):
            tokens[i] = float(tokens[i])

        embeddings[tokens[0]] = tokens[1:-1]
    print "finished"
    return embeddings

e = load_data()

从这两个句子中，我计算单词的引理和忽略停止词和标点符号，因此现在我的句子变成：

sentence1 = ['Four', 'people', 'died', 'accident'] sentence2 = ['4', 'men', 'dead', 'collision']
现在，当我尝试使用TSNE（t-分布随机邻居嵌入）可视化嵌入时，我首先存储每个句子的标签和标记：

#for sentence store labels and embeddings in list # tokens contains vector of 400 dimensions for each label labels1 = [] tokens1 = [] for i in sentence1: if i in e: labels1.append(i) tokens1.append(e[i]) else: print i labels2 = [] tokens2 = [] for i in sentence2: if i in e: labels2.append(i) tokens2.append(e[i]) else: print i
对于TSNE
我的问题是，为什么“碰撞”和“事故”、“人”和“人”等同义词有不同的坐标？如果单词是相同的/同义词，它们不应该更接近吗
距离=欧几里得距离（标记1） #从以下位置返回形状（8,8）
：
t-SNE有一个非凸的代价函数，即通过不同的初始化，我们可以得到不同的结果
这意味着在执行单词embedings的降维时，不能保证获得相同的坐标
要解决此问题，请连接以下句子，执行一次而不是两次fit_变换：

sentence1 = "Four people died in an accident." sentence2 = "4 men are dead from a collision"

sentence1 = ['Four', 'people', 'died', 'accident'] sentence2 = ['4', 'men', 'dead', 'collision'] sentences = list(set(sentence1)| set(sentence2))

编辑：您的代码中还有一个错误，您从错误的列表中打印标签。
我喜欢这样做，但同义词之间并不更接近。你知道巴罗尼埃尔的单词嵌入吗。艾尔？还有其他方法可以解决句子的嵌入问题吗？@Lucky试着使用sk_中的欧几里德距离函数了解嵌入，看看单词之间的距离是否有意义。你的意思是当我从TSNE获得坐标时，我应该计算每个词之间的欧几里德距离，看看同义词是否有最小距离？@Lucky否，计算TSNE之前的欧几里德距离并打印距离矩阵。我这样做了，得到了（8,8）的数组。我在model.fit中使用了距离矩阵，当我绘图时，我看不到任何差异。我在帖子里加了这句话。如果我错了，请纠正我。