Python 为包含单词的列表生成唯一ID_Python_Python 3.x

Python 为包含单词的列表生成唯一ID

python python-3.x

Python 为包含单词的列表生成唯一ID,python,python-3.x,Python,Python 3.x,我有一个包含成对单词的列表，希望在ID上描述单词。ID应该从0到len（set（words））。现在的列表如下所示： [['pluripotent', 'Scharte'], ['Halswirbel', 'präventiv'], ['Kleiber', 'Blauspecht'], ['Kleiber', 'Scheidung'], ['Nillenlutscher', 'Salzstangenlecker']] 结果应具有相同的格式，但应使用ID。例如： [[0, 1],

我有一个包含成对单词的列表，希望在ID上描述单词。ID应该从0到len（set（words））。现在的列表如下所示：

[['pluripotent', 'Scharte'],
 ['Halswirbel', 'präventiv'],
 ['Kleiber', 'Blauspecht'],
 ['Kleiber', 'Scheidung'],
 ['Nillenlutscher', 'Salzstangenlecker']]

结果应具有相同的格式，但应使用ID。例如：

[[0, 1],
 [2, 3],
 [4, 5],
 [4, 6],
 [7, 8]]

到目前为止，我有这个，但它没有给我正确的输出：

def words_to_ids(labels):
  vocabulary = []
  word_to_id = {}
  ids = []
  for word1,word2 in labels:
      vocabulary.append(word1)
      vocabulary.append(word2)

  for i, word in enumerate(vocabulary):
      word_to_id [word] = i
  for word1,word2 in labels:
      ids.append([word_to_id [word1], word_to_id [word1]])
  print(ids)

输出：

[[0, 0], [2, 2], [6, 6], [6, 6], [8, 8]]

它是在有唯一单词的地方重复ID。

您有两个错误。首先，您有一个简单的打字错误，如下所示：

for word1,word2 in labels:
    ids.append([word_to_id [word1], word_to_id [word1]])

您在此处添加了两次

word1

的id。更正第二个
word1
以查找
word2
接下来，您不需要测试您以前是否见过一个单词，因此对于
'Kleiber'
，您首先给它id
4
，然后在下一次迭代中用
6
覆盖该条目。您需要给出唯一的单词编号，而不是所有单词：

counter = 0 for word in vocabulary: if word not in word_to_id: word_to_id[word] = counter counter += 1
或者，如果您已经列出了一个单词，您就不能在
词汇表中添加该单词。顺便说一句，您不需要在这里单独列出词汇表。单独的循环不会给您带来任何好处，因此以下方法也适用： word_to_id = {} counter = 0 for words in labels: for word in words: word_to_id [word] = counter counter += 1 通过使用和来提供默认值，您可以大大简化代码： from collections import defaultdict from itertools import count def words_to_ids(labels): word_ids = defaultdict(count().__next__) return [[word_ids[w1], word_ids[w2]] for w1, w2 in labels] count（）对象在每次调用\uuuuuuuuuuuuuuuuuu 时为您提供序列中的下一个整数值，而defaultdict（）将在您每次尝试访问字典中尚不存在的密钥时调用该整数值。它们一起确保了每个唯一单词的唯一ID。有两个问题：您通过在word\u to\u id 中重复查找word1 来输入错误当构造<代码> WordptotoId字典时，只需要考虑唯一值。例如，在Python 3.7+中，可以利用插入顺序字典： for i, word in enumerate(dict.fromkeys(vocabulary)): word_to_id[word] = i for word1, word2 in labels: ids.append([word_to_id[word1], word_to_id[word2]]) 3.7之前版本的另一种选择是使用或如果没有订购要求，您可以使用set（词汇表） oh，thanx。唯一的问题是，现在我得到了[[0,1]，[2,3]，[6,5]，[6,7]，[8,9]]，所以我问我，数字会发生什么4@NastjaKr：您给了Kleiber两次数字。我将更新。是的，某些单词多次出现在列表中，但它的id在任何地方都应该相同。考虑到您尝试的输出，我假设您希望从0开始。请随意更正这些假设。没有要求数字的顺序与输入顺序匹配，因此set（）就足够了。@MartijnPieters，很公平，set 就可以了！