如何在Python中对列表中的单词进行编码_Python_Dictionary_Dataframe_Encoding_Text Processing

如何在Python中对列表中的单词进行编码

python dictionary dataframe encoding

如何在Python中对列表中的单词进行编码,python,dictionary,dataframe,encoding,text-processing,Python,Dictionary,Dataframe,Encoding,Text Processing,我有一个字典，其中每个单词都是一个键和一个对应的整数值，例如： {'me': 41, 'are': 21, 'the': 0} 我有一个数据框，其中包含一列已标记的单词列表，例如： ['I', 'liked', 'the', 'color', 'of', 'this', 'top'] ['Just', 'grabbed', 'this', 'today', 'great', 'find'] 如何将这些单词从字典中编码成相应的值。例如： [56, 78, 5, 1197, 556, 991,

我有一个字典，其中每个单词都是一个键和一个对应的整数值，例如：

 {'me': 41, 'are': 21, 'the': 0}

我有一个数据框，其中包含一列已标记的单词列表，例如：

['I', 'liked', 'the', 'color', 'of', 'this', 'top']
['Just', 'grabbed', 'this', 'today', 'great', 'find']

如何将这些单词从字典中编码成相应的值。例如：

[56, 78, 5, 1197, 556, 991, 40]

使用字典和列表下面使用字典（

final\u dictionary

）来确定单词的id。这是伟大的，如果你有一个预设的id字典

def encode_tokens(tokens):
    encoded_tokens = tokens[:]
    for i, token in enumerate(tokens):
        if token in final_dictionary:
            encoded_tokens[i] = final_dictionary[token]
    return encoded_tokens

print(encode_tokens(tokens))

添加和维护id 如果您正在动态分配id，我将实现一个类来实现（见下文）。但是，如果您有一个提前定义的id字典，则可以传入关键字参数

di

：

token_words_1 = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']
token_words_2 = ['I', 'liked', 'to', 'test', 'repeat', 'words']

class AutoId:
    def __init__(self, **kwargs):
        self.di = kwargs.get("di", {})
        self.loc = 0
    def get(self, value):
        if value not in self.di:
            self.di[value] = self.loc
            self.loc += 1
        return self.di[value]
    def get_list(self, li):
        return [*map(self.get, li)]

encoding = AutoId()
print(encoding.get_list(token_words_1))
print(encoding.get_list(token_words_2))

使用字典和列表下面使用字典（

final\u dictionary

）来确定单词的id。这是伟大的，如果你有一个预设的id字典

def encode_tokens(tokens):
    encoded_tokens = tokens[:]
    for i, token in enumerate(tokens):
        if token in final_dictionary:
            encoded_tokens[i] = final_dictionary[token]
    return encoded_tokens

print(encode_tokens(tokens))

添加和维护id 如果您正在动态分配id，我将实现一个类来实现（见下文）。但是，如果您有一个提前定义的id字典，则可以传入关键字参数

di

：

token_words_1 = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']
token_words_2 = ['I', 'liked', 'to', 'test', 'repeat', 'words']

class AutoId:
    def __init__(self, **kwargs):
        self.di = kwargs.get("di", {})
        self.loc = 0
    def get(self, value):
        if value not in self.di:
            self.di[value] = self.loc
            self.loc += 1
        return self.di[value]
    def get_list(self, li):
        return [*map(self.get, li)]

encoding = AutoId()
print(encoding.get_list(token_words_1))
print(encoding.get_list(token_words_2))

做什么

word2key = {'me': 41, 'are': 21, 'the': 0}
words = ['Just', 'grabbed', 'this', 'today', 'great', 'find']
default = 'unknown'
output = [word2key.get(x, default) for x in words]

如果希望将

'Just'

和

'Just'

映射到相同的值，可能需要使用

x.lower（）

。

该怎么办

word2key = {'me': 41, 'are': 21, 'the': 0}
words = ['Just', 'grabbed', 'this', 'today', 'great', 'find']
default = 'unknown'
output = [word2key.get(x, default) for x in words]

如果希望将

'Just'

和

'Just'

映射到相同的值，则可能需要使用

x.lower（）

。

假设dict位于名为

的变量中，并且列表名为

：

d = {'me': 41, 'are': 21, 'the': 0}
l = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']

print(l)
c = 0
while c < len(l):
    try:
        l[c] = d[l[c]]
    except:
        l[c] = None
    c += 1

print(l)

d={'me'：41，'are'：21，'the'：0}
l=['I'，'like'，'the'，'color'，'of'，'this'，'top']
印刷品（l）
c=0
而c

假设您的dict位于名为

的变量中，并且您的列表名为

：

d = {'me': 41, 'are': 21, 'the': 0}
l = ['I', 'liked', 'the', 'color', 'of', 'this', 'top']

print(l)
c = 0
while c < len(l):
    try:
        l[c] = d[l[c]]
    except:
        l[c] = None
    c += 1

print(l)

d={'me'：41，'are'：21，'the'：0}
l=['I'，'like'，'the'，'color'，'of'，'this'，'top']
印刷品（l）
c=0
而c

ID系统。我希望每个单词都由dictionaryID系统中定义的整数表示。我希望每个单词都由一个在dictionary@BiBi最初的海报很喜欢它，一些人发现它更容易理解简洁的逻辑结构，而不是Python的内置plithera。我同意如果这意味着使用某种模糊的语言功能，但在我看来，列表理解是，是的，上面的答案很容易理解，我更喜欢它的函数实现。@BiBi最初的海报喜欢它，一些人发现它更容易理解简洁的逻辑结构，而不是python的内置plithera。我同意如果这意味着使用某种模糊的语言功能，但在我看来，列表理解是python的核心，提高了可读性：）。是的，上面的答案很容易理解，我更喜欢它作为函数的实现。