在python中枚举和替换字符串文件中的所有标记

在python中枚举和替换字符串文件中的所有标记,python,Python,亲爱的python爱好者们,我有一个问题要问你们 我有一个语料库文件,如下所示: Ah , this is greasy . I want to eat kimchee . Is Chae Yoon 's coordinator in here ? Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ? -Chae Yoon is done singing . This la

亲爱的python爱好者们,我有一个问题要问你们

我有一个语料库文件,如下所示:

Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?
我想为每个令牌分配一个特定的数字,并用文件上分配的数字替换它

我所说的token的意思是,基本上文件中的每组字符由
'
分隔。因此,例如,
是一个令牌,同样
原谅
也是一个令牌

我有一个语料库文件,涉及超过400万行,如上所述。你能告诉我一个我想要的最快的方法吗


谢谢,

如果您已经有一个特定的字典来更改您的值,您只需映射新值即可

mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
    my_string = my_string.replace(k, v)
如果要创建一个全新的词典:

mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
    my_string = my_string.replace(k, v)

如果您已经有一个特定的字典来更改您的值,那么只需映射新值即可

mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
    my_string = my_string.replace(k, v)
如果要创建一个全新的词典:

mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
    my_string = my_string.replace(k, v)

尝试以下操作:它为每个令牌分配一个数字,然后用相应的数字替换令牌

a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")

key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))

i、 e.首先将每个唯一令牌映射到一个数字,然后您可以使用键映射将数值分配给每个令牌

尝试以下操作:它为每个令牌分配一个数字,然后用相应的数字替换令牌

a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")

key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))
from collection import defaultdict
from itertools import count
with open(filename) as f:
    with open(output, 'w+') as out:
      c = count()
      d = defaultdict(c.__next__)
      for line in f:
        line = line.split()
        line = ' '.join([d[token] for token in line])
        out.write(line)    
i、 e.首先将每个唯一标记映射到一个数字,然后您可以使用键映射将数值分配给每个标记

from collection import defaultdict
from itertools import count
with open(filename) as f:
    with open(output, 'w+') as out:
      c = count()
      d = defaultdict(c.__next__)
      for line in f:
        line = line.split()
        line = ' '.join([d[token] for token in line])
        out.write(line)    
使用
defaultdict
,我们可以记住看到的令牌。每次我们看到一个新的令牌,我们都会得到下一个数字并将其分配给该令牌。这会将输出写入另一个文件

split = "super string".split(' ')
map = []
result = ''
foreach word in split:
    if not map.__contains__(word):
        map[word] = len(map)
    result += ' ' + str(map[word]
使用
defaultdict
,我们可以记住看到的令牌。每次我们看到一个新的令牌,我们都会得到下一个数字并将其分配给该令牌。这会将输出写入另一个文件

split = "super string".split(' ')
map = []
result = ''
foreach word in split:
    if not map.__contains__(word):
        map[word] = len(map)
    result += ' ' + str(map[word]
这样可以避免使用my_string=my_string.replace(k,v)使其变慢


这样可以避免使用my_string=my_string。替换(k,v)使其变慢

可能有些过分,但您可以编写自己的分类器:

# Python 3.x
class Classifier(dict):
    def __init__(self, args = None):
        '''args is an iterable of keys (only)'''
        self.n = 1
        super().__init__()
        if args:
            for thing in args:
                self[thing] = self.n
    def __setitem__(self, key, value = None):
##        print('setitem', key)
        if key not in self:
            super().__setitem__(key, self.n)
            self.n += 1
    def setdefault(self, key, default = None):
        increment = key not in self
        n = super().setdefault(key, self.n)
        self.n += int(increment)
##        print('setdefault', n)
        return n
    def update(self, other):
        for k, v in other:
            self.setdefault(k)
    def transpose(self):
        return {v:k for k, v in self.items()}
用法:

c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
    for line in infile:
        line = (str(c.setdefault(token)) for token in line.strip().split())
        outfile.write(' '.join(line))
        outfile.write('\n')
要减少写入次数,您可以在列表中累积行数,并以一定的设置长度使用
writelines()

如果您有足够的内存,您可以读取整个文件并将其拆分,然后将其馈送到
分类器

分类

z = c.transpose()
with open('classified.txt') as f:
    for line in f:
        line = (z[int(n)] for n in line.strip().split())
        print(' '.join(line))

对于Python 2.7
super()
需要参数-将
super()
替换为
super(分类器,self)


如果您将主要处理令牌编号的字符串,则在类中,您应该在保存时将
self.n
转换为字符串,这样您就不必在工作代码中的字符串和整数之间来回转换



您还可以使用sklearn。

可能有些过分,但您可以编写自己的分类器:

# Python 3.x
class Classifier(dict):
    def __init__(self, args = None):
        '''args is an iterable of keys (only)'''
        self.n = 1
        super().__init__()
        if args:
            for thing in args:
                self[thing] = self.n
    def __setitem__(self, key, value = None):
##        print('setitem', key)
        if key not in self:
            super().__setitem__(key, self.n)
            self.n += 1
    def setdefault(self, key, default = None):
        increment = key not in self
        n = super().setdefault(key, self.n)
        self.n += int(increment)
##        print('setdefault', n)
        return n
    def update(self, other):
        for k, v in other:
            self.setdefault(k)
    def transpose(self):
        return {v:k for k, v in self.items()}
用法:

c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
    for line in infile:
        line = (str(c.setdefault(token)) for token in line.strip().split())
        outfile.write(' '.join(line))
        outfile.write('\n')
要减少写入次数,您可以在列表中累积行数,并以一定的设置长度使用
writelines()

如果您有足够的内存,您可以读取整个文件并将其拆分,然后将其馈送到
分类器

分类

z = c.transpose()
with open('classified.txt') as f:
    for line in f:
        line = (z[int(n)] for n in line.strip().split())
        print(' '.join(line))

对于Python 2.7
super()
需要参数-将
super()
替换为
super(分类器,self)


如果您将主要处理令牌编号的字符串,则在类中,您应该在保存时将
self.n
转换为字符串,这样您就不必在工作代码中的字符串和整数之间来回转换



您也可以使用sklearn。

谢谢。但是,您确定我没有为同一标记的出现提供两个不同的数字吗?mapping=dict[(I,x)for I,x in enumerate(mapping)]^SyntaxError:无效的SyntaxRequired括号括在方括号内。谢谢。但是你确定我没有给同一标记的出现给出两个不同的数字吗?mapping=dict[(I,x)表示枚举(映射)中的I,x]^SyntaxError:无效的语法需要方括号。AttributeError:“itertools.count”对象没有属性“next”@yusuf是否包含下划线<代码>\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu?在这种情况下,请不要使用下划线:
c.next
AttributeError:'itertools.count'对象没有属性'next'@yusuf是否包含下划线<代码>\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu?在这种情况下,请不要使用下划线:
c.next
当你对内置(map)进行阴影处理时,确保你意识到你做到了。map[word]=len(map)TypeError:列表索引必须是整数,而不是unicode当你对内置(map)进行阴影处理时,确保你意识到你做到了。map[word]=len(map)TypeError:列表索引必须是整数,而不是unicode()TypeError:super()至少接受一个参数(给定0)@yusuf您使用的是Python 2.x还是3.x?super()。\uuuu init\uuuu()TypeError:super()至少接受一个参数(给定0)@yusuf您使用的是Python 2.x还是3.x?