在python中枚举和替换字符串文件中的所有标记_Python

在python中枚举和替换字符串文件中的所有标记

python

在python中枚举和替换字符串文件中的所有标记,python,Python,亲爱的python爱好者们，我有一个问题要问你们我有一个语料库文件，如下所示： Ah , this is greasy . I want to eat kimchee . Is Chae Yoon 's coordinator in here ? Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ? -Chae Yoon is done singing . This la

亲爱的python爱好者们，我有一个问题要问你们

我有一个语料库文件，如下所示：

Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon &apos;s coordinator in here ?
Excuse me , aren &apos;t you Chae Yoon &apos;s coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?

我想为每个令牌分配一个特定的数字，并用文件上分配的数字替换它

我所说的token的意思是，基本上文件中的每组字符由

分隔。因此，例如，

？

是一个令牌，同样

原谅

也是一个令牌

我有一个语料库文件，涉及超过400万行，如上所述。你能告诉我一个我想要的最快的方法吗

谢谢，

如果您已经有一个特定的字典来更改您的值，您只需映射新值即可

mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
    my_string = my_string.replace(k, v)

如果要创建一个全新的词典：

mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
    my_string = my_string.replace(k, v)

如果您已经有一个特定的字典来更改您的值，那么只需映射新值即可

mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
    my_string = my_string.replace(k, v)

如果要创建一个全新的词典：

mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
    my_string = my_string.replace(k, v)

尝试以下操作：它为每个令牌分配一个数字，然后用相应的数字替换令牌

a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon &apos;s coordinator in here ?
Excuse me , aren &apos;t you Chae Yoon &apos;s coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")

key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))

i、 e.首先将每个唯一令牌映射到一个数字，然后您可以使用键映射将数值分配给每个令牌

尝试以下操作：它为每个令牌分配一个数字，然后用相应的数字替换令牌

a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon &apos;s coordinator in here ?
Excuse me , aren &apos;t you Chae Yoon &apos;s coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")

key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))

from collection import defaultdict
from itertools import count
with open(filename) as f:
    with open(output, 'w+') as out:
      c = count()
      d = defaultdict(c.__next__)
      for line in f:
        line = line.split()
        line = ' '.join([d[token] for token in line])
        out.write(line)

i、 e.首先将每个唯一标记映射到一个数字，然后您可以使用键映射将数值分配给每个标记

from collection import defaultdict
from itertools import count
with open(filename) as f:
    with open(output, 'w+') as out:
      c = count()
      d = defaultdict(c.__next__)
      for line in f:
        line = line.split()
        line = ' '.join([d[token] for token in line])
        out.write(line)

使用

defaultdict

，我们可以记住看到的令牌。每次我们看到一个新的令牌，我们都会得到下一个数字并将其分配给该令牌。这会将输出写入另一个文件

split = "super string".split(' ')
map = []
result = ''
foreach word in split:
    if not map.__contains__(word):
        map[word] = len(map)
    result += ' ' + str(map[word]

使用

defaultdict

，我们可以记住看到的令牌。每次我们看到一个新的令牌，我们都会得到下一个数字并将其分配给该令牌。这会将输出写入另一个文件

split = "super string".split(' ')
map = []
result = ''
foreach word in split:
    if not map.__contains__(word):
        map[word] = len(map)
    result += ' ' + str(map[word]

这样可以避免使用my_string=my_string.replace（k，v）使其变慢

这样可以避免使用my_string=my_string。替换（k，v）使其变慢

可能有些过分，但您可以编写自己的分类器：

# Python 3.x
class Classifier(dict):
    def __init__(self, args = None):
        '''args is an iterable of keys (only)'''
        self.n = 1
        super().__init__()
        if args:
            for thing in args:
                self[thing] = self.n
    def __setitem__(self, key, value = None):
##        print('setitem', key)
        if key not in self:
            super().__setitem__(key, self.n)
            self.n += 1
    def setdefault(self, key, default = None):
        increment = key not in self
        n = super().setdefault(key, self.n)
        self.n += int(increment)
##        print('setdefault', n)
        return n
    def update(self, other):
        for k, v in other:
            self.setdefault(k)
    def transpose(self):
        return {v:k for k, v in self.items()}

用法：

c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
    for line in infile:
        line = (str(c.setdefault(token)) for token in line.strip().split())
        outfile.write(' '.join(line))
        outfile.write('\n')

要减少写入次数，您可以在列表中累积行数，并以一定的设置长度使用

writelines（）

如果您有足够的内存，您可以读取整个文件并将其拆分，然后将其馈送到

分类器

分类

z = c.transpose()
with open('classified.txt') as f:
    for line in f:
        line = (z[int(n)] for n in line.strip().split())
        print(' '.join(line))

对于Python 2.7

super（）

需要参数-将

super（）

替换为

super（分类器，self）

如果您将主要处理令牌编号的字符串，则在类中，您应该在保存时将

self.n

转换为字符串，这样您就不必在工作代码中的字符串和整数之间来回转换

您还可以使用sklearn。

可能有些过分，但您可以编写自己的分类器：

# Python 3.x
class Classifier(dict):
    def __init__(self, args = None):
        '''args is an iterable of keys (only)'''
        self.n = 1
        super().__init__()
        if args:
            for thing in args:
                self[thing] = self.n
    def __setitem__(self, key, value = None):
##        print('setitem', key)
        if key not in self:
            super().__setitem__(key, self.n)
            self.n += 1
    def setdefault(self, key, default = None):
        increment = key not in self
        n = super().setdefault(key, self.n)
        self.n += int(increment)
##        print('setdefault', n)
        return n
    def update(self, other):
        for k, v in other:
            self.setdefault(k)
    def transpose(self):
        return {v:k for k, v in self.items()}

用法：

c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
    for line in infile:
        line = (str(c.setdefault(token)) for token in line.strip().split())
        outfile.write(' '.join(line))
        outfile.write('\n')

要减少写入次数，您可以在列表中累积行数，并以一定的设置长度使用

writelines（）

如果您有足够的内存，您可以读取整个文件并将其拆分，然后将其馈送到

分类器

分类

z = c.transpose()
with open('classified.txt') as f:
    for line in f:
        line = (z[int(n)] for n in line.strip().split())
        print(' '.join(line))

对于Python 2.7

super（）

需要参数-将

super（）

替换为

super（分类器，self）

如果您将主要处理令牌编号的字符串，则在类中，您应该在保存时将

self.n

转换为字符串，这样您就不必在工作代码中的字符串和整数之间来回转换

您也可以使用sklearn。

谢谢。但是，您确定我没有为同一标记的出现提供两个不同的数字吗？mapping=dict[（I，x）for I，x in enumerate（mapping）]^SyntaxError:无效的SyntaxRequired括号括在方括号内。谢谢。但是你确定我没有给同一标记的出现给出两个不同的数字吗？mapping=dict[（I，x）表示枚举（映射）中的I，x]^SyntaxError:无效的语法需要方括号。AttributeError:“itertools.count”对象没有属性“next”@yusuf是否包含下划线<代码>\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu？在这种情况下，请不要使用下划线：

c.next

AttributeError:'itertools.count'对象没有属性'next'@yusuf是否包含下划线<代码>\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu？在这种情况下，请不要使用下划线：

c.next

当你对内置（map）进行阴影处理时，确保你意识到你做到了。map[word]=len（map）TypeError：列表索引必须是整数，而不是unicode当你对内置（map）进行阴影处理时，确保你意识到你做到了。map[word]=len（map）TypeError：列表索引必须是整数，而不是unicode（）TypeError:super（）至少接受一个参数（给定0）@yusuf您使用的是Python 2.x还是3.x？super（）。\uuuu init\uuuu（）TypeError:super（）至少接受一个参数（给定0）@yusuf您使用的是Python 2.x还是3.x？