在python中枚举和替换字符串文件中的所有标记
亲爱的python爱好者们,我有一个问题要问你们 我有一个语料库文件,如下所示:在python中枚举和替换字符串文件中的所有标记,python,Python,亲爱的python爱好者们,我有一个问题要问你们 我有一个语料库文件,如下所示: Ah , this is greasy . I want to eat kimchee . Is Chae Yoon 's coordinator in here ? Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ? -Chae Yoon is done singing . This la
Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?
我想为每个令牌分配一个特定的数字,并用文件上分配的数字替换它
我所说的token的意思是,基本上文件中的每组字符由'
分隔。因此,例如,?
是一个令牌,同样原谅
也是一个令牌
我有一个语料库文件,涉及超过400万行,如上所述。你能告诉我一个我想要的最快的方法吗
谢谢,如果您已经有一个特定的字典来更改您的值,您只需映射新值即可
mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
如果要创建一个全新的词典:
mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
如果您已经有一个特定的字典来更改您的值,那么只需映射新值即可
mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
如果要创建一个全新的词典:
mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
尝试以下操作:它为每个令牌分配一个数字,然后用相应的数字替换令牌
a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")
key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))
i、 e.首先将每个唯一令牌映射到一个数字,然后您可以使用键映射将数值分配给每个令牌尝试以下操作:它为每个令牌分配一个数字,然后用相应的数字替换令牌
a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")
key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))
from collection import defaultdict
from itertools import count
with open(filename) as f:
with open(output, 'w+') as out:
c = count()
d = defaultdict(c.__next__)
for line in f:
line = line.split()
line = ' '.join([d[token] for token in line])
out.write(line)
i、 e.首先将每个唯一标记映射到一个数字,然后您可以使用键映射将数值分配给每个标记
from collection import defaultdict
from itertools import count
with open(filename) as f:
with open(output, 'w+') as out:
c = count()
d = defaultdict(c.__next__)
for line in f:
line = line.split()
line = ' '.join([d[token] for token in line])
out.write(line)
使用defaultdict
,我们可以记住看到的令牌。每次我们看到一个新的令牌,我们都会得到下一个数字并将其分配给该令牌。这会将输出写入另一个文件
split = "super string".split(' ')
map = []
result = ''
foreach word in split:
if not map.__contains__(word):
map[word] = len(map)
result += ' ' + str(map[word]
使用defaultdict
,我们可以记住看到的令牌。每次我们看到一个新的令牌,我们都会得到下一个数字并将其分配给该令牌。这会将输出写入另一个文件
split = "super string".split(' ')
map = []
result = ''
foreach word in split:
if not map.__contains__(word):
map[word] = len(map)
result += ' ' + str(map[word]
这样可以避免使用my_string=my_string.replace(k,v)使其变慢
这样可以避免使用my_string=my_string。替换(k,v)使其变慢可能有些过分,但您可以编写自己的分类器:
# Python 3.x
class Classifier(dict):
def __init__(self, args = None):
'''args is an iterable of keys (only)'''
self.n = 1
super().__init__()
if args:
for thing in args:
self[thing] = self.n
def __setitem__(self, key, value = None):
## print('setitem', key)
if key not in self:
super().__setitem__(key, self.n)
self.n += 1
def setdefault(self, key, default = None):
increment = key not in self
n = super().setdefault(key, self.n)
self.n += int(increment)
## print('setdefault', n)
return n
def update(self, other):
for k, v in other:
self.setdefault(k)
def transpose(self):
return {v:k for k, v in self.items()}
用法:
c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
for line in infile:
line = (str(c.setdefault(token)) for token in line.strip().split())
outfile.write(' '.join(line))
outfile.write('\n')
要减少写入次数,您可以在列表中累积行数,并以一定的设置长度使用writelines()
如果您有足够的内存,您可以读取整个文件并将其拆分,然后将其馈送到分类器
分类
z = c.transpose()
with open('classified.txt') as f:
for line in f:
line = (z[int(n)] for n in line.strip().split())
print(' '.join(line))
对于Python 2.7
super()
需要参数-将super()
替换为super(分类器,self)
如果您将主要处理令牌编号的字符串,则在类中,您应该在保存时将
self.n
转换为字符串,这样您就不必在工作代码中的字符串和整数之间来回转换
您还可以使用sklearn。可能有些过分,但您可以编写自己的分类器:
# Python 3.x
class Classifier(dict):
def __init__(self, args = None):
'''args is an iterable of keys (only)'''
self.n = 1
super().__init__()
if args:
for thing in args:
self[thing] = self.n
def __setitem__(self, key, value = None):
## print('setitem', key)
if key not in self:
super().__setitem__(key, self.n)
self.n += 1
def setdefault(self, key, default = None):
increment = key not in self
n = super().setdefault(key, self.n)
self.n += int(increment)
## print('setdefault', n)
return n
def update(self, other):
for k, v in other:
self.setdefault(k)
def transpose(self):
return {v:k for k, v in self.items()}
用法:
c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
for line in infile:
line = (str(c.setdefault(token)) for token in line.strip().split())
outfile.write(' '.join(line))
outfile.write('\n')
要减少写入次数,您可以在列表中累积行数,并以一定的设置长度使用writelines()
如果您有足够的内存,您可以读取整个文件并将其拆分,然后将其馈送到分类器
分类
z = c.transpose()
with open('classified.txt') as f:
for line in f:
line = (z[int(n)] for n in line.strip().split())
print(' '.join(line))
对于Python 2.7
super()
需要参数-将super()
替换为super(分类器,self)
如果您将主要处理令牌编号的字符串,则在类中,您应该在保存时将
self.n
转换为字符串,这样您就不必在工作代码中的字符串和整数之间来回转换
您也可以使用sklearn。谢谢。但是,您确定我没有为同一标记的出现提供两个不同的数字吗?mapping=dict[(I,x)for I,x in enumerate(mapping)]^SyntaxError:无效的SyntaxRequired括号括在方括号内。谢谢。但是你确定我没有给同一标记的出现给出两个不同的数字吗?mapping=dict[(I,x)表示枚举(映射)中的I,x]^SyntaxError:无效的语法需要方括号。AttributeError:“itertools.count”对象没有属性“next”@yusuf是否包含下划线<代码>\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu?在这种情况下,请不要使用下划线:
c.next
AttributeError:'itertools.count'对象没有属性'next'@yusuf是否包含下划线<代码>\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu?在这种情况下,请不要使用下划线:c.next
当你对内置(map)进行阴影处理时,确保你意识到你做到了。map[word]=len(map)TypeError:列表索引必须是整数,而不是unicode当你对内置(map)进行阴影处理时,确保你意识到你做到了。map[word]=len(map)TypeError:列表索引必须是整数,而不是unicode()TypeError:super()至少接受一个参数(给定0)@yusuf您使用的是Python 2.x还是3.x?super()。\uuuu init\uuuu()TypeError:super()至少接受一个参数(给定0)@yusuf您使用的是Python 2.x还是3.x?