Python 2.7 word字符n-grams的快速实现
我编写了以下用于计算字符bigrams的代码,其输出如下图所示。我的问题是,如何获得排除最后一个字符(即t)的输出?有没有一种更快更有效的方法来计算字符n-gramPython 2.7 word字符n-grams的快速实现,python-2.7,n-gram,Python 2.7,N Gram,我编写了以下用于计算字符bigrams的代码,其输出如下图所示。我的问题是,如何获得排除最后一个字符(即t)的输出?有没有一种更快更有效的方法来计算字符n-gram b='student' >>> y=[] >>> for x in range(len(b)): n=b[x:x+2] y.append(n) >>> y ['st', 'tu', 'ud', 'de', 'en', 'nt', 't'] 下面是我想要得到的结果
b='student'
>>> y=[]
>>> for x in range(len(b)):
n=b[x:x+2]
y.append(n)
>>> y
['st', 'tu', 'ud', 'de', 'en', 'nt', 't']
下面是我想要得到的结果:['st'、'tu'、'ud'、'de'、'nt]
提前感谢您的建议。要生成Bigram:
In [8]: b='student'
In [9]: [b[i:i+2] for i in range(len(b)-1)]
Out[9]: ['st', 'tu', 'ud', 'de', 'en', 'nt']
要概括为不同的n
:
In [10]: n=4
In [11]: [b[i:i+n] for i in range(len(b)-n+1)]
Out[11]: ['stud', 'tude', 'uden', 'dent']
试试zip
:
>>> def word2ngrams(text, n=3, exact=True):
... """ Convert text into character ngrams. """
... return ["".join(j) for j in zip(*[text[i:] for i in range(n)])]
...
>>> word2ngrams('foobarbarblacksheep')
['foo', 'oob', 'oba', 'bar', 'arb', 'rba', 'bar', 'arb', 'rbl', 'bla', 'lac', 'ack', 'cks', 'ksh', 'she', 'hee', 'eep']
但请注意,速度较慢:
import string, random, time
def zip_ngrams(text, n=3, exact=True):
return ["".join(j) for j in zip(*[text[i:] for i in range(n)])]
def nozip_ngrams(text, n=3):
return [text[i:i+n] for i in range(len(text)-n+1)]
# Generate 10000 random strings of length 100.
words = [''.join(random.choice(string.ascii_uppercase) for j in range(100)) for i in range(10000)]
start = time.time()
x = [zip_ngrams(w) for w in words]
print time.time() - start
start = time.time()
y = [nozip_ngrams(w) for w in words]
print time.time() - start
print x==y
[out]:
0.314492940903
0.197558879852
True
该函数为您提供n=1到n的nGram:
def getNgrams(sentences, n):
ngrams = []
for sentence in sentences:
_ngrams = []
for _n in range(1,n+1):
for pos in range(1,len(sentence)-_n):
_ngrams.append([sentence[pos:pos+_n]])
ngrams.append(_ngrams)
return ngrams
可能重复的