String 标记化文本中NGRAM(字符串)的频率

String 标记化文本中NGRAM(字符串)的频率,string,python-3.x,list,nltk,n-gram,String,Python 3.x,List,Nltk,N Gram,我有一组独特的ngram(列表称为ngramlist)和ngram标记化文本(列表称为ngrams)。我想构造一个新的向量,freqlist,其中freqlist的每个元素都是ngrams的分数,它等于ngramlist的那个元素。我编写了以下代码,给出了正确的输出,但我想知道是否有办法对其进行优化: freqlist = [ sum(int(ngram == ngram_condidate) for ngram_condidate in ngrams) / len(ng

我有一组独特的ngram(列表称为ngramlist)和ngram标记化文本(列表称为ngrams)。我想构造一个新的向量,freqlist,其中freqlist的每个元素都是ngrams的分数,它等于ngramlist的那个元素。我编写了以下代码,给出了正确的输出,但我想知道是否有办法对其进行优化:

freqlist = [
    sum(int(ngram == ngram_condidate)
        for ngram_condidate in ngrams) / len(ngrams)
    for ngram in ngramlist
]
我想在nltk或其他地方有一个函数可以更快地实现这一点,但我不确定是哪一个

谢谢

编辑:值得一提的是,ngrams是作为联合输出生成的,
ngramlist
只是一个由所有找到的ngrams组成的列表

编辑2:

下面是测试freqlist行的可复制代码(代码的其余部分不是我真正关心的)


您可能可以通过预先计算一些量并使用。如果
ngramlist
中的大多数元素都包含在
ngrams
中,这将特别有用

freqlist = [
    sum(int(ngram == ngram_candidate)
            for ngram_candidate in ngrams) / len(ngrams)
        for ngram in ngramlist
]
您当然不需要每次检查
ngram
时都迭代
ngram
。一次通过ngrams将使此算法
O(n)
而不是现在的
O(n2)
算法。请记住,更短的代码并不一定是更好或更高效的代码:

from collections import Counter
...

counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
要正确使用此函数,必须编写
def
函数,而不是
lambda

def count_ngrams(ngrams):
    counter = Counter(ngrams)
    size = len(ngrams)
    freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
    return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)

首先,不要通过重写导入的函数并将它们用作变量来污染它们,保持
ngrams
名称作为函数,并使用其他内容作为变量

import time
from functools import partial
from itertools import chain
from collections import Counter

import wikipedia

import pandas as pd

from nltk import word_tokenize
from nltk.util import ngrams
接下来,您在原始问题中提出的行之前的步骤可能有点低效,您可以将其清理干净,使其更易于阅读,并对其进行测量:

# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')
然后:

# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')
然后:

最后,到最后一行

# Instead of a set, we use a Counter here because 
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))

# More often than not, you don't need to keep all the 
# zeros in the vectors (aka dense vector), 
# you could actually get the non-zero sparse vectors 
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)

# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
    nonzero_features = Counter(list_of_ngrams) & all_trigrams
    total = len(list_of_ngrams)
    return {ng:count/total for ng, count in nonzero_features.items()}

df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)

你从哪里得到这个函数的?输入
ngramlist
中有什么?预期的输出是什么?在您发布的行之前的代码是什么?使用代码进行编辑以generate@alvas. 这到底有什么关系?@伊利亚。不幸的是,您展示的代码并没有真正帮助我们重现您的实际数据,因此在这个问题上没有任何意义。不过幸运的是,我并不认为你一开始就真的需要它。我没有考虑计数器。get()。我喜欢这个解决方案,谢谢!顺便说一句,不要使用
ngrams
作为变量。这是一个导入到名称空间=)中的函数,它解决了我遗漏的所有问题。很好,谢谢你的来信。1.是的,下载文章这件事我同意,我不知道为什么我会这样写,只是想尽快输出一个片段,因为人们要求它。2.部分trigrams行实际上不做原始行所做的事情,因为您需要对元组执行一个“”。join()(因为真正的底层代码实际上对ngram字符串执行某些操作)。不过我想我可以用一个兰姆达。3.使用计数器而不是集合列表很有趣,我们能保证每个应用程序的顺序都被保留吗?如果你使用ngram字符串做一些事情,然后进行矢量化,那么你做的效率很低。最后真正重要的是向量值(密集/稀疏)、特征(关键点)本身,机器学习算法并不关心,只要它是一致的。IMHO,不要信任字符串,字符串是更高级别的构造,信任数组。字符串应该是一个字符数组;波克,我的错。Ngram无论如何都是元组,可以散列。虽然集合集是不可破坏的,但我实际上关心顺序,所以元组是可以的。
# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')
# Instead of a set, we use a Counter here because 
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))

# More often than not, you don't need to keep all the 
# zeros in the vectors (aka dense vector), 
# you could actually get the non-zero sparse vectors 
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)

# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
    nonzero_features = Counter(list_of_ngrams) & all_trigrams
    total = len(list_of_ngrams)
    return {ng:count/total for ng, count in nonzero_features.items()}

df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)