String 标记化文本中NGRAM（字符串）的频率_String_Python 3.x_List_Nltk_N Gram

String 标记化文本中NGRAM（字符串）的频率

string python-3.x list

String 标记化文本中NGRAM（字符串）的频率,string,python-3.x,list,nltk,n-gram,String,Python 3.x,List,Nltk,N Gram,我有一组独特的ngram（列表称为ngramlist）和ngram标记化文本（列表称为ngrams）。我想构造一个新的向量，freqlist，其中freqlist的每个元素都是ngrams的分数，它等于ngramlist的那个元素。我编写了以下代码，给出了正确的输出，但我想知道是否有办法对其进行优化： freqlist = [ sum(int(ngram == ngram_condidate) for ngram_condidate in ngrams) / len(ng

我有一组独特的ngram（列表称为ngramlist）和ngram标记化文本（列表称为ngrams）。我想构造一个新的向量，freqlist，其中freqlist的每个元素都是ngrams的分数，它等于ngramlist的那个元素。我编写了以下代码，给出了正确的输出，但我想知道是否有办法对其进行优化：

freqlist = [
    sum(int(ngram == ngram_condidate)
        for ngram_condidate in ngrams) / len(ngrams)
    for ngram in ngramlist
]

我想在nltk或其他地方有一个函数可以更快地实现这一点，但我不确定是哪一个

谢谢

编辑：值得一提的是，ngrams是作为联合输出生成的，

ngramlist

只是一个由所有找到的ngrams组成的列表

编辑2：

下面是测试freqlist行的可复制代码（代码的其余部分不是我真正关心的）

您可能可以通过预先计算一些量并使用。如果

ngramlist

中的大多数元素都包含在

ngrams

中，这将特别有用

freqlist = [
    sum(int(ngram == ngram_candidate)
            for ngram_candidate in ngrams) / len(ngrams)
        for ngram in ngramlist
]

您当然不需要每次检查

ngram

时都迭代

ngram

。一次通过ngrams将使此算法

O（n）

而不是现在的

O（n2）

算法。请记住，更短的代码并不一定是更好或更高效的代码：

from collections import Counter
...

counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]

要正确使用此函数，必须编写

def

函数，而不是

lambda

：

def count_ngrams(ngrams):
    counter = Counter(ngrams)
    size = len(ngrams)
    freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
    return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)

首先，不要通过重写导入的函数并将它们用作变量来污染它们，保持

ngrams

名称作为函数，并使用其他内容作为变量

import time
from functools import partial
from itertools import chain
from collections import Counter

import wikipedia

import pandas as pd

from nltk import word_tokenize
from nltk.util import ngrams

接下来，您在原始问题中提出的行之前的步骤可能有点低效，您可以将其清理干净，使其更易于阅读，并对其进行测量：

# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')

然后：

# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')

然后：

最后，到最后一行

# Instead of a set, we use a Counter here because 
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))

# More often than not, you don't need to keep all the 
# zeros in the vectors (aka dense vector), 
# you could actually get the non-zero sparse vectors 
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)

# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
    nonzero_features = Counter(list_of_ngrams) & all_trigrams
    total = len(list_of_ngrams)
    return {ng:count/total for ng, count in nonzero_features.items()}

df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)

你从哪里得到这个函数的？输入

ngramlist

中有什么？预期的输出是什么？在您发布的行之前的代码是什么？使用代码进行编辑以generate@alvas. 这到底有什么关系？@伊利亚。不幸的是，您展示的代码并没有真正帮助我们重现您的实际数据，因此在这个问题上没有任何意义。不过幸运的是，我并不认为你一开始就真的需要它。我没有考虑计数器。get（）。我喜欢这个解决方案，谢谢！顺便说一句，不要使用

ngrams

作为变量。这是一个导入到名称空间=）中的函数，它解决了我遗漏的所有问题。很好，谢谢你的来信。1.是的，下载文章这件事我同意，我不知道为什么我会这样写，只是想尽快输出一个片段，因为人们要求它。2.部分trigrams行实际上不做原始行所做的事情，因为您需要对元组执行一个“”。join（）（因为真正的底层代码实际上对ngram字符串执行某些操作）。不过我想我可以用一个兰姆达。3.使用计数器而不是集合列表很有趣，我们能保证每个应用程序的顺序都被保留吗？如果你使用ngram字符串做一些事情，然后进行矢量化，那么你做的效率很低。最后真正重要的是向量值（密集/稀疏）、特征（关键点）本身，机器学习算法并不关心，只要它是一致的。IMHO，不要信任字符串，字符串是更高级别的构造，信任数组。字符串应该是一个字符数组；波克，我的错。Ngram无论如何都是元组，可以散列。虽然集合集是不可破坏的，但我实际上关心顺序，所以元组是可以的。

# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')

# Instead of a set, we use a Counter here because 
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))

# More often than not, you don't need to keep all the 
# zeros in the vectors (aka dense vector), 
# you could actually get the non-zero sparse vectors 
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)

# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
    nonzero_features = Counter(list_of_ngrams) & all_trigrams
    total = len(list_of_ngrams)
    return {ng:count/total for ng, count in nonzero_features.items()}

df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)