Python中基于印象的N-gram分析

Python中基于印象的N-gram分析,python,text,text-manipulation,Python,Text,Text Manipulation,以下是我的示例数据集的外观: 我的目标是了解有多少印象与一个词、两个词、三个词、四个词、五个词和六个词有关。我曾经运行N-gram算法,但它只返回count。这是我当前的n-gram代码 def find_ngrams(text, n): word_vectorizer = CountVectorizer(ngram_range=(n,n), analyzer='word') sparse_matrix = word_vectorizer.fit_transform(text)

以下是我的示例数据集的外观:

我的目标是了解有多少印象与一个词、两个词、三个词、四个词、五个词和六个词有关。我曾经运行N-gram算法,但它只返回count。这是我当前的n-gram代码

def find_ngrams(text, n):
    word_vectorizer = CountVectorizer(ngram_range=(n,n), analyzer='word')
    sparse_matrix = word_vectorizer.fit_transform(text)
    frequencies = sum(sparse_matrix).toarray()[0]
    ngram = 

pd.DataFrame(frequencies,index=word_vectorizer.get_feature_names(),columns=
['frequency'])
ngram = ngram.sort_values(by=['frequency'], ascending=[False])
return ngram

one = find_ngrams(df['query'],1)
bi = find_ngrams(df['query'],2)
tri = find_ngrams(df['query'],3)
quad = find_ngrams(df['query'],4)
pent = find_ngrams(df['query'],5)
hexx = find_ngrams(df['query'],6)
我想我需要做的是:1。将查询拆分为一个单词到六个单词。2.把印象附在分开的单词上。3.重新组合所有拆分的单词并总结印象

以第二个查询“犬类常见病及如何治疗”为例,应按以下方式进行拆分:

(1) 1-gram: dog, common, diseases, and, how, to, treat, them;
(2) 2-gram: dog common, common diseases, diseases and, and how, how to, to treat, treat them;
(3) 3-gram: dog common diseases, common diseases and, diseases and how, and how to, how to treat, to treat them;
(4) 4-gram: dog common diseases and, common diseases and how, diseases and how to, and how to treat, how to treat them;
(5) 5-gram: dog common diseases and how, the queries into one word, diseases and how to treat, and how to treat them;
(6) 6-gram: dog common diseases and how to, common diseases and how to treat, diseases and how to treat them;

这里有一个方法!不是最有效的,但是,我们不要过早地进行优化。我们的想法是使用
apply
获得一个新的
pd.DataFrame
,为所有ngram添加新列,将其与旧的DataFrame连接起来,并进行一些堆叠和分组

import pandas as pd

df = pd.DataFrame({
    "squery": ["how to feed a dog", "dog habits", "to cat or not to cat", "dog owners"],
    "count": [1000, 200, 100, 150]
})

def n_grams(txt):
    grams = list()
    words = txt.split(' ')
    for i in range(len(words)):
        for k in range(1, len(words) - i + 1):
            grams.append(" ".join(words[i:i+k]))
    return pd.Series(grams)

counts = df.squery.apply(n_grams).join(df)

counts.drop("squery", axis=1).set_index("count").unstack()\
    .rename("ngram").dropna().reset_index()\
    .drop("level_0", axis=1).groupby("ngram")["count"].sum()
最后一个表达式将返回如下所示的
pd.Series

    ngram
a                       1000
a dog                   1000
cat                      200
cat or                   100
cat or not               100
cat or not to            100
cat or not to cat        100
dog                     1350
dog habits               200
dog owners               150
feed                    1000
feed a                  1000
feed a dog              1000
habits                   200
how                     1000
how to                  1000
how to feed             1000
how to feed a           1000
how to feed a dog       1000
not                      100
not to                   100
not to cat               100
or                       100
or not                   100
or not to                100
or not to cat            100
owners                   150
to                      1200
to cat                   200
to cat or                100
to cat or not            100
to cat or not to         100
to cat or not to cat     100
to feed                 1000
to feed a               1000
to feed a dog           1000

Spiffy方法

这一个可能更有效一些,但它仍然具体化了
CountVectorizer
中的稠密n-gram向量。它将每列上的一个向量乘以印象数,然后将这些列相加,得到每个ngram的总印象数。它给出了与上述相同的结果。需要注意的是具有重复ngram的ery也可以计算两倍

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1, 5))
ngrams = cv.fit_transform(df.squery)
mask = np.repeat(df['count'].values.reshape(-1, 1), repeats = len(cv.vocabulary_), axis = 1)
index = list(map(lambda x: x[0], sorted(cv.vocabulary_.items(), key = lambda x: x[1])))
pd.Series(np.multiply(mask, ngrams.toarray()).sum(axis = 0), name = "counts", index = index)

像这样的怎么样:

def find_ngrams(input, n):
    # from http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
    return zip(*[input[i:] for i in range(n)])

def impressions_by_ngrams(data, ngram_max):
    from collections import defaultdict
    result = [defaultdict(int) for n in range(ngram_max)]
    for query, impressions in data:
        words = query.split()
        for n in range(ngram_max):
           for ngram in find_ngrams(words, n + 1):
                result[n][ngram] += impressions
    return result
例如:

>>> data = [('how to feed a dog', 10000),
...         ('see a dog run',     20000)]
>>> ngrams = impressions_by_ngrams(data, 3)
>>> ngrams[0]   # unigrams
defaultdict(<type 'int'>, {('a',): 30000, ('how',): 10000, ('run',): 20000, ('feed',): 10000, ('to',): 10000, ('see',): 20000, ('dog',): 30000})
>>> ngrams[1][('a', 'dog')]  # impressions for bigram 'a dog'
30000
>>数据=[(《如何喂狗》,10000),
…(“看狗跑”,20000)]
>>>ngrams=按ngrams的印象(数据,3)
>>>ngrams[0]#单位图
defaultdict(,{('a',):30000,('how',):10000,('run',):20000,('feed',):10000,('to',):10000,('see',):20000,('dog',):30000})
>>>ngrams[1][('a','dog')]#对bigram'a dog'的印象
30000

对不起,您的问题是什么?您是在问如何生成n-gram吗?您的意思是“我以前运行n-gram算法,但它只返回count?”“?我需要找出与n-gram相关的印象。N-gram给出了术语出现的频率,但我需要了解有多少印象是相关联的。计算多少?如果你能提供你的代码会有帮助的。我喜欢这种方式!我们只需要对所有相同的查询进行分组并汇总它们的印象,然后根据字数将它们分开。您能告诉我如何将pd.Series结果存储为数据帧吗?如果s是一个系列,
s=pd.Series()
,那么
s.to_frame()
是一个函数,它可以生成一列数据帧。谢谢。我将“squery”重命名为“query”,并更改为:ngrams=cv.fit\u transform(df.query)。但是返回的错误:“method”对象不可编辑。你知道如何解决这个问题吗?听起来你好像忘记了括号。看哪一行给出了错误。在这条线的某个地方,你正在循环一个事实上不是可迭代的东西,即一个可循环的东西,而是一个方法。