Python 如何在一个数据帧中以不同的顺序从文本数据帧列中提取所有ngram？_Python_Pandas_Extract_N Gram_Trigram

Python 如何在一个数据帧中以不同的顺序从文本数据帧列中提取所有ngram？

python pandas

Python 如何在一个数据帧中以不同的顺序从文本数据帧列中提取所有ngram？,python,pandas,extract,n-gram,trigram,Python,Pandas,Extract,N Gram,Trigram,下面是我的输入数据帧 id description 1 **must watch avoid** **good acting** 2 average movie bad acting 3 good movie **acting good** 4 pathetic avoid 5 **avoid watch must** 我想从短语中常用的单词中提取ngram，即bigram、trigram和4个单词gram。让我们将短语标记为单词，那么即使常用单词的顺序不同，我们也能找到

下面是我的输入数据帧

id  description
1   **must watch avoid** **good acting**
2   average movie bad acting
3   good movie **acting good**
4   pathetic avoid
5   **avoid watch must**

我想从短语中常用的单词中提取ngram，即bigram、trigram和4个单词gram。让我们将短语标记为单词，那么即使常用单词的顺序不同，我们也能找到Ngram，即（如果我们经常使用单词作为“好电影”，那么在第二个短语中，我们经常使用单词作为“电影好”，那么我们能提取二元语法吗“好电影”）。下面是我期待的示例：

ngram              frequency
must watch            2
acting good           2
must watch avoid      2
average               1

正如我们在第一句中看到的，经常使用的单词是“必须注意”，在最后一句中，我们有“必须注意”，即经常使用的单词的顺序发生了变化。因此，它以2的频率提取bigrams As must watch

我需要从短语中的常用词中提取ngrams/bigrams

如何使用Python数据帧实现这一点？非常感谢您的帮助

谢谢

import pandas as pd
from collections import Counter
from itertools import chain

data = [
    {"sentence": "Run with dogs, or shoes, or dogs and shoes"},
    {"sentence": "Run without dogs, or without shoes, or without dogs or shoes"},
    {"sentence": "Hold this while I finish writing the python script"},
    {"sentence": "Is this python script written yet, hey, hold this"},
    {"sentence": "Can dogs write python, or a python script?"},
]

def find_ngrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))

df = pd.DataFrame.from_records(data)
df['bigrams'] = df['sentence'].map(lambda x: find_ngrams(x.split(" "), 2))
df.head()

现在是频率计数

# Bigram Frequency Counts
bigrams = df['bigrams'].tolist()
bigrams = list(chain(*bigrams))
bigrams = [(x.lower(), y.lower()) for x,y in bigrams]

bigram_counts = Counter(bigrams)
bigram_counts.most_common(10)

 [(('dogs,', 'or'), 2),
 (('shoes,', 'or'), 2),
 (('or', 'without'), 2),
 (('hold', 'this'), 2),
 (('python', 'script'), 2),
 (('run', 'with'), 1),
 (('with', 'dogs,'), 1),
 (('or', 'shoes,'), 1),
 (('or', 'dogs'), 1),
 (('dogs', 'and'), 1)]

现在是频率计数

# Bigram Frequency Counts
bigrams = df['bigrams'].tolist()
bigrams = list(chain(*bigrams))
bigrams = [(x.lower(), y.lower()) for x,y in bigrams]

bigram_counts = Counter(bigrams)
bigram_counts.most_common(10)

 [(('dogs,', 'or'), 2),
 (('shoes,', 'or'), 2),
 (('or', 'without'), 2),
 (('hold', 'this'), 2),
 (('python', 'script'), 2),
 (('run', 'with'), 1),
 (('with', 'dogs,'), 1),
 (('or', 'shoes,'), 1),
 (('or', 'dogs'), 1),
 (('dogs', 'and'), 1)]

当你说

dataframe

时，你是指

string

吗？@Binyamin偶数：我有一个混合对象的dataframe。即int-dtype的id作为列之一，description作为object-dtype作为第二列。这个对象是数字和字符串的混合体，我需要从中提取ngramsTry，将你的数据框格式化为任务中的代码当你说

dataframe

时，你是指

字符串吗？@Binyamin偶数：我有一个混合对象的dataframe。即int-dtype的id作为一列，object-dtype的描述作为第二列。这个对象是数字和字符串的混合体，我需要从中提取格式化数据的ngramsTryame作为问题中的代码，以使其可读。我从短语中的常用词中找到ngrams/bigrams。就像在您的示例中，没有重复的词，即（常用）。你的问题现在更有意义了。给我一点时间。你能解决这个问题吗？我正在对我的数据进行同样的尝试，我应该先删除停止词吗？@superdoophero-df['col].tolist（），然后将其展平（这将是一个列表列表），然后将每个元素从集合模块传递到计数器类。@superdooperho-是的，请注意编辑后的答案。我在新示例中没有处理标点符号，但我想你会明白的！我从短语中常用的单词中找到了ngrams/bigrams。就像在你的示例中，没有重复的单词，即（经常使用）。你的问题现在更有意义了。给我一点时间你能解决这个问题吗？我正在用我的数据做同样的尝试，我应该先删除停止词吗？@superdoophero-df['col].tolist（），然后将其展平（这将是一个列表列表），然后将每个元素从集合模块传递到计数器类。@superdooperho-是的，请注意编辑后的答案。在新示例中，我没有处理标点符号，但我想你会明白的！