Python 数据帧中的部分字符串（关键字）和匹配项之和_Python_Pandas

Python 数据帧中的部分字符串（关键字）和匹配项之和

python pandas

Python 数据帧中的部分字符串（关键字）和匹配项之和,python,pandas,Python,Pandas,假设我有一个关键词列表（大约300个）我希望遍历整个数据帧（df1）列（文本），以便找到出现关键字的任何实例。我的最终目标是对每个关键字进行总计数 Text Location Date Police have just discovered a bomb. New York 4/30/2015, 23:54:27 ... 我知道我可以使用str.contains（见下文）在单个基础上查找每个单词的总数，但我正在

假设我有一个关键词列表（大约300个）

我希望遍历整个数据帧（df1）列（文本），以便找到出现关键字的任何实例。我的最终目标是对每个关键字进行总计数

Text                                Location     Date 
Police have just discovered a bomb. New York    4/30/2015, 23:54:27  
...

我知道我可以使用str.contains（见下文）在单个基础上查找每个单词的总数，但我正在寻找一种简单的方法一次计算所有单词的总数

word_count = df1[df1['Text'].str.contains('Key Word').count()

我还尝试使用一个脚本来解决我的问题，该脚本将“文本”中的所有数据拆分为单个关键字，并对总数进行求和，但它没有考虑任何有空格的关键字（至少在当前形式下）

非常感谢您的帮助

您似乎希望将所有文本拆分为一个单字列表，然后仅扫描列表一次，使用dict计算出现的次数。你可以从做一件事开始

word_list = (df1.Text + ' ').sum().split()

这将给出列中所有单词的单一列表。向每个项添加空格可防止连续项的串联。然后扫描列表并计算关键字：

word_count = dict((keyword, 0) for keyword in keywords)
for word in wordlist:
     try:
         word_count += 1
     except KeyError:
         pass

dict

查找是O（1），您只需扫描单词列表一次，就可以让它在算法上合理。我现在唯一能想到的问题是多个词的关键词。但是，您可以简单地将构成关键字（短语）的单词视为关键字，然后进行计数。然后推断关键短语的频率。这并不完美，但如果构成关键短语的单词之间没有重叠，那么它将起作用，并且根据重叠情况仍然可以起作用。我想这就足够了，但是如果没有看到所有的关键词，我就不知道了

编辑：我想到了一种方法，用熊猫做同样的事情：

word_series = pd.Series((df1.Text + ' ').sum().split())
word_series.value_counts().loc[key_words]

这将为您提供每个关键字的出现次数。它仍然不能解决关键短语的问题

但是，这里有一个解决方案适用于两个单词的关键短语：

two_word_series = word_series + ' ' word_series.shift(-1)
# a series of all consecutive pairs in the word_series
two_word_series.value_counts().loc[two_word_key_phrases]

这可以概括为n字短语，但过一段时间就会变得很麻烦。它的可行性取决于关键短语的最大长度

如果您想要在计数中包含特定短语（您之前就知道）的解决方案，您可以将短语中的空格替换为另一个字符（比如“389;”）。例如：

import pandas as pd
from collections import Counter

df = pd.DataFrame(['Police have discovered an air bomb', 'Air strike the bomb', 'The air strike police are going on strike', 'Air bomb is full of hot air'], columns = ['text'])
keywords = ['bomb', 'police', 'air strike']
keyword_dict = {w:w.replace(' ', '_') for w in keywords}

corpus = ' '.join(df.text).lower()
for w,w2 in keyword_dict.items():
   corpus = corpus.replace(w,w2)

all_counts = Counter(corpus.split())
final_counts = {w:all_counts[w2] for w,w2 in keyword_dict.items()}
print(final_counts)
{'police': 1, 'air strike': 1, 'bomb': 2}

一个更通用的解决方案（从文本挖掘的角度来看，可能是更好的做法，在这种情况下，您不必事先知道您要查找的短语），您可以从文本中提取所有的bigram，并对整个过程进行计数：

corpus = ' '.join(df.text).lower()
words = corpus.split()
bigrams = [' '.join([words[i],words[i+1]]) for i in range(len(words) -1)]
print(Counter(words + bigrams))
Counter({'air': 5, 'bomb': 3, 'strike': 3, 'air strike': 2, 'police': 2, 'air bomb': 2, 'the': 2, 'discovered': 1, 'bomb is': 1, 'the bomb': 1, 'have discovered': 1, 'full': 1, 'bomb the': 1, 'going on': 1, 'are going': 1, 'are': 1, 'discovered an': 1, 'the air': 1, 'hot air': 1, 'is full': 1, 'hot': 1, 'on strike': 1, 'is': 1, 'strike the': 1, 'police have': 1, 'bomb air': 1, 'of': 1, 'strike police': 1, 'of hot': 1, 'an': 1, 'strike air': 1, 'on': 1, 'full of': 1, 'police are': 1, 'have': 1, 'going': 1, 'an air': 1})

谢谢你，乔！这确实解决了我的大问题，但是关键字列表包含了相当多的短语，组成短语的单词之间有相当多的重叠。有没有关于工作的想法？很难说。假设我们将所有短语缩减为不超过两个单词，并将每个单词视为关键字。我想你可能可以从计数中推断出来，因为它们必须是单打或双人的。例如，如果一个是“炸弹远离”而另一个是“在马槽里”，那么两个“远离”+一个“炸弹”+一个“马槽”必须是每个人中的一个。但即使这样，与只扫描原始列300次相比，也需要做大量的工作。一个折衷办法是尽可能地使用我的解决方案，然后在不会屈服于它的短语上使用你的原创。另外，如果一个关键字在另一个关键字短语中，它会被计算一次还是两次？看起来关键词应该比听起来“更”独特。这不包括扫描字符串（语料库）吗？它是文本列总和，是关键词的多少倍？如果是这样的话，它似乎并不比OP的原始解决方案好多少，后者只是扫描列中的每个关键字（短语），而且可能会更慢，因为它没有矢量化。它也更复杂。更一般的解决方案非常有效，谢谢！

import pandas as pd
from collections import Counter

df = pd.DataFrame(['Police have discovered an air bomb', 'Air strike the bomb', 'The air strike police are going on strike', 'Air bomb is full of hot air'], columns = ['text'])
keywords = ['bomb', 'police', 'air strike']
keyword_dict = {w:w.replace(' ', '_') for w in keywords}

corpus = ' '.join(df.text).lower()
for w,w2 in keyword_dict.items():
   corpus = corpus.replace(w,w2)

all_counts = Counter(corpus.split())
final_counts = {w:all_counts[w2] for w,w2 in keyword_dict.items()}
print(final_counts)
{'police': 1, 'air strike': 1, 'bomb': 2}

corpus = ' '.join(df.text).lower()
words = corpus.split()
bigrams = [' '.join([words[i],words[i+1]]) for i in range(len(words) -1)]
print(Counter(words + bigrams))
Counter({'air': 5, 'bomb': 3, 'strike': 3, 'air strike': 2, 'police': 2, 'air bomb': 2, 'the': 2, 'discovered': 1, 'bomb is': 1, 'the bomb': 1, 'have discovered': 1, 'full': 1, 'bomb the': 1, 'going on': 1, 'are going': 1, 'are': 1, 'discovered an': 1, 'the air': 1, 'hot air': 1, 'is full': 1, 'hot': 1, 'on strike': 1, 'is': 1, 'strike the': 1, 'police have': 1, 'bomb air': 1, 'of': 1, 'strike police': 1, 'of hot': 1, 'an': 1, 'strike air': 1, 'on': 1, 'full of': 1, 'police are': 1, 'have': 1, 'going': 1, 'an air': 1})