Python：计算一列中单词的频率，并将结果存储到数据帧上的另一列中_Python_Pandas_Count_Counter_Frame

Python：计算一列中单词的频率，并将结果存储到数据帧上的另一列中

python pandas

Python：计算一列中单词的频率，并将结果存储到数据帧上的另一列中,python,pandas,count,counter,frame,Python,Pandas,Count,Counter,Frame,我想计算一列（“注释”）中每一行中出现的每个单词的数量，并将其存储在我的数据框中名为headlamp的新列（“单词”）中。我正在尝试下面的代码，但是，我得到了一个错误 for i in range(0,len(headlamp)): headlamp['word'].apply(lambda text: Counter(" ".join(headlamp['Comment'][i].astype(str)).split(" ")).items()) ------------------

我想计算一列（“注释”）中每一行中出现的每个单词的数量，并将其存储在我的数据框中名为headlamp的新列（“单词”）中。我正在尝试下面的代码，但是，我得到了一个错误

for i in range(0,len(headlamp)):
    headlamp['word'].apply(lambda text: Counter(" ".join(headlamp['Comment'][i].astype(str)).split(" ")).items())
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-16-a0c20291b4f5> in <module>()
  1 for i in range(0,len(headlamp)):
  ----> 2     headlamp['word'].apply(lambda text: Counter("".join(headlamp['Comment'][i].astype(str)).split(" ")).items())

  C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
  1995             return self._getitem_multilevel(key)
  1996         else:
  -> 1997             return self._getitem_column(key)
  1998 
  1999     def _getitem_column(self, key):

  C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
  2002         # get column
  2003         if self.columns.is_unique:
  -> 2004             return self._get_item_cache(key)
  2005 
  2006         # duplicate columns & possible reduce dimensionality

  C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
  1348         res = cache.get(item)
  1349         if res is None:
  -> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

   C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
   -> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

   C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\indexes\base.pyc in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
   -> 1947                 returnself._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

   pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)()

   pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)()

   pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()

   pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()

   KeyError: 'word'

适用于范围内的i（0，透镜（前照灯））：
前照灯['word'].apply（lambda文本：计数器（“）.join（前照灯['Comment'][i].astype（str））.split（“）.items（））
---------------------------------------------------------------------------
KeyError回溯（最近一次呼叫最后一次）
在（）
1对于范围内的i（0，透镜（前照灯））：
---->2前照灯['word'].apply（lambda文本：计数器（“）.join（前照灯['Comment'][i].astype（str））.split（“）.items（））
C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\core\frame.pyc in\uuuu\u getitem\uuuuu（self，key）
1995年返回自我。\u获取项目\u多级（关键）
1996年其他：
->1997返回自我。\u获取项目\u列（键）
1998
1999 def_getitem_列（self，key）：
C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\core\frame.pyc在_getitem_列中（self，key）
2002年#获取专栏
2003如果self.columns.u是唯一的：
->2004返回自我。获取项目缓存（密钥）
2005
2006年#重复列和可能的降维
C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\core\generic.pyc在\u get\u item\u缓存中（self，item）
1348 res=cache.get（项）
1349如果res为无：
->1350值=自身数据获取（项目）
1351 res=自身。_框_项_值（项，值）
1352缓存[项目]=res
get中的C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\core\internals.pyc（self、item、fastpath）
3288
3289如果不为空（项目）：
->3290 loc=自身物品。获取物品位置（物品）
3291其他：
3292 indexer=np.arange（len（self.items））[isnull（self.items）]
C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\index\base.pyc in get\u loc（self、key、method、tolerance）
1945返回自我。发动机。获取位置（钥匙）
1946除了键错误：
->1947返回自我。引擎。获取位置（自我。可能是铸造索引器（键））
1948
1949 indexer=self.get\u indexer（[key]，method=method，tolerance=tolerance）
pandas.index.IndexEngine.get_loc中的pandas\index.pyx（pandas\index.c:4154）（）
pandas.index.IndexEngine.get_loc中的pandas\index.pyx（pandas\index.c:4018）（）
pandas.hashtable.PyObjectHashTable.get_项中的pandas\hashtable.pyx（pandas\hashtable.c:12368）（）
pandas.hashtable.PyObjectHashTable.get_项中的pandas\hashtable.pyx（pandas\hashtable.c:12322）（）
关键字错误：“word”

任何帮助都将不胜感激

您可以尝试以下方法：

headlamp['word'] = headlamp['Comment'].apply(lambda x: len(x.split()))

示例：

headlamp = pd.DataFrame({'Comment': ['hello world','world','foo','foo and bar']})
print(headlamp)
       Comment
0  hello world
1        world
2          foo
3  foo and bar

headlamp['word'] = headlamp['Comment'].apply(lambda x: len(x.split()))
print(headlamp)
       Comment  word
0  hello world     2
1        world     1
2          foo     1
3  foo and bar     3

使用这种方法你可以达到你想要的

请随意使用这段代码：

import pandas as pd
from collections import Counter

df = pd.DataFrame({'Comment': ['This has has words words words that are written twice twice', 'This is a comment without repetitions', 'This comment, has ponctuations!']}, index = [0, 1, 2])

#you must create the new column before trying to assing any value
df['Words'] = ""

#counting frequencies
i = 0
for row in df['Comment']:
    df['Words'][i] = str(Counter(row.split()).most_common())
    i+=1

print df

输出：

                                             Comment  \
0  This has has words words words that are writte...   
1              This is a comment without repetitions   
2                    This comment, has ponctuations!   

                                               Words  
0  [('words', 3), ('twice', 2), ('has', 2), ('tha...  
1  [('a', 1), ('comment', 1), ('This', 1), ('is',...  
2  [('This', 1), ('comment,', 1), ('has', 1), ('p...

您好，存储每个单词频率的列的预期格式是什么？一个

dict

，一列一个字？你能把你的数据框标题贴出来吗？您将收到一个

KeyError:“word”

当您试图查找列

Headlight['word']

感谢您的回复@rfw，我想将每个单词的所有计数都放在新列“word”的“comment”列上，因此，将创建此新列“word”。原因是，我想知道某个词在每条评论中出现了多少次，以找出与前照灯（汽车部件）相关的问题。请告诉我您是否仍想在此处发布DataFrame让我看看我是否得到了它：您有一个“comment”列，其中有一个字符串，如“this is a simple comment”（这是一个简单的评论）。然后，您需要在每一行中运行一个计算每个单词出现次数的函数，并将这个新的“dict”写入一个名为“words”的新列中。对吗？你完全明白了@rfw谢谢你的帮助@rf不管怎样，我无法达到结果我也导入了计数器但是，创建新列时，我收到一条警告：C:\Users\Rafael\Anaconda2\envs\gl env\lib\site-packages\ipykernel\u main.py:1:SettingWithCopyWarning:试图在数据帧切片的副本上设置值。尝试使用.loc[row_indexer，col_indexer]=value，请参见文档中的注意事项：如果name='main'：然后在我运行您编写的代码后，我的帧数据1 I=0 2在前照灯中的行['Comment']：--->3前照灯['word'][I]=str（计数器（row.astype（str.split（））。大多数常见（）4 i+=1 5打印前照灯['word']AttributeError:'str'对象没有属性'astype'我使用代码：i=0表示前照灯中的行['Comment']：前照灯['word'][i]=str（计数器（row.split（））.most_common（））i+=1打印前照灯['word']