Python:计算一列中单词的频率,并将结果存储到数据帧上的另一列中

Python:计算一列中单词的频率,并将结果存储到数据帧上的另一列中,python,pandas,count,counter,frame,Python,Pandas,Count,Counter,Frame,我想计算一列(“注释”)中每一行中出现的每个单词的数量,并将其存储在我的数据框中名为headlamp的新列(“单词”)中。 我正在尝试下面的代码,但是,我得到了一个错误 for i in range(0,len(headlamp)): headlamp['word'].apply(lambda text: Counter(" ".join(headlamp['Comment'][i].astype(str)).split(" ")).items()) ------------------

我想计算一列(“注释”)中每一行中出现的每个单词的数量,并将其存储在我的数据框中名为headlamp的新列(“单词”)中。 我正在尝试下面的代码,但是,我得到了一个错误

for i in range(0,len(headlamp)):
    headlamp['word'].apply(lambda text: Counter(" ".join(headlamp['Comment'][i].astype(str)).split(" ")).items())
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-16-a0c20291b4f5> in <module>()
  1 for i in range(0,len(headlamp)):
  ----> 2     headlamp['word'].apply(lambda text: Counter("".join(headlamp['Comment'][i].astype(str)).split(" ")).items())

  C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
  1995             return self._getitem_multilevel(key)
  1996         else:
  -> 1997             return self._getitem_column(key)
  1998 
  1999     def _getitem_column(self, key):

  C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
  2002         # get column
  2003         if self.columns.is_unique:
  -> 2004             return self._get_item_cache(key)
  2005 
  2006         # duplicate columns & possible reduce dimensionality

  C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
  1348         res = cache.get(item)
  1349         if res is None:
  -> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

   C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
   -> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

   C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-packages\pandas\indexes\base.pyc in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
   -> 1947                 returnself._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

   pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)()

   pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)()

   pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()

   pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()

   KeyError: 'word'
适用于范围内的i(0,透镜(前照灯)):
前照灯['word'].apply(lambda文本:计数器(“).join(前照灯['Comment'][i].astype(str)).split(“).items())
---------------------------------------------------------------------------
KeyError回溯(最近一次呼叫最后一次)
在()
1对于范围内的i(0,透镜(前照灯)):
---->2前照灯['word'].apply(lambda文本:计数器(“).join(前照灯['Comment'][i].astype(str)).split(“).items())
C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\core\frame.pyc in\uuuu\u getitem\uuuuu(self,key)
1995年返回自我。\u获取项目\u多级(关键)
1996年其他:
->1997返回自我。\u获取项目\u列(键)
1998
1999 def_getitem_列(self,key):
C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\core\frame.pyc在_getitem_列中(self,key)
2002年#获取专栏
2003如果self.columns.u是唯一的:
->2004返回自我。获取项目缓存(密钥)
2005
2006年#重复列和可能的降维
C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\core\generic.pyc在\u get\u item\u缓存中(self,item)
1348 res=cache.get(项)
1349如果res为无:
->1350值=自身数据获取(项目)
1351 res=自身。_框_项_值(项,值)
1352缓存[项目]=res
get中的C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\core\internals.pyc(self、item、fastpath)
3288
3289如果不为空(项目):
->3290 loc=自身物品。获取物品位置(物品)
3291其他:
3292 indexer=np.arange(len(self.items))[isnull(self.items)]
C:\Users\Rafael\Anaconda2\envs\gl env\lib\site packages\pandas\index\base.pyc in get\u loc(self、key、method、tolerance)
1945返回自我。发动机。获取位置(钥匙)
1946除了键错误:
->1947返回自我。引擎。获取位置(自我。可能是铸造索引器(键))
1948
1949 indexer=self.get\u indexer([key],method=method,tolerance=tolerance)
pandas.index.IndexEngine.get_loc中的pandas\index.pyx(pandas\index.c:4154)()
pandas.index.IndexEngine.get_loc中的pandas\index.pyx(pandas\index.c:4018)()
pandas.hashtable.PyObjectHashTable.get_项中的pandas\hashtable.pyx(pandas\hashtable.c:12368)()
pandas.hashtable.PyObjectHashTable.get_项中的pandas\hashtable.pyx(pandas\hashtable.c:12322)()
关键字错误:“word”
任何帮助都将不胜感激

您可以尝试以下方法:

headlamp['word'] = headlamp['Comment'].apply(lambda x: len(x.split()))
示例:

headlamp = pd.DataFrame({'Comment': ['hello world','world','foo','foo and bar']})
print(headlamp)
       Comment
0  hello world
1        world
2          foo
3  foo and bar

headlamp['word'] = headlamp['Comment'].apply(lambda x: len(x.split()))
print(headlamp)
       Comment  word
0  hello world     2
1        world     1
2          foo     1
3  foo and bar     3
使用这种方法你可以达到你想要的

请随意使用这段代码:

import pandas as pd
from collections import Counter

df = pd.DataFrame({'Comment': ['This has has words words words that are written twice twice', 'This is a comment without repetitions', 'This comment, has ponctuations!']}, index = [0, 1, 2])

#you must create the new column before trying to assing any value
df['Words'] = ""

#counting frequencies
i = 0
for row in df['Comment']:
    df['Words'][i] = str(Counter(row.split()).most_common())
    i+=1

print df
输出:

                                             Comment  \
0  This has has words words words that are writte...   
1              This is a comment without repetitions   
2                    This comment, has ponctuations!   

                                               Words  
0  [('words', 3), ('twice', 2), ('has', 2), ('tha...  
1  [('a', 1), ('comment', 1), ('This', 1), ('is',...  
2  [('This', 1), ('comment,', 1), ('has', 1), ('p...  

您好,存储每个单词频率的列的预期格式是什么?一个
dict
,一列一个字?你能把你的数据框标题贴出来吗?您将收到一个
KeyError:“word”
当您试图查找列
Headlight['word']
感谢您的回复@rfw,我想将每个单词的所有计数都放在新列“word”的“comment”列上,因此,将创建此新列“word”。原因是,我想知道某个词在每条评论中出现了多少次,以找出与前照灯(汽车部件)相关的问题。请告诉我您是否仍想在此处发布DataFrame让我看看我是否得到了它:您有一个“comment”列,其中有一个字符串,如“this is a simple comment”(这是一个简单的评论)。然后,您需要在每一行中运行一个计算每个单词出现次数的函数,并将这个新的“dict”写入一个名为“words”的新列中。对吗?你完全明白了@rfw谢谢你的帮助@rf不管怎样,我无法达到结果我也导入了计数器但是,创建新列时,我收到一条警告:C:\Users\Rafael\Anaconda2\envs\gl env\lib\site-packages\ipykernel\u main.py:1:SettingWithCopyWarning:试图在数据帧切片的副本上设置值。尝试使用.loc[row_indexer,col_indexer]=value,请参见文档中的注意事项:如果name='main':然后在我运行您编写的代码后,我的帧数据1 I=0 2在前照灯中的行['Comment']:--->3前照灯['word'][I]=str(计数器(row.astype(str.split())。大多数常见()4 i+=1 5打印前照灯['word']AttributeError:'str'对象没有属性'astype'我使用代码:i=0表示前照灯中的行['Comment']:前照灯['word'][i]=str(计数器(row.split()).most_common())i+=1打印前照灯['word']