Python 统计数据帧中的单个单词_Python_Pandas_Ipython

Python 统计数据帧中的单个单词

python pandas ipython

Python 统计数据帧中的单个单词,python,pandas,ipython,Python,Pandas,Ipython,我试图计算数据框中一列中的单个单词。看起来像这样。实际上，这些文本就是推特 text this is some text that I want to count That's all I wan't It is unicode text 因此，我从其他stackoverflow问题中发现，我可以使用以下内容：我的df称为结果，这是我的代码： from collections import Counter result2 = Counter(" ".join(result['text']

我试图计算数据框中一列中的单个单词。看起来像这样。实际上，这些文本就是推特

text
this is some text that I want to count
That's all I wan't
It is unicode text

因此，我从其他stackoverflow问题中发现，我可以使用以下内容：

我的df称为结果，这是我的代码：

from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2

我得到以下错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
      1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
      3 result2
TypeError: sequence item 25831: expected str instance, float found

TypeError回溯（最近一次调用）
在（）
1从收款进口柜台
---->2 result2=计数器（“”）.join（结果['text'].values.tolist（））.split（“”）.items（）
3结果2
TypeError:序列项25831:应为str实例，找到浮点

文本的数据类型为object，据我所知，这对于unicode文本数据是正确的。

出现此问题是因为序列中的某些值（

result['text']

）的类型为

float

。如果你想在<代码> '.Cube（）（/COD>）中考虑它们，那么你需要将浮点转换成字符串，然后将它们传递到<代码> STR.Con（）/<代码> < /P> 您可以使用

Series.astype（）

将所有值转换为字符串。另外，您确实不需要使用

.tolist（）

，您也可以将该系列交给

str.join（）

。范例-

result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()

演示-

In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])

In [61]: df
Out[61]:
      A
0  blah
1   asd
2  10.1

In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])

TypeError: sequence item 2: expected str instance, float found

In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'

[60]中的

：df=pd.DataFrame（[['blah']，['asd']，[10.1]]，columns=['A']）
In[61]：df
出[61]：
A.
0废话
1 asd
2  10.1
在[62]中：''.join（df['A']）
---------------------------------------------------------------------------
TypeError回溯（最近一次调用上次）
在（）
---->1“”。加入（df['A']）
TypeError:序列项2:应为str实例，找到浮点
在[63]中：''.join（df['A'].astype（str））
Out[63]：“废话asd 10.1”

最后，我使用了以下代码：

pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words

然而，问题由Anand S Kumar解决。

显然，数据帧中存在浮点值，您想如何处理它们？你也要数一数吗？既然这些短信应该都是推特，我也要数一数。如果此列还包含浮点值，这是否意味着我收集的tweet只是数字？（让我好奇哪一个是浮动的）是的，这是可能的。谢谢，这似乎有效。现在输出是在dict中，将其移回pandas数据帧还是以某种方式保持在df中工作是合乎逻辑的？取决于您打算做什么工作。但我猜如果你打算做某种分析，dataframe会更快。一般问题的一般答案：D当我有一个具体问题时，我会提出一个新问题。谢谢你的帮助！