Python Dataframe：计算一列中的唯一字数，并在另一列中返回计数_Python_Pandas_Dataframe_Text

Python Dataframe：计算一列中的唯一字数，并在另一列中返回计数

python pandas dataframe text

Python Dataframe：计算一列中的唯一字数，并在另一列中返回计数,python,pandas,dataframe,text,Python,Pandas,Dataframe,Text,我有一个dataframe，它有以下列 df['Album']（包含artistX的专辑名称） df['Tracks']（包含artistX专辑中的曲目） df[‘歌词’]（包含曲目的歌词）我正在尝试计算df['Lyps']中的单词数，并返回一个名为df['wordcount']的新列，以及计算df['Lyps']中的唯一单词数，并返回一个名为df['uniquewordcount']的新列我已经能够通过计算df['lymps']中的每个字符串减去空白来获得df['wordcount'] t

我有一个dataframe，它有以下列

df['Album']（包含artistX的专辑名称）

df['Tracks']（包含artistX专辑中的曲目）

df[‘歌词’]（包含曲目的歌词）

我正在尝试计算df['Lyps']中的单词数，并返回一个名为df['wordcount']的新列，以及计算df['Lyps']中的唯一单词数，并返回一个名为df['uniquewordcount']的新列

我已经能够通过计算df['lymps']中的每个字符串减去空白来获得df['wordcount']

totalscore=df.lyris.str.count（'[^\s]'）#计算曲目中的每个单词
df['wordcount']=总分
df

我已经能够在df[‘歌词’]中计算出独特的单词

import collections
from collections import Counter

results = Counter()
count_unique = df.Lyrics.str.lower().str.split().apply(results.update)
unique_counts = sum((results).values())
df['uniquewordcount'] = unique_counts

这给了我df['lymps']中所有独特单词的计数，这就是代码要做的，但我希望每个曲目的歌词中都有独特的单词，我的python目前不太好，所以解决方案可能对每个人都很明显，但对我来说不是。我希望有人能为我指出正确的方向，告诉我如何计算每首歌的独特单词数

预期产出：

Album    Tracks    Lyrics                      wordcount  uniquewordcount
 A         Ball   Ball is life and Ball is key       7           5
           Pass   Pass me the hookah Pass me the     7           4

我得到的是：

Album    Tracks    Lyrics                    wordcount  uniquewordcount
  A     Ball   Ball is life and Ball is key       7           9
        Pass   Pass me the hookah Pass me the     7           9

仅使用标准库，您确实可以使用

collections.Counter

。但是，建议使用

ntlk

，因为您可能会对许多边缘情况感兴趣，例如处理标点符号、复数等

以下是

计数器

的分步指南。注意，由于我们也在计算每个单词的计数，因此我们在这里比要求的更进一步。当我们放下

df['lyricsconter']

时，保存在

计数器

字典中的数据将被丢弃

from collections import Counter

df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
                              'Lyrics abound lyrics here there eveywhere',
                              'Come fly come fly away']})

# convert to lowercase, split to list
df['LyricsList'] = df['Lyrics'].str.lower().str.split()

# for each set of lyrics, create a Counter dictionary
df['LyricsCounter'] = df['LyricsList'].apply(Counter)

# calculate length of list
df['LyricsWords'] = df['LyricsList'].apply(len)

# calculate number of Counter items for each set of lyrics
df['LyricsUniqueWords'] = df['LyricsCounter'].apply(len)

res = df.drop(['LyricsList', 'LyricsCounter'], axis=1)

print(res)

                                       Lyrics  LyricsWords  LyricsUniqueWords
0  This is some life some collection of words            8                  7
1   Lyrics abound lyrics here there eveywhere            6                  5
2                      Come fly come fly away            5                  3

这里有一个替代解决方案：

import pandas as pd

df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
                              'Lyrics abound lyrics here there eveywhere',
                              'Come fly come fly away']})

# Split list into new series
lyrics = df['Lyrics'].str.lower().str.split()

# Get amount of unique words
df['LyricsCounter'] = lyrics.apply(set).apply(len)

# Get amount of words
df['LyricsWords'] = lyrics.apply(len)

print(df)

                                       Lyrics  LyricsCounter  LyricsWords
0  This is some life some collection of words              7            8
1   Lyrics abound lyrics here there eveywhere              5            6
2                      Come fly come fly away              3            5

结果

应包含您需要的所有唯一单词。我不明白你真正的问题是什么。你能创建一个吗？分享一些数据和预期的结果怎么样？谢谢你的回复。结果返回唯一单词及其计数，我想要的是轨迹中每个唯一单词的计数为什么不使用nltk？谢谢，这正是我想要的：）我看到了各种错误。它不会解释：

。我可能会用NLTK来做这样的事情：）@AntonvBR对于我要做的事情，我已经对我的文本进行了预处理，删除了所有特殊字符和标点符号，所以它们实际上不存在于我的文本中。非常感谢您的回复，我刚刚进入NLTK，我实际上是用它来查找最常用的单词的artistX@Sammie好吧，那样的话，我就投赞成票。然而。。。我认为我们可以做得稍微“更好”（对我来说更可读），而不需要创建要删除的“额外”列。如果你喜欢，你可以投我一票。但无需改变答案。投票支持此解决方案是实现OP目标的更直接的方法。