使用NLTK'计算语料库中的单词总数；Python中的s条件频率分布（新手）_Python_Dataframe_Nlp_Nltk_Frequency Distribution

使用NLTK'计算语料库中的单词总数；Python中的s条件频率分布（新手）

python dataframe nlp

使用NLTK'计算语料库中的单词总数；Python中的s条件频率分布（新手）,python,dataframe,nlp,nltk,frequency-distribution,Python,Dataframe,Nlp,Nltk,Frequency Distribution,我需要使用NLTK包计算一些语料库中的单词数（单词出现）这是我的语料库： corpus = PlaintextCorpusReader('C:\DeCorpus', '.*') 以下是我如何获取每个文档的总字数： cfd_appr = nltk.ConditionalFreqDist( (textname, num_appr) for textname in corpus.fileids() for num_appr in [len(w) for w in corpu

我需要使用NLTK包计算一些语料库中的单词数（单词出现）

这是我的语料库：

corpus = PlaintextCorpusReader('C:\DeCorpus', '.*')

以下是我如何获取每个文档的总字数：

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

（我手动将字符串拆分为单词，不知何故，它比使用

corpus.words（）

效果更好，但问题仍然是一样的，因此它不相关）。通常，这会做相同（错误）的工作：

这是通过键入

cfd.appr.tablate（）

得到的：

但这些是不同长度的单词数量。我需要的只是这个（只有一种类型的项目（文本）应按字数计算）：

也就是说，不同长度的所有单词的总和（或使用

DataFrame（cfd_appr）.transpose（）.sum（axis=1）

（顺便说一句，如果有某种方法可以为该列设置一个名称，这也是一种解决方案，但是

。重命名（{None:'W.appr.}，axis='columns'））

不起作用，解决方案通常不够清晰

所以，我需要的是：

                             1    
2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0

非常感谢您的帮助！

让我们首先尝试使用臭名昭著的目录结构复制您的表：

/books_in_sentences
   books_large_p1.txt
   books_large_p2.txt

代码：

from nltk.corpus import PlaintextCorpusReader
from nltk import ConditionalFreqDist
from nltk import word_tokenize

from collections import Counter

import pandas as pd

corpus = PlaintextCorpusReader('books_in_sentences/', '.*')

cfd_appr = ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in 
                     word_tokenize(corpus.raw(fileids=textname))])

然后熊猫的咀嚼部分：

# Idiom to convert a FreqDist / ConditionalFreqDist into pd.DataFrame.
df = pd.DataFrame([dict(Counter(freqdist)) 
                   for freqdist in cfd_appr.values()], 
                 index=cfd_appr.keys())
# Fill in the not-applicable with zeros.
df = df.fillna(0).astype(int)

# If necessary, sort order of columns and add accordingly.
df = df.sort_values(list(df))

# Sum all columns per row -> pd.Series
counts_per_row = df.sum(axis=1)

最后，要访问索引系列，例如：

print('books_large_p1.txt', counts_per_row['books_large_p1.txt'])

或者我鼓励使用上面的解决方案，这样您就可以使用DataFrame进一步操作数字，但是如果您真正需要的只是每行的列数，那么请尝试以下方法

如果需要避免熊猫并直接使用CFD中的值，则必须使用

ConditionalFreqDist.values（）

并仔细迭代

如果我们这样做：

>>> list(cfd_appr.values())
[FreqDist({3: 6, 6: 5, 1: 5, 9: 4, 4: 4, 2: 3, 8: 2, 10: 2, 7: 1, 14: 1}),
 FreqDist({4: 10, 3: 9, 1: 5, 7: 4, 2: 4, 5: 3, 6: 3, 11: 1, 9: 1})]

我们将看到一个FreqDist列表，每个都对应于键（在本例中为文件名）：

因为我们知道，如果我们对每个计数器对象的值求和，我们将得到：

>>> [sum(fd.values()) for fd in cfd_appr.values()]
[33, 40]

其输出的值与上述值相同

总而言之：

>>> dict(zip(cfd_appr.keys(), [sum(fd.values()) for fd in cfd_appr.values()]))
{'books_large_p1.txt': 33, 'books_large_p2.txt': 40}

好吧，这里是实际需要的：

首先，获取不同长度的单词数（就像我之前做的那样）：

然后将import

DataFrame

添加为

pd

，并将

添加到dtype:float64
序列中，该序列是我通过对列求和得到的：
pd.DataFrame(cfd_appr).transpose().sum(axis=1).to_frame(1)

就是这样。但是，如果有人知道如何在cfd\u appr
的定义中求和，那将是一个更优雅的解决方案。
好问题！！将NLTK中的cfd或FD咀嚼到熊猫中应该是NLTK=中的一个函数。）如果有一个对NLTK的pull请求，我们可以在那里执行ConditionalFreqDist.to\u pandas
，它返回一个pd.DataFrame
，那就太好了。它比我预期的要复杂得多=D，我可能是错的，但结果与使用的结果相同（axis=1
…至少，在我尝试了这个解决方案之后，我似乎是这样的，但有一个区别（dtype:int64
）。真的没有办法在cfd\u appr
的定义范围内对结果进行求和吗？显然，我没有把我的问题表达得足够清楚，对不起……我认为问题的出现只是因为我对Python语法的误解。有一种方法可以对cfd\u appr的结果进行求和，但你不会对其复杂性感到满意s=）因此，建议的向DataFrameHint强制转换的解决方案是：如果不想强制执行int64，则不需要.astype（int）
在代码中。非常感谢！希望这不是一个离题。两个问题。1：如何将该字典放入数据帧？我得到值错误：如果使用所有标量值，必须传递一个索引。
。其次，每行计数
是一个没有标题的数据帧。是否有方法为其单个列指定名称？。重命名({None:'W.execute.'}，axis='columns'）
不起作用（显然，因为根本没有标题），并且.rename_axis（'W.execute.'）重命名axis本身，而不是列。我相信你，我相信谷歌一定能帮你找到答案=）
>>> list(cfd_appr.keys())
['books_large_p1.txt', 'books_large_p2.txt']

>>> [sum(fd.values()) for fd in cfd_appr.values()]
[33, 40]

>>> dict(zip(cfd_appr.keys(), [sum(fd.values()) for fd in cfd_appr.values()]))
{'books_large_p1.txt': 33, 'books_large_p2.txt': 40}

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

pd.DataFrame(cfd_appr).transpose().sum(axis=1).to_frame(1)