Python 如何为NLTK ConditionalFreqDist函数准备自己的数据集？_Python_Nltk

Python 如何为NLTK ConditionalFreqDist函数准备自己的数据集？

python

Python 如何为NLTK ConditionalFreqDist函数准备自己的数据集？,python,nltk,Python,Nltk,NLTK包含“棕色”数据集，其中包含不同类型的所有数据 cfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre)) genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] modals =

NLTK包含“棕色”数据集，其中包含不同类型的所有数据

cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

问题是，brown恰巧是nltk中内置的一个数据集，带有一个方便的categories方法，但我不知道输入数据是如何构造的

如果我有自己的数据，我想利用COnditionalFreqDist函数，比如50个文档，每个文档都有自己的“类型/类别”，我如何能够以函数可用的方式格式化这些数据？对于每种类型，它应该是csv还是单独的csv，每行包含一个文档？如何为输入设置格式？举个小数据集的例子就好了

如果这可以通过数据库连接而不是平面文件来完成，那将是一个加号。

如果您转到安装nltk数据的目录，您应该能够直接查看文件。我的nltk_数据目录位于/home/$user中

这些文件将是目录结构中的纯文本，如下所示：

./nltk_data/corpora/brown/

``/`` These/dts actions/nns should/md serve/vb to/to protect/vb in/in fact/nn and/cc in/in effect/nn the/at court's/nn$ wards/nns from/in undue/jj costs/nns and/cc its/pp$ appointed/vbn and/cc elected/vbn servants/nns from/in unmeritorious/jj criticisms/nns ''/'' ,/, the/at jury/nn said/vbd ./.


Regarding/in Atlanta's/np$ new/jj multi-million-dollar/jj airport/nn ,/, the/at jury/nn recommended/vbd ``/`` that/cs when/wrb the/at new/jj management/nn takes/vbz charge/nn Jan./np 1/cd the/at airport/nn be/be operated/vbn in/in a/at manner/nn that/wps will/md eliminate/vb political/jj influences/nns ''/'' ./.

棕色文件的一个示例是标记化文本，如下所示：

./nltk_data/corpora/brown/

``/`` These/dts actions/nns should/md serve/vb to/to protect/vb in/in fact/nn and/cc in/in effect/nn the/at court's/nn$ wards/nns from/in undue/jj costs/nns and/cc its/pp$ appointed/vbn and/cc elected/vbn servants/nns from/in unmeritorious/jj criticisms/nns ''/'' ,/, the/at jury/nn said/vbd ./.


Regarding/in Atlanta's/np$ new/jj multi-million-dollar/jj airport/nn ,/, the/at jury/nn recommended/vbd ``/`` that/cs when/wrb the/at new/jj management/nn takes/vbz charge/nn Jan./np 1/cd the/at airport/nn be/be operated/vbn in/in a/at manner/nn that/wps will/md eliminate/vb political/jj influences/nns ''/'' ./.

对于类别，我认为有两个相关文件：

categories.pickle
cat.txt

后者是每个文件名的简单列表，旁边列出了其类别。pickle转储是一组具有相同信息的元组（可能是从.txt文件创建的）：

您可能只需要使用文本的名称创建类别的pickle dump，然后将该文件放到您将要读取文件的同一目录中。（我自己没有这样做，所以如果我遗漏了什么，我表示歉意，但这似乎符合nltk的组织方式。）

无论如何，只要找到nltk_数据目录，就可以查看所有文件以及它们的组织方式