用Python标记10个文档的语料库_Python_Tokenize_Tf Idf

用Python标记10个文档的语料库

python

用Python标记10个文档的语料库,python,tokenize,tf-idf,Python,Tokenize,Tf Idf,我对用Python编写代码还不熟悉，所以弄清楚如何编写更高级的操作对我来说是一个挑战我的任务是计算10个文档的TF-IDF。但是我被困在如何标记语料库和打印出标记的数量和唯一标记的数量上如果有人能帮助我，甚至指引我走向正确的方向，我将不胜感激这可能会有帮助我有一个单独的文本文件集合，我想将其摄取并适合转换到TFIDFvectorier。这将遍历接收文件和使用TfidfVectorizer的过程我去获取一些关于电影评论的示例数据我使用了负面评论。就我而言，数据是什么并不重要，我只需要一

我对用Python编写代码还不熟悉，所以弄清楚如何编写更高级的操作对我来说是一个挑战

我的任务是计算10个文档的TF-IDF。但是我被困在如何标记语料库和打印出标记的数量和唯一标记的数量上

如果有人能帮助我，甚至指引我走向正确的方向，我将不胜感激

这可能会有帮助

我有一个单独的文本文件集合，我想将其摄取并适合转换到TFIDFvectorier。这将遍历接收文件和使用TfidfVectorizer的过程

我去获取一些关于电影评论的示例数据

我使用了负面评论。就我而言，数据是什么并不重要，我只需要一些文本数据

导入所需的包

import pandas as pd 
import glob
from sklearn.feature_extraction.text import TfidfVectorizer

这些软件包将如何使用

我们将使用pandas为TfidfVectorizer准备数据
glob将用于收集文件目录位置
TfidfVectorizer是该节目的明星

使用Glob收集文件位置

ls_documents = [] 
for name in glob.glob('/location/to/folder/with/document/files/*'):
    ls_documents.append(name)

这将生成一个文件位置列表

从前10个文件中读取数据

ls_text = []
for document in ls_documents[:10]:
    f = open(document,"r")
    ls_text.append(f.read())

我们现在有一个文本列表

输入大熊猫

df_text = pd.DataFrame(ls_text)

重命名列以使其更易于使用

df_text.columns = ['raw_text']

通过删除任何具有空值的行来清理数据

df_text['clean_text'] = df_text['raw_text'].fillna('')

你可以选择做一些其他的清洁。保留原始数据并创建单独的“干净”列非常有用

创建一个tfidf对象-我将为它提供英文停止词

tfidf = TfidfVectorizer(stop_words='english')

通过向tfidf传递clean_文本系列，调整并转换我们在上面创建的clean_文本

tfidf_matrix = tfidf.fit_transform(df_text['clean_text'])

您可以从tfidf中看到要素名称

tfidf.get_feature_names()

你会看到像这样的东西

['10',
 '13',
 '14',
 '175',
 '1960',
 '1990s',
 '1997',
 '20',
 '2001',
 '20th',
 '2176',
 '60',
 '80',
 '8mm',
 '90',
 '90s',
 '_huge_',
 'aberdeen',
 'able',
 'abo',
 'accent',
 'accentuate',
 'accident',
 'accidentally',
 'accompany',
 'accurate',
 'accused',
 'acting',
 'action',
 'actor',
....
]

你可以看看矩阵的形状

tfidf_matrix.shape

在我的例子中，我得到了

(10, 1733)

这大致意味着1733个单词（即标记）描述了10个文档

如果不确定您想用它做什么，您可能会发现这两篇文章很有用

来自DataCamp的这篇文章在推荐系统中使用了tfidf
来自DataCamp的这篇文章有一些通用的NLP过程技巧

import os
import re

# Path to where our data is located
base_Path = r'C:\location\to\folder\with\document\files


# Instantiate an empty dictonary
ddict = {}

# were going to walk our directory
for root, subdirs, filename in os.walk(base_path):
    # For each sub directory ('neg' and 'pos,' in this case)
    for d in subdirs:
        # Create a NEW dictionary with the subdirectory name as key
        ddict[d] = {}

        # Create a path to the subdirectory
        subroot = os.path.join(root, d)

        # Get a list of files for the directory
        # Save time by creating a new path for each file
        file_list = [os.path.join(subroot, i) for i in os.listdir(subroot) if i.endswith('txt')]

        # For each file in the filelist, open and read the file into the
        # subdictionary
        for f in file_list:
            # Basename = root name of path to file, or the filename
            fkey = os.path.basename(f)

            # Read file and set as subdictionary value
            with open(f, 'r') as f:
                ddict[d][fkey] = f.read()
            f.close()

len(ddict.keys()) # 2 top-level subdirectories
len(ddict['neg'].keys()) # 1000 files in our 'neg' subdirectory
len(ddict['pos'].keys()) # 1000 files in our 'pos' subdirectory

# sample file content
# use two keys (subdirectory name and filename)

dirkey = 'pos'
filekey = 'cv000_29590.txt'
test1 = ddict[dirkey][filekey]

'films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , o [...]'


### Simple counter dictionary function
def val_counter(iterable, output_dict=None):
    # Instanciate a new dictionary
    if output_dict is None:
        output_dict = dict()

    # Check if element in dictionary
    # Add 1 if yes, or set to 1 if no
    for i in iterable:
        if i in output_dict.keys():
            output_dict[i] += 1
        else:
            output_dict[i] = 1
    return output_dict

def wordcounts(corpus, dirname='pos', keep_small_words=False, count_dict=None):
    if count_dict is None:
        count_dict = dict()

    get_words_pat = r'(?:\s*|\n*|\t*)?([\w]+)(?:\s*|\n*|\t*)?'
    p = re.compile(get_words_pat)

    def clean_corpus(x):
        # Replace all whitespace with single-space
        clear_ws_pat = r'\s+'
        # Find nonalphanumeric characters
        remove_punc_pat = r'[^\w+]'


        tmp1 = re.sub(remove_punc_pat, ' ', x)

        # Respace whitespace and return
        return re.sub(clear_ws_pat, ' ', tmp1)

    # List of our files from the subdirectory
    keylist = list(corpus[dirname])


    for k in keylist:
        cleand = clean_corpus(corpus[dirname][k])

        # Tokenize based on size
        if keep_small_words:
            tokens = p.findall(cleand)
        else:
            # limit to results > 1 char in length
            tokens = [i for i in p.findall(cleand) if len(i) > 1]


        for i in tokens:
            if i in count_dict.keys():
                count_dict[i] += 1
            else:
                count_dict[i] = 1

    # Return dictionary once complete
    return count_dict



### Dictionary sorted lambda function

dict_sort = lambda d, descending=True: dict(sorted(d.items(), key=lambda x: x[1], reverse=descending))

# Run our function for positive corpus values
pos_result_dict = wordcounts(ddict, 'pos')
pos_result_dict = dict_sort(pos_result_dict)

# Create dictionary of how frequent each count value is
freq_dist = val_counter(pos_result_dict.values())
freq_dist = dict_sort(freq_dist)


# Stats functions
k_count = lambda x: len(x.keys())
sum_vals = lambda x: sum([v for k, v in x.items()])
calc_avg = lambda x: sum_vals(x) / k_count(x)

# Get mean (arithmetic average) of word counts
mean_dict = calc_avg(pos_result_dict)

# Top-half of results.  We count shrink this even further, if necessary
top_dict = {k:v for k, v in pos_result_dict.items() if v >= mean_dict}

# This is probably your TD-IDF part
tot_count= sum([v for v in top_dict.values()])
for k, v in top_dict.items():
    pct_ = round(v / tot_count, 4)
    print('Word: ', k, ', count: ', v, ', %-age: ', pct_)