用Python标记10个文档的语料库
我对用Python编写代码还不熟悉,所以弄清楚如何编写更高级的操作对我来说是一个挑战 我的任务是计算10个文档的TF-IDF。但是我被困在如何标记语料库和打印出标记的数量和唯一标记的数量上 如果有人能帮助我,甚至指引我走向正确的方向,我将不胜感激 这可能会有帮助 我有一个单独的文本文件集合,我想将其摄取并适合转换到TFIDFvectorier。这将遍历接收文件和使用TfidfVectorizer的过程 我去获取一些关于电影评论的示例数据 我使用了负面评论。就我而言,数据是什么并不重要,我只需要一些文本数据 导入所需的包用Python标记10个文档的语料库,python,tokenize,tf-idf,Python,Tokenize,Tf Idf,我对用Python编写代码还不熟悉,所以弄清楚如何编写更高级的操作对我来说是一个挑战 我的任务是计算10个文档的TF-IDF。但是我被困在如何标记语料库和打印出标记的数量和唯一标记的数量上 如果有人能帮助我,甚至指引我走向正确的方向,我将不胜感激 这可能会有帮助 我有一个单独的文本文件集合,我想将其摄取并适合转换到TFIDFvectorier。这将遍历接收文件和使用TfidfVectorizer的过程 我去获取一些关于电影评论的示例数据 我使用了负面评论。就我而言,数据是什么并不重要,我只需要一
import pandas as pd
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
这些软件包将如何使用
- 我们将使用pandas为TfidfVectorizer准备数据
- glob将用于收集文件目录位置
- TfidfVectorizer是该节目的明星
ls_documents = []
for name in glob.glob('/location/to/folder/with/document/files/*'):
ls_documents.append(name)
这将生成一个文件位置列表
从前10个文件中读取数据
ls_text = []
for document in ls_documents[:10]:
f = open(document,"r")
ls_text.append(f.read())
我们现在有一个文本列表
输入大熊猫
df_text = pd.DataFrame(ls_text)
重命名列以使其更易于使用
df_text.columns = ['raw_text']
通过删除任何具有空值的行来清理数据
df_text['clean_text'] = df_text['raw_text'].fillna('')
你可以选择做一些其他的清洁。保留原始数据并创建单独的“干净”列非常有用
创建一个tfidf对象-我将为它提供英文停止词
tfidf = TfidfVectorizer(stop_words='english')
通过向tfidf传递clean_文本系列,调整并转换我们在上面创建的clean_文本
tfidf_matrix = tfidf.fit_transform(df_text['clean_text'])
您可以从tfidf中看到要素名称
tfidf.get_feature_names()
你会看到像这样的东西
['10',
'13',
'14',
'175',
'1960',
'1990s',
'1997',
'20',
'2001',
'20th',
'2176',
'60',
'80',
'8mm',
'90',
'90s',
'_huge_',
'aberdeen',
'able',
'abo',
'accent',
'accentuate',
'accident',
'accidentally',
'accompany',
'accurate',
'accused',
'acting',
'action',
'actor',
....
]
你可以看看矩阵的形状
tfidf_matrix.shape
在我的例子中,我得到了
(10, 1733)
这大致意味着1733个单词(即标记)描述了10个文档
如果不确定您想用它做什么,您可能会发现这两篇文章很有用
- 来自DataCamp的这篇文章在推荐系统中使用了tfidf
- 来自DataCamp的这篇文章有一些通用的NLP过程 技巧
- 我对此采取了一种有趣的方法。我使用的数据与@the_good_pony提供的数据相同,因此我将使用相同的路径
我们将使用os和re模块,因为正则表达式很有趣,也很有挑战性
import os
import re
# Path to where our data is located
base_Path = r'C:\location\to\folder\with\document\files
# Instantiate an empty dictonary
ddict = {}
# were going to walk our directory
for root, subdirs, filename in os.walk(base_path):
# For each sub directory ('neg' and 'pos,' in this case)
for d in subdirs:
# Create a NEW dictionary with the subdirectory name as key
ddict[d] = {}
# Create a path to the subdirectory
subroot = os.path.join(root, d)
# Get a list of files for the directory
# Save time by creating a new path for each file
file_list = [os.path.join(subroot, i) for i in os.listdir(subroot) if i.endswith('txt')]
# For each file in the filelist, open and read the file into the
# subdictionary
for f in file_list:
# Basename = root name of path to file, or the filename
fkey = os.path.basename(f)
# Read file and set as subdictionary value
with open(f, 'r') as f:
ddict[d][fkey] = f.read()
f.close()
样本数量:
len(ddict.keys()) # 2 top-level subdirectories
len(ddict['neg'].keys()) # 1000 files in our 'neg' subdirectory
len(ddict['pos'].keys()) # 1000 files in our 'pos' subdirectory
# sample file content
# use two keys (subdirectory name and filename)
dirkey = 'pos'
filekey = 'cv000_29590.txt'
test1 = ddict[dirkey][filekey]
输出:
'films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , o [...]'
### Simple counter dictionary function
def val_counter(iterable, output_dict=None):
# Instanciate a new dictionary
if output_dict is None:
output_dict = dict()
# Check if element in dictionary
# Add 1 if yes, or set to 1 if no
for i in iterable:
if i in output_dict.keys():
output_dict[i] += 1
else:
output_dict[i] = 1
return output_dict
使用正则表达式(我在这里已经详细介绍了),我们可以从每个语料库中清理文本,并将字母数字项捕获到列表中。我添加了一个选项来包含小词(本例中为1个字符),但是获取stopwords不会太难
def wordcounts(corpus, dirname='pos', keep_small_words=False, count_dict=None):
if count_dict is None:
count_dict = dict()
get_words_pat = r'(?:\s*|\n*|\t*)?([\w]+)(?:\s*|\n*|\t*)?'
p = re.compile(get_words_pat)
def clean_corpus(x):
# Replace all whitespace with single-space
clear_ws_pat = r'\s+'
# Find nonalphanumeric characters
remove_punc_pat = r'[^\w+]'
tmp1 = re.sub(remove_punc_pat, ' ', x)
# Respace whitespace and return
return re.sub(clear_ws_pat, ' ', tmp1)
# List of our files from the subdirectory
keylist = list(corpus[dirname])
for k in keylist:
cleand = clean_corpus(corpus[dirname][k])
# Tokenize based on size
if keep_small_words:
tokens = p.findall(cleand)
else:
# limit to results > 1 char in length
tokens = [i for i in p.findall(cleand) if len(i) > 1]
for i in tokens:
if i in count_dict.keys():
count_dict[i] += 1
else:
count_dict[i] = 1
# Return dictionary once complete
return count_dict
### Dictionary sorted lambda function
dict_sort = lambda d, descending=True: dict(sorted(d.items(), key=lambda x: x[1], reverse=descending))
# Run our function for positive corpus values
pos_result_dict = wordcounts(ddict, 'pos')
pos_result_dict = dict_sort(pos_result_dict)
最终处理和打印:
# Create dictionary of how frequent each count value is
freq_dist = val_counter(pos_result_dict.values())
freq_dist = dict_sort(freq_dist)
# Stats functions
k_count = lambda x: len(x.keys())
sum_vals = lambda x: sum([v for k, v in x.items()])
calc_avg = lambda x: sum_vals(x) / k_count(x)
# Get mean (arithmetic average) of word counts
mean_dict = calc_avg(pos_result_dict)
# Top-half of results. We count shrink this even further, if necessary
top_dict = {k:v for k, v in pos_result_dict.items() if v >= mean_dict}
# This is probably your TD-IDF part
tot_count= sum([v for v in top_dict.values()])
for k, v in top_dict.items():
pct_ = round(v / tot_count, 4)
print('Word: ', k, ', count: ', v, ', %-age: ', pct_)
嘿围绕这个问题提供更多的上下文可能有助于给你一个更好的答案,但我已经总结了一些快速的东西,我认为这会给你一个开始。答案的底部有两个链接,这可能有助于在上下文中理解tfidf。希望有帮助!