Scikit learn 使用TfidfVectorizer,是否可以使用一个语料库获取idf信息,另一个语料库获取实际索引?
使用sklearn.feature\u extraction.text.tfidf矢量器 我想用一包单词tf idf数据来训练分类器 我有一个大的未标记语料库和一个小的标记语料库 我计划使用标记的语料库构建一个基于tf idf模型的单词包的分类器。 然而,我更喜欢使用完整的语料库(包括未标记的数据)来计算idf统计数据 在使用sklearn时,这可能吗Scikit learn 使用TfidfVectorizer,是否可以使用一个语料库获取idf信息,另一个语料库获取实际索引?,scikit-learn,tf-idf,text-classification,Scikit Learn,Tf Idf,Text Classification,使用sklearn.feature\u extraction.text.tfidf矢量器 我想用一包单词tf idf数据来训练分类器 我有一个大的未标记语料库和一个小的标记语料库 我计划使用标记的语料库构建一个基于tf idf模型的单词包的分类器。 然而,我更喜欢使用完整的语料库(包括未标记的数据)来计算idf统计数据 在使用sklearn时,这可能吗 我想到的一个解决方案是建立所有语料库的模型,然后删除属于未标记数据的行。但是,语料库可能太大,无法存储在ram中。如果我理解正确的话。您可以将T
我想到的一个解决方案是建立所有语料库的模型,然后删除属于未标记数据的行。但是,语料库可能太大,无法存储在ram中。如果我理解正确的话。您可以将TFIDF模型与所有数据相匹配,然后在较小的标记语料库上调用
transform
:
vec =TfidfVectorizer()
model = vec.fit(alldata)
tagged_data_tfidf = vec.transform(tagged_data)
谢谢@JAB,这就是我要找的 关于不适合RAM的数据,可以使用迭代器,如果数据分布在不同的源中,则可以使用多个迭代器。在我的例子中,带标签的数据存储在文件中,而我的数据存储在mongoDB中: 文件迭代器:
class File2Doc(object):
def __init__(self, top_dir):
self.top_dir = top_dir
def __iter__(self):
for root, dirs, files in os.walk(self.top_dir):
for fname in filter(lambda fname: fname.endswith('.txt'), files):
with open(os.path.join(root, fname), encoding='utf8', errors='ignore') as file:
document = file.read()
yield document
class Mongo2Doc(object):
"""
an iterator that builds a find pymongo cursor and saves the text field in the mongodb collection
"""
def __init__(self, query):
self.cur = query.cur
self.text_field = query.text_field
def __iter__(self):
for document in self.cur:
yield document[self.text_field]
class MyDocIterator(object):
'''
Expects a list of [folders] (paths) and/or a list of mongoDB [queries]
mongoDB queries have the form (collection_name, {find_query}, {projection: or text_field})
example:
mongo_query = [mongo_client.db.collection, {'optional_query': 'some_value'}, {'text':1}]
'''
def __init__(self, folders=None, mongo_query=None):
self.folders = folders
self.mongo_query = mongo_query
if self.folders is not None:
assert isinstance(self.folders, list), 'folders should be a list'
if self.mongo_query is not None:
assert isinstance(self.mongo_query,
list), 'Mongo query should be a list'
if self.folders is None and self.mongo_query is None:
raise TypeError(
'Please specify at least one folder or one mongo query')
def __iter__(self):
k = []
if self.folders is not None:
f = [File2Doc(folder) for folder in self.folders]
k.extend(f)
if self.mongo_query is not None:
m = [Mongo2Doc(query) for query in self.mongo_query]
k.extend(m)
return chain.from_iterable(k)
mongoDB迭代器:
class File2Doc(object):
def __init__(self, top_dir):
self.top_dir = top_dir
def __iter__(self):
for root, dirs, files in os.walk(self.top_dir):
for fname in filter(lambda fname: fname.endswith('.txt'), files):
with open(os.path.join(root, fname), encoding='utf8', errors='ignore') as file:
document = file.read()
yield document
class Mongo2Doc(object):
"""
an iterator that builds a find pymongo cursor and saves the text field in the mongodb collection
"""
def __init__(self, query):
self.cur = query.cur
self.text_field = query.text_field
def __iter__(self):
for document in self.cur:
yield document[self.text_field]
class MyDocIterator(object):
'''
Expects a list of [folders] (paths) and/or a list of mongoDB [queries]
mongoDB queries have the form (collection_name, {find_query}, {projection: or text_field})
example:
mongo_query = [mongo_client.db.collection, {'optional_query': 'some_value'}, {'text':1}]
'''
def __init__(self, folders=None, mongo_query=None):
self.folders = folders
self.mongo_query = mongo_query
if self.folders is not None:
assert isinstance(self.folders, list), 'folders should be a list'
if self.mongo_query is not None:
assert isinstance(self.mongo_query,
list), 'Mongo query should be a list'
if self.folders is None and self.mongo_query is None:
raise TypeError(
'Please specify at least one folder or one mongo query')
def __iter__(self):
k = []
if self.folders is not None:
f = [File2Doc(folder) for folder in self.folders]
k.extend(f)
if self.mongo_query is not None:
m = [Mongo2Doc(query) for query in self.mongo_query]
k.extend(m)
return chain.from_iterable(k)
在一个迭代器中组合这两个元素:
class File2Doc(object):
def __init__(self, top_dir):
self.top_dir = top_dir
def __iter__(self):
for root, dirs, files in os.walk(self.top_dir):
for fname in filter(lambda fname: fname.endswith('.txt'), files):
with open(os.path.join(root, fname), encoding='utf8', errors='ignore') as file:
document = file.read()
yield document
class Mongo2Doc(object):
"""
an iterator that builds a find pymongo cursor and saves the text field in the mongodb collection
"""
def __init__(self, query):
self.cur = query.cur
self.text_field = query.text_field
def __iter__(self):
for document in self.cur:
yield document[self.text_field]
class MyDocIterator(object):
'''
Expects a list of [folders] (paths) and/or a list of mongoDB [queries]
mongoDB queries have the form (collection_name, {find_query}, {projection: or text_field})
example:
mongo_query = [mongo_client.db.collection, {'optional_query': 'some_value'}, {'text':1}]
'''
def __init__(self, folders=None, mongo_query=None):
self.folders = folders
self.mongo_query = mongo_query
if self.folders is not None:
assert isinstance(self.folders, list), 'folders should be a list'
if self.mongo_query is not None:
assert isinstance(self.mongo_query,
list), 'Mongo query should be a list'
if self.folders is None and self.mongo_query is None:
raise TypeError(
'Please specify at least one folder or one mongo query')
def __iter__(self):
k = []
if self.folders is not None:
f = [File2Doc(folder) for folder in self.folders]
k.extend(f)
if self.mongo_query is not None:
m = [Mongo2Doc(query) for query in self.mongo_query]
k.extend(m)
return chain.from_iterable(k)
用法示例:
my_docs = MyDocIterator(['path_to_data'])
bow_vectorizer = CountVectorizer(preprocessor=custom_text_preprocessor, tokenizer=str.split)
bow_vectorizer.fit(my_docs)
类似于TFIDF矢量器