Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/278.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 为文本分类显示k个最近邻_Python_Scikit Learn_Classification_Knn_Text Classification - Fatal编程技术网

Python 为文本分类显示k个最近邻

Python 为文本分类显示k个最近邻,python,scikit-learn,classification,knn,text-classification,Python,Scikit Learn,Classification,Knn,Text Classification,我有一个CSV文件(corpus.CSV),其中包含以下格式的分级摘要(文本): Institute, Score, Abstract ---------------------------------------------------------------------- UoM, 3.0, Hello, this is abstract one UoM, 3.2, Hello, this is abstract two and yet coun

我有一个CSV文件(corpus.CSV),其中包含以下格式的分级摘要(文本):

Institute,    Score,    Abstract


----------------------------------------------------------------------


UoM,    3.0,    Hello, this is abstract one

UoM,    3.2,    Hello, this is abstract two and yet counting.

UoE,    3.1,    Hello, yet another abstract but this is a unique one.

UoE,    2.2,    Hello, please no more abstract.
我试图用python创建一个KNN分类程序,该程序能够获得用户输入摘要,如“这是一个新的唯一摘要”,然后将这个用户输入摘要分类为最接近语料库(CSV),并返回预测摘要的分数/等级。我有以下代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from csv import reader,writer
import operator as op
import string
from sklearn import neighbors

#Read data from corpus
r = reader(open('corpus.csv','r'))
abstract_list = []
score_list = []
institute_list = []
row_count = 0
for row in list(r)[1:]:
    institute,score,abstract = row[0], row[1], row[2]
    if len(abstract.split()) > 0:
      institute_list.append(institute)
      score = float(score)
      score_list.append(score)
      abstract = abstract.translate(string.punctuation).lower()
      abstract_list.append(abstract)
      row_count = row_count + 1

print("Total processed data: ", row_count)

#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
                     min_df = 0, stop_words = 'english', sublinear_tf=True)
response = vectorizer.fit_transform(abstract_list)
classes = score_list
feature_names = vectorizer.get_feature_names()

clf = neighbors.KNeighborsRegressor(n_neighbors=1)
clf.fit(response, classes)
predicted = clf.predict(response)
此时,如果我使用上述代码,则“预测的”将给出一个输出,例如[3.2]。但是,我也希望输出为[3.2,UoM,“您好,这是抽象的2,但仍在计数。”]


我想显示k个最近邻(不仅是分数,还包括相应的机构名称和摘要)。如何实现这一点?

安装模型后,您需要:


这将返回两个数组,其中第一个是距离列表,第二个是最近邻居的索引列表。为了以您想要的格式打印,您需要根据第二个列表的索引查找摘要。

谢谢您的回答。您能告诉我如何在我的用例中使用.kneighbors()吗?
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=1)
>>> neigh.fit(samples)
NearestNeighbors(n_neighbors=1)
>>> print(neigh.kneighbors([[1., 1., 1.]]))
(array([[0.5]]), array([[2]]))