Python 使用sklearn标记文本:索引错误

Python 使用sklearn标记文本:索引错误,python,scikit-learn,text-classification,Python,Scikit Learn,Text Classification,我正在尝试使用集群为文本分配标签。要做到这一点,我将在此链接中执行以下步骤: 但是,在选择集群中的所有术语(我有三个集群)后,如下所示: order_centroids = model.cluster_centers_.argsort()[:, ::-1] terms = vectorizer.get_feature_names() terms 其中len(terms)=2009,我得到以下错误: 索引器:列表索引超出范围 当我运行此代码时: for i in range(true_k):

我正在尝试使用集群为文本分配标签。要做到这一点,我将在此链接中执行以下步骤:

但是,在选择集群中的所有术语(我有三个集群)后,如下所示:

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
terms
其中
len(terms)=2009
,我得到以下错误:

索引器:列表索引超出范围

当我运行此代码时:

for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='') # the error comes from here
        print()
其中
true\u k=3
。 此链接:uUnfortutely没有任何相关信息来解决此问题。 似乎集群1可能会导致问题,也可能导致字数[:10]

你知道它为什么会发生以及如何修复吗? 如果你需要更多的代码和信息,我会很高兴地更新问题

样本

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemm = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemm

true_k=3
vect =TfidfVectorizer(tokenizer=preprocessing)
vectorized_text=vect.fit_transform(df['Doc'])
kmeans =KMeans(n_clusters=true_k).fit(vectorized_text)

# Predict cluster 
cl=kmeans.predict(vectorized_text)
df['Predicted Cluster']=pd.Series(cl, index=df.index)

# Transform to plot
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(df['Doc']).todense()
pca = PCA(n_components=3).fit(X)
data2D = pca.transform(X)
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
医生

。。。 我有130排

代码

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemm = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemm

true_k=3
vect =TfidfVectorizer(tokenizer=preprocessing)
vectorized_text=vect.fit_transform(df['Doc'])
kmeans =KMeans(n_clusters=true_k).fit(vectorized_text)

# Predict cluster 
cl=kmeans.predict(vectorized_text)
df['Predicted Cluster']=pd.Series(cl, index=df.index)

# Transform to plot
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(df['Doc']).todense()
pca = PCA(n_components=3).fit(X)
data2D = pca.transform(X)
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
每个集群的顶级术语
更多的代码形成一个完整的子示例会更好。简言之,在质心[i,:10]调用的顺序或[ind]调用的术语中出现了一些错误,但我不清楚是哪一个。我更新了问题。如果您需要更多的行,请告诉我@AndrewHMore代码形成一个完整的子示例会很好。简言之,在质心[i,:10]调用的顺序或[ind]调用的术语中出现了一些错误,但我不清楚是哪一个。我更新了问题。如果您需要更多的行,请告诉我@安德烈
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vect.get_feature_names()

for i in range(1):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print('%s' % terms[ind])
    print()