Python 获取TF-IDF值
我的新闻数据集有TF-IDF的代码:Python 获取TF-IDF值,python,list,loops,set,tf-idf,Python,List,Loops,Set,Tf Idf,我的新闻数据集有TF-IDF的代码: #Pake yang ini vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(text) terms = vectorizer.get_feature_names() # sum tfidf frequency of each term through documents sums = vectors.sum(axis=0) # connecting term to i
#Pake yang ini
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(text)
terms = vectorizer.get_feature_names()
# sum tfidf frequency of each term through documents
sums = vectors.sum(axis=0)
# connecting term to its sums frequency
data = []
for col, term in enumerate(terms):
data.append( (term, sums[0,col] ))
ranking = pd.DataFrame(data, columns=['term','rank'])
#print(ranking.sort_values('rank', ascending=False))
tfrank = ranking.sort_values('rank', ascending=False)
tf = tfrank['term'].values.tolist()
tflist = tfrank.values.tolist()
tflist
结果是这样的:
[['kompascom', 1.4017648244641259],
['yang', 1.3134948993732996],
['lembaga', 0.9450441338264206],
['sumber', 0.9201203935242713],
['di', 0.8774768633619345],
['fakta', 0.7941379426583972],
['dan', 0.7820675768624364],
['ini', 0.7721892264143173],
['bisa', 0.7215355604434974],
['informasi', 0.7038273489379546],
['hoaks', 0.6443546898427824],
['ifcn', 0.6310537233704365],
['atau', 0.6094359873139008],
['penguji', 0.5945524698582002],
['internasional', 0.5945524698582002],
['rubrik', 0.5534905743539935],
['khusus', 0.5534905743539935],
['masyarakat', 0.5473499161901632],
['dalam', 0.5325014351825453],...]
我有三盘,我在做交叉
LDA_set = set(ldasort)
NMF_set = set(nmsort)
TFIDF_set = set(tf)
itrsect = LDA_set.intersection(NMF_set, TFIDF_set)
itrsect
相交结果如下所示:
{'14',
'2018',
'23',
'49',
'acara',
'ada',
'adalah',
'agar',
'antara',
'atas',
'atau',
'awal',
'banjir',
'baru',
'belum',
'beredar',
'berisi',
'beritanya',
'berpartisipasi',...}
实际上,我希望交叉口的结果也与TF-IDF分数一致。我如何处理循环?
因此,预期输出将是,例如:
[['14', 1.4017648244641259],
['2018', 1.3134948993732996],
['23', 0.9450441338264206],
['49', 0.9201203935242713],
['acara', 0.8774768633619345],
['ada', 0.7941379426583972],
['adalah', 0.7820675768624364],
['agar', 0.7721892264143173],
['atas', 0.7215355604434974],
['atau', 0.7038273489379546],
['awal', 0.6443546898427824],
['banjir', 0.6310537233704365],
['baru', 0.6094359873139008],
['belum', 0.5945524698582002],
['beredar', 0.5945524698582002],
['berisi', 0.5534905743539935],
['beritanya', 0.5534905743539935],
['berpartisipasi', 0.5473499161901632],...]
您在ldasort中有什么
set()
无法将列表作为元素获取。如果它是dictionary,那么它只获取键()
,您可能需要手动将其转换为元组列表。-ldasort.items()
BTW:如果在一个数据中有(“14”,1.0)
,在另一个(“14”,0.1)
中,则set()
将把这两个元素视为不同的元素,并且交叉点
将删除它们。也许你应该保持当前的中间状态,并使用它从ldasort
@furas ldasort是一个列表可能使用itrsect
和for
-循环从tfrank
获取分数你想给出详细的代码吗@福拉斯