Python 查找每类TF-IDF得分最高的前n个术语

Python 查找每类TF-IDF得分最高的前n个术语,python,python-3.x,scikit-learn,tfidfvectorizer,Python,Python 3.x,Scikit Learn,Tfidfvectorizer,假设我有一个数据框,在pandas中有两列,类似于下面的一列: text label 0 This restaurant was amazing Positive 1 The food was served cold Negative 2 The waiter was a bit rude Negative 3 I love the view from

假设我有一个数据框,在
pandas
中有两列,类似于下面的一列:

    text                                label
0   This restaurant was amazing         Positive
1   The food was served cold            Negative
2   The waiter was a bit rude           Negative
3   I love the view from its balcony    Positive
然后我在这个数据集上使用来自
sklearn
TfidfVectorizer

就每堂课的TF-IDF分数词汇而言,找到前n名的最有效方法是什么

显然,我的实际数据帧包含的数据行比上面4行多得多

我的文章的重点是找到适用于任何类似于上述数据帧的数据帧的代码;4行数据帧或1M行数据帧

我认为我的帖子与以下帖子有很大关系:


    • 在下面,您可以找到我三年多前为类似目的编写的一段代码。我不确定这是否是做你要做的事情的最有效的方法,但据我记忆所及,它对我很有效

      # X: data points
      # y: targets (data points` label)
      # vectorizer: TFIDF vectorizer created by sklearn
      # n: number of features that we want to list for each class
      # target_list: the list of all unique labels (for example, in my case I have two labels: 1 and -1 and target_list = [1, -1])
      # --------------------------------------------
      # splitting X vectors based on target classes
      for label in target_list:
          # listing the most important words in each class
          indices = []
          current_dict = {}
      
          # finding indices the of rows (data points) for the current class
          for i in range(0, len(X.toarray())):
              if y[i] == label:
                  indices.append(i)
      
          # get rows of the current class from tf-idf vectors matrix and calculating the mean of features values
          vectors = np.mean(X[indices, :], axis=0)
      
          # creating a dictionary of features with their corresponding values
          for i in range(0, X.shape[1]):
              current_dict[X.indices[i]] = vectors.item((0, i))
      
          # sorting the dictionary based on values
          sorted_dict = sorted(current_dict.items(), key=operator.itemgetter(1), reverse=True)
      
          # printing the features textual and numeric values
          index = 1
          for element in sorted_dict:
              for key_, value_ in vectorizer.vocabulary_.items():
                  if element[0] == value_:
                      print(str(index) + "\t" + str(key_) + "\t" + str(element[1]))
                      index += 1
                      if index == n:
                          break
              else:
                  continue
              break
      

      这将为您提供每个文档的前5个术语。根据需要进行调整。

      以下代码将完成此工作(感谢)

      假设我们有一个与您的结构一致的输入数据帧df

      from sklearn.feature_extraction.text import TfidfVectorizer
      import pandas as pd
      
      # override scikit's tfidf-vectorizer in order to return dataframe with feature names as columns
      class DenseTfIdf(TfidfVectorizer):
      
          def __init__(self, **kwargs):
              super().__init__(**kwargs)
              for k, v in kwargs.items():
                  setattr(self, k, v)
      
          def transform(self, x, y=None) -> pd.DataFrame:
              res = super().transform(x)
              df = pd.DataFrame(res.toarray(), columns=self.get_feature_names())
              return df
      
          def fit_transform(self, x, y=None) -> pd.DataFrame:
              # run sklearn's fit_transform
              res = super().fit_transform(x, y=y)
              # convert the returned sparse documents-terms matrix into a dataframe to further manipulations
              df = pd.DataFrame(res.toarray(), columns=self.get_feature_names(), index=x.index)
              return df
      
      用法:
      除非明确删除hapaxes,否则根据TFxIDF定义,输入文档中的唯一单词将获得最高分数。如果你有几十个以上的单词,“前三名”将毫无意义,因为所有前n名的单词都会有相同的最高分数,而且往往根本不是什么特别好的指标。@tripleee,谢谢你的评论。然而,我认为数据帧只是一个小样本数据帧是非常明显的。我的实际数据帧由大约100k的数据行组成。我的文章的重点是找到适用于任何类似数据帧的代码;4行数据帧或1M行数据帧。同样的道理也适用于它应该是前3名还是前100名或是其他什么。因此,让我们把重点放在讨论中的问题上,只说显而易见的话。但是(事实上,显而易见的)你的问题的答案是“任何只发生在一个样本中的事情”。一个更有用的问题是,例如,“哪些代币在一组中具有高DF(即低IDF),而在另一组中没有”,但你不是在问这个问题,我们无法从你的帖子中真正猜测这是否是你真正想要的。哈哈@tripleee我的问题不是一般来说哪一个会是前n(就TF-IDF分数而言)每个类的词汇表,因为答案是显而易见的,并且是您所陈述的答案。我的问题是使用什么代码来高效地查找前n(根据TF-IDF)在
      sklearn
      TfidfVectorizer
      中为每节课的词汇评分。因此,我需要代码而不是明显的文本答案。如果我理解了你想要的,我可能会发布一个答案。这些评论希望让你澄清你试图实现的目标。因此,你真的不是在寻找前三个最极化的术语,例如足够了?好的,谢谢你(投票)!这是一个好的开始。我想其他人可能会想出一个更有效的版本。
      from sklearn.feature_extraction.text import TfidfVectorizer
      import pandas as pd
      
      # override scikit's tfidf-vectorizer in order to return dataframe with feature names as columns
      class DenseTfIdf(TfidfVectorizer):
      
          def __init__(self, **kwargs):
              super().__init__(**kwargs)
              for k, v in kwargs.items():
                  setattr(self, k, v)
      
          def transform(self, x, y=None) -> pd.DataFrame:
              res = super().transform(x)
              df = pd.DataFrame(res.toarray(), columns=self.get_feature_names())
              return df
      
          def fit_transform(self, x, y=None) -> pd.DataFrame:
              # run sklearn's fit_transform
              res = super().fit_transform(x, y=y)
              # convert the returned sparse documents-terms matrix into a dataframe to further manipulations
              df = pd.DataFrame(res.toarray(), columns=self.get_feature_names(), index=x.index)
              return df
      
      # assume texts are stored in column 'text' within a dataframe
      texts = df['text']
      df_docs_terms_corpus = DenseTfIdf(sublinear_tf=True,
                       max_df=0.5,
                       min_df=2,
                       encoding='ascii',
                       ngram_range=(1, 2),
                       lowercase=True,
                       max_features=1000,
                       stop_words='english'
                      ).fit_transform(texts)
      
      
      # Need to keep alignment of indexes between the original dataframe and the resulted documents-terms dataframe
      df_class = df[df["label"] == "Class XX"]
      df_docs_terms_class = df_docs_terms_corpus.iloc[df_class.index]
      # sum by columns and get the top n keywords
      df_docs_terms_class.sum(axis=0).nlargest(n=50)