Python 如何根据文档相似性对文本数据进行分组？_Python_Pandas_Group By_Nltk_Similarity

Python 如何根据文档相似性对文本数据进行分组？

python pandas

Python 如何根据文档相似性对文本数据进行分组？,python,pandas,group-by,nltk,similarity,Python,Pandas,Group By,Nltk,Similarity,考虑如下所示的数据帧 df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?', 'How are

考虑如下所示的数据帧

df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
                             'How are you doing?' ]})

for _, i in df.groupby('similarity')['Questions']:
    print(i,'\n')

我们见面好吗？名称：问题，数据类型：对象你叫什么名字？你的昵称是什么？你的全名是什么？名称：问题，数据类型：对象你在干什么？你今晚要做什么？你现在在干什么？你好吗？名称：问题，数据类型：对象

有人问了一个类似的问题，但不太清楚，因此没有人问这个问题。这里有一个相当大的方法，找到序列中所有元素之间的

标准化相似度得分

，然后根据新获得的转换为字符串的相似度列表对它们进行分组。i、 e

import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

def convert_tag(tag):   
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), 
     Synset('friend.n.01')]
    """

    synsetlist =[]
    tokens=nltk.word_tokenize(doc)
    pos=nltk.pos_tag(tokens)    
    for tup in pos:
        try:
            synsetlist.append(wn.synsets(tup[0], convert_tag(tup[1]))[0])
        except:
            continue           
    return synsetlist

def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """

    highscores = []
    for synset1 in s1:
        highest_yet=0
        for synset2 in s2:
            try:
                simscore=synset1.path_similarity(synset2)
                if simscore>highest_yet:
                    highest_yet=simscore
            except:
                continue

        if highest_yet>0:
             highscores.append(highest_yet)  

    return sum(highscores)/len(highscores)  if len(highscores) > 0 else 0

def document_path_similarity(doc1, doc2):
    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)
    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2


def similarity(x,df):
    sim_score = []
    for i in df['Questions']:
        sim_score.append(document_path_similarity(x,i))
    return sim_score

根据上面定义的方法，我们现在可以

df['similarity'] = df['Questions'].apply(lambda x : similarity(x,df)).astype(str)

for _, i in df.groupby('similarity')['Questions']:
    print(i,'\n')

输出：

6 Shall we meet? Name: Questions, dtype: object 3 What is your name? 4 What is your nick name? 5 What is your full name? Name: Questions, dtype: object 0 What are you doing? 1 What are you doing tonight? 2 What are you doing now? 7 How are you doing? Name: Questions, dtype: object 我们见面好吗？名称：问题，数据类型：对象你叫什么名字？你的昵称是什么？你的全名是什么？名称：问题，数据类型：对象你在干什么？你今晚要做什么？你现在在干什么？你好吗？名称：问题，数据类型：对象

这不是解决问题的最佳方法，而且速度非常慢。我们高度赞赏任何新方法

您应该首先对列表/数据框列中的所有名称进行排序，然后仅对n-1行运行相似性代码，即，对于每一行，将其与下一个元素进行比较。如果两者相似，则可以将它们分类为1或0，并通过列表进行解析。

不是将每一行与n^2的其他元素进行比较。

实际上，对于这样的问题，NLP/余弦相似性确实是前进的最佳方式。是的，我现在正在研究它。我一成功肯定会更新。还是个初学者。你的解决方案也很好：）NLP或者fuzzywuzzy@WenFuzzyWozzy听起来很有希望，但还没有使用它。你能在此基础上添加解决方案吗。？也许你可以根据自己的数据做更多的研究，因为NPL的设计是多样的，这完全取决于你使用的数据。

df['similarity'] = df['Questions'].apply(lambda x : similarity(x,df)).astype(str)

for _, i in df.groupby('similarity')['Questions']:
    print(i,'\n')