Python 在文档库中查找和排序最类似于特定单词列表的单词_Python_Pandas_Nlp

Python 在文档库中查找和排序最类似于特定单词列表的单词

python pandas nlp

Python 在文档库中查找和排序最类似于特定单词列表的单词,python,pandas,nlp,Python,Pandas,Nlp,如何对多个文档的语料库中的多个单词列表进行计数和评分，以便以几种不同的方式进行排序在语料库中查找文档，并在列表中查找和排序最相似的单词还能够找到与给定文档最近的文档比如说 colors = ['red', 'blue', 'yellow' , 'purple'] things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book'] corpus = ['i ate a red apple.', 'There are so many col

如何对多个文档的语料库中的多个单词列表进行计数和评分，以便以几种不同的方式进行排序

在语料库中查找文档，并在列表中查找和排序最相似的单词

还能够找到与给定文档最近的文档

比如说

colors = ['red', 'blue', 'yellow' , 'purple'] things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book'] corpus = ['i ate a red apple.', 'There are so many colors in the rainbow.', 'the monster was purple and green.', 'the pickle is very green', 'the kid read the book the little red riding hood', 'in the book the wizard of oz there was a yellow brick road.', 'tom has a green thumb and likes working in a garden.' ] colors = ['red', 'blue', 'yellow' , 'purple'] things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book'] 0 1 2 3 4 5 6
我要做柜台吗

# 0 'i ate a red apple.' ['red': 1, 'blue': 0, 'yellow' : 0, 'purple': 0] ['apple': 1, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0] # 1 'There are so many colors in the rainbow.' ['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0] ['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 1, 'book': 0] # 2 the monster was purple and green.' ['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 1] ['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0] # 3 'the pickle is very green', ['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0] ['apple': 0, 'pickle': 1, 'tomato': 0, 'rainbow': 0, 'book': 0] # 4 'the kid read the book the little red riding hood', ['red': 1 'blue': 0, 'yellow' : 0, 'purple': 0] ['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1] # 5 'in the book the wizard of oz there was a yellow brick road.', ['red': 0, 'blue': 0, 'yellow' : 1, 'purple': 0] ['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1] # 6 'tom has a green thumb and likes working in a garden.' ['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0] ['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]
或者一个颜色数组，一个东西数组

# colors 0 1 2 3 4 5 6 red 1 0 0 0 1 0 0 blue 0 0 0 0 0 0 0 yellow 0 0 0 0 0 1 0 purple 0 0 1 0 0 0 0
然后找到最相似的或按最接近的数字排序

或者我应该使用doc2vec或完全不同的东西吗？
您可以通过在每行上迭代并按单词分组来获得计数

def words_counter(corpus_parameter, colors_par, things_par): """ Returns two dataframes with the occurrence of the words in colors_par & things_par corpus_parameter: list of strings, common language colors_par: list of words with no spaces or punctuation things_par: list of words with no spaces or punctuation """ colors_count, things_count = [], [] # lists to collect intermediate series for i, line in enumerate(corpus): words = pd.Series( line .strip(' !?.') # it will remove any spaces or punctuation from left/right of the string .lower() # use this to count 'red', 'Red', and 'RED' as the same word .split() # split using spaces (' ') by default, you can provide a different character ) # returns a clean series with all the words # print(words) # uncomment to see the series words = words.groupby(words).size() # returns the words as index and the count as values # print(words) # uncomment to see the series colors_count.append(words.loc[words.index.isin(colors_par)]) things_count.append(words.loc[words.index.isin(things_par)]) colors_count = ( pd.concat(colors_count, axis=1) # convert list of series to dataframe .reindex(colors_par) # include colors with zero occurrence .fillna(0) # get rid of NaNs .astype(int) # convert from default float to integer ) things_count = pd.concat(things_count, axis=1).reindex(things_par).fillna(0).astype(int) print(colors_count) print(things_count) return(colors_count, things_count)
打电话

words_counter(corpus, colors, things)
输出

0 1 2 3 4 5 6 red 1 0 0 0 1 0 0 blue 0 0 0 0 0 0 0 yellow 0 0 0 0 0 1 0 purple 0 0 1 0 0 0 0 0 1 2 3 4 5 6 apple 1 0 0 0 0 0 0 pickle 0 0 0 1 0 0 0 tomato 0 0 0 0 0 0 0 rainbow 0 1 0 0 0 0 0 book 0 0 0 0 1 1 0

IIUC，你有很多主题，比如颜色、事物、情绪等，每个主题都有一些关键词。您希望根据给定主题中的关键字在同一时间的出现情况来查找句子之间的相似性
您可以通过两个步骤完成此操作-

拟合计数向量器以获取所有唯一单词的单词出现次数

仅过滤主题中存在的关键字

在该主题的单词出现次数（句子*主题）和点（主题*句子）之间取点积，得到一个（句子*句子）矩阵，该矩阵与该主题的两个句子之间的余弦相似性相同（非标准化）

去特定的一行，得到该行中相似度最高的句子（同一个句子除外）
现在，在下一步中，按主题（颜色或事物等）过滤该矩阵，并获取该矩阵的余弦相似性（标准化点积）。这可以通过此函数完成-

def get_similary_table(topic): df = cdf.loc[cdf.index.isin(topic)] #filter by topic cnd = df.values similarity = cnd.T@cnd #Take dot product to get similarty matrix dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe return dd get_similary_table(things)

如果在此表中看到一行，则值最高的列最相似。因此，如果您想要最相似的值，只需取一个最大值，或者如果您想要前5个值，则排序并取前5个值（及其对应的列）
下面是一个代码，用于获取与给定句子最相似的句子

def get_similar_review(s, topic): df = cdf.loc[cdf.index.isin(topic)] #filter by topic cnd = df.values similarity = cnd.T@cnd #Take dot product to get similarty matrix np.fill_diagonal(similarity,0) #set diagonal elements to 0, to avoid same sentence being returned as output dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe return dd.loc[s].idxmax(axis=0) #filter by sentence and get column name with max value

如果你不想通过一个主题找到相似性，那么你可以简单地忽略大部分步骤，直接使用CountVectoriad矩阵cv，使用其点积获得（句子*句子）矩阵，然后获得相似性矩阵
只是为了澄清一下，你想要两个文档之间的相似性仅基于颜色和事物？或者你只是想要基于所有单词共现的相似句子？或者你想要基于上下文的类似句子（颜色、蜡笔、蓝色等与苹果、香蕉、水果、沙拉等有相似的上下文）。每个文档作为一个整体只是一个玩具示例。在实际使用中，它们可能是10个或10个单词，表示情绪、主题等，如快乐、悲伤或其他。我正在尝试计数，以查找语料库中每个文档的相似性，并将其排序为单词列表。因此，您有大量的主题集，并且您正在根据这些特定主题中的单词查找相似性？所以，一组主题可能是情绪，你想找到情绪愤怒的句子吗？是的，但已经做了情绪分析，这是为了配合过滤。我有大量的评论，都是关于人的，我想根据这些特定词汇列表中的词汇找出相似之处。我想根据文档中的相似性（每个人都有一个文档或一个列）来对人进行分类，这只是一个特定单词的列表。因此，我可以按与列表1或列表2最相似的方式进行排序。或者与doc1或人名最相似。检查我的答案，你的问题中有更多相同问题的变体。所有这些都是对Samethaks的修改，为了您的帮助，有些方法要快得多。我学到了很多。我真的很感激。
def words_counter(corpus_parameter, colors_par, things_par): """ Returns two dataframes with the occurrence of the words in colors_par & things_par corpus_parameter: list of strings, common language colors_par: list of words with no spaces or punctuation things_par: list of words with no spaces or punctuation """ colors_count, things_count = [], [] # lists to collect intermediate series for i, line in enumerate(corpus): words = pd.Series( line .strip(' !?.') # it will remove any spaces or punctuation from left/right of the string .lower() # use this to count 'red', 'Red', and 'RED' as the same word .split() # split using spaces (' ') by default, you can provide a different character ) # returns a clean series with all the words # print(words) # uncomment to see the series words = words.groupby(words).size() # returns the words as index and the count as values # print(words) # uncomment to see the series colors_count.append(words.loc[words.index.isin(colors_par)]) things_count.append(words.loc[words.index.isin(things_par)]) colors_count = ( pd.concat(colors_count, axis=1) # convert list of series to dataframe .reindex(colors_par) # include colors with zero occurrence .fillna(0) # get rid of NaNs .astype(int) # convert from default float to integer ) things_count = pd.concat(things_count, axis=1).reindex(things_par).fillna(0).astype(int) print(colors_count) print(things_count) return(colors_count, things_count)

words_counter(corpus, colors, things)

0 1 2 3 4 5 6 red 1 0 0 0 1 0 0 blue 0 0 0 0 0 0 0 yellow 0 0 0 0 0 1 0 purple 0 0 1 0 0 0 0 0 1 2 3 4 5 6 apple 1 0 0 0 0 0 0 pickle 0 0 0 1 0 0 0 tomato 0 0 0 0 0 0 0 rainbow 0 1 0 0 0 0 0 book 0 0 0 0 1 1 0

from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() out = cv.fit_transform(corpus).toarray() #apply countvectorizer #For scalability (because you can have a lot more topics like Mood etc) I am combining all topics first and later ill filter by given topic combined = colors+things #combine all your topics c = [(k,v) for k,v in cv.vocabulary_.items() if k in combined] #get indexes for all the items from all topics cdf = pd.DataFrame(out[:,[i[1] for i in c]], columns=[i[0] for i in c]).T #Filter cv dataframe for all items print(cdf)

#This results in a keyword occurance dataset with all keywords from all topics 0 1 2 3 4 5 6 red 1 0 0 0 1 0 0 apple 1 0 0 0 0 0 0 rainbow 0 1 0 0 0 0 0 purple 0 0 1 0 0 0 0 pickle 0 0 0 1 0 0 0 book 0 0 0 0 1 1 0 yellow 0 0 0 0 0 1 0

def get_similary_table(topic): df = cdf.loc[cdf.index.isin(topic)] #filter by topic cnd = df.values similarity = cnd.T@cnd #Take dot product to get similarty matrix dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe return dd get_similary_table(things)

def get_similar_review(s, topic): df = cdf.loc[cdf.index.isin(topic)] #filter by topic cnd = df.values similarity = cnd.T@cnd #Take dot product to get similarty matrix np.fill_diagonal(similarity,0) #set diagonal elements to 0, to avoid same sentence being returned as output dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe return dd.loc[s].idxmax(axis=0) #filter by sentence and get column name with max value

s = 'i ate a red apple.' get_similar(s, colors) #'the kid read the book the little red riding hood'

s = 'the kid read the book the little red riding hood' get_similar(s, things) #'in the book the wizard of oz there was a yellow brick road.'