Python 统计评论中大量名词和动词/形容词的所有共现情况_Python_Pandas_Nltk

Python 统计评论中大量名词和动词/形容词的所有共现情况

python pandas

Python 统计评论中大量名词和动词/形容词的所有共现情况,python,pandas,nltk,Python,Pandas,Nltk,我有一个包含大量评论的数据框架，一个包含名词词（1000）的大列表，另一个包含动词/形容词（1000）的大列表示例数据帧和列表： import pandas as pd data = {'reviews':['Very professional operation. Room is very clean and comfortable', 'Daniel is the most amazing host! His place is extremely

我有一个包含大量评论的数据框架，一个包含名词词（1000）的大列表，另一个包含动词/形容词（1000）的大列表

示例数据帧和列表：

import pandas as pd

data = {'reviews':['Very professional operation. Room is very clean and comfortable',
                    'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
                    'The room is very quiet, and well decorated, very clean.',
                    'He provides the room with towels, tea, coffee and a wardrobe.',
                    'Daniel is a great host. Always recomendable.',
                    'My friend and I were very satisfied with our stay in his apartment.']}

df = pd.DataFrame(data)

nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
         'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
         'à','station','bed','experience','hosts','Thank','bien']

verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
             'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

我想创建一个字典字典来存储每次评论中所有名词和动词/形容词的共现，例如

"操作非常专业,。房间非常干净舒适。”

{'room'：{'is'：1，“干净”：1，“舒适”：1}

使用以下代码：

def count_co_occurences(reviews):
    # Iterate on each review and count
    occurences_per_review = {
        f"review_{i+1}": {
            noun: dict(Counter(review.lower().split(" ")))
            for noun in nouns
            if noun in review.lower()
        }
        for i, review in enumerate(reviews)
    }
    # Remove verb_adj not found in main list
    opr = deepcopy(occurences_per_review)
    for review, occurences in opr.items():
        for noun, counts in occurences.items():
            for verb_adj in counts.keys():
                if verb_adj not in verbs_adj:
                    del occurences_per_review[review][noun][verb_adj]
                    
    return occurences_per_review

pprint(count_co_occurences(data["reviews"]))

适用于列表和评论数量较少的情况，但当此功能用于大列表/大量评论时，我的笔记本会崩溃。如何修改代码以处理此问题？

我认为您可能需要使用几个库来简化您的生活。在本例中，我使用的是nltk和集合，而不是pand当然：

import pandas as pd
import nltk
from collections import Counter

data = {'reviews':['Very professional operation. Room is very clean and comfortable',
                    'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
                    'The room is very quiet, and well decorated, very clean.',
                    'He provides the room with towels, tea, coffee and a wardrobe.',
                    'Daniel is a great host. Always recomendable.',
                    'My friend and I were very satisfied with our stay in his apartment.']}

df = pd.DataFrame(data)

nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
         'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
         'à','station','bed','experience','hosts','Thank','bien']

verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
             'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

def buildict(x):
    occurdict={}
    tokens = nltk.word_tokenize(x)
    tokenslower = list(map(str.lower, tokens)) 
    allnouns=[word for word in tokenslower if word in nouns]
    allverbs_adj=Counter(word for word in tokenslower if word in verbs_adj)
    for noun in allnouns:
        occurdict[noun]=dict(allverbs_adj)
    return occurdict

df['words']=df['reviews'].apply(lambda x: buildict(x))

输出：

0   Very professional operation. Room is very clea...   {'room': {'is': 1, 'clean': 1, 'comfortable': 1}}
1   Daniel is the most amazing host! His place is ...   {'host': {'is': 3, 'amazing': 1, 'clean': 1, '...
2   The room is very quiet, and well decorated, ve...   {'room': {'is': 1, 'quiet': 1, 'clean': 1}}
3   He provides the room with towels, tea, coffee ...   {'room': {}}
4   Daniel is a great host. Always recomendable.    {'host': {'is': 1, 'great': 1}}
5   My friend and I were very satisfied with our s...   {'stay': {'were': 1, 'stay': 1}, 'apartment': ...

这正是我想要的，谢谢。也可以将dict的dict转换成数据帧吗？因此每一行都是所有的名词，每一列都是动词/形容词。这是可能的，它类似于dfdict=pd.dataframe（occurict）。transpose（），其中occurict是函数buildict返回的内容（dict of dicts）我尝试了您的解决方案和类似的代码，但它们都只为每个字典值输出一行和一列，我不知道为什么会发生这种情况。但感谢您的帮助！