Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/326.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 比较单词并返回数据帧条目_Python_Pandas_Dataframe_Nlp - Fatal编程技术网

Python 比较单词并返回数据帧条目

Python 比较单词并返回数据帧条目,python,pandas,dataframe,nlp,Python,Pandas,Dataframe,Nlp,我计划设置一个简单的函数,看看是否可以在熊猫数据框公共单词中找到单词列表中的单词。如果匹配,我想返回相应的数据帧条目,而DF的格式为生命平衡14,长期9,上层管理9,突出显示单词标记及其出现编号 但是,下面的代码当前仅打印单词列表中的搜索词(即生命平衡),而不是包含发生次数的数据框条目。因此,我需要找到一种方法来返回word,而不是wordlist元素。我的推理错误在哪里 相关代码部分为: # Check for matches between wordlist and Pandas d

我计划设置一个简单的函数,看看是否可以在熊猫数据框
公共单词
中找到单词列表中的单词。如果匹配,我想返回相应的数据帧条目,而DF的格式为
生命平衡14
长期9
上层管理9
,突出显示单词标记及其出现编号

但是,下面的代码当前仅打印单词列表中的搜索词(即
生命平衡
),而不是包含发生次数的数据框条目。因此,我需要找到一种方法来返回
word
,而不是
wordlist
元素。我的推理错误在哪里

相关代码部分为:

    # Check for matches between wordlist and Pandas dataframe
    def wordcheck():
        wordlist = ["work balance", "good management", "work life"]
        for x in wordlist:
            if df[i].str.contains(x).any():
                print('Group 1:', x)
    wordcheck()
完整代码段如下所示:

# Loading and normalising the input file
file = open("glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)


# Datetime conversion
df['Date'] = pd.to_datetime(df['Date'])
# Adding of 'Quarter' column
df['Quarter'] = df['Date'].dt.to_period('Q')


# Word frequency analysis
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]


# Analysis loops through different qualitative sections
for i in ['Text_Pro','Text_Con','Text_Main']:
    common_words = get_top_n_bigram(df[i], 500)
    for word, freq in common_words:
        print(word, freq)


    # Check for matches between wordlist and Pandas dataframe
    def wordcheck():
        wordlist = ["work balance", "good management", "work life"]
        for x in wordlist:
            if df[i].str.contains(x).any():
                print('Group 1:', x)
    wordcheck()

我可能有误解,但这是因为您只打印搜索到的术语吗? 那么,类似于下面的方法会更好吗

# Check for matches between wordlist and Pandas dataframe
def wordcheck():
    wordlist = ["work balance", "good management", "work life"]
    for x in wordlist:
        print('Group 1:', df[i][df[i].str.contains(x).any()])
wordcheck()

谢谢你的反馈。是,它当前返回搜索的术语。我已经尝试了您建议的代码,但是这导致了
print('group1:',df[I][df[I].str.contains(x).any()),KeyError:True
您可能需要添加列名,因此
print('group1:',df[I][df[I]['column_name'].str contains(x).any()])
好主意,值返回True,但不是
word
变量。我认为它必须直接链接到
word
,因为只有这个变量连接单词标记和出现次数。所以最好是这样:
print('group1:',df[I].loc[df[I]['column\u name']].str.contains(x).any())
?无论如何,谢谢你的努力