Python正在努力让单词变得有意义_Python_Pandas

Python正在努力让单词变得有意义

python pandas

Python正在努力让单词变得有意义,python,pandas,Python,Pandas,嗨，我刚刚注意到tweepy api，我可以使用tweets对象（从tweepy获取）创建带有熊猫的数据帧。我想在我的推文中记下一个单词。这是我的密码 freq_df = hastag_tweets_df["Tweet"].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis =0).sort_values(ascending=False).reset_index().head(10) f

嗨，我刚刚注意到tweepy api，我可以使用tweets对象（从tweepy获取）创建带有熊猫的数据帧。我想在我的推文中记下一个单词。这是我的密码

freq_df = hastag_tweets_df["Tweet"].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis =0).sort_values(ascending=False).reset_index().head(10)
    
freq_df.columns = ["Words","freq"]
    
print('FREQ DF\n\n')

print(freq_df)
print('\n\n')

 
a = freq_df[freq_df.freq > freq_df.freq.mean() +   freq_df.freq.std()]
    #plotting
fig =a.plot.barh(x = "Words",y = "freq").get_figure()

这看起来不像我想要的那样好，因为它总是以“空白”和“类”字开头

                         Words   freq
0                               301.0
1    the                        164.0

因此，我如何才能获得所需的数据，而不需要空行和一些单词，如“the”。

谢谢

我们可以使用spaCy库来做这件事。使用此库，您可以轻松删除诸如“the”、“a”之类的词（称为停止词）：

易于安装：

pip安装空间

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load("en_core_web_sm")


lexeme = nlp.vocab[word]

# process text using spacy

def process_text(text):
    #Tokenize text
    doc = nlp(text)
    token_list = [t.text for t in doc]

    #remove stop words
    filtered_sentence =[] 
    for word in token_list:
        lexeme = nlp.vocab[word]
        if lexeme.is_stop == False:
             filtered_sentence.append(word)

    # here we return the length of the filtered words if you want you can return the list as well
    return len(filtered_sentence)


df = (
    df
    .assign(words_count=lambda d: d['comment'].apply(lambda c: process_text(c) ))
)

.value\u counts（）

返回按降序排序的结果。因此，结果是预期的，“我如何改进代码？”在这里是一个糟糕的问题，因为您得到了预期的结果。要提出更好的问题，请添加

所需输出

结果，以便社区了解您想要实现的目标。好的，谢谢您的指导，谢谢您的帮助