Python 找出一个句子是肯定的、中性的还是否定的？_Python_Machine Learning_Nlp_Nltk_Sentiment Analysis

Python 找出一个句子是肯定的、中性的还是否定的？

python machine-learning nlp

Python 找出一个句子是肯定的、中性的还是否定的？,python,machine-learning,nlp,nltk,sentiment-analysis,Python,Machine Learning,Nlp,Nltk,Sentiment Analysis,我想创建一个脚本，可以发现一个句子是肯定的、中性的还是否定的我在网上搜索发现，通过一个实例，它可以使用NLTK库来完成所以，我尝试了这个代码 import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews def extract_features(word_list): return dict([(word, True) f

我想创建一个脚本，可以发现一个句子是肯定的、中性的还是否定的

我在网上搜索发现，通过一个实例，它可以使用NLTK库来完成

所以，我尝试了这个代码

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews


def extract_features(word_list):
    return dict([(word, True) for word in word_list])


if __name__ == '__main__':
    # Load positive and negative reviews
    positive_fileids = movie_reviews.fileids('pos')
    negative_fileids = movie_reviews.fileids('neg')

    features_positive = [(extract_features(movie_reviews.words(fileids=[f])),
                          'Positive') for f in positive_fileids]
    features_negative = [(extract_features(movie_reviews.words(fileids=[f])),
                          'Negative') for f in negative_fileids]

    # Split the data into train and test (80/20)
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))
    threshold_negative = int(threshold_factor * len(features_negative))

    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print("\nNumber of training datapoints:", len(features_train))
    print("Number of test datapoints:", len(features_test))

    # Train a Naive Bayes classifier
    classifier = NaiveBayesClassifier.train(features_train)
    print("\nAccuracy of the classifier:", nltk.classify.util.accuracy(classifier, features_test))

    print("\nTop 10 most informative words:")
    for item in classifier.most_informative_features()[:10]:
        print(item[0])

    # Sample input reviews
    input_reviews = [
    "Started off as the greatest series of all time, but had the worst ending of all time.",
    "Exquisite. 'Big Little Lies' takes us to an incredible journey with its emotional and intriguing storyline.",
    "I love Brooklyn 99 so much. It has the best crew ever!!",
    "The Big Bang Theory and to me it's one of the best written sitcoms currently on network TV.",
    "'Friends' is simply the best series ever aired. The acting is amazing.",
    "SUITS is smart, sassy, clever, sophisticated, timely and immensely entertaining!",
    "Cumberbatch is a fantastic choice for Sherlock Holmes-he is physically right (he fits the traditional reading of the character) and he is a damn good actor",
    "What sounds like a typical agent hunting serial killer, surprises with great characters, surprising turning points and amazing cast."
    "This is one of the most magical things I have ever had the fortune of viewing.",
    "I don't recommend watching this at all!"
    ]

    print("\nPredictions:")
    for review in input_reviews:
        print("\nReview:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        print("Predicted sentiment:", pred_sentiment)
        print("Probability:", round(probdist.prob(pred_sentiment), 2))

这是我得到的结果

Number of training datapoints: 1600
Number of test datapoints: 400

Accuracy of the classifier: 0.735

Top 10 most informative words:
outstanding
insulting
vulnerable
ludicrous
uninvolving
avoids
astounding
fascination
affecting
seagal

Predictions:

Review: Started off as the greatest series of all time, but had the worst ending of all time.
Predicted sentiment: Negative
Probability: 0.64

Review: Exquisite. 'Big Little Lies' takes us to an incredible journey with its emotional and intriguing storyline.
Predicted sentiment: Positive
Probability: 0.89

Review: I love Brooklyn 99 so much. It has the best crew ever!!
Predicted sentiment: Negative
Probability: 0.51

Review: The Big Bang Theory and to me it's one of the best written sitcoms currently on network TV.
Predicted sentiment: Positive
Probability: 0.62

Review: 'Friends' is simply the best series ever aired. The acting is amazing.
Predicted sentiment: Positive
Probability: 0.55

Review: SUITS is smart, sassy, clever, sophisticated, timely and immensely entertaining!
Predicted sentiment: Positive
Probability: 0.82

Review: Cumberbatch is a fantastic choice for Sherlock Holmes-he is physically right (he fits the traditional reading of the character) and he is a damn good actor
Predicted sentiment: Positive
Probability: 1.0

Review: What sounds like a typical agent hunting serial killer, surprises with great characters, surprising turning points and amazing cast.This is one of the most magical things I have ever had the fortune of viewing.
Predicted sentiment: Positive
Probability: 0.95

Review: I don't recommend watching this at all!
Predicted sentiment: Negative
Probability: 0.53

Process finished with exit code 0

我面临的问题是数据集非常有限，因此输出精度非常低。是否有更好的库或资源或其他任何东西来检查一个语句是肯定的、中立的还是否定的

更具体地说，我想将其应用到日常谈话中

亚马逊客户评论数据集是一个庞大的数据集，由1.3亿多条客户评论组成。您可以通过匹配评论和评级，将其用于情绪分析。大量的数据也非常适合于极度需要数据的深度学习方法

（）

如果您特别喜欢搜索电影评论，那么也可以选择大型电影评论数据集，其中包括50K+IMDB评论。（）

我建议您使用单词嵌入来增强您的模型，而不是一袋热编码的单词。

（）

如果您特别喜欢搜索电影评论，那么也可以选择大型电影评论数据集，其中包括50K+IMDB评论。（）

我建议使用单词嵌入来增强您的模型，而不是使用一个热编码的单词包。

已经有一些可用的语料库：

中文：

1）多域情绪分析数据集：

2） IMDB审查：

3）斯坦福情感树库：

4）第140条：

5）推特美国航空公司情绪：

等等这里：这里：

中文：

7） THUCNews：

8）头条:

9）索古卡：

10） SogouCS：

等等

一旦数据集足够大，您就可以使用区分模型，因为对于小数据集，生成模型可以防止过度拟合，而对于大数据集，区分模型可以捕获生成模型无法捕获的依赖项（如详细所述）

说，树结构最好用不太多的数据来模拟情绪，那么我想我们可以在这里考虑。p> 已经有一些可用的语料库：

中文：

1）多域情绪分析数据集：

2） IMDB审查：

3）斯坦福情感树库：

4）第140条：

5）推特美国航空公司情绪：

等等这里：这里：

中文：

7） THUCNews：

8）头条:

9）索古卡：

10） SogouCS：

等等

说，树结构最好用不太多的数据来模拟情绪，那么我想我们可以在这里考虑。p> 网上有很多情绪分析数据集，你可以使用。否则，您可以从网站或使用twitter API获取评论。感谢您让我了解twitter API。。。弄明白了。。。谢谢，试过维德的情绪分析。。。并得到了比上述代码更好的结果。。。所以，我只想问一下，哪一个是BatterTextBlob还是VADER？对于一个只有2000条记录的小数据集来说，这并不是说哪个包或分类器更好。你可以看到，你所学的“十大信息量最大的词”中有几个甚至没有情感：“西格尔”仅仅是演员/导演的名字，“避免”、“着迷”、“侮辱”几乎毫无意义。标记：网上有很多情感分析数据集，你可以使用。否则，您可以从网站或使用twitter API获取评论。感谢您让我了解twitter API。。。弄明白了。。。谢谢，试过维德的情绪分析。。。并得到了比上述代码更好的结果。。。所以，我只想问一下，哪一个是BatterTextBlob还是VADER？对于一个只有2000条记录的小数据集来说，这并不是说哪个包或分类器更好。你可以看到，你所学的“十大信息量最大的词”中有几个甚至不带感情色彩：“西格尔”只是演员/导演的名字，“避免”、“着迷”、“侮辱”几乎毫无意义。Tagged:根据你的说法，从textblob开始，哪种感情分析是好的，维德或其他什么？这取决于你的需要以及你在现实世界中使用它的地方。维德使用了一种基于规则的方法，当您的领域缺少数据时，这种方法可能会优于基于学习的方法。另一方面，Textblob使用预训练（在电影评论中）朴素贝叶斯分类器。如果使用更大的数据集进行训练，它可以比您的代码工作得更好。所以，最好的方法是尝试不同的方法，根据您的需求获得最佳结果。根据您的说法，从textblob、VADER或其他任何东西开始哪种情绪分析比较好？这取决于您的需求以及您在现实世界中的使用情况。维德使用了一种基于规则的方法，当您的领域缺少数据时，这种方法可能会优于基于学习的方法。另一方面，Textblob使用预训练（在电影评论中）朴素贝叶斯分类器。如果使用更大的数据集进行训练，它可以比您的代码工作得更好。所以，最好的方法是尝试不同的方法，以实现您的需求方面的最佳结果。