PythonNLTK和Pandas-文本分类器-（新手）-以类似于提供的示例的格式导入我的数据_Python_Pandas_Nlp_Nltk_Text Classification

PythonNLTK和Pandas-文本分类器-（新手）-以类似于提供的示例的格式导入我的数据

python pandas nlp

PythonNLTK和Pandas-文本分类器-（新手）-以类似于提供的示例的格式导入我的数据,python,pandas,nlp,nltk,text-classification,Python,Pandas,Nlp,Nltk,Text Classification,我不熟悉文本分类，但是我了解了大部分的概念。简而言之，我在Excel数据集中有一个餐馆评论列表，我想将它们用作我的培训数据。我正在努力解决的是将实际回顾和分类（1=pos，0=neg）作为培训数据集的一部分导入的示例语法。如果我在一个元组中手动创建数据集（即，我在训练中得到的当前数据），我知道如何做到这一点。感谢您的帮助 import nltk from nltk.tokenize import word_tokenize import pandas as pd df = pd.read_ex

我不熟悉文本分类，但是我了解了大部分的概念。简而言之，我在Excel数据集中有一个餐馆评论列表，我想将它们用作我的培训数据。我正在努力解决的是将实际回顾和分类（1=pos，0=neg）作为培训数据集的一部分导入的示例语法。如果我在一个元组中手动创建数据集（即，我在训练中得到的当前数据），我知道如何做到这一点。感谢您的帮助

import nltk
from nltk.tokenize import word_tokenize
import pandas as pd

df = pd.read_excel("reviewclasses.xlsx")

customerreview= df.customerreview.tolist() #I want this to be what's in 
"train" below (i.e., "this is a negative review")

reviewrating= df.reviewrating.tolist() #I also want this to be what's in 
"train" below (e.g., 0)

#train = [("Great place to be when you are in Bangalore.", "1"),
#  ("The place was being renovated when I visited so the seating was 
limited.", "0"),
#  ("Loved the ambiance, loved the food", "1"),
#  ("The food is delicious but not over the top.", "0"),
#  ("Service - Little slow, probably because too many people.", "0"),
#  ("The place is not easy to locate", "0"),
#  ("Mushroom fried rice was spicy", "1"),
#]

dictionary = set(word.lower() for passage in train for word in 
word_tokenize(passage[0]))

t = [({word: (word in word_tokenize(x[0])) for word in dictionary}, x[1]) 
for x in train]

# Step 4 – the classifier is trained with sample data
classifier = nltk.NaiveBayesClassifier.train(t)

test_data = "The food sucked and I couldn't wait to leave the terrible 
restaurant."
test_data_features = {word.lower(): (word in 
word_tokenize(test_data.lower())) for word in dictionary}

print (classifier.classify(test_data_features))

我想出来了。我基本上只需要将两个列表合并成一个元组

def merge(customerreview, reviewrating): 

    merged_list = [(customerreview[i], reviewrating[i]) for i in range(0, 
len(customerreview))] 
    return merged_list 

train = (merge(customerreview, reviewrating))

几乎可以肯定，将数据保存在数据帧本身更有效。为什么需要元组？这里的问题到底是什么？如果我创建数据（即，上面有6条评论），我可以让分类器工作，但是我不能导入包含100条评论加上它们的pos/neg分类的数据集。