Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/345.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用逻辑回归进行推特主题分类_Python_Mysql_Pandas_Logistic Regression_Text Classification - Fatal编程技术网

Python 使用逻辑回归进行推特主题分类

Python 使用逻辑回归进行推特主题分类,python,mysql,pandas,logistic-regression,text-classification,Python,Mysql,Pandas,Logistic Regression,Text Classification,关于逻辑回归的使用,我有一个问题。我目前正在用Python对Tweets主题进行分类。 到目前为止,我能够使用pandas从MySQL表中读取火车数据,使用NLTK清除火车推文,并使用CountVectorizer创建特征向量。 下面是代码 import pandas as pd from sqlalchemy import * from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords import r

关于逻辑回归的使用,我有一个问题。我目前正在用Python对Tweets主题进行分类。 到目前为止,我能够使用pandas从MySQL表中读取火车数据,使用NLTK清除火车推文,并使用CountVectorizer创建特征向量。 下面是代码

import pandas as pd
from sqlalchemy import *
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

#connect to database and get the training data
engine = create_engine('mysql+mysqlconnector://root:root@localhost:3306/machinelearning')
tweet = pd.read_sql_query('SELECT label, tweets FROM tweetstable', engine, index_col='label')

#TEXT PREPROCESSING (REMOVE HTML MARKUP, REMOVE PUNCTUATION, TOKENIZING, REMOVE STOP WORDS, STEMMING)

def preprocessing(pptweets):
    pptweets = pptweets.lower()
    urlrtweets = re.sub(r'https:.*$', ":", pptweets)
    rpptweets = urlrtweets.replace("_", " ")
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rpptweets)
    filteredwords = [w for w in tokens if not w in stopwords.words('english')]
    stemmer = SnowballStemmer("english")
    stweets = [stemmer.stem(tokens) for tokens in filteredwords]
    return " ".join(stweets)

#initialize an empty list to hold the clean reviews
cleantweets = []

#loop over each review, create an index i that goes from 0 to the length of tweets list
for i in range(0, len(tweet["tweets"])):
    cleantweets.append(preprocessing(tweet["tweets"][i]))

#initialize the "CountVectorizer" object, which is scikit-learn's BoW tools
vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000)

#fit_transform() does two functions: First, it fits the model
#and learns the vocabulary; second, it transforms our training data
#into feature vectors. the input to fit_transform should be a list of strings
traindatafeatures = vectorizer.fit_transform(cleantweets)

#Numpy arrays are easy to work with, so convert the result to an array
traindatafeatures = traindatafeatures.toarray()
我现在面临的问题是。。我不知道如何使用逻辑回归从列车数据中学习。下面是我用来将火车数据拟合到逻辑回归分类器中的代码

#train the model
logmodel = LogisticRegression()
logmodel.fit(traindatafeatures, tweet["label"])

#check trained model intercept

print(logmodel.intercept_)
#check trained model coefficients
print(logmodel.coef_)
我将traindatafeatures作为输入X,tweet[“label”]作为每个tweet的标签/类别Y传递给逻辑回归分类器,以便它可以从中学习,但当我运行完整代码时,我得到如下错误:

Traceback (most recent call last):
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1945, in get_loc
    return self._engine.get_loc(key)
  File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
  File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
  File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
  File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'label'
在处理上述异常期间,发生了另一个异常:

Traceback (most recent call last):
  File "C:/Users/Indra/PycharmProjects/TextClassifier/textclassifier.py", line 52, in <module>
    logmodel.fit(traindatafeatures, tweet["label"])
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1997, in __getitem__
    return self._getitem_column(key)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2004, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1350, in _get_item_cache
    values = self._data.get(item)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3290, in get
    loc = self.items.get_loc(item)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1947, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
  File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
  File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
  File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'label'
回溯(最近一次呼叫最后一次):
文件“C:/Users/Indra/PycharmProjects/TextClassifier/TextClassifier.py”,第52行,在
logmodel.fit(traindatafeatures,tweet[“标签”])
文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\core\frame.py”,第1997行,在\uu getitem中__
返回self.\u getitem\u列(键)
文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\core\frame.py”,第2004行,在\u getitem\u列中
返回self.\u获取\u项目\u缓存(密钥)
文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\core\generic.py”,第1350行,在\u get\u item\u缓存中
values=self.\u data.get(项目)
get中第3290行的文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\core\internals.py”
loc=自身项目。获取loc(项目)
文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\index\base.py”,第1947行,在get\U loc中
返回self.\u引擎。获取\u loc(self.\u可能\u cast\u索引器(键))
文件“pandas\index.pyx”,第137行,在pandas.index.IndexEngine.get_loc(pandas\index.c:4154)中
pandas.index.IndexEngine.get_loc(pandas\index.c:4018)中第159行的文件“pandas\index.pyx”
pandas.hashtable.PyObjectHashTable.get_项(pandas\hashtable.c:12368)中第675行的文件“pandas\hashtable.pyx”
pandas.hashtable.PyObjectHashTable.get_项(pandas\hashtable.c:12322)中第683行的文件“pandas\hashtable.pyx”
KeyError:“标签”

有人能帮我解决这个问题吗(我一直在搜索教程,但到目前为止还没有找到任何内容。

您正在将标签设置为索引(作为pd.read\u sql\u查询的一部分),因此它不再可用作列。使用tweet.index获取label列,或删除index\u col参数。我删除了index\u col参数并能够运行该程序。现在我想问的是,使用tweet.index是什么意思?你能演示如何操作吗?他的意思是,如果你想保持
label
index
(这是将
index\u col
关键字与
read\u sql
一起使用的结果,然后您可以使用
tweet.index
而不是
tweet['label']
.fit()一起使用
index\u col
。由于您删除了
index\u col
,因此不再需要前者。啊,我现在明白了。感谢@Stefan的帮助:)感谢@AlexanderBauer的帮助:)您正在将标签设置为索引(作为pd.read\u sql\u查询的一部分),因此它不再作为列提供。使用tweet.index获取label列,或者删除index\u col参数。我删除了index\u col参数,并且能够运行该程序。我现在想问的是,使用tweet.index是什么意思?你能告诉我怎么做吗?他的意思是,如果你想把
label
保留为
index
(这是将
index\u col
关键字与
read\u sql
结合使用的结果),那么你可以使用
tweet.index
而不是
tweet['label']
配合使用()
。由于您删除了
索引列
,因此不再需要前者。啊,我现在明白了。谢谢@Stefan的帮助:)谢谢@AlexanderBauer的帮助:)