Python 使用逻辑回归进行推特主题分类_Python_Mysql_Pandas_Logistic Regression_Text Classification

Python 使用逻辑回归进行推特主题分类

python mysql pandas

Python 使用逻辑回归进行推特主题分类,python,mysql,pandas,logistic-regression,text-classification,Python,Mysql,Pandas,Logistic Regression,Text Classification,关于逻辑回归的使用，我有一个问题。我目前正在用Python对Tweets主题进行分类。到目前为止，我能够使用pandas从MySQL表中读取火车数据，使用NLTK清除火车推文，并使用CountVectorizer创建特征向量。下面是代码 import pandas as pd from sqlalchemy import * from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords import r

关于逻辑回归的使用，我有一个问题。我目前正在用Python对Tweets主题进行分类。到目前为止，我能够使用pandas从MySQL表中读取火车数据，使用NLTK清除火车推文，并使用CountVectorizer创建特征向量。下面是代码

import pandas as pd
from sqlalchemy import *
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

#connect to database and get the training data
engine = create_engine('mysql+mysqlconnector://root:root@localhost:3306/machinelearning')
tweet = pd.read_sql_query('SELECT label, tweets FROM tweetstable', engine, index_col='label')

#TEXT PREPROCESSING (REMOVE HTML MARKUP, REMOVE PUNCTUATION, TOKENIZING, REMOVE STOP WORDS, STEMMING)

def preprocessing(pptweets):
    pptweets = pptweets.lower()
    urlrtweets = re.sub(r'https:.*$', ":", pptweets)
    rpptweets = urlrtweets.replace("_", " ")
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rpptweets)
    filteredwords = [w for w in tokens if not w in stopwords.words('english')]
    stemmer = SnowballStemmer("english")
    stweets = [stemmer.stem(tokens) for tokens in filteredwords]
    return " ".join(stweets)

#initialize an empty list to hold the clean reviews
cleantweets = []

#loop over each review, create an index i that goes from 0 to the length of tweets list
for i in range(0, len(tweet["tweets"])):
    cleantweets.append(preprocessing(tweet["tweets"][i]))

#initialize the "CountVectorizer" object, which is scikit-learn's BoW tools
vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000)

#fit_transform() does two functions: First, it fits the model
#and learns the vocabulary; second, it transforms our training data
#into feature vectors. the input to fit_transform should be a list of strings
traindatafeatures = vectorizer.fit_transform(cleantweets)

#Numpy arrays are easy to work with, so convert the result to an array
traindatafeatures = traindatafeatures.toarray()

我现在面临的问题是。。我不知道如何使用逻辑回归从列车数据中学习。下面是我用来将火车数据拟合到逻辑回归分类器中的代码

#train the model
logmodel = LogisticRegression()
logmodel.fit(traindatafeatures, tweet["label"])

#check trained model intercept

print(logmodel.intercept_)
#check trained model coefficients
print(logmodel.coef_)

我将traindatafeatures作为输入X，tweet[“label”]作为每个tweet的标签/类别Y传递给逻辑回归分类器，以便它可以从中学习，但当我运行完整代码时，我得到如下错误：

Traceback (most recent call last):
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1945, in get_loc
    return self._engine.get_loc(key)
  File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
  File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
  File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
  File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'label'

在处理上述异常期间，发生了另一个异常：

Traceback (most recent call last):
  File "C:/Users/Indra/PycharmProjects/TextClassifier/textclassifier.py", line 52, in <module>
    logmodel.fit(traindatafeatures, tweet["label"])
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1997, in __getitem__
    return self._getitem_column(key)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2004, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1350, in _get_item_cache
    values = self._data.get(item)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3290, in get
    loc = self.items.get_loc(item)
  File "C:\Users\Indra\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1947, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
  File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
  File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
  File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'label'

回溯（最近一次呼叫最后一次）：
文件“C:/Users/Indra/PycharmProjects/TextClassifier/TextClassifier.py”，第52行，在
logmodel.fit（traindatafeatures，tweet[“标签”]）
文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\core\frame.py”，第1997行，在\uu getitem中__
返回self.\u getitem\u列（键）
文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\core\frame.py”，第2004行，在\u getitem\u列中
返回self.\u获取\u项目\u缓存（密钥）
文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\core\generic.py”，第1350行，在\u get\u item\u缓存中
values=self.\u data.get（项目）
get中第3290行的文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\core\internals.py”
loc=自身项目。获取loc（项目）
文件“C:\Users\Indra\Anaconda3\lib\site packages\pandas\index\base.py”，第1947行，在get\U loc中
返回self.\u引擎。获取\u loc（self.\u可能\u cast\u索引器（键））
文件“pandas\index.pyx”，第137行，在pandas.index.IndexEngine.get_loc（pandas\index.c:4154）中
pandas.index.IndexEngine.get_loc（pandas\index.c:4018）中第159行的文件“pandas\index.pyx”
pandas.hashtable.PyObjectHashTable.get_项（pandas\hashtable.c:12368）中第675行的文件“pandas\hashtable.pyx”
pandas.hashtable.PyObjectHashTable.get_项（pandas\hashtable.c:12322）中第683行的文件“pandas\hashtable.pyx”
KeyError:“标签”

有人能帮我解决这个问题吗（我一直在搜索教程，但到目前为止还没有找到任何内容。

您正在将标签设置为索引（作为pd.read\u sql\u查询的一部分），因此它不再可用作列。使用tweet.index获取label列，或删除index\u col参数。我删除了index\u col参数并能够运行该程序。现在我想问的是，使用tweet.index是什么意思？你能演示如何操作吗？他的意思是，如果你想保持

label

为

index

（这是将

index\u col

关键字与

read\u sql

一起使用的结果，然后您可以使用

tweet.index

而不是

tweet['label']

与

.fit（）一起使用index\u col
。由于您删除了index\u col
，因此不再需要前者。啊，我现在明白了。感谢@Stefan的帮助：）感谢@AlexanderBauer的帮助：）您正在将标签设置为索引（作为pd.read\u sql\u查询的一部分），因此它不再作为列提供。使用tweet.index获取label列，或者删除index\u col参数。我删除了index\u col参数，并且能够运行该程序。我现在想问的是，使用tweet.index是什么意思？你能告诉我怎么做吗？他的意思是，如果你想把label
保留为index
（这是将index\u col
关键字与read\u sql
结合使用的结果），那么你可以使用tweet.index
而不是tweet['label']
与配合使用（）
。由于您删除了索引列
，因此不再需要前者。啊，我现在明白了。谢谢@Stefan的帮助：）谢谢@AlexanderBauer的帮助：）