Python ML分类器中的文本编码_Python_Machine Learning_Encoding_Scikit Learn_Countvectorizer

Python ML分类器中的文本编码

python machine-learning encoding scikit-learn

Python ML分类器中的文本编码,python,machine-learning,encoding,scikit-learn,countvectorizer,Python,Machine Learning,Encoding,Scikit Learn,Countvectorizer,我正在尝试建立一个ML模型。然而，我在理解在哪里应用编码方面有困难。请参阅下面的步骤和功能，以复制我一直遵循的流程首先，我将数据集拆分为训练和测试： # Import the resampling package from sklearn.naive_bayes import MultinomialNB import string from nltk.corpus import stopwords import re from sklearn.model_selection import t

我正在尝试建立一个ML模型。然而，我在理解在哪里应用编码方面有困难。请参阅下面的步骤和功能，以复制我一直遵循的流程

首先，我将数据集拆分为训练和测试：

# Import the resampling package
from sklearn.naive_bayes import MultinomialNB
import string
from nltk.corpus import stopwords
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.utils import resample
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
# Split into training and test sets

# Testing Count Vectorizer

X = df[['Text']] 
y = df['Label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

# Returning to one dataframe
training_set = pd.concat([X_train, y_train], axis=1)

现在我应用（欠）采样：

# Separating classes
spam = training_set[training_set.Label == 1]
not_spam = training_set[training_set.Label == 0]

# Undersampling the majority
undersample = resample(not_spam, 
                       replace=True, 
                       n_samples=len(spam), #set the number of samples to equal the number of the minority class
                       random_state=40)
# Returning to new training set
undersample_train = pd.concat([spam, undersample])

我应用选择的算法：

full_result = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])

X, y = BOW(undersample_train)
full_result = full_result.append(training_naive(X_train, X_test, y_train, y_test, 'Count Vectorize'), ignore_index = True)

其中，弓的定义如下

def BOW(data):
    
    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)

    count_vectorizer = CountVectorizer(analyzer=fun)
    count_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()
    
    X = count_vectorizer.transform(list_corpus)
    
    return X, list_labels

基本预处理

定义如下：

def basic_preprocessing(df):
    
    df_temp = df.copy(deep = True)
    df_temp = df_temp.rename(index = str, columns = {'Clean_Titles_2': 'Text'})
    df_temp.loc[:, 'Text'] = [text_prepare(x) for x in df_temp['Text'].values]
    
    #le = LabelEncoder()
    #le.fit(df_temp['medical_specialty'])
    #df_temp.loc[:, 'class_label'] = le.transform(df_temp['medical_specialty'])
    
    tokenizer = RegexpTokenizer(r'\w+')
    df_temp["Tokens"] = df_temp["Text"].apply(tokenizer.tokenize)
    
    return df_temp

其中

text\u prepare

为：

def text_prepare(text):

    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
    STOPWORDS = set(stopwords.words('english'))
    
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub('', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    words = text.split()
    i = 0
    while i < len(words):
        if words[i] in STOPWORDS:
            words.pop(i)
        else:
            i += 1
    text = ' '.join(map(str, words))# delete stopwords from text
    
    return text

如您所见，顺序如下：

定义文本，为文本清理做准备
定义基本的预处理
定义弓
将数据集拆分为训练和测试
应用抽样
应用该算法

我不理解的是如何正确编码文本，以使算法工作良好。我的数据集称为df，列为：

Label      Text                                 Year
1         bla bla bla                           2000
0         add some words                        2012
1         this is just an example               1998
0         unfortunately the code does not work  2018
0         where should I apply the encoding?    2000
0         What am I missing here?               2005

当我得到以下错误时，应用BOW的顺序是错误的：

ValueError:无法将字符串转换为float:“如果…，则期望得到良好的结果”
我遵循了以下步骤（代码=来自此链接：。
然而，抽样的部分是错误的，因为它应该只对列车进行，所以在分割之后。原则应该是：（1）分割训练/测试；（2）对训练集应用重抽样，以便使用平衡数据对模型进行训练；（3）将模型应用于测试集并对其进行评估
我很乐意提供进一步的信息、数据和/或代码，但我认为我已经提供了所有最相关的步骤
非常感谢。
您需要一个测试弓功能，该功能应该重用在培训阶段构建的计数矢量器模型
考虑使用管道来减少代码的冗长
来自sklearn.naive_bayes导入多项式nb
导入字符串
从nltk.corpus导入停止词
进口稀土
从sklearn.model\u选择导入列车\u测试\u拆分
从io导入StringIO
从sklearn.feature\u extraction.text导入countvectorier
从nltk.tokenize导入RegexpTokenizer
从sklearn.utils导入重采样
从sklearn.metrics导入f1评分、精度评分、召回评分、准确度评分
def fun（文本）：
remove_punc=[c代表文本中的c，如果c不在字符串中。标点符号]
remove\u punc=''。连接（remove\u punc）
已清洁=[w代表移除中的w_punc.split（），如果w.较低（）
不在stopwords中。单词（'english'）]
返回清洗
#测试计数矢量器
def弓（数据）：
df_temp=data.copy（deep=True）
df_temp=基本预处理（df_temp）
计数向量器=计数向量器（analyzer=fun）
计数向量器.fit（df_temp['Text']）
list_corpus=df_temp[“Text”].tolist（）
list_labels=df_temp[“Label”]。tolist（）
X=计数向量器.变换（列表语料库）
返回X，列出标签，计数向量器
def测试（数据、计数向量器）：
df_temp=data.copy（deep=True）
df_temp=基本预处理（df_temp）
list_corpus=df_temp[“Text”].tolist（）
list_labels=df_temp[“Label”]。tolist（）
X=计数向量器.变换（列表语料库）
返回X，列出所有标签
def基本预处理（df）：
df_temp=df.copy（deep=True）
df_temp=df_temp.rename（index=str，columns={'Clean_Titles_2'：'Text'}）
df_temp.loc[：，'Text']=[文本为df_temp['Text']中的x准备（x）。值]
标记器=RegexpTokenizer（r'\w+'））
df_temp[“Tokens”]=df_temp[“Text”].apply（tokenizer.tokenize）
返回温度
def text_prepare（文本）：
将_替换为_SPACE_RE=RE.compile（'[/（）{}\[\]\\\\\\\\\\\@，；]'））
错误的符号\u RE=RE.compile（“[^0-9a-z++]”）
STOPWORDS=set（STOPWORDS.words（'english'））
text=text.lower（）
#用空格替换符号用文本中的空格替换符号
text=将_替换为_空格_RE.sub（“”，text）
#从文本中删除错误符号中的符号
text=错误的符号\u RE.sub（“”，text）
words=text.split（）
i=0
而我（用词）：
如果停止字中的字[i]：
单词.流行音乐（一）
其他：
i+=1
text=''.join（映射（str，words））#从文本中删除stopwords
返回文本
s=“”标签文本年份
1布拉布拉布拉2000
0添加一些单词2012
1这只是1998年的一个例子
0很遗憾，该代码不起作用
0我应该在哪里应用编码？2000
0我在这里遗漏了什么？2005“”
df=pd.read_csv（StringIO，sep='\s{2，}'）
X=df[['Text']]
y=df[“标签”]
X_列车，X_试验，y_列车，y_试验=列车试验(
十、 y，测试尺寸=0.2，随机状态=40）
#返回到一个数据帧
训练集=pd.concat（[X\U列，y\U列]，轴=1）
#分班
spam=训练集[训练集.标签==1]
非垃圾邮件=训练集[训练集.标签==0]
#对多数人抽样不足
欠采样=重新采样（不是垃圾邮件，
replace=True，
#将样本数设置为等于少数类的数目
n_样本=len（垃圾邮件），
随机状态=40）
#返回新的训练集
欠采样列车=pd.concat（[spam，欠采样]）
full_result=pd.DataFrame（列=['预处理'，'模型'，'精度'，
“回忆”、“F1成绩”、“准确性”]）
列车x、列车y、计数矢量器=船首（欠采样列车）
测试集=局部固结（[X\U测试，y\U测试]，轴=1）
测试x、测试y=测试弓（测试集、计数矢量器）
def training_naive（X_train_naive，X_test_naive，y_train_naive，y_test_naive，预处理）：
clf=多项式NB（）#高斯朴素贝叶斯
clf.fit（X_train_naive，y_train_naive）
res=pd.DataFrame（列=[‘预处理’、‘模型’、‘精度’、‘回忆’、‘F1分数’、‘精度’]）
y_pred=clf.predict（X_test_naive）
f1=f1_分数（y_pred，y_test_naive，average='加权'）
pres=精度分数（y_pred，y_test_naive，average=‘加权’）
记录=
Label      Text                                 Year
1         bla bla bla                           2000
0         add some words                        2012
1         this is just an example               1998
0         unfortunately the code does not work  2018
0         where should I apply the encoding?    2000
0         What am I missing here?               2005