Python scikit分类器的精确度非常低（朴素贝叶斯、DecisionTreeClassifier）_Python_Machine Learning_Scipy_Scikit Learn

Python scikit分类器的精确度非常低（朴素贝叶斯、DecisionTreeClassifier）
python machine-learning scikit-learn
Python scikit分类器的精确度非常低（朴素贝叶斯、DecisionTreeClassifier）,python,machine-learning,scipy,scikit-learn,Python,Machine Learning,Scipy,Scikit Learn,我使用的是这个数据集，文档说明准确度应该在84%左右。不幸的是，我的程序的准确率是25% 为了处理数据，我执行了以下操作： 1. Loaded the .txt data file and converted it to a .csv 2. Removed data with missing values 3. Extracted the class values: <=50K >50 and convert it to 0 and 1 respectively 4. For eac
我使用的是这个数据集，文档说明准确度应该在
84%左右。不幸的是，我的程序的准确率是25%

为了处理数据，我执行了以下操作：
1. Loaded the .txt data file and converted it to a .csv
2. Removed data with missing values
3. Extracted the class values: <=50K >50 and convert it to 0 and 1 respectively
4. For each attribute and for each string value of that attribute I 
   mapped it to an integer value. Example att1{'cs':0, 'cs2':1},
   att2{'usa':0, 'greece':1} ... and so on
5. Called naive bayes on the new integer data set

加载csv模块：
import numpy as np

attributes = {  'Private':0, 'Self-emp-not-inc':1, 'Self-emp-inc':2, 'Federal-gov':3, 'Local-gov':4, 'State-gov':5, 'Without-pay':6, 'Never-worked':7,
            'Bachelors':0, 'Some-college':1, '11th':2, 'HS-grad':3, 'Prof-school':4, 'Assoc-acdm':5, 'Assoc-voc':6, '9th':7, '7th-8th':8, '12th':9, 'Masters':10, '1st-4th':11, '10th':12,                  'Doctorate':13, '5th-6th':14, 'Preschool':15,
            'Married-civ-spouse':0, 'Divorced':1, 'Never-married':2, 'Separated':3, 'Widowed':4, 'Married-spouse-absent':5, 'Married-AF-spouse':6,
            'Tech-support':0, 'Craft-repair':1, 'Other-service':2, 'Sales':3, 'Exec-managerial':4, 'Prof-specialty':5, 'Handlers-cleaners':6, 'Machine-op-inspct':7, 'Adm-clerical':8, 
            'Farming-fishing':9, 'Transport-moving':10, 'Priv-house-serv':11, 'Protective-serv':12, 'Armed-Forces':13,
            'Wife':0, 'Own-child':1, 'Husband':2, 'Not-in-family':4, 'Other-relative':5, 'Unmarried':5,
            'White':0, 'Asian-Pac-Islander':1, 'Amer-Indian-Eskimo':2, 'Other':3, 'Black':4,
            'Female':0, 'Male':1,
            'United-States':0, 'Cambodia':1, 'England':2, 'Puerto-Rico':3, 'Canada':4, 'Germany':5, 'Outlying-US(Guam-USVI-etc)':6, 'India':7, 'Japan':8, 'Greece':9, 'South':10, 'China':11,                   'Cuba':12, 'Iran':13, 'Honduras':14, 'Philippines':15, 'Italy':16, 'Poland':17, 'Jamaica':18, 'Vietnam':19, 'Mexico':20, 'Portugal':21, 'Ireland':22, 'France':23,                  'Dominican-Republic':24, 'Laos':25, 'Ecuador':26, 'Taiwan':27, 'Haiti':28, 'Columbia':29, 'Hungary':30, 'Guatemala':31, 'Nicaragua':32, 'Scotland':33, 'Thailand':34, 'Yugoslavia':35,                  'El-Salvador':36, 'Trinadad&Tobago':37, 'Peru':38, 'Hong':39, 'Holand-Netherlands':40
      }



def remove_field_num(a, i):                                                                      #function to strip values
   names = list(a.dtype.names)  
   new_names = names[:i] + names[i + 1:]
   b = a[new_names]
   return b

def remove_missing_values(data):
    temp = []
    for i in range(len(data)):
        for j in range(len(data[i])):
            if data[i][j] == '?':                                                                 #If a missing value '?' is encountered do not append the line to temp
                break;
            if j == (len(data[i]) - 1) and len(data[i]) == 15:
                temp.append(data[i])                                                              #Append the lines that do not contain '?'
    return temp

def create_labels(data):
    temp = [] 
    for i in range(len(data)):                                                                    #Iterate through the data
        j = len(data[i]) - 1                                                                      #Extract the labels
        if data[i][j] == '<=50K':
            temp.append(0)
        else:
            temp.append(1)
    return temp

def convert_to_int(data):

    my_lst = []
    for i in range(len(data)):
        lst = []
        for j in range(len(data[i])):
            key = data[i][j]
            if j in (1, 3, 5, 6, 7, 8, 9, 13, 14):
                lst.append(int(attributes[key]))
            else:
                lst.append(int(key))    
        my_lst.append(lst)

    temp = np.array(my_lst)
    return temp

将numpy导入为np
attributes={'Private'：0，'Self emp not inc'：1，'Self emp inc'：2，'Federal gov'：3，'Local gov'：4，'State gov'：5，'Without pay'：6，'Never worked'：7，
“学士”：0，“一些学院”：1，“11”：2，“高级毕业生”：3，“教授学校”：4，“助理acdm”：5，“助理voc”：6，“9”：7，“7-8”：8，“12”：9，“硕士”：10，“1-4”：11，“10”：12，“博士”：13，“5-6”：14，“学前”：15，
“已婚公民配偶”：0，“离婚”：1，“从未结婚”：2，“分居”：3，“丧偶”：4，“已婚配偶缺席”：5，“已婚配偶”：6，
“技术支持”：0，“工艺维修”：1，“其他服务”：2，“销售”：3，“高级管理人员”：4，“专业教授”：5，“处理人员”：6，“机器操作检查”：7，“行政文书”：8，
“农耕渔业”：9，“交通运输”：10，“私人住宅服务”：11，“保护性服务”：12，“武装部队”：13，
“妻子”：0，“亲生子女”：1，“丈夫”：2，“不在家”：4，“其他亲属”：5，“未婚”：5，
“白人”：0，“亚洲太平洋岛民”：1，“美国印第安爱斯基摩人”：2，“其他”：3，“黑人”：4，
“女性”：0，“男性”：1，
‘美国’：0，‘柬埔寨’：1，‘英国’：2，‘波多黎各’：3，‘加拿大’：4，‘德国’：5，‘美国以外的地区（关岛、美属维尔京群岛等）“：6，'印度'：7，'日本'：8，'希腊'：9，'南'：10，'中国'：11，'古巴'：12，'伊朗'：13，'洪都拉斯'：14，'菲律宾'：15，'意大利'：16，'波兰'：17，'牙买加'：18，'越南'：19，'墨西哥'：20，'葡萄牙'：21，'爱尔兰'：22，'法国'：23，'多米尼加共和国'：24，'老挝'：25，'厄瓜多尔'：26，'台湾'：27，'海地'：28，'哥伦比亚'：29“匈牙利”：30；“危地马拉”：31；“尼加拉瓜”：32；“苏格兰”：33；“泰国”：34；“南斯拉夫”：35；“萨尔瓦多”：36；“特里纳达和多巴哥”：37；“秘鲁”：38；“香港”：39；“荷兰荷兰”：40
}
def remove_field_num（a，i）：#函数用于剥离值
名称=列表（a.dtype.names）
新名称=名称[：i]+名称[i+1:]
b=a[新名称]
返回b
def删除缺少的值（数据）：
温度=[]
对于范围内的i（len（数据））：
对于范围内的j（len（数据[i]）：
如果数据[i][j]==“？”：#如果遇到缺少的值“？”，请不要将该行附加到临时值
打破
如果j==（len（数据[i]）-1）和len（数据[i]）==15：
临时追加（数据[i]）#追加不包含“？”的行
返回温度
def创建_标签（数据）：
温度=[]
对于范围内的i（len（data））：#遍历数据
j=len（数据[i]）-1#提取标签
如果数据[i][j]='我想问题在于预处理。最好将分类变量编码为1_热向量（只有零的向量或1对应于该类所需值的向量），而不是原始数字。Sklearn可以在这方面帮助你。使用pandas
库可以更有效地进行分类
下面显示了在pandas
library的帮助下，您可以多么轻松地实现这一点。它在学习的同时也非常有效。这在占整个数据20%的测试集上实现了81.6的精度
from __future__ import division

from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.dict_vectorizer import DictVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics.classification import classification_report, accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree.tree import DecisionTreeClassifier

import numpy as np
import pandas as pd


# Read the data into a pandas dataframe
df = pd.read_csv('adult.data.csv')

# Columns names
cols = np.array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
                 'marital-status', 'occupation', 'relationship', 'race', 'sex',
                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
                 'target'])

# numeric columns
numeric_cols = ['age', 'fnlwgt', 'education-num',
                'capital-gain', 'capital-loss', 'hours-per-week']

# assign names to the columns in the dataframe
df.columns = cols

# replace the target variable to 0 and 1 for <50K and >50k
df1 = df.copy()
df1.loc[df1['target'] == ' <=50K', 'target'] = 0
df1.loc[df1['target'] == ' >50K', 'target'] = 1

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    df1.drop('target', axis=1), df1['target'], test_size=0.2)


# numeric attributes

x_num_train = X_train[numeric_cols].as_matrix()
x_num_test = X_test[numeric_cols].as_matrix()

# scale to <0,1>

max_train = np.amax(x_num_train, 0)
max_test = np.amax(x_num_test, 0)        # not really needed

x_num_train = x_num_train / max_train
x_num_test = x_num_test / max_train        # scale test by max_train

# labels or target attribute

y_train = y_train.astype(int)
y_test = y_test.astype(int)

# categorical attributes

cat_train = X_train.drop(numeric_cols, axis=1)
cat_test = X_test.drop(numeric_cols, axis=1)

cat_train.fillna('NA', inplace=True)
cat_test.fillna('NA', inplace=True)

x_cat_train = cat_train.T.to_dict().values()
x_cat_test = cat_test.T.to_dict().values()

# vectorize (encode as one hot)

vectorizer = DictVectorizer(sparse=False)
vec_x_cat_train = vectorizer.fit_transform(x_cat_train)
vec_x_cat_test = vectorizer.transform(x_cat_test)

# build the feature vector

x_train = np.hstack((x_num_train, vec_x_cat_train))
x_test = np.hstack((x_num_test, vec_x_cat_test))


clf = LogisticRegression().fit(x_train, y_train.values)
pred = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)

clf = DecisionTreeClassifier().fit(x_train, y_train)
predict = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)

clf = GaussianNB().fit(x_train, y_train)
predict = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)

来自未来进口部的
从sklearn.cross\u验证导入序列测试\u分割
从sklearn.feature\u extraction.dict\u矢量器导入矢量器
从sklearn.linear_模型.逻辑导入逻辑回归
从sklearn.metrics.classification导入分类报告，准确度评分
从sklearn.naive_bayes导入GaussianNB
从sklearn.tree.tree导入决策树分类程序
将numpy作为np导入
作为pd进口熊猫
#将数据读入数据帧
df=pd.read\u csv（'maintal.data.csv'））
#列名称
cols=np.array（['age'，'workclass'，'fnlwgt'，'education'，'education num'，'，
‘婚姻状况’、‘职业’、‘关系’、‘种族’、‘性别’，
‘资本收益’、‘资本损失’、‘每周工作小时数’、‘本国’，
“目标”]）
#数字列
数字=年龄、fnlwgt、教育数字、，
“资本收益”、“资本损失”、“每周小时数”]
#为数据框中的列指定名称
df.columns=cols
#对于50k，将目标变量替换为0和1
df1=df.copy（）
df1.loc[df1['target']='50K'，'target']=1
#将数据拆分为训练和测试
X_列车，X_试验，y_列车，y_试验=列车试验(
df1.下降（'target'，轴=1），df1['target'，测试尺寸=0.2）
#数字属性
x_num_train=x_train[数值列].as_矩阵（）
x_num_test=x_test[数值列].as_matrix（）
#按比例
最大列车=np.amax（x列车数量，0）
max_test=np.amax（x_num_test，0）#实际上不需要
x_num_train=x_num_train/max_train
x_num_test=x_num_test/max_train#max_train的比例测试
#标签或目标属性
y_列=y_列.aType（int）
y_测试=y_测试.astype（int）
#范畴属性
cat_列=X_列下降（数值列，轴=1）
cat_测试=X_测试。跌落（数值_列，轴=1）
cat_train.fillna（'NA'，原地=真）
cat_测试。填充值（'NA'，原位=真）
x_cat_train=cat_train.T.to_dict（）.values（）
x_cat_test=cat_test.T.to_dict（）.values（）
#矢量化（编码为一个热）
矢量器=指令矢量器（稀疏=假）
vec_x_cat_train=矢量器.拟合变换（x_cat_train）
vec_x_cat_test=矢量器.transform（x_cat_test）
#构建特征向量
x_列=np.hstack（（x_num_列，vec_x_cat_列））
x_测试=np.hstack（（x_数值
from __future__ import division

from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.dict_vectorizer import DictVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics.classification import classification_report, accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree.tree import DecisionTreeClassifier

import numpy as np
import pandas as pd


# Read the data into a pandas dataframe
df = pd.read_csv('adult.data.csv')

# Columns names
cols = np.array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
                 'marital-status', 'occupation', 'relationship', 'race', 'sex',
                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
                 'target'])

# numeric columns
numeric_cols = ['age', 'fnlwgt', 'education-num',
                'capital-gain', 'capital-loss', 'hours-per-week']

# assign names to the columns in the dataframe
df.columns = cols

# replace the target variable to 0 and 1 for <50K and >50k
df1 = df.copy()
df1.loc[df1['target'] == ' <=50K', 'target'] = 0
df1.loc[df1['target'] == ' >50K', 'target'] = 1

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    df1.drop('target', axis=1), df1['target'], test_size=0.2)


# numeric attributes

x_num_train = X_train[numeric_cols].as_matrix()
x_num_test = X_test[numeric_cols].as_matrix()

# scale to <0,1>

max_train = np.amax(x_num_train, 0)
max_test = np.amax(x_num_test, 0)        # not really needed

x_num_train = x_num_train / max_train
x_num_test = x_num_test / max_train        # scale test by max_train

# labels or target attribute

y_train = y_train.astype(int)
y_test = y_test.astype(int)

# categorical attributes

cat_train = X_train.drop(numeric_cols, axis=1)
cat_test = X_test.drop(numeric_cols, axis=1)

cat_train.fillna('NA', inplace=True)
cat_test.fillna('NA', inplace=True)

x_cat_train = cat_train.T.to_dict().values()
x_cat_test = cat_test.T.to_dict().values()

# vectorize (encode as one hot)

vectorizer = DictVectorizer(sparse=False)
vec_x_cat_train = vectorizer.fit_transform(x_cat_train)
vec_x_cat_test = vectorizer.transform(x_cat_test)

# build the feature vector

x_train = np.hstack((x_num_train, vec_x_cat_train))
x_test = np.hstack((x_num_test, vec_x_cat_test))


clf = LogisticRegression().fit(x_train, y_train.values)
pred = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)

clf = DecisionTreeClassifier().fit(x_train, y_train)
predict = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)

clf = GaussianNB().fit(x_train, y_train)
predict = clf.predict(x_test)
print classification_report(y_test.values, pred, digits=4)
print accuracy_score(y_test.values, pred)