使用sklearn在python中使用BernoulliNB后,无法找到正确的预测结果

使用sklearn在python中使用BernoulliNB后,无法找到正确的预测结果,python,pandas,scikit-learn,naivebayes,multilabel-classification,Python,Pandas,Scikit Learn,Naivebayes,Multilabel Classification,所以我在网上收集了一些代码用于我的研究工作和实践。我正在研究丹佛犯罪数据集。看起来是这样的: INCIDENT_ID 446399 non-null int64 OFFENSE_ID 446399 non-null int64 OFFENSE_CODE 446399 non-null int64 OFFENSE_CODE_EXTENSION 446399 non-null int64 OFFENSE_TYP

所以我在网上收集了一些代码用于我的研究工作和实践。我正在研究丹佛犯罪数据集。看起来是这样的:

INCIDENT_ID               446399 non-null int64
OFFENSE_ID                446399 non-null int64
OFFENSE_CODE              446399 non-null int64
OFFENSE_CODE_EXTENSION    446399 non-null int64
OFFENSE_TYPE_ID           446399 non-null object
OFFENSE_CATEGORY_ID       446399 non-null object
FIRST_OCCURRENCE_DATE     446399 non-null object
LAST_OCCURRENCE_DATE      149714 non-null object
REPORTED_DATE             446399 non-null object
INCIDENT_ADDRESS          400668 non-null object
GEO_X                     442927 non-null float64
GEO_Y                     442927 non-null float64
GEO_LON                   442927 non-null float64
GEO_LAT                   442927 non-null float64
DISTRICT_ID               446399 non-null int64
PRECINCT_ID               446399 non-null int64
NEIGHBORHOOD_ID           446399 non-null object
IS_CRIME                  446399 non-null int64
IS_TRAFFIC                446399 non-null int64
dtypes: float64(4), int64(8), object(7)

我在上面应用了这个代码:

   def normalize(data): #feature normalization
        data = (data - data.mean()) / (data.max() - data.min())
        return data

    num2month= {1:'jan',2:'feb',3:'mar',4:'apr',5:'may',6:'jun',7:'jul',8:'aug',9:'sep',10:'oct',11:'nov',12:'dec'}

     crime = pd.read_csv('crime.csv')
     train, test = train_test_split(crime, test_size=0.2)
     test.to_csv('test.csv')
     train.to_csv('train.csv')
     train=pd.read_csv('train.csv', parse_dates = ['FIRST_OCCURRENCE_DATE'])
     test=pd.read_csv('test.csv', parse_dates = ['FIRST_OCCURRENCE_DATE'])
     #for training data 
         le_crime = preprocessing.LabelEncoder()
         crime = le_crime.fit_transform(train.OFFENSE_CATEGORY_ID)

         train['FIRST_OCCURRENCE_DATE'] = pd.to_datetime(train['FIRST_OCCURRENCE_DATE'])
         train['FIRST_OCCURRENCE_DATE(DAYOFWEEK)'] = train['FIRST_OCCURRENCE_DATE'].dt.weekday_name
         train['FIRST_OCCURRENCE_DATE(YEAR)'] = train['FIRST_OCCURRENCE_DATE'].dt.year
         train['FIRST_OCCURRENCE_DATE(MONTH)'] = train['FIRST_OCCURRENCE_DATE'].dt.month
         train['FIRST_OCCURRENCE_DATE(DAY)'] = train['FIRST_OCCURRENCE_DATE'].dt.day
         train['Year'] = train['FIRST_OCCURRENCE_DATE'].dt.year
         train['PdDistrict'] = train['OFFENSE_CATEGORY_ID']

         #Get binarized weekdays, districts, and hours.
         train['Days'] = train['FIRST_OCCURRENCE_DATE(DAYOFWEEK)']
         days = pd.get_dummies(train.Days)
         district = pd.get_dummies(train.PdDistrict)
         month = pd.get_dummies(train.FIRST_OCCURRENCE_DATE.dt.month.map(num2month))
         hour = train.FIRST_OCCURRENCE_DATE.dt.hour
         submit = pd.read_csv('submit.csv') 

         #Build new array
         new_datatr = pd.concat([hour, month, days, district], axis=1)
         new_datatr['X']=normalize(train.GEO_LON)
         new_datatr['Y']=normalize(train.GEO_LAT)
         new_datatr['hour']=normalize(train.FIRST_OCCURRENCE_DATE.dt.hour)

         new_datatr['crime']=crime

         new_datatr['dark'] = train.FIRST_OCCURRENCE_DATE.dt.hour.apply(lambda x: 1 if (x >= 18 or x < 6) else 0)

         train_proc = new_datatr



     #and similarly same code for test data set    
         test_proc = new_datatr

     features = [1,2,
        'jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec',
        'Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday', 
        #'X','Y'
            ] 

     training, validation = train_test_split(train_proc, train_size=.67)
     model = BernoulliNB()
     model.fit(training[features], training['crime'])
     predicted = np.array(model.predict_proba(validation[features]))
     log_loss(validation['crime'], predicted)


     model = BernoulliNB()
     model.fit(train_proc[features], train_proc['crime'])
     predicted = model.predict_proba(test_proc[features])


     le_crime = preprocessing.LabelEncoder()
     crime = le_crime.fit_transform(train.OFFENSE_CATEGORY_ID)
     result=pd.DataFrame(predicted, columns=le_crime.classes_)
     result.to_csv('submit.csv', index = True, index_label = 'Id' )
def规范化(数据):#特征规范化
data=(data-data.mean())/(data.max()-data.min())
返回数据
num2month={1:'jan',2:'feb',3:'mar',4:'apr',5:'may',6:'jun',7:'jul',8:'aug',9:'sep',10:'oct',11:'nov',12:'dec'}
犯罪=pd.read\u csv('crime.csv'))
列车,测试=列车测试分割(犯罪,测试尺寸=0.2)
test.to_csv('test.csv'))
列车至列车csv(“列车csv”)
train=pd.read\u csv('train.csv',parse\u dates=['FIRST\u OCCURRENCE\u DATE']))
test=pd.read\u csv('test.csv',parse\u dates=['FIRST\u OCCURRENCE\u DATE']))
#训练数据
le_crime=预处理。LabelEncoder()
犯罪=le_犯罪。fit_变换(训练。犯罪\u类别\u ID)
列车['FIRST\u Recurrence\u DATE']=pd.to\u datetime(列车['FIRST\u Recurrence\u DATE'])
列车['FIRST\u Recurrence\u DATE(DAYOFWEEK)]=列车['FIRST\u Recurrence\u DATE'].dt.weekday\u名称
列车['首次发生日期(年)]=列车['首次发生日期'].dt.YEAR
列车['首次发生日期(月)]=列车['首次发生日期].dt.MONTH
列车['首次发生日期(天)]=列车['首次发生日期].dt.DAY
列车['Year']=列车['FIRST\u OCCURRENCE\u DATE'].dt.Year
列车['PdDistrict']=列车['Agreement\u CATEGORY\u ID']
#获取二进制化的工作日、地区和时间。
列车['Days']=列车['FIRST\u Incidence\u DAYOFWEEK']
天数=pd.get_假人(训练天数)
district=pd.get\u假人(train.PdDistrict)
month=pd.get\u假人(train.FIRST\u OCCURRENCE\u DATE.dt.month.map(num2month))
小时=train.FIRST\u OCCURRENCE\u DATE.dt.hour
submit=pd.read\u csv('submit.csv'))
#构建新阵列
新数据=pd.concat([小时、月、日、地区],轴=1)
新数据tr['X']=标准化(train.GEO_LON)
新数据tr['Y']=标准化(train.GEO\u LAT)
新数据tr['hour']=正常化(列车首次出现日期dt.hour)
新数据tr['crime']=犯罪
新数据tr['dark']=列车首次出现日期dt.hour.apply(如果(x>=18或x<6)为0,则λx:1)
列车运行程序=新的数据传输程序
#对于测试数据集,使用同样的代码
测试程序=新数据
特征=[1,2,,
‘一月’、‘二月’、‘三月’、‘四月’、‘五月’、‘六月’、‘七月’、‘八月’、‘九月’、‘十月’、‘十一月’、‘十二月’,
‘星期五’、‘星期一’、‘星期六’、‘星期日’、‘星期四’、‘星期二’、‘星期三’,
#“X”,“Y”
] 
培训、验证=培训测试分割(培训过程、培训规模=.67)
模型=BernoulliNB()
model.fit(训练[特征],训练[犯罪])
预测=np.数组(模型预测概率(验证[特征])
日志损失(验证[‘犯罪'],预测)
模型=BernoulliNB()
model.fit(训练过程[特征],训练过程[犯罪])
预测=模型。预测概率(测试过程[特征])
le_crime=预处理。LabelEncoder()
犯罪=le_犯罪。fit_变换(训练。犯罪\u类别\u ID)
结果=pd.DataFrame(预测,列=le_犯罪.类别)
result.to_csv('submit.csv',index=True,index_label='Id'))
最后,当我打开submit文件时,我发现每个实例都有类成员资格Percatage,它看起来像是什么


我想得到一份文件,该文件预测的是确切的类别
犯罪id
,而不是类别成员身份。

你能添加'crime.csv'的第一行吗?你能重新表述一下你想要的输出吗?你想知道NaiveBayes把每一个观察分类成什么样吗?