机器学习分类的Python和Numpy问题
这就是我要做的。原始数据集有两列,一列是个人的全名(即:贾斯汀·戴维森),另一列是种族(即:英语)。我想使用朴素贝叶斯机器学习方法进行训练,根据姓名特征预测人们的种族。为了从名称中提取名称特征,我将全名分解为3个字符的子字符串(即:Justine Davidson=>jus、ust、sti等)。以下是我的代码机器学习分类的Python和Numpy问题,python,python-2.7,numpy,pandas,machine-learning,Python,Python 2.7,Numpy,Pandas,Machine Learning,这就是我要做的。原始数据集有两列,一列是个人的全名(即:贾斯汀·戴维森),另一列是种族(即:英语)。我想使用朴素贝叶斯机器学习方法进行训练,根据姓名特征预测人们的种族。为了从名称中提取名称特征,我将全名分解为3个字符的子字符串(即:Justine Davidson=>jus、ust、sti等)。以下是我的代码 import pandas as pd from pandas import DataFrame import re import numpy as np import nltk from
import pandas as pd
from pandas import DataFrame
import re
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()
# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen
# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')
# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens
# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")
# Split into three-character strings
for i in range(1, 41):
substr = "substr" + str(i)
frame3[substr] = frame3["name_filled"].str[i-1:i+2]
# Count number of letter characters
frame3["name_len"] = frame3["name"].map(lambda x : len(re.findall('[a-zA-Z]', x)))
# Count number of vowel letter
frame3["vowel_len"] = frame3["name"].map(lambda x : len(re.findall('[aeiouAEIOU]', x)))
# Count number of consonant letter
frame3["consonant_len"] = frame3["name"].map(lambda x : len(re.findall('[b-df-hj-np-tv-z]', x)))
# Count number of in-between-string (not any) spaces
frame3["space_len"] = frame3["name"].map(lambda x : len(re.findall('[#]', x)))
# Space-name ratio
frame3["SN_ratio"] = frame3["space_len"]/frame3["name_len"]
# Vowel-name ratio
frame3["VN_ratio"] = frame3["vowel_len"]/frame3["name_len"]
# Recategorize ethnicity
frame3["ethnicity2"] = ""
frame3["ethnicity2"][frame3["ethnicity"] == "chinese"] = "chinese"
frame3["ethnicity2"][frame3["ethnicity"] != "chinese"] = "non-chinese"
# Test outputs
##print frame3
# Run naive bayes
featuresets = [((substr1, substr2), ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3.iterrows()]
train_set, test_set = featuresets[:400], featuresets[400:]
classifier = nbc.train(train_set)
# Predict
print classifier.classify(ethnic_features('Anderson Silva'))
Name Ethnicity
J-b'te Letourneau Scotish
Jane Mc-earthar French
Li Chen Chinese
Amabil?? Bonneau English
当我运行程序时,它有两个问题:
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
frame3["space_len"] = frame3["name"].map(lambda x : len(re.findall('[#]', x)))
C:\Users\KubiK\Desktop\FamSeach_NameHandling4.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Traceback(最近一次调用):Traceback(最近一次调用):
文件“C:\Users\KubiK\Desktop\FamSeach\u NameHandling4.py”,第71行,在
frame3.iterrows()中索引(substr1,substr2,ethnicity2)的featuresets=[(substr1,ethnicity2)
ValueError:要解压缩的值太多
由于frame3有3个以上的列,因此出现错误 iterrows()是通过元组(索引,行)的迭代器。 这里的行是一个pd.Series,其索引是列名称,值是行中的所有值 frame3数据框有许多列:name、etnicity、name\u filled、name\u len等。 您试图将所有这些值写入三个变量:substr1、substr2和ethnicity2,因此出现了“太多值无法解包”错误。要解决此问题,请仅选择所需的列:
featuresets = [(substr1, ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3[['substr1', 'substr2', 'ethnicity2']].iterrows()]
您应该发布一个初始数据帧的示例,这样我们就可以处理一些数据了。关于您的第一个问题,请尝试使用这个frame3.loc[:,“space_len”]=frame3[“name”].map(lambda x:len(re.findall(“[#]”,x)))谢谢您的建议,示例数据已添加到Hanks中。所以你想让featureset成为一个什么的列表?很抱歉不清楚。我希望功能集是substr1到substr40,因为每个substr都是整个名称的3个字母的子字符串。在上面的示例中,我只包含substr1和substr2,但得到了错误。
featuresets = [(substr1, ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3[['substr1', 'substr2', 'ethnicity2']].iterrows()]