Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/364.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
机器学习分类的Python和Numpy问题_Python_Python 2.7_Numpy_Pandas_Machine Learning - Fatal编程技术网

机器学习分类的Python和Numpy问题

机器学习分类的Python和Numpy问题,python,python-2.7,numpy,pandas,machine-learning,Python,Python 2.7,Numpy,Pandas,Machine Learning,这就是我要做的。原始数据集有两列,一列是个人的全名(即:贾斯汀·戴维森),另一列是种族(即:英语)。我想使用朴素贝叶斯机器学习方法进行训练,根据姓名特征预测人们的种族。为了从名称中提取名称特征,我将全名分解为3个字符的子字符串(即:Justine Davidson=>jus、ust、sti等)。以下是我的代码 import pandas as pd from pandas import DataFrame import re import numpy as np import nltk from

这就是我要做的。原始数据集有两列,一列是个人的全名(即:贾斯汀·戴维森),另一列是种族(即:英语)。我想使用朴素贝叶斯机器学习方法进行训练,根据姓名特征预测人们的种族。为了从名称中提取名称特征,我将全名分解为3个字符的子字符串(即:Justine Davidson=>jus、ust、sti等)。以下是我的代码

import pandas as pd
from pandas import DataFrame
import re
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc

# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity

# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]

# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()

# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen

# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')

# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens

# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")

# Split into three-character strings
for i in range(1, 41):
    substr = "substr" + str(i)
    frame3[substr] = frame3["name_filled"].str[i-1:i+2]

# Count number of letter characters
frame3["name_len"] = frame3["name"].map(lambda x : len(re.findall('[a-zA-Z]', x)))

# Count number of vowel letter
frame3["vowel_len"] = frame3["name"].map(lambda x : len(re.findall('[aeiouAEIOU]', x)))

# Count number of consonant letter
frame3["consonant_len"] = frame3["name"].map(lambda x : len(re.findall('[b-df-hj-np-tv-z]', x)))

# Count number of in-between-string (not any) spaces
frame3["space_len"] = frame3["name"].map(lambda x : len(re.findall('[#]', x)))

# Space-name ratio
frame3["SN_ratio"] = frame3["space_len"]/frame3["name_len"]

# Vowel-name ratio
frame3["VN_ratio"] = frame3["vowel_len"]/frame3["name_len"]

# Recategorize ethnicity
frame3["ethnicity2"] = ""
frame3["ethnicity2"][frame3["ethnicity"] == "chinese"] = "chinese"
frame3["ethnicity2"][frame3["ethnicity"] != "chinese"] = "non-chinese"

# Test outputs
##print frame3

# Run naive bayes
featuresets = [((substr1, substr2), ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3.iterrows()]
train_set, test_set = featuresets[:400], featuresets[400:]
classifier = nbc.train(train_set)

# Predict
print classifier.classify(ethnic_features('Anderson Silva'))

Name    Ethnicity
J-b'te Letourneau   Scotish
Jane Mc-earthar French
Li Chen Chinese
Amabil?? Bonneau    English
当我运行程序时,它有两个问题:

  • 这是一个非致命问题,在整个代码中多次出现,但仍会在不终止的情况下运行:

    See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
      frame3["space_len"] = frame3["name"].map(lambda x : len(re.findall('[#]', x)))
    C:\Users\KubiK\Desktop\FamSeach_NameHandling4.py:57: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
  • 这是一个致命问题(终止程序):

    Traceback(最近一次调用):Traceback(最近一次调用):
    文件“C:\Users\KubiK\Desktop\FamSeach\u NameHandling4.py”,第71行,在
    frame3.iterrows()中索引(substr1,substr2,ethnicity2)的featuresets=[(substr1,ethnicity2)
    ValueError:要解压缩的值太多
    

  • 由于frame3有3个以上的列,因此出现错误

    iterrows()是通过元组(索引,行)的迭代器。 这里的行是一个pd.Series,其索引是列名称,值是行中的所有值

    frame3数据框有许多列:name、etnicity、name\u filled、name\u len等。 您试图将所有这些值写入三个变量:substr1、substr2和ethnicity2,因此出现了“太多值无法解包”错误。要解决此问题,请仅选择所需的列:

    featuresets = [(substr1, ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3[['substr1', 'substr2', 'ethnicity2']].iterrows()]
    

    您应该发布一个初始数据帧的示例,这样我们就可以处理一些数据了。关于您的第一个问题,请尝试使用这个frame3.loc[:,“space_len”]=frame3[“name”].map(lambda x:len(re.findall(“[#]”,x)))谢谢您的建议,示例数据已添加到Hanks中。所以你想让featureset成为一个什么的列表?很抱歉不清楚。我希望功能集是substr1到substr40,因为每个substr都是整个名称的3个字母的子字符串。在上面的示例中,我只包含substr1和substr2,但得到了错误。
    featuresets = [(substr1, ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3[['substr1', 'substr2', 'ethnicity2']].iterrows()]