Python 从训练数据中提取特征

Python 从训练数据中提取特征,python,pandas,Python,Pandas,我有一个如下的培训数据,所有信息都在一列中。该数据集有300000多个数据 id features label 1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1 2 name=Mark clark;age=21;1.=Under Graduate;Int

我有一个如下的培训数据,所有信息都在一列中。该数据集有300000多个数据

id         features                                                     label

1          name=John Matthew;age=25;1.=Post Graduate;2.=Football Player;    1
2          name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games;   1
3          name=David;age=12;1:=High School;2:=Cricketer;native=america;    2
4          name=George;age=11;1:=High School;2:=Carpenter;married=yes       2
.
.

300000     name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No             3
现在我需要像下面这样转换这些训练数据

 id   name          age   1               2                Interest      married   Smoker
 1    John Matthew   25   Post Graduate   Football Player   Nan           Nan      Nan
 2    Mark clark     21   Under Graduate  Nan               Video Games   Nan      Nan
 .
 .
有什么有效的方法可以做到这一点吗。我尝试了下面的代码,但花了3个小时才完成

#Getting the proper features from the features column

    cols = {}
    for choices in set_label:
        collection_list = []
        array = train["features"][train["label"] == choices].values
        for i in range(1,len(array)):
            var_split = array[i].split(";")
            try :
                d = (dict(s.split('=') for s in var_split))
                for x in d.keys():
                    collection_list.append(x)
            except ValueError:
                Error = ValueError
        count = Counter(collection_list)
        for k , v in count.most_common(5):
            key = k.replace(":","").replace(" ","_").lower()
            cols[key] = v

    columns_add = list(cols.keys())
    train = train.reindex(columns = np.append( train.columns.values, columns_add))
    print (train.columns)
    print (train.shape)

#Adding the values for the newly created problem

    for row in train.itertuples():
        dummy_dic = {}
        new_dict={}
        value = train.loc[row.Index, 'features']
        v_split = value.split(";")
        try :
            dummy_dict = (dict(s.split('=') for s in v_split))
            for k, v in dummy_dict.items():
                new_key = k.replace(":","").replace(" ","_").lower()
                new_dict[new_key] = v
        except ValueError:
            Error = ValueError
        for k,v in new_dict.items():
            if k in train.columns:
                train.loc[row.Index, k] = v

这里有什么有用的函数可以用于有效的特征提取吗?

假设您的数据如下:

features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;", 
 'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;", 
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})
我将开始回答并尝试将所有分隔符(姓名、年龄、1:=、2:=)替换为

具有此功能

def replace_feature(x):
    for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
        x = x.replace(*r)
    x = x.split(';')
    return x
df = df.assign(features= df.features.apply(replace_feature))
将该函数应用于df后,所有值将显示一个功能列表。在那里你可以通过索引得到每一个 然后使用4个自定义函数获取每个属性的名称、年龄、等级;工作, 注意:只有一个函数可以更好地实现这一点

def get_name(df):
    return df['features'][1]
def get_age(df):
    return df['features'][2]
def get_grade(df):
    return df['features'][3]
def get_job(df):
    return df['features'][4]
最后,将该函数应用于数据帧:

df = df.assign(name = df.apply(get_name, axis=1),
         age = df.apply(get_age, axis=1),
         grade = df.apply(get_grade, axis=1),
         job = df.apply(get_job, axis=1))
from StringIO import StringIO
data=StringIO("""id         features                                                     label

1          name=John Matthew;age=25;1.=Post Graduate;2.=Football Player;    1
2          name=Mark clark;age=21;1.=Under Graduate;2.=Football Player;     1
3          name=David;age=12;1:=High School;2:=Cricketer;                   2
4          name=George;age=11;1:=High School;2:=Carpenter;                  2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
print pd.DataFrame(feat)
               1.               2. age          name
0   Post Graduate  Football Player  25  John Matthew
1  Under Graduate  Football Player  21    Mark clark
2     High School        Cricketer  12         David
3     High School        Carpenter  11        George

希望这将是快速的

就我所了解的代码而言,性能不佳的原因是您逐个元素创建了dataframe。最好一次创建整个数据框架,其中包含一个词汇列表

让我们重新创建输入数据帧:

df = df.assign(name = df.apply(get_name, axis=1),
         age = df.apply(get_age, axis=1),
         grade = df.apply(get_grade, axis=1),
         job = df.apply(get_job, axis=1))
from StringIO import StringIO
data=StringIO("""id         features                                                     label

1          name=John Matthew;age=25;1.=Post Graduate;2.=Football Player;    1
2          name=Mark clark;age=21;1.=Under Graduate;2.=Football Player;     1
3          name=David;age=12;1:=High School;2:=Cricketer;                   2
4          name=George;age=11;1:=High School;2:=Carpenter;                  2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
print pd.DataFrame(feat)
               1.               2. age          name
0   Post Graduate  Football Player  25  John Matthew
1  Under Graduate  Football Player  21    Mark clark
2     High School        Cricketer  12         David
3     High School        Carpenter  11        George
我们可以检查:

print df
   id                                           features  label
0   1  name=John Matthew;age=25;1.=Post Graduate;2.=F...      1
1   2  name=Mark clark;age=21;1.=Under Graduate;2.=Fo...      1
2   3     name=David;age=12;1:=High School;2:=Cricketer;      2
3   4    name=George;age=11;1:=High School;2:=Carpenter;      2
现在,我们可以使用以下代码创建所需的词典列表:

feat=[]
for line in df['features']:
    line=line.replace(':','.')
    lsp=line.split(';')[:-1]
    feat.append(dict([elt.split('=') for elt in lsp]))
以及生成的数据帧:

df = df.assign(name = df.apply(get_name, axis=1),
         age = df.apply(get_age, axis=1),
         grade = df.apply(get_grade, axis=1),
         job = df.apply(get_job, axis=1))
from StringIO import StringIO
data=StringIO("""id         features                                                     label

1          name=John Matthew;age=25;1.=Post Graduate;2.=Football Player;    1
2          name=Mark clark;age=21;1.=Under Graduate;2.=Football Player;     1
3          name=David;age=12;1:=High School;2:=Cricketer;                   2
4          name=George;age=11;1:=High School;2:=Carpenter;                  2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
print pd.DataFrame(feat)
               1.               2. age          name
0   Post Graduate  Football Player  25  John Matthew
1  Under Graduate  Football Player  21    Mark clark
2     High School        Cricketer  12         David
3     High School        Carpenter  11        George
创建两个数据帧(在第一个数据帧中,每个数据点的所有功能都相同,第二个数据帧是对第一个数据帧的修改,为某些数据点引入了不同的功能),以满足您的条件:

import pandas as pd
import numpy as np
import random
import time
import itertools


# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000


NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]



df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          FEATURES1[np.random.randint(0, len(FEATURES1))],\
                          FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]

df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]

df.rename(columns={0:"features"}, inplace=True)

print df.head(20)



# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point. 


mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)

INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']

mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]

mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]


print mod_df.head(20)
假设原始数据存储在名为
df
的数据帧中

解决方案1(每个数据点的所有功能都相同)。 编辑:您需要做的一件事是相应地编辑
列表


解决方案2(每个数据点的功能可能相同或不同)。
数据有300000行,那么我们如何通过在数据变量中复制整个内容来进行第一步?我猜您的数据在文件中,因此您不需要第一步,请给出文件名,而不是df=pd中的数据。read_table(filename,…)感谢您的解决方案。但每一行并不仅仅限于姓名、年龄、1、2。接下来的几行中还有很多其他的功能,这些功能也需要转换为功能。是的,Espoir。因为有300000行,所以我们有300多个特性