Python 数字数据到特征向量的转换

Python 数字数据到特征向量的转换,python,pandas,csv,dataframe,feature-extraction,Python,Pandas,Csv,Dataframe,Feature Extraction,我想用以下代码将数字数据规范化为特征向量: import numpy as np import pandas as pd import csv def clearRegister(): clear_register = [] zero = 0 for i in range(21): clear_register.append(0) return clear_register def header(): clear_register =

我想用以下代码将数字数据规范化为特征向量:

import numpy as np
import pandas as pd
import csv

def clearRegister():
    clear_register = []
    zero = 0
    for i in range(21):
        clear_register.append(0)
    return clear_register

def header():
    clear_register = []
    name = 'c'
    entry = 1
    for i in range(21):
        clear_register.append(name+str(entry))
        entry += 1
    return clear_register

def convert(filename):
    clear_dataset = []
    clear_dataset.append(header())
    with open(filename) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            clear_register = clearRegister()
            clear_register[(int(row["blue1"])-1)] = 1
            clear_register[(int(row["blue2"])-1)] = 1
            clear_register[(int(row["blue3"])-1)] = 1
            clear_register[(int(row["red1"])+9)] = 1
            clear_register[(int(row["red2"])+9)] = 1
            clear_register[(int(row["red3"])+9)] = 1
以下是我的csvfile输入:

row blue1 blue2 blue3 red1 red2 red3 lable
0 1 5 4 6 2 8 0
1 2 3 1 9 4 5 1
. . . . . . . .
3000 5 7 4 3 8 10 1
我期望输出如下(蓝色为c1-c10,红色为c11-c20):

c11-c20是c1-c10的“红色”表示,它们都是唯一的。如果c1、c5、c10的值为1,那么c11、c15、c20不能有该值

我试着用以下方式来称呼它:

df = convert("dataset.csv")
df1 = pd.DataFrame(df)
print(df1)
我得到了这个结果:

Empty DataFrame
Columns: []
Index: []

代码是否有问题或不足?

考虑熊猫解决方案,而不是使用
loc
以迭代方式创建新的c1-c20列的csv操作。下面用随机数据演示:

数据(仅适用于OP使用实际csv的问题读者)

过程

for i in range(1,11):    
    df.loc[(df['blue1'] == i) | (df['blue2'] == i) | (df['blue3'] == i), 'c'+str(i)] = 1
    df.loc[(df['red1'] == i) | (df['red2'] == i) | (df['red3'] == i), 'c'+str(i+10)] = 1

# SELECT AND RE-ORDER COLUMNS, FILL IN NANs, CONVERT TO INT TYPE
df = df[['c'+str(i) for i in range(1,21)]+['lable']].fillna(0).astype(int)

print(df.head())    
#    c1  c2  c3  c4  c5  c6  c7  c8  c9  c10  c11  c12  c13  c14  c15  c16  c17  c18  c19  c20  lable
# 0   0   0   0   1   1   0   0   0   0    0    0    0    0    0    0    0    0    1    0    1      0
# 1   0   1   0   0   0   0   1   0   0    0    0    0    1    0    0    0    0    1    0    0      1
# 2   0   1   0   1   0   0   0   0   0    0    1    0    0    0    0    0    1    1    0    0      0
# 3   0   0   0   1   1   0   0   1   0    0    1    0    0    0    0    0    0    1    1    0      1
# 4   1   0   0   0   1   0   0   0   0    0    0    0    0    0    1    1    0    0    1    0      0

蓝色1=blue2=blue3是否可能,红色是否也可能,而您实际需要的是计数?或者答案总是二进制的,总是二进制的。我忘了提到这两个数据集都是非重复的(唯一的),因此如果c1的值为1,那么代表红色c1的c11将不会具有相同的值。尽管给出的示例是21x3000,但我的实际数据集转换包含277列和39500行,这使得执行速度非常慢。。。不管怎样,我真的很感谢你的帮助。非常感谢你!
import numpy as np
import pandas as pd

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 25)

np.random.seed(5005)
df = pd.DataFrame({'row': range(3000),
                   'blue1': [np.random.randint(11) for _ in range(3000)],
                   'blue2': [np.random.randint(11) for _ in range(3000)],
                   'blue3': [np.random.randint(11) for _ in range(3000)],
                   'red1': [np.random.randint(11) for _ in range(3000)],
                   'red2': [np.random.randint(11) for _ in range(3000)],
                   'red3': [np.random.randint(11) for _ in range(3000)],
                   'lable': [0,1]*1500})

print(df.head())
#    blue1  blue2  blue3  lable  red1  red2  red3  row
# 0      4      5      5      0    10     0     8    0
# 1      7      2      2      1     3     8     8    1
# 2      2      4      0      0     8     1     7    2
# 3      4      5      8      1     9     8     1    3
# 4      0      1      5      0     5     6     9    4
for i in range(1,11):    
    df.loc[(df['blue1'] == i) | (df['blue2'] == i) | (df['blue3'] == i), 'c'+str(i)] = 1
    df.loc[(df['red1'] == i) | (df['red2'] == i) | (df['red3'] == i), 'c'+str(i+10)] = 1

# SELECT AND RE-ORDER COLUMNS, FILL IN NANs, CONVERT TO INT TYPE
df = df[['c'+str(i) for i in range(1,21)]+['lable']].fillna(0).astype(int)

print(df.head())    
#    c1  c2  c3  c4  c5  c6  c7  c8  c9  c10  c11  c12  c13  c14  c15  c16  c17  c18  c19  c20  lable
# 0   0   0   0   1   1   0   0   0   0    0    0    0    0    0    0    0    0    1    0    1      0
# 1   0   1   0   0   0   0   1   0   0    0    0    0    1    0    0    0    0    1    0    0      1
# 2   0   1   0   1   0   0   0   0   0    0    1    0    0    0    0    0    1    1    0    0      0
# 3   0   0   0   1   1   0   0   1   0    0    1    0    0    0    0    0    0    1    1    0      1
# 4   1   0   0   0   1   0   0   0   0    0    0    0    0    0    1    1    0    0    1    0      0