Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/294.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 编码多列_Python_Pandas_Encoding_One Hot Encoding_Countvectorizer - Fatal编程技术网

Python 编码多列

Python 编码多列,python,pandas,encoding,one-hot-encoding,countvectorizer,Python,Pandas,Encoding,One Hot Encoding,Countvectorizer,如果一个数据帧有两个或多个带有数字和文本值的列,以及一个标签/目标列,如果我想应用类似svm的模型,我如何仅使用我更感兴趣的列? 前 等等 我采取的方法是 1.编码“Num”列: one_hot = pd.get_dummies(df['Num']) df = df.drop('Num',axis = 1) df = df.join(one_hot) def bag_words(df): df = basic_preprocessing(df) co

如果一个数据帧有两个或多个带有数字和文本值的列,以及一个标签/目标列,如果我想应用类似svm的模型,我如何仅使用我更感兴趣的列? 前

等等

我采取的方法是

1.编码
“Num”
列:

one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
def bag_words(df):
        
    df = basic_preprocessing(df)
    
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit(df['Data'])
    
    list_corpus = df["Data"].tolist()
    list_labels = df["Label/Target"].tolist()
        
    X = count_vectorizer.transform(list_corpus)
        
    return X, list_labels
2.编码
“数据”
列:

one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)
def bag_words(df):
        
    df = basic_preprocessing(df)
    
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit(df['Data'])
    
    list_corpus = df["Data"].tolist()
    list_labels = df["Label/Target"].tolist()
        
    X = count_vectorizer.transform(list_corpus)
        
    return X, list_labels
然后对数据集应用
bag_words

X, y = bag_words(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
在这些步骤中我有没有遗漏什么?如何在我的培训数据集中仅选择
“数据”
“Num”
功能?(我认为“无意义”与我的目的不太相关)

编辑:我已经试过了

def bag_words(df):
            
    df = basic_preprocessing(df)
        
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit(df['Data'])
        
    list_corpus = df["Data"].tolist()+ df["group1"].tolist()+df["group2"].tolist()+df["group3"].tolist() #<----
    list_labels = df["Label/Target"].tolist()
            
    X = count_vectorizer.transform(list_corpus)
            
    return X, list_labels
我希望这能帮助你:

import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import CountVectorizer

#this part so I can recreate you df from the string you posted
#remove this part !!!!

data="""
Data                        Num     Label/Target   No_Sense
What happens here?         group1         1          Migrate
Customer Management        group2         0          Change Stage
Life Cycle Stages          group1         1          Restructure
Drop-down allows to select status type  group3   1   Restructure Status
"""
df = pd.DataFrame(np.array( [ re.split(r'\s{2,}', line) for line in lines[1:] ] ), 
                columns = lines[0].split())


#what you want starts from here!!!!:
one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)

#at this point you have 3 new fetures for 'Num' variable

def bag_words(df):

    

    count_vectorizer = CountVectorizer()
    count_vectorizer.fit(df['Data'])
    matrix = count_vectorizer.transform(df['Data'])

    #this dataframe: `encoded_df`has 15 new features, these are the result of fitting 
    #the CountVectorizer to the 'Data' variable
    encoded_df = pd.DataFrame(data=matrix.toarray(), columns=["Data"+str(i) for i in range(matrix.shape[1])])
    
    #adding them to the dataframe
    df.join(encoded_df)
    
    #getting the numpy arrays that you can use in training
    X = df.loc[:, ["Data"+str(i) for i in range(matrix.shape[1])] + ["group1", "group2", "group3"]].to_numpy()
    y = df.loc[:, ["Label/Target"]].to_numpy()

    return X, y

X, y = bag_words(df)

嗨,艾斯达里,谢谢你的回答。我得到了以下错误:KeyError:'传递列表喜欢。不再支持带有任何缺少标签的loc或[]
X=test.loc[:,[“Titles”+str(i)表示范围内的i(matrix.shape[1])]
我尝试按如下方式删除NaN值
test=df.dropna(subset=['Num','Data'])
但错误仍然存在。您知道如何修复它吗?
X=test.loc[:,[“Titles”+str(i)表示范围内的i(matrix.shape[1])
您缺少一个
]
另一方面,此错误表示列表中的一个元素
[“Titles”+str(i)代表范围内的i(matrix.shape[1])]
不在数据帧列中,我建议
打印([“Titles”+str(i)代表范围内的i(matrix.shape[1]))
然后查看结果中是否有一个值不是您的表中的列名。对于第二个问题,我不完全理解您在本上下文中初始化的意思,如果您能澄清您想要什么,我可能可以提供帮助。我已编辑了答案,以便与数据列相关的所有编码都在bag_words func中蒂安,这就是你需要的吗?