Python在使用函数时不能接受输入_Python_Pandas_Machine Learning_Scikit Learn_Kaggle

Python在使用函数时不能接受输入

python pandas machine-learning scikit-learn

Python在使用函数时不能接受输入,python,pandas,machine-learning,scikit-learn,kaggle,Python,Pandas,Machine Learning,Scikit Learn,Kaggle,我正在研究卡格尔主持的房价问题。在构建模型时，我认为在测试集上重用我用于train数据集的一些代码是有意义的，因此我将执行交互操作的代码放在一个函数定义中。在这个函数中，我处理缺少的值，并使用它的返回执行一个热编码，并在随机林回归中使用它。但是，它会引发以下错误： Traceback (most recent call last): File "C:/Users/security/Downloads/AP/Boston-Kaggle/Model.py", line 56, in <mo

我正在研究卡格尔主持的房价问题。在构建模型时，我认为在测试集上重用我用于train数据集的一些代码是有意义的，因此我将执行交互操作的代码放在一个函数定义中。在这个函数中，我处理缺少的值，并使用它的返回执行一个热编码，并在随机林回归中使用它。但是，它会引发以下错误：

Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Model.py", line 56, in <module>
    sel.fit(x_train, y_train)
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\feature_selection\from_model.py", line 196, in fit
    self.estimator_.fit(X, y, **fit_params)
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\forest.py", line 249, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

新错误：

重用代码是一个好主意，但要注意在将代码放入函数时变量的范围是如何变化的

您得到的错误是因为在您输入到随机林的数组中存在NaN值。在feature_engineering_和_selection函数中，您正在删除NaN值，但函数从未返回df，因此在模型中使用原始的、未修改的df

我建议将特性、工程和选择功能分成不同的组件。在这里，我创建了一个函数，它只删除了NaN

我建议用0代替平均值填充NaN数值。对于该数据，有3个数值列具有nan值：连接到房地产的街道的地块临街英尺数、MasVnrArea Massey veener区域、GarageYrBlt车库建成年份。如果没有车库，那么就没有修建车库的年份，因此将年份设为0而不是平均年份是有意义的，以此类推

还有一些工作，需要做的一个热编码器，你已经设置。创建单热编码可能很棘手，因为训练数据和测试数据需要具有相同的列。如果您有以下培训和测试数据

训练

试验

然后，如果使用pd.get\u假人，列车立柱将为[house\u type\u Mandage，house\u type\u ranch]，测试立柱将为[house\u type\u Mandage，house\u type\u duplex]，这将不起作用。但是，使用sklearn，您可以将一个热编码器安装到列车数据中。转换测试数据集时，它将创建与列车数据集相同的列。handle_unknown参数将告诉编码器如何处理测试集中的双工，要么忽略，要么出错

为了组合分类和非分类数据，我再次建议创建一个单独的函数

# One hot encodes the given dataframe
def one_hot_encode(df, categorical_columns, encoder):
    # Get dataframe with only categorical columns
    categorical_df = df[categorical_columns]
    # Get one hot encoding
    ohe_df = pd.DataFrame(encoder.transform(categorical_df), columns=encoder.get_feature_names())
    # Get float columns
    float_df = df.drop(categorical_columns, axis=1)
    # Return the combined array
    return pd.concat([float_df, ohe_df], axis=1)

最后，您的feature_engineering_和_selection函数可以调用所有这些函数

def feature_selection_and_engineering(df, encoder=None):
    df = remove_nan(df)
    categorical_columns = get_categorical_columns(df)
    # If there is no encoder, train one
    if encoder == None:
        encoder = train_one_hot_encoder(df, categorical_columns)
    # Encode Data
    df = one_hot_encode(df, categorical_columns, encoder)
    # Return the encoded data AND encoder
    return df, encoder

为了让代码运行，我必须解决一些问题，我已经在这里的摘要中包含了整个修改过的脚本

看起来错误是由这行中未正确设置handle_unknown参数引起的：enc=onehotcodersparse=False，handle_unknown='ignore'。你的代码中的那一行是否相同，因为我在运行gist中的代码时没有收到错误。我的句柄\u unknown='unknown'lol，所以我修复了它。但是现在我得到了一个ValueError，它说输入中有一个NaN。当我完全按照原样使用您的代码时，我得到了这样的结果：SettingWithCopyWarning:一个值正试图在DataFrame的一个片段的副本上设置，可能是因为我有一个较新版本的Pandas。我也得到了这个警告，我更新了remove__nan的定义以进行修复。至于您的ValueError，我猜您的remove\u nan函数工作不正常，或者您没有像我在调用train\u test\u split后在gist中那样调用reset\u index

# Iterates through the columns and fixes any NaNs
def remove_nan(df):
    replace_dict = {}

    for col in df.columns:

        # If there are any NaN values in this column
        if pd.isna(df[col]).any():

            # Replace NaN in object columns with 'N/A'
            if df[col].dtypes == 'object':
                replace_dict[col] = 'N/A'

            # Replace NaN in float columns with 0
            elif df[col].dtypes == 'float64':
                replace_dict[col] = 0

    df = df.fillna(replace_dict)

    return df

| House Type |
| ---------- |
| Mansion    |
| Ranch      |

| House Type |
| ---------- |
| Mansion    |
| Duplex     |

# Fits an sklearn one hot encoder
def train_one_hot_encoder(df, categorical_columns):
    # take one-hot encoding of categorical columns
    categorical_df = df[categorical_columns]
    enc = OneHotEncoder(sparse=False, handle_unknown='ignore')
    return enc.fit(categorical_df)

# One hot encodes the given dataframe
def one_hot_encode(df, categorical_columns, encoder):
    # Get dataframe with only categorical columns
    categorical_df = df[categorical_columns]
    # Get one hot encoding
    ohe_df = pd.DataFrame(encoder.transform(categorical_df), columns=encoder.get_feature_names())
    # Get float columns
    float_df = df.drop(categorical_columns, axis=1)
    # Return the combined array
    return pd.concat([float_df, ohe_df], axis=1)

def feature_selection_and_engineering(df, encoder=None):
    df = remove_nan(df)
    categorical_columns = get_categorical_columns(df)
    # If there is no encoder, train one
    if encoder == None:
        encoder = train_one_hot_encoder(df, categorical_columns)
    # Encode Data
    df = one_hot_encode(df, categorical_columns, encoder)
    # Return the encoded data AND encoder
    return df, encoder