Python 使用LightGBM的sklearn feature_选择器中出现float32值错误

Python 使用LightGBM的sklearn feature_选择器中出现float32值错误,python,scikit-learn,feature-selection,Python,Scikit Learn,Feature Selection,我一直在犯这个错误,这让我发疯。它肯定是由特定的变量引起的,因为如果我将列表子集到另一组列,它将成功运行。。。但我不明白为什么 我这里没有一个很好的模拟,但希望有人能提出一种方法来测试,或者看看我是否能识别问题 我将一个名为clean的数据帧传递给一个函数,该函数将数据帧拆分为train和test,并使用lightGBM执行RCECV 错误指向内部功能特性_selector.fit 我可以在执行之前打印出df.dtypes,并证明所有列都是float32 错误:ValueError:输入包含Na

我一直在犯这个错误,这让我发疯。它肯定是由特定的变量引起的,因为如果我将列表子集到另一组列,它将成功运行。。。但我不明白为什么

我这里没有一个很好的模拟,但希望有人能提出一种方法来测试,或者看看我是否能识别问题

我将一个名为
clean
的数据帧传递给一个函数,该函数将数据帧拆分为train和test,并使用lightGBM执行RCECV

错误指向内部功能特性_selector.fit

我可以在执行之前打印出
df.dtypes
,并证明所有列都是float32

错误:
ValueError:输入包含NaN、无穷大或对数据类型(“float32”)太大的值。

回溯:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-56-81e642e89134> in <module>
----> 1 export = add_features(clean, TRAINING_FLAG)

<ipython-input-54-8d69f26acf60> in add_features(df, TRAINING_FLAG)
    255 
    256             while True: # e.g. "loop forever"
--> 257                 reduced_partial_feats = reduce_feats(df_new, partial_list,[t])
    258 
    259                 if len(partial_list) <= ceil(len(reduced_partial_feats) + (0.02 * partial_feat_count)):

<ipython-input-51-414034e1b855> in reduce_feats(df, inlist, target)
    210         lgb.LGBMClassifier(**params), step=step_size, scoring="roc_auc", cv=CROSSFOLDS, verbose=1
    211     )
--> 212     feature_selector.fit(x_train, y_train.values.ravel())
    213 
    214     selected_features = [f for f in x_train.columns[feature_selector.ranking_ == 1]]

/opt/conda/envs/py3/lib/python3.6/site-packages/sklearn/feature_selection/rfe.py in fit(self, X, y, groups)
    479             train/test set.
    480         """
--> 481         X, y = check_X_y(X, y, "csr", ensure_min_features=2)
    482 
    483         # Initialization

/opt/conda/envs/py3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    717                     ensure_min_features=ensure_min_features,
    718                     warn_on_dtype=warn_on_dtype,
--> 719                     estimator=estimator)
    720     if multi_output:
    721         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/opt/conda/envs/py3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    540         if force_all_finite:
    541             _assert_all_finite(array,
--> 542                                allow_nan=force_all_finite == 'allow-nan')
    543 
    544     if ensure_min_samples > 0:

/opt/conda/envs/py3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
     54                 not allow_nan and not np.isfinite(X).all()):
     55             type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56             raise ValueError(msg_err.format(type_err, X.dtype))
     57     # for object dtype data, we only check for NaNs (GH-13254)
     58     elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
def reduce_feats(df, inlist, target):

    temp = df[inlist + target].copy()
    y = temp[target].iloc[:,0].copy()
    x = temp.drop(target, axis=1).fillna(0)

    step_size = ceil(len(x.columns)/STEPS_PER_FOLD)

    x_train, x_valid, y_train, y_valid = train_test_split(
        x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED
    )

    print(x_train.dtypes) # THIS SHOWS THAT EVERYTHING IS FLOAT32!

    params = {
        "objective": "binary",
        "metric": "auc",
        "boosting_type": 'gbdt',
        "is_unbalance": IS_UNBALANCE,
        "boost_from_average": True,
        "n_estimators": 100,
        "num_threads": -1,
        "num_leaves": 200,
        "min_data_in_leaf": 25,
        "max_depth": -1,
        "learning_rate": 0.1,
        "step": step_size
    }

    feature_selector = RFECV(
        lgb.LGBMClassifier(**params), step=step_size, scoring="roc_auc", cv=CROSSFOLDS, verbose=1
    )

    feature_selector.fit(x_train, y_train.values.ravel())

    selected_features = [f for f in x_train.columns[feature_selector.ranking_ == 1]]

    return selected_features