Scikit learn Scikit学习随机森林，具有热启动结果（不可广播输出…）_Scikit Learn_Random Forest

Scikit learn Scikit学习随机森林，具有热启动结果（不可广播输出…）

scikit-learn

Scikit learn Scikit学习随机森林，具有热启动结果（不可广播输出…）,scikit-learn,random-forest,Scikit Learn,Random Forest,我正在尝试建立一个在线随机森林分类器。在for循环中，我遇到了一个找不到原因的错误 clf = RandomForestClassifier(n_estimators=1, warm_start=True) 在for循环中，我在读取新数据时增加了估计器的数量 clf.n_estimators = (clf.n_estimators + 1) clf = clf.fit(data_batch, label_batch) 循环3次后，运行代码时，在循环中预测如下： predicted = clf

我正在尝试建立一个在线随机森林分类器。在for循环中，我遇到了一个找不到原因的错误

clf = RandomForestClassifier(n_estimators=1, warm_start=True)

在for循环中，我在读取新数据时增加了估计器的数量

clf.n_estimators = (clf.n_estimators + 1)
clf = clf.fit(data_batch, label_batch)

循环3次后，运行代码时，在循环中预测如下：

predicted = clf.predict(data_batch)

我得到以下错误：

ValueError: non-broadcastable output operand with shape (500,1) doesn't match the broadcast shape (500,2)

而数据为形状（500153），标签为（500，）

下面是一个更完整的代码：

clf = RandomForestClassifier(n_estimators=1, warm_start=True)
clf = clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

batch_size = 500

for i in xrange(batch_init_size, records, batch_size):
    from_ = (i + 1)
    to_ = (i + batch_size + 1)

    data_batch = data[from_:to_, :]
    label_batch = label[from_:to_]

    predicted = clf.predict(data_batch)

    clf.n_estimators = (clf.n_estimators + 1)
    clf = clf.fit(data_batch, label_batch)

我找到了问题的原因：由于数据不平衡，某些批次的所有样本很可能都来自同一个类。在这种情况下，文件forest.py无法对一个一维和一个二维矩阵进行操作。以下是scikit learn中forest.py中的代码：

def accumulate_prediction(predict, X, out, lock):
    prediction = predict(X, check_input=False)
    with lock:
        if len(out) == 1:
            out[0] += prediction

        else:
            for i in range(len(out)):
                out[i] += prediction[i]

是的，错误是由于批次的样本类数量不相等。

我通过使用一个包含所有类的批量解决了这个问题。

您能提供可复制的样本数据集吗。我无法使用随机生成的数据集重现此问题…批处理初始大小和记录的值是多少？你们有多少类/标签？您的数据是否按标签排序？