使用imbalanced-learn的make_pipeline时如何解决"ValueError: Found input variables with inconsistent numbe

问题现象与背景分析

当使用imbalanced-learn库的make_pipeline方法时，开发者经常遇到以下错误提示：

ValueError: Found input variables with inconsistent numbers of samples: [n_samples1, n_samples2]

该错误通常发生在以下典型场景：

重采样方法（如SMOTE、RandomUnderSampler）与分类器组合时
在特征工程阶段使用不同的样本处理策略
多步骤管道中存在数据维度不匹配

根本原因解析

通过对错误堆栈的逆向追踪，我们发现核心矛盾点在于：

数据流不一致：管道中某些步骤改变了样本数量而未同步更新特征矩阵
维度校验失败：scikit-learn的check_consistent_length函数检测到X和y形状不匹配
重采样副作用：如SMOTE会生成新样本但未正确处理标签向量

6种解决方案与代码示例

方案1：显式样本对齐

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# 确保X和y具有相同样本数
assert len(X) == len(y)

pipeline = make_pipeline(
    SMOTE(sampling_strategy='auto'),
    RandomForestClassifier()
)

方案2：自定义采样适配器

class ResampleWrapper:
    def __init__(self, sampler):
        self.sampler = sampler
        
    def fit_resample(self, X, y):
        X_res, y_res = self.sampler.fit_resample(X, y)
        return X_res, y_res

方案3：使用FeatureUnion整合

from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer

pipeline = make_pipeline(
    FeatureUnion([
        ('sampler', SMOTE()),
        ('identity', FunctionTransformer(lambda x: x))
    ]),
    RandomForestClassifier()
)

调试技巧与可视化工具

推荐使用以下方法进行问题诊断：

工具	使用方法	输出示例
Pipeline可视化	`from sklearn.utils import estimator_html_repr`	HTML格式的管道结构图
样本追踪	`print(f"Step {i}: X.shape={X.shape}")`	各步骤后的维度变化

最佳实践建议

始终在管道构建前验证X.shape[0] == y.shape[0]
对于多输出问题，考虑使用ColumnTransformer
在交叉验证中使用StratifiedKFold保持分布