matlab回归工具箱中高斯指数GPR模型导出后,我使用相同的数据集,但是调整数据集的顺序,再次进行训练,是否存在数据泄露的问题。下面是我的代码。After exporting the Gaussian exponential (GPR) model from the matlab regression toolbox, I used the same dataset but adjusted the order of the datasets and conducted the trainin
    9 ビュー (過去 30 日間)
  
       古いコメントを表示
    
导入数据
% res = xlsread('数据集.xlsx');
res = table2array(data1);
% 增加随机数种子确保复现性
rng(1000); 
划分训练集和测试集
n = size(data1,1);
temp = randperm(n);  %打乱数据集,随机生成索引
n1 = round(0.8*n);
% TreeBagger函数的输入格式要求样本作为行,特征值作为列
%这次转置的目的是方便后续按照特征进行归一化 
P_train = res(temp(1: n1), 1: 10)'; % 特征值
T_train = res(temp(1: n1), 11)';    % 目标变量
M = size(P_train, 2); % 样本个数
P_test = res(temp(n1+1: end), 1: 10)';
T_test = res(temp(n1+1: end), 11)';
N = size(P_test, 2);  % 样本个数
数据归一化
% mapminmax最小-最大归一化,第一个参数是归一化后的数据,第二个参数是一个结构体,用于对后续测试数据做相同的归一化
[p_train, ps_input] = mapminmax(P_train, 0, 1);
p_test = mapminmax('apply', P_test, ps_input); %使用训练集的归一化参数,对测试集进行完全相同的缩放。避免数据泄露
[t_train, ps_output] = mapminmax(T_train, 0, 1);
t_test = mapminmax('apply', T_test, ps_output);
转置以适应模型
%这次转置的目的是将数据集调整到适合树模型的输入格式要求
p_train = p_train'; p_test = p_test';
t_train = t_train'; t_test = t_test';
n_features = size(p_train, 2); 
% 创建 5 折分区
cv = cvpartition(size(p_train,1), 'KFold', 5);
% 初始化预测结果和误差
validationPredictions = zeros(size(t_train));
fold_R2 = zeros(1, cv.NumTestSets);
for i = 1:cv.NumTestSets
    % 获取当前 fold 的训练/测试索引
    trainIdx = training(cv, i); % 训练集索引(约 4/5 样本)
    testIdx = test(cv, i);     % 测试集索引(约 1/5 样本)
    % 提取当前 fold 的数据(样本作为行)
    X_train_fold = p_train(trainIdx, :); % 特征:训练样本×10
    Y_train_fold = t_train(trainIdx);    % 标签:训练样本×1
    % **在当前 fold 内重新训练模型**(仅用该 fold 的训练数据)
    regressionGP_fold = fitrgp( ...
        X_train_fold, ...
        Y_train_fold, ...
        'BasisFunction', 'constant', ...
        'KernelFunction', 'exponential', ...
        'Standardize', true...  % 自动对当前 fold 的特征标准化
    );
    % 预测当前 fold 的测试集
    Y_pred_fold = predict(regressionGP_fold, p_train(testIdx, :));
    % 存储预测结果(用于后续指标计算)
    validationPredictions(testIdx) = Y_pred_fold;
    % 计算当前 fold 的 R²(可选)
    SS_res = sum((t_train(testIdx) - Y_pred_fold).^2);
    SS_tot = sum((t_train(testIdx) - mean(t_train(testIdx))).^2);
    fold_R2(i) = 1 - SS_res / SS_tot;
end
% 计算交叉验证平均 R²
mean_cv_R2 = mean(fold_R2);
disp(['5折交叉验证平均 R²: ', num2str(mean_cv_R2)]);
5折交叉验证平均 R²: 0.86416
regressionGP_final = fitrgp( ...
        p_train, ...
        t_train, ...
        'BasisFunction', 'constant', ...
        'KernelFunction', 'exponential', ...
        'Standardize', true...  
    );
% 使用 predict 函数创建结果结构体
predictorExtractionFcn = @(t) t;
gpPredictFcn = @(x) predict(regressionGP_final, x);
trainedModel.predictFcn = @(x) gpPredictFcn(predictorExtractionFcn(x));
% 向结果结构体中添加字段
trainedModel.RequiredVariables = {'AN1', 'VR1', 'ARE1', 'EF2', 'E12', 'ST_1', 'St_1', 'CRR_1', 'AT_1', 'At_1'};
trainedModel.RegressionGP = regressionGP_final;
trainedModel.About = '此结构体是从回归学习器 R2024b 导出的训练模型。';
trainedModel.HowToPredict = sprintf('要基于新表 T 进行预测,请使用: \n yfit = c.predictFcn(T) \n将 ''c'' 替换为此结构体的变量名,例如 ''trainedModel''。\n \n表 T 必须包含由以下属性返回的变量: \n c.RequiredVariables \n变量格式(例如矩阵/向量、数据类型)必须与原始训练数据匹配。\n忽略其他变量。\n \n有关详细信息,请参阅 <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appregression_exportmodeltoworkspace'')">How to predict using an exported model</a>。');
仿真测试-预测
% 为了后续计算误差
t_sim1 = trainedModel.predictFcn(p_train);
t_sim2 = trainedModel.predictFcn(p_test);
%%  数据反归一化
T_sim1 = mapminmax('reverse', t_sim1, ps_output);
T_sim2 = mapminmax('reverse', t_sim2, ps_output);
%%  均方根误差RSME
error1 = sqrt(sum((T_sim1' - T_train).^2) ./ M);
error2 = sqrt(sum((T_sim2' - T_test ).^2) ./ N);
绘图
figure
plot(1: M, T_train, 'r-*', 1: M, T_sim1, 'b-o', 'LineWidth', 1)
legend('真实值', '预测值')
xlabel('预测样本')
ylabel('预测结果')
string = {'训练集预测结果对比'; ['RMSE=' num2str(error1)]};
title(string)
xlim([1, M])
grid
figure
plot(1: N, T_test, 'r-*', 1: N, T_sim2, 'b-o', 'LineWidth', 1)
legend('真实值', '预测值')
xlabel('预测样本')
ylabel('预测结果')
string = {'测试集预测结果对比'; ['RMSE=' num2str(error2)]};
title(string)
xlim([1, N])
grid
% %%  绘制误差曲线
% figure
% plot(1: trees, oobError(net), 'b-', 'LineWidth', 1)
% legend('误差曲线')
% xlabel('决策树数目')
% ylabel('误差')
% xlim([1, trees])
% grid
% %%  绘制特征重要性
% figure
% bar(importance)
% legend('重要性')
% xlabel('特征')
% ylabel('重要性')
相关指标计算
% R2
R1 = 1 - norm(T_train - T_sim1')^2 / norm(T_train - mean(T_train))^2;
R2 = 1 - norm(T_test  - T_sim2')^2 / norm(T_test  - mean(T_test ))^2;
disp(['训练集数据的R2为:', num2str(R1)])
训练集数据的R2为:0.99726
disp(['测试集数据的R2为:', num2str(R2)])
测试集数据的R2为:0.91522
% % MAE
% mae1 = sum(abs(T_sim1' - T_train)) ./ M;
% mae2 = sum(abs(T_sim2' - T_test )) ./ N;
% 
% disp(['训练集数据的MAE为:', num2str(mae1)])
% disp(['测试集数据的MAE为:', num2str(mae2)])
% 
% % MBE
% mbe1 = sum(T_sim1' - T_train) ./ M ;
% mbe2 = sum(T_sim2' - T_test ) ./ N ;
% 
% disp(['训练集数据的MBE为:', num2str(mbe1)])
% disp(['测试集数据的MBE为:', num2str(mbe2)])
%%  绘制散点图
sz = 25;
c = 'b';
figure
scatter(T_train, T_sim1, sz, c)
hold on
min_val = min([T_train, T_sim1']) * 0.95;
max_val = max([T_train, T_sim1']) * 1.05;
plot([min_val max_val], [min_val max_val], 'k--', 'LineWidth', 1)
axis([min_val max_val min_val max_val]);
xlabel('训练集真实值');
ylabel('训练集预测值');
title('训练集预测值 vs. 训练集真实值')
figure
scatter(T_test, T_sim2, sz, c)
hold on
min_val = min([T_test, T_sim2']) * 0.95;
max_val = max([T_test, T_sim2']) * 1.05;
plot([min_val max_val], [min_val max_val], 'k--', 'LineWidth', 1)
axis([min_val max_val min_val max_val]);
plot(xlim, ylim, '--k')
xlabel('测试集真实值');
ylabel('测试集预测值');
title('测试集预测值 vs. 测试集真实')
回答 (1 件)
  Ronit
      
 2025 年 7 月 16 日
        Hello,
Data leakage does not occur when the dataset is shuffled before being divided into training and test sets. In fact, to show that both sets are independent and representative, it is actually common and advised to randomly arrange the samples.
When information from the test set affects the training procedure, such as when test set statistics are used for normalization or model fitting, this is known as data leakage. Leakage is avoided in your workflow by applying normalization parameters to the test data after they have been calculated exclusively from the training data.
Furthermore, predictions on the test set are made without any exposure to test data during training, and the model is only trained on the training subset. 
As a result, rearranging the data prior to splitting increases the model evaluation's robustness and prevents leakage.
I hope this clarifies your doubt.
参考
カテゴリ
				Help Center および File Exchange で MATLAB Report Generator についてさらに検索
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!


