www.gusucode.com > stats 源码程序 matlab案例代码 > stats/TrainAnAlgorithmUsingOrdinaryLeastSquaresExample.m

    %% Train an Ordinary Least Squares Algorithm
% This example shows how to predict the fuel economy of a car by training
% an ordinary least squares algorithm. 
%%
% You can imagine that there are many features of a car that might affect
% its fuel economy, e.g., its weight, how aerodynamic it is, whether the
% driver is running the heater. Statistics and Machine Learning Toolbox(TM)
% data set |carsmall| contains a sample of 100 cars and their fuel economy
% is of interest. Load the data set and display the sampled cars' features.
% 

% Copyright 2015 The MathWorks, Inc.

load carsmall
who
%%
% Any of the features might have an influence on a car's fuel
% economy.  Perhaps the most influential is the weight of a car.  This
% example proceeds by using |Weight| as a feature.
%%
% There are several missing fuel economies in the sample. Save the weights
% corresponding to them for prediction. Then, remove instances with missing
% fuel economies from the data for the analysis.
nanWeight = sort(Weight(isnan(MPG))); % Predict their fuel economies
Weight    = Weight(~isnan(MPG));
MPG       = MPG(~isnan(MPG));
n         = length(MPG); % Effective sample size
%%
% |nanWeight| contains the vehicle weights corresponding to missing fuel
% economies.  |Weight| and |MPG| are the instances without missing the
% data.
%% 
% Randomly partition the data into two sets: a training set (70%) and a
% cross-validation set (30%).
rng(2); % For reproducibility
DataPartition = cvpartition(n,'holdout',0.30);
%%
% |DataPartition| contains the training and validation set indeces.  Use
% |training(DataPartition)| to extract the indeces corresponding to the
% training set.  Use |test(DataPartition)| to extract the indeces
% corresponding to the validation set.
%%
% Plot fuel economy against weight to discover any patterns.
tMPG = MPG(training(DataPartition,1)); % Training MPGs
tWeight = Weight(training(DataPartition,1)); % Training weights
figure
plot(tWeight,tMPG,'o');
xlabel('Weight (lbs)')
ylabel('Fuel Economy (MPG)')
title('{\bf Car Fuel Economies and Weights}')
%%
% For the training set, as |Weight| increases, |MPG| falls linearly, but seems to level off.
% As a result, suitable prediction functions include
%
% * Model 1: $h_\theta(x_i) = \theta_0 + \theta_1x_i$
% * Model 2: $h_\theta(x_i) = \theta_0 + \theta_1x_i+\theta_2x_i^2,$
%
% where $x_i$ is the weight of car $i$.
%%
% The scales of |Weight| and its square are large.  Suppose that
% |Acceleration|, which has a relatively small scale, is included in the
% model. Then the effects of |Weight| and its square dominate the
% algorithm. Therefore, it is best practice to scale your features before
% you pass them to the learning algorithm.  Although |Acceleration| is not
% in the model, normalize |tWeight| and its square (i.e., rescale the
% features by standardizing them).
tWeight2    = tWeight.^2;
tWeightBar  = mean(tWeight);
tWeight2Bar = mean(tWeight2);
stdTWeight  = std(tWeight);
stdTWeight2 = std(tWeight2);

nTWeight  = (tWeight-tWeightBar)/stdTWeight;      % Normalized Weight
nTWeight2 = (tWeight2 - tWeight2Bar)/stdTWeight2; % Normalized Weight2
%%
% Train the ordinary least squares algorithms by fitting both models to the data.
DS = dataset(nTWeight,nTWeight2,tMPG);
EstMdl1 = fitlm(DS(:,[1,3]));
EstMdl2 = fitlm(DS);
%%
% |fitlm| finds the curve that minimizes the residual sum of squares (i.e.,
% the sum of the squared vertical distances from each data point to the
% curve). |EstMdl1| and |EstMdl2| are estimated linear models, and represent
% trained algorithms for Models 1 and 2.
%%
% Plot the fitted regression curves.
figure
plot(tWeight,tMPG,'o');
hold on;
plot(sort(tWeight),predict(EstMdl1,sort(nTWeight)),'r',...
    'LineWidth',2)
plot(sort(tWeight),predict(EstMdl2,...
    [sort(nTWeight) sort(nTWeight2)]),'k','LineWidth',2)
xlabel('Weight (lbs)')
ylabel('Fuel Economy (MPG)')
title('{\bf Car Fuel Economies and Weights}')
legend('Data','Regression Line','Quad. Regression Curve')
hold off
%%
% Both algorithms seem to fit the training set, but
% Model 1 might not generalize well for heavy vehicles.
%%
% Validate each model using the validation set.
pred1 = @(XTrain,yTrain,xTest) predict(EstMdl1,...
    (xTest-tWeightBar)/stdTWeight);
pred2 = @(XTrain,yTrain,xTest) predict(EstMdl2,...
    bsxfun(@rdivide,bsxfun(@minus,xTest,...
    [tWeightBar tWeight2Bar]),[stdTWeight stdTWeight2]));

MSE1 = crossval('mse',Weight,MPG,'Predfun',pred1,...
    'Partition',DataPartition)
MSE2 = crossval('mse',[Weight Weight.^2],MPG,...
    'Predfun',pred2,'Partition',DataPartition)
%%
% The MSEs, which are means of the squared residuals for the validation set,
% are very close.  Since Model 2 is more complex than Model 1, it should
% not be surprising that it has a lower MSE. What sets Model 2 apart is that the algorithm levels off the
% fuel economy for heavier vehicles, which might decribe the true
% association between the two variables.  Model 1 continues to
% decrease, which might produce wrong predicted fuel economies
% for heavier vehicles.
%%
% To illustrate the potencial problem with Model 1, predict the average
% fuel economy for a car that weighs 6000 lbs using both models.
heavyCarPred1 = pred1([],[],6000)
heavyCarPred2 = pred2([],[],[6000 6000^2])
%%
% A fuel economy of -2.8 MPG does not have a practical interpretation,
% whereas 7.2 MPG does, and might seem like a reasonable prediction.
%%
% Use the validated model (Model 2) to predict the fuel economies for the
% missing data.
nanMPG   = pred2([],[],[nanWeight nanWeight.^2]);
nanMerge = [nanWeight'; nanMPG'];
fprintf('\nWeight |  MPG\n')
fprintf('---------------\n')
fprintf('%5.0f  | %6.3f\n',nanMerge(:))