www.gusucode.com > stats 源码程序 matlab案例代码 > stats/TrainAnAlgorithmUsingOrdinaryLeastSquaresExample.m
%% Train an Ordinary Least Squares Algorithm % This example shows how to predict the fuel economy of a car by training % an ordinary least squares algorithm. %% % You can imagine that there are many features of a car that might affect % its fuel economy, e.g., its weight, how aerodynamic it is, whether the % driver is running the heater. Statistics and Machine Learning Toolbox(TM) % data set |carsmall| contains a sample of 100 cars and their fuel economy % is of interest. Load the data set and display the sampled cars' features. % % Copyright 2015 The MathWorks, Inc. load carsmall who %% % Any of the features might have an influence on a car's fuel % economy. Perhaps the most influential is the weight of a car. This % example proceeds by using |Weight| as a feature. %% % There are several missing fuel economies in the sample. Save the weights % corresponding to them for prediction. Then, remove instances with missing % fuel economies from the data for the analysis. nanWeight = sort(Weight(isnan(MPG))); % Predict their fuel economies Weight = Weight(~isnan(MPG)); MPG = MPG(~isnan(MPG)); n = length(MPG); % Effective sample size %% % |nanWeight| contains the vehicle weights corresponding to missing fuel % economies. |Weight| and |MPG| are the instances without missing the % data. %% % Randomly partition the data into two sets: a training set (70%) and a % cross-validation set (30%). rng(2); % For reproducibility DataPartition = cvpartition(n,'holdout',0.30); %% % |DataPartition| contains the training and validation set indeces. Use % |training(DataPartition)| to extract the indeces corresponding to the % training set. Use |test(DataPartition)| to extract the indeces % corresponding to the validation set. %% % Plot fuel economy against weight to discover any patterns. tMPG = MPG(training(DataPartition,1)); % Training MPGs tWeight = Weight(training(DataPartition,1)); % Training weights figure plot(tWeight,tMPG,'o'); xlabel('Weight (lbs)') ylabel('Fuel Economy (MPG)') title('{\bf Car Fuel Economies and Weights}') %% % For the training set, as |Weight| increases, |MPG| falls linearly, but seems to level off. % As a result, suitable prediction functions include % % * Model 1: $h_\theta(x_i) = \theta_0 + \theta_1x_i$ % * Model 2: $h_\theta(x_i) = \theta_0 + \theta_1x_i+\theta_2x_i^2,$ % % where $x_i$ is the weight of car $i$. %% % The scales of |Weight| and its square are large. Suppose that % |Acceleration|, which has a relatively small scale, is included in the % model. Then the effects of |Weight| and its square dominate the % algorithm. Therefore, it is best practice to scale your features before % you pass them to the learning algorithm. Although |Acceleration| is not % in the model, normalize |tWeight| and its square (i.e., rescale the % features by standardizing them). tWeight2 = tWeight.^2; tWeightBar = mean(tWeight); tWeight2Bar = mean(tWeight2); stdTWeight = std(tWeight); stdTWeight2 = std(tWeight2); nTWeight = (tWeight-tWeightBar)/stdTWeight; % Normalized Weight nTWeight2 = (tWeight2 - tWeight2Bar)/stdTWeight2; % Normalized Weight2 %% % Train the ordinary least squares algorithms by fitting both models to the data. DS = dataset(nTWeight,nTWeight2,tMPG); EstMdl1 = fitlm(DS(:,[1,3])); EstMdl2 = fitlm(DS); %% % |fitlm| finds the curve that minimizes the residual sum of squares (i.e., % the sum of the squared vertical distances from each data point to the % curve). |EstMdl1| and |EstMdl2| are estimated linear models, and represent % trained algorithms for Models 1 and 2. %% % Plot the fitted regression curves. figure plot(tWeight,tMPG,'o'); hold on; plot(sort(tWeight),predict(EstMdl1,sort(nTWeight)),'r',... 'LineWidth',2) plot(sort(tWeight),predict(EstMdl2,... [sort(nTWeight) sort(nTWeight2)]),'k','LineWidth',2) xlabel('Weight (lbs)') ylabel('Fuel Economy (MPG)') title('{\bf Car Fuel Economies and Weights}') legend('Data','Regression Line','Quad. Regression Curve') hold off %% % Both algorithms seem to fit the training set, but % Model 1 might not generalize well for heavy vehicles. %% % Validate each model using the validation set. pred1 = @(XTrain,yTrain,xTest) predict(EstMdl1,... (xTest-tWeightBar)/stdTWeight); pred2 = @(XTrain,yTrain,xTest) predict(EstMdl2,... bsxfun(@rdivide,bsxfun(@minus,xTest,... [tWeightBar tWeight2Bar]),[stdTWeight stdTWeight2])); MSE1 = crossval('mse',Weight,MPG,'Predfun',pred1,... 'Partition',DataPartition) MSE2 = crossval('mse',[Weight Weight.^2],MPG,... 'Predfun',pred2,'Partition',DataPartition) %% % The MSEs, which are means of the squared residuals for the validation set, % are very close. Since Model 2 is more complex than Model 1, it should % not be surprising that it has a lower MSE. What sets Model 2 apart is that the algorithm levels off the % fuel economy for heavier vehicles, which might decribe the true % association between the two variables. Model 1 continues to % decrease, which might produce wrong predicted fuel economies % for heavier vehicles. %% % To illustrate the potencial problem with Model 1, predict the average % fuel economy for a car that weighs 6000 lbs using both models. heavyCarPred1 = pred1([],[],6000) heavyCarPred2 = pred2([],[],[6000 6000^2]) %% % A fuel economy of -2.8 MPG does not have a practical interpretation, % whereas 7.2 MPG does, and might seem like a reasonable prediction. %% % Use the validated model (Model 2) to predict the fuel economies for the % missing data. nanMPG = pred2([],[],[nanWeight nanWeight.^2]); nanMerge = [nanWeight'; nanMPG']; fprintf('\nWeight | MPG\n') fprintf('---------------\n') fprintf('%5.0f | %6.3f\n',nanMerge(:))