Hands-On Linear Regression
Modelling MPG using Auto Features
Introduction
This Report demonstrates the application of multiple linear regression to model miles per gallon (mpg) using predictors in the Auto data set. The analysis aims to explore relationships, assess predictor significance, and evaluate the regression model’s fit.
Variables in the Auto dataset include :
mpg
: miles per galoncylinders
: Number of cylinders between 4 and 8displacement
: Engine displacement(cu.inches)horsepower
: Engine horsepowerweight
: Vehicle weight(lbs)acceleration
: Time to accelerate 0-60 mph(sec)year
: Model yearorigin
: Origin of car(1. American, 2. European, 3. Japanese)
Exploratory Data Analysis
Scatterplot Matrix
The Auto data set, displayed in the scatterplot matrix, records variables for a number of vehicles. Each panel of the scatterplot matrix is a scatterplot for a pair of variables, with identities indicated by the corresponding row and column labels. For example, the scatterplot directly to the right of the word “mpg” depicts mpg versus cylinders, while the plot directly to the right of “cylinders” corresponds to cylinders versus displacement.
The scatterplot matrix reveals several notable patterns. A strong negative relationship is observed between mpg and predictors such as displacement, horsepower, and weight, indicating that higher values in these variables are associated with lower fuel efficiency e.g., the mpg versus weight panel shows a clear downward trend, suggesting heavier vehicles tend to have lower mpg.Additionally, relationships among predictors, such as the positive correlation between displacement and weight, suggest potential multicollinearity to investigate further in the regression analysis.
Correlation Matrix
The correlation matrix quantifies associations between the quantitative variables in the Auto data set.
variable | mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin |
---|---|---|---|---|---|---|---|---|
mpg | 1.0000 | |||||||
cylinders | -0.7776 | 1.0000 | ||||||
displacement | -0.8051 | 0.9508 | 1.0000 | |||||
horsepower | -0.7784 | 0.8430 | 0.8973 | 1.0000 | ||||
weight | -0.8322 | 0.8975 | 0.9330 | 0.8645 | 1.0000 | |||
acceleration | 0.4233 | -0.5047 | -0.5438 | -0.6892 | -0.4168 | 1.0000 | ||
year | 0.5805 | -0.3456 | -0.3699 | -0.4164 | -0.3091 | 0.2903 | 1.0000 | |
origin | 0.5652 | -0.5689 | -0.6145 | -0.4552 | -0.5850 | 0.2127 | 0.1815 | 1.0000 |
Correlation matrix for the Auto data. |
The correlation matrix quantifies the strong negative relationships observed in the scatterplot matrix between mpg and key predictors such as weight (-0.832), displacement (-0.805), horsepower (-0.778), and cylinders (-0.778). This confirms that larger, heavier, and more powerful cars consistently achieve lower fuel efficiency and reinforces the downward trend between weight and mpg seen in the scatterplot. Additionally, the matrix highlights high multicollinearity among predictors, particularly between displacement and weight (0.933), extending to cylinders (0.951) and other related attributes, suggesting that these variables collectively represent aspects of overall vehicle size and power.
Regression Model
To assess the relationship between mpg and the quantitative predictors in the Auto data set, a multiple linear regression model is fitted using all predictors: cylinders, displacement, horsepower, weight, acceleration, year, and origin while excluding the qualitative name
variable.
Term | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
(Intercept) | -17.2184 | 4.6443 | -3.7 | 2e-04 |
cylinders | -0.4934 | 0.3233 | -1.5 | 0.1278 |
displacement | 0.0199 | 0.0075 | 2.6 | 0.0084 |
horsepower | -0.0170 | 0.0138 | -1.2 | 0.2196 |
weight | -0.0065 | 0.0007 | -9.9 | < 0.0001 |
acceleration | 0.0806 | 0.0988 | 0.8 | 0.4155 |
year | 0.7508 | 0.0510 | 14.7 | < 0.0001 |
origin | 1.4261 | 0.2781 | 5.1 | < 0.0001 |
Table 1.2 displays the multiple regression coefficient estimates when cylinders, displacement, horsepower, weight, acceleration, year, and origin are used to predict mpg using Auto data. |
The model explains 82.15% of mpg’s variance (R² = 0.8215), with a highly significant overall relationship (F = 252.4, p < 2.2e-16). Key predictors include weight, year, origin, and displacement, while cylinders, horsepower, and acceleration are not statistically significant. Notably, the year coefficient (0.75) indicates that fuel efficiency improves by 0.75 mpg per year
Diagnostic Analysis
Diagnostic plots are generated to evaluate the fit of the multiple linear regression model and identify potential issues such as non-linearity, heteroscedasticity, outliers, or high-leverage points.
The diagnostic plots reveal several issues with the model fit. The Residuals vs Fitted plot(Top left) displays a slight U-shaped pattern, with residuals trending downward for fitted values below 20 and upward above 25, suggesting the linear model may not fully capture the data’s underlying structure. The Q-Q Residuals plot(Top Right) shows deviations from the diagonal at the tails, particularly for residuals beyond ±2, with observations 323, 326, and 390 appearing as unusually large outliers based on standardized residuals. The Scale-Location plot(Bottom left) indicates heteroscedasticity, as the spread of standardized residuals increases with fitted values, especially above 25, violating the assumption of constant variance.
The Residuals vs Leverage plot(Bottom left) identifies observation 140 with unusually high leverage, exceeding 0.15, and flags observations 323, 326, and 327 as outliers; however, none of these points exceed a Cook’s distance of 0.5, suggesting they have limited influence on the overall fit. These findings highlight areas for potential improvement in the model, to be explored through interaction effects and variable transformations in subsequent analyses.
Interaction Effects
To explore whether interactions between predictors improve the model fit, linear regression models incorporating interaction terms are fitted using the *
and :
symbols, building on the non-linearity observed in the diagnostic analysis.
Term | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
(Intercept) | 2.8757 | 4.5106 | 0.6 | 0.5241 |
horsepower | -0.2313 | 0.0236 | -9.8 | < 0.0001 |
weight | -0.0112 | 0.0007 | -15.4 | < 0.0001 |
cylinders | -0.0296 | 0.2881 | -0.1 | 0.9184 |
displacement | 0.0059 | 0.0067 | 0.9 | 0.3786 |
acceleration | -0.0902 | 0.0886 | -1.0 | 0.3091 |
year | 0.7695 | 0.0449 | 17.1 | < 0.0001 |
origin | 0.8344 | 0.2513 | 3.3 | 0.001 |
horsepower:weight | 0.0001 | 0.0000 | 10.6 | < 0.0001 |
Table 1.2a displays the least squares coefficient estimates associated with the regression of mpg onto horsepower, weight, and other predictors, including an interaction term horsepower weight * |
For the first model with horsepower * weight, the interaction term horsepower:weight is highly significant (p < 2e-16), with a coefficient of 5.529e-05, suggesting that the combined effect of horsepower and weight on mpg is statistically meaningful, improving the model’s R-squared to 0.8618 from 0.8215 in the base model.
Term | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
(Intercept) | -5.3892 | 4.3005 | -1.3 | 0.2109 |
displacement | -0.0684 | 0.0110 | -6.2 | < 0.0001 |
weight | -0.0106 | 0.0007 | -14.9 | < 0.0001 |
cylinders | 0.1175 | 0.2943 | 0.4 | 0.6899 |
horsepower | -0.0328 | 0.0124 | -2.6 | 0.0084 |
acceleration | 0.0672 | 0.0880 | 0.8 | 0.4455 |
year | 0.7852 | 0.0455 | 17.2 | < 0.0001 |
origin | 0.5610 | 0.2622 | 2.1 | 0.0331 |
displacement:weight | 0.0000 | 0.0000 | 10.1 | < 0.0001 |
Table 1.2b displays the least squares coefficient estimates associated with the regression of mpg onto displacement, weight, and other predictors, including an interaction term displacement weight. * |
The second model with displacement * weight also shows a significant interaction (p < 2e-16, coefficient 2.269e-05), with an R-squared of 0.8588, indicating that engine size and weight together influence mpg.
Term | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
(Intercept) | -118.5635 | 13.3765 | -8.9 | < 0.0001 |
year | 2.0841 | 0.1732 | 12.0 | < 0.0001 |
weight | 0.0304 | 0.0047 | 6.5 | < 0.0001 |
cylinders | -0.1218 | 0.3032 | -0.4 | 0.6881 |
displacement | 0.0129 | 0.0070 | 1.8 | 0.0663 |
horsepower | -0.0288 | 0.0129 | -2.2 | 0.0259 |
acceleration | 0.1447 | 0.0920 | 1.6 | 0.1164 |
origin | 1.1736 | 0.2597 | 4.5 | < 0.0001 |
year:weight | -0.0005 | 0.0001 | -8.0 | < 0.0001 |
Table 1.2c displays the least squares coefficient estimates associated with the regression of mpg onto year, weight, and other predictors, including an interaction term year * weight. |
The third model with year * weight yields a significant interaction (p = 1.47e-14, coefficient -4.879e-04), with an R-squared of 0.847, suggesting that the effect of weight on mpg varies with model year.
All three models outperform the base model (F-statistics: 298.6, 291.1, and 265.1, respectively, vs. 252.4), and their significant interaction terms (p < 0.05) confirm that these pairs capture non-linear relationships.
Variable Transformations
To further refine the model and address the non-linearity and heteroscedasticity identified in the diagnostic analysis, various transformations of the variables are applied, including log(X)
, √X
, and X^2
.
Term | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
(Intercept) | 78.0386 | 9.9324 | 7.9 | < 0.0001 |
log(horsepower) | -21.2237 | 1.9837 | -10.7 | < 0.0001 |
weight | -0.0274 | 0.0028 | -9.7 | < 0.0001 |
displacement | 0.0046 | 0.0066 | 0.7 | 0.483 |
year | 0.7611 | 0.0450 | 16.9 | < 0.0001 |
origin | 0.8087 | 0.2515 | 3.2 | 0.0014 |
cylinders | -0.2713 | 0.2832 | -1.0 | 0.3387 |
acceleration | -0.1675 | 0.0966 | -1.7 | 0.0837 |
log(horsepower):weight | 0.0049 | 0.0006 | 8.4 | < 0.0001 |
Table 1.3a displays the least squares coefficient estimates associated with the regression of mpg onto log(horsepower), weight, and other predictors, including an interaction term log(horsepower) * weight, using the Auto data. |
Term | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
(Intercept) | 20.9353 | 4.7715 | 4.4 | < 0.0001 |
sqrt(weight) | -1.0617 | 0.0710 | -15.0 | < 0.0001 |
displacement | -0.1118 | 0.0183 | -6.1 | < 0.0001 |
horsepower | -0.0307 | 0.0125 | -2.5 | 0.0143 |
year | 0.7891 | 0.0456 | 17.3 | < 0.0001 |
origin | 0.5659 | 0.2630 | 2.2 | 0.032 |
cylinders | -0.0751 | 0.2918 | -0.3 | 0.7971 |
acceleration | 0.0547 | 0.0880 | 0.6 | 0.5344 |
sqrt(weight):displacement | 0.0021 | 0.0003 | 7.8 | < 0.0001 |
Table 1.3b displays the least squares coefficient estimates associated with the regression of mpg onto sqrt(weight), displacement, and other predictors, including an interaction term sqrt(weight) * displacement, using the Auto data. |
Term | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
(Intercept) | -13.7793 | 4.4883 | -3.1 | 0.0023 |
I(displacement^2) | 0.0001 | 0.0000 | 6.5 | < 0.0001 |
horsepower | -0.0425 | 0.0138 | -3.1 | 0.0023 |
weight | -0.0064 | 0.0006 | -11.0 | < 0.0001 |
year | 0.7644 | 0.0489 | 15.6 | < 0.0001 |
origin | 1.3374 | 0.2537 | 5.3 | < 0.0001 |
cylinders | -0.7083 | 0.2614 | -2.7 | 0.007 |
acceleration | 0.0747 | 0.0943 | 0.8 | 0.4288 |
Table 1.3c displays the least squares coefficient estimates associated with the regression of mpg onto a squared displacement term, I(displacement^2), and other predictors using the Auto data. |
Across all models, the log(horsepower) * weight model performs best (R-squared 0.8625, residual standard error 2.924), slightly outperforming the best interaction model (0.8618), likely addressing non-linearity and heteroscedasticity. The sqrt(weight) * displacement model (R-squared 0.8584) also improves fit, while the I(displacement^2) model (R-squared 0.8361) is the least effective. Persistent non-significance of cylinders and displacement in some models suggests collinearity, necessitating further investigation.
Colinearity Assessment
The log(horsepower) * weight
model (Table 1.3a) is selected as the final model due to its superior fit (R-squared 0.8625, residual standard error 2.924). To address persistent non-significance of predictors like cylinders
and displacement
, collinearity is assessed using Variance Inflation Factors (VIF).
Term | VIF |
---|---|
log(horsepower) | 21.21 |
weight | 261.22 |
displacement | 21.57 |
year | 1.26 |
origin | 1.88 |
cylinders | 10.67 |
acceleration | 3.25 |
log(horsepower):weight | 380.62 |
Table 1.4 displays the Variance Inflation Factors (VIF) for predictors in the final model to assess collinearity. |
The VIF results reveal collinearity among several predictors. The interaction term log(horsepower):weight has an exceptionally high VIF of 380.62, which is expected as it is derived from log(horsepower) (VIF 21.21) and weight (VIF 261.22), both of which also exhibit high collinearity due to their inclusion in this significant term (p < 0.0001). Similarly, displacement (VIF 21.57) and cylinders (VIF 10.67) show elevated collinearity, consistent with their strong correlations with weight (0.933 and 0.951, respectively) from the correlation matrix.
This explains their non-significance in the model, as their effects may be overshadowed by weight and the interaction. In contrast, acceleration (VIF 3.25), year (VIF 1.26), and origin (VIF 1.88) have low VIFs, indicating minimal collinearity with other predictors. Despite the high VIFs, the model’s strong predictive performance (R-squared 0.8625) and significant coefficients suggest that retaining all terms is justified, though dropping cylinders or displacement could simplify the model without substantial loss of explanatory power.
Conclusion
Overall, this analysis demonstrates a robust relationship between predictors and mpg
, with weight
, year
, and origin
as key drivers, enhanced by interactions and transformations. However, residual outliers, mild heteroscedasticity, and multicollinearity suggest limitations. Future steps could include robust regression to handle outliers or further variable selection to reduce collinearity, building on the improved fit achieved here.