ML Regression: House Price Prediction
Machine Learning allows us to make predictions in the form of categories (Classification) or continuous values (Regression). This project is an example of the latter: given information on real-estate properties, our job is to predict the sale price for properties we haven't seen before.
Follow along with the code: click here.
Download PDF
1. Initial look at the data
The Kaggle Housing Dataset describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. It conatins 2930 observations and a total of 80 variables (features). It was initially compiled by Professor Dean De Cock at Truman State University in 2011, who published this informative paper.
Kaggle.com offers a variety of interesting datasets. Some form the basis for competitions, as is the case with this one. As such, we only have access to 1460 instances where we know the final sale price. The goal is to accurately predict the sale price for the other instances.
Some vocabulary that will be used here:
Independent variable / feature: a measurable characteristic such as the number of rooms in a house or the lot size
Dependent variable / target / label: the information that we would like to predict using features
Model: this can be used to mean both the mathematical patterns that describe relationships within the data (and which allow us to make predictions) and the algorithms used to find those relationships.
Along with the dataset, we are given a text file with descriptions for all the variables. The complete file can be found in the notebook on Kaggle. Here are the first few entries:
MSSubClass: Identifies the type of dwelling involved in the sale. 20 1-STORY 1946 & NEWER ALL STYLES 30 1-STORY 1945 & OLDER 40 1-STORY W/FINISHED ATTIC ALL AGES 45 1-1/2 STORY - UNFINISHED ALL AGES 50 1-1/2 STORY FINISHED ALL AGES 60 2-STORY 1946 & NEWER 70 2-STORY 1945 & OLDER 75 2-1/2 STORY ALL AGES 80 SPLIT OR MULTI-LEVEL 85 SPLIT FOYER 90 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES MSZoning: Identifies the general zoning classification of the sale. A Agriculture C Commercial FV Floating Village Residential I Industrial RH Residential High Density RL Residential Low Density RP Residential Low Density Park RM Residential Medium Density LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet
An essential part of the Machine Learning process is to become acquainted to the dataset and check for missing values, data types, etc. An initial check reveals that there are 1460 samples available to us, 80 features and 1 target variable ('SalePrice'). There are missing values that we'll have to explore and different data types.
Four of the features have a high number (more than 50%) of missing values: Alley, PoolQC, Fence and MiscFeature.
Important variables such as SalePrice (which will be our target) and LotArea (which I suspect will be an influential feature) are extremely tail-heavy to the right, meaning that there are values that are much higher than the rest. Several histograms show a prevalence of value 0, which we can assume represents the absence of certain property features. For example, properties with "GarageArea" value 0 don't have a garage. Similarly for "WoodDeckSF", "OpenPorchSF" and others.
2. Training and Holdout Sets
In order to properly evaluate and tune the models later on, we need to separate a subset of our training data. This subset, called the 'holdout' set, should not be available during Exploratory Data Analysis, Feature Engineering, Feature Selection, Model Selection and Model hyper-parameter Tuning, so as to avoid 'fitting' our model to it as much as possible.
We want the holdout set to contain a similar proportion of samples from different price-ranges so that the results contain meaningful information as to what changes we'll need to make to our model. We performed a Stratified Shuffle Split where the samples were randomly shuffled before splitting the dataset, while maintaining the different "strata" on SalePrice.
Because the dataset available to us is relatively small, we don't want to lose access to too many samples. We chose to create a holdout set using only 20% of the samples. We still have access then to 1168 samples for the next sections.
3. Exploratory Data Analysis
Something that stood out right away was that a few houses were much more expensive than the rest, or that they were exceptionally large. If our model learns patterns from these samples it may not generalize well to other data, as these are likely exceptional. We decided to drop any sample with 'GrLivArea' higher than 4000 as recommended by Professor Dean De Cock himself.
Numerical Features
Although the chosen model (Gradient Boosting) easily handles multi-collinearity, we still want to leave the door open to using other models in the future. Therefore, we analyze the relationships between features and evaluate how dependent they are on each other.
For example, If we want to be able to use Linear Regression in our model, then we have to be mindful of correlations between features. That is because in Linear Regression, we try to isolate the relationship between an independent variable and the dependent variables. In other words, a Linear Regression model describes how much an independent variable 'moves' as a dependent variable changes value while holding all other variables constant. However, if two dependent variables are correlated, then by changing one we are effectively changing the other.
To evaluate potential multi-collinearity, we observe the Pearson correlation coefficient between variables. A value of 1 means the variables are completely correlated with each other (if one goes up, the other does too and by a similar proportion). A value of -1 is similar but with movement in opposite directions. A value of 0 would mean that the variables are completely independent of each other.
Some strong relationships:
EnclosedPorch is negatively correlated with YearBuilt
BsmtFullBath is positively correlated with BsmtFinSF2 (if there is a finished basement, chances are there will be a full bathroom in it) and negatively with BsmtUnfSF (if the basement is unfinished, it will probably not have a full bathroom)
TotRmsAbvGrd is positively correlated with GrLivArea (above-ground living area's sqft)
TotRmsAbvGrd is positively correlated with BedroomAbvGr (bedrooms above ground)
GarageArea is positibely correlated with GarageCars
TotalBsmtSF is strongly positively correlated with 1stFlrSF
Let's look at these relationship more closely:
We can now see that the following correlations are very strong (>0.5 or <-0.5):
GarageArea/GarageCars
TotRmsAbvGrd/GrLivArea
TotRmsAbvGrd/BedroomAbvGr
TotalBsmtSf/1stFlrSF
It's also informative to look at the linear relationship between each dependent variable (our features) and the independent variable ('SalePrice' in our case). If we decide to use a Linear Regression model, this could be helpful as we'll be able to tell which features 'inform' our target.
Three numerical features contain missing values:
LotFrontage 205 MasVnrArea 8 GarageYrBlt 61
LotFrontage: There are no values of 0, so we assume that the homes with no front lot simply have an empty field in this column. We will insert a 0 in those fields.
MasVnrArea: Similarly, I will assume that the 8 missing values are missing simply because there is no Mason Veneer Area.
GarageYrBlt: Because inserting a 0 in this feature wouldn't make sense, we will instead insert the mean.
Categorical Features
MSSubClass: The categories that are higher are, with the exception of “75 2-1/2 STORY ALL AGES”, for newer construction. This will be considered by YearBuilt already and in more precision.
MSZoning: RL (Residential Low Density) and FV (Floating Village Residential) are in general more expensive than the others. RL has many outliers, some very far away from the inter-quartile range. A vast majority of the samples are RL (920 out of 1460), with RH (Residential High Density and C (commercial) having severely under-sampled: only 9 and 13 instances respectively
Street: This category is severely imbalanced (6 ‘Gravel’ vs 1162 ‘Paved’) and there is no huge distinction in prices (the difference in medians looks to be around $40,000)
Alley: 1086 missing values: almost 93% of dataset. Homes with paved alleys are more expensive than those with gravel alleys with a difference in medians of about $100,000.
LotShape: Slightly Irregulars (IR1) tend to be more expensive than Regulars (Reg). Lots that are Moderately Irregular (IR2) or Irregular (IR3) are severely under-sampled and tend to have a similar distribution to IR1. We decided to consolidate IR1, IR2 and IR3 into one category.
LandContour: ‘Near Flat/Level’ (Lvl) category is severely oversampled: it accounts for almost 90% of instances. Hillside and Depressed land contours tend to correspond to higher sale prices, and Banked tend to be slightly cheaper.
Utilities: ‘All Public Utilities’ appears in all except 1 of the instances, which happens to land close to the ‘AllPub’ median anyway. We discard this feature.
Lot Config: "CulDSac" homes are in general more expensive, but we should be very careful to generalize since "Inside" and "Corner" have a lot of outliers. In fact, the most expensive properties are fall into the "Corner" category. Part of the problem is again the undersampling of some categories. "Inside" and "Corner", which have similar distributions account for 1052 samples in total, while "CulDSac" only has 74 instances.
LandSlope: "Mod" and "Sev" are slightly more expensive in general than "Gtl", with medians of about 180,000 vs 160,000. "Gtl" has 1104/1168 samples, however, so we should be careful to generalize.
Neighborhood: Categories are (mostly) evenly distributed and can be looked at 3 broad ranges according to the median SalesPrice:
$80,000 - $150,000: BrDale, BrkSide, Edwards, IDOTRR, MeadowV, NAmes, Oldtown, SWISU, Sayer
$150,000 - $250,000: Blmngtn, ClearCr, CollgCr, Crawfor, Gilbert, Mitchel, NWAmes, SawyerW, Somerst, Timber, Veenker
$250,000 +: NoRidger, NridgHt, StoneBr
Condition1: "Normal" accounts for 1010/1168 instances, which cover all the range of prices. "PosA" and to a lesser extent "PosN" are in general more expensive than others. They have only 4 and 15 instances respectively.
Condition2: "Normal" accounts for 1155 of the instances. We discard this feature.
BldgType: 1Fam (Single-family Detached) accounts for 973/1168 instances and cover the whole range of prices. TwnhsE, with 95 instances, corresponds to more expensive properties than 2fmCon, Duplex and Twnhs (here we assume that Twnhs and TwnhI are the same thing and it's just a typo). However, its distribution seems to align well with 1Fam.
HouseStyle: 2Story (359 instances) has a slightly more expensive range than the rest, with the exception of 2.5Fin (9 instances). We consolidate these two into their own category and the rest into another one.
RoofStyle: There is nothing conclusive about this one. First, there's a severe class imbalance: 4 out of the 6 categories have less than 10 instances each. Gable has 915/1168 instances, but it contains many outliers that might give Hip (the seemingly more expensive category) a run for its money.
RoofMatl: CompShg has 1150/1168 instances and covers the whole price range. We discard this feature.
Exterior1st: We see that AsbShng and AsphShn correspond to lower SalesPrice than others, but together they only account for 18 samples! VinylSd has 416 samples and a higher median than most others, but the IQR is huge.
Exterior 2nd: Same issue as with Exterior1st
MasVnrType: 8 missing values (0.7% of total) Stone usually corresponds to higher prices, BrkCmn to lower prices, BrkFace is all over the price range. None has most of the instances (685/1168) and spans the range of prices as well, but slightly lower in general than BrkFace.
ExterQual: One of the clearest separations of class. No missing values.
ExterCond: We expected this one to also have a clear correspondence with value, but TA (1033/1168 instances) spans the whole price range, while the others are unremarkable. Ex should theoretically correspond to higher prices, but we only have 2 samples.
Foundation: Clearly, houses with PConc (poured concrete) foundation tend to be pricier. BrkTil and CBlock look like they have a similar distribution. Slab and Stone are a little cheaper but it's hard to tell... they have 21 and 5 instances respectively. Wood only has only instance, which falls within PConc IQR. We consolidate into two broad categories: PConc & Wood and the rest.
BsmtQual (Evaluates the height of the basement): 31 missing values (2.65% of total). No Poor or NA. The IQRs are clearly demarcated, with the expected Ex corresponding to more expensive houses and so on. Where to put the 31 NaNs? Gd and TA have pretty different distributions and they both share a similar proportion of instances (492 & 517 respectively).
BsmtCond (Evaluates the general condition of the basement): Same missing values as BsmtQual Almost all entries are TA (1057/1168) and it covers the whole range of prices.
BsmtExposure (Refers to walkout or garden level walls): 32 missing values (2.74% of total), similar to the others Although No has a lot of outliers on the expensive side, the IQR is clearly defined between around 130,000 and 190,000. It's the category with most instances: 755/1168.
BsmtFinType1: Rating of basement finished area Same missing values as the other Bsmt categories All the IQRs are fairly tight
BsmtFinType2: Rating of basement finished area (if multiple types) Unf includes most of the samples: 1000/1168 and it covers the whole range of prices.
Heating Almost all instances (1144/1168) belong to GasA
HeatingQC 598 instances belong to Ex, which also correspond generally to homes that are slightly more expensive (and also all of the very expensive ones, as seen in the boxplot). The other ranges are pretty similar, so I will combine them into one category
CentralAir The homes with no central air are definitely cheaper. Even though there is a steep class imbalance Y 1096 vs N 72, this definitely influences what the prices of the home should be.
Electrical SBrkr includes most of the instances (1062/1168). The other categories are cheaper in general and because there are so few instances in them, I will combine them into one category. There is 1 missing value, which I will just assign to SBrkr (the category's most frequent category).
KitchenQual 4 categories with distinct IQRs and no NaNs.
Functional: Home functionality (Assume typical unless deductions are warranted) Typ includes 1087/1168 instances and it covers all of the prices. Even though the median of this category is higher than others, it's not by much. We discard this one.
FireplaceQu: Fireplace quality Lots of NaNs (560, 48%), which we assume represent the houses that just don't have fireplaces. The categories look promising. We will label the missing values accordingly.
GarageType: Garage location 67 missing (5.7%) Attchd contains most of the samples (695/1168) and covers almost the whole range of SalesPrice. Even though it contains lots of outliers at the top, its IQR sits between 160,000 and 240,000. CarPort corresponds with lower prices, but there are only 8 instances to consider.
GarageFinish: Interior finish of the garage This could be a better indicator of SalesPrice than GarageType (by the way, they have the same of missing values). Finished > Rough Finished > Unfinished. We can consider the missing values to represent homes without a garage and assign a 'NA' or 'NoGarage' category.
GarageQual TA (Typical/Average) has 1046/1168 samples. In the boxplot, Ex corresponds to higher prices, but it only has 2 instances in the dataset.
GarageCond: Garage condition Similar to GarageQual
PavedDrive: Paved driveway Y (Paved) has 1073/1168 instances and its values cover almost the whole price range. However, I'd hesitate to discard this one: the other categories P (Partial Pavement) and N (Dirt/Gravel) correspond to lower sales prices as expected.
PoolQC: Pool quality 1163 missing values (99.6%)
Fence: Fence quality 956 missing values (82%)
MiscFeature: Miscellaneous feature not covered in other categories 1124 missing values (96%)
SaleType Although some of the categories look promising (New clearly corresponds to higher values), the fact that 1015/1168 instances are in one category makes this less than ideal.
SaleCondition Similar to SaleType
The categorical features that contain missing values are:
Alley 1089 MasVnrType 8 BsmtQual 27 BsmtCond 27 BsmtExposure 28 BsmtFinType1 27 BsmtFinType2 28 Electrical 1 FireplaceQu 545 GarageType 61 GarageFinish 61 GarageQual 61 GarageCond 61 PoolQC 1161 Fence 945 MiscFeature 1123
4. Feature Engineering
Missing Values
We started the treatment of the different features by filling in the missing values according to the previous analysis:
LotFrontage and MasVnrArea: filled missing values with 0
GarageYrBlt: filled missing values with the mean
MasVnrType: filled with 'None'
BsmtQual, BsmtExposure, BsmtFinType1, FireplaceQu, GarageFinish: filled with 'NA'
Electrical: filled with 'SBrkr'
Category consolidation
Per the analysis above, for the following of the categorical features we consolidate some of the values: LotShape, Neighborhood, HouseStyle, Foundation, HeatingQC, Electrical. See the notebook for details.
Feature creation
Fraction of finished basement
Total number of bathrooms
Total porch and deck area
Total basement and above-ground area (TotalSF)
Ratio of living area / total lot
Season in which sale took place
Binary features to represent the presence or absence of a characteristic, e.g. LotFrontage → HasLotFront
Feature transformation
Year-valued features were transformed into age values, e.g. YearBuilt → HomeAge
Logarithmic transformation of LotArea and GrLivArea
5. Feature Selection
We dropped features according to the analysis above:
'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'MSSubClass', 'Street', 'Alley', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'ExterCond', 'BsmtCond', 'BsmtFinType2', 'Heating', 'Functional', 'GarageType', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'
There are several ways to perform feature selection:
Wrapper methods: train a model using different subsets of features and produce a score for each one of them
Filter methods: consider correlation between features and correlation between each feature and the target, among other non-model-specific properties of the dataset
Embedded methods: these occur organically as part of certain models' construction process. An example of this is Lasso Regression, which is a Linear Regression model that penalizes less-informative variables by pulling their coefficients towards 0. Another example is tree-based algorithms, which assign each feature an "importance" score.
During the Feature Engineering / Feature Selection / Model Selection process, as we tried different models and sets of features, one model was seen to perform better than the rest: Gradient Boosting Trees. This algorithm will be explained below, but we mention it here because, as an ensemble of decision trees, this model produces its own embedded feature selection. Wrapper and Filter methods were not needed in this case.
6. Metric
In the next sections, we evaluate model performance and consider different hyper-parameter settings. However, we first needed to consider which performance metric was most appropriate.
There are many options for this, but in this case we decided to evaluate based on Root Mean Squared Error. Put simply, this is the average distance between each prediction and the ground truth.
For example, say the known SalePrice value for a certain property is $250,000. If a model predicts that this property's price is $280,000, then the squared error is ($280,000 - $250,000)^2. Because of the 'squared' aspect, larger errors are penalized harder than small ones. We then take the average of these squared errors, also called the Mean Squared Error. This measurement is informative enough, but we wanted to bring the measurement back to dollar units. This is why we take the Root of this value, thus producing the Root Mean Squared Error (RMSE).
7. Model Selection
In order to compare different models, we used a method called cross-validation. The motivation for doing this is that it would not be informative to use train a model on a certain dataset, only to evaluate its prediction on the same dataset. This would produce perfect results on the training data but we would have no information on how it would perform on unseen data (an extreme example of overfitting the training data).
In cross-validation, a portion of the data is designated as a test set. The model is fitted to the rest of samples and evaluated on the test data. This can be repeated several times, using different "folds" for testing and, in each case, the remaining data for training. At the end, an average of the peformance scores (and the standard deviation) is computed for the model.
In our case, we performed 5-fold cross-validation for a variety of models, including Linear Regression variants, Support Vector Machines and a few tree-based models.
Model: LinearRegression Cross-val Mean: 30279.694447671907 Cross-val StDev: 2985.8303729401937 Model: SGDRegressor Cross-val Mean: 4137472377456351.0 Cross-val StDev: 1546483926819965.5 Model: Ridge Cross-val Mean: 30314.79745932803 Cross-val StDev: 2965.1493398621533 Model: Lasso Cross-val Mean: 30279.53304991037 Cross-val StDev: 2985.5849787440156 Model: ElasticNet Cross-val Mean: 32524.277803055513 Cross-val StDev: 2736.9451380807945 Model: SVR Cross-val Mean: 77596.15505735752 Cross-val StDev: 5129.075433413484 Model: RandomForestRegressor Cross-val Mean: 27761.00887350167 Cross-val StDev: 2727.2517139428255 Model: ExtraTreesRegressor Cross-val Mean: 27947.14246083494 Cross-val StDev: 1678.7589092291896 Model: GradientBoostingRegressor Cross-val Mean: 27452.123542768455 Cross-val StDev: 1637.1689378334142 Model: XGBRegressor Cross-val Mean: 27393.467952903808 Cross-val StDev: 1655.0572044541734 Model: AdaBoostRegressor Cross-val Mean: 31749.10560609706 Cross-val StDev: 1883.8981385385696
There are some exceptional results, which where brought into a more "normal" range with some fine-tuning. However, those models (SGDRegressor and SVR) still did not perform as well as some of the others.
8. Gradient Boosting Trees
As mentioned above, we selected Gradient Boosting Trees, based on a higher performance score through cross-validation on the training set.
Decision Trees
Gradient Boosting Trees consists of an ensemble of individual Decision Trees, so we will here describe those first.
Whether used for classification (predicting categories such as Male or Female, Survived or Died) or regression, Decision Trees consist of steps (nodes) at which it makes decisions based on features from the dataset. These decisions take into account which features provide the most information to be able to divide the samples. In our case, for example, a Decision Tree may consider the feature LotArea and split samples based on whether their LotArea value is below or above a certain number.
Here is a visualization of the first Decision Tree in the ensemble from our analysis:
The names 'f0' and 'f1' in this visualization represent the features at the 0th and 1st indices of the dataset respectively. The node "f1 < 6.5" divides samples by whether their value for "OverallQual" is less than 6.5 or not. At the next level, the node "f0 < 2057" further divides samples by whether "TotalSF" is less than 2057 or not. Similarly for "f0 < 3009", it splits samples by whether "TotalSF" is less than 3009 or not.
The last level consists of childless nodes or leaves, which provide an average SalePrice value for the samples it contains.
Decision Trees are very flexible models since, unlike Linear Regression for example, they make few assumptions about the data. In Machine Learning / Statistics lingo, they have low bias. Also, their modelling varies quite a bit if the training data is changed even slightly: they're said to have high variance. Because of this, a single Decision Tree will usually overfit the data.
Ensemble
An ensemble of Decision Trees can be built to counteract the overfitting of a single predictor. In the case of Gradient Boosting Trees, only the first Decision Tree is fit to the entire training set, while each subsequent tree is fit to predict the errors of the previous one and learns to minimize them. In other words, the ensemble combines several "weak learners" to build a "strong" one.
XGBoost
There are two implementations that we considered for this algorithm: Scikit-Learn's GradientBoostingRegressor and XGBoost's XGBRegressor. XGBRegressor tends to obtain similar results to sklearn's implementation but is computationally faster. Also, the standard deviation for XGBRegressor's cross-validation results was lower than GradientBoostingRegressor's.
9. Model Tuning
We now utilize the holdout set to evaluate the performance of the model on a dataset it hasn't seen before.
There are several "hyper-parameters" that we can tweak on XGBRegressor, depending on our goal. We first observed that the model, with its default configuration, severely overfitted the training set, meaning that it learned patterns in the training data that did not generalize well to unseen data.
We learned this from evaluating the model on the holdout set, which we set aside early in the process. These were the results:
RMSE on the training set: $18,070 RMSE on the holdout set: $34,603
Through tuning, we were able to set up a model that produced these results:
RMSE on the training set: $20,143 RMSE on the holdout set: $26,915
The worsening of the performance on the training set was expected, as the goal was to decrease the model's flexibility (known as variance) so it wouldn't fit the training set so much that it couldn't generalize to unseen data.
10. Pipeline
The pre-processing steps defined during Feature Selection and Feature Engineering, as well as the model fitting and prediction steps were streamlined into a data-processing pipeline. This allows us to accept new data and process it in a way that is consistent and easily customizable.
We can input the holdout dataset and the test set, for which we don't have ground truth, and the pipeline will execute all the pre-processing and output predictions that we can evaluate.
Pipelines allow easy customization of individual steps, for example by adjusting hyper-parameters of the predictor (XGBRegressor in this case) or even substituting it for a different one.