pacman::p_load(olsrr, ggstatsplot, ggpubr,
sf, spdep, GWmodel, tmap,
tidyverse, gtsummary, performance,
see, sfdep)Calibrating Hedonic Pricing Model for Private Highrise Property with GWR Method
Getting Started
Importing the data
URA Master Plan 2014 planning subzone boundary
condo_resale <- read_csv("data/aspatial/Condo_resale_2015.csv")mpsz <- read_rds("data/rds/mpsz.rds")condo_resale_sf <- read_rds(
"data/rds/condo_resale_sf.rds")Correlation Analysis - ggstatsplot methods
Instead of using corrplot package, in the code chunk below, ggcorrmat() of ggstatsplot is used.
ggcorrmat(condo_resale[, 5:23])Building a Hedonic Pricing Model by using Multiple Linear Regression Method
The code chunk below using lm() to calibrate the multiple linear regression model.
condo_mlr <- lm(formula = SELLING_PRICE ~ AREA_SQM +
AGE + PROX_CBD + PROX_CHILDCARE +
PROX_ELDERLYCARE + PROX_URA_GROWTH_AREA +
PROX_HAWKER_MARKET + PROX_KINDERGARTEN +
PROX_MRT + PROX_PARK + PROX_PRIMARY_SCH +
PROX_TOP_PRIMARY_SCH + PROX_SHOPPING_MALL +
PROX_SUPERMARKET + PROX_BUS_STOP +
NO_Of_UNITS + FAMILY_FRIENDLY +
FREEHOLD + LEASEHOLD_99YR,
data=condo_resale_sf)
summary(condo_mlr)Model Assessment: olsrr method
In this section, we would like to introduce you a fantastic R package specially programmed for performing OLS regression. It is called olsrr. It provides a collection of very useful methods for building better multiple linear regression models:
- comprehensive regression output
- residual diagnostics
- measures of influence
- heteroskedasticity tests
- model fit assessment
- variable contribution assessment
- variable selection procedures
Generating tidy linear regression report
ols_regress(condo_mlr)Multicolinearuty
ols_vif_tol(condo_mlr)Variable selection
condo_fw_mlr <- ols_step_forward_p(
condo_mlr,
p_val = 0.05,
details = FALSE)plot(condo_fw_mlr)Visualising model parameters
ggcoefstats(condo_mlr,
sort = "ascending")Test for Non-Linearity
In multiple linear regression, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent variables.
In the code chunk below, the ols_plot_resid_fit() of olsrr package is used to perform linearity assumption test.
ols_plot_resid_fit(condo_fw_mlr$model)The figure above reveals that most of the data poitns are scattered around the 0 line, hence we can safely conclude that the relationships between the dependent variable and independent variables are linear.
Test for Normality Assumption
Lastly, the code chunk below uses ols_plot_resid_hist() of olsrr package to perform normality assumption test.
ols_plot_resid_hist(condo_fw_mlr$model)The figure reveals that the residual of the multiple linear regression model (i.e. condo.mlr1) is resemble normal distribution.
If you prefer formal statistical test methods, the ols_test_normality() of olsrr package can be used as shown in the code chun below.
ols_test_normality(condo_fw_mlr$model)The summary table above reveals that the p-values of the four tests are way smaller than the alpha value of 0.05. Hence we will reject the null hypothesis and infer that there is statistical evidence that the residual are not normally distributed.
Testing for Spatial Autocorrelation
The hedonic model we try to build are using geographically referenced attributes, hence it is also important for us to visual the residual of the hedonic pricing model.
First, we will export the residual of the hedonic pricing model and save it as a data frame.
mlr_output <- as.data.frame(condo_fw_mlr$model$residuals) %>%
rename(`FW_MLR_RES` = `condo_fw_mlr$model$residuals`)Next, we will join the newly created data frame with condo_resale_sf object.
condo_resale_sf <- cbind(condo_resale_sf,
mlr_output$FW_MLR_RES) %>%
rename(`MLR_RES` = `mlr_output.FW_MLR_RES`)Next, we will use tmap package to display the distribution of the residuals on an interactive map.
The code churn below will turn on the interactive mode of tmap.
tmap_mode("view")
tm_shape(mpsz)+
tmap_options(check.and.fix = TRUE) +
tm_polygons(alpha = 0.4) +
tm_shape(condo_resale_sf) +
tm_dots(col = "MLR_RES",
alpha = 0.6,
style="quantile")
tmap_mode("plot")The figure above reveal that there is sign of spatial autocorrelation.
Spatial stationary test
To proof that our observation is indeed true, the Moran’s I test will be performed
Ho: The residuals are randomly distributed (also known as spatial stationary) H1: The residuals are spatially non-stationary
First, we will compute the distance-based weight matrix by using dnearneigh() function of spdep.
condo_resale_sf <- condo_resale_sf %>%
mutate(nb = st_knn(geometry, k=6,
longlat = FALSE),
wt = st_weights(nb,
style = "W"),
.before = 1)Next, global_moran_perm() of sfdep is used to perform global Moran permutation test.
global_moran_perm(condo_resale_sf$MLR_RES,
condo_resale_sf$nb,
condo_resale_sf$wt,
alternative = "two.sided",
nsim = 99)The Global Moran’s I test for residual spatial autocorrelation shows that it’s p-value is less than 0.00000000000000022 which is less than the alpha value of 0.05. Hence, we will reject the null hypothesis that the residuals are randomly distributed.
Since the Observed Global Moran I = 0.25586 which is greater than 0, we can infer than the residuals resemble cluster distribution.
Building Hedonic Pricing Models using GWmodel
In this section, you are going to learn how to modelling hedonic pricing by using geographically weighted regression model. Two spatial weights will be used, they are: fixed and adaptive bandwidth schemes.
Building Fixed Bandwidth GWR Model
Computing fixed bandwith
In the code chunk below bw.gwr() of GWModel package is used to determine the optimal fixed bandwidth to use in the model. Notice that the argument adaptive is set to FALSE indicates that we are interested to compute the fixed bandwidth.
There are two possible approaches can be uused to determine the stopping rule, they are: CV cross-validation approach and AIC corrected (AICc) approach. We define the stopping rule using approach agreement.
bw_fixed <- bw.gwr(formula = SELLING_PRICE ~ AREA_SQM + AGE +
PROX_CBD + PROX_CHILDCARE +
PROX_ELDERLYCARE + PROX_URA_GROWTH_AREA +
PROX_MRT + PROX_PARK + PROX_PRIMARY_SCH +
PROX_SHOPPING_MALL + PROX_BUS_STOP +
NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD,
data=condo_resale_sf,
approach="CV",
kernel="gaussian",
adaptive=FALSE,
longlat=FALSE)The result shows that the recommended bandwidth is 971.3405 metres. (Quiz: Do you know why it is in metre?)
GWModel method - fixed bandwith
Now we can use the code chunk below to calibrate the gwr model using fixed bandwidth and gaussian kernel.
gwr_fixed <- gwr.basic(formula = SELLING_PRICE ~ AREA_SQM +
AGE + PROX_CBD + PROX_CHILDCARE +
PROX_ELDERLYCARE +PROX_URA_GROWTH_AREA +
PROX_MRT + PROX_PARK + PROX_PRIMARY_SCH +
PROX_SHOPPING_MALL + PROX_BUS_STOP +
NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD,
data=condo_resale_sf,
bw=bw_fixed,
kernel = 'gaussian',
longlat = FALSE)The output is saved in a list of class “gwrm”. The code below can be used to display the model output.
gwr_fixedThe report shows that the AICc of the gwr is 42263.61 which is significantly smaller than the globel multiple linear regression model of 42967.1.
Building Adaptive Bandwidth GWR Model
In this section, we will calibrate the gwr-based hedonic pricing model by using adaptive bandwidth approach.
Computing the adaptive bandwidth
Similar to the earlier section, we will first use bw.gwr() to determine the recommended data point to use.
The code chunk used look very similar to the one used to compute the fixed bandwidth except the adaptive argument has changed to TRUE.
bw_adaptive <- bw.gwr(formula = SELLING_PRICE ~ AREA_SQM + AGE +
PROX_CBD + PROX_CHILDCARE + PROX_ELDERLYCARE +
PROX_URA_GROWTH_AREA + PROX_MRT + PROX_PARK +
PROX_PRIMARY_SCH + PROX_SHOPPING_MALL + PROX_BUS_STOP +
NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD,
data=condo_resale_sf,
approach="CV",
kernel="gaussian",
adaptive=TRUE,
longlat=FALSE)The result shows that the 30 is the recommended data points to be used.
Constructing the adaptive bandwidth gwr model
Now, we can go ahead to calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and gaussian kernel as shown in the code chunk below.
gwr_adaptive <- gwr.basic(formula = SELLING_PRICE ~ AREA_SQM + AGE +
PROX_CBD + PROX_CHILDCARE + PROX_ELDERLYCARE +
PROX_URA_GROWTH_AREA + PROX_MRT + PROX_PARK +
PROX_PRIMARY_SCH + PROX_SHOPPING_MALL + PROX_BUS_STOP +
NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD,
data=condo_resale_sf,
bw=bw_adaptive,
kernel = 'gaussian',
adaptive=TRUE,
longlat = FALSE)The code below can be used to display the model output.
gwr_adaptiveThe report shows that the AICc the adaptive distance gwr is 41982.22 which is even smaller than the AICc of the fixed distance gwr of 42263.61.
Visualising GWR Output
In addition to regression residuals, the output feature class table includes fields for observed and predicted y values, condition number (cond), Local R2, residuals, and explanatory variable coefficients and standard errors:
Condition Number: this diagnostic evaluates local collinearity. In the presence of strong local collinearity, results become unstable. Results associated with condition numbers larger than 30, may be unreliable.
Local R2: these values range between 0.0 and 1.0 and indicate how well the local regression model fits observed y values. Very low values indicate the local model is performing poorly. Mapping the Local R2 values to see where GWR predicts well and where it predicts poorly may provide clues about important variables that may be missing from the regression model.
Predicted: these are the estimated (or fitted) y values 3. computed by GWR.
Residuals: to obtain the residual values, the fitted y values are subtracted from the observed y values. Standardized residuals have a mean of zero and a standard deviation of 1. A cold-to-hot rendered map of standardized residuals can be produce by using these values.
Coefficient Standard Error: these values measure the reliability of each coefficient estimate. Confidence in those estimates are higher when standard errors are small in relation to the actual coefficient values. Large standard errors may indicate problems with local collinearity.
They are all stored in a SpatialPointsDataFrame or SpatialPolygonsDataFrame object integrated with fit.points, GWR coefficient estimates, y value, predicted values, coefficient standard errors and t-values in its “data” slot in an object called SDF of the output list.
Converting SDF into sf data.frame
To visualise the fields in SDF, we need to first covert it into sf data.frame by using the code chunk below.
gwr_adaptive_output <- as.data.frame(
gwr_adaptive$SDF) %>%
select(-c(2:15))gwr_sf_adaptive <- cbind(condo_resale_sf,
gwr_adaptive_output)Next, glimpse() is used to display the content of condo_resale_sf.adaptive sf data frame.
glimpse(gwr_sf_adaptive)summary(gwr_adaptive$SDF$yhat)Visualising local R2
The code chunks below is used to create an interactive point symbol map.
tmap_mode("view")
tmap_options(check.and.fix = TRUE)
tm_shape(mpsz)+
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf_adaptive) +
tm_dots(col = "Local_R2",
border.col = "gray60",
border.lwd = 1) +
tm_view(set.zoom.limits = c(11,14))
tmap_mode("plot")Visualising coefficient estimates
The code chunks below is used to create an interactive point symbol map.
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
AREA_SQM_SE <- tm_shape(mpsz)+
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf_adaptive) +
tm_dots(col = "AREA_SQM_SE",
border.col = "gray60",
border.lwd = 1) +
tm_view(set.zoom.limits = c(11,14))
AREA_SQM_TV <- tm_shape(mpsz)+
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf_adaptive) +
tm_dots(col = "AREA_SQM_TV",
border.col = "gray60",
border.lwd = 1) +
tm_view(set.zoom.limits = c(11,14))
tmap_arrange(AREA_SQM_SE, AREA_SQM_TV,
asp=1, ncol=2,
sync = TRUE)tmap_mode("plot")By URA Plannign Region
tm_shape(mpsz[mpsz$REGION_N=="CENTRAL REGION", ])+
tm_polygons()+
tm_shape(gwr_sf_adaptive) +
tm_bubbles(col = "Local_R2",
size = 0.15,
border.col = "gray60",
border.lwd = 1)