My bitemoji Zach data enthusiast. options trader.

How much does a car's brand affect its price?

» data

Introduction

As a student about to graduate from college, I was all ready to start #adulting and a part of that was getting a new car. Of course, I soon realised that

(a) I did not have any money, and

(b) cars in Singapore are prohibitively expensive.

So, I did what anyone in my situation would do - I decided to analyse the prices of cars in Singapore in place of actually buying a car (note: it does not give the same satisfaction)

Principle Questions

As one of my first independent Data Analytics Projects, there were just 2 simple questions that I had when embarking on this project:

1. How much does make/brand affect the price of a car? My dad always talked about the “BMW” premium and I wanted to see whether this was true. Did a car being a ‘luxury’ car artificially inflate its value..? And by how much?

2. How accurately can we predict the price of a car given certain parameters? A potential second part of this project might involve building a simple web-app that gives a “fair value” of a car, given certain parameters and hence building a model to predict price was something I wanted to achieve.

tl;dr

I recognise at this juncture that some of you may not want to trawl through the entire article so I’ll just dump my conclusions here (and you can scroll down if you want to know the technicals + see nice graphs)

1. Make premium effect We narrowed the specs of the cars and compared ‘normal spec-ed’ premium (BMW etc.) makes with non-premium (Toyota etc.) makes. Make had most significant interaction on Top Speed’s and Fuel Consumption’s correlation with price. This made sense that buyers of premium makes likely did not really care about fuel costs and would still buy a car with high Fuel Consumption.

2. Prediction model The model had a predicted $R^{2}$ value of 0.837 which was not too far off from the adjusted $R^{2}$ value of 0.857. This means that the model was able to predict ~84% of the car price with the given variables. Residual outliers were mainly supercars (like the Ferrari SF90) - prices of which were likely inflated due to the significant premium associated with them (above and beyond ‘normal’ luxurious makes like BMW)

Envisioned Steps

Upon deciding to embark on this project, I figured that there were three key steps to take. These steps coincidentally matched up with a typical Data Analytics (wow buzzwords) project

  1. Data Collection
  2. Data Cleaning
  3. Data Analytics

With these steps in mind, I headed off into the great unknown!

Step 1: Data Collection (aka: is it done yet?)

The first step in any data project is always to have data to analyse. I realised I needed a good source of data that would provide me with all the essential information about cars that I needed. My dad (my main source of car knowledge) spoke frequently about the mystical SGCarMart.

Heading over to the site, I was pleasantly surprised to find that it contained a large repository of essential car information. Prices, types of car, features, engine size - SGCarMart had it all! The only issue now was getting the data into a nice, manipulatable .csv file.

Information on Webpage + Needing it in a .csv = Web Scrapping

In my work/research experiences and in the online courses I had taken, the data had always come to me in a nice little .csv file and so I was faced, for the first time, with the need to get my own information.

Initially, I decided to go the usual route of scrapping with rvest:

library(rvest)
sgcar <- read_html("https://www.sgcarmart.com/new_cars/newcars_specs.php?CarCode=12117")
Engine_Capacity <- sgcar %>% html_node("#submodel_spec tr:nth-child(2) td+ td") %>% html_text()
Engine_Capacity
## [1] "1,199 cc"

However, I eventually decided to use Web Scraper since it came with a very convenient selector graph that allowed me to map out exactly how I wanted to trawl through the pages and scrape my data.

image Selector Graph showing you the exact scrapping map - convenience!!

Choosing a 2000ms delay (so I wouldn’t get blocked by SGCarmart - fingers crossed), away it went. I ran the scrapper four times in total, picking out different variants of the same make/model (since SGCarmart organised their price/specs/feature tabs in different styles). About 4 hours later, I was done and myohmy was the scrapped data MESSY.

messyraw urghhh it’s so messy

Step 2: Data Cleaning (aka: I’d rather clean up my room, I’m sorry mum)

According to Forbes, Data Scientists spend up to 60% of their time cleaning data - I knew I was in for a treat.

The first step was to merge the separate scrapped files together, and remove cars that did not have a price (some cars on the website had yet to become available and hence had no price).

#import data files
sgcarmart_1 <- read.csv("sgcarmart.csv")
sgcarmart_2 <- read.csv("sgcarmartm2.csv")
sgcarmart_3 <- read.csv("sgcarmart3.csv")

#join files together
sgcarmart_12 <- merge (sgcarmart_1,sgcarmart_2,all.x = TRUE,all.y = TRUE)
sgcarmart <- merge (sgcarmart_12, sgcarmart_3, all.x = TRUE, all.y = TRUE)

#remove rows that are null and select necessary columns
avail_cars <- sgcarmart %>%
    select(-1, -2, - contains('href'), - contains('link'), - contains('electric')) %>%
    filter(!grepl('null', Current_Price))

Next, I needed to remove all the random artefacts from the dataset. For some reason, SGCarmart had included random “ ” characters in some of the figures so these had to be removed.

#clean up data and remove artefacts
clean_cars <- avail_cars %>%
  #remove dollar signs, commas, and units
  mutate_all(str_replace_all, "[$]|km/h|km/L|/yr|/mth|,|cc", "") %>%
  #weird character so had to mutate separately
  mutate(Road_Tax = str_replace_all(Road_Tax, "[ /yr]", "")) %>%
  mutate(ARF = str_replace_all(ARF, "[(after VES rebate)]", "")) %>%
  #Needed the '.%' since the numbers were not the same
  mutate(Down_Payment = str_replace_all(Down_Payment, "(Maximum .*% loan)", "")) %>%
  mutate(Down_Payment = stri_replace_all_fixed(Down_Payment, "()", "")) %>%
  mutate(VES = str_replace_all(VES, "\\(.*\\)", "")) %>%
  mutate(tank_cap = str_replace_all(tank_cap, "L", ""))

With basic cleaning done, it was time for the next part of data cleaning… dummy encoding. I decided to encode some categories and not others depending on the number of unique observations within the variable. Given that I’d planned to produce a linear regression model, the lm function would be capable of handling factors and I was hence comfortable with leaving them as factors.

#encode dummy variables
cleandum_cars <- clean_cars %>%
  mutate(KeylessEntry = ifelse(KeylessEntry == "Yes", 1, 0)) %>%
  mutate(AutoHeadlights = ifelse(AutoHeadlights == "Yes", 1, 0)) %>%
  mutate(AutoWipers = ifelse(AutoWipers == "Yes", 1, 0)) %>%
  mutate(PaddleShift = ifelse(PaddleShift == "Yes", 1, 0)) %>%
  mutate(SmartKey = ifelse(SmartKey == "Yes", 1, 0)) %>%
  mutate(RemoteBoot = ifelse(RemoteBoot == "Yes", 1, 0)) %>%
  mutate(ReverseCam = ifelse(ReverseCam == "Yes", 1, 0)) %>%
  mutate(Power_hp = str_extract(Power, "\\d+(?=[^(]*\\))"))

#convert numbers from strings to numeric
intcleandum_cars <- cleandum_cars %>%
  mutate_at(vars(-one_of(c('categories','openindiv','Vehicle.Name','veh_type','engine_type'))), as.numeric)

# Add "features" together
final_cars <- intcleandum_cars %>% 
  mutate(totalFeatures = select(., 18:24) %>% rowSums())

#convert engine type to number of cylinders
final_cars <- final_cars %>% 
  mutate(cyl = if_else(str_detect(engine_type, "(4-cylinder)"),as.numeric(4),
    if_else(str_detect(engine_type, "(V6)"), as.numeric(6),
      if_else(str_detect(engine_type, "(V8)"), as.numeric(8),
        if_else(str_detect(engine_type, "(3-cylinder)"), as.numeric(3), 
          if_else(str_detect(engine_type, "(V10)"), as.numeric(10),
                  if_else(str_detect(engine_type, "(V12)"), as.numeric(12), 
                          if_else(str_detect(engine_type, "(6-cylinder)"), as.numeric(6), as.numeric(0)))))))))

#factoring in turbocharged
final_cars <- final_cars %>%
  mutate(turbo = if_else(str_detect(engine_type, "(turbo)|(turbocharged)|(supercharged)|(Turbocharged)|(Turbo)"), as.numeric(1), as.numeric(0)))

A note here - my encoding of cylinder types could definitely have been done more efficiently However, I only realised that there were SO MANY different types of cylinders midway through the process of typing out the code and decided to just grit through the rest of it. Poor effort I know - I will do better next time.

3. Data Analytics (aka: This is actually all people care about)

The data cleaning surprisingly did not take as long as I’d anticipated. With that out of the way, it was time for the sexy bit of the project - analysing the data! I did a quick dfSummary just to get an overview of the data. (I LOVE using dfSummary just for rough feel of the data)

Click to expand dfSummary!

Data Frame Summary

final_cars

Dimensions: 545 x 29
Duplicates: 4
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 categories [character] 1. BMW 2. Mercedes-Benz 3. Porsche 4. Audi 5. Nissan 6. Toyota 7. Volkswagen 8. Mazda 9. Hyundai 10. Honda [ 34 others ]
45(8.3%)
40(7.3%)
38(7.0%)
33(6.1%)
23(4.2%)
22(4.0%)
21(3.9%)
20(3.7%)
19(3.5%)
18(3.3%)
266(48.8%)
545 (100%) 0 (0%)
2 openindiv [character] 1. Nissan NV350 2. Alfa Romeo Giulia 3. Audi A6 Sedan Mild Hybrid 4. BMW 2 Series Gran Tourer 5. BMW 3 Series Sedan 6. BMW 7 Series 7. BMW X2 8. BMW X3 9. BMW X4 10. BMW X7 [ 323 others ]
5(0.9%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
513(94.1%)
545 (100%) 0 (0%)
3 Current_Price [numeric] Mean (sd) : 256508.7 (217284.5) min < med < max: 50999 < 181888 < 1818000 IQR (CV) : 195000 (0.8) 439 distinct values 545 (100%) 0 (0%)
4 Depreciation [numeric] Mean (sd) : 16242.1 (7974.3) min < med < max: 6700 < 14000 < 54600 IQR (CV) : 7900 (0.5) 144 distinct values 247 (45.32%) 298 (54.68%)
5 Down_Payment [numeric] Mean (sd) : 82274.8 (57798.3) min < med < max: 5100 < 68355 < 359200 IQR (CV) : 52597.5 (0.7) 346 distinct values 411 (75.41%) 134 (24.59%)
6 Installment [numeric] Mean (sd) : 1884.7 (1133.4) min < med < max: 653 < 1527 < 7662 IQR (CV) : 1114.5 (0.6) 327 distinct values 411 (75.41%) 134 (24.59%)
7 Road_Tax [numeric] Mean (sd) : 1621.2 (1324.7) min < med < max: 392 < 1208 < 7856 IQR (CV) : 1678 (0.8) 82 distinct values 494 (90.64%) 51 (9.36%)
8 OMV [numeric] Mean (sd) : 45776.3 (44967.8) min < med < max: 11607 < 33321 < 454730 IQR (CV) : 21031 (1) 274 distinct values 277 (50.83%) 268 (49.17%)
9 ARF [numeric] Mean (sd) : 57652.3 (74605) min < med < max: 5000 < 37770.5 < 790514 IQR (CV) : 30634.2 (1.3) 263 distinct values 268 (49.17%) 277 (50.83%)
10 VES [numeric] Mean (sd) : 9812.1 (8115.7) min < med < max: 0 < 10000 < 20000 IQR (CV) : 20000 (0.8)
0:162(33.8%)
10000:164(34.2%)
20000:153(31.9%)
479 (87.89%) 66 (12.11%)
11 Vehicle.Name [character] 1. Nissan NV350 2. Alfa Romeo Giulia 3. Audi A6 Sedan Mild Hybrid 4. BMW 2 Series Gran Tourer 5. BMW 3 Series Sedan 6. BMW 7 Series 7. BMW X2 8. BMW X3 9. BMW X4 10. BMW X7 [ 323 others ]
5(0.9%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
3(0.6%)
513(94.1%)
545 (100%) 0 (0%)
12 eng_cap [numeric] Mean (sd) : 2188.6 (923.8) min < med < max: 998 < 1995 < 6498 IQR (CV) : 1394.2 (0.4) 81 distinct values 526 (96.51%) 19 (3.49%)
13 engine_type [character] 1. 4-cylinder in-line 16-val 2. 4-cylinder in-line Turboc 3. V6 Turbocharged 4. 4-cylinder in-line 16-val 5. 4-cylinder in-line 16-val 6. V8 Twin-Turbocharged 7. 4-cylinder in-line 16-val 8. 4-cylinder in-line 16-val 9. 3-cylinder in-line 12-val 10. 4-cylinder in-line DOHC [ 121 others ]
94(17.2%)
37(6.8%)
18(3.3%)
16(2.9%)
15(2.8%)
15(2.8%)
14(2.6%)
12(2.2%)
11(2.0%)
10(1.8%)
303(55.6%)
545 (100%) 0 (0%)
14 top_speed [numeric] Mean (sd) : 221.8 (41.4) min < med < max: 100 < 215 < 350 IQR (CV) : 57 (0.2) 120 distinct values 507 (93.03%) 38 (6.97%)
15 Fuel_consumption [numeric] Mean (sd) : 15.1 (5.3) min < med < max: 5.9 < 14.7 < 50 IQR (CV) : 5.5 (0.4) 111 distinct values 503 (92.29%) 42 (7.71%)
16 tank_cap [numeric] Mean (sd) : 60.5 (14.3) min < med < max: 35 < 60 < 105 IQR (CV) : 20 (0.2) 54 distinct values 504 (92.48%) 41 (7.52%)
17 veh_type [character] 1. Commercial 2. Hatchback 3. Luxury Sedan 4. MPV 5. Sedan 6. Sports 7. Stationwagon 8. SUV
36(6.6%)
53(9.7%)
78(14.3%)
33(6.1%)
35(6.4%)
105(19.3%)
9(1.6%)
196(36.0%)
545 (100%) 0 (0%)
18 KeylessEntry [numeric] Min : 0 Mean : 0.9 Max : 1
0:45(8.3%)
1:500(91.7%)
545 (100%) 0 (0%)
19 AutoWipers [numeric] Min : 0 Mean : 0.8 Max : 1
0:116(21.3%)
1:429(78.7%)
545 (100%) 0 (0%)
20 AutoHeadlights [numeric] Min : 0 Mean : 0.9 Max : 1
0:44(8.1%)
1:501(91.9%)
545 (100%) 0 (0%)
21 PaddleShift [numeric] Min : 0 Mean : 0.6 Max : 1
0:211(38.7%)
1:334(61.3%)
545 (100%) 0 (0%)
22 SmartKey [numeric] Min : 0 Mean : 0.9 Max : 1
0:54(9.9%)
1:491(90.1%)
545 (100%) 0 (0%)
23 RemoteBoot [numeric] Min : 0 Mean : 0.7 Max : 1
0:144(26.4%)
1:401(73.6%)
545 (100%) 0 (0%)
24 ReverseCam [numeric] Min : 0 Mean : 0.8 Max : 1
0:95(17.4%)
1:450(82.6%)
545 (100%) 0 (0%)
25 Power [numeric] All NA's 0 (0%) 545 (100%)
26 Power_hp [numeric] Mean (sd) : 240.3 (147.8) min < med < max: 59 < 180 < 789 IQR (CV) : 178 (0.6) 149 distinct values 496 (91.01%) 49 (8.99%)
27 totalFeatures [numeric] Mean (sd) : 5.7 (1.7) min < med < max: 0 < 6 < 7 IQR (CV) : 2 (0.3)
0:13(2.4%)
1:16(2.9%)
2:10(1.8%)
3:21(3.8%)
4:42(7.7%)
5:66(12.1%)
6:130(23.8%)
7:247(45.3%)
545 (100%) 0 (0%)
28 cyl [numeric] Mean (sd) : 4.1 (2) min < med < max: 0 < 4 < 12 IQR (CV) : 0 (0.5)
0:53(9.7%)
3:56(10.3%)
4:313(57.4%)
6:80(14.7%)
8:38(7.0%)
10:2(0.4%)
12:3(0.5%)
545 (100%) 0 (0%)
29 turbo [numeric] Min : 0 Mean : 0.7 Max : 1
0:173(31.7%)
1:372(68.3%)
545 (100%) 0 (0%)

Generated by summarytools 0.9.6 (R version 4.0.2)
2020-09-29

From the dfSummary, it appears that there was wide variability in terms of the various engine parameters and features that we were inspecting. Furthermore, there was also the issue of missing values in some of the columns. As the ‘NAs’ weren’t that many, I decided to perform a NA.approx() from the zoo package to fill up the missing values

#interpolate NA values
tfinal <- final_cars %>% 
  mutate(inter_fuelC = na.approx(Fuel_consumption)) %>% 
  mutate(inter_tankC = na.approx(tank_cap)) %>%
  mutate(inter_topS = na.approx(top_speed)) %>% 
  mutate(inter_hp = na.approx(Power_hp)) %>%
  mutate(inter_engc = na.approx(eng_cap))

Question 1: How does brand/make affect the car price?

With the first question asking how important the ‘status’ of a brand was to the car’s final price, I had to come up with categories to segment the cars into. After consulting numerous sources (aka some Googling and Reddit), I decided to split them into 2 main categories: standard and premium. I was primarily interested in knowing if there was a monetary premium that was placed on cars of a prestigious make. And if so how much. (I also created an others category for cars that just weren’t popular in Singapore and hence did not have a particular stigma attached to them)

Category Makes
Standard Ford, Citroen, Honda, Hyundai, Jeep, Kia, Mazda, Mitsubishi, Subaru, Suzuki, Toyota, Volkswagen, Nissan
Premium Audi, Ferrari, Jaguar, Lamborghini, Lexus, Maserati, McLaren, Mercedes-Benz, Porsche, BMW, Volvo
Others All other makes
#split into different categories of prestige

tfinal <- tfinal %>%
  mutate(make_status = ifelse(categories %in% c("Ford","Citroen","Honda","Hyundai","Jeep","Kia","Mazda","Mitsubishi","Subaru","Suzuki","Toyota","Volkswagen","Nissan"), as.numeric(1),
                               ifelse(categories %in% c("Audi","Ferrari","Jaguar","Lamborghini","Lexus","Maserati","McLaren","Mercedes-Benz","Porsche","BMW","BMW ALPINA","BMW M Series","Volvo"), as.numeric(2), as.numeric(0))))

Before examining the interactions of ‘status’, I first wanted to get an overview of the correlation between the price of the car and the various features that I intended to use in my Linear Regression model.

unnamed-chunk-13-1

Looking at just the top row (Current Price), it looked like most of the features have a somewhat positive correlation with price (except for fuel consumption, which has an obviously negative correlation unless you like your car drinking fuel). This looked promising and the next step was to then determine the interaction of the Make on these correlations.

For this, I decided to go with a scatterplotMatrix. I filtered out car makes that were marked as “Others” to compare just the popular “Standard” and “Premium” car makers.

unnamed-chunk-14-1

(Looking again at just the top-row for Current Price~other specs) Unsurprisingly, interaction was observed across all the various features (this is observed through the non-parallel regression lines. A no effect model would have resulted in parallel regression lines).

“But wait,” you might hypothetically say, “of course there would be interaction! Premium car makers would naturally produce more spec-ed out cars, making the regression lines cross!” And you would be right. Realising this, I hence ran the plot again, but this time restricting it to a range so that the influence of the V12 supercars etc. would be removed. I tightened the engine specs to a “normal” range, looking at cars with speeds of 150 - 260, engine CCs of 1000 - 3500, and horsepower of 200-500. I also tightened the fuel tank and consumption variables in order to standardise the measures.

unnamed-chunk-15-1

Looking at the plot again, it appeared that the Engine CC and Top Speed regression lines were roughly parallel. I guess the make premium maybe didn’t matter so much if you were looking at just speed or CC. (a side point: CC is particularly sensitive in the Singapore context due to Road Tax being calculated based on CC. This might hence explain the non-effect that make has).

Looking across the other variables, one that I found particularly interesting was Fuel Consumption. The “Standard” (blue) regression line dropped off fairly quickly (showing that ordinary folks probably weren’t too happy about cars with high fuel consumption). In contrast, the line for “Premium” (pink) stayed almost straight. I guess if you’re throwing down 200+k to buy a supercar, fuel price/consumption really doesn’t matter.

Conclusion: Honestly, the plots weren’t really the most conclusive. It was initially clear that the make did affect the correlation between variables. But then, when we did an “apples to apples” comparison (force comparing lower spec-ed premium make cars to similar spec-ed standard make cars), the effect of the interaction was noticeably reduced in certain factors. I would however, venture the conclusion that make matters to an extent. but premium makes also tend to make more premium cars with better specs

Question 2: Building a model to assess “fair value”

The above plots showed that car makes had an effect on the correlation between price and individual features. However, what about the effect of car makes on the correlation between price and ALL the features?

In order to find this out, I built a multivariate linear regression model (which would also allow me to later create a “fair value” model)

library(caret)
library(jtools)
#build the training and testing set
set.seed(2809)
training.samples <- ttfinal$Current_Price %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- ttfinal[training.samples, ]
test.data <- ttfinal[-training.samples, ]

#build the model
model <- train.data %>% lm(formula = Current_Price~(inter_engc+inter_topS+inter_hp+inter_fuelC+inter_tankC+cyl)*make_status, data = .)

Having created the model, I then tested it against the test set to find the $R^{2}$ value.

#make predictions on model
predictions <- model %>% predict(test.data)
#see model performance
R2(predictions,test.data$Current_Price)
## [1] 0.8576692

The $R^{2}$ value of 0.857 meant that ~86% of the variance of price was explained by the model. In plain English (but honestly not the most accurate way to see it), this meant that the various features we put into the model explained 86% of why price was what it was.

The value was suspiciously high though and I suspected that there might have been some overfitting going on. The usual rule of thumb for regression models is 10-15 observations for every variable. My data (with ~ 545 observations and 8 variables) fit this nicely but there was still that itch I wanted to scratch.

A good way to test for overfitting is to use the predicted $R^{2}$ value and to compare this value to the actual $R^{2}$ value. Overfitting would likely be present if these values differed by a large amount. I did this with a helpful post by Thomas Hopper. For more on the theory behind using predicted $R^{2}$, refer to that blog.

## Predicted Residual Error Sum of Squares:  3.435012e+12
## Predicted R2 value:  0.8374878

Pleasantly, the predicted $R^{2}$ value was not in fact that far off from the actual $R^{2}$ value. My fears of overfitting were suitably assauged. Going back to the issue of make premium, I then created a table for the model to look at specific interactions.

## MODEL INFO:
## Observations: 437
## Dependent Variable: Current_Price
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(13,423) = 196.32, p = 0.00
## R² = 0.86
## Adj. R² = 0.85 
## 
## Standard errors: OLS
## ----------------------------------------------------------------------
##                                       Est.        S.E.   t val.      p
## ----------------------------- ------------ ----------- -------- ------
## (Intercept)                       -9297.84   111705.01    -0.08   0.93
## inter_engc                           17.28       22.15     0.78   0.44
## inter_topS                        -1112.06      380.70    -2.92   0.00
## inter_hp                            997.66      197.33     5.06   0.00
## inter_fuelC                        6225.33     2903.33     2.14   0.03
## inter_tankC                        1109.77      745.02     1.49   0.14
## cyl                                8050.62     7113.70     1.13   0.26
## make_status                     -211167.02    67772.15    -3.12   0.00
## inter_engc:make_status               20.75       13.81     1.50   0.13
## inter_topS:make_status             1435.46      247.78     5.79   0.00
## inter_hp:make_status               -125.14      111.01    -1.13   0.26
## inter_fuelC:make_status           -3290.09     1582.34    -2.08   0.04
## inter_tankC:make_status            -852.41      495.62    -1.72   0.09
## cyl:make_status                   -1212.66     4321.40    -0.28   0.78
## ----------------------------------------------------------------------

Looking at the output of the multivariate regression model (which takes into account the relationship between ALL features and price), it appeared that the effect of make was most significant on Top Speed and Fuel Consumption.

Hence, we can conclude that in the multivariate regression model, make continued to have an influence on the correlation between price and the other variables, with most significant influence on Top Speed and Fuel Consumption.

Question 2b: Normality of Residuals

I also wanted to explore the normality of the residuals of my model and this was easily done through a Q-Q plot:

unnamed-chunk-20-1

The plot did not follow a normal distribution and there were outliers at both tails. However, while the normality assumption is important, papers have argued that the normality assumption is less important for large datasets (typically said to be >200 but of course depends on context) given the Central Limit Theorem. With my dataset > 200, I was more comfortable with the lack of normality of the residuals.

Of course, I still wanted to explore the outliers and mapped the residuals back to the original dataset in order to figure out what these outliers were.

MakeModelPrice
FerrariFerrari SF90 Stradale Plug-in Hybrid1818000
FerrariFerrari 812 GTS1560000
LamborghiniLamborghini Aventador1498000
JaguarJaguar F-PACE393999
Land RoverLand Rover Range Rover897999
BMW ALPINABMW ALPINA B5 Bi-Turbo Saloon543800

As expected, most of the outliers were performance cars with incredible prices. The reason that the prices of these sports cars may not have been captured by the model can probably be attributed to an (essentially unmeasurable) feature of buyer sentiment. The top 2 outliers are Ferraris. The price tag of a Ferrari likely goes beyond its Engine CC and Fuel Consumption but is also probably very much influenced by how much bigger your wallet looks and how much cooler you seem.

198610 look at me I am so cool

Conclusion (aka: You probably just skipped here)

The conclusions of this little study are summed up in the tl;dr but I’ll repeat them here again just in case you missed it the first time!

1. Make premium effect We narrowed the specs of the cars and compared ‘normal spec-ed’ premium (BMW etc.) makes with non-premium (Toyota etc.) makes. Make had most significant interaction on Top Speed’s and Fuel Consumption’s correlation with price. This made sense that buyers of premium makes likely did not really care about fuel costs and would still buy a car with high Fuel Consumption.

2. Prediction model The model had a predicted $R^{2}$ value of 0.837 which was not too far off from the adjusted $R^{2}$ value of 0.857. This means that the model was able to predict ~84% of the car price with the given variables. Residual outliers were mainly supercars (like the Ferrari SF90) - prices of which were likely inflated due to the significant premium associated with them (above and beyond ‘normal’ luxurious makes like BMW)

Personal conclusions: for those who care to read all the way through

1. Gathering of data. I regretted not learning Regex earlier. I learnt Regex only after collecting the data using Webscraper and could probably have used Regex with Webscraper to do pre-cleaning of the data during collection which might have sped up the data cleaning process.

2. Difficulty of visualising multivariate models. I spent the largest amount of time trying to figure out how to best present the multivariate regression model. In the end, I gave up and went with just presenting a summary table from jtools. I’m sure there are better ways to do this and will try to improve on that for the next project.

3. Assumption of normality. After coming up with the model and plotting the Q-Q plot, I initially freaked out because I realised it was super non-normal. This spurred me to go read more about normality and I guess it was comforting to realise that normality was important but not that critical for accuracy of model. In the stats geek’s words’ “for moderate to large sample sizes, non-normality of residuals should not adversely affect the usual inferential procedures, as a result of the central limit theorem”.

Thanks for reading~ I hope it helps you make better decisions when you buy a new car in the future! And look out for (hopefully) part 2 where I try to use SHINY to create a web app to help you find the fair value of a car you want to buy using the model we created!