Introduction

Row

Introduction

Abstract: This project analyzes the main factors that influence used car prices using a cleaned and filtered data set of used car listings. After converting key variables to numeric form and removing extreme outliers, graphs revealed clear trends, most notably that higher mileage lowers price and a new model year raises it. A multiple linear regression model was then built to measure these effects. The results showed that mileage, model year, brand, model, and accident history all significantly impacted the price. This analysis provides a foundation to help first time buyers better understand fair pricing in the used car market.

Motivation: For many college students, they have either never bought a car before or have never bought a car without the help of their parents. The used car market is a massive sea of confusing, inconsistent, and untruthful decisions. As a result of this, first time buyers risk getting ripped off and overpaying or choosing a vehicle that may not be reliable. By building a data driven regression model to predict used car prices, I aim to give the youth or first time buyers of America a leg up on negotiation and an understanding of what a fair price should look like.

Dataset

Rows: 4,009
Columns: 12
$ brand        <chr> "Ford", "Hyundai", "Lexus", "INFINITI", "Audi", "Acura", …
$ model        <chr> "Utility Police Interceptor Base", "Palisade SEL", "RX 35…
$ model_year   <int> 2013, 2021, 2022, 2015, 2021, 2016, 2017, 2001, 2021, 202…
$ milage       <chr> "51,000 mi.", "34,742 mi.", "22,372 mi.", "88,900 mi.", "…
$ fuel_type    <chr> "E85 Flex Fuel", "Gasoline", "Gasoline", "Hybrid", "Gasol…
$ engine       <chr> "300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capability", "…
$ transmission <chr> "6-Speed A/T", "8-Speed Automatic", "Automatic", "7-Speed…
$ ext_col      <chr> "Black", "Moonlight Cloud", "Blue", "Black", "Glacier Whi…
$ int_col      <chr> "Black", "Gray", "Black", "Black", "Black", "Ebony.", "Bl…
$ accident     <chr> "At least 1 accident or damage reported", "At least 1 acc…
$ clean_title  <chr> "Yes", "Yes", "", "Yes", "", "", "Yes", "Yes", "Yes", "Ye…
$ price        <chr> "$10,300", "$38,005", "$54,598", "$15,500", "$34,999", "$…

Column

Research Questions

  • Which predictors influence used car prices the strongest?

  • Does including accident history and title status improve the model performance?

  • Do exterior and Interior colors affect used car prices, or are these mostly just visual differences?

Data Description

Source: Kaggle (Used Car Prediction Dataset)

Variables - Price (listed sale price)

  • Milage (mileage of the vehicle)

  • Model_year (year manufactured)

  • Brand (manufacturer)

  • Model (car model)

  • Accident (accident history)

  • Clean_title (clean or salvage title)

  • Exterior color

  • Interior color

Data Cleaning

Column

Data Cleaning Steps

  • Removed non-numeric characters from price and mileage columns such as “mi.” and “$” by using the function gsub().

  • Filtered extreme outliers in variables price, mileage, and model_year by using the function filter() to filter out all vehicles that were over 200,000 USD, over 300,000 miles, and produced before the year 1990.

  • Reduced categorical complexity by lumping rare brands and models into an overall “other” category. The data set contained many unique brands and 1898 unique model names which made making a clean looking graph impossible. By using the function fct_lump_n(), I was able to reduce the noise on the x-axis and allow for the model to focus on the more important and consistent brands and models.

  • Converted categorical variables to factors for regression modeling

  • Created log transformed price to improve normality and stabilize variance

  • Created log transformed mileage and scaled mileage in thousands (milage_k) for better interpretability

Missing Data Handling

       brand        model   model_year       milage    fuel_type       engine 
           0            0            0            0            0            0 
transmission      ext_col      int_col     accident  clean_title        price 
           0            0            0            0            0            0 

Summary Statistics

     price            milage          milage_k        log_milage     
 Min.   :  2000   Min.   :   100   Min.   :  0.10   Min.   :0.09531  
 1st Qu.: 17000   1st Qu.: 24285   1st Qu.: 24.29   1st Qu.:3.23021  
 Median : 30500   Median : 54000   Median : 54.00   Median :4.00733  
 Mean   : 37863   Mean   : 65415   Mean   : 65.42   Mean   :3.79615  
 3rd Qu.: 48000   3rd Qu.: 95000   3rd Qu.: 95.00   3rd Qu.:4.56435  
 Max.   :199998   Max.   :285000   Max.   :285.00   Max.   :5.65599  
   model_year  
 Min.   :1992  
 1st Qu.:2012  
 Median :2017  
 Mean   :2015  
 3rd Qu.:2020  
 Max.   :2024  

Row

Price Distribution

Log Price Distribution

EDA

Column

Price vs Mileage

Price vs Model Year

Price vs Brand

Price vs Model

Price vs Accident History

Price vs Title Status

Row

Exploratory Data Analysis

Price vs Mileage: Showed a strong negative trend, meaning that cars with a higher mileage tend to have lower price suggesting a nonlinear relationship. With a regression coefficient of (−7.62 × 10⁻⁶,), meaning for every additional 10,000 miles, the expected price decreases by about 7.6%. This showed to be one of the strongest predictors of used car price.

Price vs Model Year: Showed that newer vehicle tended to have higher costs compared to older vehicles. The regression coefficient for model year, (0.0467), means that for each additional year that the car is newer, the expected price increases by about 4.78%. This aligns with the EDA that newer cars are more expensive

Price vs Title Status: The box plot comparing cars with clean titles vs non clean titles showed that clean title cars tended to have a slightly higher price, but there was a lot of overlap between groups. This was confirmed in the regression. Clean title (p=0.64) was not a significant predictor, meaning that after controlling for predictors such as mileage and model year, title status does not explain for much variation in price.

Price vs Model: The model shows that a car’s model has a substantial impact on it’s price. High end sports cars and luxury vehicles such are Porsche 911’s and BMW M Series are associated with significantly higher prices, especially compared to the base model of the same vehicles. In contrast to this, more common vehicles such as the Mustang GT and Jeep Wrangler, did not show a statistically significant difference in price from the baseline model.

Methods

Column

Analytical Methods

  • Used multiple linear regression with log price

  • Variables selected based upon EDA and interpretability

  • Mileage scaled (milage_k) down by 1000 to stabilize coefficients

  • Diagnostics examined: linearity, heteroscedasticity, normality, influence


Call:
lm(formula = log_price ~ log_milage * model_year + model + accident, 
    data = cars_filtered)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5291 -0.3054 -0.0227  0.2930  2.4901 

Coefficients:
                                                 Estimate Std. Error t value
(Intercept)                                    102.667776  13.220605   7.766
log_milage                                     -46.315981   2.946057 -15.721
model_year                                      -0.045057   0.006543  -6.886
model911 Carrera                                 0.952008   0.167060   5.699
model911 Carrera S                               0.965113   0.173454   5.564
modelCamaro 2SS                                  0.025878   0.164273   0.158
modelCorvette Base                               0.377141   0.156670   2.407
modelE-Class E 350                              -0.116591   0.176929  -0.659
modelE-Class E 350 4MATIC                       -0.263059   0.177061  -1.486
modelExplorer XLT                               -0.177626   0.173209  -1.026
modelF-150 Lariat                                0.235475   0.176797   1.332
modelF-150 XLT                                   0.101966   0.151505   0.673
modelF-250 Lariat                                0.637144   0.173255   3.677
modelM3 Base                                     0.695747   0.145563   4.780
modelM4 Base                                     0.405346   0.169766   2.388
modelM5 Base                                     0.591355   0.176856   3.344
modelModel Y Long Range                         -0.075433   0.164083  -0.460
modelMustang GT Premium                         -0.037559   0.167054  -0.225
modelWrangler Sport                              0.027882   0.164542   0.169
modelOther                                       0.067925   0.114958   0.591
accidentAt least 1 accident or damage reported  -0.136725   0.049414  -2.767
accidentNone reported                           -0.047997   0.047339  -1.014
log_milage:model_year                            0.022781   0.001459  15.615
                                               Pr(>|t|)    
(Intercept)                                    1.03e-14 ***
log_milage                                      < 2e-16 ***
model_year                                     6.65e-12 ***
model911 Carrera                               1.30e-08 ***
model911 Carrera S                             2.81e-08 ***
modelCamaro 2SS                                0.874834    
modelCorvette Base                             0.016120 *  
modelE-Class E 350                             0.509953    
modelE-Class E 350 4MATIC                      0.137440    
modelExplorer XLT                              0.305189    
modelF-150 Lariat                              0.182973    
modelF-150 XLT                                 0.500975    
modelF-250 Lariat                              0.000239 ***
modelM3 Base                                   1.82e-06 ***
modelM4 Base                                   0.017003 *  
modelM5 Base                                   0.000834 ***
modelModel Y Long Range                        0.645741    
modelMustang GT Premium                        0.822121    
modelWrangler Sport                            0.865450    
modelOther                                     0.554640    
accidentAt least 1 accident or damage reported 0.005685 ** 
accidentNone reported                          0.310699    
log_milage:model_year                           < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4851 on 3903 degrees of freedom
Multiple R-squared:  0.6215,    Adjusted R-squared:  0.6193 
F-statistic: 291.3 on 22 and 3903 DF,  p-value: < 2.2e-16

Column

Regression Table

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 102.6677758 13.2206052 7.7657395 0.0000000 76.7478277 128.5877238
log_milage -46.3159810 2.9460568 -15.7213469 0.0000000 -52.0919374 -40.5400246
model_year -0.0450566 0.0065430 -6.8862777 0.0000000 -0.0578846 -0.0322287
model911 Carrera 0.9520083 0.1670602 5.6985945 0.0000000 0.6244748 1.2795418
model911 Carrera S 0.9651133 0.1734540 5.5640892 0.0000000 0.6250443 1.3051823
modelCamaro 2SS 0.0258782 0.1642727 0.1575322 0.8748336 -0.2961903 0.3479467
modelCorvette Base 0.3771414 0.1566702 2.4072319 0.0161201 0.0699783 0.6843045
modelE-Class E 350 -0.1165913 0.1769294 -0.6589710 0.5099533 -0.4634741 0.2302914
modelE-Class E 350 4MATIC -0.2630586 0.1770609 -1.4856956 0.1374404 -0.6101992 0.0840820
modelExplorer XLT -0.1776262 0.1732088 -1.0255033 0.3051893 -0.5172145 0.1619621
modelF-150 Lariat 0.2354750 0.1767970 1.3318945 0.1829726 -0.1111483 0.5820982
modelF-150 XLT 0.1019661 0.1515054 0.6730197 0.5009746 -0.1950712 0.3990035
modelF-250 Lariat 0.6371443 0.1732547 3.6774998 0.0002387 0.2974659 0.9768227
modelM3 Base 0.6957474 0.1455631 4.7796949 0.0000018 0.4103604 0.9811344
modelM4 Base 0.4053456 0.1697663 2.3876680 0.0170030 0.0725065 0.7381847
modelM5 Base 0.5913546 0.1768565 3.3436982 0.0008345 0.2446148 0.9380944
modelModel Y Long Range -0.0754325 0.1640828 -0.4597221 0.6457413 -0.3971287 0.2462637
modelMustang GT Premium -0.0375592 0.1670542 -0.2248325 0.8221214 -0.3650810 0.2899626
modelWrangler Sport 0.0278820 0.1645420 0.1694519 0.8654500 -0.2947146 0.3504785
modelOther 0.0679253 0.1149576 0.5908724 0.5546401 -0.1574574 0.2933079
accidentAt least 1 accident or damage reported -0.1367249 0.0494141 -2.7669221 0.0056855 -0.2336047 -0.0398450
accidentNone reported -0.0479968 0.0473394 -1.0138874 0.3106993 -0.1408091 0.0448155
log_milage:model_year 0.0227807 0.0014589 15.6152462 0.0000000 0.0199205 0.0256409

Model Fit

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6214801 0.6193465 0.4851262 291.2825 0 22 -2719.362 5486.724 5637.333 918.5612 3903 3926

Diagnostics

Column

Diagnostics

Residuals vs Fitted The Residuals vs Fitted plot helps to assess whether the linear regression assumptions, linearity, and constant variance are satisfied. In this plot, the residuals are fairly centered around zero, but there is noticeable spread and some curvature in the red line, suggesting mild non linearity. The vertical spread appears mostly consistent, though a slight funneling near larger fitted values, which hints at mild heteroscedasticity. A few points are labeled, indicating potential outliers or influential observations that deviate from the general pattern of the data.

Q-Q Residuals The Q-Q Plot assesses whether residuals follow a normal distribution. If the residuals were perfect, the points would lie along the diagonal line. In this model, the middle section follows the lie very well, suggesting the bulk of the residuals are approximately normal. However, both tails show deviations, the lower tail drops below the line, and the upper tail rises above it. The points near the upper end highlight extreme observations. Overall, the plot suggests that while normality is roughly met in the center of the distribution, there are departures in the tails that may affect inference.

Scale-Location The Scale-Location plot checks whether the residuals exhibit constant variance across levels of the fitted values. In this model, the square root of the standardized residuals is plotted against the fitted values from the model log_price ~ log_milage * model_year + model + accident. The residuals appear mostly clustered around the horizontal axis, but the red smoothing line shows a slight downward trend, indicating that residual variance decreases for higher fitted values. This suggests mild heteroscedasticity, though is not severe enough to outright invalidate the whole model.

Cook’s Distance The Cook’s Distance plot identifies observations that exert an unusually large influence on the regression estimates. Several observations in the model (321, 860, and 2781) stand out. These points should be investigated further to determine whether they represent potential outliers that could affect the model estimates, though the overall model is not dominated by these observations.

Column

Residuals vs Fitted

Q-Q Residuals

Scale-Location

Cook’s Distance

Conclusion

Row

Conclusion

This project explored the primary factors that drive used car prices using a cleaned and filtered data set of online listings. Across the exploratory analysis and regression modeling, the two main predictors that stood out were mileage and model year. Vehicles with higher mileage tended to sell for significantly less, whole newer vehicles commanded higher prices. Brand and car model also played a substantial role with luxury and performance vehicles standing on top. Accident history showed small differences when holding other factors constant. In addition to this, exterior and interior colors were found to have little to no impact on the price of a used vehicle. The multiple linear regression model largely supported this idea and revealed that a combination of model year, mileage, and car model can explain a substantial amount of variability in used car prices. While the model captured broad pricing patterns and provides reasonable insight for first time car buyers, the diagnostics indicated several signs of non linearity and influential outliers. As a result of this, the model should be interpreted as a baseline more than a full out predictive tool.

Limitations

While this analysis provided useful insight into predicting used car price, several limitations should be acknowledged. First, the diagnostic plots showed signs of non linearity, suggesting that the regression model did not capture the full relationship in the data, especially when it came to mileage at the higher price levels. Additionally, the data contained several very influential outliers, which may have represented unusually priced vehicles and can disproportionately affect the regression coefficients. The decision to lump brands and models into broader categories helped for simplifying visualizations, but it also removed the detail from less common vehicles, for example specific trims. The data is limited in different factors such as location, trim level, and optional packages (luxury or performance) that may have played a major role in pricing in the real world. Finally, because the data comes from online listing rather than final sale prices, it may reflect higher prices due to the subtraction of the negotiation process which is a very common process when buying a car and often lowers the price at least a little bit.

Column

Personal Information

Jack Sarsen

B.S. Statistics, University of Dayton

Final Project - MTH 369: Regression and Linear Models

Linkedin: linkedin.com/jacksarsen

---
title: "Used Car Price Determinants"
output:
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    theme: journal
    source_code: embed
---

```{r setup, include=FALSE}
pacman::p_load(flexdashboard, tidyverse, dplyr, ggplot2, forcats, broom, knitr)
```

Introduction
===
Row
-------
### Introduction
**Abstract:**
This project analyzes the main factors that influence used car prices using a cleaned and filtered data set of used car listings. After converting key variables to numeric form and removing extreme outliers, graphs revealed clear trends, most notably that higher mileage lowers price and a new model year raises it. A multiple linear regression model was then built to measure these effects. The results showed that **mileage, model year, brand, model, and accident history all significantly impacted the price**. This analysis provides a foundation to help first time buyers better understand fair pricing in the used car market.

**Motivation:**
For many college students, they have either never bought a car before or have never bought a car without the help of their parents. The used car market is a massive sea of confusing, inconsistent, and untruthful decisions. As a result of this, first time buyers risk getting ripped off and overpaying or choosing a vehicle that may not be reliable.
By building a data driven regression model to predict used car prices, I aim to give the youth or first time buyers of America a leg up on negotiation and an understanding of what a fair price should look like.

### Dataset
```{r}
cars <- read.csv("./data/used_cars.csv")
glimpse(cars)
```

Column {.tabset data-width=350} 
-----------------------------------------------------------------------

### Research Questions

- Which predictors influence used car prices the strongest?

- Does including accident history and title status improve the model performance?

- Do exterior and Interior colors affect used car prices, or are these mostly just visual differences?

### Data Description {data-height=350}
**Source:** Kaggle (Used Car Prediction Dataset)

**Variables** 
- Price (listed sale price)

- Milage (mileage of the vehicle)

- Model_year (year manufactured)

- Brand (manufacturer)

- Model (car model)

- Accident (accident history)

- Clean_title (clean or salvage title)

- Exterior color 

- Interior color

Data Cleaning
===
Column {.tabset data-width=350} 
-----------------------------------------------------------------------

### Data Cleaning Steps
- **Removed non-numeric characters** from price and mileage columns such as "mi." and "$" by using the function `gsub()`.

- **Filtered extreme outliers** in variables price, mileage, and model_year by using the function `filter()` to filter out all vehicles that were over 200,000 USD, over 300,000 miles, and produced before the year 1990.

- **Reduced categorical complexity** by lumping rare brands and models into an overall "other" category. The data set contained many unique brands and 1898 unique model names which made making a clean looking graph impossible. By using the function `fct_lump_n()`, I was able to reduce the noise on the x-axis and allow for the model to focus on the more important and consistent brands and models.

- **Converted categorical variables to factors** for regression modeling

- **Created log transformed price** to improve normality and stabilize variance

- **Created log transformed mileage** and scaled mileage in thousands `(milage_k)` for better interpretability

### Missing Data Handling
```{r}
colSums(is.na(cars))
```

```{r, include=TRUE}
cars <- cars %>%
mutate(
milage = as.numeric(gsub("[^0-9]", "", milage)),
price = as.numeric(gsub("[^0-9]", "", price))
) %>% 
  drop_na(price, milage, model_year)


cars_filtered <- cars %>%
filter(price < 200000, milage < 300000, model_year >= 1990) %>%
mutate(
brand = fct_lump_min(brand, min = 50),
model = fct_lump_n(model, n = 15),
clean_title = factor(clean_title),
accident = factor(accident),
log_price = log(price),
milage_k = milage / 1000,
log_milage = log(milage_k + 1)
)
```

### Summary Statistics
```{r}
cars_filtered %>%
  select(price, milage, milage_k, log_milage, model_year) %>% 
  summary()
```

Row
-----------------------------------------------------------------------

### Price Distribution {data-height=400}
```{r}
ggplot(cars_filtered, aes(x = price)) +
geom_histogram(fill="lightblue", bins=50) +
scale_x_continuous(labels= scales::comma) +
labs(title="Distribution of Used Car Prices", x="Price", y="Count")
```

### Log Price Distribution {data-height=400}
```{r}
ggplot(cars_filtered, aes(x = log_price)) +
geom_histogram(fill="navyblue", bins=50) +
labs(title="Distribution of Log-Transformed Price", x="Log(Price)", y="Count")
```


EDA
===

Column {.tabset data-width=450} 
------

### Price vs Mileage
```{r}
ggplot(cars_filtered, aes(milage_k, price)) +
geom_point(alpha = 0.4) +
labs(x = "Mileage (thousands)", y = "Price", title = "Price vs Mileage") +
scale_y_continuous(labels = scales::comma)
```

### Price vs Model Year
```{r}
ggplot(cars_filtered, aes(model_year, price)) +
geom_point(alpha = 0.5, color = "blue") +
scale_y_continuous(labels = scales::comma) +
labs(title = "Price vs Model Year")
```

### Price vs Brand
```{r}
ggplot(cars_filtered, aes(brand, price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title="Price by Brand", x="Brand", y="Price")
```

### Price vs Model
```{r}
ggplot(cars_filtered, aes(x = model, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Model",
y = "Price ($)",
title = "Used Car Price by Model"
) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

### Price vs Accident History
```{r}
ggplot(cars_filtered, aes(accident, price)) +
geom_boxplot() +
scale_y_continuous(labels=scales::comma) +
labs(title="Price by Accident History", x="Accident", y="Price")
```

### Price vs Title Status
```{r}
ggplot(cars_filtered, aes(x = clean_title, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Clean Title",
y = "Price ($)",
title = "Used Car Price by Clean Title Status"
)
```

Row
-----------------------------------------------------

### Exploratory Data Analysis {data-height=400}

**Price vs Mileage:** Showed a strong negative trend, meaning that cars with a higher mileage tend to have lower price suggesting a nonlinear relationship. With a regression coefficient of (−7.62 × 10⁻⁶,), meaning **for every additional 10,000 miles, the expected price decreases by about 7.6%**. This showed to be one of the strongest predictors of used car price.

**Price vs Model Year:** Showed that newer vehicle tended to have higher costs compared to older vehicles. The regression coefficient for model year, (0.0467), means that **for each additional year that the car is newer, the expected price increases by about 4.78%**. This aligns with the EDA that newer cars are more expensive

**Price vs Title Status:** The box plot comparing cars with clean titles vs non clean titles showed that clean title cars tended to have a slightly higher price, but there was a lot of overlap between groups. This was confirmed in the regression. Clean title (p=0.64) was not a significant predictor, meaning that after controlling for predictors such as mileage and model year, title status does not explain for much variation in price.

**Price vs Model:** The model shows that a car's model has a substantial impact on it's price. High end **sports cars and luxury vehicles** such are Porsche 911's and BMW M Series are associated with significantly higher prices, especially compared to the base model of the same vehicles. In contrast to this, more common vehicles such as the Mustang GT and Jeep Wrangler, did not show a statistically significant difference in price from the baseline model.

Methods
===

Column {.tabset data-width=350} 
-----------------------------------------------------------------------

### Analytical Methods
- Used **multiple linear regression** with log price

- Variables selected based upon EDA and interpretability

- Mileage scaled (milage_k) down by 1000 to stabilize coefficients

- Diagnostics examined: linearity, heteroscedasticity, normality, influence


```{r}
model <- lm(log_price ~ log_milage * model_year + model + accident,
data = cars_filtered)
summary(model)
```

Column {.tabset data-width=350} 
-----------------------------------------------------------------------
### Regression Table
```{r}
tidy(model, conf.int = TRUE) %>% kable()
```

### Model Fit
```{r}
glance (model) %>% kable()
```

Diagnostics
===

Column {.tabset data-width=350} 
-----------------------------------------------------------------------
### Diagnostics

**Residuals vs Fitted**
The Residuals vs Fitted plot helps to assess whether the linear regression assumptions, linearity, and constant variance are satisfied. In this plot, the residuals are fairly centered around zero, but there is noticeable spread and some curvature in the red line, suggesting mild non linearity. The vertical spread appears mostly consistent, though a slight funneling near larger fitted values, which hints at mild heteroscedasticity. A few points are labeled, indicating potential outliers or influential observations that deviate from the general pattern of the data.

**Q-Q Residuals**
The Q-Q Plot assesses whether residuals follow a normal distribution. If the residuals were perfect, the points would lie along the diagonal line. In this model, the middle section follows the lie very well, suggesting the bulk of the residuals are approximately normal. However, both tails show deviations, the lower tail drops below the line, and the upper tail rises above it. The points near the upper end highlight extreme observations. Overall, the plot suggests that while normality is roughly met in the center of the distribution, there are departures in the tails that may affect inference. 

**Scale-Location**
The Scale-Location plot checks whether the residuals exhibit constant variance across levels of the fitted values. In this model, the square root of the standardized residuals is plotted against the fitted values from the model `log_price ~ log_milage * model_year + model + accident.` The residuals appear mostly clustered around the horizontal axis, but the red smoothing line shows a slight downward trend, indicating that residual variance decreases for higher fitted values. This suggests mild heteroscedasticity, though is not severe enough to outright invalidate the whole model.


**Cook's Distance**
The Cook's Distance plot identifies observations that exert an unusually large influence on the regression estimates. Several observations in the model (321, 860, and 2781) stand out. These points should be investigated further to determine whether they represent potential outliers that could affect the model estimates, though the overall model is not dominated by these observations.

Column {.tabset data-width=350} 
-----------------------------------------------------------------------

### Residuals vs Fitted
```{r}
plot(model, which = 1)
```

### Q-Q Residuals
```{r}
plot(model, which = 2)
```

### Scale-Location
```{r}
plot(model, which = 3)
```

### Cook's Distance
```{r}
plot(model, which = 4)
```

Conclusion
===

Row
-------
### **Conclusion**
This project explored the primary factors that drive used car prices using a cleaned and filtered data set of online listings. Across the exploratory analysis and regression modeling, the two main predictors that stood out were mileage and model year. Vehicles with higher mileage tended to sell for significantly less, whole newer vehicles commanded higher prices. Brand and car model also played a substantial role with luxury and performance vehicles standing on top. Accident history showed small differences when holding other factors constant. In addition to this, exterior and interior colors were found to have little to no impact on the price of a used vehicle.
The multiple linear regression model largely supported this idea and revealed that a combination of model year, mileage, and car model can explain a substantial amount of variability in used car prices. While the model captured broad pricing patterns and provides reasonable insight for first time car buyers, the diagnostics indicated several signs of non linearity and influential outliers. As a result of this, the model should be interpreted as a baseline more than a full out predictive tool.

### **Limitations**
While this analysis provided useful insight into predicting used car price, several limitations should be acknowledged. First, the **diagnostic plots showed signs of non linearity**, suggesting that the regression model did not capture the full relationship in the data, especially when it came to mileage at the higher price levels. Additionally, **the data contained several very influential outliers**, which may have represented unusually priced vehicles and can disproportionately affect the regression coefficients. The decision to **lump brands and models** into broader categories helped for simplifying visualizations, but it also removed the detail from less common vehicles, for example specific trims. The data is limited in different factors such as **location, trim level, and optional packages (luxury or performance)** that may have played a major role in pricing in the real world. Finally, because the data comes from **online listing rather than final sale prices**, it may reflect higher prices due to the subtraction of the negotiation process which is a very common process when buying a car and often lowers the price at least a little bit.


Column {.tabset data-width=350} 
-----------------------------------------------------------------------

### Personal Information

Jack Sarsen

B.S. Statistics, University of Dayton

Final Project - MTH 369: Regression and Linear Models

Linkedin: linkedin.com/jacksarsen

### Refrences

Dataset - https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset