Abstract: This project analyzes the main factors that influence used car prices using a cleaned and filtered data set of used car listings. After converting key variables to numeric form and removing extreme outliers, graphs revealed clear trends, most notably that higher mileage lowers price and a new model year raises it. A multiple linear regression model was then built to measure these effects. The results showed that mileage, model year, brand, model, and accident history all significantly impacted the price. This analysis provides a foundation to help first time buyers better understand fair pricing in the used car market.
Motivation: For many college students, they have either never bought a car before or have never bought a car without the help of their parents. The used car market is a massive sea of confusing, inconsistent, and untruthful decisions. As a result of this, first time buyers risk getting ripped off and overpaying or choosing a vehicle that may not be reliable. By building a data driven regression model to predict used car prices, I aim to give the youth or first time buyers of America a leg up on negotiation and an understanding of what a fair price should look like.
Rows: 4,009
Columns: 12
$ brand <chr> "Ford", "Hyundai", "Lexus", "INFINITI", "Audi", "Acura", …
$ model <chr> "Utility Police Interceptor Base", "Palisade SEL", "RX 35…
$ model_year <int> 2013, 2021, 2022, 2015, 2021, 2016, 2017, 2001, 2021, 202…
$ milage <chr> "51,000 mi.", "34,742 mi.", "22,372 mi.", "88,900 mi.", "…
$ fuel_type <chr> "E85 Flex Fuel", "Gasoline", "Gasoline", "Hybrid", "Gasol…
$ engine <chr> "300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capability", "…
$ transmission <chr> "6-Speed A/T", "8-Speed Automatic", "Automatic", "7-Speed…
$ ext_col <chr> "Black", "Moonlight Cloud", "Blue", "Black", "Glacier Whi…
$ int_col <chr> "Black", "Gray", "Black", "Black", "Black", "Ebony.", "Bl…
$ accident <chr> "At least 1 accident or damage reported", "At least 1 acc…
$ clean_title <chr> "Yes", "Yes", "", "Yes", "", "", "Yes", "Yes", "Yes", "Ye…
$ price <chr> "$10,300", "$38,005", "$54,598", "$15,500", "$34,999", "$…
Which predictors influence used car prices the strongest?
Does including accident history and title status improve the model performance?
Do exterior and Interior colors affect used car prices, or are these mostly just visual differences?
Source: Kaggle (Used Car Prediction Dataset)
Variables - Price (listed sale price)
Milage (mileage of the vehicle)
Model_year (year manufactured)
Brand (manufacturer)
Model (car model)
Accident (accident history)
Clean_title (clean or salvage title)
Exterior color
Interior color
Removed non-numeric characters from price and
mileage columns such as “mi.” and “$” by using the function
gsub().
Filtered extreme outliers in variables price,
mileage, and model_year by using the function filter() to
filter out all vehicles that were over 200,000 USD, over 300,000 miles,
and produced before the year 1990.
Reduced categorical complexity by lumping rare
brands and models into an overall “other” category. The data set
contained many unique brands and 1898 unique model names which made
making a clean looking graph impossible. By using the function
fct_lump_n(), I was able to reduce the noise on the x-axis
and allow for the model to focus on the more important and consistent
brands and models.
Converted categorical variables to factors for regression modeling
Created log transformed price to improve normality and stabilize variance
Created log transformed mileage and scaled
mileage in thousands (milage_k) for better
interpretability
brand model model_year milage fuel_type engine
0 0 0 0 0 0
transmission ext_col int_col accident clean_title price
0 0 0 0 0 0
price milage milage_k log_milage
Min. : 2000 Min. : 100 Min. : 0.10 Min. :0.09531
1st Qu.: 17000 1st Qu.: 24285 1st Qu.: 24.29 1st Qu.:3.23021
Median : 30500 Median : 54000 Median : 54.00 Median :4.00733
Mean : 37863 Mean : 65415 Mean : 65.42 Mean :3.79615
3rd Qu.: 48000 3rd Qu.: 95000 3rd Qu.: 95.00 3rd Qu.:4.56435
Max. :199998 Max. :285000 Max. :285.00 Max. :5.65599
model_year
Min. :1992
1st Qu.:2012
Median :2017
Mean :2015
3rd Qu.:2020
Max. :2024
Price vs Mileage: Showed a strong negative trend, meaning that cars with a higher mileage tend to have lower price suggesting a nonlinear relationship. With a regression coefficient of (−7.62 × 10⁻⁶,), meaning for every additional 10,000 miles, the expected price decreases by about 7.6%. This showed to be one of the strongest predictors of used car price.
Price vs Model Year: Showed that newer vehicle tended to have higher costs compared to older vehicles. The regression coefficient for model year, (0.0467), means that for each additional year that the car is newer, the expected price increases by about 4.78%. This aligns with the EDA that newer cars are more expensive
Price vs Title Status: The box plot comparing cars with clean titles vs non clean titles showed that clean title cars tended to have a slightly higher price, but there was a lot of overlap between groups. This was confirmed in the regression. Clean title (p=0.64) was not a significant predictor, meaning that after controlling for predictors such as mileage and model year, title status does not explain for much variation in price.
Price vs Model: The model shows that a car’s model has a substantial impact on it’s price. High end sports cars and luxury vehicles such are Porsche 911’s and BMW M Series are associated with significantly higher prices, especially compared to the base model of the same vehicles. In contrast to this, more common vehicles such as the Mustang GT and Jeep Wrangler, did not show a statistically significant difference in price from the baseline model.
Used multiple linear regression with log price
Variables selected based upon EDA and interpretability
Mileage scaled (milage_k) down by 1000 to stabilize coefficients
Diagnostics examined: linearity, heteroscedasticity, normality, influence
Call:
lm(formula = log_price ~ log_milage * model_year + model + accident,
data = cars_filtered)
Residuals:
Min 1Q Median 3Q Max
-2.5291 -0.3054 -0.0227 0.2930 2.4901
Coefficients:
Estimate Std. Error t value
(Intercept) 102.667776 13.220605 7.766
log_milage -46.315981 2.946057 -15.721
model_year -0.045057 0.006543 -6.886
model911 Carrera 0.952008 0.167060 5.699
model911 Carrera S 0.965113 0.173454 5.564
modelCamaro 2SS 0.025878 0.164273 0.158
modelCorvette Base 0.377141 0.156670 2.407
modelE-Class E 350 -0.116591 0.176929 -0.659
modelE-Class E 350 4MATIC -0.263059 0.177061 -1.486
modelExplorer XLT -0.177626 0.173209 -1.026
modelF-150 Lariat 0.235475 0.176797 1.332
modelF-150 XLT 0.101966 0.151505 0.673
modelF-250 Lariat 0.637144 0.173255 3.677
modelM3 Base 0.695747 0.145563 4.780
modelM4 Base 0.405346 0.169766 2.388
modelM5 Base 0.591355 0.176856 3.344
modelModel Y Long Range -0.075433 0.164083 -0.460
modelMustang GT Premium -0.037559 0.167054 -0.225
modelWrangler Sport 0.027882 0.164542 0.169
modelOther 0.067925 0.114958 0.591
accidentAt least 1 accident or damage reported -0.136725 0.049414 -2.767
accidentNone reported -0.047997 0.047339 -1.014
log_milage:model_year 0.022781 0.001459 15.615
Pr(>|t|)
(Intercept) 1.03e-14 ***
log_milage < 2e-16 ***
model_year 6.65e-12 ***
model911 Carrera 1.30e-08 ***
model911 Carrera S 2.81e-08 ***
modelCamaro 2SS 0.874834
modelCorvette Base 0.016120 *
modelE-Class E 350 0.509953
modelE-Class E 350 4MATIC 0.137440
modelExplorer XLT 0.305189
modelF-150 Lariat 0.182973
modelF-150 XLT 0.500975
modelF-250 Lariat 0.000239 ***
modelM3 Base 1.82e-06 ***
modelM4 Base 0.017003 *
modelM5 Base 0.000834 ***
modelModel Y Long Range 0.645741
modelMustang GT Premium 0.822121
modelWrangler Sport 0.865450
modelOther 0.554640
accidentAt least 1 accident or damage reported 0.005685 **
accidentNone reported 0.310699
log_milage:model_year < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4851 on 3903 degrees of freedom
Multiple R-squared: 0.6215, Adjusted R-squared: 0.6193
F-statistic: 291.3 on 22 and 3903 DF, p-value: < 2.2e-16
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 102.6677758 | 13.2206052 | 7.7657395 | 0.0000000 | 76.7478277 | 128.5877238 |
| log_milage | -46.3159810 | 2.9460568 | -15.7213469 | 0.0000000 | -52.0919374 | -40.5400246 |
| model_year | -0.0450566 | 0.0065430 | -6.8862777 | 0.0000000 | -0.0578846 | -0.0322287 |
| model911 Carrera | 0.9520083 | 0.1670602 | 5.6985945 | 0.0000000 | 0.6244748 | 1.2795418 |
| model911 Carrera S | 0.9651133 | 0.1734540 | 5.5640892 | 0.0000000 | 0.6250443 | 1.3051823 |
| modelCamaro 2SS | 0.0258782 | 0.1642727 | 0.1575322 | 0.8748336 | -0.2961903 | 0.3479467 |
| modelCorvette Base | 0.3771414 | 0.1566702 | 2.4072319 | 0.0161201 | 0.0699783 | 0.6843045 |
| modelE-Class E 350 | -0.1165913 | 0.1769294 | -0.6589710 | 0.5099533 | -0.4634741 | 0.2302914 |
| modelE-Class E 350 4MATIC | -0.2630586 | 0.1770609 | -1.4856956 | 0.1374404 | -0.6101992 | 0.0840820 |
| modelExplorer XLT | -0.1776262 | 0.1732088 | -1.0255033 | 0.3051893 | -0.5172145 | 0.1619621 |
| modelF-150 Lariat | 0.2354750 | 0.1767970 | 1.3318945 | 0.1829726 | -0.1111483 | 0.5820982 |
| modelF-150 XLT | 0.1019661 | 0.1515054 | 0.6730197 | 0.5009746 | -0.1950712 | 0.3990035 |
| modelF-250 Lariat | 0.6371443 | 0.1732547 | 3.6774998 | 0.0002387 | 0.2974659 | 0.9768227 |
| modelM3 Base | 0.6957474 | 0.1455631 | 4.7796949 | 0.0000018 | 0.4103604 | 0.9811344 |
| modelM4 Base | 0.4053456 | 0.1697663 | 2.3876680 | 0.0170030 | 0.0725065 | 0.7381847 |
| modelM5 Base | 0.5913546 | 0.1768565 | 3.3436982 | 0.0008345 | 0.2446148 | 0.9380944 |
| modelModel Y Long Range | -0.0754325 | 0.1640828 | -0.4597221 | 0.6457413 | -0.3971287 | 0.2462637 |
| modelMustang GT Premium | -0.0375592 | 0.1670542 | -0.2248325 | 0.8221214 | -0.3650810 | 0.2899626 |
| modelWrangler Sport | 0.0278820 | 0.1645420 | 0.1694519 | 0.8654500 | -0.2947146 | 0.3504785 |
| modelOther | 0.0679253 | 0.1149576 | 0.5908724 | 0.5546401 | -0.1574574 | 0.2933079 |
| accidentAt least 1 accident or damage reported | -0.1367249 | 0.0494141 | -2.7669221 | 0.0056855 | -0.2336047 | -0.0398450 |
| accidentNone reported | -0.0479968 | 0.0473394 | -1.0138874 | 0.3106993 | -0.1408091 | 0.0448155 |
| log_milage:model_year | 0.0227807 | 0.0014589 | 15.6152462 | 0.0000000 | 0.0199205 | 0.0256409 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.6214801 | 0.6193465 | 0.4851262 | 291.2825 | 0 | 22 | -2719.362 | 5486.724 | 5637.333 | 918.5612 | 3903 | 3926 |
Residuals vs Fitted The Residuals vs Fitted plot helps to assess whether the linear regression assumptions, linearity, and constant variance are satisfied. In this plot, the residuals are fairly centered around zero, but there is noticeable spread and some curvature in the red line, suggesting mild non linearity. The vertical spread appears mostly consistent, though a slight funneling near larger fitted values, which hints at mild heteroscedasticity. A few points are labeled, indicating potential outliers or influential observations that deviate from the general pattern of the data.
Q-Q Residuals The Q-Q Plot assesses whether residuals follow a normal distribution. If the residuals were perfect, the points would lie along the diagonal line. In this model, the middle section follows the lie very well, suggesting the bulk of the residuals are approximately normal. However, both tails show deviations, the lower tail drops below the line, and the upper tail rises above it. The points near the upper end highlight extreme observations. Overall, the plot suggests that while normality is roughly met in the center of the distribution, there are departures in the tails that may affect inference.
Scale-Location The Scale-Location plot checks
whether the residuals exhibit constant variance across levels of the
fitted values. In this model, the square root of the standardized
residuals is plotted against the fitted values from the model
log_price ~ log_milage * model_year + model + accident. The
residuals appear mostly clustered around the horizontal axis, but the
red smoothing line shows a slight downward trend, indicating that
residual variance decreases for higher fitted values. This suggests mild
heteroscedasticity, though is not severe enough to outright invalidate
the whole model.
Cook’s Distance The Cook’s Distance plot identifies observations that exert an unusually large influence on the regression estimates. Several observations in the model (321, 860, and 2781) stand out. These points should be investigated further to determine whether they represent potential outliers that could affect the model estimates, though the overall model is not dominated by these observations.
This project explored the primary factors that drive used car prices using a cleaned and filtered data set of online listings. Across the exploratory analysis and regression modeling, the two main predictors that stood out were mileage and model year. Vehicles with higher mileage tended to sell for significantly less, whole newer vehicles commanded higher prices. Brand and car model also played a substantial role with luxury and performance vehicles standing on top. Accident history showed small differences when holding other factors constant. In addition to this, exterior and interior colors were found to have little to no impact on the price of a used vehicle. The multiple linear regression model largely supported this idea and revealed that a combination of model year, mileage, and car model can explain a substantial amount of variability in used car prices. While the model captured broad pricing patterns and provides reasonable insight for first time car buyers, the diagnostics indicated several signs of non linearity and influential outliers. As a result of this, the model should be interpreted as a baseline more than a full out predictive tool.
While this analysis provided useful insight into predicting used car price, several limitations should be acknowledged. First, the diagnostic plots showed signs of non linearity, suggesting that the regression model did not capture the full relationship in the data, especially when it came to mileage at the higher price levels. Additionally, the data contained several very influential outliers, which may have represented unusually priced vehicles and can disproportionately affect the regression coefficients. The decision to lump brands and models into broader categories helped for simplifying visualizations, but it also removed the detail from less common vehicles, for example specific trims. The data is limited in different factors such as location, trim level, and optional packages (luxury or performance) that may have played a major role in pricing in the real world. Finally, because the data comes from online listing rather than final sale prices, it may reflect higher prices due to the subtraction of the negotiation process which is a very common process when buying a car and often lowers the price at least a little bit.
Jack Sarsen
B.S. Statistics, University of Dayton
Final Project - MTH 369: Regression and Linear Models
Linkedin: linkedin.com/jacksarsen
Dataset - https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset
---
title: "Used Car Price Determinants"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
theme: journal
source_code: embed
---
```{r setup, include=FALSE}
pacman::p_load(flexdashboard, tidyverse, dplyr, ggplot2, forcats, broom, knitr)
```
Introduction
===
Row
-------
### Introduction
**Abstract:**
This project analyzes the main factors that influence used car prices using a cleaned and filtered data set of used car listings. After converting key variables to numeric form and removing extreme outliers, graphs revealed clear trends, most notably that higher mileage lowers price and a new model year raises it. A multiple linear regression model was then built to measure these effects. The results showed that **mileage, model year, brand, model, and accident history all significantly impacted the price**. This analysis provides a foundation to help first time buyers better understand fair pricing in the used car market.
**Motivation:**
For many college students, they have either never bought a car before or have never bought a car without the help of their parents. The used car market is a massive sea of confusing, inconsistent, and untruthful decisions. As a result of this, first time buyers risk getting ripped off and overpaying or choosing a vehicle that may not be reliable.
By building a data driven regression model to predict used car prices, I aim to give the youth or first time buyers of America a leg up on negotiation and an understanding of what a fair price should look like.
### Dataset
```{r}
cars <- read.csv("./data/used_cars.csv")
glimpse(cars)
```
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Research Questions
- Which predictors influence used car prices the strongest?
- Does including accident history and title status improve the model performance?
- Do exterior and Interior colors affect used car prices, or are these mostly just visual differences?
### Data Description {data-height=350}
**Source:** Kaggle (Used Car Prediction Dataset)
**Variables**
- Price (listed sale price)
- Milage (mileage of the vehicle)
- Model_year (year manufactured)
- Brand (manufacturer)
- Model (car model)
- Accident (accident history)
- Clean_title (clean or salvage title)
- Exterior color
- Interior color
Data Cleaning
===
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Data Cleaning Steps
- **Removed non-numeric characters** from price and mileage columns such as "mi." and "$" by using the function `gsub()`.
- **Filtered extreme outliers** in variables price, mileage, and model_year by using the function `filter()` to filter out all vehicles that were over 200,000 USD, over 300,000 miles, and produced before the year 1990.
- **Reduced categorical complexity** by lumping rare brands and models into an overall "other" category. The data set contained many unique brands and 1898 unique model names which made making a clean looking graph impossible. By using the function `fct_lump_n()`, I was able to reduce the noise on the x-axis and allow for the model to focus on the more important and consistent brands and models.
- **Converted categorical variables to factors** for regression modeling
- **Created log transformed price** to improve normality and stabilize variance
- **Created log transformed mileage** and scaled mileage in thousands `(milage_k)` for better interpretability
### Missing Data Handling
```{r}
colSums(is.na(cars))
```
```{r, include=TRUE}
cars <- cars %>%
mutate(
milage = as.numeric(gsub("[^0-9]", "", milage)),
price = as.numeric(gsub("[^0-9]", "", price))
) %>%
drop_na(price, milage, model_year)
cars_filtered <- cars %>%
filter(price < 200000, milage < 300000, model_year >= 1990) %>%
mutate(
brand = fct_lump_min(brand, min = 50),
model = fct_lump_n(model, n = 15),
clean_title = factor(clean_title),
accident = factor(accident),
log_price = log(price),
milage_k = milage / 1000,
log_milage = log(milage_k + 1)
)
```
### Summary Statistics
```{r}
cars_filtered %>%
select(price, milage, milage_k, log_milage, model_year) %>%
summary()
```
Row
-----------------------------------------------------------------------
### Price Distribution {data-height=400}
```{r}
ggplot(cars_filtered, aes(x = price)) +
geom_histogram(fill="lightblue", bins=50) +
scale_x_continuous(labels= scales::comma) +
labs(title="Distribution of Used Car Prices", x="Price", y="Count")
```
### Log Price Distribution {data-height=400}
```{r}
ggplot(cars_filtered, aes(x = log_price)) +
geom_histogram(fill="navyblue", bins=50) +
labs(title="Distribution of Log-Transformed Price", x="Log(Price)", y="Count")
```
EDA
===
Column {.tabset data-width=450}
------
### Price vs Mileage
```{r}
ggplot(cars_filtered, aes(milage_k, price)) +
geom_point(alpha = 0.4) +
labs(x = "Mileage (thousands)", y = "Price", title = "Price vs Mileage") +
scale_y_continuous(labels = scales::comma)
```
### Price vs Model Year
```{r}
ggplot(cars_filtered, aes(model_year, price)) +
geom_point(alpha = 0.5, color = "blue") +
scale_y_continuous(labels = scales::comma) +
labs(title = "Price vs Model Year")
```
### Price vs Brand
```{r}
ggplot(cars_filtered, aes(brand, price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title="Price by Brand", x="Brand", y="Price")
```
### Price vs Model
```{r}
ggplot(cars_filtered, aes(x = model, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Model",
y = "Price ($)",
title = "Used Car Price by Model"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
### Price vs Accident History
```{r}
ggplot(cars_filtered, aes(accident, price)) +
geom_boxplot() +
scale_y_continuous(labels=scales::comma) +
labs(title="Price by Accident History", x="Accident", y="Price")
```
### Price vs Title Status
```{r}
ggplot(cars_filtered, aes(x = clean_title, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Clean Title",
y = "Price ($)",
title = "Used Car Price by Clean Title Status"
)
```
Row
-----------------------------------------------------
### Exploratory Data Analysis {data-height=400}
**Price vs Mileage:** Showed a strong negative trend, meaning that cars with a higher mileage tend to have lower price suggesting a nonlinear relationship. With a regression coefficient of (−7.62 × 10⁻⁶,), meaning **for every additional 10,000 miles, the expected price decreases by about 7.6%**. This showed to be one of the strongest predictors of used car price.
**Price vs Model Year:** Showed that newer vehicle tended to have higher costs compared to older vehicles. The regression coefficient for model year, (0.0467), means that **for each additional year that the car is newer, the expected price increases by about 4.78%**. This aligns with the EDA that newer cars are more expensive
**Price vs Title Status:** The box plot comparing cars with clean titles vs non clean titles showed that clean title cars tended to have a slightly higher price, but there was a lot of overlap between groups. This was confirmed in the regression. Clean title (p=0.64) was not a significant predictor, meaning that after controlling for predictors such as mileage and model year, title status does not explain for much variation in price.
**Price vs Model:** The model shows that a car's model has a substantial impact on it's price. High end **sports cars and luxury vehicles** such are Porsche 911's and BMW M Series are associated with significantly higher prices, especially compared to the base model of the same vehicles. In contrast to this, more common vehicles such as the Mustang GT and Jeep Wrangler, did not show a statistically significant difference in price from the baseline model.
Methods
===
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Analytical Methods
- Used **multiple linear regression** with log price
- Variables selected based upon EDA and interpretability
- Mileage scaled (milage_k) down by 1000 to stabilize coefficients
- Diagnostics examined: linearity, heteroscedasticity, normality, influence
```{r}
model <- lm(log_price ~ log_milage * model_year + model + accident,
data = cars_filtered)
summary(model)
```
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Regression Table
```{r}
tidy(model, conf.int = TRUE) %>% kable()
```
### Model Fit
```{r}
glance (model) %>% kable()
```
Diagnostics
===
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Diagnostics
**Residuals vs Fitted**
The Residuals vs Fitted plot helps to assess whether the linear regression assumptions, linearity, and constant variance are satisfied. In this plot, the residuals are fairly centered around zero, but there is noticeable spread and some curvature in the red line, suggesting mild non linearity. The vertical spread appears mostly consistent, though a slight funneling near larger fitted values, which hints at mild heteroscedasticity. A few points are labeled, indicating potential outliers or influential observations that deviate from the general pattern of the data.
**Q-Q Residuals**
The Q-Q Plot assesses whether residuals follow a normal distribution. If the residuals were perfect, the points would lie along the diagonal line. In this model, the middle section follows the lie very well, suggesting the bulk of the residuals are approximately normal. However, both tails show deviations, the lower tail drops below the line, and the upper tail rises above it. The points near the upper end highlight extreme observations. Overall, the plot suggests that while normality is roughly met in the center of the distribution, there are departures in the tails that may affect inference.
**Scale-Location**
The Scale-Location plot checks whether the residuals exhibit constant variance across levels of the fitted values. In this model, the square root of the standardized residuals is plotted against the fitted values from the model `log_price ~ log_milage * model_year + model + accident.` The residuals appear mostly clustered around the horizontal axis, but the red smoothing line shows a slight downward trend, indicating that residual variance decreases for higher fitted values. This suggests mild heteroscedasticity, though is not severe enough to outright invalidate the whole model.
**Cook's Distance**
The Cook's Distance plot identifies observations that exert an unusually large influence on the regression estimates. Several observations in the model (321, 860, and 2781) stand out. These points should be investigated further to determine whether they represent potential outliers that could affect the model estimates, though the overall model is not dominated by these observations.
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Residuals vs Fitted
```{r}
plot(model, which = 1)
```
### Q-Q Residuals
```{r}
plot(model, which = 2)
```
### Scale-Location
```{r}
plot(model, which = 3)
```
### Cook's Distance
```{r}
plot(model, which = 4)
```
Conclusion
===
Row
-------
### **Conclusion**
This project explored the primary factors that drive used car prices using a cleaned and filtered data set of online listings. Across the exploratory analysis and regression modeling, the two main predictors that stood out were mileage and model year. Vehicles with higher mileage tended to sell for significantly less, whole newer vehicles commanded higher prices. Brand and car model also played a substantial role with luxury and performance vehicles standing on top. Accident history showed small differences when holding other factors constant. In addition to this, exterior and interior colors were found to have little to no impact on the price of a used vehicle.
The multiple linear regression model largely supported this idea and revealed that a combination of model year, mileage, and car model can explain a substantial amount of variability in used car prices. While the model captured broad pricing patterns and provides reasonable insight for first time car buyers, the diagnostics indicated several signs of non linearity and influential outliers. As a result of this, the model should be interpreted as a baseline more than a full out predictive tool.
### **Limitations**
While this analysis provided useful insight into predicting used car price, several limitations should be acknowledged. First, the **diagnostic plots showed signs of non linearity**, suggesting that the regression model did not capture the full relationship in the data, especially when it came to mileage at the higher price levels. Additionally, **the data contained several very influential outliers**, which may have represented unusually priced vehicles and can disproportionately affect the regression coefficients. The decision to **lump brands and models** into broader categories helped for simplifying visualizations, but it also removed the detail from less common vehicles, for example specific trims. The data is limited in different factors such as **location, trim level, and optional packages (luxury or performance)** that may have played a major role in pricing in the real world. Finally, because the data comes from **online listing rather than final sale prices**, it may reflect higher prices due to the subtraction of the negotiation process which is a very common process when buying a car and often lowers the price at least a little bit.
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Personal Information
Jack Sarsen
B.S. Statistics, University of Dayton
Final Project - MTH 369: Regression and Linear Models
Linkedin: linkedin.com/jacksarsen
### Refrences
Dataset - https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset