Let's step back from our feature matrix X and examine the complete dataset—our car sales data containing all variables across 153 rows. Before we split the data into features and targets, we need to address a critical preprocessing step: identifying and removing statistical outliers that could skew our model's performance.
Our focus centers on two key variables where extreme values appear most problematic: engine size and price. Both metrics show data points that fall well outside the normal distribution, potentially compromising our model's ability to generalize effectively to new data.
We'll begin our outlier removal process by filtering the price dimension. Setting our car sales dataset to include only records where the price in thousands is less than or equal to $80,000 immediately improves our data quality.
This initial filter eliminates two clear outliers—luxury vehicles priced significantly above the $80,000 threshold that would otherwise distort our model's understanding of the price-feature relationship. Our dataset now contains 151 rows of more representative data points.
Next, we'll apply a complementary filter to address engine size anomalies. While some overlap may exist between price and engine size outliers, this secondary filter ensures we capture any remaining edge cases that could impact model performance.
By restricting our dataset to vehicles with engine sizes of seven liters or below, we remove one additional outlier. Notice how our row count decreases from 151 to 150—confirming that this filter caught a data point missed by our price-based criteria alone.
With our cleaned dataset of 150 records, we've successfully removed statistical anomalies while preserving the underlying patterns our model needs to learn. This preprocessing step is crucial for building robust, generalizable machine learning models that perform consistently on real-world data.
Now we're ready to leverage this refined dataset for the next phase of our analysis. We'll redeclare our feature matrix X and target variable Y using the filtered data, split them into training and testing sets, retrain our model, and evaluate the performance improvements gained through proper outlier management.