Now comes the crucial question: how accurate were our predictions? To answer this definitively, we'll leverage the dot score method—a built-in evaluation tool that every machine learning model provides to measure predictive accuracy.

The score method employs different metrics depending on your modeling approach. For linear regression, we obtain what's called an R-squared score, which measures the proportion of variance in the dependent variable that's predictable from the independent variables. This distinction matters significantly when interpreting results.

The implementation is straightforward: we call score = model.score() and pass in our X_test and y_test datasets. Essentially, we're presenting the model with unseen data and asking, "How do your predictions compare to the actual outcomes?" This process provides an unbiased assessment of real-world performance, since the model has never encountered this test data during training.

Our result? Approximately 69%—which represents excellent performance for most real-world applications.

However, this percentage requires careful interpretation. It doesn't mean 69% of predictions were exactly correct—in fact, with continuous variables, virtually zero predictions will match reality to the decimal point. The nature of regression problems involves predicting precise numerical values, making perfect accuracy nearly impossible and frankly unnecessary for practical applications.

So what does this 69% actually measure? It quantifies how much better our model performs compared to the simplest possible baseline: predicting the mean value every single time. Think of this as the "lazy statistician" approach—someone who looks at all historical data, calculates the average, and uses that same number for every future prediction.

We can easily calculate this baseline ourselves. Taking our y_test values—which form our ground truth dataset—we sum all values and divide by the count. This gives us the mean value that serves as our comparison benchmark.

In our case, that mean is 29.47. Imagine if our model simply returned 29.47 for every prediction, regardless of input features. "What's the prediction for this complex set of variables?" 29.47. "How about this completely different scenario?" 29.47 again. Such a model would score exactly zero—no better than random guessing based on historical averages.

The scoring system can actually produce negative values, which indicates your model performs worse than this naive baseline. This scenario, while embarrassing, provides valuable diagnostic information—it suggests fundamental issues with feature selection, model choice, or data quality that require immediate attention.

Fortunately, we're performing significantly better. Our 69% score means we're capturing meaningful patterns in the data and making predictions that are substantially more accurate than the baseline approach. For most business applications, scores above 60% represent actionable predictive power.

In the next section, we'll explore advanced techniques to push this performance even higher, including feature engineering, hyperparameter tuning, and ensemble methods that can often achieve 80%+ accuracy scores.