With our data properly formatted and initial analysis complete, we can now leverage our domain expertise to make strategic decisions about feature selection. This iterative process—one that data scientists should continuously refine—involves evaluating which columns provide predictive value, experimenting with different feature combinations, preprocessing data to handle outliers, and applying the analytical techniques we'll explore throughout this series. However, our primary focus here is demonstrating the key distinctions between logistic regression and linear regression in practice.
For our feature matrix X, we'll select columns that our exploratory analysis suggests have strong predictive power: the engineered categorical variables low, medium, and high, along with satisfaction_level and average_monthly_hours (preserving the original dataset's column naming convention). We'll also include the number of promotions received in the past five years, which often serves as a strong indicator of employee engagement and career trajectory.
Our target variable y represents our binary outcome: whether an employee left the organization (1) or remained (0). This binary classification problem is precisely where logistic regression excels, as it models the probability of class membership rather than predicting continuous values like linear regression.
Next, we'll partition our data into training and testing sets using the standard 80/20 split. This approach ensures we have sufficient data for model training while reserving an unbiased holdout set for performance evaluation. The train_test_split function returns a tuple that we unpack into X_train, X_test, y_train, and y_test—a fundamental practice in supervised learning workflows.
Feature scaling becomes critical at this stage due to the disparate scales across our numerical variables. Consider the contrast: average_monthly_hours ranges from approximately 150 to nearly 300, while promotions_last_5years typically contains small integers (0, 1, or 2). Without standardization, features with larger magnitudes would disproportionately influence the model's decision boundary. StandardScaler addresses this by centering each feature around its mean with unit variance, ensuring all variables contribute proportionally to the learning process.
The model instantiation mirrors our previous linear regression approach, with one crucial distinction: we're now using LogisticRegression instead of LinearRegression. This seemingly simple change fundamentally alters the underlying mathematics—logistic regression uses the sigmoid function to map any real-valued input to a probability between 0 and 1, making it ideal for binary classification tasks.
Finally, we train our model using the fit method on our scaled training data. This process involves the algorithm iteratively adjusting coefficients to minimize the logistic loss function, learning patterns that distinguish between employees likely to leave versus those likely to stay. With training complete, we're ready to evaluate our model's predictive performance on the holdout test set.