Preparing Titanic Dataset: Splitting and Scaling Techniques

Now we'll proceed with data splitting, though our approach differs slightly from typical machine learning workflows. Since Kaggle has already provided separate training and test datasets, we're working exclusively with pre-designated training data—eliminating the need for our usual train-test-validation split.

This pre-split structure is common in competitive data science environments and mirrors real-world scenarios where test data remains sequestered until final model evaluation. It's a practice that prevents data leakage and ensures more robust model validation.

Based on our exploratory data analysis and domain expertise, we'll define our feature matrix X_train using the most predictive variables from our Titanic dataset: passenger class (Pclass), embarkation port, sex, age, fare, number of siblings/spouses aboard, and number of parents/children aboard. These features represent a carefully curated selection that balances predictive power with data quality—each chosen for its statistical significance and logical relationship to survival outcomes.

Our target variable Y_train consists of the 'Survived' column—a binary classification where 1 indicates survival and 0 indicates death. This straightforward labeling makes our supervised learning task clearly defined: predict passenger survival based on demographic and ticket information.

Before feeding our data into any machine learning algorithm, we must address the critical issue of feature scaling. Age values range from infants (less than 1 year) to elderly passengers (80+ years), while fare values span from nearly free passage to luxury suite prices exceeding several hundred dollars. Without proper scaling, algorithms might incorrectly weight fare as more important simply because its numerical values are larger—a classic case of letting measurement units drive model decisions rather than actual predictive relationships.

We'll employ StandardScaler, the industry standard for feature normalization, which transforms each feature to have a mean of zero and standard deviation of one. This z-score normalization ensures that all features contribute equally to distance calculations in our upcoming random forest model, preventing any single feature from dominating due to scale rather than significance.

Using pandas' powerful .loc indexer—what the data science community affectionately calls "fancy indexing"—we can selectively scale only the continuous variables (age and fare) while leaving categorical variables unchanged. The syntax X_train.loc[:, ['age', 'fare']] allows us to target specific columns across all rows, applying the fit_transform method in a single, elegant operation.

After transformation, both age and fare variables are centered around zero with unit variance, placing them on equivalent scales for our machine learning algorithm. This preprocessing step is crucial for model performance and interpretability—a fundamental practice that separates professional data science work from amateur attempts.

With our features properly scaled and our data scientifically prepared, we're ready to implement our chosen algorithm: the random forest classifier, a robust ensemble method that excels at handling mixed data types and providing reliable predictions even with limited feature engineering.

Related Articles

Basic Excel Calculations and Order of Operations

Paste Special: Excel Skills with Key Techniques

Building a Three-Layer Neural Network with Keras and TensorFlow