Converting categorical data into a format that machine learning algorithms can process requires a fundamental shift from human-readable labels to numerical representations. Unlike our linear regression examples that worked with inherently numerical datasets, categorical variables like salary levels demand a more nuanced approach to maintain their distinct, non-ordinal nature.

The challenge with categories like "high," "low," and "medium" lies in their lack of meaningful numerical relationships. While we can calculate a mean salary across employees, these categorical labels don't represent measurable intervals or ratios. Treating "medium" as 2 and "high" as 3 would incorrectly suggest that "high" is exactly 50% more valuable than "medium"—a mathematical relationship that simply doesn't exist in categorical data.

The solution lies in binary representation: converting each category into a series of zeros and ones. Rather than assigning arbitrary numerical values, we create separate binary indicators for each possible category. This approach preserves the categorical nature of the data while making it computationally accessible.

Here's how this binary transformation works in practice: for each original row, we generate three distinct columns—one for "low," one for "medium," and one for "high." Each row receives exactly one "1" in the column corresponding to its category, with "0" values filling the remaining columns. This ensures that every data point maintains its categorical identity without introducing false numerical relationships.


This encoding strategy provides machine learning algorithms with clean, interpretable signals. The algorithm doesn't need to understand what "high salary" means conceptually—it simply identifies patterns in the binary data. It might discover, for instance, that rows with a "1" in the high salary column correlate strongly with employee retention, while those with "1" in the low salary column show higher turnover rates. The binary format enables these pattern discoveries without imposing artificial mathematical relationships between categories.

This technique, known as one-hot encoding, has become the industry standard for handling categorical variables in machine learning workflows. Pandas provides the get_dummies function specifically for this transformation, converting categorical columns into binary indicator variables with minimal code complexity.

The term "get dummies" reflects historical data science terminology, where "dummy variables" referred to binary indicators used in statistical modeling. While the naming might seem outdated, the function remains one of the most reliable tools for categorical data preprocessing in Python's data science ecosystem.


Let's implement this transformation on our salary data. We'll create a new DataFrame called salary_OHE (one-hot encoded) using Pandas' get_dummies function. The function takes our original salary column—containing string values like "low," "medium," and "high"—and converts it into binary columns. We'll specify dtype=int to ensure our output contains clean integer values rather than boolean flags.

Examining our resulting salary_OHE DataFrame reveals the transformation in action. The output displays the first and last five rows, maintaining the same row count as our original dataset—a crucial validation step. Notice how each row contains exactly one "1" and two "0" values: employees with low salaries show "1" in the low column, medium-salary employees have "1" in the medium column, and high earners display "1" in the high column. This binary representation perfectly captures our categorical information in a machine-readable format.

The final step involves integrating this one-hot encoded data back into our primary DataFrame. By appending these binary columns to our existing dataset, we'll have both the original categorical information for human interpretation and the binary encoding for algorithmic processing—giving us the best of both worlds for our machine learning pipeline.