Creating a DataFrame with Iris Dataset

Let's transform our iris dataset into a structured DataFrame for proper analysis. We'll begin by examining the target classifications—Setosa, Versicolor, and Virginica—which represent the three iris species we're working with. The feature names provide our measurement dimensions: Sepal Length, Sepal Width, Petal Length, and Petal Width.

These features will serve as our column headers, creating a professional dataset structure. We'll create our DataFrame with `iris_dataframe = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)`. This establishes our foundational data structure with proper column naming conventions.

The resulting DataFrame organizes our data into four distinct columns, each row containing the four measurements for a single iris specimen. We now have clearly labeled columns—Sepal Length, Sepal Width, Petal Length, Petal Width—across 150 observations. This structured approach is essential for any serious data analysis workflow.

Now we need to incorporate our target classifications. Currently, we can't distinguish which specimens are Setosa, Versicolor, or Virginica. Before adding our target column, let's examine `iris_data.target`, which contains an array of numerical encodings: zeros, ones, and twos.

These numerical values correspond directly to our species: Setosa (0), Versicolor (1), and Virginica (2). We'll add this classification data by creating a new column: `iris_dataframe['target'] = iris_data.target`. This gives us our target variable for machine learning applications.

While numerical encoding works perfectly for algorithms, it's problematic for human interpretation and data exploration. Remembering which number corresponds to which species creates unnecessary cognitive overhead and potential for errors in analysis and reporting.

The solution is creating a human-readable `species` column that translates these numerical codes into meaningful species names. We need to map each target value to its corresponding entry in `iris_data.target_names`.

The target_names array contains our species in order: ['setosa', 'versicolor', 'virginica'], indexed as 0, 1, 2 respectively. We'll use this mapping to convert numerical codes to descriptive labels. For each target value, we'll look up the corresponding species name using array indexing.

This transformation requires Pandas' powerful `apply` method, which executes a function across every row or column element. I'll demonstrate both approaches—using a named function and a lambda expression—so you can choose the style that best fits your coding preferences and team standards.

Let's start with a explicit function approach, which offers better readability and debugging capabilities. We'll create a function called `get_flower_name` that accepts a target number and returns the corresponding species name. The function uses `iris_data.target_names[target_number]` to perform the lookup—when `target_number` is 0, it returns 'setosa'; when it's 1, it returns 'versicolor'; and so forth.

Here's our implementation: the function captures the species name from the target_names array using the numerical index, stores it as `flower_name`, and returns the string value. This approach provides clear, maintainable code that's easy to debug and modify.

Now we apply this function across our entire target column: `iris_df['species'] = iris_df['target'].apply(get_flower_name)`. This creates our new species column by transforming every numerical target into its corresponding species name. Let's verify our results with `iris_df.sample(10)` to examine a random subset.

The transformation works perfectly—our function successfully converts target codes into readable species names. We can see 'Versicolor' for target 1, 'Setosa' for target 0, and 'Virginica' for target 2. This human-readable format dramatically improves data interpretability and reduces analysis errors.

For those comfortable with Python lambda expressions, we can achieve the same result more concisely. Instead of defining a separate function, we can use: `iris_df['species'] = iris_df['target'].apply(lambda target_number: iris_data.target_names[target_number])`. This one-liner performs identical functionality with reduced code footprint.

Both approaches yield identical results—the choice depends on your coding style, team preferences, and maintainability requirements. Named functions offer better debugging and documentation, while lambdas provide conciseness for simple transformations. Regardless of your chosen method, you now have a fully human-readable dataset ready for comprehensive analysis and modeling.

Related Articles

Basic Excel Calculations and Order of Operations

Paste Special: Excel Skills with Key Techniques

Building a Three-Layer Neural Network with Keras and TensorFlow