Now we'll construct a concise yet sophisticated three-layer neural network, then dive deep into the architecture, functionality, and parameters that make it work. We're building a Keras sequential model—a straightforward architecture where data flows through each layer in order, making it ideal for beginners and many production applications.
Our first layer is the input layer, implemented as a TensorFlow Keras flatten operation. This layer takes our 28×28 pixel images and reshapes them into a format the network can process efficiently. The input_shape parameter of (28, 28) tells the network to expect square images of this dimension—a standard format for the MNIST handwritten digit dataset we're working with.
The second layer represents the heart of our network: a dense (fully connected) hidden layer with 128 neurons. This might seem substantial, but when you consider the computational complexity we're about to explore, you'll realize the true scale of what's happening here. Each neuron uses ReLU (Rectified Linear Unit) activation, a choice that's become the gold standard in modern neural networks for reasons we'll examine shortly.
Our final layer is the output layer—technically another dense layer, but functionally distinct in its purpose. It contains exactly 10 neurons because we're classifying 10 possible outcomes: digits 0 through 9. This layer employs softmax activation, which transforms raw neural outputs into probability distributions that sum to 1.0, giving us interpretable confidence scores for each digit class.
When we pass these layers to Keras, TensorFlow constructs the complete neural network architecture in milliseconds. But understanding what happens beneath the surface requires examining each component in detail.
Let's start with the flatten layer and why it's essential for our architecture. "Flattening" is a fundamental preprocessing step that converts multidimensional arrays into one-dimensional vectors. Our 28×28 image matrix becomes a single array of 784 values, maintaining the exact same pixel data in the same sequence.
While the 2D grid structure helps humans visualize and interpret images, neural networks operate more efficiently with linearized data. The network doesn't need to understand spatial relationships between adjacent pixels—instead, it learns to weight each of the 784 individual pixel values according to their importance for digit classification. This approach allows the network to discover patterns that might not be immediately obvious to human observers, often finding correlations across pixels that aren't spatially adjacent but are mathematically significant.
Machine learning systems excel at processing normalized, structured data formats. A simple array of numerical values is computationally optimal, allowing for efficient matrix operations that form the backbone of neural network calculations.
The second layer is where the real complexity—and power—emerges. This dense or hidden layer operates as what practitioners often call a "black box," a system whose internal workings are opaque even to its creators. Here's where the computational scale becomes impressive: our 784 input values connect to 128 neurons, creating 100,352 individual weighted connections (784 × 128).
Each connection has its own weight parameter that the network adjusts during training. The network analyzes patterns like "when pixel 247 has a high value and pixel 156 has a low value, there's a 73% correlation with the digit being a 5." These weights are learned through exposure to thousands of training examples, with the network gradually optimizing each connection to improve classification accuracy.
This is why even major tech companies like Google and Meta sometimes can't fully explain their neural network decisions. The models work exceptionally well—often achieving superhuman performance on specific tasks—but the reasoning involves hundreds of thousands or millions of weighted connections that don't translate into human-interpretable logic. Google's search algorithms, recommendation systems, and language models all operate on this principle: empirical effectiveness over explicable reasoning.
The output layer brings everything together with 10 neurons representing our digit classes. Each neuron receives weighted inputs from all 128 hidden layer neurons (another 1,280 connections), producing raw scores that indicate the network's confidence for each digit. Before we can interpret these scores, they pass through the softmax activation function, which normalizes them into a probability distribution.
The result is a set of probabilities that sum to 1.0, answering questions like: "Is this a 3? 89% confident. A 8? 7% confident. A 5? 2% confident." The highest probability wins, but having access to the full distribution provides valuable information about the network's uncertainty and alternative hypotheses.
Understanding activation functions is crucial for neural network design. ReLU (Rectified Linear Unit) has become the dominant choice in modern architectures, despite—or perhaps because of—its elegant simplicity. The function simply returns max(0, n): if the input is positive, it passes through unchanged; if negative, it becomes zero.
This seemingly basic operation solves several critical problems. First, it prevents negative activations from diminishing confidence in other classifications—a neuron that strongly indicates "this isn't a 5" doesn't reduce the probability of it being a 7. Second, ReLU addresses the vanishing gradient problem that plagued earlier activation functions like sigmoid and tanh, allowing networks to train more effectively across many layers.
ReLU's computational efficiency also matters at scale. Unlike sigmoid functions that require expensive exponential calculations, ReLU operations are nearly instantaneous. When you're processing millions of parameters across thousands of training iterations, this efficiency compounds significantly. Modern alternatives like Leaky ReLU, ELU, and Swish have emerged, but ReLU remains the reliable default for most applications.
Softmax activation serves a different but equally important role in the output layer. Raw neural network outputs can be any real number—positive, negative, large, or small. Softmax transforms these raw logits into a proper probability distribution where all values fall between 0 and 1 and sum to exactly 1.0.
The mathematical elegance of softmax lies in its ability to amplify differences between competing classes while maintaining probabilistic interpretation. A raw output of [2.1, 1.8, 0.3] becomes approximately [0.57, 0.41, 0.02] after softmax, clearly indicating the network's preference while preserving the relative confidence levels. This makes softmax indispensable for multi-class classification problems where you need both a decision and a confidence measure.
With our architecture defined and its components understood, we're ready to move beyond static structure into dynamic training. The next phase involves compiling the model with optimization algorithms and loss functions, then feeding it data to learn from—transforming our carefully designed but untrained network into a working digit classifier.