Convulations to CNN
CNN

Convulations to CNN

Dev Rishi Khare-5

We will study convolution, which is a popular array operation that is used in various forms in signal processing, digital recording, image processing, video processing, and computer vision. In these application areas, convolution is often performed as a filter that transforms signals and pixels into more desirable values.

In high-performance computing, the convolution pattern is often referred to as stencil computation. Convolution typically involves a significant number of arithmetic operations on each data element. Each output data element can be calculated independently of each other, a desirable trait for parallel computing. On the other hand, there is substantial level of input data sharing among output data elements with somewhat challenging boundary conditions. This makes convolution an important use case of sophisticated tiling methods and input data staging methods.

Convolution is an array operation where each output data element is a weighted sum of a collection of neighboring input elements. The weights used in the weighted sum calculation are defined by an input mask array, commonly referred to as the convolution kernel. We will refer to these mask arrays as convolution masks. The same convolution mask is typically used for all elements of the array

Below is shown a convolution example for 1D data where a 5-element convolution mask array M is applied to a 7-element input array N.

1D Convulation

For image processing and computer vision, input data are typically two-dimensional arrays, with pixels in an x-y space. Image convolutions are therefore 2D convolutions. The mask does not have to be a square array.

2D Convulation


1. From 1D to 2D convolution

You know 1D convolution:

For images, the input is 2D (height Ă— width), so:

This is just the same idea you already know, but:


2. Multiple channels (e.g., RGB)

Real images are not just 2D, they often have channels:

Convolution step:

So conceptually: 3D convolution over space + channels, but still just “weighted sum of neighbors”.


3. Filters (kernels) and feature maps

In CNNs, you don’t use just one kernel; you use many:

So if the input shape is:

Each of those F output channels captures a different kind of pattern:

Think:
“A CNN layer = apply many different convolutions (filters) over the input.”


4. Stride and padding

Two more array-related details:

Mathematically, it’s still just sliding windows and dot products, but with control over:


5. Nonlinearity: adding activation functions

So far it’s all linear: weighted sums. A CNN layer adds nonlinearity:

So a single conv layer is:

  1. Convolution (linear weighted sums).
  2. Activation (nonlinear).

This combination lets CNNs learn complex, non-linear mappings.


6. Stacking layers → deep feature hierarchies

A full CNN is just many layers of:

Intuition:

All of this is still built from the same primitive you understand:
“take local neighborhoods and compute weighted sums.”


7. Connection to fully connected (dense) layers

A fully connected layer is also just weighted sums:

Difference:

So you can see CNNs as:


Conclusion: From Simple Operation to Powerful Networks

If you understand convolution as:

“At each position, take neighbors and compute a weighted sum with a kernel,”

then you’re already standing on the core idea behind CNNs.

CNNs take that simple building block and:

The magic of CNNs isn’t a new kind of math—it’s the composition of many small, local, convolution operations into a deep architecture that can learn powerful visual representations.