CNNs consist of multiple layers including convolutional layers, pooling layers, and fully connected layers. The convolutional layers apply filters to the input image to create feature maps. Pooling layers downsample these maps to reduce dimensionality and computational complexity. Finally, fully connected layers classify the features extracted by the convolutional and pooling layers.