Handwritten digit recognition is a fundamental challenge in computer vision that has applications ranging from digitizing historical documents to automating postal services. While Latin digit recognition has been thoroughly explored, Bangla (Bengali) digit recognition presents unique challenges due to the script’s distinctive features and the relative scarcity of large, standardized datasets.
In this blog post, we’ll explore a fascinating project that implements a Convolutional Neural Network (CNN) from scratch—without relying on deep learning frameworks—to recognize handwritten Bangla digits. We’ll delve into the neural network architecture, the mathematics behind it, the implementation details, and the results achieved.
The primary goal of this project is to build a convolutional neural network from the ground up that can accurately recognize handwritten Bangla digits. What makes this project particularly interesting is that it doesn’t use standard deep learning libraries like TensorFlow or PyTorch for the core neural network implementation. Instead, it constructs the entire CNN architecture using only NumPy, providing valuable insights into the inner workings of neural networks.
The project achieves an impressive 95.87% accuracy on the test dataset, demonstrating the effectiveness of the custom implementation.
The project uses the NumtaDB dataset, which contains handwritten Bangla digit images. You can find more information about this dataset at NumtaDB GitHub Repository.
The dataset is organized into multiple partitions:
These partitions are combined and split into training and validation sets during the development process, with training-d often reserved for testing.
Each image in the dataset is a grayscale representation of a handwritten Bangla digit, and the corresponding labels are stored in CSV files that match the image filenames with their digit values (0-9).
Bangla digits have distinct visual properties compared to Latin digits:
These characteristics make Bangla digit recognition a more challenging task that benefits from robust preprocessing and well-designed neural network architectures.
Before feeding images into the neural network, several preprocessing steps are applied to enhance the model’s performance:
# Convert to grayscale
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Convert the image to binary using Otsu's thresholding
img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# Check if the background is black or white
# If the background is white, then invert the image
if np.mean(img) > 128:
img = 255 - img
# Dilate and erode to remove noise
kernel = np.ones((2, 2), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
These operations ensure that:
Additionally, the code removes empty rows and columns around the digit (where pixel intensity is near zero) and then resizes the image back to 28×28 pixels:
# Drop rows and columns with very low pixel values
row_drop = []
for i in range(img.shape[0]):
if 0 <= np.mean(img[i, :]) <= 5:
row_drop.append(i)
col_drop = []
for i in range(img.shape[1]):
if 0 <= np.mean(img[:, i]) <= 5:
col_drop.append(i)
# Drop the rows and columns
img = np.delete(img, row_drop, axis=0)
img = np.delete(img, col_drop, axis=1)
# Resize the image back to 28x28
img = cv2.resize(img, (28, 28), interpolation=cv2.INTER_AREA)
This centering process helps the model focus on the actual digit and ignore irrelevant empty space. For a deeper understanding of these image preprocessing techniques, you can visit OpenCV’s documentation on image processing.
The preprocessing pipeline is particularly effective for Bangla digits because:
Binarization with Otsu’s thresholding automatically determines the optimal threshold value, which is crucial for handling the varying stroke widths common in Bangla handwriting.
Background normalization ensures all digits have a consistent representation (white digits on black background), addressing the common inconsistency in how people write Bangla digits.
Morphological operations (dilation followed by erosion) help close small gaps in strokes that might otherwise break the connectivity of Bangla digits, which often contain delicate curves and connections.
Empty space removal centers the digit and maximizes its size within the frame, enhancing the model’s ability to detect the distinctive features of each Bangla numeral.
Final resizing to 28×28 standardizes all inputs while preserving the aspect ratio of the digit, which is critical for maintaining the proportional relationships in Bangla numerals.
The project implements two different CNN architectures:
The second architecture, which appears to be inspired by the classic LeNet-5 architecture, is the one used in the final implementation shown in the training code. You can learn more about the original LeNet architecture at Yann LeCun’s website.
The LeNet-inspired architecture is particularly suitable for Bangla digit recognition for several reasons:
Hierarchical Feature Extraction: The gradual increase in filter numbers (6 → 16) allows the network to capture increasingly complex patterns, which is essential for Bangla digits that contain intricate shapes and curves.
Appropriate Field of View: The 5×5 convolution kernels provide a sufficient receptive field to capture the distinctive strokes and curves of Bangla digits without missing important details or including too much irrelevant information.
Dimensionality Reduction: The max pooling layers progressively reduce the spatial dimensions while preserving the most important features, making the model more computationally efficient and helping it focus on the most discriminative aspects of each digit.
Multi-stage Fully Connected Layers: The sequence of fully connected layers (120 → 84 → 10) provides sufficient complexity to learn the mapping from abstract features to specific digit classes, handling the subtleties that distinguish similar-looking Bangla digits.
Balanced Capacity: With approximately 61,706 trainable parameters (much smaller than modern deep networks), this architecture strikes a good balance between having enough capacity to learn complex patterns and avoiding overfitting on the limited dataset size.
The heart of this project is the implementation of the CNN from scratch using only NumPy. Let’s explore the key components:
The project implements two activation functions:
ReLU (Rectified Linear Unit):
class ReLU():
def f(self, x):
return np.maximum(0, x)
def df(self, x, cached_y=None):
return np.where(x <= 0, 0, 1)
ReLU passes positive values unchanged and sets negative values to zero, allowing for efficient training of deep networks while mitigating the vanishing gradient problem. For a deeper mathematical understanding of activation functions, check out CS231n’s notes on activation functions.
Softmax:
class SoftMax():
def f(self, x):
y = np.exp(x - np.max(x, axis=1, keepdims=True))
return y / np.sum(y, axis=1, keepdims=True)
def df(self, x, cached_y=None):
return np.where(x <= 0, 0, 1)
The Softmax function converts a vector of real numbers into a probability distribution, making it ideal for the output layer in classification tasks. Note that subtracting the maximum value before taking the exponential is a numerical stability trick to prevent overflow.
The project uses the Softmax Cross-Entropy cost function:
class SoftmaxCrossEntropy():
def f(self, a_last, y):
batch_size = y.shape[0]
cost = -1 / batch_size * (y * np.log(np.clip(a_last, epsilon, 1.0))).sum()
return cost
def grad(self, a_last, y):
return - np.divide(y, np.clip(a_last, epsilon, 1.0))
This function computes the cross-entropy loss between the predicted probabilities and the true labels. The np.clip
function prevents numerical instability by avoiding taking the logarithm of very small numbers. For more on cross-entropy loss, see Understanding Categorical Cross-Entropy Loss.
Two optimization algorithms are implemented:
Gradient Descent:
class GradientDescent():
def __init__(self, trainable_layers):
self.trainable_layers = trainable_layers
def initialize(self):
pass
def update(self, learning_rate, w_grads, b_grads, step):
for layer in self.trainable_layers:
layer.update_params(dw=learning_rate * w_grads[layer],
db=learning_rate * b_grads[layer])
This is the simplest optimization algorithm that updates weights by moving in the direction of the negative gradient scaled by the learning rate. Learn more about gradient descent at Sebastian Ruder’s optimization overview.
Adam (Adaptive Moment Estimation):
class Adam():
def __init__(self, trainable_layers, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.trainable_layers = trainable_layers
self.v = {}
self.s = {}
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
# ... (initialization and update methods)
Adam is a more sophisticated algorithm that adapts the learning rate for each parameter based on estimates of first and second moments of the gradients. It’s generally more efficient than standard gradient descent. For an in-depth explanation of the Adam optimizer, check out the original paper by Kingma and Ba.
The most complex part of this project is the custom implementation of the convolutional layer:
class Conv():
def __init__(self, kernel_size, stride, n_c, padding='valid', activation=relu):
# ... initialization code ...
def forward(self, a_prev, training):
batch_size = a_prev.shape[0]
a_prev_padded = Conv.zero_pad(a_prev, self.pad)
out = np.zeros((batch_size, self.n_h, self.n_w, self.n_c))
# Convolve
for i in range(self.n_h):
v_start = i * self.stride
v_end = v_start + self.kernel_size
for j in range(self.n_w):
h_start = j * self.stride
h_end = h_start + self.kernel_size
out[:, i, j, :] = np.sum(a_prev_padded[:, v_start:v_end, h_start:h_end, :, np.newaxis] *
self.w[np.newaxis, :, :, :], axis=(1, 2, 3))
z = out + self.b
a = self.activation.f(z)
# ... caching for backward pass ...
return a
This implementation:
Why this custom implementation is particularly effective:
Direct Control: By implementing the convolution operation directly, the code has complete control over how features are extracted, which allows for custom optimizations specific to Bangla digits.
Transparency: The explicit loop-based implementation makes it clear exactly how the convolution operates, without any “magic” happening inside library functions.
Educational Value: This approach forces a deep understanding of how CNNs work, which can lead to more thoughtful architecture design.
Flexibility: The implementation can be easily modified to handle specific challenges of Bangla digit recognition if needed.
The backward pass through the convolutional layer is equally important:
def backward(self, da):
batch_size = da.shape[0]
a_prev, z, a = (self.cache[key] for key in ('a_prev', 'z', 'a'))
a_prev_pad = Conv.zero_pad(a_prev, self.pad) if self.pad != 0 else a_prev
da_prev = np.zeros((batch_size, self.n_h_prev, self.n_w_prev, self.n_c_prev))
da_prev_pad = Conv.zero_pad(da_prev, self.pad) if self.pad != 0 else da_prev
dz = da * self.activation.df(z, cached_y=a)
db = 1 / batch_size * dz.sum(axis=(0, 1, 2))
dw = np.zeros((self.kernel_size, self.kernel_size, self.n_c_prev, self.n_c))
# 'Convolve' back
for i in range(self.n_h):
v_start = self.stride * i
v_end = v_start + self.kernel_size
for j in range(self.n_w):
h_start = self.stride * j
h_end = h_start + self.kernel_size
da_prev_pad[:, v_start:v_end, h_start:h_end, :] += \
np.sum(self.w[np.newaxis, :, :, :, :] *
dz[:, i:i+1, j:j+1, np.newaxis, :], axis=4)
dw += np.sum(a_prev_pad[:, v_start:v_end, h_start:h_end, :, np.newaxis] *
dz[:, i:i+1, j:j+1, np.newaxis, :], axis=0)
dw /= batch_size
# ... handle padding for da_prev ...
return da_prev, dw, db
This implementation:
dz
)dw
) and biases (db
)da_prev
)For a more detailed explanation of backpropagation in CNNs, see CS231n’s notes on backpropagation.
The custom CNN implementation achieves an impressive 95.87% accuracy on Bangla digit recognition for several key reasons:
The LeNet-inspired architecture used in this project is particularly well-suited for digit recognition tasks. Originally designed for handwritten digit recognition (albeit for Latin digits), LeNet’s structure provides an excellent starting point for Bangla digits as well. The sequence of convolutional layers, pooling layers, and fully connected layers creates a hierarchical feature extraction process that can identify the distinctive patterns in Bangla numerals.
The preprocessing steps are carefully designed to enhance the quality of the input images:
The implementation incorporates several techniques to ensure numerical stability:
The use of mini-batch training (with a batch size of 200) provides a form of regularization and helps the model generalize better. It also allows for more frequent weight updates, which can lead to faster convergence.
The learning rate (0.005) and number of epochs (30) have been carefully chosen to balance training speed with accuracy. The relatively small learning rate ensures stable training, while the number of epochs provides sufficient time for the model to converge without overfitting.
Using multiple evaluation metrics (accuracy, F1 score, and confusion matrix) provides a comprehensive view of the model’s performance, helping to identify and address potential issues with specific digits.
The implementation employs several techniques to optimize performance:
The weights are initialized using the He initialization method, which is particularly well-suited for ReLU activations:
self.w = np.random.randn(self.kernel_size, self.kernel_size, self.n_c_prev, self.n_c) * \
np.sqrt(2 / (self.kernel_size * self.kernel_size * self.n_c_prev))
This initialization strategy helps ensure that activations neither vanish nor explode at the beginning of training, allowing for faster convergence. For more on weight initialization strategies, see Understanding Xavier Initialization in Deep Neural Networks.
The model is trained using mini-batches rather than full-batch or stochastic gradient descent:
cnn.train(x_train, y_train,
mini_batch_size=200,
learning_rate=0.005,
num_epochs=30,
validation_data=(x_test, y_test))
Mini-batch training strikes a balance between computational efficiency and update stability. It also introduces a form of regularization that can help the model generalize better.
The learning rate (0.005) has been carefully chosen to be:
To improve the efficiency of backpropagation, the implementation caches activations and intermediate values during the forward pass:
if training:
# Cache for backward pass
self.cache.update({'a_prev': a_prev, 'z': z, 'a': a})
This eliminates the need to recompute these values during the backward pass, significantly improving computational efficiency.
This Bangla handwritten digit recognition system has numerous practical applications:
Document Digitization: Converting handwritten Bangla documents into digital text, preserving cultural and historical materials.
Postal Automation: Automating the sorting and routing of mail based on handwritten Bangla postal codes and addresses.
Educational Tools: Creating interactive learning systems to help students practice writing Bangla numerals with real-time feedback.
Banking Systems: Processing handwritten Bangla digits on checks, forms, and other financial documents.
Form Processing: Automating the entry of handwritten numerical data from forms in government offices, healthcare facilities, and businesses.
Extending to Bangla Characters: The system could be extended to recognize full Bangla script, not just digits.
Real-time Recognition: Optimizing the implementation for real-time recognition on mobile devices or embedded systems.
Multi-script Recognition: Developing a system that can handle multiple scripts (Bangla, Latin, Devanagari, etc.) simultaneously.
Integration with OCR Systems: Incorporating this digit recognition system into a full Optical Character Recognition pipeline for Bangla documents.
Transfer Learning Approaches: Exploring how pre-trained models on Latin digits could be adapted for Bangla digits with minimal retraining.
This project demonstrates that it’s possible to build a highly effective neural network from scratch using only NumPy, without relying on deep learning frameworks. By achieving an accuracy of 95.87% on Bangla handwritten digit recognition, it shows that a well-designed architecture with appropriate preprocessing can perform exceptionally well even without the advanced optimizations provided by modern deep learning libraries.
The implementation provides valuable insights into the inner workings of convolutional neural networks, including forward and backward propagation, various layer types, and optimization algorithms. It serves as both a practical solution for Bangla digit recognition and an educational resource for those wanting to understand the fundamentals of neural networks.
As AI applications continue to expand globally, projects like this that focus on languages beyond English make technology more inclusive and accessible to diverse populations. The techniques and insights from this project can inspire similar efforts for other scripts and languages, contributing to a more equitable technological landscape.