No doubt you have noticed a common theme this year with how people can interact with technology, in that the technology has finally come to a stage whereby our interaction with the digital world can be less artificial. Examples include: Gesture controlled interfacing such as with the Kinect, LeapMotion, and Intels Perceptual Computing kit, Touch, Eye Tracking, Voice, All are encompassed under a category of Human Computer Interaction (HCI) called Natural User Interfaces (NUI) with the goal of making interfacing with devices, as the name suggests, natural.
The one we’ll be looking at in-depth in this article is user input via a Stylus, with the focus on developing a simple hand-written Optical Character Recognition (OCR) system to interpret numbers and symbols to solve simple mathematical equations. Drawing and writing is an old skill mastered many many years ago which no doubt helped humans survive by allowing us to pass lessons down from generation to generation. It’s a skill we dedicate a lot of time at school practicing, and now we have the demand for tablets and phablets (some with built in styluses) we will no doubt see this become a common form of HCI.
In this post we will present a brief overview of how a simple OCR solution can be built using OpenCV; the goal is to build a simple calculator that can interpret the user’s writing and provide the answer when prompting (via the ‘=’ symbol). Our goal is not to focus on the code (please request if this is something of interest) but rather abstract to the approach. Below is a screenshot of the end result in action:
If you’re not familiar with OpenCV then it’s worth checking it out if you ever require a toolset to interpret and understand images and/or machine learning routines. OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library available (and optimised) on most major platforms (Android, iOS, Mac, Windows, …) and it was the library used here.
Let’s break down our task into manageable chunks and work our way through each of them.
- In order for us to understand anything we must build up a knowledge base which we can refer to i.e. we need to teach (or train) the system that the shape ‘+’ means addition, we also need a way to access this bank of knowledge so we can reference it when trying to understand the users input.
- Our first task is to ‘watch’ what the user writes; 2 possible approaches are interpreting gestures (i.e. build up a library of gestures for each character and symbol) or to view the users input as a whole i.e. interpret the whole image or sub-set of the image when the user lifts their pen. The approach taken here was to interpret the image as a whole as this offers the greatest flexibility i.e. input can be via the Stylus or a captured image.
- Because we’re taking in a whole image (or region of a image) we need to be able to break this down to be able to interpret each individual character/symbol one at a time, or rather create segments from a whole image.
- Once we have these segments we need to convert it into a readable/comparable format so we can then match it against our taught knowledge bank.
- Now we can provide our best guess of what it’s likely to be.
Our first task is to ‘train’ the system what each digit, symbol is – in order to be able to do this we need to devise an approach that is comparable when matching i.e. the process of training must be the same as when we are interpreting each segment. The idea of training the system is to present the system with many variants for each digital/symbol and for each creating a signature and associated label (e.g. ‘+’ or ‘1’) before saving it into, what is essentially, a lookup table.
This training data can come from a library of images or trained by the user – our prototype used both approaches i.e. used a base set of data and allowed the user to train using their own handwriting.
The following diagram shows the process visually; remember that this same approach is used to iterate through each segment when trying to interpret the user’s handwriting.
The main job here is to create a signature that is consistent (and therefore comparable) across all segments read. Because we’re training our system here we would add the signature along with the associated label. In most cases the more training you have the better the results (obviously a trade-off between memory and lookup time to be considered).
The above step has taken care of training and most of the work of classification, but in order to classify something the user has inputted we must segment them out i.e. passing a whole image won’t allow us to create a comparable signature for anything in our trained data.
As we are taking input from the screen we can optimise this approach but a generalized approach is used here so that ‘external’ (i.e. images from the camera etc) could easily be used.
The first step in segmentation is, normally, to convert the input image to Greyscale – remember images are made of pixels which each containing (normally) 3 (or 4 if you include alpha) channels of information to describe the colour (3 channels being Red Green Blue). To make it more manageable we convert the image to greyscale so that each pixel is described with 1 channel in a range of 0-255.
Once we have converted it, we threshold the image; thresholding means that you look at each pixel in the image and if it doesn’t meet a certain criteria you set it to 0 otherwise you see it to 255 (essentially creating a binary image with each pixel set to on or off).
Now we have a binary image we perform run a morphological filter (dilation and eroding) to remove noise and fill in any small holes.
At this stage we are left with a image with separated islands (aka segments), hopefully each belonging to a separate digital/symbol. The last step in our segmentation process is to find all the contours (a contour is a lasso defined by a set of points, each encompassing an individual island) and from this we can calculate a bounding box for each (and filtering any which are too small or too large) that we can clip and run through our classifier.
Finally we create a signature for each segment (as described above) and compare it against our library of trained data. Like all problems, there is more than one way of finding a match. One of the more basic variations is something called K-nearest neighbors algorithm (k-NN); this algorithm basically runs through all your training data and calculating a distance between its signature and the segments signature. Once finished it returns the label that has the strongest association with a level of confidence.
This of course is a bare bones OCR system but given the processing power of portable devices these days and readily available access to remote services, interpreting the user’s handwriting is now a feasible option as a input mechanism that can reduce barriers of learning and increase user efficiency, so we think it’s definitely worth considering when thinking how you would like to engage with your users.