This month I started competing in my very first Kaggle competition, Denoising Dirty Documents. I was first introduced to Kaggle a few years ago by Xavier Conort, an insurance industry colleague who also lives here in Singapore. But I had been passive with my Kaggle membership, and hadn’t even considered competing.
This year two things changed. Firstly, I joined IntelliM, an image processing, machine learning and software house, and I needed to get out into the real world and make business connections and start adding value in these fields. Secondly, Kaggle opened the Denoising Dirty Documents competition, which is about pre-processing scanned documents so that they are suitable for optical character recognition, and this competition required both image processing skills and machine learning skills. So this competition looked like a great match for me, and hopefully would be an easy transition to build some experience within Kaggle.
Although I am an actuary by training, I have not always stayed within the traditional bounds of actuarial work. Back in the 1990s I first started playing with machine learning, using neural networks to predict which customers will renew their insurance policies. Then, inspired by Kim and Nelson’s book, I developed a state space regime switching model for predicting periods of massive builder insolvencies. That model has subsequently been adapted for cancer research, to measure the timing of genes switching off and on. In the 2000s I started getting involved in image processing, firstly to create optical character recognition for a web scraper software package, and later developing COPR, license plate recognition software. Over the past decade I have been using machine learning for customer analytics and insurance pricing.
So I thought that just doing some pre-processing for optical character recognition would be quick and easy. When I looked at the examples (see one example above), my eyes could quickly see what the answer should look like even before I peeked at the example cleaned image. I was so wrong…
Lesson: Avoid Artificial Stupidity
Machine learning is sometimes called artificial intelligence. After all, aren’t neural networks based upon the architecture of the human brain?
My first competition submission was a pure machine learning solution. I modelled the target image one pixel at a time. For predictors, I got the raw pixel brightnesses for a region around each pixel location. This is a brute force approach that I have used in the past for optical character recognition. I figured that the machine learning algorithm would learn what the character strokes looked like, and thereby know which pixels should be background.
What really happened was that the machine learning algorithm simply adjusted the brightness and contrast of the image, to better match the required solution. So I scored 8.58%, giving me 24th ranking, much higher than I was expecting, and much closer to some naive benchmarks than I was comfortable with.
I wanted a top ten placing, but I was a long way away from it. So I fine-tuned the model hyperparameters. This moderately improved the score, and only moved me up 3 ranks. My next competition submission actually scored far worse than my preceding two submissions! I needed to rethink my approach because I was going backwards, and the better submissions were almost an order of magnitude better than mine.
The reason my submission scored so poorly was because I was asking the machine learning model to learn complex interactions between pixels, without any guidance from me. There are heuristics about text images that I intuitively know, but I hadn’t passed on any of that knowledge to the machine learning algorithm, either via predictors or model structure.
My algorithm wasn’t artificially intelligent; it was artificially stupid!
So I stopped making submissions to the competitions, and started looking at the raw images and cleaned images, and I applied some common image processing algorithms. I asked myself these questions:
- what is it about the text that is different to the background?
- what are the typical characteristics of text?
- what are the typical characteristics of stains?
- what are the typical characteristics of folded or crinkled paper?
- how does a dark stain differ from dark text?
- what does the output from a image processing algorithm tell me about whether a pixel is text or background?
- what are the shortcomings of a particular image processing algorithm?
- what makes an image processing algorithm drop out some of the text?
- what makes an image processing algorithm think that a stain is text?
- what makes an image processing algorithm think that a paper fold is text?
- which algorithms have opposing types of classification errors?
For example, in the image above, the algorithm thins out the text too much, does not remove the outer edges of stains, and does not remove small stains. That prompted me to think that maybe an edge finding algorithm would complement this algorithm.
After a week of experimentation and feature extraction, I finally made a new competition submission, and it jumped me up in the rankings. Then I started fine tuning my model, and split the one all-encompassing machine learning model into multiple specialist models. At the time of writing this blob I am ranked 4th in the competition, and after looking at the scores of the top 3 competitors, I realise that I will have to do more than just fine tune my algorithm. It’s time for me to get back into research mode and find a new feature that identifies the blob stain at the end of the first paragraph in this image:
Kaggle is addictive. I can’t wait to solve this problem!
Hi Colin! Excellent post. I actually just started this competition myself because I’m very interested in these kind of OCR problems. However, my background has been in web app development so I’m a little rusty when it comes to approaching these problems and where to start. I’m familiar with Python and R so I’m not too concerned about that part, just mostly about the preprocessing steps. For instance, the questions you posed in your blog are great, but I would have no idea how to venture out to find the answers (programmatically). Any guidance would be much appreciated!
LikeLiked by 1 person
Hi Bobby,
Thanks for reading my blog 🙂
If you’d like, I can start to write some blogs that give an introduction to doing machine vision in R.
Colin
LikeLike
That would be fantastic!! You don’t even have to shy away from any mathematics, I tend to be a quick study. In the meantime, do you have any suggestions on any reading material that could get me started in this endeavor?
LikeLiked by 1 person
Hi Bobby,
OK I will start a series of blogs where I take people through the steps to do this Kaggle challenge in R, and I will post some R scripts on Kaggle (most of the existing posted scripts for this challenge are written in Python – have you read through any of them for ideas?).
For background reading I suggest you wade your way through this textbook, which I use http://www.amazon.com/Computer-Machine-Vision-Fourth-Practicalities/dp/0123869080/ref=sr_1_1?ie=UTF8&qid=1438327062&sr=8-1&keywords=computer+and+machine+vision
Colin
LikeLike
I’ll take a closer look at the Python scripts. My first step is to try and get an overview of the workflow needed to get some results from this. For example, how do you know which filters are most appropriate, and after filtering, how do you go about picking which pixels are more important in order to get the most from your machine learning algorithm of choice? Lots of questions lol. I look forward to your post! And I’ll take a stab at that book, thanks for the direction.
LikeLiked by 1 person
Muchas gracias. ?Como puedo iniciar sesion?
LikeLike
Sorry I only speak English What login are you asking about?
LikeLike
Pingback: Denoising Dirty Documents: Part 1 | Keeping Up With The Latest Techniques
Thanks a lot Colin for your blog.
It really helping me a lot.
I request you to please continuously write more on Machine learning technique.
LikeLike
Hi Nawal,
I’m glad that you are finding it helpful 🙂 I’m travelling on business at the moment, but will resume my blog posts next week.
Colin
LikeLike