Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Category Archives: Machine Learning

Tutorial: Sentiment Analysis of Airlines Using the syuzhet Package and Twitter

30 Sunday Apr 2017

Posted by Colin Priest in R, Sentiment Analysis, Social Media, Text Mining, Twitter

≈ 32 Comments

Tags

R, Sentiment Analysis, Text Mining, Twitter

In my last job, I was a frequent flyer. Each week I flew between 2 or 3 countries, briefly returning for 24 hours on the weekend to get a change of clothes. My favourite airlines were Cathay Pacific, Emirates and Singapore Air. Now, unless you have been living in a cave, you’d be well aware of the recent news story of how United Airlines removed David Dao from an aircraft. I wondered how that incident had affected United’s brand value, and being a data scientist I decided to do sentiment analysis of United versus my favourite airlines.

Way back on 4th July 2015, almost two years ago, I wrote a blog entitled Tutorial: Using R and Twitter to Analyse Consumer Sentiment. Even though that blog post is one of my earliest, it continues to be the most popular, attracting just as many readers per day as when I first wrote it.

Since the sentiment package, upon which that blog was based, is no longer supported by CRAN, and many readers have problems with the manual and technical process of installing an obsolete package from an archive, I have written a new blog using a different, live CRAN package. The syuzhet package was published only several weeks ago, and offers a range of different sentiment analysis models. So I’ve started to try it out.

I have collected tweets for 4 airlines:

  1. Cathay Pacific
  2. Emirates
  3. Singapore Air
  4. United Airlines

The tweet data starts at 01-Jan-2015 and go up to mid-April 2017.

Step 1: Load the tweets and load the relevant packages

library(foreign)
library(syuzhet)
library(lubridate)
library(plyr)
library(ggplot2)
library(tm)
library(wordcloud)

# get the data for the tweets
dataURL = 'https://s3-ap-southeast-1.amazonaws.com/colinpriest/tweets.zip'
if (! file.exists('tweets.zip')) download.file(dataURL, 'tweets.zip')
if (! file.exists('tweets.dbf')) unzip('tweets.zip')
tweets = read.dbf('tweets.dbf', as.is = TRUE)

 

I’ve stored the tweets in a dbf file and zipped it. The zip file is 68MB in size, and the dbf file is 353MB. The code shown above downloads the zip file, extracts the dbf and then reads the dbf file into a data.frame.

Step 2: Do Sentiment Scoring using the syuzhet package


# function to get various sentiment scores, using the syuzhet package
scoreSentiment = function(tab)
{
 tab$syuzhet = get_sentiment(tab$Text, method="syuzhet")
 tab$bing = get_sentiment(tab$Text, method="bing")
 tab$afinn = get_sentiment(tab$Text, method="afinn")
 tab$nrc = get_sentiment(tab$Text, method="nrc")
 emotions = get_nrc_sentiment(tab$Text)
 n = names(emotions)
 for (nn in n) tab[, nn] = emotions[nn]
 return(tab)
}

# get the sentiment scores for the tweets
tweets = scoreSentiment(tweets)
tweets = tweets[tweets$TimeStamp < as.Date('19-04-2017', format = '%d-%m-%Y'),]

The syuzhet package offers a few different algorithms, each taking a different approach to sentiment scoring. It also does emotion scoring based upon the nrc algorithm. The code above calculates scores using the syuzhet, bing, afinn and nrc algorithms, adding columns with the scores from each algorithm.

Step 3: Visualise the Sentiment Scores


# function to find the week in which a date occurs
round_weeks <- function(x)
{
require(data.table)
dt = data.table(i = 1:length(x), day = x, weekday = weekdays(x))
offset = data.table(weekday = c('Sunday', 'Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday'),
offset = -(0:6))
dt = merge(dt, offset, by="weekday")
dt[ , day_adj := day + offset]
setkey(dt, i)
return(dt[ , day_adj])
}
# get daily summaries of the results
daily = ddply(tweets, ~ Airline + TimeStamp, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
 ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))

# plot the daily sentiment
ggplot(daily, aes(x=TimeStamp, y=ave_sentiment, colour=Airline)) + geom_line() +
 ggtitle("Airline Sentiment") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

# get weekly summaries of the results
weekly = ddply(tweets, ~ Airline + week, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
 ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))

# plot the weekly sentiment
ggplot(weekly, aes(x=week, y=ave_sentiment, colour=Airline)) + geom_line() +
 ggtitle("Airline Sentiment") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

The code above summarises the sentiment for each airline across time. The first plot shows the daily sentiment values for each airline:

20170429 plot 01 daily sentiment

Based upon the bing sentiment algorithm, United has the poorest sentiment, and Singapore has the best sentiment. United usually has negative sentiment. Daily to day random fluctuations in sentiment make this a cluttered graph, so I decided to summarise the sentiment weekly instead of daily:

20170429 plot 02 weekly sentiment

Now it’s easier to see the differences in sentiment between the four airlines. While Emirates and Cathay Pacific have similar levels of sentiment, the values for Emirates are more stable. This, however, may be due to the sheer volume of tweets about Emirates versus the smaller number of tweets about Cathay Pacific.

20170429 plot 03 positive sentiment

Step 4: Compare the Sentiment Algorithms

The sentiment scores above use the bing algorithm, but we should check whether the different algorithms produce different results.


# compare the sentiment for across the algorithms
algorithms = tweets[rep(1, nrow(tweets) * 4), c("week", "syuzhet", "Airline", "Airline")]
names(algorithms) = c("TimeStamp", "Sentiment", "Algorithm", "Airline")
algorithms$Algorithm = "syuzhet"
algorithms[seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "syuzhet", "Airline")]
algorithms[nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "bing", "Airline")]
algorithms$Algorithm[nrow(tweets) + seq_len(nrow(tweets))] = "bing"
algorithms[2 * nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "afinn", "Airline")]
algorithms$Algorithm[2 * nrow(tweets) + seq_len(nrow(tweets))] = "afinn"
algorithms[3 * nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "nrc", "Airline")]
algorithms$Algorithm[3 * nrow(tweets) + seq_len(nrow(tweets))] = "nrc"

# get the algorithm averages for each airline
averages = ddply(algorithms, ~ Airline + Algorithm, summarize, ave_sentiment = mean(Sentiment))
averages$ranking = 1
for (alg in c("syuzhet", "bing", "afinn", "nrc")) averages$ranking[averages$Algorithm == alg] = 5 - rank(averages$ave_sentiment[averages$Algorithm == alg])
averages = averages[order(averages$Airline, averages$Algorithm), ]

The code above was a bit clumsy – I probably should have used reshape.

20170429 plot 08 sentiment algorithm comparisons

The different algorithms give similar rankings between the airlines with one big exception: the nrc algorithm is surprisingly positive about United and unusually negative about Singapore Air compared to the other algorithms. This goes to show that sentiment analysis isn’t just a plug and play technique and also means that a warning should be applied to the emotion analysis shown in Step 5 below, as it is based upon the nrc algorithm!

Step 5: Emotion Analysis

Noting the warning, from the previous section, let’s compare the emotions between the airlines and between tweets, using the nrc algorithm.


ggplot(weekly, aes(x=week, y=ave_negative, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Positive Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

ggplot(weekly, aes(x=week, y=ave_positive, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Negative Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

ggplot(weekly, aes(x=week, y=ave_anger, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Anger Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

# function to make the text suitable for analysis
clean.text = function(x)
{
# tolower
x = tolower(x)
# remove rt
x = gsub("rt", "", x)
# remove at
x = gsub("@\\w+", "", x)
# remove punctuation
x = gsub("[[:punct:]]", "", x)
# remove numbers
x = gsub("[[:digit:]]", "", x)
# remove links http
x = gsub("http\\w+", "", x)
# remove tabs
x = gsub("[ |\t]{2,}", "", x)
# remove blank spaces at the beginning
x = gsub("^ ", "", x)
# remove blank spaces at the end
x = gsub(" $", "", x)
return(x)
}

# emotion analysis: anger, anticipation, disgust, fear, joy, sadness, surprise, trust
# put everything in a single vector
all = c(
paste(tweets$Text[tweets$anger > 0], collapse=" "),
paste(tweets$Text[tweets$anticipation > 0], collapse=" "),
paste(tweets$Text[tweets$disgust > 0], collapse=" "),
paste(tweets$Text[tweets$fear > 0], collapse=" "),
paste(tweets$Text[tweets$joy > 0], collapse=" "),
paste(tweets$Text[tweets$sadness > 0], collapse=" "),
paste(tweets$Text[tweets$surprise > 0], collapse=" "),
paste(tweets$Text[tweets$trust > 0], collapse=" ")
)
# clean the text
all = clean.text(all)
# remove stop-words
# adding extra domain specific stop words
all = removeWords(all, c(stopwords("english"), 'singapore', 'singaporeair',
'emirates', 'united', 'airlines', 'unitedairlines',
'cathay', 'pacific', 'cathaypacific', 'airline',
'airlinesunited', 'emiratesemirates', 'pacifics'))
#
# create corpus
corpus = Corpus(VectorSource(all))
#
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#
# convert as matrix
tdm = as.matrix(tdm)
#
# add column names
colnames(tdm) = c('anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust')
#
# Plot comparison wordcloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Emotion Comparison Word Cloud')
comparison.cloud(tdm, random.order=FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
title.size=1.5, max.words=250)

&nbsp;

The code above plots the emotions across time for each airline.

20170429 plot 05 anger
20170429 plot 03 positive sentiment
20170429 plot 04 negative sentiment

United Airlines attracts more angry tweets, and this has spiked in April 2017 following the David Dao incident. But United Airlines also attracts more positive tweets than the other airlines. This might explain the ranking differences between the algorithms – maybe the algorithms weight positive tweets differently to negative tweets.

Then the code creates a comparison word cloud, to show the different words in airline tweets that are associated with each emotion.

20170429 plot 09 emotions

Step 6: Compare the Different Tweeting Behaviour of Different Twitter Users

Are some users more positive than others? Is this user behaviour different between the airlines? Do people who tweet more have a different sentiment to those who tweet about airlines less frequently? Are particular users dragging the average up or down? To answer these questions, I have tracked the 100 users who tweeted the most about these airlines.


# get the user summaries of the results
users = ddply(tweets, ~ Airline + UserName, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))
sizeSentiment = ddply(users, ~ num_tweets, summarize, ave_sentiment = mean(ave_sentiment),
ave_negative = mean(ave_negative), ave_positive = mean(ave_positive), ave_anger = mean(ave_anger))
sizeSentiment$num_tweets = as.numeric(sizeSentiment$num_tweets)

# plot users positive versus negative with bubble plot
cutoff = sort(users$num_tweets, decreasing = TRUE)[100]
ggplot(users[users$num_tweets > cutoff,], aes(x = ave_positive, y = ave_negative, size = num_tweets, fill = Airline)) +
geom_point(shape = 21) +
ggtitle("100 Most Prolific Tweeters About Airlines") +
labs(x = "Positive Sentiment", y = "Negative Sentiment")
#
ggplot(sizeSentiment, aes(x = num_tweets, y = ave_sentiment)) + geom_point() + stat_smooth(method = "loess", size = 1, span = 0.35) +
ggtitle("Number of Tweets versus Sentiment") + scale_x_log10() +
labs(x = "Positive Sentiment", y = "Negative Sentiment")

Firstly let’s look at the behaviour of individual users:

20170429 plot 06 top 100 tweeters

Top user sentiment is quite different by airline. Emirates has a number of frequent tweeters who are unemotional, who on average post neither positive nor negative sentiment. United Airlines attracts more emotional posts. Singapore Air and Cathay Pacific have big users that post a lot of tweets about them.

20170429 plot 07 sentiment versus tweet count

However, on average bigger frequent tweeters post a similar balance of positive and negative content to smaller users who tweet infrequently.

Step 7: Compare The Words Used to Describe Each Airline

In order to explain the differences in sentiment, we can create a word cloud that contrasts the words used in posts about each airline.


# Join texts in a vector for each company
txt1 = paste(tweets$Text[tweets$Airline == 'United'], collapse=" ")
txt2 = paste(tweets$Text[tweets$Airline == 'SingaporeAir'], collapse=" ")
txt3 = paste(tweets$Text[tweets$Airline == 'Emirates'], collapse=" ")
txt4 = paste(tweets$Text[tweets$Airline == 'Cathay Pacific'], collapse=" ")
#
# put everything in a single vector
all = c(clean.text(txt1), clean.text(txt2), clean.text(txt3), clean.text(txt4))
#
# remove stop-words
# adding extra domain specific stop words
all = removeWords(all, c(stopwords("english"), 'singapore', 'singaporeair',
'emirates', 'united', 'airlines', 'unitedairlines',
'cathay', 'pacific', 'cathaypacific', 'airline',
'airlinesunited', 'emiratesemirates', 'pacifics'))
#
# create corpus
corpus = Corpus(VectorSource(all))
#
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#
# convert as matrix
tdm = as.matrix(tdm)
#
# add column names
colnames(tdm) = c('United', 'Singapore Air', 'Emirates', 'Cathay Pacific')
#
# Plot comparison wordcloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Word Comparison by Airline')
comparison.cloud(tdm, random.order=FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),
title.size=1.5, max.words=250)
#
# Plot commonality cloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Word Commonality by Airline')
commonality.cloud(tdm, random.order=FALSE,
colors = brewer.pal(8, "Dark2"),
title.size=1.5, max.words=250)

The code above is quite similar to that in the previous step, except that this time we are comparing airlines instead of emotions.

20170429 plot 10 airline word contrast
20170429 plot 11 airline word commonality

Emirates includes “aniston”, presumably in reference to the marketing campaign involving Jennifer Aniston, while United includes “CEO” due to a number of news stories about United CEO’s including a resignation and a heart transplant.

 

 

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Second Annual Data Science Bowl – Part 3 – Automatically Finding the Heart Location in an MRI Image

08 Tuesday Mar 2016

Posted by Colin Priest in Automation, Convolutional Neural Networks, Deep Learning, Image Processing, Kaggle, Machine Learning, Medical Imaging, Python

≈ 10 Comments

Tags

Automation, Deep Learning, Image Processing, Kaggle, Machine Learning, Medical Imaging

My last blog wasn’t so sexy, what with all the data cleansing, and no predictive modelling. But in this blog I do something really cool – I train a machine learning model to find the left ventricle of the heart in an MRI image. And I couldn’t have done it without all of that boring data cleansing. #kaggle @kaggle

Aside from being a really cool thing to do, there is a purpose to this modelling. I want to find the boundaries of the heart chamber, and that is much easier and faster to do when I remove distractions. Once I have found the location of the heart chamber, I can crop the image to a much smaller square.

The input to the model will be a set of images. In order to simply what the model learns, I only gave it training images from sax locations near the centre of the heart.

20160308-image01
20160308-image02
20160308-image03
20160308-image04

The output from the model will be the row number and column number of the centroid of the left ventricle heart chamber (the red dot in the images above).

I had to manually define those centroid locations for a training set of a few hundred of the images. This was laborious and time consuming, even after I automated some of the process. But it needed to be done, because otherwise the machine learning algorithm has no way of knowing what the true answers should be.

Even though I am much more comfortable coding in R than in Python, I used Python for this step because I wanted to use Daniel Nouri‘s nolearn library, which sits above lasagne and theano, and these libraries are not available in R. The convolution neural network architecture was based upon the architecture in Daniel Nouri’s tutorial for the Facial Keypoints Detection competition in Kaggle.

Step 1: Importing of all the Required Libraries

OK, so this part isn’t all that sexy either. But it’s the engine for all the cool modelling that is about to be done.


import numpy as np
import csv
import random
import math
import os
import cv2
import itertools
import math
import matplotlib.pyplot as plt
import pandas as pd
import itertools

from lasagne import layers
from lasagne.updates import nesterov_momentum
from lasagne.nonlinearities import softmax
from lasagne.nonlinearities import sigmoid
from nolearn.lasagne import BatchIterator
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import PrintLayerInfo
from nolearn.lasagne.visualize import plot_loss
from nolearn.lasagne.visualize import plot_conv_weights
from nolearn.lasagne.visualize import plot_conv_activity
from nolearn.lasagne.visualize import plot_occlusion

%pylab inline
from lasagne.layers import DenseLayer
from lasagne.layers import InputLayer
from lasagne.layers import DropoutLayer
from lasagne.layers import Conv2DLayer
from lasagne.layers import MaxPool2DLayer
from lasagne.nonlinearities import softmax
from lasagne.updates import adam
from lasagne.layers import get_all_params
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import objective

import theano
import theano.tensor as T

 

Step 2: Defining the Helper Functions

I used jupyter notebook as the development environment to set up and run my Python scripts. While there’s a lot to like about jupyter, one thing that annoys me is that print commands run in jupyter don’t immediately show text on the screen. But here’s a trick to work around that:


def printQ(s):
 print(s)
 sys.stdout.flush()

Using this helper function instead of the print function results in text immediately appearing in the output. This is particular helpful for progress messages on long training runs.

I like to use R’s expand.grid function, but it isn’t built in to Python. So I wrote my own helper function in Python that mimics the functionality:


def product2(*args, repeat=1):
 # product('ABCD', 'xy') --> Ax Ay Bx By Cx Cy Dx Dy
 # product(range(2), repeat=3) -> 000 001 010 011 100 101 110 111
 pools = [tuple(pool) for pool in args] * repeat
 result = [[]]
 for pool in pools:
 result = [x+[y] for x in result for y in pool]
 for prod in result:
 yield tuple(prod)

def expand_grid(dictionary):
 return pd.DataFrame([row for row in product2(*dictionary.values())],
 columns=dictionary.keys())

This next function reads a cleaned up image, checking that it has the correct aspect ratio, then resizing it to 96 x 96 pixels. The resizing is done to reduce memory usage in my GPU, and to speed up the training and scoring.


def load_image(path):
 # check that the file exists
 if not os.path.isfile(path):
   printQ('BAD PATH: ' + path)
 image = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
 # ensure portrait aspect ratio
 if (image.shape[0] < image.shape[1]):
   image = cv2.transpose(image)
 # resize to 96x96
 resized_image = cv2.resize(image, (96, 96))
 # check that the image isn't empty
 s = sum(sum(resized_image))
 if np.isnan(s):
   print(path)
 return resized_image

My initial models were not performing as well as I would like. So to force the the model to generalise, I added some image transformations (rotation and reflection) to the training data. This required helper functions:


def rotate_image(img, degrees):
 rows,cols = img.shape
 M = cv2.getRotationMatrix2D((cols/2,rows/2),degrees,1)
 dst = cv2.warpAffine(img,M,(cols,rows))
 return dst

def transform_image(img, equalise, gammaAdjust, reflection, rotation):
 if equalise == 1:
   img = cv2.equalizeHist(img)
 if gammaAdjust != 1:
   img = pow(img / 255.0, gammaAdjust) * 255.0
 if reflection == 0 or reflection == 1:
   img = cv2.flip(img, int(reflection))
 if rotation != 0:
   img = rotate_image(img, rotation)
 return img

def transform_xy(x, y, equalise, gammaAdjust, reflection, rotation):
 # if equalise, then no change to x and y
 # if gamma adjustment, then no change to x and y
 # reflection
 if reflection == 0:
   y = 1.0 - y
 if reflection == 1:
   x = 1.0 - x
 if rotation == 180:
   x = 1.0 - x
   y = 1.0 - y
 if rotation != 0 and rotation != 180:
   x1 = x - 0.5
   y1 = y - 0.5
   theta = rotation / 180 * pi
   x2 = x1 * cos(theta) + y1 * sin(theta)
   y2 = -x1 * sin(theta) + y1 * cos(theta)
   x = x2 + 0.5
   y = y2 + 0.5
 return numpy.array([x, y])

# set up the image adjustments
#equalise, gammaAdjust, reflection
dAdjustShort = { 'equalise': [0],
 'gammaAdjust': [1],
 'reflection': [-1],
 'rotation': [0]}
dAdjust = { 'equalise': [0, 1],
 'gammaAdjust': [1, 0.75, 1.5],
 'reflection': [-1, 0, 1],
 'rotation': [0, 180, 3, -3]}
# 'rotation': [0, 180, 3, -3, 10, -10]}
imgAdj = expand_grid(dAdjust)
#imgAdj = expand_grid(dAdjustShort)
print (imgAdj.shape)
# can't have both reflection AND rotation by 180 degrees
imgAdj = imgAdj.query('not (reflection > -1 and rotation == 180)')
# can't have equalise AND gamma adjust
imgAdj = imgAdj.query('not (equalise == 1 and gammaAdjust != 1)')

There are two feature sets that enable the machine learning model to find the heart:

  1. shape in the image – by looking at shape in the image it can find the heart
  2. movement between images at different points of time – the heart is moving but most of the chest is stationary

So I set up the network architecture to use two channels. The first channel is the image, and the second channel is the difference between the image at this point of time and the image 8 time periods in the future.


def load_train_set():
 #numTimes = 8 # how many time periods to use
 plusTime = 5 # gaps between time periods for comparison
 xs = []
 ys = []
 ids = []
 all_y = '/home/colin/data/Second-Annual-Data-Science-Bowl/working/centroids-20160218-R.csv'
 with open(all_y) as f:
   rows = csv.reader(f, delimiter=',', quotechar='"')
   iRow = 1
   for line in rows:
     # first line is column headers, so ignore it
     if iRow > 1:
       # parse the line
       patient = line[0]
       x = float(line[1])
       y = float(line[2])
       sax = int(line[4])
       firstTime = int(line[5])
       if int(patient) % 25 == 0 and firstTime == 1:
         printQ(patient)
       # enhance the training data with rotations, reflections, NOT histogram equalisation
       for index, row in imgAdj.iterrows():
         # append the target values
         xy = transform_xy(x, y, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         ys.append(xy.astype('float32').reshape((1, 2)))
         #
         # read the images
         folder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned/' + patient + '/study/sax_0' + str(sax) + '/'
         xm = np.zeros([1, 2, 96, 96])
         #
         #
         # current frame
         path = folder + 'image-' + ('%0*d' % (2, firstTime)) + '.png'
         img = load_image(path)
         # transform the image - rotation, reflection etc
         img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         # get the pixels into the range [-1, 1]
         img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
         xm[0, 0, :, :] = img
         #
         #
         # find movement of current frame to future frame
         path = folder + 'image-' + ('%0*d' % (2, firstTime + plusTime)) + '.png'
         img = load_image(path)
         # transform the image - rotation, reflection etc
         img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         # get the pixels into the range [-1, 1]
         img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
         # first time is the complete image at time 1
         # subsequent frames are the differences between frames
         xm[0, 1, :, :] = img - xm[0, 0, :, :]
         xs.append(xm.astype('float32').reshape((1, 2, 96, 96)))
         ids.append(patient)

     iRow = iRow + 1
 return np.vstack(xs), np.vstack(ys), np.vstack(ids)

I used early stopping to help reduce overfitting:


class EarlyStopping(object):
 def __init__(self, patience=100):
   self.patience = patience
   self.best_valid = np.inf
   self.best_valid_epoch = 0
   self.best_weights = None

def __call__(self, nn, train_history):
 current_valid = train_history[-1]['valid_loss']
 current_epoch = train_history[-1]['epoch']
 if current_valid < self.best_valid:
   self.best_valid = current_valid
   self.best_valid_epoch = current_epoch
   self.best_weights = nn.get_all_params_values()
 elif self.best_valid_epoch + self.patience < current_epoch:
   print('Early stopping')
   print('Best valid loss was {:.6f} at epoch {}.'.format(
   self.best_valid, self.best_valid_epoch))
   nn.load_params_from(self.best_weights)
   raise StopIteration()

Step 3: Read the Training Data

As well as reading the training data, I shuffled the order of the data. This allowed me to use batch training.


# read the training data

printQ('reading the training data')
train_x, train_y, train_id = load_train_set()

printQ ('shuffling training rows')
random.seed(1234)
rows = random.choice(arange(0, train_x.shape[0]), train_x.shape[0])
t_x = train_x[rows,:,:,:]
t_y = train_y[rows,:]
t_id = train_id[rows]

printQ('finished')

Step 4: Train the Model

The network architecture used deep convolutional layers to find features in the image, then fully connected layers to convert these features into the centroid location:

20160308-image05


# fit the models

# set up the model
printQ ('setting up the model structure')
layers0 = [
 # layer dealing with the input data
 (InputLayer, {'shape': (None, 2, 96, 96)}),

# first stage of our convolutional layers
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 5}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 (MaxPool2DLayer, {'pool_size': 2}),
 (DropoutLayer, {'p': 0.2}),

# second stage of our convolutional layers
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.3}),
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.3}),
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 (MaxPool2DLayer, {'pool_size': 2}),
 (DropoutLayer, {'p': 0.3}),

# two dense layers with dropout
 (DenseLayer, {'num_units': 128}),
 (DropoutLayer, {'p': 0.5}),
 (DenseLayer, {'num_units': 128}),

# the output layer
 (DenseLayer, {'num_units': 2, 'nonlinearity': sigmoid}),
]

printQ ('creating and training the networks architectures')
numNets = 1
NNs = list()
for iNet in arange(numNets):
 nn = NeuralNet(
 layers = layers0,
 max_epochs = 2000,
 update=adam,
 update_learning_rate=0.0002,
 regression=True, # flag to indicate we're dealing with regression problem
 batch_iterator_train=BatchIterator(batch_size=100),
 on_epoch_finished=[EarlyStopping(patience=10),],
 train_split=TrainSplit(eval_size=0.25),
 verbose=1,
 )
 result = nn.fit(t_x, t_y)
 NNs.append(nn)

printQ('finished')

Based upon how quickly the training converged, the network could possibly have been simplified, reducing the number of layers, or using fewer neurons in the fully connected layers. But I didn’t have time to experiment with different architectures.

20160308-image06

The GPU quickly ran out of RAM unless I used the batch iterator. I found the batch size via trial and error. Large batch sizes caused the GPU to run out of RAM. Small batch sizes ran much slower.

Step 5: Review the Training Errors

Just like humans, all models make mistakes. The heart chamber segmentation algorithms I used later in this project were sensitive to how well the heart chamber was centred in the image. But as long as the model output was a centroid that was inside the heart chamber, things usually went OK. Early versions of my model made mistakes that placed the centroid outside the heart chamber, sometimes even far away from the heart.Tweaks to the training data (especially enhancing the data with rotation and reflection) and the architecture (especially dropout layers) improved the performance.


def getHeartLocation(trainX):
  # get the heart locations from each network
  heartLocs = zeros(numNets * trainX.shape[0] * 2).reshape((numNets, trainX.shape[0], 2))
  for j in arange(numNets):
    nn = NNs[j]
    heartLocs[j, :, :] = nn.predict(trainX)

    # use median as an ensembler
    heartLocsMedian = zeros(trainX.shape[0] * 2).reshape((trainX.shape[0], 2))
    heartLocsMedian[:,0] = median(heartLocs[:,:,0], axis = 0)
    heartLocsMedian[:,1] = median(heartLocs[:,:,1], axis = 0)

    # use a 'max distance from centre' ensembler
    heartLocsDist = zeros(trainX.shape[0] * 2).reshape((trainX.shape[0], 2))
    distance = abs(heartLocs - 0.5)
    am0 = distance[:,:,0].argmax(0)
    am1 = distance[:,:,1].argmax(0)
    heartLocsDist[:,0] = heartLocs[am0, arange(trainX.shape[0]), 0]
    heartLocsDist[:,1] = heartLocs[am1, arange(trainX.shape[0]), 1]

    # combine the two using an arithmetic average
    heartLocations = 0.5 * heartLocsMedian + 0.5 * heartLocsDist

    return heartLocations

heartLocations = getHeartLocation(train_x)

# review the training errors to check for model improvements
def plot_sample(x, y, predicted, axis):
  img = x[0, :, :].reshape(96, 96)
  axis.imshow(img, cmap='gray')
  axis.scatter(y[0::2] * 96, y[1::2] * 96, marker='x', s=10)
  axis.scatter(predicted[0::2] * 96, predicted[1::2] * 96, marker='x', s=10, color='red')

nTrain = train_x.shape[0]
errors = np.zeros(nTrain)

for i in arange(0, nTrain):
  errors[i] = sqrt( square(heartLocations[i, 0] - train_y[i, 0]) + square(heartLocations[i, 1] - train_y[i, 1]) )

print('Prob(error > 0.05)' + str(mean(errors > 0.05)))
print('Mean: ' + str(mean(errors)))
print('Percentiles: ' + str(percentile(errors, [50, 75, 90, 95, 99, 100])))

for i in arange(0, nTrain):
  error = sqrt( square(heartLocations[i, 0] - train_y[i, 0]) + square(heartLocations[i, 1] - train_y[i, 1]) )
  if (error > 0.04):
    if train_id[i] != train_id[i-1]:
      #print(i)
      print(train_id[i]) # only errors on the original images - not the altered images
      fig = pyplot.figure(figsize=(6, 3))
      ax = fig.add_subplot(1, 2, 1, xticks=[], yticks=[])
      plot_sample(train_x[i,:,:,:], train_y[i, :], heartLocations[i, :], ax)
      pyplot.show()

print('error review completed')

20160308-image07

After many failed models, I was excited when the two worst training errors were still close to the centre of the heart chamber 🙂

Step 6: Find the Left Ventricle Locations for the Submission Data

The main point of building a heart finder machine learning model is to automate the process of finding the left ventricle in the test images that will be used as part of the competition submission. These are images that the model has never seen before.


def load_submission_set():
  numTimes = 2 # how many time periods to use
  plusTime = 5
  xs = []
  ys = []
  ids = []
  paths = []
  times = []
  saxes = []
  all_y = '/home/colin/data/Second-Annual-Data-Science-Bowl/working/centroids-submission-R.csv'
  with open(all_y) as f:
    rows = csv.reader(f, delimiter=',', quotechar='"')
    iRow = 1
    for line in rows:
      # first line is column headers
      if iRow > 1:
        # parse the line
        patient = line[0]
        x = 0
        y = 0
        sax = int(line[1])
        firstTime = int(line[2])
        # save the targets
        xy = np.asarray([x, y])
        ys.append(xy.astype('float32').reshape((1, 2)))
        # read the images
        folder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned/' + patient + '/study/sax_0' + str(sax) + '/'
        xm = np.zeros([1, 2, 96, 96])
        #
        #
        # current frame
        path0 = folder + 'image-' + ('%0*d' % (2, firstTime)) + '.png'
        img = load_image(path0)
        # transform the image - rotation, reflection etc
        #img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
        # get the pixels into the range [-1, 1]
        img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
        xm[0, 0, :, :] = img
        #
        #
        # find movement of current frame to future frame
        path5 = folder + 'image-' + ('%0*d' % (2, firstTime + plusTime)) + '.png'
        img = load_image(path5)
        # transform the image - rotation, reflection etc
        #img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
        # get the pixels into the range [-1, 1]
        img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
        # first time is the complete image at time 1
        # subsequent frames are the differences between frames
        xm[0, 1, :, :] = img - xm[0, 0, :, :]
        xs.append(xm.astype('float32').reshape((1, numTimes, 96, 96)))
        ids.append(patient)
        paths.append(path0)
        times.append(firstTime)
        saxes.append(sax)
      iRow = iRow + 1
  return np.vstack(xs), np.vstack(ids), np.vstack(paths), np.vstack(times), np.vstack(saxes)

printQ('reading the submission data')
test_x, test_ids, test_paths, test_times, test_sax = load_submission_set()

printQ('quot;getting the predictions')
predicted_y = getHeartLocation(test_x)

printQ('creating the output table')
fullIDs = []
fullPaths = []
fullX = []
fullY = []
fullSax = []
fullTime = []
nTest = test_x.shape[0]
iTime = 1
for i in arange(0, nTest):
  patient = (test_ids[i])[0]
  sax = int((test_sax[i])[0])
  path = (test_paths[i])[0]
  iTime = int((test_times[i])[0])
  fullIDs.append(patient)
  fullPaths.append(path)
  fullX.append(predicted_y[i, 0])
  fullY.append(predicted_y[i, 1])
  fullSax.append(sax)
  fullTime.append(iTime)
  outPath = '/home/colin/data/Second-Annual-Data-Science-Bowl/predicted-heart-location-submission/'
  outPath = outPath + patient
  outPath = outPath + '-' + str(sax)
  outPath = outPath + '-' + str(iTime) + '.png'
  img = load_image256x192(path)
  x = int(round(predicted_y[i, 0] * 192))
  y = int(round(predicted_y[i, 1] * 256))
  img[y, x] = 255
  img[y-1, x-1] = 255
  img[y-1, x+1] = 255
  img[y+1, x-1] = 255
  img[y+1, x+1] = 255
  write_image(img, outPath)

fullIDs = array(fullIDs)
fullPaths = array(fullPaths)
fullX = array(fullX)
fullY = array(fullY)
fullSax = array(fullSax)
fullTime = array(fullTime)

printQ('saving results table')
d = { 'patient': fullIDs, 'path' : fullPaths, 'x' : fullX, 'y' : fullY, 'iTime' : fullTime, 'sax' : fullSax}
import pandas as pd
d = pd.DataFrame(d)
d.to_csv('/home/colin/data/Second-Annual-Data-Science-Bowl/working/heartfinderV4b-centroids-submission.csv', index = False)

In the animated gif below, you can see the left ventricle centroid location that has been automatically fitted, displayed as a dark rectangle moving around near the centre of the heart chamber. The machine learning algorithm was not trained on this patient’s images – so what you see here is artificial intelligence in action!

20160308-submission-images

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

An Even Dozen – Denoising Dirty Documents: Part 12

15 Sunday Nov 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R, Stacking, XGBoost

≈ 3 Comments

Tags

Image Processing, Kaggle, Machine Learning, R, Stacking, XGBoost

Over the past 11 blogs in this series, I have discussed how to build machine learning models for Kaggle’s Denoising Dirty Documents competition.

dozeneggs

The final blog in this series brings the count to an even dozen, and will achieve two aims:

  1. ensemble the models that we have built
  2. take advantage of the second information leakage in the competition

Ensembling, the combining of individual models into a single model, performs best when the individual models have errors that are not strongly correlated. For example, if each model has statistically independent errors, and each model performs with similar accuracy, then the average prediction across the 4 models will have half the RMSE score of the individual models. One way to increase the statistical independence of the models is to use different feature sets and / or types of models on each. I therefore chose the following combination of models:

  1. deep learning – thresholding based features
  2. deep learning – edge based features
  3. deep learning – median based features
  4. images with backgrounds removed using information leakage
  5. xgboost – wide selection of features
  6. convolutional neural network – using raw images without background removal pre-processing
  7. convolutional neural network – using images with backgrounds removed using information leakage
  8. deep convolutional neural network – using raw images without background removal pre-processing
  9. deep convolutional neural network – using images with backgrounds removed using information leakage

20151115 ensemble structure

It turned out that some of these models had errors that weren’t strongly independent to other models. But I was rushing to improve my leaderboard score in the final 48 hours of the competition, so I didn’t have time to experiment.

I didn’t experiment much with different ensemble models. However I did test xgboost versus a simple average or a least square linear regression, and it outperformed both. Maybe an elastic net could have done a good job.

Here is the R code for my ensemble:


.libPaths(c(.libPaths(), "./rlibs"))
library(png)
library(data.table)
library(xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

cleanFolder = "./data/train_cleaned"
inFolder1 = "./threshold based model/training data"
inFolder2 = "./edge based model/training data"
inFolder3 = "./median based model/training data"
inFolder4 = "./foreground/train foreground"
inFolder5 = "./submission 11/train_postprocessed"
inFolder6 = "./convnet/train_predicted"
inFolder7 = "./cnn_leakage/train_predicted"
inFolder8 = "./CNN based model/training"
inFolder9 = "./deep CNN/train_predicted"

outPath = "./stacked/stacking.csv"

filenames = list.files(cleanFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
y = img2vec(imgY)
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(y, x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("y", "threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
#rows = sample(nrow(dat), 15000000)
dat[is.na(dat)] = 0
#dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
dtrain <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
#
nThreads = 30
# do cross validation first
#xgb.tab = xgb.cv(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = 1000, early.stop.round = 15, nfold = 4, print.every.n = 10)
# what is the best number of rounds?
#min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
min.error.idx = 300 # was 268
xgb.mod = xgboost(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

dat_predicted = predict(xgb.mod, newdata=as.matrix(dat[,-1]))
sqrt( mean( (dat$y - dat_predicted) ^ 2 )) # 0.00759027

save (xgb.mod, file = "./model/xgb.rData")

#####################################################################################################################################

imgFolder = "./data/test"
inFolder1 = "./threshold based model/test data"
inFolder2 = "./edge based model/test data"
inFolder3 = "./median based model/test data"
inFolder4 = "./foreground/test foreground"
inFolder5 = "./submission 11/test_postprocessed"
inFolder6 = "./convnet/test_predicted"
inFolder7 = "./cnn_leakage/test_predicted"
inFolder8 = "./CNN based model/test"
inFolder9 = "./deep CNN/test_predicted"

outFolder = "./stacked/test data"
outFolder2 = "./stacked/test images"

filenames = list.files(imgFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))

# turn the images into vectors
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
yHat = predict(xgb.mod, newdata=as.matrix(dat))
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
imgY = matrix(yHat, nrow(imgX1), ncol(imgX1))
writePNG(imgY, file.path(outFolder2, f))
save(imgY, file = file.path(outFolder, gsub(".png", ".rData", f)))
}

Ensembling materially improved my leaderboard score versus any of the individual models. I feel that was due to the use of different features across my 3 deep learning models. So now I had a set of images that looked quite good:

20151115 output 1

20151115 output 2

To my eyes, my predicted images were indistinguishable from the clean images in the training data. In a real world situation I would have stopped model development here, because the image quality exceeds the minimum requirements for OCR. However, since this was a competition, I wanted the best score I could get.

So I took advantage of the second data leakage in the competition – the fact that the cleaned images were repeated across the dataset. This meant that I could compare a cleaned images to other cleaned images that appeared to have the same text and the same font, and clean up any pixels that were different across the set of images. I experimented with using the mean of the pixel brightness across the images, but using the median performed better.


library(png)
library(data.table)

inFolder = "./stacked/test data"
outFolder = "./information leakage/data"
outFolder2 = "./information leakage/images"

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

filenames = list.files(inFolder, pattern = "\\.rData$")
for (f in filenames)
{
print(f)

load(file.path(inFolder, f))
imgX = imgY

# look for the closest matched images
scores = matrix(1, length(filenames))
for (i in 1:length(filenames))
{
load(file.path(inFolder, filenames[i]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) && ncol(imgY) >= ncol(imgX))
{
imgY = imgY[1:nrow(imgX), 1:ncol(imgX)]
rmse = sqrt(mean( (imgX - imgY)^2 ))
}
scores[i] = rmse
}

dat = matrix(1, ncol(imgX) * nrow(imgX), 4)
for (i in 1:4)
{
f2 = filenames[order(scores)][i]
load(file.path(inFolder, f2))
dat[,i] = img2vec(imgY)
}

dat2 = apply(dat, 1, median)
#dat2 = apply(dat, 1, mean)

imgOut = matrix(dat2, nrow(imgX), ncol(imgX))
writePNG(imgOut, file.path(outFolder2, gsub(".rData", ".png", f)))
save(imgOut, file = file.path(outFolder, f))
}

This information leakage halved the RMSE, and I suspect that it was what allowed the top two competitors to obtain RMSE scores less than 1%.

So that’s it for this series of blogs. I learned a lot from my first Kaggle competition. Competing against others, and sharing ideas is a fun way to learn.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 11

08 Sunday Nov 2015

Posted by Colin Priest in Adaptive Thresholding, Background Removal, Deep Learning, Edge Detection, h2o, Image Processing, Kaggle, Machine Learning, Median Filter, Morphology, R

≈ 1 Comment

Tags

Adaptive Thresholding, Background Removal, Deep Learning, Edge Detection, h2o, Image Processing, Kaggle, Machine Learning, Median Filter, Morphology, R

In my last blog I showed how to use convolutional neural networks to build a model that removed stains from an image. While convolutional neural networks seem to be well suited for image processing, in this competition I found that deep neural networks performed better. In this blog I show how to build these models.

warnH022 - deep water

Since I wanted to use R, have limited RAM and I don’t have a powerful GPU, I chose to use h2o to build the models. That way I could do the feature engineering in R, pass the data to h2o, let h2o build a model, then get the predicted values back in R. The memory management would be done in h2o, which uses deep learning algorithms that adjust the RAM constraints. So I guess this combination of deep learning and h2o could be called “deep water” 😉

For my final competition submission I used an ensemble of models, including 3 deep learning models built with R and h2o. Each of the 3 deep learning models used different feature engineering:

  • median based feature engineering
  • edge based feature engineering
  • threshold based feature engineering

This blog shows the details of the median based model. I leave it to the reader to implement the edge based and threshold based models using the image processing scripts from my earlier blogs in this series.

If you don’t already have h2o installed on your computer, then you can install it directly from R. At the time of writing this blog, you could install h2o using the following script:


# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/R")))

That script will need to be changed as new versions of h2o are released. So use the latest instructions shown here.

Once h2o is installed, you can interface with h2o from R using the CRAN package.


install.packages("h2o")
library(h2o)

Median based image processing is used for feature engineering in this example, but you could use any combination of image processing techniques for your feature engineering. I got better performance using separate deep learning models for different types of image processing, but that may be because I had limited computing resources. If you have more computing resources than me, then maybe you will be successful with a single large model that uses all of the image processing techniques to create features.


# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}
 
median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
 k = 1
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}
 
# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5
 p = 1.39
 th = 240
 
 # the background is found via a median filter
 background = median_Filter(img, w)
 
 # the foreground is darker than the background
 foreground = img / background
 foreground[foreground > 1] = 1
 
 foreground2 = foreground ^ p
 foreground2[foreground2 >= (th / 255)] = 1
 
 return (matrix(foreground2, nrow(img), ncol(img)))
} 

img2tab = function(imgX, f)
{
 median5 = img2vec(median_Filter(imgX, 5))
 median17 = img2vec(median_Filter(imgX, 17))
 median25 = img2vec(median_Filter(imgX, 25))
 backgroundRemoval = img2vec(background_Removal(imgX))
 foreground = readPNG(file.path(foregroundFolder, f))
 
 # pad out imgX
 padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
 offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))
 
 # raw pixels window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
 x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x2 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
 x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x3 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
 x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x4 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
 x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x5 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
 x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x6 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
 x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

 dat = data.table(cbind(x, x2, x3, x4, x5, x6))
 setnames(dat,c(
 paste("x", seq_len((2*padding+1)^2), sep=""), 
 paste("median5", seq_len((2*padding+1)^2), sep=""),
 paste("median17", seq_len((2*padding+1)^2), sep=""),
 paste("median25", seq_len((2*padding+1)^2), sep=""),
 paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
 paste("foreground", seq_len((2*padding+1)^2), sep="")
 ))
 
 return (dat)
}

If you’ve been following my blog, then you will see that there’s nothing new in the two image processing functions shown above.

To build the model you will need to start h2o, import the data and tell h2o to create a deep learning model.

h2oServer = h2o.init(nthreads = 6, max_mem_size = "10G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
 score_training_samples = 0, 
 overwrite_with_best_model = TRUE,
 activation = "Rectifier", seed = 1,
 hidden = c(200, 200,200), epochs = 15,
 adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
 fast_mode = T, diagnostics = T, ignore_const_cols = T,
 force_load_balance = T)


You should change the h2o.init parameters according to the hardware on your computer. I’m running my model on a PC with 8 CPUs and 16GB of RAM, so I left a couple of CPUs free to do the user interface and core operating system functionality, plus some RAM for the operating system. Scale these parameters up or down if your PC specifications are more or less powerful than mine.

The model may take a few hours to fit. During that time R will not do anything. So if you want to see how the model is progressing, then point your browser to your localhost (port 54321 on my PC, but maybe a different port on yours) and use the h2o web interface to see what is happening.

You can get the predicted values using the following script:


filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
 predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
 imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))
 
 # correct the pixel brightnesses that are out of bounds
 imgOut[imgOut > 1] = 1
 imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Running predictions is as simple as creating a data file, importing it to h2o, and then asking h2o to give you the predicted values from your already fitted model. I found that some of the raw predicted values were out of the [0, 1] range, and improved my leaderboard score by limiting the predicted values to lie within this range.

You do not need to shut down h2o after you finish running a model. In fact you may wish to leave it running so that you can do model diagnostics or run more predictions.

If you wish to save a copy of your model, for later reuse, then you can use the following syntax:


modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

Just remember that h2o needs to be running when you save models or load previously saved models.

In my next, and final, blog in this series, I will show how to take advantage of the second information leakage in the competition.

For those who want the entire R script to try out for themselves, here it is:


install.packages("h2o")
library(h2o)
library(png)
library(data.table)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5
p = 1.39
th = 240

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img / background
foreground[foreground > 1] = 1

foreground2 = foreground ^ p
foreground2[foreground2 >= (th / 255)] = 1

return (matrix(foreground2, nrow(img), ncol(img)))
}

dirtyFolder = "./data/train"
cleanFolder = "./data/train_cleaned"
outFolder = "./model"
foregroundFolder = "./foreground/train foreground"

outPath = file.path(outFolder, "trainingdata.csv")
outPath2 = file.path(outFolder, "testdata.csv")
filenames = list.files(dirtyFolder)
padding = 2
set.seed(1)
library(h2o)
h2oServer = h2o.init(nthreads = 15, max_mem_size = "110G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
score_training_samples = 0,
overwrite_with_best_model = TRUE,
activation = "Rectifier", seed = 1,
hidden = c(200, 200,200), epochs = 15,
adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
fast_mode = T, diagnostics = T, ignore_const_cols = T,
force_load_balance = T)

summary(model.dl)

modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

outFolder = "./model/training data"

img2tab = function(imgX, f)
{
median5 = img2vec(median_Filter(imgX, 5))
median17 = img2vec(median_Filter(imgX, 17))
median25 = img2vec(median_Filter(imgX, 25))
backgroundRemoval = img2vec(background_Removal(imgX))
foreground = readPNG(file.path(foregroundFolder, f))

# pad out imgX
padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))

# raw pixels window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x2 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x3 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x4 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x5 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x6 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

dat = data.table(cbind(x, x2, x3, x4, x5, x6))
setnames(dat,c(
paste("x", seq_len((2*padding+1)^2), sep=""),
paste("median5", seq_len((2*padding+1)^2), sep=""),
paste("median17", seq_len((2*padding+1)^2), sep=""),
paste("median25", seq_len((2*padding+1)^2), sep=""),
paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
paste("foreground", seq_len((2*padding+1)^2), sep="")
))

return (dat)
}

dirtyFolder = "./data/test"
outFolder = "./model/test data"
foregroundFolder = "./foreground/test foreground"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))

# correct the pixel brightnesses that are out of bounds
imgOut[imgOut > 1] = 1
imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents – Part 10

01 Sunday Nov 2015

Posted by Colin Priest in Convolutional Neural Networks, Deep Learning, Image Processing, Kaggle, Machine Learning, Python

≈ 27 Comments

Tags

Convolutional Neural Networks, Deep Learning, Image Processing, Kaggle, Machine Learning, Python

In my last blog, I explained how to take advantage of an information leakage regarding the repeated backgrounds in Kaggle’s Denoising Dirty Documents competition. The result of that process was that we had done a fairly good job of removing the background. But the score from doing this was not good enough to get a good placing. We need to do some processing on the image to improve the score.

Today we will use an approach that does not require me to do any feature engineering – convolutional neural networks, which are neural networks where the first few layers repeatedly apply the same weights across overlapping regions of the input data. One intuitive way of thinking about this is that it is like applying an edge detection filter (much like I described here) where the algorithm finds the appropriate weights for several different edge filters.

_images/mylenet.png

I’m told that convolutional neural networks are inspired by how vision works in the natural world. So if I test whether convolutional neural networks work well, am I giving them a robot eye test?

robot eye test

While I am comfortable coding in R, there is little support for convolutional neural networks in R, and I had to code this in Python, using the Theano library. The reason that I chose Theano is because neural network model fitting can be quite time consuming, and Theano supports GPU based processing, which can be orders of magnitude faster than CPU based calculations. To simply the code, I am using Daniel Nouri’s nolearn library, which sits over the lasagne library, which sits over the Theano library. This is the first time I have coded in Python and the first time I have used convolutional neural networks, so it was a good learning experience.

Since my PCs don’t have top of the line graphics cards with GPU processing support, I decided to run my code in a cloud on a virtual machine with GPU support. And since I didn’t want go through the effort of setting up a Linux machine and installing all of the libraries and compilers, I used Domino Data Labs to host my analysis. You can find my project here.

Before I progress to coding the model, I have to set up the environment.

import os
import shutil
 
def setup_theano():
	destfile = "/home/ubuntu/.theanorc"
	open(destfile, 'a').close()
	shutil.copyfile("/mnt/.theanorc", destfile)
 
	print "Finished setting up Theano"

The Python script shown above creates a function that copies the Theano settings file into the appropriate folder in the virtual machine, so that Theano knows to use GPU processing rather than CPU processing. This function gets called from my main script.

In order to reduce the number of calculations, I used a network architecture that inputs an image and outputs an image. The suggestion for this architecture comes from ironbar, a great guy who placed third in the competition. This is unlike all of the examples I found online, which have just one or two outputs, usually because the online example is a classification problem identifying objects appearing within the image. But there are two issues with this architecture:

  1. it doesn’t allow for fully connected layers before the output, and
  2. the target images are different sizes.

I chose to ignore the first problem, although if I had time I would have tried out a more traditional architecture that included fully connected layers but which only models one target pixel at a time.

def image_matrix(img):
 """
 The output value has shape (<number of pixels>, <number of rows>, <number of columns>)
 """
 # 420 x 540 or 258 x 540?
 if img.shape[0] == 258:
 return (img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540))
 if img.shape[0] == 420:
 result = []
 result.append((img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
 result.append((img[162:420, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
 result = np.vstack(result).astype('float32').reshape((2, 1, 258, 540))
 return result

For the second problem, I used the script shown above to split the larger images into two smaller images that were the same size as the other small images in the data, thereby standardising the output dimensions.

def load_train_set(file_list):
 xs = []
 ys = []
 for fname in file_list:
 x = image_matrix(load_image(os.path.join('./train_foreground/', fname)))
 y = image_matrix(load_image(os.path.join('./train_cleaned/', fname)))
 for i in range(0, x.shape[0]):
 xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
 ys.append(y[i, :, :, :].reshape((1, 1, 258, 540)))
 return np.vstack(xs), np.vstack(ys)

Theano uses tensors (multi-dimensional matrices) to store the training data and outputs. The first dimension is the index of the training data item. The second dimension is the colourspace information e.g. RGB would be 3 dimensions. Since our images are greyscale, this dimension has a size of only 1. The remaining dimensions are the dimensions of the input data / output data. The script shown above reshapes the data to meet this criteria.
The nolearn library simplifes the process of defining the architecture of a neural network. I used 3 hidden convolutional layers, each with 25 image filters. The script below shows how this was achieved.

net2 = NeuralNet(
layers = [
('input', layers.InputLayer),
('conv1', layers.Conv2DLayer),
('conv2', layers.Conv2DLayer),
('conv3', layers.Conv2DLayer),
('output', layers.FeaturePoolLayer),
],
#layer parameters:
input_shape = (None, 1, 258, 540),
conv1_num_filters = 25, conv1_filter_size = (7, 7), conv1_pad = 'same',
conv2_num_filters = 25, conv2_filter_size = (7, 7), conv2_pad = 'same',
conv3_num_filters = 25, conv3_filter_size = (7, 7), conv3_pad = 'same',
output_pool_size = 25,
output_pool_function = T.sum,
y_tensor_type=T.tensor4,

#optimization parameters:
update = nesterov_momentum,
update_learning_rate = 0.005,
update_momentum = 0.9,
regression = True,
max_epochs = 200,
verbose = 1,
batch_iterator_train=BatchIterator(batch_size=25),
on_epoch_finished=[EarlyStopping(patience=20),],
train_split=TrainSplit(eval_size=0.25)
)

Due to the unique nature of the problem versus the online examples, my first attempt at this script was not successful. Here are the key changes that I needed to make:

  • set y_tensor_type=T.tensor4 because the target is 2 dimensional
  • your graphics card almost certainly doesn’t have enough RAM to process all of the images at once, so you need to use a batch iterator and experiment to find a suitable batch size e.g. batch_iterator_train=BatchIterator(batch_size=25)

plot_loss(net2)
plt.savefig("./results/plotloss.png")

I also wanted to plot the loss across the iterations. So I added the two lines above, giving me the graph below.

During the first several iterations, the neural network is balancing out the weights so that the pixels are the correct magnitude, and after that the serious work of image processing begins.


plot_conv_weights(net2.layers_[1], figsize=(4, 4))
plt.savefig("./results/convweights.png")

I wanted to see what some of the convolutional filters looked like. So I added the two lines shown above, giving me the set of images below.

These filters look like small parts of images of letters, which makes some sense because we are trying to identify whether a pixel sits on the stroke of a letter.

The output looks reasonable, although not as good as what I achieve using a combination of image processing techniques and deep learning.

In the output image above you can see some imperfections.

I think that if I had changed the network architecture to include fully connected layers then I would have achieved a better result. Maybe one day when I have enough time I will experiment with that architecture.

The full Python script is shown below:


import random
import numpy as np
import cv2
import os
import itertools
import math
import matplotlib.pyplot as plt

from setup_GPU import setup_theano
setup_theano()

from lasagne import layers
from lasagne.updates import nesterov_momentum
from lasagne.nonlinearities import softmax
from lasagne.nonlinearities import sigmoid
from nolearn.lasagne import BatchIterator
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import PrintLayerInfo
from nolearn.lasagne.visualize import plot_loss
from nolearn.lasagne.visualize import plot_conv_weights
from nolearn.lasagne.visualize import plot_conv_activity
from nolearn.lasagne.visualize import plot_occlusion

import theano.tensor as T

def load_image(path):
return cv2.imread(path, cv2.IMREAD_GRAYSCALE)

def write_image(img, path):
return cv2.imwrite(path, img)

def image_matrix(img):
"""
The output value has shape (<number of pixels>, <number of rows>, <number of columns>)
"""
# 420 x 540 or 258 x 540?
if img.shape[0] == 258:
return (img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540))
if img.shape[0] == 420:
result = []
result.append((img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
result.append((img[162:420, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
result = np.vstack(result).astype('float32').reshape((2, 1, 258, 540))
return result

def load_train_set(file_list):
xs = []
ys = []
for fname in file_list:
x = image_matrix(load_image(os.path.join('./train_foreground/', fname)))
y = image_matrix(load_image(os.path.join('./train_cleaned/', fname)))
for i in range(0, x.shape[0]):
xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
ys.append(y[i, :, :, :].reshape((1, 1, 258, 540)))
return np.vstack(xs), np.vstack(ys)

def load_test_file(fname, folder):
xs = []
x = image_matrix(load_image(os.path.join(folder, fname)))
for i in range(0, x.shape[0]):
xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
return np.vstack(xs)

def list_images(folder):
included_extentions = ['jpg','bmp','png','gif' ]
results = [fn for fn in os.listdir(folder) if any([fn.endswith(ext) for ext in included_extentions])]
return results

def do_test(inFolder, outFolder, nn):
test_images = list_images(inFolder)
nTest = len(test_images)
for x in range(0, nTest):
fname = test_images[x]
x1 = load_test_file(fname, inFolder)
x1 = x1 - 0.5
pred_y = nn.predict(x1)
tempImg = []
if pred_y.shape[0] == 1:
tempImg = pred_y[0, 0, :, :].reshape(258, 540)
if pred_y.shape[0] == 2:
tempImg1 = pred_y[0, 0, :, :].reshape(258, 540)
tempImg2 = pred_y[1, 0, :, :].reshape(258, 540)
tempImg = np.empty((420, 540))
tempImg[0:258, 0:540] = tempImg1
tempImg[162:420, 0:540] = tempImg2
tempImg[tempImg < 0] = 0
tempImg[tempImg > 1] = 1
tempImg = np.asarray(tempImg*255.0, dtype=np.uint8)
write_image(tempImg, (os.path.join(outFolder, fname)))

class EarlyStopping(object):
def __init__(self, patience=100):
self.patience = patience
self.best_valid = np.inf
self.best_valid_epoch = 0
self.best_weights = None

def __call__(self, nn, train_history):
current_valid = train_history[-1]['valid_loss']
current_epoch = train_history[-1]['epoch']
if current_valid < self.best_valid:
self.best_valid = current_valid
self.best_valid_epoch = current_epoch
self.best_weights = nn.get_all_params_values()
elif self.best_valid_epoch + self.patience < current_epoch:
print("Early stopping.")
print("Best valid loss was {:.6f} at epoch {}.".format(
self.best_valid, self.best_valid_epoch))
nn.load_params_from(self.best_weights)
raise StopIteration()

def main():
random.seed(1234)

training_images = list_images("./train_foreground/")
random.shuffle(training_images)
nTraining = len(training_images)
TRAIN_IMAGES = training_images

train_x, train_y = load_train_set(TRAIN_IMAGES)
test_x = train_x
test_y = train_y

# centre on zero - has already been divided by 255
train_x = train_x - 0.5

net2 = NeuralNet(
layers = [
('input', layers.InputLayer),
('conv1', layers.Conv2DLayer),
('conv2', layers.Conv2DLayer),
('conv3', layers.Conv2DLayer),
('output', layers.FeaturePoolLayer),
],
#layer parameters:
input_shape = (None, 1, 258, 540),
conv1_num_filters = 25, conv1_filter_size = (7, 7), conv1_pad = 'same',
conv2_num_filters = 25, conv2_filter_size = (7, 7), conv2_pad = 'same',
conv3_num_filters = 25, conv3_filter_size = (7, 7), conv3_pad = 'same',
output_pool_size = 25,
output_pool_function = T.sum,
y_tensor_type=T.tensor4,

#optimization parameters:
update = nesterov_momentum,
update_learning_rate = 0.005,
update_momentum = 0.9,
regression = True,
max_epochs = 200,
verbose = 1,
batch_iterator_train=BatchIterator(batch_size=25),
on_epoch_finished=[EarlyStopping(patience=20),],
train_split=TrainSplit(eval_size=0.25)
)

net2.fit(train_x, train_y)

plot_loss(net2)
plt.savefig("./results/plotloss.png")
plot_conv_weights(net2.layers_[1], figsize=(4, 4))
plt.savefig("./results/convweights.png")

#layer_info = PrintLayerInfo()
#layer_info(net2)

import cPickle as pickle
with open('results/net2.pickle', 'wb') as f:
pickle.dump(net2, f, -1)

y_pred2 = net2.predict(test_x)
print "The accuracy of this network is: %0.2f" % (abs(y_pred2 - test_y)).mean()

do_test("./train_foreground/", './train_predicted/', net2)
do_test("./test_foreground/", './test_predicted/', net2)

if __name__ == '__main__':
main()

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 8

02 Friday Oct 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R

≈ 4 Comments

Tags

Image Processing, Kaggle, Machine Learning, R

In this blog we will engineer a new feature to go into our model. So far we have predominantly been using localised features – information about pixels that are located nearby the pixel whose brightness we are predicting. In this blog we will consider the structure of a document, and use that to improve our model.

20150801 - after

The image consists of multiple lines of text, arranged into one or more paragraphs with multiple sentences in each paragraph.

ducks lined up

If we get our ducks in a row, we can use the fact that the text (like our metaphorical ducks) is arranged into lines, and that there are gaps between those lines of text. In particular, if we can find the gaps between the lines, we can ensure that the predicted value within those gaps is always the background colour.

Let’s look at the horizontal profile of a sample input image:


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

f = "135.png"
imgX = readPNG(file.path(dirtyFolder, f))

hProfileX = unlist(apply(imgX, 1, mean))
hProfileY = unlist(apply(imgY, 1, mean))

x = 1:nrow(imgX)
plot(x, hProfileX, col="red", type="l", xlab = "row", ylab = "brightness")

20150930 output 01

The rows with low brightness show where text occurs at regular intervals. But the absolute level of the row brightness varies, according to the noise and stains, so much that sometimes the gaps between the lines of text are darker than the rows in which text occurs. This will make it more difficult to determine which rows are which.

But what if we used the predicted images from our stacked model from the last blog as the inputs to this predictor?


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"

f = "135.png"
imgX = readPNG(file.path(dirtyFolder, f))

hProfileX = unlist(apply(imgX, 1, mean))
hProfileY = unlist(apply(imgY, 1, mean))

x = 1:nrow(imgX)
plot(x, hProfileX, col="red", type="l", xlab = "row", ylab = "brightness")

20150930 output 02

Now it’s much easier to find the gaps between the lines of text. To do this we need estimate three parameters:

  1. cycle length: the number of rows from one line of text to the next line of text
  2. gap length: the height of the gap between lines of text
  3. starting location of the first cycle

The constraints that we can use for determining these parameters are:

  • cycle length is a positive number
  • gap length is a positive number
  • gap length is less than cycle length
  • gap rows are white
  • most letters (e.g. weruioaszxcvnm) are short and these letters are located in the darkest rows, which we will call the “core rows”
  • some letters are taller (e.g. tdfhklb) and create slightly darker rows above the core rows
  • some letters are taller (e.g. qypgj) and create slightly darker rows below the core rows
  • if we cluster the values, the darker cluster contains the core of the letters (the core rows), but these rows may be lighter when a paragraph starts or ends
  • the in-between valued rows next to the white rows are possibly where the taller letters are located and their darkness will vary according to the number of tall letters

Let’s start by clustering the row brightnesses, and checking where the rows switch from one cluster to another, and using this to estimate cycle length.


km = kmeans(hProfileX, 2)
cutoff = (max(hProfileX[km$cluster == which.min(km$centers)]) + min(hProfileX[km$cluster == which.max(km$centers)])) / 2
plot(hProfileX, type="l")
lines(hProfileX * 0 + cutoff, col="red")

starts = (2:length(hProfileX))[sapply(2:length(hProfileX), function(x) hProfileX[x] < cutoff & hProfileX[x-1] > cutoff)]
cycleLength = mean(diff(starts))

20150930 output 03

20150930 output 04

Next we will estimate the starting location of the first cycle. But before we do that, we need to define the “start” of the cycle. Let’s define the cycle as starting at the row in which the first letter starts. That would be the first non-white row before the first core rows.

cycleStarts = starts[1]
while (hProfileX[cycleStarts] < 0.99)
cycleStarts = cycleStarts – 1

The first cycle starts at row 3.

To estimate the gap length, we should measure the number of white rows before each cycle starts.


gaps = matrix(0, length(starts) - 1)
for (i in 2:length(starts))
{
 cycleStarts = starts[i]
 while (hProfileX[cycleStarts] < 0.99)
 cycleStarts = cycleStarts - 1
 gapStarts = cycleStarts - 1
 while (hProfileX[gapStarts] > 0.99)
 gapStarts = gapStarts - 1
 gapLength = cycleStarts - gapStarts
 gaps[i - 1] = gapLength
}
gaps

The gaps are (13, 12, 17, 13, 12, 11, 12, 18, 12, 12). The two outliers occur after rows of text that do not contain any of the characters (q, y, p, g, j). After removing these two outliers, the gaps do not exceed 13 rows. So we can write code to allow for this as follows:


mGaps = median(gaps)
gapLength = mean(gaps[abs(gaps - mGaps) <= 2])

This gives us a gap length estimate of 13.125 and completes our line location model for this image. The quick way to use this model is to white out the gap rows. A more complex model could use the newly estimated row types (core, gap, other) as predictors.


imgCleaned = imgX
numCycles = length(starts)
for (i in 1:numCycles)
{
cycleStart = starts[i] - min(gaps[abs(gaps - mGaps) <= 2])
imgCleaned[cycleStart:(starts[i]-1)] = 1
}

The script above would clean up any minor image blemishes that exist between the lines of text.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 7

23 Wednesday Sep 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R, Stacking

≈ 3 Comments

Tags

Image Processing, Kaggle, Machine Learning, R, Stacking

By the time I’d finished building the model in my last blog, I’d started to overload my computer’s RAM and CPU, so much so that I couldn’t add any more features. One solution could be for me to upgrade my hardware, or rent out a cluster in the cloud, but I’m trying to save money at the moment. So today I’m going to restructure my predictive solution into separate predictive models, each of which do not individually overload my computer, but which are combined via stacking to give an overall solution that is more predictive than any of the individual models.

stacking

Stacking would also allow us to answer Bobby’s question:

I thought it was interesting how the importance chart didn’t show any one individual pixel as being significant. Is there another similar importance chart or metric that can show the importance of a combination of pixels?

So let’s start by breaking up the current monolithic model into discrete chunks. Once we have done this, it will be easier for us to add new features (which I will do in the next blog). I will store the predicted outputs from each model in image format with separate folders for each model.

20150923 diagram 1

The R script for creating the individual models is primarily recycled from my past blogs. Since images contain integer values for pixel brightness, one may argue that storing the predicted values as images loses some of the information content, but the rounding of floating point values to integer values only affects the precision by a maximum of 0.2%, and the noise in the predicted values is an order of magnitude greater than that. If you want that fractional extra predictive power, then I leave it to you to adapt the script to write the floating point predictions into a data.table.

Linear Transformation

The linear transformation model was developed here, and just involves a linear transformation, then constraining the output to be in the [0, 1] range.

20150801 output 6


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

# predict using linear transformation
outModel1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)

# linear transformation
yHat = -0.126655 + 1.360662 * x

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel1, f))
}

Thresholding

The thresholding model was developed here, and involves a combination of adaptive thresholding processes, each with different sizes of localisation ranges, plus a global thresholding process. But this time we will use xgboost instead of gbm.

20150815 output 3


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(y, x, x2, x3))
setnames(dat,c("y", "raw", "thresholded", "adaptive"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(x, x2, x3))
setnames(dat,c("raw", "thresholded", "adaptive"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel2, f))
}

Edges

The edge model was developed here, and involves a combination of canny edge detection and image morphology.

20150822 output 2


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost, biOps)

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x4, x5, x6))
setnames(dat,c("y", "raw", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(x, x4, x5, x6))
setnames(dat,c("raw", "canny", "cannyDilated1", "cannyDilated2"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel3, f))
}

Background Removal

The background removal model was developed here, and involves the use of median filters. An xgboost model is used to combine the median filter results.

20150829 output 1


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)
# a function to do a median filter
median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(NA, nrow(img) * ncol(img), filterWidth ^ 2)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x, na.rm = TRUE)))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img - background
foreground[foreground > 0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(y, x, x7a, x7b, x7c, x8))
setnames(dat,c("y", "raw", "median5", "median11", "median17", "background"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(x, x7a, x7b, x7c, x8))
setnames(dat,c("raw", "median5", "median11", "median17", "background"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel4, f))
}

Nearby Pixels

The spacial model was developed here, and involves the use of a sliding window and xgboost.

20150904 output 7


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
 pad = 2
 width = 2 * pad + 1
 padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
 k = 1
 for (i in seq_len(width))
 {
 for (j in seq_len(width))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }

 return (tab)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX)
 y = img2vec(imgY)
 
 # surrounding pixels
 x9 = proximalPixels(imgX)

 dat = data.table(cbind(y, x9))
 setnames(dat,append("y", paste("x", 1:25, sep="")))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)


# get the predicted result for each image
outModel5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX) 
 
 # surrounding pixels
 x9 = proximalPixels(imgX)

 dat = data.table(x9)
 setnames(dat,paste("x", 1:25, sep=""))
 
 # predicted values
 yHat = predict(xgb.mod, newdata=as.matrix(dat))
 
 # constrain the range to be in [0, 1]
 yHat = sapply(yHat, function(x) max(min(x, 1),0))
 
 # turn the vector into an image
 imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))
 
 # save the predicted value
 writePNG(imgYHat, file.path(outModel5, f))
}

Combining the Individual Models

There are many ways that the predictions of the models could be ensembled together. I have used an xgboost model because I want to allow for the complexity of the problem that is being solved.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

inPath1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
inPath2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
inPath3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
inPath4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
inPath5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"

cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacking.csv"

filenames = list.files(inPath1)
for (f in filenames)
{
 print(f)
 imgX1 = readPNG(file.path(inPath1, f))
 imgX2 = readPNG(file.path(inPath2, f))
 imgX3 = readPNG(file.path(inPath3, f))
 imgX4 = readPNG(file.path(inPath4, f))
 imgX5 = readPNG(file.path(inPath5, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX)
 y = img2vec(imgY)
 
 # contributing models
 x1 = img2vec(imgX1)
 x2 = img2vec(imgX2)
 x3 = img2vec(imgX3)
 x4 = img2vec(imgX4)
 x5 = img2vec(imgX5)

 dat = data.table(cbind(y, x1, x2, x3, x4, x5))
 setnames(dat,c("y", "linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 5000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)

# get the predicted result for each image
outModel = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"
for (f in filenames)
{
 print(f)
 imgX1 = readPNG(file.path(inPath1, f))
 imgX2 = readPNG(file.path(inPath2, f))
 imgX3 = readPNG(file.path(inPath3, f))
 imgX4 = readPNG(file.path(inPath4, f))
 imgX5 = readPNG(file.path(inPath5, f))
 
 # contributing models
 x1 = img2vec(imgX1)
 x2 = img2vec(imgX2)
 x3 = img2vec(imgX3)
 x4 = img2vec(imgX4)
 x5 = img2vec(imgX5)

 dat = data.table(cbind(x1, x2, x3, x4, x5))
 setnames(dat,c("linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
 
 # predicted values
 yHat = predict(xgb.mod, newdata=as.matrix(dat))
 
 # constrain the range to be in [0, 1]
 yHat = sapply(yHat, function(x) max(min(x, 1),0))
 
 # turn the vector into an image
 imgYHat = matrix(yHat, nrow(imgX1), ncol(imgX1))
 
 # save the predicted value
 writePNG(imgYHat, file.path(outModel, f))
}

20150923 output 1
The graph above answers Bobby’s question – when considered together, the nearby pixels provide most of the additional predictive power beyond that of the raw pixel brightness. It seems that the key to separating dark text from noise is to consider how the pixel fits into the local region within the image.
Now that we have set up a stacking model structure, we can recommence the process of adding more image processing and features to the model. And that’s what the next blog will do…

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 6

07 Monday Sep 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

≈ 7 Comments

Tags

Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

So far in this series of blogs we have used image processing techniques to improve the images, and then ensembled together the results of that image processing using GBM or XGBoost. But I have noticed that some competitors have achieved reasonable results using purely machine learning approaches. While these pure machine learning approaches aren’t enough for the competitors to get to the top of the leader board, they have outperformed some of the models that I have presented in this series of blogs. However these scripts were invariably written in Python and I thought that it would be great to see how to use R to build a similar type of model (except better, because we will include all of the image processing predictors that we have developed so far). So today we will add a brute-force machine learning approach to our model.

At the time of writing this blog, the ranked top competitor who has shared their script is placed 17th with an RMSE of 2.6%. While I am not very experienced in Python, I can usually figure out what a Python script is doing if I read it. So here’s what I think that script ise doing:

  1. pad out each image by an extra 2 pixels (see my last blog for how to pad out an image in R)
  2. run a 3×3 sliding window along the image pixels (see my last blog for how to create a sliding window in R)
  3. use all 9 pixels within the sliding window as predictors for the pixel in the centre of the sliding window
  4. use a random forest model to predict the pixel brightnesses

This is a pure machine learning approach because it doesn’t do any image processing to pre-process the image. It simply says that if you want to predict a particular pixel brightness, then you should probably look around at the brightnesses of the pixels either side of that pixel.

canstockphoto16547059

While I don’t want to fuel the R versus Python language wars, I do want to create a model in R that can outperform my competitors. So instead of using a 3×3 sliding window, I will use a 5×5 sliding window. Because the random forest implementation in R tends to run out of RAM on my computer, I will use XGBoost.

More_Power_HomeImprovementO

Here is how I create a 5×5 sliding window.


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# create a padded image within which we will embed our source image
pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

# create a matrix of predictor values - each column is a pixel from the sliding window
tab = NULL
for (i in seq_len(width))
{
for (j in seq_len(width))
{
if (i == 1 && j == 1)
{
tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
} else {
tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
}
}
}

head(tab[,1:4])

20150904 output 1

When I noticed that this script ran a bit slowly, I looked at how I wrote the loops. R does not like manual looping, and it is very inefficient at appending data. So I rewrote the script to pre-allocate space for all of the cells rather than appending each column as it was calculated.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

This modification gave me a tenfold improvement in speed! It’s a reminder that I shouldn’t be lazy when writing R scripts.

There’s one small issue to consider before we put this all together for a new more powerful predictive model. In the code above I have used a brightness value of 1 when padding out the image, but that doesn’t look natural. The image below shows a clear white border around the image. This could confuse the machine learning algorithm, as it could waste time learning what to do with pure white pixels.

20150904 output 2

So instead it makes sense to pad out the image with a background brightness. One way to do this is to replicate the edge of the original image.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
# fill in the padded edges
padded[,1:pad] = padded[,pad + 1]
padded[,ncol(img) + pad + 1:pad] = padded[,ncol(padded) - pad]
padded[1:pad,] = padded[pad + 1,]
padded[nrow(img) + pad + 1:pad,] = padded[nrow(padded) - pad,]

20150904 output 3

This looks more natural, except where there is writing at the edge of the image. Another way is to pad out the pixels using the median of the entire image.


pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

20150904 output 4

I haven’t tested which approach works best. So I leave it to the reader to compare the results from the three different padding approaches, and use whichever gives the best result (although I suspect that there won’t be much difference between the second and third approaches).

Let’s pull it all together, and add the surrounding pixels as a predictor to a more complete model.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb  0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

return (tab)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata_blog6.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

# median filter and related features
x7 = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

# surrounding pixels
x9 = proximalPixels(imgX)

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,append(c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"), paste("x", 1:25, sep="")))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain  1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150904 output 5

None of the individual pixels that we added have strong predictive powers, yet our RMSE on the training data dropped from 2.4% for the last blog’s model to 1.4% for this model! That’s because it is the combination of nearby pixels that are predictive, not just any individual pixel. I haven’t scored this model on the test set, but I suspect that it will get you a good ranking.
20150904 output 6

There is hardly any trace of the coffee cup stain remaining. This is a pretty good candidate model.

In order to understand the effect of the nearby pixels, it is useful to visualise their variable importance in a grid.

pixelNames = paste("x", 1:25, sep="")
pixelNames[13] = "raw"
grid = t(matrix(sapply(1:25, FUN = function(x) importance_matrix$Gain[importance_matrix$Feature == pixelNames[x]]), 5, 5))
grid[3, 3] = NA
plot(raster(grid))

20150904 output 7
This graphic shows that we didn’t need all of the surrounding pixels to create a good predictive model. The pixels that don’t lie on the same row or column as the target pixel aren’t as important. If we wanted to expand beyond the 3×3 sliding window used by our competitors, then we didn’t need to add all of the extra pixels. We could have just added the pixels at (1, 3) and (3, 1) and (3, 5) and (5, 3).

The model that we developed in this blog is pushing the boundaries of what my PC can do. Without access to very powerful computers and / or a cluster, we need to use a different approach if we want to improve on this blog’s model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents : Part 4

21 Friday Aug 2015

Posted by Colin Priest in Edge Detection, Image Processing, Kaggle, Machine Learning, Morphology, R

≈ 18 Comments

Tags

Edge Detection, Image Processing, Kaggle, Machine Learning, Morphology, R

At the end of the last blog, we had made some progress in removing the coffee cup stain from the image, but we needed to do more.

20150815 output 5

Adaptive thresholding has started to separate the writing from the stain, but it has also created a speckled pattern within the stains. We need to engineer a feature that can tell apart a stroke of writing from a speckled local maxima i.e. distinguish a ridge from a peak in the 3D surface.

1045988_SMPNG_46U34561M9028014G

In image processing, we do this via edge detection, which is the process of calculating the slope of the 3D surface of the image, and retaining lines where the slope is high. There are several different standard algorithms to do edge detection, and today we will use the canny edge detector.

The biOps package, which has an implementation of the canny edge detector, has been removed from CRAN. It has been migrated to Google Code. You will need to follow the installation instructions that can be found here. For example, since I am using Windows 64, I mostly followed the following instructions from the web site:

Windows 64 bit

    1. Note: we dropped jpeg and tiff IO functions.
    2. Download (go to the DLL’s page then download the raw file) libfftw3-3.dll, libfftw3f-f.dll, libfftw3l-3.dll, and zlib1.dll to C:\Program Files\R\R-3.x.x\bin\x64 (x.x needs to be edited) or somewhere present in the PATH variables. Make sure that the downloaded dll files are MB in file size.
    3. Download biOps_0.2.2.zip (go to the DLL’s page then download the raw file). Make sure that the file size is around 700KB.
    4. Run 64 bit R.
    5. Choose biOps_0.2.2.zip from Packages>Install package(s)…
    6. Load the library.
> library(biOps)

However for step E) I used the following R code:


install.packages("C:/Program Files/R/R-3.1.3/bin/biOps_0.2.2.zip")

That’s because I had downloaded biOps_0.2.2.zip into the C:/Program Files/R/R-3.1.3/bin folder. You should substitute the folder path that you downloaded the zip file into.

Update!!!: Google has switched off Google Code. But I have found the dlls inside an archive that you can download from here. Warning: the download is large (613.91MB).

Now we can start to experiment with edge detection. Note that biOps images have pixel brightnesses from 0 to 255 rather than from 0 to 1. So we have to rescale whenever we switch packages.


if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, biOps)

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
plot(raster(img))

# convert the image to a biOps image
img.biOps = imagedata(img * 255)
plot.imagedata(img.biOps)

# canny edge detection
img.canny = imgCanny(img.biOps, 0.7)
plot.imagedata(img.canny)

20150822 output 1

One of the things I like about this particular edge detection algorithm is that it has automatically thresholded the edges so that we are only left with the strong edges. So here are my observations about the behaviour of the edges:

  • they surround the writing
  • they surround the stains
  • there are few edges within a stain, so a lack of edges in a region may be a useful feature for removing the speckles within a stain
  • writing has a pair of parallel edges around each stroke, while the boundary of the stain has only a single edge, so the presence of a pair of edges may be a useful feature for separating stains from writing

To take advantage of these observations we shall use image morphology. Dilation is the process of making a line or blog thicker by expanding its boundary by one pixel. Erosion is the opposite, removing a 1 pixel thick layer from the boundary of an object. If we dilate the edges, then the pair of edges around the writing will expand to include the writing inside, and the edge of the stain will also expand.


# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
plot.imagedata(img.dilation)

20150822 output 3

The writing is all black, whereas most of the stain is white. This will probably be a useful feature for removing the coffee cup stain. But we can do more: now that we have dilated the edges, we can erode them to remove where we started with single edges.

20150822 output 2

This looks pretty good – all of the writing is black, but only a small part of the stain remains. the stain has a thin line, while the writing has thick lines. So we can erode once, then dilate once, and the thin lines will disappear.


# do some morphology to remove the stain lines
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
img.dilation.3 = imgBinaryDilation(img.dilation.2, mask)
plot.imagedata(img.dilation.3)

20150822 output 4

The stain is now almost completely removed! But unfortunately some of the writing has been removed too. So it is an imperfect feature for removing the coffee cup stain.

Let’s put it all together with the existing features that we have developed over the past few blogs, by adding canny edges and the dilated / eroded edges to the gradient boosted model:


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0
imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6))
setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit a model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 1000000)
gbm.mod = gbm(y ~ raw + thresholded + adaptive + canny + cannyDilated1 + cannyDilated2, data = dat[rows,], n.trees = 10000, train.fraction = 0.5, interaction.depth = 5)
best.iter <- gbm.perf(gbm.mod)

s = summary(gbm.mod)

# get the predictions - using parallel processing to save time
numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits
cl = makeCluster(numCores)
registerDoSNOW(cl)
num_splits = numCores
split_testing = sort(rank(1:nrow(dat))%%numCores)
yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% {
as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter))
}
stopCluster(cl)
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse) # 4.1%

# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)))
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2"))
yHatImg = predict(gbm.mod, newdata=x, n.trees = best.iter)
yHatImg[yHatImg < 0] = 0
yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150822 output 5

The first dilation operation that we tried in this blog is the most powerful predictor that we developed today, even more powerful than the adaptive thresholding that we used in the last blog. We have improved the RMSE score on the training set from 5.4% to 4.1%

20150822 sample

The model doesn’t completely remove the coffee cup stain, but it has faded the stain enough that we have a good chance at removing it later. In the next blog in this series, we will do more to clean up the coffee cup stain.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 3

14 Friday Aug 2015

Posted by Colin Priest in Adaptive Thresholding, GBM, Gradient Boosting Machine, Image Processing, Kaggle, Machine Learning, R

≈ 9 Comments

Tags

Adaptive Thresholding, GBMs, Image Processing, Kaggle, Machine Learning, R

In my last blog, I discussed how to threshold an image, and we ended up doing quite a good job of cleaning up the image with the creased paper. But then I finished the blog by showing that our algorithm had a problem, it didn’t cope well with coffee cup stains.

coffee stain cropped

The problem is that coffee cup stains are dark, so our existing algorithm does not distinguish between dark writing and dark stains. We need to find some features that distinguish between dark stains and dark writing.20150808 plot 7

Looking at the image, we can see that even though the stains are dark, the writing is darker. The stains are dark compared to the overall image, but lighter than the writing within them,

So we can hypothesise that we might be able to separate the writing from the stains by using thresholding that looks at the pixel brightnesses over a localised set of pixels rather than across all of the image. That way we might separate the darker writing from the dark background stain.

While I am an R programmer and a fan of R, I have to admit that R can be a frustrating platform for image processing. For example, I cannot get the adimpro package to work at all, despite much experimentation and tinkering, and googling for solutions. Then I discovered that the biOps package was removed from CRAN, and the archived versions refuse to install on my PC. But eventually I found ways to get image processing done in R.

First, I tried the EbayesThresh package.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, EbayesThresh)

# sample image containing coffee cup stain
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")


# using EbayesThresh
et = ebayesthresh(img)
plot(raster(et))
range(et)

20150815 output 1
All of the values are zero! That isn’t very helpful.

Second, I tried the treethresh package.


# experiment with treethresh
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, treethresh)

# sample image containing coffee cup stain
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# using tree thresholding
tt = treethresh(img)
tt.pruned &lt;- prune(tt)
tt.thresholded &lt;- thresh(tt.pruned)
plot(raster(tt.thresholded))
range(tt.thresholded)

Once again I only got zeroes.
Finally, I tried wavelet thresholding using the treethresh package.

# experiment with wavelet thresholding
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, treethresh)

# sample image containing coffee cup stain
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# using wavelet thresholding
# need a square image that has with that is a power of 2
# compute wavelet transform
wt.square &lt;- imwd(img[1:256,1:256])
# Perform the thresholding
wt.thresholded = threshold(wt.square)
# Compute inverse wavelet transform
wt.denoised &lt;- imwr(wt.thresholded)
plot(raster(wt.denoised))

20150815 output 2
It wasn’t what I was looking for, but at least it’s not just a matrix of zeroes!

After these two failures with CRAN packages, I decided that I needed to look beyond CRAN and try out the EBImage package on Bioconductor.


if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

# sample image containing coffee cup stain
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# using adaptive thresholding
img.eb = readImage("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
img.thresholded.3 = thresh(img.eb, 3, 3)
display(img.thresholded.3)
img.thresholded.5 = thresh(img.eb, 5, 5)
display(img.thresholded.5)
img.thresholded.7 = thresh(img.eb, 7, 7)
display(img.thresholded.7)
img.thresholded.9 = thresh(img.eb, 9, 9)
display(img.thresholded.9)

20150815 output 3

These images show that we are on the right track. The adaptive thresholding is good at separating the letters from the background, but it also adds black to the background. While the adaptive thresholding may not be perfect, any feature extraction that adds to our knowledge will probably improve our ensemble model.

Before we add adaptive thresholding to our model, we will do a bit more pre-processing. Based on the observation that adaptive thresholding is keeping the letters but adding some unwanted black to the background, we can combine all of the adaptive thresholding with normal thresholding images, and use the maximum pixel brightness across each 5 image set.


# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi &lt;= hiThresh] = 0
imgHi[imgHi &gt; hiThresh] = 1

return (imgHi)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)),
img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
plot(raster(ttt.3))

20150815 output 4

It’s starting to look much better. The coffee cup stain isn’t removed, but we are starting to successfully clean it up. This suggests that adaptive thresholding will be an important predictor in our ensemble.

Let’s bring it all together, and build a new ensemble model, extending the model from my last blog on this subject.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi &lt;= hiThresh] = 0
imgHi[imgHi &gt; hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb &lt;- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(y, x, x2, x3))
setnames(dat,c("y", "raw", "thresholded", "adaptive"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit a model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 1000000)
gbm.mod = gbm(y ~ raw + thresholded + adaptive, data = dat[rows,], n.trees = 7500, cv.folds = 3, train.fraction = 0.5, interaction.depth = 5)
best.iter &lt;- gbm.perf(gbm.mod,method="cv",oobag.curve = FALSE)

s = summary(gbm.mod)

# get the predictions - using parallel processing to save time
numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits
cl = makeCluster(numCores)
registerDoSNOW(cl)
num_splits = numCores
split_testing = sort(rank(1:nrow(dat))%%numCores)
yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% {
as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter))
}
stopCluster(cl)

# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse)

# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)))
setnames(x, c("raw", "thresholded", "adaptive"))
yHat = predict(gbm.mod, newdata=x, n.trees = best.iter)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150815 output 6

The adaptive thresholding gets a relative importance score of 2.4%, but full image thresholding loses all of its importance in the model.

20150815 output 5

The result may not be perfect, but you can see how the coffee cup stain is starting to be erased. We will need to find some more features to continue the clean up, but that will be left to future blogs.

This blog’s model has improved the RMSE on the training data from 6.5% reported in my last blog to 5.4% in this latest version. There’s a long way to go to have a top 10 performing model, but this process is steadily moving us in the right direction.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...
← Older posts

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 276 other followers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 276 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: