Tutorial: Sentiment Analysis of Airlines Using the syuzhet Package and Twitter

Tags

, , ,

In my last job, I was a frequent flyer. Each week I flew between 2 or 3 countries, briefly returning for 24 hours on the weekend to get a change of clothes. My favourite airlines were Cathay Pacific, Emirates and Singapore Air. Now, unless you have been living in a cave, you’d be well aware of the recent news story of how United Airlines removed David Dao from an aircraft. I wondered how that incident had affected United’s brand value, and being a data scientist I decided to do sentiment analysis of United versus my favourite airlines.

Way back on 4th July 2015, almost two years ago, I wrote a blog entitled Tutorial: Using R and Twitter to Analyse Consumer Sentiment. Even though that blog post is one of my earliest, it continues to be the most popular, attracting just as many readers per day as when I first wrote it.

Since the sentiment package, upon which that blog was based, is no longer supported by CRAN, and many readers have problems with the manual and technical process of installing an obsolete package from an archive, I have written a new blog using a different, live CRAN package. The syuzhet package was published only several weeks ago, and offers a range of different sentiment analysis models. So I’ve started to try it out.

I have collected tweets for 4 airlines:

  1. Cathay Pacific
  2. Emirates
  3. Singapore Air
  4. United Airlines

The tweet data starts at 01-Jan-2015 and go up to mid-April 2017.

Step 1: Load the tweets and load the relevant packages

library(foreign)
library(syuzhet)
library(lubridate)
library(plyr)
library(ggplot2)
library(tm)
library(wordcloud)

# get the data for the tweets
dataURL = 'https://s3-ap-southeast-1.amazonaws.com/colinpriest/tweets.zip'
if (! file.exists('tweets.zip')) download.file(dataURL, 'tweets.zip')
if (! file.exists('tweets.dbf')) unzip('tweets.zip')
tweets = read.dbf('tweets.dbf', as.is = TRUE)

 

I’ve stored the tweets in a dbf file and zipped it. The zip file is 68MB in size, and the dbf file is 353MB. The code shown above downloads the zip file, extracts the dbf and then reads the dbf file into a data.frame.

Step 2: Do Sentiment Scoring using the syuzhet package


# function to get various sentiment scores, using the syuzhet package
scoreSentiment = function(tab)
{
 tab$syuzhet = get_sentiment(tab$Text, method="syuzhet")
 tab$bing = get_sentiment(tab$Text, method="bing")
 tab$afinn = get_sentiment(tab$Text, method="afinn")
 tab$nrc = get_sentiment(tab$Text, method="nrc")
 emotions = get_nrc_sentiment(tab$Text)
 n = names(emotions)
 for (nn in n) tab[, nn] = emotions[nn]
 return(tab)
}

# get the sentiment scores for the tweets
tweets = scoreSentiment(tweets)
tweets = tweets[tweets$TimeStamp < as.Date('19-04-2017', format = '%d-%m-%Y'),]

The syuzhet package offers a few different algorithms, each taking a different approach to sentiment scoring. It also does emotion scoring based upon the nrc algorithm. The code above calculates scores using the syuzhet, bing, afinn and nrc algorithms, adding columns with the scores from each algorithm.

Step 3: Visualise the Sentiment Scores


# function to find the week in which a date occurs
round_weeks <- function(x)
{
require(data.table)
dt = data.table(i = 1:length(x), day = x, weekday = weekdays(x))
offset = data.table(weekday = c('Sunday', 'Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday'),
offset = -(0:6))
dt = merge(dt, offset, by="weekday")
dt[ , day_adj := day + offset]
setkey(dt, i)
return(dt[ , day_adj])
}
# get daily summaries of the results
daily = ddply(tweets, ~ Airline + TimeStamp, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
 ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))

# plot the daily sentiment
ggplot(daily, aes(x=TimeStamp, y=ave_sentiment, colour=Airline)) + geom_line() +
 ggtitle("Airline Sentiment") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

# get weekly summaries of the results
weekly = ddply(tweets, ~ Airline + week, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
 ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))

# plot the weekly sentiment
ggplot(weekly, aes(x=week, y=ave_sentiment, colour=Airline)) + geom_line() +
 ggtitle("Airline Sentiment") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

The code above summarises the sentiment for each airline across time. The first plot shows the daily sentiment values for each airline:

20170429 plot 01 daily sentiment

Based upon the bing sentiment algorithm, United has the poorest sentiment, and Singapore has the best sentiment. United usually has negative sentiment. Daily to day random fluctuations in sentiment make this a cluttered graph, so I decided to summarise the sentiment weekly instead of daily:

20170429 plot 02 weekly sentiment

Now it’s easier to see the differences in sentiment between the four airlines. While Emirates and Cathay Pacific have similar levels of sentiment, the values for Emirates are more stable. This, however, may be due to the sheer volume of tweets about Emirates versus the smaller number of tweets about Cathay Pacific.

20170429 plot 03 positive sentiment

Step 4: Compare the Sentiment Algorithms

The sentiment scores above use the bing algorithm, but we should check whether the different algorithms produce different results.


# compare the sentiment for across the algorithms
algorithms = tweets[rep(1, nrow(tweets) * 4), c("week", "syuzhet", "Airline", "Airline")]
names(algorithms) = c("TimeStamp", "Sentiment", "Algorithm", "Airline")
algorithms$Algorithm = "syuzhet"
algorithms[seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "syuzhet", "Airline")]
algorithms[nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "bing", "Airline")]
algorithms$Algorithm[nrow(tweets) + seq_len(nrow(tweets))] = "bing"
algorithms[2 * nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "afinn", "Airline")]
algorithms$Algorithm[2 * nrow(tweets) + seq_len(nrow(tweets))] = "afinn"
algorithms[3 * nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "nrc", "Airline")]
algorithms$Algorithm[3 * nrow(tweets) + seq_len(nrow(tweets))] = "nrc"

# get the algorithm averages for each airline
averages = ddply(algorithms, ~ Airline + Algorithm, summarize, ave_sentiment = mean(Sentiment))
averages$ranking = 1
for (alg in c("syuzhet", "bing", "afinn", "nrc")) averages$ranking[averages$Algorithm == alg] = 5 - rank(averages$ave_sentiment[averages$Algorithm == alg])
averages = averages[order(averages$Airline, averages$Algorithm), ]

The code above was a bit clumsy – I probably should have used reshape.

20170429 plot 08 sentiment algorithm comparisons

The different algorithms give similar rankings between the airlines with one big exception: the nrc algorithm is surprisingly positive about United and unusually negative about Singapore Air compared to the other algorithms. This goes to show that sentiment analysis isn’t just a plug and play technique and also means that a warning should be applied to the emotion analysis shown in Step 5 below, as it is based upon the nrc algorithm!

Step 5: Emotion Analysis

Noting the warning, from the previous section, let’s compare the emotions between the airlines and between tweets, using the nrc algorithm.


ggplot(weekly, aes(x=week, y=ave_negative, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Positive Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

ggplot(weekly, aes(x=week, y=ave_positive, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Negative Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

ggplot(weekly, aes(x=week, y=ave_anger, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Anger Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

# function to make the text suitable for analysis
clean.text = function(x)
{
# tolower
x = tolower(x)
# remove rt
x = gsub("rt", "", x)
# remove at
x = gsub("@\\w+", "", x)
# remove punctuation
x = gsub("[[:punct:]]", "", x)
# remove numbers
x = gsub("[[:digit:]]", "", x)
# remove links http
x = gsub("http\\w+", "", x)
# remove tabs
x = gsub("[ |\t]{2,}", "", x)
# remove blank spaces at the beginning
x = gsub("^ ", "", x)
# remove blank spaces at the end
x = gsub(" $", "", x)
return(x)
}

# emotion analysis: anger, anticipation, disgust, fear, joy, sadness, surprise, trust
# put everything in a single vector
all = c(
paste(tweets$Text[tweets$anger > 0], collapse=" "),
paste(tweets$Text[tweets$anticipation > 0], collapse=" "),
paste(tweets$Text[tweets$disgust > 0], collapse=" "),
paste(tweets$Text[tweets$fear > 0], collapse=" "),
paste(tweets$Text[tweets$joy > 0], collapse=" "),
paste(tweets$Text[tweets$sadness > 0], collapse=" "),
paste(tweets$Text[tweets$surprise > 0], collapse=" "),
paste(tweets$Text[tweets$trust > 0], collapse=" ")
)
# clean the text
all = clean.text(all)
# remove stop-words
# adding extra domain specific stop words
all = removeWords(all, c(stopwords("english"), 'singapore', 'singaporeair',
'emirates', 'united', 'airlines', 'unitedairlines',
'cathay', 'pacific', 'cathaypacific', 'airline',
'airlinesunited', 'emiratesemirates', 'pacifics'))
#
# create corpus
corpus = Corpus(VectorSource(all))
#
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#
# convert as matrix
tdm = as.matrix(tdm)
#
# add column names
colnames(tdm) = c('anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust')
#
# Plot comparison wordcloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Emotion Comparison Word Cloud')
comparison.cloud(tdm, random.order=FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
title.size=1.5, max.words=250)

&nbsp;

The code above plots the emotions across time for each airline.

United Airlines attracts more angry tweets, and this has spiked in April 2017 following the David Dao incident. But United Airlines also attracts more positive tweets than the other airlines. This might explain the ranking differences between the algorithms – maybe the algorithms weight positive tweets differently to negative tweets.

Then the code creates a comparison word cloud, to show the different words in airline tweets that are associated with each emotion.

20170429 plot 09 emotions

Step 6: Compare the Different Tweeting Behaviour of Different Twitter Users

Are some users more positive than others? Is this user behaviour different between the airlines? Do people who tweet more have a different sentiment to those who tweet about airlines less frequently? Are particular users dragging the average up or down? To answer these questions, I have tracked the 100 users who tweeted the most about these airlines.


# get the user summaries of the results
users = ddply(tweets, ~ Airline + UserName, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))
sizeSentiment = ddply(users, ~ num_tweets, summarize, ave_sentiment = mean(ave_sentiment),
ave_negative = mean(ave_negative), ave_positive = mean(ave_positive), ave_anger = mean(ave_anger))
sizeSentiment$num_tweets = as.numeric(sizeSentiment$num_tweets)

# plot users positive versus negative with bubble plot
cutoff = sort(users$num_tweets, decreasing = TRUE)[100]
ggplot(users[users$num_tweets > cutoff,], aes(x = ave_positive, y = ave_negative, size = num_tweets, fill = Airline)) +
geom_point(shape = 21) +
ggtitle("100 Most Prolific Tweeters About Airlines") +
labs(x = "Positive Sentiment", y = "Negative Sentiment")
#
ggplot(sizeSentiment, aes(x = num_tweets, y = ave_sentiment)) + geom_point() + stat_smooth(method = "loess", size = 1, span = 0.35) +
ggtitle("Number of Tweets versus Sentiment") + scale_x_log10() +
labs(x = "Positive Sentiment", y = "Negative Sentiment")

Firstly let’s look at the behaviour of individual users:

20170429 plot 06 top 100 tweeters

Top user sentiment is quite different by airline. Emirates has a number of frequent tweeters who are unemotional, who on average post neither positive nor negative sentiment. United Airlines attracts more emotional posts. Singapore Air and Cathay Pacific have big users that post a lot of tweets about them.

20170429 plot 07 sentiment versus tweet count

However, on average bigger frequent tweeters post a similar balance of positive and negative content to smaller users who tweet infrequently.

Step 7: Compare The Words Used to Describe Each Airline

In order to explain the differences in sentiment, we can create a word cloud that contrasts the words used in posts about each airline.


# Join texts in a vector for each company
txt1 = paste(tweets$Text[tweets$Airline == 'United'], collapse=" ")
txt2 = paste(tweets$Text[tweets$Airline == 'SingaporeAir'], collapse=" ")
txt3 = paste(tweets$Text[tweets$Airline == 'Emirates'], collapse=" ")
txt4 = paste(tweets$Text[tweets$Airline == 'Cathay Pacific'], collapse=" ")
#
# put everything in a single vector
all = c(clean.text(txt1), clean.text(txt2), clean.text(txt3), clean.text(txt4))
#
# remove stop-words
# adding extra domain specific stop words
all = removeWords(all, c(stopwords("english"), 'singapore', 'singaporeair',
'emirates', 'united', 'airlines', 'unitedairlines',
'cathay', 'pacific', 'cathaypacific', 'airline',
'airlinesunited', 'emiratesemirates', 'pacifics'))
#
# create corpus
corpus = Corpus(VectorSource(all))
#
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#
# convert as matrix
tdm = as.matrix(tdm)
#
# add column names
colnames(tdm) = c('United', 'Singapore Air', 'Emirates', 'Cathay Pacific')
#
# Plot comparison wordcloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Word Comparison by Airline')
comparison.cloud(tdm, random.order=FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),
title.size=1.5, max.words=250)
#
# Plot commonality cloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Word Commonality by Airline')
commonality.cloud(tdm, random.order=FALSE,
colors = brewer.pal(8, "Dark2"),
title.size=1.5, max.words=250)

The code above is quite similar to that in the previous step, except that this time we are comparing airlines instead of emotions.

Emirates includes “aniston”, presumably in reference to the marketing campaign involving Jennifer Aniston, while United includes “CEO” due to a number of news stories about United CEO’s including a resignation and a heart transplant.

 

 

Second Annual Data Science Bowl – Part 3 – Automatically Finding the Heart Location in an MRI Image

Tags

, , , , ,

My last blog wasn’t so sexy, what with all the data cleansing, and no predictive modelling. But in this blog I do something really cool – I train a machine learning model to find the left ventricle of the heart in an MRI image. And I couldn’t have done it without all of that boring data cleansing. #kaggle @kaggle

Aside from being a really cool thing to do, there is a purpose to this modelling. I want to find the boundaries of the heart chamber, and that is much easier and faster to do when I remove distractions. Once I have found the location of the heart chamber, I can crop the image to a much smaller square.

The input to the model will be a set of images. In order to simply what the model learns, I only gave it training images from sax locations near the centre of the heart.

The output from the model will be the row number and column number of the centroid of the left ventricle heart chamber (the red dot in the images above).

I had to manually define those centroid locations for a training set of a few hundred of the images. This was laborious and time consuming, even after I automated some of the process. But it needed to be done, because otherwise the machine learning algorithm has no way of knowing what the true answers should be.

Even though I am much more comfortable coding in R than in Python, I used Python for this step because I wanted to use Daniel Nouri‘s nolearn library, which sits above lasagne and theano, and these libraries are not available in R. The convolution neural network architecture was based upon the architecture in Daniel Nouri’s tutorial for the Facial Keypoints Detection competition in Kaggle.

Step 1: Importing of all the Required Libraries

OK, so this part isn’t all that sexy either. But it’s the engine for all the cool modelling that is about to be done.


import numpy as np
import csv
import random
import math
import os
import cv2
import itertools
import math
import matplotlib.pyplot as plt
import pandas as pd
import itertools

from lasagne import layers
from lasagne.updates import nesterov_momentum
from lasagne.nonlinearities import softmax
from lasagne.nonlinearities import sigmoid
from nolearn.lasagne import BatchIterator
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import PrintLayerInfo
from nolearn.lasagne.visualize import plot_loss
from nolearn.lasagne.visualize import plot_conv_weights
from nolearn.lasagne.visualize import plot_conv_activity
from nolearn.lasagne.visualize import plot_occlusion

%pylab inline
from lasagne.layers import DenseLayer
from lasagne.layers import InputLayer
from lasagne.layers import DropoutLayer
from lasagne.layers import Conv2DLayer
from lasagne.layers import MaxPool2DLayer
from lasagne.nonlinearities import softmax
from lasagne.updates import adam
from lasagne.layers import get_all_params
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import objective

import theano
import theano.tensor as T

 

Step 2: Defining the Helper Functions

I used jupyter notebook as the development environment to set up and run my Python scripts. While there’s a lot to like about jupyter, one thing that annoys me is that print commands run in jupyter don’t immediately show text on the screen. But here’s a trick to work around that:


def printQ(s):
 print(s)
 sys.stdout.flush()

Using this helper function instead of the print function results in text immediately appearing in the output. This is particular helpful for progress messages on long training runs.

I like to use R’s expand.grid function, but it isn’t built in to Python. So I wrote my own helper function in Python that mimics the functionality:


def product2(*args, repeat=1):
 # product('ABCD', 'xy') --> Ax Ay Bx By Cx Cy Dx Dy
 # product(range(2), repeat=3) -> 000 001 010 011 100 101 110 111
 pools = [tuple(pool) for pool in args] * repeat
 result = [[]]
 for pool in pools:
 result = [x+[y] for x in result for y in pool]
 for prod in result:
 yield tuple(prod)

def expand_grid(dictionary):
 return pd.DataFrame([row for row in product2(*dictionary.values())],
 columns=dictionary.keys())

This next function reads a cleaned up image, checking that it has the correct aspect ratio, then resizing it to 96 x 96 pixels. The resizing is done to reduce memory usage in my GPU, and to speed up the training and scoring.


def load_image(path):
 # check that the file exists
 if not os.path.isfile(path):
   printQ('BAD PATH: ' + path)
 image = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
 # ensure portrait aspect ratio
 if (image.shape[0] < image.shape[1]):
   image = cv2.transpose(image)
 # resize to 96x96
 resized_image = cv2.resize(image, (96, 96))
 # check that the image isn't empty
 s = sum(sum(resized_image))
 if np.isnan(s):
   print(path)
 return resized_image

My initial models were not performing as well as I would like. So to force the the model to generalise, I added some image transformations (rotation and reflection) to the training data. This required helper functions:


def rotate_image(img, degrees):
 rows,cols = img.shape
 M = cv2.getRotationMatrix2D((cols/2,rows/2),degrees,1)
 dst = cv2.warpAffine(img,M,(cols,rows))
 return dst

def transform_image(img, equalise, gammaAdjust, reflection, rotation):
 if equalise == 1:
   img = cv2.equalizeHist(img)
 if gammaAdjust != 1:
   img = pow(img / 255.0, gammaAdjust) * 255.0
 if reflection == 0 or reflection == 1:
   img = cv2.flip(img, int(reflection))
 if rotation != 0:
   img = rotate_image(img, rotation)
 return img

def transform_xy(x, y, equalise, gammaAdjust, reflection, rotation):
 # if equalise, then no change to x and y
 # if gamma adjustment, then no change to x and y
 # reflection
 if reflection == 0:
   y = 1.0 - y
 if reflection == 1:
   x = 1.0 - x
 if rotation == 180:
   x = 1.0 - x
   y = 1.0 - y
 if rotation != 0 and rotation != 180:
   x1 = x - 0.5
   y1 = y - 0.5
   theta = rotation / 180 * pi
   x2 = x1 * cos(theta) + y1 * sin(theta)
   y2 = -x1 * sin(theta) + y1 * cos(theta)
   x = x2 + 0.5
   y = y2 + 0.5
 return numpy.array([x, y])

# set up the image adjustments
#equalise, gammaAdjust, reflection
dAdjustShort = { 'equalise': [0],
 'gammaAdjust': [1],
 'reflection': [-1],
 'rotation': [0]}
dAdjust = { 'equalise': [0, 1],
 'gammaAdjust': [1, 0.75, 1.5],
 'reflection': [-1, 0, 1],
 'rotation': [0, 180, 3, -3]}
# 'rotation': [0, 180, 3, -3, 10, -10]}
imgAdj = expand_grid(dAdjust)
#imgAdj = expand_grid(dAdjustShort)
print (imgAdj.shape)
# can't have both reflection AND rotation by 180 degrees
imgAdj = imgAdj.query('not (reflection > -1 and rotation == 180)')
# can't have equalise AND gamma adjust
imgAdj = imgAdj.query('not (equalise == 1 and gammaAdjust != 1)')

There are two feature sets that enable the machine learning model to find the heart:

  1. shape in the image – by looking at shape in the image it can find the heart
  2. movement between images at different points of time – the heart is moving but most of the chest is stationary

So I set up the network architecture to use two channels. The first channel is the image, and the second channel is the difference between the image at this point of time and the image 8 time periods in the future.


def load_train_set():
 #numTimes = 8 # how many time periods to use
 plusTime = 5 # gaps between time periods for comparison
 xs = []
 ys = []
 ids = []
 all_y = '/home/colin/data/Second-Annual-Data-Science-Bowl/working/centroids-20160218-R.csv'
 with open(all_y) as f:
   rows = csv.reader(f, delimiter=',', quotechar='"')
   iRow = 1
   for line in rows:
     # first line is column headers, so ignore it
     if iRow > 1:
       # parse the line
       patient = line[0]
       x = float(line[1])
       y = float(line[2])
       sax = int(line[4])
       firstTime = int(line[5])
       if int(patient) % 25 == 0 and firstTime == 1:
         printQ(patient)
       # enhance the training data with rotations, reflections, NOT histogram equalisation
       for index, row in imgAdj.iterrows():
         # append the target values
         xy = transform_xy(x, y, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         ys.append(xy.astype('float32').reshape((1, 2)))
         #
         # read the images
         folder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned/' + patient + '/study/sax_0' + str(sax) + '/'
         xm = np.zeros([1, 2, 96, 96])
         #
         #
         # current frame
         path = folder + 'image-' + ('%0*d' % (2, firstTime)) + '.png'
         img = load_image(path)
         # transform the image - rotation, reflection etc
         img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         # get the pixels into the range [-1, 1]
         img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
         xm[0, 0, :, :] = img
         #
         #
         # find movement of current frame to future frame
         path = folder + 'image-' + ('%0*d' % (2, firstTime + plusTime)) + '.png'
         img = load_image(path)
         # transform the image - rotation, reflection etc
         img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         # get the pixels into the range [-1, 1]
         img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
         # first time is the complete image at time 1
         # subsequent frames are the differences between frames
         xm[0, 1, :, :] = img - xm[0, 0, :, :]
         xs.append(xm.astype('float32').reshape((1, 2, 96, 96)))
         ids.append(patient)

     iRow = iRow + 1
 return np.vstack(xs), np.vstack(ys), np.vstack(ids)

I used early stopping to help reduce overfitting:


class EarlyStopping(object):
 def __init__(self, patience=100):
   self.patience = patience
   self.best_valid = np.inf
   self.best_valid_epoch = 0
   self.best_weights = None

def __call__(self, nn, train_history):
 current_valid = train_history[-1]['valid_loss']
 current_epoch = train_history[-1]['epoch']
 if current_valid < self.best_valid:
   self.best_valid = current_valid
   self.best_valid_epoch = current_epoch
   self.best_weights = nn.get_all_params_values()
 elif self.best_valid_epoch + self.patience < current_epoch:
   print('Early stopping')
   print('Best valid loss was {:.6f} at epoch {}.'.format(
   self.best_valid, self.best_valid_epoch))
   nn.load_params_from(self.best_weights)
   raise StopIteration()

Step 3: Read the Training Data

As well as reading the training data, I shuffled the order of the data. This allowed me to use batch training.


# read the training data

printQ('reading the training data')
train_x, train_y, train_id = load_train_set()

printQ ('shuffling training rows')
random.seed(1234)
rows = random.choice(arange(0, train_x.shape[0]), train_x.shape[0])
t_x = train_x[rows,:,:,:]
t_y = train_y[rows,:]
t_id = train_id[rows]

printQ('finished')

Step 4: Train the Model

The network architecture used deep convolutional layers to find features in the image, then fully connected layers to convert these features into the centroid location:

20160308-image05


# fit the models

# set up the model
printQ ('setting up the model structure')
layers0 = [
 # layer dealing with the input data
 (InputLayer, {'shape': (None, 2, 96, 96)}),

# first stage of our convolutional layers
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 5}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 (MaxPool2DLayer, {'pool_size': 2}),
 (DropoutLayer, {'p': 0.2}),

# second stage of our convolutional layers
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.3}),
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.3}),
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 (MaxPool2DLayer, {'pool_size': 2}),
 (DropoutLayer, {'p': 0.3}),

# two dense layers with dropout
 (DenseLayer, {'num_units': 128}),
 (DropoutLayer, {'p': 0.5}),
 (DenseLayer, {'num_units': 128}),

# the output layer
 (DenseLayer, {'num_units': 2, 'nonlinearity': sigmoid}),
]

printQ ('creating and training the networks architectures')
numNets = 1
NNs = list()
for iNet in arange(numNets):
 nn = NeuralNet(
 layers = layers0,
 max_epochs = 2000,
 update=adam,
 update_learning_rate=0.0002,
 regression=True, # flag to indicate we're dealing with regression problem
 batch_iterator_train=BatchIterator(batch_size=100),
 on_epoch_finished=[EarlyStopping(patience=10),],
 train_split=TrainSplit(eval_size=0.25),
 verbose=1,
 )
 result = nn.fit(t_x, t_y)
 NNs.append(nn)

printQ('finished')

Based upon how quickly the training converged, the network could possibly have been simplified, reducing the number of layers, or using fewer neurons in the fully connected layers. But I didn’t have time to experiment with different architectures.

20160308-image06

The GPU quickly ran out of RAM unless I used the batch iterator. I found the batch size via trial and error. Large batch sizes caused the GPU to run out of RAM. Small batch sizes ran much slower.

Step 5: Review the Training Errors

Just like humans, all models make mistakes. The heart chamber segmentation algorithms I used later in this project were sensitive to how well the heart chamber was centred in the image. But as long as the model output was a centroid that was inside the heart chamber, things usually went OK. Early versions of my model made mistakes that placed the centroid outside the heart chamber, sometimes even far away from the heart.Tweaks to the training data (especially enhancing the data with rotation and reflection) and the architecture (especially dropout layers) improved the performance.


def getHeartLocation(trainX):
  # get the heart locations from each network
  heartLocs = zeros(numNets * trainX.shape[0] * 2).reshape((numNets, trainX.shape[0], 2))
  for j in arange(numNets):
    nn = NNs[j]
    heartLocs[j, :, :] = nn.predict(trainX)

    # use median as an ensembler
    heartLocsMedian = zeros(trainX.shape[0] * 2).reshape((trainX.shape[0], 2))
    heartLocsMedian[:,0] = median(heartLocs[:,:,0], axis = 0)
    heartLocsMedian[:,1] = median(heartLocs[:,:,1], axis = 0)

    # use a 'max distance from centre' ensembler
    heartLocsDist = zeros(trainX.shape[0] * 2).reshape((trainX.shape[0], 2))
    distance = abs(heartLocs - 0.5)
    am0 = distance[:,:,0].argmax(0)
    am1 = distance[:,:,1].argmax(0)
    heartLocsDist[:,0] = heartLocs[am0, arange(trainX.shape[0]), 0]
    heartLocsDist[:,1] = heartLocs[am1, arange(trainX.shape[0]), 1]

    # combine the two using an arithmetic average
    heartLocations = 0.5 * heartLocsMedian + 0.5 * heartLocsDist

    return heartLocations

heartLocations = getHeartLocation(train_x)

# review the training errors to check for model improvements
def plot_sample(x, y, predicted, axis):
  img = x[0, :, :].reshape(96, 96)
  axis.imshow(img, cmap='gray')
  axis.scatter(y[0::2] * 96, y[1::2] * 96, marker='x', s=10)
  axis.scatter(predicted[0::2] * 96, predicted[1::2] * 96, marker='x', s=10, color='red')

nTrain = train_x.shape[0]
errors = np.zeros(nTrain)

for i in arange(0, nTrain):
  errors[i] = sqrt( square(heartLocations[i, 0] - train_y[i, 0]) + square(heartLocations[i, 1] - train_y[i, 1]) )

print('Prob(error > 0.05)' + str(mean(errors > 0.05)))
print('Mean: ' + str(mean(errors)))
print('Percentiles: ' + str(percentile(errors, [50, 75, 90, 95, 99, 100])))

for i in arange(0, nTrain):
  error = sqrt( square(heartLocations[i, 0] - train_y[i, 0]) + square(heartLocations[i, 1] - train_y[i, 1]) )
  if (error > 0.04):
    if train_id[i] != train_id[i-1]:
      #print(i)
      print(train_id[i]) # only errors on the original images - not the altered images
      fig = pyplot.figure(figsize=(6, 3))
      ax = fig.add_subplot(1, 2, 1, xticks=[], yticks=[])
      plot_sample(train_x[i,:,:,:], train_y[i, :], heartLocations[i, :], ax)
      pyplot.show()

print('error review completed')

20160308-image07

After many failed models, I was excited when the two worst training errors were still close to the centre of the heart chamber 🙂

Step 6: Find the Left Ventricle Locations for the Submission Data

The main point of building a heart finder machine learning model is to automate the process of finding the left ventricle in the test images that will be used as part of the competition submission. These are images that the model has never seen before.


def load_submission_set():
  numTimes = 2 # how many time periods to use
  plusTime = 5
  xs = []
  ys = []
  ids = []
  paths = []
  times = []
  saxes = []
  all_y = '/home/colin/data/Second-Annual-Data-Science-Bowl/working/centroids-submission-R.csv'
  with open(all_y) as f:
    rows = csv.reader(f, delimiter=',', quotechar='"')
    iRow = 1
    for line in rows:
      # first line is column headers
      if iRow > 1:
        # parse the line
        patient = line[0]
        x = 0
        y = 0
        sax = int(line[1])
        firstTime = int(line[2])
        # save the targets
        xy = np.asarray([x, y])
        ys.append(xy.astype('float32').reshape((1, 2)))
        # read the images
        folder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned/' + patient + '/study/sax_0' + str(sax) + '/'
        xm = np.zeros([1, 2, 96, 96])
        #
        #
        # current frame
        path0 = folder + 'image-' + ('%0*d' % (2, firstTime)) + '.png'
        img = load_image(path0)
        # transform the image - rotation, reflection etc
        #img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
        # get the pixels into the range [-1, 1]
        img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
        xm[0, 0, :, :] = img
        #
        #
        # find movement of current frame to future frame
        path5 = folder + 'image-' + ('%0*d' % (2, firstTime + plusTime)) + '.png'
        img = load_image(path5)
        # transform the image - rotation, reflection etc
        #img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
        # get the pixels into the range [-1, 1]
        img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
        # first time is the complete image at time 1
        # subsequent frames are the differences between frames
        xm[0, 1, :, :] = img - xm[0, 0, :, :]
        xs.append(xm.astype('float32').reshape((1, numTimes, 96, 96)))
        ids.append(patient)
        paths.append(path0)
        times.append(firstTime)
        saxes.append(sax)
      iRow = iRow + 1
  return np.vstack(xs), np.vstack(ids), np.vstack(paths), np.vstack(times), np.vstack(saxes)

printQ('reading the submission data')
test_x, test_ids, test_paths, test_times, test_sax = load_submission_set()

printQ('quot;getting the predictions')
predicted_y = getHeartLocation(test_x)

printQ('creating the output table')
fullIDs = []
fullPaths = []
fullX = []
fullY = []
fullSax = []
fullTime = []
nTest = test_x.shape[0]
iTime = 1
for i in arange(0, nTest):
  patient = (test_ids[i])[0]
  sax = int((test_sax[i])[0])
  path = (test_paths[i])[0]
  iTime = int((test_times[i])[0])
  fullIDs.append(patient)
  fullPaths.append(path)
  fullX.append(predicted_y[i, 0])
  fullY.append(predicted_y[i, 1])
  fullSax.append(sax)
  fullTime.append(iTime)
  outPath = '/home/colin/data/Second-Annual-Data-Science-Bowl/predicted-heart-location-submission/'
  outPath = outPath + patient
  outPath = outPath + '-' + str(sax)
  outPath = outPath + '-' + str(iTime) + '.png'
  img = load_image256x192(path)
  x = int(round(predicted_y[i, 0] * 192))
  y = int(round(predicted_y[i, 1] * 256))
  img[y, x] = 255
  img[y-1, x-1] = 255
  img[y-1, x+1] = 255
  img[y+1, x-1] = 255
  img[y+1, x+1] = 255
  write_image(img, outPath)

fullIDs = array(fullIDs)
fullPaths = array(fullPaths)
fullX = array(fullX)
fullY = array(fullY)
fullSax = array(fullSax)
fullTime = array(fullTime)

printQ('saving results table')
d = { 'patient': fullIDs, 'path' : fullPaths, 'x' : fullX, 'y' : fullY, 'iTime' : fullTime, 'sax' : fullSax}
import pandas as pd
d = pd.DataFrame(d)
d.to_csv('/home/colin/data/Second-Annual-Data-Science-Bowl/working/heartfinderV4b-centroids-submission.csv', index = False)

In the animated gif below, you can see the left ventricle centroid location that has been automatically fitted, displayed as a dark rectangle moving around near the centre of the heart chamber. The machine learning algorithm was not trained on this patient’s images – so what you see here is artificial intelligence in action!

20160308-submission-images

Second Annual Data Science Bowl – Part 2

Tags

, , , ,

In Part 1 of this blog series, I described how to fix the brightness and contrast of the MRI images. In this blog we finish cleaning up the input data.

Other than brightness and contrast, we need to fix up the following problems:

  • different image sizes
  • different image rotations – portrait versus landscape
  • different pixel spacing
  • different short axis slice spacing
  • image sets from duplicate locations
  • sax sets aren’t in the same order as their locations
  • some sax image sets have multiple locations

Different Image Sizes and Rotations

There are approximately a dozen different image sizes, include rotated images. Not all of the image sizes scale to the same 4:3 aspect ratio that is the most common across the training set. Some of the machine learning algorithms I used later need fixed dimension images, so I compromised and decided to use a standard sizing of 256 x 192 pixels portrait aspect ratio. This meant that I wasn’t always scaling the x-axis and the y-axis by the same amount, and occasionally I was even upscaling an image.


library(pacman)
pacman::p_load(EBImage)
rescaleImage = function(img)
{
 imgOut = img
 # check for landscape aspect ratio and correct
 if (nrow(img) > ncol(img)) imgOut = t(img)
 imgOut = resize(imgOut, 256, 192) # standardise image size to 256 x 192
 return (imgOut)
}

20160307image0120160307image02

Note that I used matrix transpose to rotate the image 90 degrees. I could also have used the rotate function in EBImage. Either way could work, but I felt that matrix transpose would be a faster operation.

Image Locations and Spacing

The DICOM images contain information about image slice location, pixel spacing (how far apart are adjacent pixel centroids), and slice spacing (how far apart are adjacent sax slices).


library(pacman)
pacman::p_load(oro.dicom)

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
 img = readDICOMFile(path)
 width = ncol(img$img)
 height = nrow(img$img)
 headers = img$hdr
 patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
 patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
 patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
 ps = extractHeader(img$hdr, 'PixelSpacin', numeric = FALSE)
 pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
 pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
 seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
 location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
 pathSplit = unlist(strsplit(path, '/&amp;amp;', fixed = TRUE))
 filename = pathSplit[length(pathSplit)]
 prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
 frame = as.integer(unlist(strsplit(prefix, '-'))[3])
 sliceNum = 0
 if (length(unlist(strsplit(prefix, '-'))) &amp;gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
 time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
 #
 return (list(
 id = patientID,
 age = patientAge,
 gender = patientGender,
 pixelSpacingX = pixelSpacingX,
 pixelSpacingY = pixelSpacingY,
 width = width,
 height = height,
 series = seriesNum,
 location = location,
 frame = frame,
 sliceNum = sliceNum,
 time = time,
 path = path
 ))
}

I’m not a medical expert. So when there are images from duplicate locations, I don’t know which image sets are the best to use. Therefore I just assumed that the medical specialist repeated the MRI scans until the image quality was adequate. This meant that I chose the image set with the latest time stamps (from the header information inside the DICOM files, not the file system date).

Then I just

  1. searched for DICOM images in all of the subfolders for each patient
  2. filtered to use only sax slices, ignoring those images that did not come from a folder that contained the substring “sax”
  3. read the DICOM header information from each image to find its location and time
  4. searched for the high time stamp for each location and kept that image
  5. wrote out a new folder structure where sax_01, sax_02, … were the sax slice image sets for that patient, order by their location along the long axis of the heart

The R script to do this isn’t too difficult. I’ve pasted it below:


# this script creates a cleaner version of the image data
# plus an extract of the file headers
# 1) removes duplicate images
# 2) reorders images by location

# read a poor image and translate the pixel brightnesses
rebalanceImage = function(badImage)
{
v = matrix(badImage, nrow(badImage) * ncol(badImage))
o2 = order(v)
vIn = sample(vAll, nrow(badImage) * ncol(badImage))
vIn = vIn[order(vIn)]
v2 = v
v2[o2] = vIn
cleanImage = matrix(v2, nrow(badImage), ncol(badImage))
return (cleanImage)
}

# check whether a folder exists and make it if it doesn't exist
checkFolder = function(mainDir, subDir)
{
if(!dir.exists(file.path(mainDir, subDir)))
dir.create(file.path(mainDir, subDir))
}

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
img = readDICOMFile(path)
width = ncol(img$img)
height = nrow(img$img)
headers = img$hdr
patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
ps = extractHeader(img$hdr, 'PixelSpacing', numeric = FALSE)
pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
pathSplit = unlist(strsplit(path, '/', fixed = TRUE))
filename = pathSplit[length(pathSplit)]
prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
frame = as.integer(unlist(strsplit(prefix, '-'))[3])
sliceNum = 0
if (length(unlist(strsplit(prefix, '-'))) &amp;gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
#
return (list(
id = patientID,
age = patientAge,
gender = patientGender,
pixelSpacingX = pixelSpacingX,
pixelSpacingY = pixelSpacingY,
width = width,
height = height,
series = seriesNum,
location = location,
frame = frame,
sliceNum = sliceNum,
time = time,
path = path
))
}
############################################################################################################################

library(pacman)
pacman::p_load(oro.dicom, raster, data.table, png, flexclust, foreach, doParallel, snowfall)

# whether this run is for the training set (FALSE) or the validation set (TRUE)
useValidation = FALSE
#useValidation = TRUE

# create a benchmark histogram from an exemplar image
dicomBenchmark = readDICOM('/home/colin/data/Second-Annual-Data-Science-Bowl/train/1/study/sax_13')
images = dicomBenchmark[[2]]
img = images[[1]]
vAll = unname(unlist(images))
vAll = vAll[order(vAll)]

# loop through all of the patients
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned'
if (useValidation)
{
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned'
}
cases = list.dirs(rootFolder, recursive=FALSE)
#cases = cases[grep('123', cases)]
simpleData = matrix(0, 500, 3)
simpleFeatures = data.frame(id = rep(0, 500), age = rep(0, 500), gender = rep('U', 500),
pixelSpacingX = integer(500), pixelSpacingY = integer(500),
width = integer(500), height = integer(500),
numSlices = rep(0, 500), stringsAsFactors = FALSE)
allImages = NULL

sfInit(parallel=TRUE, cpus=14)
sfLibrary(oro.dicom)

for (patient in cases)
{
patientFolder = paste0(patient, '/study')
imgSequences = list.dirs(patientFolder, recursive=FALSE)
# filter for 'sax' folders
imgSequences = imgSequences[grep('sax', imgSequences, fixed = TRUE)]
nMax = 10000
imgTable = data.table(id = integer(nMax), age = integer(nMax), gender = character(nMax),
pixelSpacingX = integer(nMax), pixelSpacingY = integer(nMax),
width = integer(nMax), height = integer(nMax),
location = numeric(nMax), frame = integer(nMax), series = integer(nMax),
sliceNum = integer(nMax), time = numeric(nMax),
path = character(nMax))
for (imgFolder in imgSequences)
{
# find the first file in imgs
imgFiles = list.files(imgFolder)
if (length(imgFiles) &lt; 30)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
} else {
if (length(imgFiles) %% 30 != 0)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
}
}
# do the next part regardless of the number of images
paths = unlist(lapply(imgFiles, function(x) return (paste0(imgFolder, '/', x))))
result <- sfLapply(paths, getDicomHeaderInfo)
for (row in result)
{
n = sum(imgTable$id > 0) + 1
imgTable$id[n] = row$id
imgTable$age[n] = row$age
imgTable$gender[n] = row$gender
imgTable$pixelSpacingX[n] = row$pixelSpacingX
imgTable$pixelSpacingY[n] = row$pixelSpacingY
imgTable$width[n] = row$width
imgTable$height[n] = row$height
imgTable$series[n] = row$series
imgTable$location[n] = row$location
imgTable$frame[n] = row$frame
imgTable$sliceNum[n] = row$sliceNum
imgTable$time[n] = row$time
imgTable$path[n] = row$path
}
}
# remove surplus records from table
imgTable= imgTable[imgTable$id > 0]
# grab the latest image for each location and frame
latestImages = imgTable[order(id, age, gender, pixelSpacingX, pixelSpacingY, width, height, location, frame, time, sliceNum, series), .SD[c(.N)], by=c(&amp;amp;amp;quot;id&amp;amp;amp;quot;, &amp;amp;amp;quot;age&amp;amp;amp;quot;, &amp;amp;amp;quot;gender&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingX&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingY&amp;amp;amp;quot;, &amp;amp;amp;quot;width&amp;amp;amp;quot;, &amp;amp;amp;quot;height&amp;amp;amp;quot;, &amp;amp;amp;quot;location&amp;amp;amp;quot;, &amp;amp;amp;quot;frame&amp;amp;amp;quot;)]
#
print(paste0('Patient: ', latestImages$id[1]))
uniqueLocs = sort(unique(latestImages$location[abs(latestImages$location) > 0.000001]))
simpleFeatures[latestImages$id[1], ] = list(latestImages$id[1], latestImages$age[1], latestImages$gender[1],
latestImages$pixelSpacingX[1], latestImages$pixelSpacingY[1],
latestImages$width[1], latestImages$height[1],
length(uniqueLocs))
# create the cleaned-up images for CNN training
imgTable$cleanPath = '---'
iSax = 1
patientID = latestImages$id[1]
for (loc in uniqueLocs)
{
imgset = latestImages$path[latestImages$location == loc]
iImage = 1
for (imgPath in imgset)
{
dicomImage = readDICOMFile(imgPath)
# fix the contrast and brightness
fixedImage = rebalanceImage(dicomImage$img)
# get the details of this image
###patientID = extractHeader(dicomImage$hdr, 'PatientID', numeric = TRUE)
sliceID = iSax
imageID = iImage
outPath = paste0(outFolder, '/', patientID, '/study/sax_', formatC(sliceID, width=2, flag='0'), '/image-', formatC(imageID, width=2, flag='0'), '.png')
#print(outPath)
#plot(raster(fixedImage))
checkFolder(outFolder, as.character(patientID))
checkFolder(paste0(outFolder, '/', patientID), 'study')
checkFolder(paste0(outFolder, '/', patientID, '/study'), paste0('sax_', formatC(sliceID, width=2, flag='0')))
writePNG(fixedImage / max(fixedImage), outPath)
imgTable$cleanPath[imgTable$path == imgPath] = outPath
#
iImage = iImage + 1
}
iSax = iSax + 1
}

if (patient == cases[1])
{
allImages = imgTable
} else
{
allImages = data.frame(rbind(allImages, imgTable))
}
}

sfStop()

I stored all of the DICOM header information in a table. Some of this will be required later, when calculating the volume of the left ventricle chamber.

The next step, which will be described in my next blog, is to design a convolutional neural network that will automatically find the left ventricle in an image.

Second Annual Data Science Bowl – Part 1

Tags

, , , ,

I’m currently competing in the Second Annual Data Science Bowl at Kaggle. This is by far the most difficult competition that I have entered to date. At the time of writing I am placed 62nd out of 755 entries, with only a day remaining to lock down my methodology. There’s a lot more I’d like to do to improve my model, but alas, I don’t have the time!

Here’s the problem that we are solving:

  1. We are given a set of medical images taken by MRI, across 30 time periods, and a variable number of location slices through the body.
  2. We are also given the volume of the left ventricle of the heart at times of diastole and systole.
  3. Our task is to design an automatic algorithm that inputs DICOM images and outputs a cumulative density function of the likelihood of different volumes at both diastole and systole.

The medical image files are in DICOM format, containing information about the patient (e.g. age and gender) and a set of monochrome images for each patient giving a 4 dimensional view of that patient’s chest. The key images are the “sax” (short axis) images, a set of slices perpendicular to the line that passes through the length of the heart (a heart isn’t circular, but more ovoid in shape), and there are typically 30 images for each sax, each being 1/30th the time period of a heartbeat, showing one complete cycle of the heart. There are a varying number of sax images for each patient, depending upon the length of the patient’s heart, and sometimes there are also repeated sax sets, where the scanning was repeated in an attempt to improve the image quality.

The image quality varies greatly between patients, with differing image resolutions, brightness, contrast and aspect ratio / rotation.

As you can see in the animated gifs above, some of the images are such poor quality that it is difficult for the human eye to discern the details. So my first challenge was to improve the brightness and contrast. One way to do this is to do a linear transformation on each image so that its pixels have a preset mean and standard deviation.

As you can see, while this approach helped, it did not work well enough for the problem images. I also tried non-linear transformations without much more success. No transformation function was flexible enough for the wide range of image qualities. Frequently part of the problem image gets washed out.

At the National Heart Centre Singapore, I spoke with Assistant Professor Calvin Chin about how doctors use imaging to assess heart volume. He explained that the very bright, washed out sections in some of the images are the result of fat deposits within the patient’s body. He also explained how to find the left ventricle chamber in an image (it is round with a thick lining surrounding it) and what to do about the dark patches inside the chamber (include them in the area of the chamber because they are blood vessels). This was really helpful. It pays to bring in some domain knowledge to a machine learning problem.

What I wanted was for all images to have similar brightness histograms. After much experimentation, I couldn’t find a transformation function that achieved this for me. But then I realised that I didn’t need to use a function – I could just use empirical histograms as my target function, mapped to my original image via the brightness ranking of each pixel. All I needed to do was select an exemplar image (or multiple exemplar images) and then order the pixel brightnesses, then map across. Here’s how I did it using R:


library(pacman)
pacman::p_load(oro.dicom)

# a function to turn the image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# read a poor image and translate the pixel brightnesses
rebalanceImage = function(badImage)
{
# get the pixel brightnesses and get an index that sorts them
v = img2vec(badImage)
o2 = order(v)

# get a target histogram, allowing for the size of the bad image
vIn = sample(vAll, nrow(badImage) * ncol(badImage))
vIn = vIn[order(vIn)]

#
v2 = v
v2[o2] = vIn

# turn the piuxel vector back into an image
cleanImage = matrix(v2, nrow(badImage), ncol(badImage))

return (cleanImage)
}
# create a benchmark histogram from an exemplar image
dicomBenchmark = readDICOM('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/1/study/sax_13')
images = dicomBenchmark[[2]]
img = images[[1]]
vAll = unname(unlist(images))
vAll = vAll[order(vAll)]

# read the raw image
dicomImage = readDICOMFile('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/1/study/sax_8/IM-4560-0001.dcm')
# fix the contrast and brightness
fixedImage = rebalanceImage(dicomImage$img)
# read the raw image
dicomImage2 = readDICOMFile('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/6/study/sax_8/IM-9548-0001.dcm')
# fix the contrast and brightness
fixedImage2 = rebalanceImage(dicomImage2$img)

This gave me fairly consistent brightnesses and contrast, regardless of the quality of the original images, and also prevented washed out regions. You can see the results below:

Standardising the input data helps machine learning algorithms perform better because they don’t have to waste resources figuring out how to adjust for varying inputs where that variation is not a predictive feature.

In my next blog I will describe how I used the DICOM header information to further improve the model inputs, and to create extra features for my final model.

An Even Dozen – Denoising Dirty Documents: Part 12

Tags

, , , , ,

Over the past 11 blogs in this series, I have discussed how to build machine learning models for Kaggle’s Denoising Dirty Documents competition.

dozeneggs

The final blog in this series brings the count to an even dozen, and will achieve two aims:

  1. ensemble the models that we have built
  2. take advantage of the second information leakage in the competition

Ensembling, the combining of individual models into a single model, performs best when the individual models have errors that are not strongly correlated. For example, if each model has statistically independent errors, and each model performs with similar accuracy, then the average prediction across the 4 models will have half the RMSE score of the individual models. One way to increase the statistical independence of the models is to use different feature sets and / or types of models on each. I therefore chose the following combination of models:

  1. deep learning – thresholding based features
  2. deep learning – edge based features
  3. deep learning – median based features
  4. images with backgrounds removed using information leakage
  5. xgboost – wide selection of features
  6. convolutional neural network – using raw images without background removal pre-processing
  7. convolutional neural network – using images with backgrounds removed using information leakage
  8. deep convolutional neural network – using raw images without background removal pre-processing
  9. deep convolutional neural network – using images with backgrounds removed using information leakage

20151115 ensemble structure

It turned out that some of these models had errors that weren’t strongly independent to other models. But I was rushing to improve my leaderboard score in the final 48 hours of the competition, so I didn’t have time to experiment.

I didn’t experiment much with different ensemble models. However I did test xgboost versus a simple average or a least square linear regression, and it outperformed both. Maybe an elastic net could have done a good job.

Here is the R code for my ensemble:


.libPaths(c(.libPaths(), "./rlibs"))
library(png)
library(data.table)
library(xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

cleanFolder = "./data/train_cleaned"
inFolder1 = "./threshold based model/training data"
inFolder2 = "./edge based model/training data"
inFolder3 = "./median based model/training data"
inFolder4 = "./foreground/train foreground"
inFolder5 = "./submission 11/train_postprocessed"
inFolder6 = "./convnet/train_predicted"
inFolder7 = "./cnn_leakage/train_predicted"
inFolder8 = "./CNN based model/training"
inFolder9 = "./deep CNN/train_predicted"

outPath = "./stacked/stacking.csv"

filenames = list.files(cleanFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
y = img2vec(imgY)
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(y, x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("y", "threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
#rows = sample(nrow(dat), 15000000)
dat[is.na(dat)] = 0
#dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
dtrain <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
#
nThreads = 30
# do cross validation first
#xgb.tab = xgb.cv(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = 1000, early.stop.round = 15, nfold = 4, print.every.n = 10)
# what is the best number of rounds?
#min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
min.error.idx = 300 # was 268
xgb.mod = xgboost(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

dat_predicted = predict(xgb.mod, newdata=as.matrix(dat[,-1]))
sqrt( mean( (dat$y - dat_predicted) ^ 2 )) # 0.00759027

save (xgb.mod, file = "./model/xgb.rData")

#####################################################################################################################################

imgFolder = "./data/test"
inFolder1 = "./threshold based model/test data"
inFolder2 = "./edge based model/test data"
inFolder3 = "./median based model/test data"
inFolder4 = "./foreground/test foreground"
inFolder5 = "./submission 11/test_postprocessed"
inFolder6 = "./convnet/test_predicted"
inFolder7 = "./cnn_leakage/test_predicted"
inFolder8 = "./CNN based model/test"
inFolder9 = "./deep CNN/test_predicted"

outFolder = "./stacked/test data"
outFolder2 = "./stacked/test images"

filenames = list.files(imgFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))

# turn the images into vectors
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
yHat = predict(xgb.mod, newdata=as.matrix(dat))
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
imgY = matrix(yHat, nrow(imgX1), ncol(imgX1))
writePNG(imgY, file.path(outFolder2, f))
save(imgY, file = file.path(outFolder, gsub(".png", ".rData", f)))
}

Ensembling materially improved my leaderboard score versus any of the individual models. I feel that was due to the use of different features across my 3 deep learning models. So now I had a set of images that looked quite good:

20151115 output 1

20151115 output 2

To my eyes, my predicted images were indistinguishable from the clean images in the training data. In a real world situation I would have stopped model development here, because the image quality exceeds the minimum requirements for OCR. However, since this was a competition, I wanted the best score I could get.

So I took advantage of the second data leakage in the competition – the fact that the cleaned images were repeated across the dataset. This meant that I could compare a cleaned images to other cleaned images that appeared to have the same text and the same font, and clean up any pixels that were different across the set of images. I experimented with using the mean of the pixel brightness across the images, but using the median performed better.


library(png)
library(data.table)

inFolder = "./stacked/test data"
outFolder = "./information leakage/data"
outFolder2 = "./information leakage/images"

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

filenames = list.files(inFolder, pattern = "\\.rData$")
for (f in filenames)
{
print(f)

load(file.path(inFolder, f))
imgX = imgY

# look for the closest matched images
scores = matrix(1, length(filenames))
for (i in 1:length(filenames))
{
load(file.path(inFolder, filenames[i]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) && ncol(imgY) >= ncol(imgX))
{
imgY = imgY[1:nrow(imgX), 1:ncol(imgX)]
rmse = sqrt(mean( (imgX - imgY)^2 ))
}
scores[i] = rmse
}

dat = matrix(1, ncol(imgX) * nrow(imgX), 4)
for (i in 1:4)
{
f2 = filenames[order(scores)][i]
load(file.path(inFolder, f2))
dat[,i] = img2vec(imgY)
}

dat2 = apply(dat, 1, median)
#dat2 = apply(dat, 1, mean)

imgOut = matrix(dat2, nrow(imgX), ncol(imgX))
writePNG(imgOut, file.path(outFolder2, gsub(".rData", ".png", f)))
save(imgOut, file = file.path(outFolder, f))
}

This information leakage halved the RMSE, and I suspect that it was what allowed the top two competitors to obtain RMSE scores less than 1%.

So that’s it for this series of blogs. I learned a lot from my first Kaggle competition. Competing against others, and sharing ideas is a fun way to learn.

Denoising Dirty Documents: Part 11

Tags

, , , , , , , , , ,

In my last blog I showed how to use convolutional neural networks to build a model that removed stains from an image. While convolutional neural networks seem to be well suited for image processing, in this competition I found that deep neural networks performed better. In this blog I show how to build these models.

warnH022 - deep water

Since I wanted to use R, have limited RAM and I don’t have a powerful GPU, I chose to use h2o to build the models. That way I could do the feature engineering in R, pass the data to h2o, let h2o build a model, then get the predicted values back in R. The memory management would be done in h2o, which uses deep learning algorithms that adjust the RAM constraints. So I guess this combination of deep learning and h2o could be called “deep water” 😉

For my final competition submission I used an ensemble of models, including 3 deep learning models built with R and h2o. Each of the 3 deep learning models used different feature engineering:

  • median based feature engineering
  • edge based feature engineering
  • threshold based feature engineering

This blog shows the details of the median based model. I leave it to the reader to implement the edge based and threshold based models using the image processing scripts from my earlier blogs in this series.

If you don’t already have h2o installed on your computer, then you can install it directly from R. At the time of writing this blog, you could install h2o using the following script:


# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/R")))

That script will need to be changed as new versions of h2o are released. So use the latest instructions shown here.

Once h2o is installed, you can interface with h2o from R using the CRAN package.


install.packages("h2o")
library(h2o)

Median based image processing is used for feature engineering in this example, but you could use any combination of image processing techniques for your feature engineering. I got better performance using separate deep learning models for different types of image processing, but that may be because I had limited computing resources. If you have more computing resources than me, then maybe you will be successful with a single large model that uses all of the image processing techniques to create features.


# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}
 
median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
 k = 1
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}
 
# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5
 p = 1.39
 th = 240
 
 # the background is found via a median filter
 background = median_Filter(img, w)
 
 # the foreground is darker than the background
 foreground = img / background
 foreground[foreground > 1] = 1
 
 foreground2 = foreground ^ p
 foreground2[foreground2 >= (th / 255)] = 1
 
 return (matrix(foreground2, nrow(img), ncol(img)))
} 

img2tab = function(imgX, f)
{
 median5 = img2vec(median_Filter(imgX, 5))
 median17 = img2vec(median_Filter(imgX, 17))
 median25 = img2vec(median_Filter(imgX, 25))
 backgroundRemoval = img2vec(background_Removal(imgX))
 foreground = readPNG(file.path(foregroundFolder, f))
 
 # pad out imgX
 padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
 offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))
 
 # raw pixels window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
 x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x2 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
 x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x3 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
 x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x4 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
 x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x5 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
 x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x6 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
 x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

 dat = data.table(cbind(x, x2, x3, x4, x5, x6))
 setnames(dat,c(
 paste("x", seq_len((2*padding+1)^2), sep=""), 
 paste("median5", seq_len((2*padding+1)^2), sep=""),
 paste("median17", seq_len((2*padding+1)^2), sep=""),
 paste("median25", seq_len((2*padding+1)^2), sep=""),
 paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
 paste("foreground", seq_len((2*padding+1)^2), sep="")
 ))
 
 return (dat)
}

If you’ve been following my blog, then you will see that there’s nothing new in the two image processing functions shown above.

To build the model you will need to start h2o, import the data and tell h2o to create a deep learning model.

h2oServer = h2o.init(nthreads = 6, max_mem_size = "10G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
 score_training_samples = 0, 
 overwrite_with_best_model = TRUE,
 activation = "Rectifier", seed = 1,
 hidden = c(200, 200,200), epochs = 15,
 adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
 fast_mode = T, diagnostics = T, ignore_const_cols = T,
 force_load_balance = T)


You should change the h2o.init parameters according to the hardware on your computer. I’m running my model on a PC with 8 CPUs and 16GB of RAM, so I left a couple of CPUs free to do the user interface and core operating system functionality, plus some RAM for the operating system. Scale these parameters up or down if your PC specifications are more or less powerful than mine.

The model may take a few hours to fit. During that time R will not do anything. So if you want to see how the model is progressing, then point your browser to your localhost (port 54321 on my PC, but maybe a different port on yours) and use the h2o web interface to see what is happening.

You can get the predicted values using the following script:


filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
 predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
 imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))
 
 # correct the pixel brightnesses that are out of bounds
 imgOut[imgOut > 1] = 1
 imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Running predictions is as simple as creating a data file, importing it to h2o, and then asking h2o to give you the predicted values from your already fitted model. I found that some of the raw predicted values were out of the [0, 1] range, and improved my leaderboard score by limiting the predicted values to lie within this range.

You do not need to shut down h2o after you finish running a model. In fact you may wish to leave it running so that you can do model diagnostics or run more predictions.

If you wish to save a copy of your model, for later reuse, then you can use the following syntax:


modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

Just remember that h2o needs to be running when you save models or load previously saved models.

In my next, and final, blog in this series, I will show how to take advantage of the second information leakage in the competition.

For those who want the entire R script to try out for themselves, here it is:


install.packages("h2o")
library(h2o)
library(png)
library(data.table)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5
p = 1.39
th = 240

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img / background
foreground[foreground > 1] = 1

foreground2 = foreground ^ p
foreground2[foreground2 >= (th / 255)] = 1

return (matrix(foreground2, nrow(img), ncol(img)))
}

dirtyFolder = "./data/train"
cleanFolder = "./data/train_cleaned"
outFolder = "./model"
foregroundFolder = "./foreground/train foreground"

outPath = file.path(outFolder, "trainingdata.csv")
outPath2 = file.path(outFolder, "testdata.csv")
filenames = list.files(dirtyFolder)
padding = 2
set.seed(1)
library(h2o)
h2oServer = h2o.init(nthreads = 15, max_mem_size = "110G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
score_training_samples = 0,
overwrite_with_best_model = TRUE,
activation = "Rectifier", seed = 1,
hidden = c(200, 200,200), epochs = 15,
adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
fast_mode = T, diagnostics = T, ignore_const_cols = T,
force_load_balance = T)

summary(model.dl)

modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

outFolder = "./model/training data"

img2tab = function(imgX, f)
{
median5 = img2vec(median_Filter(imgX, 5))
median17 = img2vec(median_Filter(imgX, 17))
median25 = img2vec(median_Filter(imgX, 25))
backgroundRemoval = img2vec(background_Removal(imgX))
foreground = readPNG(file.path(foregroundFolder, f))

# pad out imgX
padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))

# raw pixels window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x2 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x3 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x4 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x5 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x6 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

dat = data.table(cbind(x, x2, x3, x4, x5, x6))
setnames(dat,c(
paste("x", seq_len((2*padding+1)^2), sep=""),
paste("median5", seq_len((2*padding+1)^2), sep=""),
paste("median17", seq_len((2*padding+1)^2), sep=""),
paste("median25", seq_len((2*padding+1)^2), sep=""),
paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
paste("foreground", seq_len((2*padding+1)^2), sep="")
))

return (dat)
}

dirtyFolder = "./data/test"
outFolder = "./model/test data"
foregroundFolder = "./foreground/test foreground"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))

# correct the pixel brightnesses that are out of bounds
imgOut[imgOut > 1] = 1
imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Denoising Dirty Documents – Part 10

Tags

, , , , ,

In my last blog, I explained how to take advantage of an information leakage regarding the repeated backgrounds in Kaggle’s Denoising Dirty Documents competition. The result of that process was that we had done a fairly good job of removing the background. But the score from doing this was not good enough to get a good placing. We need to do some processing on the image to improve the score.

Today we will use an approach that does not require me to do any feature engineering – convolutional neural networks, which are neural networks where the first few layers repeatedly apply the same weights across overlapping regions of the input data. One intuitive way of thinking about this is that it is like applying an edge detection filter (much like I described here) where the algorithm finds the appropriate weights for several different edge filters.

_images/mylenet.png

I’m told that convolutional neural networks are inspired by how vision works in the natural world. So if I test whether convolutional neural networks work well, am I giving them a robot eye test?

robot eye test

While I am comfortable coding in R, there is little support for convolutional neural networks in R, and I had to code this in Python, using the Theano library. The reason that I chose Theano is because neural network model fitting can be quite time consuming, and Theano supports GPU based processing, which can be orders of magnitude faster than CPU based calculations. To simply the code, I am using Daniel Nouri’s nolearn library, which sits over the lasagne library, which sits over the Theano library. This is the first time I have coded in Python and the first time I have used convolutional neural networks, so it was a good learning experience.

Since my PCs don’t have top of the line graphics cards with GPU processing support, I decided to run my code in a cloud on a virtual machine with GPU support. And since I didn’t want go through the effort of setting up a Linux machine and installing all of the libraries and compilers, I used Domino Data Labs to host my analysis. You can find my project here.

Before I progress to coding the model, I have to set up the environment.

import os
import shutil
 
def setup_theano():
	destfile = "/home/ubuntu/.theanorc"
	open(destfile, 'a').close()
	shutil.copyfile("/mnt/.theanorc", destfile)
 
	print "Finished setting up Theano"

The Python script shown above creates a function that copies the Theano settings file into the appropriate folder in the virtual machine, so that Theano knows to use GPU processing rather than CPU processing. This function gets called from my main script.

In order to reduce the number of calculations, I used a network architecture that inputs an image and outputs an image. The suggestion for this architecture comes from ironbar, a great guy who placed third in the competition. This is unlike all of the examples I found online, which have just one or two outputs, usually because the online example is a classification problem identifying objects appearing within the image. But there are two issues with this architecture:

  1. it doesn’t allow for fully connected layers before the output, and
  2. the target images are different sizes.

I chose to ignore the first problem, although if I had time I would have tried out a more traditional architecture that included fully connected layers but which only models one target pixel at a time.

def image_matrix(img):
 """
 The output value has shape (<number of pixels>, <number of rows>, <number of columns>)
 """
 # 420 x 540 or 258 x 540?
 if img.shape[0] == 258:
 return (img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540))
 if img.shape[0] == 420:
 result = []
 result.append((img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
 result.append((img[162:420, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
 result = np.vstack(result).astype('float32').reshape((2, 1, 258, 540))
 return result

For the second problem, I used the script shown above to split the larger images into two smaller images that were the same size as the other small images in the data, thereby standardising the output dimensions.

def load_train_set(file_list):
 xs = []
 ys = []
 for fname in file_list:
 x = image_matrix(load_image(os.path.join('./train_foreground/', fname)))
 y = image_matrix(load_image(os.path.join('./train_cleaned/', fname)))
 for i in range(0, x.shape[0]):
 xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
 ys.append(y[i, :, :, :].reshape((1, 1, 258, 540)))
 return np.vstack(xs), np.vstack(ys)

Theano uses tensors (multi-dimensional matrices) to store the training data and outputs. The first dimension is the index of the training data item. The second dimension is the colourspace information e.g. RGB would be 3 dimensions. Since our images are greyscale, this dimension has a size of only 1. The remaining dimensions are the dimensions of the input data / output data. The script shown above reshapes the data to meet this criteria.
The nolearn library simplifes the process of defining the architecture of a neural network. I used 3 hidden convolutional layers, each with 25 image filters. The script below shows how this was achieved.

net2 = NeuralNet(
layers = [
('input', layers.InputLayer),
('conv1', layers.Conv2DLayer),
('conv2', layers.Conv2DLayer),
('conv3', layers.Conv2DLayer),
('output', layers.FeaturePoolLayer),
],
#layer parameters:
input_shape = (None, 1, 258, 540),
conv1_num_filters = 25, conv1_filter_size = (7, 7), conv1_pad = 'same',
conv2_num_filters = 25, conv2_filter_size = (7, 7), conv2_pad = 'same',
conv3_num_filters = 25, conv3_filter_size = (7, 7), conv3_pad = 'same',
output_pool_size = 25,
output_pool_function = T.sum,
y_tensor_type=T.tensor4,

#optimization parameters:
update = nesterov_momentum,
update_learning_rate = 0.005,
update_momentum = 0.9,
regression = True,
max_epochs = 200,
verbose = 1,
batch_iterator_train=BatchIterator(batch_size=25),
on_epoch_finished=[EarlyStopping(patience=20),],
train_split=TrainSplit(eval_size=0.25)
)

Due to the unique nature of the problem versus the online examples, my first attempt at this script was not successful. Here are the key changes that I needed to make:

  • set y_tensor_type=T.tensor4 because the target is 2 dimensional
  • your graphics card almost certainly doesn’t have enough RAM to process all of the images at once, so you need to use a batch iterator and experiment to find a suitable batch size e.g. batch_iterator_train=BatchIterator(batch_size=25)

plot_loss(net2)
plt.savefig("./results/plotloss.png")

I also wanted to plot the loss across the iterations. So I added the two lines above, giving me the graph below.

During the first several iterations, the neural network is balancing out the weights so that the pixels are the correct magnitude, and after that the serious work of image processing begins.


plot_conv_weights(net2.layers_[1], figsize=(4, 4))
plt.savefig("./results/convweights.png")

I wanted to see what some of the convolutional filters looked like. So I added the two lines shown above, giving me the set of images below.

These filters look like small parts of images of letters, which makes some sense because we are trying to identify whether a pixel sits on the stroke of a letter.

The output looks reasonable, although not as good as what I achieve using a combination of image processing techniques and deep learning.

In the output image above you can see some imperfections.

I think that if I had changed the network architecture to include fully connected layers then I would have achieved a better result. Maybe one day when I have enough time I will experiment with that architecture.

The full Python script is shown below:


import random
import numpy as np
import cv2
import os
import itertools
import math
import matplotlib.pyplot as plt

from setup_GPU import setup_theano
setup_theano()

from lasagne import layers
from lasagne.updates import nesterov_momentum
from lasagne.nonlinearities import softmax
from lasagne.nonlinearities import sigmoid
from nolearn.lasagne import BatchIterator
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import PrintLayerInfo
from nolearn.lasagne.visualize import plot_loss
from nolearn.lasagne.visualize import plot_conv_weights
from nolearn.lasagne.visualize import plot_conv_activity
from nolearn.lasagne.visualize import plot_occlusion

import theano.tensor as T

def load_image(path):
return cv2.imread(path, cv2.IMREAD_GRAYSCALE)

def write_image(img, path):
return cv2.imwrite(path, img)

def image_matrix(img):
"""
The output value has shape (<number of pixels>, <number of rows>, <number of columns>)
"""
# 420 x 540 or 258 x 540?
if img.shape[0] == 258:
return (img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540))
if img.shape[0] == 420:
result = []
result.append((img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
result.append((img[162:420, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
result = np.vstack(result).astype('float32').reshape((2, 1, 258, 540))
return result

def load_train_set(file_list):
xs = []
ys = []
for fname in file_list:
x = image_matrix(load_image(os.path.join('./train_foreground/', fname)))
y = image_matrix(load_image(os.path.join('./train_cleaned/', fname)))
for i in range(0, x.shape[0]):
xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
ys.append(y[i, :, :, :].reshape((1, 1, 258, 540)))
return np.vstack(xs), np.vstack(ys)

def load_test_file(fname, folder):
xs = []
x = image_matrix(load_image(os.path.join(folder, fname)))
for i in range(0, x.shape[0]):
xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
return np.vstack(xs)

def list_images(folder):
included_extentions = ['jpg','bmp','png','gif' ]
results = [fn for fn in os.listdir(folder) if any([fn.endswith(ext) for ext in included_extentions])]
return results

def do_test(inFolder, outFolder, nn):
test_images = list_images(inFolder)
nTest = len(test_images)
for x in range(0, nTest):
fname = test_images[x]
x1 = load_test_file(fname, inFolder)
x1 = x1 - 0.5
pred_y = nn.predict(x1)
tempImg = []
if pred_y.shape[0] == 1:
tempImg = pred_y[0, 0, :, :].reshape(258, 540)
if pred_y.shape[0] == 2:
tempImg1 = pred_y[0, 0, :, :].reshape(258, 540)
tempImg2 = pred_y[1, 0, :, :].reshape(258, 540)
tempImg = np.empty((420, 540))
tempImg[0:258, 0:540] = tempImg1
tempImg[162:420, 0:540] = tempImg2
tempImg[tempImg < 0] = 0
tempImg[tempImg > 1] = 1
tempImg = np.asarray(tempImg*255.0, dtype=np.uint8)
write_image(tempImg, (os.path.join(outFolder, fname)))

class EarlyStopping(object):
def __init__(self, patience=100):
self.patience = patience
self.best_valid = np.inf
self.best_valid_epoch = 0
self.best_weights = None

def __call__(self, nn, train_history):
current_valid = train_history[-1]['valid_loss']
current_epoch = train_history[-1]['epoch']
if current_valid < self.best_valid:
self.best_valid = current_valid
self.best_valid_epoch = current_epoch
self.best_weights = nn.get_all_params_values()
elif self.best_valid_epoch + self.patience < current_epoch:
print("Early stopping.")
print("Best valid loss was {:.6f} at epoch {}.".format(
self.best_valid, self.best_valid_epoch))
nn.load_params_from(self.best_weights)
raise StopIteration()

def main():
random.seed(1234)

training_images = list_images("./train_foreground/")
random.shuffle(training_images)
nTraining = len(training_images)
TRAIN_IMAGES = training_images

train_x, train_y = load_train_set(TRAIN_IMAGES)
test_x = train_x
test_y = train_y

# centre on zero - has already been divided by 255
train_x = train_x - 0.5

net2 = NeuralNet(
layers = [
('input', layers.InputLayer),
('conv1', layers.Conv2DLayer),
('conv2', layers.Conv2DLayer),
('conv3', layers.Conv2DLayer),
('output', layers.FeaturePoolLayer),
],
#layer parameters:
input_shape = (None, 1, 258, 540),
conv1_num_filters = 25, conv1_filter_size = (7, 7), conv1_pad = 'same',
conv2_num_filters = 25, conv2_filter_size = (7, 7), conv2_pad = 'same',
conv3_num_filters = 25, conv3_filter_size = (7, 7), conv3_pad = 'same',
output_pool_size = 25,
output_pool_function = T.sum,
y_tensor_type=T.tensor4,

#optimization parameters:
update = nesterov_momentum,
update_learning_rate = 0.005,
update_momentum = 0.9,
regression = True,
max_epochs = 200,
verbose = 1,
batch_iterator_train=BatchIterator(batch_size=25),
on_epoch_finished=[EarlyStopping(patience=20),],
train_split=TrainSplit(eval_size=0.25)
)

net2.fit(train_x, train_y)

plot_loss(net2)
plt.savefig("./results/plotloss.png")
plot_conv_weights(net2.layers_[1], figsize=(4, 4))
plt.savefig("./results/convweights.png")

#layer_info = PrintLayerInfo()
#layer_info(net2)

import cPickle as pickle
with open('results/net2.pickle', 'wb') as f:
pickle.dump(net2, f, -1)

y_pred2 = net2.predict(test_x)
print "The accuracy of this network is: %0.2f" % (abs(y_pred2 - test_y)).mean()

do_test("./train_foreground/", './train_predicted/', net2)
do_test("./test_foreground/", './test_predicted/', net2)

if __name__ == '__main__':
main()

Denoising Dirty Documents: Part 9

Tags

, , ,

Now that Kaggle’s Denoising Dirty Documents Competition has closed, it’s time to start posting the secrets to getting a very good score in this competition. In this blog, I describe how to take advantage of the first of two information leakages that I used.

512px-broken-water-pipe-clip-art-380417

Information leakage occurs in predictive modelling when the training and test data includes values that would not be known at the time a prediction was being made. For example, I once worked on a direct marketing sales propensity modelling project where our predictive model fitted the training data far too well, making us suspicious. We eventually tracked it down to an incorrectly designed data extract that used status values as at the data extraction date, instead of as at the date at which a prediction would have been run. The sale of the product to the customer changed the value of that data field, so the data needed to be the value of the data field before the sale occurred. In real life projects, information leakage is a bad thing because it overstates the model accuracy. Therefore you need to ensure that it does not occur; otherwise your predictive model will not perform well. In data science competitions, information leakage is something to be taken advantage of. It enables you to obtain a higher score.

The first information leakage in this competition comes from how the training data was created. There are only 8 different page backgrounds. There are 2 coffee cup stains, 2 folded pages, 2 watermarks and 2 crumpled pages.

20151015 output 1

This gives us a huge advantage in removing the background and the stains. Instead of using an uncertain estimate based upon only a single image, we can group together the images so that each group has the same background. Then, for each pixel location, calculate the brightest pixel across all of the images in that group, and use the brightest value as the pixel brightness in the background. You can find a couple of scripts on Kaggle that do this.

The problem is that you can’t do the same for the test images because they have different backgrounds, and they have only four different backgrounds. You can’t have known this without manually looking at the test images, and that break the rules prohibiting manual processing of the test images.

20151015 output 2

You can also run into troubles if your model just assumes that the test images are arranged so that there are 8 different background images that repeat every 8 images.

A better solution is to let your model decide which images can be grouped together, and let your model decide which background belongs with each image. The way to do this is to apply a median filter to each image, to obtain an estimate of the background of each image, and then group the median images together according to similarity.


library(png)
library(raster)
library(data.table)

median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# training data
dirtyFolder = "C:/Users/Colin/dropbox/Kaggle/Denoising Dirty Documents/data/train"
outPath = "D:/CNN with background removal/train median25"
filenames = list.files(dirtyFolder)
# use a 25x25 median filter to get the background
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

median25 = median_Filter(imgX, 25)

outFile = file.path(outPath, f)
writePNG(median25, outFile)
}

The script above calculates the median filter for each image and stores it in a folder. I have used a filter size of 25 pixels because I want broad patterns of the background, and I want the text removed.

Now that I have the median images, I can iterate through each training image and link it to other images that have similar median images. I have used RMSE as a measure of similarity. The cutoff RMSE of 2.5% was determined by looking at the range of RMSE values and looking for a natural break.

20151015 output 3


# find the images with matching background
outPath2 = "D:/CNN with background removal/train background"
outPath3 = "D:/CNN with background removal/train foreground"
for (i in 1:length(filenames))
{
f = filenames[i]
print(f)

imgX = readPNG(file.path(outPath, f))

scores = matrix(1, length(filenames))
for (j in 1:length(filenames))
{
imgY = readPNG(file.path(outPath, filenames[j]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) & ncol(imgY) >= ncol(imgX)) rmse = sqrt(mean((imgX - imgY[1:nrow(imgX), 1:ncol(imgX)]) ^ 2))
scores[j] = rmse
}

sameStains = filenames[scores <= 0.025]
nImages = length(sameStains)

rawData = matrix(0, ncol(imgX) * nrow(imgX), nImages)
for (j in 1:nImages)
{
imgY = readPNG(file.path(dirtyFolder, sameStains[j]))
rawData[,j] = img2vec(imgY[1:nrow(imgX), 1:ncol(imgX)])
}

background = matrix(unlist(apply(rawData,1,max)), nrow(imgX), ncol(imgX)) # background is defined as the lightest pixel of images with similar median transformations
plot(raster(background))
writePNG(background, file.path(outPath2, f))

imgX = readPNG(file.path(dirtyFolder, f))
foreground = (imgX - background) / background
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])
#plot(raster(foreground))
writePNG(foreground, file.path(outPath3, f))
}

One of the tricks in the script above was that I didn’t simply subtract the background from the image. If I had done that, then the result would not be consistent in areas where the stains occur:

20151015 output 4

20151015 foreground-bad

Notice that in the image above, the writing is faded where the stain was dark. That’s because it was created with the code:


foreground = (imgX - background)
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])

Where the background is dark, there is less opportunity for the writing to contrast against the background. To fix this, I changed the above script to be:


foreground = (imgX - background) / background
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])

By dividing the difference by the background brightness, I have rescaled the contrast to allow for the limitation on the maximum contrast at this location. The result is shown below:
20151015 foreground-good
This time the writing has consistent darkness across the entire image, regardless of the brightness of the background pixels.

My competition submission consisted of 4 stages, and this leakage-based background removal was the first stage. The final stage also took advantage of an information leakage. But you will have to wait for a couple of blogs to see what that was…

Denoising Dirty Documents: Part 8

Tags

, , ,

In this blog we will engineer a new feature to go into our model. So far we have predominantly been using localised features – information about pixels that are located nearby the pixel whose brightness we are predicting. In this blog we will consider the structure of a document, and use that to improve our model.

20150801 - after

The image consists of multiple lines of text, arranged into one or more paragraphs with multiple sentences in each paragraph.

ducks lined up

If we get our ducks in a row, we can use the fact that the text (like our metaphorical ducks) is arranged into lines, and that there are gaps between those lines of text. In particular, if we can find the gaps between the lines, we can ensure that the predicted value within those gaps is always the background colour.

Let’s look at the horizontal profile of a sample input image:


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

f = "135.png"
imgX = readPNG(file.path(dirtyFolder, f))

hProfileX = unlist(apply(imgX, 1, mean))
hProfileY = unlist(apply(imgY, 1, mean))

x = 1:nrow(imgX)
plot(x, hProfileX, col="red", type="l", xlab = "row", ylab = "brightness")

20150930 output 01

The rows with low brightness show where text occurs at regular intervals. But the absolute level of the row brightness varies, according to the noise and stains, so much that sometimes the gaps between the lines of text are darker than the rows in which text occurs. This will make it more difficult to determine which rows are which.

But what if we used the predicted images from our stacked model from the last blog as the inputs to this predictor?


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"

f = "135.png"
imgX = readPNG(file.path(dirtyFolder, f))

hProfileX = unlist(apply(imgX, 1, mean))
hProfileY = unlist(apply(imgY, 1, mean))

x = 1:nrow(imgX)
plot(x, hProfileX, col="red", type="l", xlab = "row", ylab = "brightness")

20150930 output 02

Now it’s much easier to find the gaps between the lines of text. To do this we need estimate three parameters:

  1. cycle length: the number of rows from one line of text to the next line of text
  2. gap length: the height of the gap between lines of text
  3. starting location of the first cycle

The constraints that we can use for determining these parameters are:

  • cycle length is a positive number
  • gap length is a positive number
  • gap length is less than cycle length
  • gap rows are white
  • most letters (e.g. weruioaszxcvnm) are short and these letters are located in the darkest rows, which we will call the “core rows”
  • some letters are taller (e.g. tdfhklb) and create slightly darker rows above the core rows
  • some letters are taller (e.g. qypgj) and create slightly darker rows below the core rows
  • if we cluster the values, the darker cluster contains the core of the letters (the core rows), but these rows may be lighter when a paragraph starts or ends
  • the in-between valued rows next to the white rows are possibly where the taller letters are located and their darkness will vary according to the number of tall letters

Let’s start by clustering the row brightnesses, and checking where the rows switch from one cluster to another, and using this to estimate cycle length.


km = kmeans(hProfileX, 2)
cutoff = (max(hProfileX[km$cluster == which.min(km$centers)]) + min(hProfileX[km$cluster == which.max(km$centers)])) / 2
plot(hProfileX, type="l")
lines(hProfileX * 0 + cutoff, col="red")

starts = (2:length(hProfileX))[sapply(2:length(hProfileX), function(x) hProfileX[x] < cutoff & hProfileX[x-1] > cutoff)]
cycleLength = mean(diff(starts))

20150930 output 03

20150930 output 04

Next we will estimate the starting location of the first cycle. But before we do that, we need to define the “start” of the cycle. Let’s define the cycle as starting at the row in which the first letter starts. That would be the first non-white row before the first core rows.

cycleStarts = starts[1]
while (hProfileX[cycleStarts] < 0.99)
cycleStarts = cycleStarts – 1

The first cycle starts at row 3.

To estimate the gap length, we should measure the number of white rows before each cycle starts.


gaps = matrix(0, length(starts) - 1)
for (i in 2:length(starts))
{
 cycleStarts = starts[i]
 while (hProfileX[cycleStarts] < 0.99)
 cycleStarts = cycleStarts - 1
 gapStarts = cycleStarts - 1
 while (hProfileX[gapStarts] > 0.99)
 gapStarts = gapStarts - 1
 gapLength = cycleStarts - gapStarts
 gaps[i - 1] = gapLength
}
gaps

The gaps are (13, 12, 17, 13, 12, 11, 12, 18, 12, 12). The two outliers occur after rows of text that do not contain any of the characters (q, y, p, g, j). After removing these two outliers, the gaps do not exceed 13 rows. So we can write code to allow for this as follows:


mGaps = median(gaps)
gapLength = mean(gaps[abs(gaps - mGaps) <= 2])

This gives us a gap length estimate of 13.125 and completes our line location model for this image. The quick way to use this model is to white out the gap rows. A more complex model could use the newly estimated row types (core, gap, other) as predictors.


imgCleaned = imgX
numCycles = length(starts)
for (i in 1:numCycles)
{
cycleStart = starts[i] - min(gaps[abs(gaps - mGaps) <= 2])
imgCleaned[cycleStart:(starts[i]-1)] = 1
}

The script above would clean up any minor image blemishes that exist between the lines of text.

Denoising Dirty Documents: Part 7

Tags

, , , ,

By the time I’d finished building the model in my last blog, I’d started to overload my computer’s RAM and CPU, so much so that I couldn’t add any more features. One solution could be for me to upgrade my hardware, or rent out a cluster in the cloud, but I’m trying to save money at the moment. So today I’m going to restructure my predictive solution into separate predictive models, each of which do not individually overload my computer, but which are combined via stacking to give an overall solution that is more predictive than any of the individual models.

stacking

Stacking would also allow us to answer Bobby’s question:

I thought it was interesting how the importance chart didn’t show any one individual pixel as being significant. Is there another similar importance chart or metric that can show the importance of a combination of pixels?

So let’s start by breaking up the current monolithic model into discrete chunks. Once we have done this, it will be easier for us to add new features (which I will do in the next blog). I will store the predicted outputs from each model in image format with separate folders for each model.

20150923 diagram 1

The R script for creating the individual models is primarily recycled from my past blogs. Since images contain integer values for pixel brightness, one may argue that storing the predicted values as images loses some of the information content, but the rounding of floating point values to integer values only affects the precision by a maximum of 0.2%, and the noise in the predicted values is an order of magnitude greater than that. If you want that fractional extra predictive power, then I leave it to you to adapt the script to write the floating point predictions into a data.table.

Linear Transformation

The linear transformation model was developed here, and just involves a linear transformation, then constraining the output to be in the [0, 1] range.

20150801 output 6


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

# predict using linear transformation
outModel1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)

# linear transformation
yHat = -0.126655 + 1.360662 * x

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel1, f))
}

Thresholding

The thresholding model was developed here, and involves a combination of adaptive thresholding processes, each with different sizes of localisation ranges, plus a global thresholding process. But this time we will use xgboost instead of gbm.

20150815 output 3


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(y, x, x2, x3))
setnames(dat,c("y", "raw", "thresholded", "adaptive"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(x, x2, x3))
setnames(dat,c("raw", "thresholded", "adaptive"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel2, f))
}

Edges

The edge model was developed here, and involves a combination of canny edge detection and image morphology.

20150822 output 2


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost, biOps)

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x4, x5, x6))
setnames(dat,c("y", "raw", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(x, x4, x5, x6))
setnames(dat,c("raw", "canny", "cannyDilated1", "cannyDilated2"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel3, f))
}

Background Removal

The background removal model was developed here, and involves the use of median filters. An xgboost model is used to combine the median filter results.

20150829 output 1


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)
# a function to do a median filter
median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(NA, nrow(img) * ncol(img), filterWidth ^ 2)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x, na.rm = TRUE)))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img - background
foreground[foreground > 0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(y, x, x7a, x7b, x7c, x8))
setnames(dat,c("y", "raw", "median5", "median11", "median17", "background"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(x, x7a, x7b, x7c, x8))
setnames(dat,c("raw", "median5", "median11", "median17", "background"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel4, f))
}

Nearby Pixels

The spacial model was developed here, and involves the use of a sliding window and xgboost.

20150904 output 7


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
 pad = 2
 width = 2 * pad + 1
 padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
 k = 1
 for (i in seq_len(width))
 {
 for (j in seq_len(width))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }

 return (tab)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX)
 y = img2vec(imgY)
 
 # surrounding pixels
 x9 = proximalPixels(imgX)

 dat = data.table(cbind(y, x9))
 setnames(dat,append("y", paste("x", 1:25, sep="")))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)


# get the predicted result for each image
outModel5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX) 
 
 # surrounding pixels
 x9 = proximalPixels(imgX)

 dat = data.table(x9)
 setnames(dat,paste("x", 1:25, sep=""))
 
 # predicted values
 yHat = predict(xgb.mod, newdata=as.matrix(dat))
 
 # constrain the range to be in [0, 1]
 yHat = sapply(yHat, function(x) max(min(x, 1),0))
 
 # turn the vector into an image
 imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))
 
 # save the predicted value
 writePNG(imgYHat, file.path(outModel5, f))
}

Combining the Individual Models

There are many ways that the predictions of the models could be ensembled together. I have used an xgboost model because I want to allow for the complexity of the problem that is being solved.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

inPath1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
inPath2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
inPath3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
inPath4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
inPath5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"

cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacking.csv"

filenames = list.files(inPath1)
for (f in filenames)
{
 print(f)
 imgX1 = readPNG(file.path(inPath1, f))
 imgX2 = readPNG(file.path(inPath2, f))
 imgX3 = readPNG(file.path(inPath3, f))
 imgX4 = readPNG(file.path(inPath4, f))
 imgX5 = readPNG(file.path(inPath5, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX)
 y = img2vec(imgY)
 
 # contributing models
 x1 = img2vec(imgX1)
 x2 = img2vec(imgX2)
 x3 = img2vec(imgX3)
 x4 = img2vec(imgX4)
 x5 = img2vec(imgX5)

 dat = data.table(cbind(y, x1, x2, x3, x4, x5))
 setnames(dat,c("y", "linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 5000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)

# get the predicted result for each image
outModel = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"
for (f in filenames)
{
 print(f)
 imgX1 = readPNG(file.path(inPath1, f))
 imgX2 = readPNG(file.path(inPath2, f))
 imgX3 = readPNG(file.path(inPath3, f))
 imgX4 = readPNG(file.path(inPath4, f))
 imgX5 = readPNG(file.path(inPath5, f))
 
 # contributing models
 x1 = img2vec(imgX1)
 x2 = img2vec(imgX2)
 x3 = img2vec(imgX3)
 x4 = img2vec(imgX4)
 x5 = img2vec(imgX5)

 dat = data.table(cbind(x1, x2, x3, x4, x5))
 setnames(dat,c("linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
 
 # predicted values
 yHat = predict(xgb.mod, newdata=as.matrix(dat))
 
 # constrain the range to be in [0, 1]
 yHat = sapply(yHat, function(x) max(min(x, 1),0))
 
 # turn the vector into an image
 imgYHat = matrix(yHat, nrow(imgX1), ncol(imgX1))
 
 # save the predicted value
 writePNG(imgYHat, file.path(outModel, f))
}

20150923 output 1
The graph above answers Bobby’s question – when considered together, the nearby pixels provide most of the additional predictive power beyond that of the raw pixel brightness. It seems that the key to separating dark text from noise is to consider how the pixel fits into the local region within the image.
Now that we have set up a stacking model structure, we can recommence the process of adding more image processing and features to the model. And that’s what the next blog will do…