Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Tag Archives: Image Processing

Second Annual Data Science Bowl – Part 3 – Automatically Finding the Heart Location in an MRI Image

08 Tuesday Mar 2016

Posted by Colin Priest in Automation, Convolutional Neural Networks, Deep Learning, Image Processing, Kaggle, Machine Learning, Medical Imaging, Python

≈ 10 Comments

Tags

Automation, Deep Learning, Image Processing, Kaggle, Machine Learning, Medical Imaging

My last blog wasn’t so sexy, what with all the data cleansing, and no predictive modelling. But in this blog I do something really cool – I train a machine learning model to find the left ventricle of the heart in an MRI image. And I couldn’t have done it without all of that boring data cleansing. #kaggle @kaggle

Aside from being a really cool thing to do, there is a purpose to this modelling. I want to find the boundaries of the heart chamber, and that is much easier and faster to do when I remove distractions. Once I have found the location of the heart chamber, I can crop the image to a much smaller square.

The input to the model will be a set of images. In order to simply what the model learns, I only gave it training images from sax locations near the centre of the heart.

20160308-image01
20160308-image02
20160308-image03
20160308-image04

The output from the model will be the row number and column number of the centroid of the left ventricle heart chamber (the red dot in the images above).

I had to manually define those centroid locations for a training set of a few hundred of the images. This was laborious and time consuming, even after I automated some of the process. But it needed to be done, because otherwise the machine learning algorithm has no way of knowing what the true answers should be.

Even though I am much more comfortable coding in R than in Python, I used Python for this step because I wanted to use Daniel Nouri‘s nolearn library, which sits above lasagne and theano, and these libraries are not available in R. The convolution neural network architecture was based upon the architecture in Daniel Nouri’s tutorial for the Facial Keypoints Detection competition in Kaggle.

Step 1: Importing of all the Required Libraries

OK, so this part isn’t all that sexy either. But it’s the engine for all the cool modelling that is about to be done.


import numpy as np
import csv
import random
import math
import os
import cv2
import itertools
import math
import matplotlib.pyplot as plt
import pandas as pd
import itertools

from lasagne import layers
from lasagne.updates import nesterov_momentum
from lasagne.nonlinearities import softmax
from lasagne.nonlinearities import sigmoid
from nolearn.lasagne import BatchIterator
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import PrintLayerInfo
from nolearn.lasagne.visualize import plot_loss
from nolearn.lasagne.visualize import plot_conv_weights
from nolearn.lasagne.visualize import plot_conv_activity
from nolearn.lasagne.visualize import plot_occlusion

%pylab inline
from lasagne.layers import DenseLayer
from lasagne.layers import InputLayer
from lasagne.layers import DropoutLayer
from lasagne.layers import Conv2DLayer
from lasagne.layers import MaxPool2DLayer
from lasagne.nonlinearities import softmax
from lasagne.updates import adam
from lasagne.layers import get_all_params
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import objective

import theano
import theano.tensor as T

 

Step 2: Defining the Helper Functions

I used jupyter notebook as the development environment to set up and run my Python scripts. While there’s a lot to like about jupyter, one thing that annoys me is that print commands run in jupyter don’t immediately show text on the screen. But here’s a trick to work around that:


def printQ(s):
 print(s)
 sys.stdout.flush()

Using this helper function instead of the print function results in text immediately appearing in the output. This is particular helpful for progress messages on long training runs.

I like to use R’s expand.grid function, but it isn’t built in to Python. So I wrote my own helper function in Python that mimics the functionality:


def product2(*args, repeat=1):
 # product('ABCD', 'xy') --> Ax Ay Bx By Cx Cy Dx Dy
 # product(range(2), repeat=3) -> 000 001 010 011 100 101 110 111
 pools = [tuple(pool) for pool in args] * repeat
 result = [[]]
 for pool in pools:
 result = [x+[y] for x in result for y in pool]
 for prod in result:
 yield tuple(prod)

def expand_grid(dictionary):
 return pd.DataFrame([row for row in product2(*dictionary.values())],
 columns=dictionary.keys())

This next function reads a cleaned up image, checking that it has the correct aspect ratio, then resizing it to 96 x 96 pixels. The resizing is done to reduce memory usage in my GPU, and to speed up the training and scoring.


def load_image(path):
 # check that the file exists
 if not os.path.isfile(path):
   printQ('BAD PATH: ' + path)
 image = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
 # ensure portrait aspect ratio
 if (image.shape[0] < image.shape[1]):
   image = cv2.transpose(image)
 # resize to 96x96
 resized_image = cv2.resize(image, (96, 96))
 # check that the image isn't empty
 s = sum(sum(resized_image))
 if np.isnan(s):
   print(path)
 return resized_image

My initial models were not performing as well as I would like. So to force the the model to generalise, I added some image transformations (rotation and reflection) to the training data. This required helper functions:


def rotate_image(img, degrees):
 rows,cols = img.shape
 M = cv2.getRotationMatrix2D((cols/2,rows/2),degrees,1)
 dst = cv2.warpAffine(img,M,(cols,rows))
 return dst

def transform_image(img, equalise, gammaAdjust, reflection, rotation):
 if equalise == 1:
   img = cv2.equalizeHist(img)
 if gammaAdjust != 1:
   img = pow(img / 255.0, gammaAdjust) * 255.0
 if reflection == 0 or reflection == 1:
   img = cv2.flip(img, int(reflection))
 if rotation != 0:
   img = rotate_image(img, rotation)
 return img

def transform_xy(x, y, equalise, gammaAdjust, reflection, rotation):
 # if equalise, then no change to x and y
 # if gamma adjustment, then no change to x and y
 # reflection
 if reflection == 0:
   y = 1.0 - y
 if reflection == 1:
   x = 1.0 - x
 if rotation == 180:
   x = 1.0 - x
   y = 1.0 - y
 if rotation != 0 and rotation != 180:
   x1 = x - 0.5
   y1 = y - 0.5
   theta = rotation / 180 * pi
   x2 = x1 * cos(theta) + y1 * sin(theta)
   y2 = -x1 * sin(theta) + y1 * cos(theta)
   x = x2 + 0.5
   y = y2 + 0.5
 return numpy.array([x, y])

# set up the image adjustments
#equalise, gammaAdjust, reflection
dAdjustShort = { 'equalise': [0],
 'gammaAdjust': [1],
 'reflection': [-1],
 'rotation': [0]}
dAdjust = { 'equalise': [0, 1],
 'gammaAdjust': [1, 0.75, 1.5],
 'reflection': [-1, 0, 1],
 'rotation': [0, 180, 3, -3]}
# 'rotation': [0, 180, 3, -3, 10, -10]}
imgAdj = expand_grid(dAdjust)
#imgAdj = expand_grid(dAdjustShort)
print (imgAdj.shape)
# can't have both reflection AND rotation by 180 degrees
imgAdj = imgAdj.query('not (reflection > -1 and rotation == 180)')
# can't have equalise AND gamma adjust
imgAdj = imgAdj.query('not (equalise == 1 and gammaAdjust != 1)')

There are two feature sets that enable the machine learning model to find the heart:

  1. shape in the image – by looking at shape in the image it can find the heart
  2. movement between images at different points of time – the heart is moving but most of the chest is stationary

So I set up the network architecture to use two channels. The first channel is the image, and the second channel is the difference between the image at this point of time and the image 8 time periods in the future.


def load_train_set():
 #numTimes = 8 # how many time periods to use
 plusTime = 5 # gaps between time periods for comparison
 xs = []
 ys = []
 ids = []
 all_y = '/home/colin/data/Second-Annual-Data-Science-Bowl/working/centroids-20160218-R.csv'
 with open(all_y) as f:
   rows = csv.reader(f, delimiter=',', quotechar='"')
   iRow = 1
   for line in rows:
     # first line is column headers, so ignore it
     if iRow > 1:
       # parse the line
       patient = line[0]
       x = float(line[1])
       y = float(line[2])
       sax = int(line[4])
       firstTime = int(line[5])
       if int(patient) % 25 == 0 and firstTime == 1:
         printQ(patient)
       # enhance the training data with rotations, reflections, NOT histogram equalisation
       for index, row in imgAdj.iterrows():
         # append the target values
         xy = transform_xy(x, y, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         ys.append(xy.astype('float32').reshape((1, 2)))
         #
         # read the images
         folder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned/' + patient + '/study/sax_0' + str(sax) + '/'
         xm = np.zeros([1, 2, 96, 96])
         #
         #
         # current frame
         path = folder + 'image-' + ('%0*d' % (2, firstTime)) + '.png'
         img = load_image(path)
         # transform the image - rotation, reflection etc
         img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         # get the pixels into the range [-1, 1]
         img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
         xm[0, 0, :, :] = img
         #
         #
         # find movement of current frame to future frame
         path = folder + 'image-' + ('%0*d' % (2, firstTime + plusTime)) + '.png'
         img = load_image(path)
         # transform the image - rotation, reflection etc
         img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
         # get the pixels into the range [-1, 1]
         img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
         # first time is the complete image at time 1
         # subsequent frames are the differences between frames
         xm[0, 1, :, :] = img - xm[0, 0, :, :]
         xs.append(xm.astype('float32').reshape((1, 2, 96, 96)))
         ids.append(patient)

     iRow = iRow + 1
 return np.vstack(xs), np.vstack(ys), np.vstack(ids)

I used early stopping to help reduce overfitting:


class EarlyStopping(object):
 def __init__(self, patience=100):
   self.patience = patience
   self.best_valid = np.inf
   self.best_valid_epoch = 0
   self.best_weights = None

def __call__(self, nn, train_history):
 current_valid = train_history[-1]['valid_loss']
 current_epoch = train_history[-1]['epoch']
 if current_valid < self.best_valid:
   self.best_valid = current_valid
   self.best_valid_epoch = current_epoch
   self.best_weights = nn.get_all_params_values()
 elif self.best_valid_epoch + self.patience < current_epoch:
   print('Early stopping')
   print('Best valid loss was {:.6f} at epoch {}.'.format(
   self.best_valid, self.best_valid_epoch))
   nn.load_params_from(self.best_weights)
   raise StopIteration()

Step 3: Read the Training Data

As well as reading the training data, I shuffled the order of the data. This allowed me to use batch training.


# read the training data

printQ('reading the training data')
train_x, train_y, train_id = load_train_set()

printQ ('shuffling training rows')
random.seed(1234)
rows = random.choice(arange(0, train_x.shape[0]), train_x.shape[0])
t_x = train_x[rows,:,:,:]
t_y = train_y[rows,:]
t_id = train_id[rows]

printQ('finished')

Step 4: Train the Model

The network architecture used deep convolutional layers to find features in the image, then fully connected layers to convert these features into the centroid location:

20160308-image05


# fit the models

# set up the model
printQ ('setting up the model structure')
layers0 = [
 # layer dealing with the input data
 (InputLayer, {'shape': (None, 2, 96, 96)}),

# first stage of our convolutional layers
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 5}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.2}),
 (Conv2DLayer, {'num_filters': 32, 'filter_size': 3}),
 (MaxPool2DLayer, {'pool_size': 2}),
 (DropoutLayer, {'p': 0.2}),

# second stage of our convolutional layers
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.3}),
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 #(DropoutLayer, {'p': 0.3}),
 (Conv2DLayer, {'num_filters': 64, 'filter_size': 3}),
 (MaxPool2DLayer, {'pool_size': 2}),
 (DropoutLayer, {'p': 0.3}),

# two dense layers with dropout
 (DenseLayer, {'num_units': 128}),
 (DropoutLayer, {'p': 0.5}),
 (DenseLayer, {'num_units': 128}),

# the output layer
 (DenseLayer, {'num_units': 2, 'nonlinearity': sigmoid}),
]

printQ ('creating and training the networks architectures')
numNets = 1
NNs = list()
for iNet in arange(numNets):
 nn = NeuralNet(
 layers = layers0,
 max_epochs = 2000,
 update=adam,
 update_learning_rate=0.0002,
 regression=True, # flag to indicate we're dealing with regression problem
 batch_iterator_train=BatchIterator(batch_size=100),
 on_epoch_finished=[EarlyStopping(patience=10),],
 train_split=TrainSplit(eval_size=0.25),
 verbose=1,
 )
 result = nn.fit(t_x, t_y)
 NNs.append(nn)

printQ('finished')

Based upon how quickly the training converged, the network could possibly have been simplified, reducing the number of layers, or using fewer neurons in the fully connected layers. But I didn’t have time to experiment with different architectures.

20160308-image06

The GPU quickly ran out of RAM unless I used the batch iterator. I found the batch size via trial and error. Large batch sizes caused the GPU to run out of RAM. Small batch sizes ran much slower.

Step 5: Review the Training Errors

Just like humans, all models make mistakes. The heart chamber segmentation algorithms I used later in this project were sensitive to how well the heart chamber was centred in the image. But as long as the model output was a centroid that was inside the heart chamber, things usually went OK. Early versions of my model made mistakes that placed the centroid outside the heart chamber, sometimes even far away from the heart.Tweaks to the training data (especially enhancing the data with rotation and reflection) and the architecture (especially dropout layers) improved the performance.


def getHeartLocation(trainX):
  # get the heart locations from each network
  heartLocs = zeros(numNets * trainX.shape[0] * 2).reshape((numNets, trainX.shape[0], 2))
  for j in arange(numNets):
    nn = NNs[j]
    heartLocs[j, :, :] = nn.predict(trainX)

    # use median as an ensembler
    heartLocsMedian = zeros(trainX.shape[0] * 2).reshape((trainX.shape[0], 2))
    heartLocsMedian[:,0] = median(heartLocs[:,:,0], axis = 0)
    heartLocsMedian[:,1] = median(heartLocs[:,:,1], axis = 0)

    # use a 'max distance from centre' ensembler
    heartLocsDist = zeros(trainX.shape[0] * 2).reshape((trainX.shape[0], 2))
    distance = abs(heartLocs - 0.5)
    am0 = distance[:,:,0].argmax(0)
    am1 = distance[:,:,1].argmax(0)
    heartLocsDist[:,0] = heartLocs[am0, arange(trainX.shape[0]), 0]
    heartLocsDist[:,1] = heartLocs[am1, arange(trainX.shape[0]), 1]

    # combine the two using an arithmetic average
    heartLocations = 0.5 * heartLocsMedian + 0.5 * heartLocsDist

    return heartLocations

heartLocations = getHeartLocation(train_x)

# review the training errors to check for model improvements
def plot_sample(x, y, predicted, axis):
  img = x[0, :, :].reshape(96, 96)
  axis.imshow(img, cmap='gray')
  axis.scatter(y[0::2] * 96, y[1::2] * 96, marker='x', s=10)
  axis.scatter(predicted[0::2] * 96, predicted[1::2] * 96, marker='x', s=10, color='red')

nTrain = train_x.shape[0]
errors = np.zeros(nTrain)

for i in arange(0, nTrain):
  errors[i] = sqrt( square(heartLocations[i, 0] - train_y[i, 0]) + square(heartLocations[i, 1] - train_y[i, 1]) )

print('Prob(error > 0.05)' + str(mean(errors > 0.05)))
print('Mean: ' + str(mean(errors)))
print('Percentiles: ' + str(percentile(errors, [50, 75, 90, 95, 99, 100])))

for i in arange(0, nTrain):
  error = sqrt( square(heartLocations[i, 0] - train_y[i, 0]) + square(heartLocations[i, 1] - train_y[i, 1]) )
  if (error > 0.04):
    if train_id[i] != train_id[i-1]:
      #print(i)
      print(train_id[i]) # only errors on the original images - not the altered images
      fig = pyplot.figure(figsize=(6, 3))
      ax = fig.add_subplot(1, 2, 1, xticks=[], yticks=[])
      plot_sample(train_x[i,:,:,:], train_y[i, :], heartLocations[i, :], ax)
      pyplot.show()

print('error review completed')

20160308-image07

After many failed models, I was excited when the two worst training errors were still close to the centre of the heart chamber 🙂

Step 6: Find the Left Ventricle Locations for the Submission Data

The main point of building a heart finder machine learning model is to automate the process of finding the left ventricle in the test images that will be used as part of the competition submission. These are images that the model has never seen before.


def load_submission_set():
  numTimes = 2 # how many time periods to use
  plusTime = 5
  xs = []
  ys = []
  ids = []
  paths = []
  times = []
  saxes = []
  all_y = '/home/colin/data/Second-Annual-Data-Science-Bowl/working/centroids-submission-R.csv'
  with open(all_y) as f:
    rows = csv.reader(f, delimiter=',', quotechar='"')
    iRow = 1
    for line in rows:
      # first line is column headers
      if iRow > 1:
        # parse the line
        patient = line[0]
        x = 0
        y = 0
        sax = int(line[1])
        firstTime = int(line[2])
        # save the targets
        xy = np.asarray([x, y])
        ys.append(xy.astype('float32').reshape((1, 2)))
        # read the images
        folder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned/' + patient + '/study/sax_0' + str(sax) + '/'
        xm = np.zeros([1, 2, 96, 96])
        #
        #
        # current frame
        path0 = folder + 'image-' + ('%0*d' % (2, firstTime)) + '.png'
        img = load_image(path0)
        # transform the image - rotation, reflection etc
        #img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
        # get the pixels into the range [-1, 1]
        img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
        xm[0, 0, :, :] = img
        #
        #
        # find movement of current frame to future frame
        path5 = folder + 'image-' + ('%0*d' % (2, firstTime + plusTime)) + '.png'
        img = load_image(path5)
        # transform the image - rotation, reflection etc
        #img = transform_image(img, row['equalise'], row['gammaAdjust'], row['reflection'], row['rotation'])
        # get the pixels into the range [-1, 1]
        img = (img / 128.0 - 1.0).astype('float32').reshape((96, 96))
        # first time is the complete image at time 1
        # subsequent frames are the differences between frames
        xm[0, 1, :, :] = img - xm[0, 0, :, :]
        xs.append(xm.astype('float32').reshape((1, numTimes, 96, 96)))
        ids.append(patient)
        paths.append(path0)
        times.append(firstTime)
        saxes.append(sax)
      iRow = iRow + 1
  return np.vstack(xs), np.vstack(ids), np.vstack(paths), np.vstack(times), np.vstack(saxes)

printQ('reading the submission data')
test_x, test_ids, test_paths, test_times, test_sax = load_submission_set()

printQ('quot;getting the predictions')
predicted_y = getHeartLocation(test_x)

printQ('creating the output table')
fullIDs = []
fullPaths = []
fullX = []
fullY = []
fullSax = []
fullTime = []
nTest = test_x.shape[0]
iTime = 1
for i in arange(0, nTest):
  patient = (test_ids[i])[0]
  sax = int((test_sax[i])[0])
  path = (test_paths[i])[0]
  iTime = int((test_times[i])[0])
  fullIDs.append(patient)
  fullPaths.append(path)
  fullX.append(predicted_y[i, 0])
  fullY.append(predicted_y[i, 1])
  fullSax.append(sax)
  fullTime.append(iTime)
  outPath = '/home/colin/data/Second-Annual-Data-Science-Bowl/predicted-heart-location-submission/'
  outPath = outPath + patient
  outPath = outPath + '-' + str(sax)
  outPath = outPath + '-' + str(iTime) + '.png'
  img = load_image256x192(path)
  x = int(round(predicted_y[i, 0] * 192))
  y = int(round(predicted_y[i, 1] * 256))
  img[y, x] = 255
  img[y-1, x-1] = 255
  img[y-1, x+1] = 255
  img[y+1, x-1] = 255
  img[y+1, x+1] = 255
  write_image(img, outPath)

fullIDs = array(fullIDs)
fullPaths = array(fullPaths)
fullX = array(fullX)
fullY = array(fullY)
fullSax = array(fullSax)
fullTime = array(fullTime)

printQ('saving results table')
d = { 'patient': fullIDs, 'path' : fullPaths, 'x' : fullX, 'y' : fullY, 'iTime' : fullTime, 'sax' : fullSax}
import pandas as pd
d = pd.DataFrame(d)
d.to_csv('/home/colin/data/Second-Annual-Data-Science-Bowl/working/heartfinderV4b-centroids-submission.csv', index = False)

In the animated gif below, you can see the left ventricle centroid location that has been automatically fitted, displayed as a dark rectangle moving around near the centre of the heart chamber. The machine learning algorithm was not trained on this patient’s images – so what you see here is artificial intelligence in action!

20160308-submission-images

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Second Annual Data Science Bowl – Part 2

07 Monday Mar 2016

Posted by Colin Priest in Automation, Image Processing, Kaggle, Medical Imaging, R

≈ 3 Comments

Tags

Automation, Image Processing, Kaggle, Medical Imaging, R

In Part 1 of this blog series, I described how to fix the brightness and contrast of the MRI images. In this blog we finish cleaning up the input data.

Other than brightness and contrast, we need to fix up the following problems:

  • different image sizes
  • different image rotations – portrait versus landscape
  • different pixel spacing
  • different short axis slice spacing
  • image sets from duplicate locations
  • sax sets aren’t in the same order as their locations
  • some sax image sets have multiple locations

Different Image Sizes and Rotations

There are approximately a dozen different image sizes, include rotated images. Not all of the image sizes scale to the same 4:3 aspect ratio that is the most common across the training set. Some of the machine learning algorithms I used later need fixed dimension images, so I compromised and decided to use a standard sizing of 256 x 192 pixels portrait aspect ratio. This meant that I wasn’t always scaling the x-axis and the y-axis by the same amount, and occasionally I was even upscaling an image.


library(pacman)
pacman::p_load(EBImage)
rescaleImage = function(img)
{
 imgOut = img
 # check for landscape aspect ratio and correct
 if (nrow(img) > ncol(img)) imgOut = t(img)
 imgOut = resize(imgOut, 256, 192) # standardise image size to 256 x 192
 return (imgOut)
}

20160307image0120160307image02

Note that I used matrix transpose to rotate the image 90 degrees. I could also have used the rotate function in EBImage. Either way could work, but I felt that matrix transpose would be a faster operation.

Image Locations and Spacing

The DICOM images contain information about image slice location, pixel spacing (how far apart are adjacent pixel centroids), and slice spacing (how far apart are adjacent sax slices).


library(pacman)
pacman::p_load(oro.dicom)

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
 img = readDICOMFile(path)
 width = ncol(img$img)
 height = nrow(img$img)
 headers = img$hdr
 patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
 patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
 patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
 ps = extractHeader(img$hdr, 'PixelSpacin', numeric = FALSE)
 pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
 pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
 seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
 location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
 pathSplit = unlist(strsplit(path, '/&amp;amp;', fixed = TRUE))
 filename = pathSplit[length(pathSplit)]
 prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
 frame = as.integer(unlist(strsplit(prefix, '-'))[3])
 sliceNum = 0
 if (length(unlist(strsplit(prefix, '-'))) &amp;gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
 time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
 #
 return (list(
 id = patientID,
 age = patientAge,
 gender = patientGender,
 pixelSpacingX = pixelSpacingX,
 pixelSpacingY = pixelSpacingY,
 width = width,
 height = height,
 series = seriesNum,
 location = location,
 frame = frame,
 sliceNum = sliceNum,
 time = time,
 path = path
 ))
}

I’m not a medical expert. So when there are images from duplicate locations, I don’t know which image sets are the best to use. Therefore I just assumed that the medical specialist repeated the MRI scans until the image quality was adequate. This meant that I chose the image set with the latest time stamps (from the header information inside the DICOM files, not the file system date).

Then I just

  1. searched for DICOM images in all of the subfolders for each patient
  2. filtered to use only sax slices, ignoring those images that did not come from a folder that contained the substring “sax”
  3. read the DICOM header information from each image to find its location and time
  4. searched for the high time stamp for each location and kept that image
  5. wrote out a new folder structure where sax_01, sax_02, … were the sax slice image sets for that patient, order by their location along the long axis of the heart

The R script to do this isn’t too difficult. I’ve pasted it below:


# this script creates a cleaner version of the image data
# plus an extract of the file headers
# 1) removes duplicate images
# 2) reorders images by location

# read a poor image and translate the pixel brightnesses
rebalanceImage = function(badImage)
{
v = matrix(badImage, nrow(badImage) * ncol(badImage))
o2 = order(v)
vIn = sample(vAll, nrow(badImage) * ncol(badImage))
vIn = vIn[order(vIn)]
v2 = v
v2[o2] = vIn
cleanImage = matrix(v2, nrow(badImage), ncol(badImage))
return (cleanImage)
}

# check whether a folder exists and make it if it doesn't exist
checkFolder = function(mainDir, subDir)
{
if(!dir.exists(file.path(mainDir, subDir)))
dir.create(file.path(mainDir, subDir))
}

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
img = readDICOMFile(path)
width = ncol(img$img)
height = nrow(img$img)
headers = img$hdr
patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
ps = extractHeader(img$hdr, 'PixelSpacing', numeric = FALSE)
pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
pathSplit = unlist(strsplit(path, '/', fixed = TRUE))
filename = pathSplit[length(pathSplit)]
prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
frame = as.integer(unlist(strsplit(prefix, '-'))[3])
sliceNum = 0
if (length(unlist(strsplit(prefix, '-'))) &amp;gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
#
return (list(
id = patientID,
age = patientAge,
gender = patientGender,
pixelSpacingX = pixelSpacingX,
pixelSpacingY = pixelSpacingY,
width = width,
height = height,
series = seriesNum,
location = location,
frame = frame,
sliceNum = sliceNum,
time = time,
path = path
))
}
############################################################################################################################

library(pacman)
pacman::p_load(oro.dicom, raster, data.table, png, flexclust, foreach, doParallel, snowfall)

# whether this run is for the training set (FALSE) or the validation set (TRUE)
useValidation = FALSE
#useValidation = TRUE

# create a benchmark histogram from an exemplar image
dicomBenchmark = readDICOM('/home/colin/data/Second-Annual-Data-Science-Bowl/train/1/study/sax_13')
images = dicomBenchmark[[2]]
img = images[[1]]
vAll = unname(unlist(images))
vAll = vAll[order(vAll)]

# loop through all of the patients
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned'
if (useValidation)
{
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned'
}
cases = list.dirs(rootFolder, recursive=FALSE)
#cases = cases[grep('123', cases)]
simpleData = matrix(0, 500, 3)
simpleFeatures = data.frame(id = rep(0, 500), age = rep(0, 500), gender = rep('U', 500),
pixelSpacingX = integer(500), pixelSpacingY = integer(500),
width = integer(500), height = integer(500),
numSlices = rep(0, 500), stringsAsFactors = FALSE)
allImages = NULL

sfInit(parallel=TRUE, cpus=14)
sfLibrary(oro.dicom)

for (patient in cases)
{
patientFolder = paste0(patient, '/study')
imgSequences = list.dirs(patientFolder, recursive=FALSE)
# filter for 'sax' folders
imgSequences = imgSequences[grep('sax', imgSequences, fixed = TRUE)]
nMax = 10000
imgTable = data.table(id = integer(nMax), age = integer(nMax), gender = character(nMax),
pixelSpacingX = integer(nMax), pixelSpacingY = integer(nMax),
width = integer(nMax), height = integer(nMax),
location = numeric(nMax), frame = integer(nMax), series = integer(nMax),
sliceNum = integer(nMax), time = numeric(nMax),
path = character(nMax))
for (imgFolder in imgSequences)
{
# find the first file in imgs
imgFiles = list.files(imgFolder)
if (length(imgFiles) &lt; 30)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
} else {
if (length(imgFiles) %% 30 != 0)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
}
}
# do the next part regardless of the number of images
paths = unlist(lapply(imgFiles, function(x) return (paste0(imgFolder, '/', x))))
result <- sfLapply(paths, getDicomHeaderInfo)
for (row in result)
{
n = sum(imgTable$id > 0) + 1
imgTable$id[n] = row$id
imgTable$age[n] = row$age
imgTable$gender[n] = row$gender
imgTable$pixelSpacingX[n] = row$pixelSpacingX
imgTable$pixelSpacingY[n] = row$pixelSpacingY
imgTable$width[n] = row$width
imgTable$height[n] = row$height
imgTable$series[n] = row$series
imgTable$location[n] = row$location
imgTable$frame[n] = row$frame
imgTable$sliceNum[n] = row$sliceNum
imgTable$time[n] = row$time
imgTable$path[n] = row$path
}
}
# remove surplus records from table
imgTable= imgTable[imgTable$id > 0]
# grab the latest image for each location and frame
latestImages = imgTable[order(id, age, gender, pixelSpacingX, pixelSpacingY, width, height, location, frame, time, sliceNum, series), .SD[c(.N)], by=c(&amp;amp;amp;quot;id&amp;amp;amp;quot;, &amp;amp;amp;quot;age&amp;amp;amp;quot;, &amp;amp;amp;quot;gender&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingX&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingY&amp;amp;amp;quot;, &amp;amp;amp;quot;width&amp;amp;amp;quot;, &amp;amp;amp;quot;height&amp;amp;amp;quot;, &amp;amp;amp;quot;location&amp;amp;amp;quot;, &amp;amp;amp;quot;frame&amp;amp;amp;quot;)]
#
print(paste0('Patient: ', latestImages$id[1]))
uniqueLocs = sort(unique(latestImages$location[abs(latestImages$location) > 0.000001]))
simpleFeatures[latestImages$id[1], ] = list(latestImages$id[1], latestImages$age[1], latestImages$gender[1],
latestImages$pixelSpacingX[1], latestImages$pixelSpacingY[1],
latestImages$width[1], latestImages$height[1],
length(uniqueLocs))
# create the cleaned-up images for CNN training
imgTable$cleanPath = '---'
iSax = 1
patientID = latestImages$id[1]
for (loc in uniqueLocs)
{
imgset = latestImages$path[latestImages$location == loc]
iImage = 1
for (imgPath in imgset)
{
dicomImage = readDICOMFile(imgPath)
# fix the contrast and brightness
fixedImage = rebalanceImage(dicomImage$img)
# get the details of this image
###patientID = extractHeader(dicomImage$hdr, 'PatientID', numeric = TRUE)
sliceID = iSax
imageID = iImage
outPath = paste0(outFolder, '/', patientID, '/study/sax_', formatC(sliceID, width=2, flag='0'), '/image-', formatC(imageID, width=2, flag='0'), '.png')
#print(outPath)
#plot(raster(fixedImage))
checkFolder(outFolder, as.character(patientID))
checkFolder(paste0(outFolder, '/', patientID), 'study')
checkFolder(paste0(outFolder, '/', patientID, '/study'), paste0('sax_', formatC(sliceID, width=2, flag='0')))
writePNG(fixedImage / max(fixedImage), outPath)
imgTable$cleanPath[imgTable$path == imgPath] = outPath
#
iImage = iImage + 1
}
iSax = iSax + 1
}

if (patient == cases[1])
{
allImages = imgTable
} else
{
allImages = data.frame(rbind(allImages, imgTable))
}
}

sfStop()

I stored all of the DICOM header information in a table. Some of this will be required later, when calculating the volume of the left ventricle chamber.

The next step, which will be described in my next blog, is to design a convolutional neural network that will automatically find the left ventricle in an image.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Second Annual Data Science Bowl – Part 1

06 Sunday Mar 2016

Posted by Colin Priest in Automation, Image Processing, Kaggle, Medical Imaging, R

≈ 3 Comments

Tags

Automation, Image Processing, Kaggle, Medical Imaging, R

I’m currently competing in the Second Annual Data Science Bowl at Kaggle. This is by far the most difficult competition that I have entered to date. At the time of writing I am placed 62nd out of 755 entries, with only a day remaining to lock down my methodology. There’s a lot more I’d like to do to improve my model, but alas, I don’t have the time!

Here’s the problem that we are solving:

  1. We are given a set of medical images taken by MRI, across 30 time periods, and a variable number of location slices through the body.
  2. We are also given the volume of the left ventricle of the heart at times of diastole and systole.
  3. Our task is to design an automatic algorithm that inputs DICOM images and outputs a cumulative density function of the likelihood of different volumes at both diastole and systole.

The medical image files are in DICOM format, containing information about the patient (e.g. age and gender) and a set of monochrome images for each patient giving a 4 dimensional view of that patient’s chest. The key images are the “sax” (short axis) images, a set of slices perpendicular to the line that passes through the length of the heart (a heart isn’t circular, but more ovoid in shape), and there are typically 30 images for each sax, each being 1/30th the time period of a heartbeat, showing one complete cycle of the heart. There are a varying number of sax images for each patient, depending upon the length of the patient’s heart, and sometimes there are also repeated sax sets, where the scanning was repeated in an attempt to improve the image quality.

The image quality varies greatly between patients, with differing image resolutions, brightness, contrast and aspect ratio / rotation.

sax_8
sax_8

As you can see in the animated gifs above, some of the images are such poor quality that it is difficult for the human eye to discern the details. So my first challenge was to improve the brightness and contrast. One way to do this is to do a linear transformation on each image so that its pixels have a preset mean and standard deviation.

20160306image01
20160306image02
20160306image03
20160306image04

As you can see, while this approach helped, it did not work well enough for the problem images. I also tried non-linear transformations without much more success. No transformation function was flexible enough for the wide range of image qualities. Frequently part of the problem image gets washed out.

At the National Heart Centre Singapore, I spoke with Assistant Professor Calvin Chin about how doctors use imaging to assess heart volume. He explained that the very bright, washed out sections in some of the images are the result of fat deposits within the patient’s body. He also explained how to find the left ventricle chamber in an image (it is round with a thick lining surrounding it) and what to do about the dark patches inside the chamber (include them in the area of the chamber because they are blood vessels). This was really helpful. It pays to bring in some domain knowledge to a machine learning problem.

20060306image05
20060306image06

What I wanted was for all images to have similar brightness histograms. After much experimentation, I couldn’t find a transformation function that achieved this for me. But then I realised that I didn’t need to use a function – I could just use empirical histograms as my target function, mapped to my original image via the brightness ranking of each pixel. All I needed to do was select an exemplar image (or multiple exemplar images) and then order the pixel brightnesses, then map across. Here’s how I did it using R:


library(pacman)
pacman::p_load(oro.dicom)

# a function to turn the image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# read a poor image and translate the pixel brightnesses
rebalanceImage = function(badImage)
{
# get the pixel brightnesses and get an index that sorts them
v = img2vec(badImage)
o2 = order(v)

# get a target histogram, allowing for the size of the bad image
vIn = sample(vAll, nrow(badImage) * ncol(badImage))
vIn = vIn[order(vIn)]

#
v2 = v
v2[o2] = vIn

# turn the piuxel vector back into an image
cleanImage = matrix(v2, nrow(badImage), ncol(badImage))

return (cleanImage)
}
# create a benchmark histogram from an exemplar image
dicomBenchmark = readDICOM('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/1/study/sax_13')
images = dicomBenchmark[[2]]
img = images[[1]]
vAll = unname(unlist(images))
vAll = vAll[order(vAll)]

# read the raw image
dicomImage = readDICOMFile('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/1/study/sax_8/IM-4560-0001.dcm')
# fix the contrast and brightness
fixedImage = rebalanceImage(dicomImage$img)
# read the raw image
dicomImage2 = readDICOMFile('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/6/study/sax_8/IM-9548-0001.dcm')
# fix the contrast and brightness
fixedImage2 = rebalanceImage(dicomImage2$img)

This gave me fairly consistent brightnesses and contrast, regardless of the quality of the original images, and also prevented washed out regions. You can see the results below:

20160306image07
20160306image08
20160306image09
20160306image10

Standardising the input data helps machine learning algorithms perform better because they don’t have to waste resources figuring out how to adjust for varying inputs where that variation is not a predictive feature.

In my next blog I will describe how I used the DICOM header information to further improve the model inputs, and to create extra features for my final model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

An Even Dozen – Denoising Dirty Documents: Part 12

15 Sunday Nov 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R, Stacking, XGBoost

≈ 3 Comments

Tags

Image Processing, Kaggle, Machine Learning, R, Stacking, XGBoost

Over the past 11 blogs in this series, I have discussed how to build machine learning models for Kaggle’s Denoising Dirty Documents competition.

dozeneggs

The final blog in this series brings the count to an even dozen, and will achieve two aims:

  1. ensemble the models that we have built
  2. take advantage of the second information leakage in the competition

Ensembling, the combining of individual models into a single model, performs best when the individual models have errors that are not strongly correlated. For example, if each model has statistically independent errors, and each model performs with similar accuracy, then the average prediction across the 4 models will have half the RMSE score of the individual models. One way to increase the statistical independence of the models is to use different feature sets and / or types of models on each. I therefore chose the following combination of models:

  1. deep learning – thresholding based features
  2. deep learning – edge based features
  3. deep learning – median based features
  4. images with backgrounds removed using information leakage
  5. xgboost – wide selection of features
  6. convolutional neural network – using raw images without background removal pre-processing
  7. convolutional neural network – using images with backgrounds removed using information leakage
  8. deep convolutional neural network – using raw images without background removal pre-processing
  9. deep convolutional neural network – using images with backgrounds removed using information leakage

20151115 ensemble structure

It turned out that some of these models had errors that weren’t strongly independent to other models. But I was rushing to improve my leaderboard score in the final 48 hours of the competition, so I didn’t have time to experiment.

I didn’t experiment much with different ensemble models. However I did test xgboost versus a simple average or a least square linear regression, and it outperformed both. Maybe an elastic net could have done a good job.

Here is the R code for my ensemble:


.libPaths(c(.libPaths(), "./rlibs"))
library(png)
library(data.table)
library(xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

cleanFolder = "./data/train_cleaned"
inFolder1 = "./threshold based model/training data"
inFolder2 = "./edge based model/training data"
inFolder3 = "./median based model/training data"
inFolder4 = "./foreground/train foreground"
inFolder5 = "./submission 11/train_postprocessed"
inFolder6 = "./convnet/train_predicted"
inFolder7 = "./cnn_leakage/train_predicted"
inFolder8 = "./CNN based model/training"
inFolder9 = "./deep CNN/train_predicted"

outPath = "./stacked/stacking.csv"

filenames = list.files(cleanFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
y = img2vec(imgY)
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(y, x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("y", "threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
#rows = sample(nrow(dat), 15000000)
dat[is.na(dat)] = 0
#dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
dtrain <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
#
nThreads = 30
# do cross validation first
#xgb.tab = xgb.cv(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = 1000, early.stop.round = 15, nfold = 4, print.every.n = 10)
# what is the best number of rounds?
#min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
min.error.idx = 300 # was 268
xgb.mod = xgboost(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

dat_predicted = predict(xgb.mod, newdata=as.matrix(dat[,-1]))
sqrt( mean( (dat$y - dat_predicted) ^ 2 )) # 0.00759027

save (xgb.mod, file = "./model/xgb.rData")

#####################################################################################################################################

imgFolder = "./data/test"
inFolder1 = "./threshold based model/test data"
inFolder2 = "./edge based model/test data"
inFolder3 = "./median based model/test data"
inFolder4 = "./foreground/test foreground"
inFolder5 = "./submission 11/test_postprocessed"
inFolder6 = "./convnet/test_predicted"
inFolder7 = "./cnn_leakage/test_predicted"
inFolder8 = "./CNN based model/test"
inFolder9 = "./deep CNN/test_predicted"

outFolder = "./stacked/test data"
outFolder2 = "./stacked/test images"

filenames = list.files(imgFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))

# turn the images into vectors
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
yHat = predict(xgb.mod, newdata=as.matrix(dat))
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
imgY = matrix(yHat, nrow(imgX1), ncol(imgX1))
writePNG(imgY, file.path(outFolder2, f))
save(imgY, file = file.path(outFolder, gsub(".png", ".rData", f)))
}

Ensembling materially improved my leaderboard score versus any of the individual models. I feel that was due to the use of different features across my 3 deep learning models. So now I had a set of images that looked quite good:

20151115 output 1

20151115 output 2

To my eyes, my predicted images were indistinguishable from the clean images in the training data. In a real world situation I would have stopped model development here, because the image quality exceeds the minimum requirements for OCR. However, since this was a competition, I wanted the best score I could get.

So I took advantage of the second data leakage in the competition – the fact that the cleaned images were repeated across the dataset. This meant that I could compare a cleaned images to other cleaned images that appeared to have the same text and the same font, and clean up any pixels that were different across the set of images. I experimented with using the mean of the pixel brightness across the images, but using the median performed better.


library(png)
library(data.table)

inFolder = "./stacked/test data"
outFolder = "./information leakage/data"
outFolder2 = "./information leakage/images"

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

filenames = list.files(inFolder, pattern = "\\.rData$")
for (f in filenames)
{
print(f)

load(file.path(inFolder, f))
imgX = imgY

# look for the closest matched images
scores = matrix(1, length(filenames))
for (i in 1:length(filenames))
{
load(file.path(inFolder, filenames[i]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) && ncol(imgY) >= ncol(imgX))
{
imgY = imgY[1:nrow(imgX), 1:ncol(imgX)]
rmse = sqrt(mean( (imgX - imgY)^2 ))
}
scores[i] = rmse
}

dat = matrix(1, ncol(imgX) * nrow(imgX), 4)
for (i in 1:4)
{
f2 = filenames[order(scores)][i]
load(file.path(inFolder, f2))
dat[,i] = img2vec(imgY)
}

dat2 = apply(dat, 1, median)
#dat2 = apply(dat, 1, mean)

imgOut = matrix(dat2, nrow(imgX), ncol(imgX))
writePNG(imgOut, file.path(outFolder2, gsub(".rData", ".png", f)))
save(imgOut, file = file.path(outFolder, f))
}

This information leakage halved the RMSE, and I suspect that it was what allowed the top two competitors to obtain RMSE scores less than 1%.

So that’s it for this series of blogs. I learned a lot from my first Kaggle competition. Competing against others, and sharing ideas is a fun way to learn.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 11

08 Sunday Nov 2015

Posted by Colin Priest in Adaptive Thresholding, Background Removal, Deep Learning, Edge Detection, h2o, Image Processing, Kaggle, Machine Learning, Median Filter, Morphology, R

≈ 1 Comment

Tags

Adaptive Thresholding, Background Removal, Deep Learning, Edge Detection, h2o, Image Processing, Kaggle, Machine Learning, Median Filter, Morphology, R

In my last blog I showed how to use convolutional neural networks to build a model that removed stains from an image. While convolutional neural networks seem to be well suited for image processing, in this competition I found that deep neural networks performed better. In this blog I show how to build these models.

warnH022 - deep water

Since I wanted to use R, have limited RAM and I don’t have a powerful GPU, I chose to use h2o to build the models. That way I could do the feature engineering in R, pass the data to h2o, let h2o build a model, then get the predicted values back in R. The memory management would be done in h2o, which uses deep learning algorithms that adjust the RAM constraints. So I guess this combination of deep learning and h2o could be called “deep water” 😉

For my final competition submission I used an ensemble of models, including 3 deep learning models built with R and h2o. Each of the 3 deep learning models used different feature engineering:

  • median based feature engineering
  • edge based feature engineering
  • threshold based feature engineering

This blog shows the details of the median based model. I leave it to the reader to implement the edge based and threshold based models using the image processing scripts from my earlier blogs in this series.

If you don’t already have h2o installed on your computer, then you can install it directly from R. At the time of writing this blog, you could install h2o using the following script:


# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/R")))

That script will need to be changed as new versions of h2o are released. So use the latest instructions shown here.

Once h2o is installed, you can interface with h2o from R using the CRAN package.


install.packages("h2o")
library(h2o)

Median based image processing is used for feature engineering in this example, but you could use any combination of image processing techniques for your feature engineering. I got better performance using separate deep learning models for different types of image processing, but that may be because I had limited computing resources. If you have more computing resources than me, then maybe you will be successful with a single large model that uses all of the image processing techniques to create features.


# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}
 
median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
 k = 1
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}
 
# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5
 p = 1.39
 th = 240
 
 # the background is found via a median filter
 background = median_Filter(img, w)
 
 # the foreground is darker than the background
 foreground = img / background
 foreground[foreground > 1] = 1
 
 foreground2 = foreground ^ p
 foreground2[foreground2 >= (th / 255)] = 1
 
 return (matrix(foreground2, nrow(img), ncol(img)))
} 

img2tab = function(imgX, f)
{
 median5 = img2vec(median_Filter(imgX, 5))
 median17 = img2vec(median_Filter(imgX, 17))
 median25 = img2vec(median_Filter(imgX, 25))
 backgroundRemoval = img2vec(background_Removal(imgX))
 foreground = readPNG(file.path(foregroundFolder, f))
 
 # pad out imgX
 padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
 offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))
 
 # raw pixels window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
 x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x2 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
 x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x3 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
 x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x4 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
 x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x5 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
 x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x6 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
 x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

 dat = data.table(cbind(x, x2, x3, x4, x5, x6))
 setnames(dat,c(
 paste("x", seq_len((2*padding+1)^2), sep=""), 
 paste("median5", seq_len((2*padding+1)^2), sep=""),
 paste("median17", seq_len((2*padding+1)^2), sep=""),
 paste("median25", seq_len((2*padding+1)^2), sep=""),
 paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
 paste("foreground", seq_len((2*padding+1)^2), sep="")
 ))
 
 return (dat)
}

If you’ve been following my blog, then you will see that there’s nothing new in the two image processing functions shown above.

To build the model you will need to start h2o, import the data and tell h2o to create a deep learning model.

h2oServer = h2o.init(nthreads = 6, max_mem_size = "10G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
 score_training_samples = 0, 
 overwrite_with_best_model = TRUE,
 activation = "Rectifier", seed = 1,
 hidden = c(200, 200,200), epochs = 15,
 adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
 fast_mode = T, diagnostics = T, ignore_const_cols = T,
 force_load_balance = T)


You should change the h2o.init parameters according to the hardware on your computer. I’m running my model on a PC with 8 CPUs and 16GB of RAM, so I left a couple of CPUs free to do the user interface and core operating system functionality, plus some RAM for the operating system. Scale these parameters up or down if your PC specifications are more or less powerful than mine.

The model may take a few hours to fit. During that time R will not do anything. So if you want to see how the model is progressing, then point your browser to your localhost (port 54321 on my PC, but maybe a different port on yours) and use the h2o web interface to see what is happening.

You can get the predicted values using the following script:


filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
 predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
 imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))
 
 # correct the pixel brightnesses that are out of bounds
 imgOut[imgOut > 1] = 1
 imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Running predictions is as simple as creating a data file, importing it to h2o, and then asking h2o to give you the predicted values from your already fitted model. I found that some of the raw predicted values were out of the [0, 1] range, and improved my leaderboard score by limiting the predicted values to lie within this range.

You do not need to shut down h2o after you finish running a model. In fact you may wish to leave it running so that you can do model diagnostics or run more predictions.

If you wish to save a copy of your model, for later reuse, then you can use the following syntax:


modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

Just remember that h2o needs to be running when you save models or load previously saved models.

In my next, and final, blog in this series, I will show how to take advantage of the second information leakage in the competition.

For those who want the entire R script to try out for themselves, here it is:


install.packages("h2o")
library(h2o)
library(png)
library(data.table)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5
p = 1.39
th = 240

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img / background
foreground[foreground > 1] = 1

foreground2 = foreground ^ p
foreground2[foreground2 >= (th / 255)] = 1

return (matrix(foreground2, nrow(img), ncol(img)))
}

dirtyFolder = "./data/train"
cleanFolder = "./data/train_cleaned"
outFolder = "./model"
foregroundFolder = "./foreground/train foreground"

outPath = file.path(outFolder, "trainingdata.csv")
outPath2 = file.path(outFolder, "testdata.csv")
filenames = list.files(dirtyFolder)
padding = 2
set.seed(1)
library(h2o)
h2oServer = h2o.init(nthreads = 15, max_mem_size = "110G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
score_training_samples = 0,
overwrite_with_best_model = TRUE,
activation = "Rectifier", seed = 1,
hidden = c(200, 200,200), epochs = 15,
adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
fast_mode = T, diagnostics = T, ignore_const_cols = T,
force_load_balance = T)

summary(model.dl)

modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

outFolder = "./model/training data"

img2tab = function(imgX, f)
{
median5 = img2vec(median_Filter(imgX, 5))
median17 = img2vec(median_Filter(imgX, 17))
median25 = img2vec(median_Filter(imgX, 25))
backgroundRemoval = img2vec(background_Removal(imgX))
foreground = readPNG(file.path(foregroundFolder, f))

# pad out imgX
padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))

# raw pixels window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x2 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x3 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x4 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x5 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x6 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

dat = data.table(cbind(x, x2, x3, x4, x5, x6))
setnames(dat,c(
paste("x", seq_len((2*padding+1)^2), sep=""),
paste("median5", seq_len((2*padding+1)^2), sep=""),
paste("median17", seq_len((2*padding+1)^2), sep=""),
paste("median25", seq_len((2*padding+1)^2), sep=""),
paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
paste("foreground", seq_len((2*padding+1)^2), sep="")
))

return (dat)
}

dirtyFolder = "./data/test"
outFolder = "./model/test data"
foregroundFolder = "./foreground/test foreground"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))

# correct the pixel brightnesses that are out of bounds
imgOut[imgOut > 1] = 1
imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents – Part 10

01 Sunday Nov 2015

Posted by Colin Priest in Convolutional Neural Networks, Deep Learning, Image Processing, Kaggle, Machine Learning, Python

≈ 27 Comments

Tags

Convolutional Neural Networks, Deep Learning, Image Processing, Kaggle, Machine Learning, Python

In my last blog, I explained how to take advantage of an information leakage regarding the repeated backgrounds in Kaggle’s Denoising Dirty Documents competition. The result of that process was that we had done a fairly good job of removing the background. But the score from doing this was not good enough to get a good placing. We need to do some processing on the image to improve the score.

Today we will use an approach that does not require me to do any feature engineering – convolutional neural networks, which are neural networks where the first few layers repeatedly apply the same weights across overlapping regions of the input data. One intuitive way of thinking about this is that it is like applying an edge detection filter (much like I described here) where the algorithm finds the appropriate weights for several different edge filters.

_images/mylenet.png

I’m told that convolutional neural networks are inspired by how vision works in the natural world. So if I test whether convolutional neural networks work well, am I giving them a robot eye test?

robot eye test

While I am comfortable coding in R, there is little support for convolutional neural networks in R, and I had to code this in Python, using the Theano library. The reason that I chose Theano is because neural network model fitting can be quite time consuming, and Theano supports GPU based processing, which can be orders of magnitude faster than CPU based calculations. To simply the code, I am using Daniel Nouri’s nolearn library, which sits over the lasagne library, which sits over the Theano library. This is the first time I have coded in Python and the first time I have used convolutional neural networks, so it was a good learning experience.

Since my PCs don’t have top of the line graphics cards with GPU processing support, I decided to run my code in a cloud on a virtual machine with GPU support. And since I didn’t want go through the effort of setting up a Linux machine and installing all of the libraries and compilers, I used Domino Data Labs to host my analysis. You can find my project here.

Before I progress to coding the model, I have to set up the environment.

import os
import shutil
 
def setup_theano():
	destfile = "/home/ubuntu/.theanorc"
	open(destfile, 'a').close()
	shutil.copyfile("/mnt/.theanorc", destfile)
 
	print "Finished setting up Theano"

The Python script shown above creates a function that copies the Theano settings file into the appropriate folder in the virtual machine, so that Theano knows to use GPU processing rather than CPU processing. This function gets called from my main script.

In order to reduce the number of calculations, I used a network architecture that inputs an image and outputs an image. The suggestion for this architecture comes from ironbar, a great guy who placed third in the competition. This is unlike all of the examples I found online, which have just one or two outputs, usually because the online example is a classification problem identifying objects appearing within the image. But there are two issues with this architecture:

  1. it doesn’t allow for fully connected layers before the output, and
  2. the target images are different sizes.

I chose to ignore the first problem, although if I had time I would have tried out a more traditional architecture that included fully connected layers but which only models one target pixel at a time.

def image_matrix(img):
 """
 The output value has shape (<number of pixels>, <number of rows>, <number of columns>)
 """
 # 420 x 540 or 258 x 540?
 if img.shape[0] == 258:
 return (img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540))
 if img.shape[0] == 420:
 result = []
 result.append((img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
 result.append((img[162:420, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
 result = np.vstack(result).astype('float32').reshape((2, 1, 258, 540))
 return result

For the second problem, I used the script shown above to split the larger images into two smaller images that were the same size as the other small images in the data, thereby standardising the output dimensions.

def load_train_set(file_list):
 xs = []
 ys = []
 for fname in file_list:
 x = image_matrix(load_image(os.path.join('./train_foreground/', fname)))
 y = image_matrix(load_image(os.path.join('./train_cleaned/', fname)))
 for i in range(0, x.shape[0]):
 xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
 ys.append(y[i, :, :, :].reshape((1, 1, 258, 540)))
 return np.vstack(xs), np.vstack(ys)

Theano uses tensors (multi-dimensional matrices) to store the training data and outputs. The first dimension is the index of the training data item. The second dimension is the colourspace information e.g. RGB would be 3 dimensions. Since our images are greyscale, this dimension has a size of only 1. The remaining dimensions are the dimensions of the input data / output data. The script shown above reshapes the data to meet this criteria.
The nolearn library simplifes the process of defining the architecture of a neural network. I used 3 hidden convolutional layers, each with 25 image filters. The script below shows how this was achieved.

net2 = NeuralNet(
layers = [
('input', layers.InputLayer),
('conv1', layers.Conv2DLayer),
('conv2', layers.Conv2DLayer),
('conv3', layers.Conv2DLayer),
('output', layers.FeaturePoolLayer),
],
#layer parameters:
input_shape = (None, 1, 258, 540),
conv1_num_filters = 25, conv1_filter_size = (7, 7), conv1_pad = 'same',
conv2_num_filters = 25, conv2_filter_size = (7, 7), conv2_pad = 'same',
conv3_num_filters = 25, conv3_filter_size = (7, 7), conv3_pad = 'same',
output_pool_size = 25,
output_pool_function = T.sum,
y_tensor_type=T.tensor4,

#optimization parameters:
update = nesterov_momentum,
update_learning_rate = 0.005,
update_momentum = 0.9,
regression = True,
max_epochs = 200,
verbose = 1,
batch_iterator_train=BatchIterator(batch_size=25),
on_epoch_finished=[EarlyStopping(patience=20),],
train_split=TrainSplit(eval_size=0.25)
)

Due to the unique nature of the problem versus the online examples, my first attempt at this script was not successful. Here are the key changes that I needed to make:

  • set y_tensor_type=T.tensor4 because the target is 2 dimensional
  • your graphics card almost certainly doesn’t have enough RAM to process all of the images at once, so you need to use a batch iterator and experiment to find a suitable batch size e.g. batch_iterator_train=BatchIterator(batch_size=25)

plot_loss(net2)
plt.savefig("./results/plotloss.png")

I also wanted to plot the loss across the iterations. So I added the two lines above, giving me the graph below.

During the first several iterations, the neural network is balancing out the weights so that the pixels are the correct magnitude, and after that the serious work of image processing begins.


plot_conv_weights(net2.layers_[1], figsize=(4, 4))
plt.savefig("./results/convweights.png")

I wanted to see what some of the convolutional filters looked like. So I added the two lines shown above, giving me the set of images below.

These filters look like small parts of images of letters, which makes some sense because we are trying to identify whether a pixel sits on the stroke of a letter.

The output looks reasonable, although not as good as what I achieve using a combination of image processing techniques and deep learning.

In the output image above you can see some imperfections.

I think that if I had changed the network architecture to include fully connected layers then I would have achieved a better result. Maybe one day when I have enough time I will experiment with that architecture.

The full Python script is shown below:


import random
import numpy as np
import cv2
import os
import itertools
import math
import matplotlib.pyplot as plt

from setup_GPU import setup_theano
setup_theano()

from lasagne import layers
from lasagne.updates import nesterov_momentum
from lasagne.nonlinearities import softmax
from lasagne.nonlinearities import sigmoid
from nolearn.lasagne import BatchIterator
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import PrintLayerInfo
from nolearn.lasagne.visualize import plot_loss
from nolearn.lasagne.visualize import plot_conv_weights
from nolearn.lasagne.visualize import plot_conv_activity
from nolearn.lasagne.visualize import plot_occlusion

import theano.tensor as T

def load_image(path):
return cv2.imread(path, cv2.IMREAD_GRAYSCALE)

def write_image(img, path):
return cv2.imwrite(path, img)

def image_matrix(img):
"""
The output value has shape (<number of pixels>, <number of rows>, <number of columns>)
"""
# 420 x 540 or 258 x 540?
if img.shape[0] == 258:
return (img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540))
if img.shape[0] == 420:
result = []
result.append((img[0:258, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
result.append((img[162:420, 0:540] / 255.0).astype('float32').reshape((1, 1, 258, 540)))
result = np.vstack(result).astype('float32').reshape((2, 1, 258, 540))
return result

def load_train_set(file_list):
xs = []
ys = []
for fname in file_list:
x = image_matrix(load_image(os.path.join('./train_foreground/', fname)))
y = image_matrix(load_image(os.path.join('./train_cleaned/', fname)))
for i in range(0, x.shape[0]):
xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
ys.append(y[i, :, :, :].reshape((1, 1, 258, 540)))
return np.vstack(xs), np.vstack(ys)

def load_test_file(fname, folder):
xs = []
x = image_matrix(load_image(os.path.join(folder, fname)))
for i in range(0, x.shape[0]):
xs.append(x[i, :, :, :].reshape((1, 1, 258, 540)))
return np.vstack(xs)

def list_images(folder):
included_extentions = ['jpg','bmp','png','gif' ]
results = [fn for fn in os.listdir(folder) if any([fn.endswith(ext) for ext in included_extentions])]
return results

def do_test(inFolder, outFolder, nn):
test_images = list_images(inFolder)
nTest = len(test_images)
for x in range(0, nTest):
fname = test_images[x]
x1 = load_test_file(fname, inFolder)
x1 = x1 - 0.5
pred_y = nn.predict(x1)
tempImg = []
if pred_y.shape[0] == 1:
tempImg = pred_y[0, 0, :, :].reshape(258, 540)
if pred_y.shape[0] == 2:
tempImg1 = pred_y[0, 0, :, :].reshape(258, 540)
tempImg2 = pred_y[1, 0, :, :].reshape(258, 540)
tempImg = np.empty((420, 540))
tempImg[0:258, 0:540] = tempImg1
tempImg[162:420, 0:540] = tempImg2
tempImg[tempImg < 0] = 0
tempImg[tempImg > 1] = 1
tempImg = np.asarray(tempImg*255.0, dtype=np.uint8)
write_image(tempImg, (os.path.join(outFolder, fname)))

class EarlyStopping(object):
def __init__(self, patience=100):
self.patience = patience
self.best_valid = np.inf
self.best_valid_epoch = 0
self.best_weights = None

def __call__(self, nn, train_history):
current_valid = train_history[-1]['valid_loss']
current_epoch = train_history[-1]['epoch']
if current_valid < self.best_valid:
self.best_valid = current_valid
self.best_valid_epoch = current_epoch
self.best_weights = nn.get_all_params_values()
elif self.best_valid_epoch + self.patience < current_epoch:
print("Early stopping.")
print("Best valid loss was {:.6f} at epoch {}.".format(
self.best_valid, self.best_valid_epoch))
nn.load_params_from(self.best_weights)
raise StopIteration()

def main():
random.seed(1234)

training_images = list_images("./train_foreground/")
random.shuffle(training_images)
nTraining = len(training_images)
TRAIN_IMAGES = training_images

train_x, train_y = load_train_set(TRAIN_IMAGES)
test_x = train_x
test_y = train_y

# centre on zero - has already been divided by 255
train_x = train_x - 0.5

net2 = NeuralNet(
layers = [
('input', layers.InputLayer),
('conv1', layers.Conv2DLayer),
('conv2', layers.Conv2DLayer),
('conv3', layers.Conv2DLayer),
('output', layers.FeaturePoolLayer),
],
#layer parameters:
input_shape = (None, 1, 258, 540),
conv1_num_filters = 25, conv1_filter_size = (7, 7), conv1_pad = 'same',
conv2_num_filters = 25, conv2_filter_size = (7, 7), conv2_pad = 'same',
conv3_num_filters = 25, conv3_filter_size = (7, 7), conv3_pad = 'same',
output_pool_size = 25,
output_pool_function = T.sum,
y_tensor_type=T.tensor4,

#optimization parameters:
update = nesterov_momentum,
update_learning_rate = 0.005,
update_momentum = 0.9,
regression = True,
max_epochs = 200,
verbose = 1,
batch_iterator_train=BatchIterator(batch_size=25),
on_epoch_finished=[EarlyStopping(patience=20),],
train_split=TrainSplit(eval_size=0.25)
)

net2.fit(train_x, train_y)

plot_loss(net2)
plt.savefig("./results/plotloss.png")
plot_conv_weights(net2.layers_[1], figsize=(4, 4))
plt.savefig("./results/convweights.png")

#layer_info = PrintLayerInfo()
#layer_info(net2)

import cPickle as pickle
with open('results/net2.pickle', 'wb') as f:
pickle.dump(net2, f, -1)

y_pred2 = net2.predict(test_x)
print "The accuracy of this network is: %0.2f" % (abs(y_pred2 - test_y)).mean()

do_test("./train_foreground/", './train_predicted/', net2)
do_test("./test_foreground/", './test_predicted/', net2)

if __name__ == '__main__':
main()

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 9

15 Thursday Oct 2015

Posted by Colin Priest in Background Removal, Image Processing, Kaggle, R

≈ 3 Comments

Tags

Background Removal, Image Processing, Kaggle, R

Now that Kaggle’s Denoising Dirty Documents Competition has closed, it’s time to start posting the secrets to getting a very good score in this competition. In this blog, I describe how to take advantage of the first of two information leakages that I used.

512px-broken-water-pipe-clip-art-380417

Information leakage occurs in predictive modelling when the training and test data includes values that would not be known at the time a prediction was being made. For example, I once worked on a direct marketing sales propensity modelling project where our predictive model fitted the training data far too well, making us suspicious. We eventually tracked it down to an incorrectly designed data extract that used status values as at the data extraction date, instead of as at the date at which a prediction would have been run. The sale of the product to the customer changed the value of that data field, so the data needed to be the value of the data field before the sale occurred. In real life projects, information leakage is a bad thing because it overstates the model accuracy. Therefore you need to ensure that it does not occur; otherwise your predictive model will not perform well. In data science competitions, information leakage is something to be taken advantage of. It enables you to obtain a higher score.

The first information leakage in this competition comes from how the training data was created. There are only 8 different page backgrounds. There are 2 coffee cup stains, 2 folded pages, 2 watermarks and 2 crumpled pages.

20151015 output 1

This gives us a huge advantage in removing the background and the stains. Instead of using an uncertain estimate based upon only a single image, we can group together the images so that each group has the same background. Then, for each pixel location, calculate the brightest pixel across all of the images in that group, and use the brightest value as the pixel brightness in the background. You can find a couple of scripts on Kaggle that do this.

The problem is that you can’t do the same for the test images because they have different backgrounds, and they have only four different backgrounds. You can’t have known this without manually looking at the test images, and that break the rules prohibiting manual processing of the test images.

20151015 output 2

You can also run into troubles if your model just assumes that the test images are arranged so that there are 8 different background images that repeat every 8 images.

A better solution is to let your model decide which images can be grouped together, and let your model decide which background belongs with each image. The way to do this is to apply a median filter to each image, to obtain an estimate of the background of each image, and then group the median images together according to similarity.


library(png)
library(raster)
library(data.table)

median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# training data
dirtyFolder = "C:/Users/Colin/dropbox/Kaggle/Denoising Dirty Documents/data/train"
outPath = "D:/CNN with background removal/train median25"
filenames = list.files(dirtyFolder)
# use a 25x25 median filter to get the background
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

median25 = median_Filter(imgX, 25)

outFile = file.path(outPath, f)
writePNG(median25, outFile)
}

The script above calculates the median filter for each image and stores it in a folder. I have used a filter size of 25 pixels because I want broad patterns of the background, and I want the text removed.

Now that I have the median images, I can iterate through each training image and link it to other images that have similar median images. I have used RMSE as a measure of similarity. The cutoff RMSE of 2.5% was determined by looking at the range of RMSE values and looking for a natural break.

20151015 output 3


# find the images with matching background
outPath2 = "D:/CNN with background removal/train background"
outPath3 = "D:/CNN with background removal/train foreground"
for (i in 1:length(filenames))
{
f = filenames[i]
print(f)

imgX = readPNG(file.path(outPath, f))

scores = matrix(1, length(filenames))
for (j in 1:length(filenames))
{
imgY = readPNG(file.path(outPath, filenames[j]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) & ncol(imgY) >= ncol(imgX)) rmse = sqrt(mean((imgX - imgY[1:nrow(imgX), 1:ncol(imgX)]) ^ 2))
scores[j] = rmse
}

sameStains = filenames[scores <= 0.025]
nImages = length(sameStains)

rawData = matrix(0, ncol(imgX) * nrow(imgX), nImages)
for (j in 1:nImages)
{
imgY = readPNG(file.path(dirtyFolder, sameStains[j]))
rawData[,j] = img2vec(imgY[1:nrow(imgX), 1:ncol(imgX)])
}

background = matrix(unlist(apply(rawData,1,max)), nrow(imgX), ncol(imgX)) # background is defined as the lightest pixel of images with similar median transformations
plot(raster(background))
writePNG(background, file.path(outPath2, f))

imgX = readPNG(file.path(dirtyFolder, f))
foreground = (imgX - background) / background
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])
#plot(raster(foreground))
writePNG(foreground, file.path(outPath3, f))
}

One of the tricks in the script above was that I didn’t simply subtract the background from the image. If I had done that, then the result would not be consistent in areas where the stains occur:

20151015 output 4

20151015 foreground-bad

Notice that in the image above, the writing is faded where the stain was dark. That’s because it was created with the code:


foreground = (imgX - background)
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])

Where the background is dark, there is less opportunity for the writing to contrast against the background. To fix this, I changed the above script to be:


foreground = (imgX - background) / background
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])

By dividing the difference by the background brightness, I have rescaled the contrast to allow for the limitation on the maximum contrast at this location. The result is shown below:
20151015 foreground-good
This time the writing has consistent darkness across the entire image, regardless of the brightness of the background pixels.

My competition submission consisted of 4 stages, and this leakage-based background removal was the first stage. The final stage also took advantage of an information leakage. But you will have to wait for a couple of blogs to see what that was…

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 8

02 Friday Oct 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R

≈ 4 Comments

Tags

Image Processing, Kaggle, Machine Learning, R

In this blog we will engineer a new feature to go into our model. So far we have predominantly been using localised features – information about pixels that are located nearby the pixel whose brightness we are predicting. In this blog we will consider the structure of a document, and use that to improve our model.

20150801 - after

The image consists of multiple lines of text, arranged into one or more paragraphs with multiple sentences in each paragraph.

ducks lined up

If we get our ducks in a row, we can use the fact that the text (like our metaphorical ducks) is arranged into lines, and that there are gaps between those lines of text. In particular, if we can find the gaps between the lines, we can ensure that the predicted value within those gaps is always the background colour.

Let’s look at the horizontal profile of a sample input image:


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

f = "135.png"
imgX = readPNG(file.path(dirtyFolder, f))

hProfileX = unlist(apply(imgX, 1, mean))
hProfileY = unlist(apply(imgY, 1, mean))

x = 1:nrow(imgX)
plot(x, hProfileX, col="red", type="l", xlab = "row", ylab = "brightness")

20150930 output 01

The rows with low brightness show where text occurs at regular intervals. But the absolute level of the row brightness varies, according to the noise and stains, so much that sometimes the gaps between the lines of text are darker than the rows in which text occurs. This will make it more difficult to determine which rows are which.

But what if we used the predicted images from our stacked model from the last blog as the inputs to this predictor?


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"

f = "135.png"
imgX = readPNG(file.path(dirtyFolder, f))

hProfileX = unlist(apply(imgX, 1, mean))
hProfileY = unlist(apply(imgY, 1, mean))

x = 1:nrow(imgX)
plot(x, hProfileX, col="red", type="l", xlab = "row", ylab = "brightness")

20150930 output 02

Now it’s much easier to find the gaps between the lines of text. To do this we need estimate three parameters:

  1. cycle length: the number of rows from one line of text to the next line of text
  2. gap length: the height of the gap between lines of text
  3. starting location of the first cycle

The constraints that we can use for determining these parameters are:

  • cycle length is a positive number
  • gap length is a positive number
  • gap length is less than cycle length
  • gap rows are white
  • most letters (e.g. weruioaszxcvnm) are short and these letters are located in the darkest rows, which we will call the “core rows”
  • some letters are taller (e.g. tdfhklb) and create slightly darker rows above the core rows
  • some letters are taller (e.g. qypgj) and create slightly darker rows below the core rows
  • if we cluster the values, the darker cluster contains the core of the letters (the core rows), but these rows may be lighter when a paragraph starts or ends
  • the in-between valued rows next to the white rows are possibly where the taller letters are located and their darkness will vary according to the number of tall letters

Let’s start by clustering the row brightnesses, and checking where the rows switch from one cluster to another, and using this to estimate cycle length.


km = kmeans(hProfileX, 2)
cutoff = (max(hProfileX[km$cluster == which.min(km$centers)]) + min(hProfileX[km$cluster == which.max(km$centers)])) / 2
plot(hProfileX, type="l")
lines(hProfileX * 0 + cutoff, col="red")

starts = (2:length(hProfileX))[sapply(2:length(hProfileX), function(x) hProfileX[x] < cutoff & hProfileX[x-1] > cutoff)]
cycleLength = mean(diff(starts))

20150930 output 03

20150930 output 04

Next we will estimate the starting location of the first cycle. But before we do that, we need to define the “start” of the cycle. Let’s define the cycle as starting at the row in which the first letter starts. That would be the first non-white row before the first core rows.

cycleStarts = starts[1]
while (hProfileX[cycleStarts] < 0.99)
cycleStarts = cycleStarts – 1

The first cycle starts at row 3.

To estimate the gap length, we should measure the number of white rows before each cycle starts.


gaps = matrix(0, length(starts) - 1)
for (i in 2:length(starts))
{
 cycleStarts = starts[i]
 while (hProfileX[cycleStarts] < 0.99)
 cycleStarts = cycleStarts - 1
 gapStarts = cycleStarts - 1
 while (hProfileX[gapStarts] > 0.99)
 gapStarts = gapStarts - 1
 gapLength = cycleStarts - gapStarts
 gaps[i - 1] = gapLength
}
gaps

The gaps are (13, 12, 17, 13, 12, 11, 12, 18, 12, 12). The two outliers occur after rows of text that do not contain any of the characters (q, y, p, g, j). After removing these two outliers, the gaps do not exceed 13 rows. So we can write code to allow for this as follows:


mGaps = median(gaps)
gapLength = mean(gaps[abs(gaps - mGaps) <= 2])

This gives us a gap length estimate of 13.125 and completes our line location model for this image. The quick way to use this model is to white out the gap rows. A more complex model could use the newly estimated row types (core, gap, other) as predictors.


imgCleaned = imgX
numCycles = length(starts)
for (i in 1:numCycles)
{
cycleStart = starts[i] - min(gaps[abs(gaps - mGaps) <= 2])
imgCleaned[cycleStart:(starts[i]-1)] = 1
}

The script above would clean up any minor image blemishes that exist between the lines of text.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 7

23 Wednesday Sep 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R, Stacking

≈ 3 Comments

Tags

Image Processing, Kaggle, Machine Learning, R, Stacking

By the time I’d finished building the model in my last blog, I’d started to overload my computer’s RAM and CPU, so much so that I couldn’t add any more features. One solution could be for me to upgrade my hardware, or rent out a cluster in the cloud, but I’m trying to save money at the moment. So today I’m going to restructure my predictive solution into separate predictive models, each of which do not individually overload my computer, but which are combined via stacking to give an overall solution that is more predictive than any of the individual models.

stacking

Stacking would also allow us to answer Bobby’s question:

I thought it was interesting how the importance chart didn’t show any one individual pixel as being significant. Is there another similar importance chart or metric that can show the importance of a combination of pixels?

So let’s start by breaking up the current monolithic model into discrete chunks. Once we have done this, it will be easier for us to add new features (which I will do in the next blog). I will store the predicted outputs from each model in image format with separate folders for each model.

20150923 diagram 1

The R script for creating the individual models is primarily recycled from my past blogs. Since images contain integer values for pixel brightness, one may argue that storing the predicted values as images loses some of the information content, but the rounding of floating point values to integer values only affects the precision by a maximum of 0.2%, and the noise in the predicted values is an order of magnitude greater than that. If you want that fractional extra predictive power, then I leave it to you to adapt the script to write the floating point predictions into a data.table.

Linear Transformation

The linear transformation model was developed here, and just involves a linear transformation, then constraining the output to be in the [0, 1] range.

20150801 output 6


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

# predict using linear transformation
outModel1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)

# linear transformation
yHat = -0.126655 + 1.360662 * x

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel1, f))
}

Thresholding

The thresholding model was developed here, and involves a combination of adaptive thresholding processes, each with different sizes of localisation ranges, plus a global thresholding process. But this time we will use xgboost instead of gbm.

20150815 output 3


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(y, x, x2, x3))
setnames(dat,c("y", "raw", "thresholded", "adaptive"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(x, x2, x3))
setnames(dat,c("raw", "thresholded", "adaptive"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel2, f))
}

Edges

The edge model was developed here, and involves a combination of canny edge detection and image morphology.

20150822 output 2


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost, biOps)

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x4, x5, x6))
setnames(dat,c("y", "raw", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(x, x4, x5, x6))
setnames(dat,c("raw", "canny", "cannyDilated1", "cannyDilated2"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel3, f))
}

Background Removal

The background removal model was developed here, and involves the use of median filters. An xgboost model is used to combine the median filter results.

20150829 output 1


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)
# a function to do a median filter
median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(NA, nrow(img) * ncol(img), filterWidth ^ 2)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x, na.rm = TRUE)))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img - background
foreground[foreground > 0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(y, x, x7a, x7b, x7c, x8))
setnames(dat,c("y", "raw", "median5", "median11", "median17", "background"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(x, x7a, x7b, x7c, x8))
setnames(dat,c("raw", "median5", "median11", "median17", "background"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel4, f))
}

Nearby Pixels

The spacial model was developed here, and involves the use of a sliding window and xgboost.

20150904 output 7


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
 pad = 2
 width = 2 * pad + 1
 padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
 k = 1
 for (i in seq_len(width))
 {
 for (j in seq_len(width))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }

 return (tab)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX)
 y = img2vec(imgY)
 
 # surrounding pixels
 x9 = proximalPixels(imgX)

 dat = data.table(cbind(y, x9))
 setnames(dat,append("y", paste("x", 1:25, sep="")))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)


# get the predicted result for each image
outModel5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX) 
 
 # surrounding pixels
 x9 = proximalPixels(imgX)

 dat = data.table(x9)
 setnames(dat,paste("x", 1:25, sep=""))
 
 # predicted values
 yHat = predict(xgb.mod, newdata=as.matrix(dat))
 
 # constrain the range to be in [0, 1]
 yHat = sapply(yHat, function(x) max(min(x, 1),0))
 
 # turn the vector into an image
 imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))
 
 # save the predicted value
 writePNG(imgYHat, file.path(outModel5, f))
}

Combining the Individual Models

There are many ways that the predictions of the models could be ensembled together. I have used an xgboost model because I want to allow for the complexity of the problem that is being solved.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

inPath1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
inPath2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
inPath3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
inPath4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
inPath5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"

cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacking.csv"

filenames = list.files(inPath1)
for (f in filenames)
{
 print(f)
 imgX1 = readPNG(file.path(inPath1, f))
 imgX2 = readPNG(file.path(inPath2, f))
 imgX3 = readPNG(file.path(inPath3, f))
 imgX4 = readPNG(file.path(inPath4, f))
 imgX5 = readPNG(file.path(inPath5, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX)
 y = img2vec(imgY)
 
 # contributing models
 x1 = img2vec(imgX1)
 x2 = img2vec(imgX2)
 x3 = img2vec(imgX3)
 x4 = img2vec(imgX4)
 x5 = img2vec(imgX5)

 dat = data.table(cbind(y, x1, x2, x3, x4, x5))
 setnames(dat,c("y", "linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 5000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)

# get the predicted result for each image
outModel = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"
for (f in filenames)
{
 print(f)
 imgX1 = readPNG(file.path(inPath1, f))
 imgX2 = readPNG(file.path(inPath2, f))
 imgX3 = readPNG(file.path(inPath3, f))
 imgX4 = readPNG(file.path(inPath4, f))
 imgX5 = readPNG(file.path(inPath5, f))
 
 # contributing models
 x1 = img2vec(imgX1)
 x2 = img2vec(imgX2)
 x3 = img2vec(imgX3)
 x4 = img2vec(imgX4)
 x5 = img2vec(imgX5)

 dat = data.table(cbind(x1, x2, x3, x4, x5))
 setnames(dat,c("linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
 
 # predicted values
 yHat = predict(xgb.mod, newdata=as.matrix(dat))
 
 # constrain the range to be in [0, 1]
 yHat = sapply(yHat, function(x) max(min(x, 1),0))
 
 # turn the vector into an image
 imgYHat = matrix(yHat, nrow(imgX1), ncol(imgX1))
 
 # save the predicted value
 writePNG(imgYHat, file.path(outModel, f))
}

20150923 output 1
The graph above answers Bobby’s question – when considered together, the nearby pixels provide most of the additional predictive power beyond that of the raw pixel brightness. It seems that the key to separating dark text from noise is to consider how the pixel fits into the local region within the image.
Now that we have set up a stacking model structure, we can recommence the process of adding more image processing and features to the model. And that’s what the next blog will do…

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 6

07 Monday Sep 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

≈ 7 Comments

Tags

Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

So far in this series of blogs we have used image processing techniques to improve the images, and then ensembled together the results of that image processing using GBM or XGBoost. But I have noticed that some competitors have achieved reasonable results using purely machine learning approaches. While these pure machine learning approaches aren’t enough for the competitors to get to the top of the leader board, they have outperformed some of the models that I have presented in this series of blogs. However these scripts were invariably written in Python and I thought that it would be great to see how to use R to build a similar type of model (except better, because we will include all of the image processing predictors that we have developed so far). So today we will add a brute-force machine learning approach to our model.

At the time of writing this blog, the ranked top competitor who has shared their script is placed 17th with an RMSE of 2.6%. While I am not very experienced in Python, I can usually figure out what a Python script is doing if I read it. So here’s what I think that script ise doing:

  1. pad out each image by an extra 2 pixels (see my last blog for how to pad out an image in R)
  2. run a 3×3 sliding window along the image pixels (see my last blog for how to create a sliding window in R)
  3. use all 9 pixels within the sliding window as predictors for the pixel in the centre of the sliding window
  4. use a random forest model to predict the pixel brightnesses

This is a pure machine learning approach because it doesn’t do any image processing to pre-process the image. It simply says that if you want to predict a particular pixel brightness, then you should probably look around at the brightnesses of the pixels either side of that pixel.

canstockphoto16547059

While I don’t want to fuel the R versus Python language wars, I do want to create a model in R that can outperform my competitors. So instead of using a 3×3 sliding window, I will use a 5×5 sliding window. Because the random forest implementation in R tends to run out of RAM on my computer, I will use XGBoost.

More_Power_HomeImprovementO

Here is how I create a 5×5 sliding window.


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# create a padded image within which we will embed our source image
pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

# create a matrix of predictor values - each column is a pixel from the sliding window
tab = NULL
for (i in seq_len(width))
{
for (j in seq_len(width))
{
if (i == 1 && j == 1)
{
tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
} else {
tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
}
}
}

head(tab[,1:4])

20150904 output 1

When I noticed that this script ran a bit slowly, I looked at how I wrote the loops. R does not like manual looping, and it is very inefficient at appending data. So I rewrote the script to pre-allocate space for all of the cells rather than appending each column as it was calculated.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

This modification gave me a tenfold improvement in speed! It’s a reminder that I shouldn’t be lazy when writing R scripts.

There’s one small issue to consider before we put this all together for a new more powerful predictive model. In the code above I have used a brightness value of 1 when padding out the image, but that doesn’t look natural. The image below shows a clear white border around the image. This could confuse the machine learning algorithm, as it could waste time learning what to do with pure white pixels.

20150904 output 2

So instead it makes sense to pad out the image with a background brightness. One way to do this is to replicate the edge of the original image.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
# fill in the padded edges
padded[,1:pad] = padded[,pad + 1]
padded[,ncol(img) + pad + 1:pad] = padded[,ncol(padded) - pad]
padded[1:pad,] = padded[pad + 1,]
padded[nrow(img) + pad + 1:pad,] = padded[nrow(padded) - pad,]

20150904 output 3

This looks more natural, except where there is writing at the edge of the image. Another way is to pad out the pixels using the median of the entire image.


pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

20150904 output 4

I haven’t tested which approach works best. So I leave it to the reader to compare the results from the three different padding approaches, and use whichever gives the best result (although I suspect that there won’t be much difference between the second and third approaches).

Let’s pull it all together, and add the surrounding pixels as a predictor to a more complete model.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb  0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

return (tab)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata_blog6.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

# median filter and related features
x7 = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

# surrounding pixels
x9 = proximalPixels(imgX)

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,append(c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"), paste("x", 1:25, sep="")))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain  1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150904 output 5

None of the individual pixels that we added have strong predictive powers, yet our RMSE on the training data dropped from 2.4% for the last blog’s model to 1.4% for this model! That’s because it is the combination of nearby pixels that are predictive, not just any individual pixel. I haven’t scored this model on the test set, but I suspect that it will get you a good ranking.
20150904 output 6

There is hardly any trace of the coffee cup stain remaining. This is a pretty good candidate model.

In order to understand the effect of the nearby pixels, it is useful to visualise their variable importance in a grid.

pixelNames = paste("x", 1:25, sep="")
pixelNames[13] = "raw"
grid = t(matrix(sapply(1:25, FUN = function(x) importance_matrix$Gain[importance_matrix$Feature == pixelNames[x]]), 5, 5))
grid[3, 3] = NA
plot(raster(grid))

20150904 output 7
This graphic shows that we didn’t need all of the surrounding pixels to create a good predictive model. The pixels that don’t lie on the same row or column as the target pixel aren’t as important. If we wanted to expand beyond the 3×3 sliding window used by our competitors, then we didn’t need to add all of the extra pixels. We could have just added the pixels at (1, 3) and (3, 1) and (3, 5) and (5, 3).

The model that we developed in this blog is pushing the boundaries of what my PC can do. Without access to very powerful computers and / or a cluster, we need to use a different approach if we want to improve on this blog’s model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...
← Older posts

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 277 other subscribers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 86 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: