Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Tag Archives: R

Tutorial: Sentiment Analysis of Airlines Using the syuzhet Package and Twitter

30 Sunday Apr 2017

Posted by Colin Priest in R, Sentiment Analysis, Social Media, Text Mining, Twitter

≈ 32 Comments

Tags

R, Sentiment Analysis, Text Mining, Twitter

In my last job, I was a frequent flyer. Each week I flew between 2 or 3 countries, briefly returning for 24 hours on the weekend to get a change of clothes. My favourite airlines were Cathay Pacific, Emirates and Singapore Air. Now, unless you have been living in a cave, you’d be well aware of the recent news story of how United Airlines removed David Dao from an aircraft. I wondered how that incident had affected United’s brand value, and being a data scientist I decided to do sentiment analysis of United versus my favourite airlines.

Way back on 4th July 2015, almost two years ago, I wrote a blog entitled Tutorial: Using R and Twitter to Analyse Consumer Sentiment. Even though that blog post is one of my earliest, it continues to be the most popular, attracting just as many readers per day as when I first wrote it.

Since the sentiment package, upon which that blog was based, is no longer supported by CRAN, and many readers have problems with the manual and technical process of installing an obsolete package from an archive, I have written a new blog using a different, live CRAN package. The syuzhet package was published only several weeks ago, and offers a range of different sentiment analysis models. So I’ve started to try it out.

I have collected tweets for 4 airlines:

  1. Cathay Pacific
  2. Emirates
  3. Singapore Air
  4. United Airlines

The tweet data starts at 01-Jan-2015 and go up to mid-April 2017.

Step 1: Load the tweets and load the relevant packages

library(foreign)
library(syuzhet)
library(lubridate)
library(plyr)
library(ggplot2)
library(tm)
library(wordcloud)

# get the data for the tweets
dataURL = 'https://s3-ap-southeast-1.amazonaws.com/colinpriest/tweets.zip'
if (! file.exists('tweets.zip')) download.file(dataURL, 'tweets.zip')
if (! file.exists('tweets.dbf')) unzip('tweets.zip')
tweets = read.dbf('tweets.dbf', as.is = TRUE)

 

I’ve stored the tweets in a dbf file and zipped it. The zip file is 68MB in size, and the dbf file is 353MB. The code shown above downloads the zip file, extracts the dbf and then reads the dbf file into a data.frame.

Step 2: Do Sentiment Scoring using the syuzhet package


# function to get various sentiment scores, using the syuzhet package
scoreSentiment = function(tab)
{
 tab$syuzhet = get_sentiment(tab$Text, method="syuzhet")
 tab$bing = get_sentiment(tab$Text, method="bing")
 tab$afinn = get_sentiment(tab$Text, method="afinn")
 tab$nrc = get_sentiment(tab$Text, method="nrc")
 emotions = get_nrc_sentiment(tab$Text)
 n = names(emotions)
 for (nn in n) tab[, nn] = emotions[nn]
 return(tab)
}

# get the sentiment scores for the tweets
tweets = scoreSentiment(tweets)
tweets = tweets[tweets$TimeStamp < as.Date('19-04-2017', format = '%d-%m-%Y'),]

The syuzhet package offers a few different algorithms, each taking a different approach to sentiment scoring. It also does emotion scoring based upon the nrc algorithm. The code above calculates scores using the syuzhet, bing, afinn and nrc algorithms, adding columns with the scores from each algorithm.

Step 3: Visualise the Sentiment Scores


# function to find the week in which a date occurs
round_weeks <- function(x)
{
require(data.table)
dt = data.table(i = 1:length(x), day = x, weekday = weekdays(x))
offset = data.table(weekday = c('Sunday', 'Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday'),
offset = -(0:6))
dt = merge(dt, offset, by="weekday")
dt[ , day_adj := day + offset]
setkey(dt, i)
return(dt[ , day_adj])
}
# get daily summaries of the results
daily = ddply(tweets, ~ Airline + TimeStamp, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
 ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))

# plot the daily sentiment
ggplot(daily, aes(x=TimeStamp, y=ave_sentiment, colour=Airline)) + geom_line() +
 ggtitle("Airline Sentiment") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

# get weekly summaries of the results
weekly = ddply(tweets, ~ Airline + week, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
 ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))

# plot the weekly sentiment
ggplot(weekly, aes(x=week, y=ave_sentiment, colour=Airline)) + geom_line() +
 ggtitle("Airline Sentiment") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

The code above summarises the sentiment for each airline across time. The first plot shows the daily sentiment values for each airline:

20170429 plot 01 daily sentiment

Based upon the bing sentiment algorithm, United has the poorest sentiment, and Singapore has the best sentiment. United usually has negative sentiment. Daily to day random fluctuations in sentiment make this a cluttered graph, so I decided to summarise the sentiment weekly instead of daily:

20170429 plot 02 weekly sentiment

Now it’s easier to see the differences in sentiment between the four airlines. While Emirates and Cathay Pacific have similar levels of sentiment, the values for Emirates are more stable. This, however, may be due to the sheer volume of tweets about Emirates versus the smaller number of tweets about Cathay Pacific.

20170429 plot 03 positive sentiment

Step 4: Compare the Sentiment Algorithms

The sentiment scores above use the bing algorithm, but we should check whether the different algorithms produce different results.


# compare the sentiment for across the algorithms
algorithms = tweets[rep(1, nrow(tweets) * 4), c("week", "syuzhet", "Airline", "Airline")]
names(algorithms) = c("TimeStamp", "Sentiment", "Algorithm", "Airline")
algorithms$Algorithm = "syuzhet"
algorithms[seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "syuzhet", "Airline")]
algorithms[nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "bing", "Airline")]
algorithms$Algorithm[nrow(tweets) + seq_len(nrow(tweets))] = "bing"
algorithms[2 * nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "afinn", "Airline")]
algorithms$Algorithm[2 * nrow(tweets) + seq_len(nrow(tweets))] = "afinn"
algorithms[3 * nrow(tweets) + seq_len(nrow(tweets)), c("TimeStamp", "Sentiment", "Airline")] = tweets[,c("TimeStamp", "nrc", "Airline")]
algorithms$Algorithm[3 * nrow(tweets) + seq_len(nrow(tweets))] = "nrc"

# get the algorithm averages for each airline
averages = ddply(algorithms, ~ Airline + Algorithm, summarize, ave_sentiment = mean(Sentiment))
averages$ranking = 1
for (alg in c("syuzhet", "bing", "afinn", "nrc")) averages$ranking[averages$Algorithm == alg] = 5 - rank(averages$ave_sentiment[averages$Algorithm == alg])
averages = averages[order(averages$Airline, averages$Algorithm), ]

The code above was a bit clumsy – I probably should have used reshape.

20170429 plot 08 sentiment algorithm comparisons

The different algorithms give similar rankings between the airlines with one big exception: the nrc algorithm is surprisingly positive about United and unusually negative about Singapore Air compared to the other algorithms. This goes to show that sentiment analysis isn’t just a plug and play technique and also means that a warning should be applied to the emotion analysis shown in Step 5 below, as it is based upon the nrc algorithm!

Step 5: Emotion Analysis

Noting the warning, from the previous section, let’s compare the emotions between the airlines and between tweets, using the nrc algorithm.


ggplot(weekly, aes(x=week, y=ave_negative, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Positive Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

ggplot(weekly, aes(x=week, y=ave_positive, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Negative Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

ggplot(weekly, aes(x=week, y=ave_anger, colour=Airline)) + geom_line() +
ggtitle("Airline Sentiment (Anger Only)") + xlab("Date") + ylab("Sentiment") + scale_x_date(date_labels = '%d-%b-%y')

# function to make the text suitable for analysis
clean.text = function(x)
{
# tolower
x = tolower(x)
# remove rt
x = gsub("rt", "", x)
# remove at
x = gsub("@\\w+", "", x)
# remove punctuation
x = gsub("[[:punct:]]", "", x)
# remove numbers
x = gsub("[[:digit:]]", "", x)
# remove links http
x = gsub("http\\w+", "", x)
# remove tabs
x = gsub("[ |\t]{2,}", "", x)
# remove blank spaces at the beginning
x = gsub("^ ", "", x)
# remove blank spaces at the end
x = gsub(" $", "", x)
return(x)
}

# emotion analysis: anger, anticipation, disgust, fear, joy, sadness, surprise, trust
# put everything in a single vector
all = c(
paste(tweets$Text[tweets$anger > 0], collapse=" "),
paste(tweets$Text[tweets$anticipation > 0], collapse=" "),
paste(tweets$Text[tweets$disgust > 0], collapse=" "),
paste(tweets$Text[tweets$fear > 0], collapse=" "),
paste(tweets$Text[tweets$joy > 0], collapse=" "),
paste(tweets$Text[tweets$sadness > 0], collapse=" "),
paste(tweets$Text[tweets$surprise > 0], collapse=" "),
paste(tweets$Text[tweets$trust > 0], collapse=" ")
)
# clean the text
all = clean.text(all)
# remove stop-words
# adding extra domain specific stop words
all = removeWords(all, c(stopwords("english"), 'singapore', 'singaporeair',
'emirates', 'united', 'airlines', 'unitedairlines',
'cathay', 'pacific', 'cathaypacific', 'airline',
'airlinesunited', 'emiratesemirates', 'pacifics'))
#
# create corpus
corpus = Corpus(VectorSource(all))
#
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#
# convert as matrix
tdm = as.matrix(tdm)
#
# add column names
colnames(tdm) = c('anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust')
#
# Plot comparison wordcloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Emotion Comparison Word Cloud')
comparison.cloud(tdm, random.order=FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
title.size=1.5, max.words=250)

&nbsp;

The code above plots the emotions across time for each airline.

20170429 plot 05 anger
20170429 plot 03 positive sentiment
20170429 plot 04 negative sentiment

United Airlines attracts more angry tweets, and this has spiked in April 2017 following the David Dao incident. But United Airlines also attracts more positive tweets than the other airlines. This might explain the ranking differences between the algorithms – maybe the algorithms weight positive tweets differently to negative tweets.

Then the code creates a comparison word cloud, to show the different words in airline tweets that are associated with each emotion.

20170429 plot 09 emotions

Step 6: Compare the Different Tweeting Behaviour of Different Twitter Users

Are some users more positive than others? Is this user behaviour different between the airlines? Do people who tweet more have a different sentiment to those who tweet about airlines less frequently? Are particular users dragging the average up or down? To answer these questions, I have tracked the 100 users who tweeted the most about these airlines.


# get the user summaries of the results
users = ddply(tweets, ~ Airline + UserName, summarize, num_tweets = length(positive), ave_sentiment = mean(bing),
ave_negative = mean(negative), ave_positive = mean(positive), ave_anger = mean(anger))
sizeSentiment = ddply(users, ~ num_tweets, summarize, ave_sentiment = mean(ave_sentiment),
ave_negative = mean(ave_negative), ave_positive = mean(ave_positive), ave_anger = mean(ave_anger))
sizeSentiment$num_tweets = as.numeric(sizeSentiment$num_tweets)

# plot users positive versus negative with bubble plot
cutoff = sort(users$num_tweets, decreasing = TRUE)[100]
ggplot(users[users$num_tweets > cutoff,], aes(x = ave_positive, y = ave_negative, size = num_tweets, fill = Airline)) +
geom_point(shape = 21) +
ggtitle("100 Most Prolific Tweeters About Airlines") +
labs(x = "Positive Sentiment", y = "Negative Sentiment")
#
ggplot(sizeSentiment, aes(x = num_tweets, y = ave_sentiment)) + geom_point() + stat_smooth(method = "loess", size = 1, span = 0.35) +
ggtitle("Number of Tweets versus Sentiment") + scale_x_log10() +
labs(x = "Positive Sentiment", y = "Negative Sentiment")

Firstly let’s look at the behaviour of individual users:

20170429 plot 06 top 100 tweeters

Top user sentiment is quite different by airline. Emirates has a number of frequent tweeters who are unemotional, who on average post neither positive nor negative sentiment. United Airlines attracts more emotional posts. Singapore Air and Cathay Pacific have big users that post a lot of tweets about them.

20170429 plot 07 sentiment versus tweet count

However, on average bigger frequent tweeters post a similar balance of positive and negative content to smaller users who tweet infrequently.

Step 7: Compare The Words Used to Describe Each Airline

In order to explain the differences in sentiment, we can create a word cloud that contrasts the words used in posts about each airline.


# Join texts in a vector for each company
txt1 = paste(tweets$Text[tweets$Airline == 'United'], collapse=" ")
txt2 = paste(tweets$Text[tweets$Airline == 'SingaporeAir'], collapse=" ")
txt3 = paste(tweets$Text[tweets$Airline == 'Emirates'], collapse=" ")
txt4 = paste(tweets$Text[tweets$Airline == 'Cathay Pacific'], collapse=" ")
#
# put everything in a single vector
all = c(clean.text(txt1), clean.text(txt2), clean.text(txt3), clean.text(txt4))
#
# remove stop-words
# adding extra domain specific stop words
all = removeWords(all, c(stopwords("english"), 'singapore', 'singaporeair',
'emirates', 'united', 'airlines', 'unitedairlines',
'cathay', 'pacific', 'cathaypacific', 'airline',
'airlinesunited', 'emiratesemirates', 'pacifics'))
#
# create corpus
corpus = Corpus(VectorSource(all))
#
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#
# convert as matrix
tdm = as.matrix(tdm)
#
# add column names
colnames(tdm) = c('United', 'Singapore Air', 'Emirates', 'Cathay Pacific')
#
# Plot comparison wordcloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Word Comparison by Airline')
comparison.cloud(tdm, random.order=FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),
title.size=1.5, max.words=250)
#
# Plot commonality cloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, 'Word Commonality by Airline')
commonality.cloud(tdm, random.order=FALSE,
colors = brewer.pal(8, "Dark2"),
title.size=1.5, max.words=250)

The code above is quite similar to that in the previous step, except that this time we are comparing airlines instead of emotions.

20170429 plot 10 airline word contrast
20170429 plot 11 airline word commonality

Emirates includes “aniston”, presumably in reference to the marketing campaign involving Jennifer Aniston, while United includes “CEO” due to a number of news stories about United CEO’s including a resignation and a heart transplant.

 

 

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Second Annual Data Science Bowl – Part 2

07 Monday Mar 2016

Posted by Colin Priest in Automation, Image Processing, Kaggle, Medical Imaging, R

≈ 3 Comments

Tags

Automation, Image Processing, Kaggle, Medical Imaging, R

In Part 1 of this blog series, I described how to fix the brightness and contrast of the MRI images. In this blog we finish cleaning up the input data.

Other than brightness and contrast, we need to fix up the following problems:

  • different image sizes
  • different image rotations – portrait versus landscape
  • different pixel spacing
  • different short axis slice spacing
  • image sets from duplicate locations
  • sax sets aren’t in the same order as their locations
  • some sax image sets have multiple locations

Different Image Sizes and Rotations

There are approximately a dozen different image sizes, include rotated images. Not all of the image sizes scale to the same 4:3 aspect ratio that is the most common across the training set. Some of the machine learning algorithms I used later need fixed dimension images, so I compromised and decided to use a standard sizing of 256 x 192 pixels portrait aspect ratio. This meant that I wasn’t always scaling the x-axis and the y-axis by the same amount, and occasionally I was even upscaling an image.


library(pacman)
pacman::p_load(EBImage)
rescaleImage = function(img)
{
 imgOut = img
 # check for landscape aspect ratio and correct
 if (nrow(img) > ncol(img)) imgOut = t(img)
 imgOut = resize(imgOut, 256, 192) # standardise image size to 256 x 192
 return (imgOut)
}

20160307image0120160307image02

Note that I used matrix transpose to rotate the image 90 degrees. I could also have used the rotate function in EBImage. Either way could work, but I felt that matrix transpose would be a faster operation.

Image Locations and Spacing

The DICOM images contain information about image slice location, pixel spacing (how far apart are adjacent pixel centroids), and slice spacing (how far apart are adjacent sax slices).


library(pacman)
pacman::p_load(oro.dicom)

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
 img = readDICOMFile(path)
 width = ncol(img$img)
 height = nrow(img$img)
 headers = img$hdr
 patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
 patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
 patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
 ps = extractHeader(img$hdr, 'PixelSpacin', numeric = FALSE)
 pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
 pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
 seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
 location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
 pathSplit = unlist(strsplit(path, '/&amp;amp;', fixed = TRUE))
 filename = pathSplit[length(pathSplit)]
 prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
 frame = as.integer(unlist(strsplit(prefix, '-'))[3])
 sliceNum = 0
 if (length(unlist(strsplit(prefix, '-'))) &amp;gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
 time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
 #
 return (list(
 id = patientID,
 age = patientAge,
 gender = patientGender,
 pixelSpacingX = pixelSpacingX,
 pixelSpacingY = pixelSpacingY,
 width = width,
 height = height,
 series = seriesNum,
 location = location,
 frame = frame,
 sliceNum = sliceNum,
 time = time,
 path = path
 ))
}

I’m not a medical expert. So when there are images from duplicate locations, I don’t know which image sets are the best to use. Therefore I just assumed that the medical specialist repeated the MRI scans until the image quality was adequate. This meant that I chose the image set with the latest time stamps (from the header information inside the DICOM files, not the file system date).

Then I just

  1. searched for DICOM images in all of the subfolders for each patient
  2. filtered to use only sax slices, ignoring those images that did not come from a folder that contained the substring “sax”
  3. read the DICOM header information from each image to find its location and time
  4. searched for the high time stamp for each location and kept that image
  5. wrote out a new folder structure where sax_01, sax_02, … were the sax slice image sets for that patient, order by their location along the long axis of the heart

The R script to do this isn’t too difficult. I’ve pasted it below:


# this script creates a cleaner version of the image data
# plus an extract of the file headers
# 1) removes duplicate images
# 2) reorders images by location

# read a poor image and translate the pixel brightnesses
rebalanceImage = function(badImage)
{
v = matrix(badImage, nrow(badImage) * ncol(badImage))
o2 = order(v)
vIn = sample(vAll, nrow(badImage) * ncol(badImage))
vIn = vIn[order(vIn)]
v2 = v
v2[o2] = vIn
cleanImage = matrix(v2, nrow(badImage), ncol(badImage))
return (cleanImage)
}

# check whether a folder exists and make it if it doesn't exist
checkFolder = function(mainDir, subDir)
{
if(!dir.exists(file.path(mainDir, subDir)))
dir.create(file.path(mainDir, subDir))
}

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
img = readDICOMFile(path)
width = ncol(img$img)
height = nrow(img$img)
headers = img$hdr
patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
ps = extractHeader(img$hdr, 'PixelSpacing', numeric = FALSE)
pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
pathSplit = unlist(strsplit(path, '/', fixed = TRUE))
filename = pathSplit[length(pathSplit)]
prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
frame = as.integer(unlist(strsplit(prefix, '-'))[3])
sliceNum = 0
if (length(unlist(strsplit(prefix, '-'))) &amp;gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
#
return (list(
id = patientID,
age = patientAge,
gender = patientGender,
pixelSpacingX = pixelSpacingX,
pixelSpacingY = pixelSpacingY,
width = width,
height = height,
series = seriesNum,
location = location,
frame = frame,
sliceNum = sliceNum,
time = time,
path = path
))
}
############################################################################################################################

library(pacman)
pacman::p_load(oro.dicom, raster, data.table, png, flexclust, foreach, doParallel, snowfall)

# whether this run is for the training set (FALSE) or the validation set (TRUE)
useValidation = FALSE
#useValidation = TRUE

# create a benchmark histogram from an exemplar image
dicomBenchmark = readDICOM('/home/colin/data/Second-Annual-Data-Science-Bowl/train/1/study/sax_13')
images = dicomBenchmark[[2]]
img = images[[1]]
vAll = unname(unlist(images))
vAll = vAll[order(vAll)]

# loop through all of the patients
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned'
if (useValidation)
{
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned'
}
cases = list.dirs(rootFolder, recursive=FALSE)
#cases = cases[grep('123', cases)]
simpleData = matrix(0, 500, 3)
simpleFeatures = data.frame(id = rep(0, 500), age = rep(0, 500), gender = rep('U', 500),
pixelSpacingX = integer(500), pixelSpacingY = integer(500),
width = integer(500), height = integer(500),
numSlices = rep(0, 500), stringsAsFactors = FALSE)
allImages = NULL

sfInit(parallel=TRUE, cpus=14)
sfLibrary(oro.dicom)

for (patient in cases)
{
patientFolder = paste0(patient, '/study')
imgSequences = list.dirs(patientFolder, recursive=FALSE)
# filter for 'sax' folders
imgSequences = imgSequences[grep('sax', imgSequences, fixed = TRUE)]
nMax = 10000
imgTable = data.table(id = integer(nMax), age = integer(nMax), gender = character(nMax),
pixelSpacingX = integer(nMax), pixelSpacingY = integer(nMax),
width = integer(nMax), height = integer(nMax),
location = numeric(nMax), frame = integer(nMax), series = integer(nMax),
sliceNum = integer(nMax), time = numeric(nMax),
path = character(nMax))
for (imgFolder in imgSequences)
{
# find the first file in imgs
imgFiles = list.files(imgFolder)
if (length(imgFiles) &lt; 30)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
} else {
if (length(imgFiles) %% 30 != 0)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
}
}
# do the next part regardless of the number of images
paths = unlist(lapply(imgFiles, function(x) return (paste0(imgFolder, '/', x))))
result <- sfLapply(paths, getDicomHeaderInfo)
for (row in result)
{
n = sum(imgTable$id > 0) + 1
imgTable$id[n] = row$id
imgTable$age[n] = row$age
imgTable$gender[n] = row$gender
imgTable$pixelSpacingX[n] = row$pixelSpacingX
imgTable$pixelSpacingY[n] = row$pixelSpacingY
imgTable$width[n] = row$width
imgTable$height[n] = row$height
imgTable$series[n] = row$series
imgTable$location[n] = row$location
imgTable$frame[n] = row$frame
imgTable$sliceNum[n] = row$sliceNum
imgTable$time[n] = row$time
imgTable$path[n] = row$path
}
}
# remove surplus records from table
imgTable= imgTable[imgTable$id > 0]
# grab the latest image for each location and frame
latestImages = imgTable[order(id, age, gender, pixelSpacingX, pixelSpacingY, width, height, location, frame, time, sliceNum, series), .SD[c(.N)], by=c(&amp;amp;amp;quot;id&amp;amp;amp;quot;, &amp;amp;amp;quot;age&amp;amp;amp;quot;, &amp;amp;amp;quot;gender&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingX&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingY&amp;amp;amp;quot;, &amp;amp;amp;quot;width&amp;amp;amp;quot;, &amp;amp;amp;quot;height&amp;amp;amp;quot;, &amp;amp;amp;quot;location&amp;amp;amp;quot;, &amp;amp;amp;quot;frame&amp;amp;amp;quot;)]
#
print(paste0('Patient: ', latestImages$id[1]))
uniqueLocs = sort(unique(latestImages$location[abs(latestImages$location) > 0.000001]))
simpleFeatures[latestImages$id[1], ] = list(latestImages$id[1], latestImages$age[1], latestImages$gender[1],
latestImages$pixelSpacingX[1], latestImages$pixelSpacingY[1],
latestImages$width[1], latestImages$height[1],
length(uniqueLocs))
# create the cleaned-up images for CNN training
imgTable$cleanPath = '---'
iSax = 1
patientID = latestImages$id[1]
for (loc in uniqueLocs)
{
imgset = latestImages$path[latestImages$location == loc]
iImage = 1
for (imgPath in imgset)
{
dicomImage = readDICOMFile(imgPath)
# fix the contrast and brightness
fixedImage = rebalanceImage(dicomImage$img)
# get the details of this image
###patientID = extractHeader(dicomImage$hdr, 'PatientID', numeric = TRUE)
sliceID = iSax
imageID = iImage
outPath = paste0(outFolder, '/', patientID, '/study/sax_', formatC(sliceID, width=2, flag='0'), '/image-', formatC(imageID, width=2, flag='0'), '.png')
#print(outPath)
#plot(raster(fixedImage))
checkFolder(outFolder, as.character(patientID))
checkFolder(paste0(outFolder, '/', patientID), 'study')
checkFolder(paste0(outFolder, '/', patientID, '/study'), paste0('sax_', formatC(sliceID, width=2, flag='0')))
writePNG(fixedImage / max(fixedImage), outPath)
imgTable$cleanPath[imgTable$path == imgPath] = outPath
#
iImage = iImage + 1
}
iSax = iSax + 1
}

if (patient == cases[1])
{
allImages = imgTable
} else
{
allImages = data.frame(rbind(allImages, imgTable))
}
}

sfStop()

I stored all of the DICOM header information in a table. Some of this will be required later, when calculating the volume of the left ventricle chamber.

The next step, which will be described in my next blog, is to design a convolutional neural network that will automatically find the left ventricle in an image.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Second Annual Data Science Bowl – Part 1

06 Sunday Mar 2016

Posted by Colin Priest in Automation, Image Processing, Kaggle, Medical Imaging, R

≈ 3 Comments

Tags

Automation, Image Processing, Kaggle, Medical Imaging, R

I’m currently competing in the Second Annual Data Science Bowl at Kaggle. This is by far the most difficult competition that I have entered to date. At the time of writing I am placed 62nd out of 755 entries, with only a day remaining to lock down my methodology. There’s a lot more I’d like to do to improve my model, but alas, I don’t have the time!

Here’s the problem that we are solving:

  1. We are given a set of medical images taken by MRI, across 30 time periods, and a variable number of location slices through the body.
  2. We are also given the volume of the left ventricle of the heart at times of diastole and systole.
  3. Our task is to design an automatic algorithm that inputs DICOM images and outputs a cumulative density function of the likelihood of different volumes at both diastole and systole.

The medical image files are in DICOM format, containing information about the patient (e.g. age and gender) and a set of monochrome images for each patient giving a 4 dimensional view of that patient’s chest. The key images are the “sax” (short axis) images, a set of slices perpendicular to the line that passes through the length of the heart (a heart isn’t circular, but more ovoid in shape), and there are typically 30 images for each sax, each being 1/30th the time period of a heartbeat, showing one complete cycle of the heart. There are a varying number of sax images for each patient, depending upon the length of the patient’s heart, and sometimes there are also repeated sax sets, where the scanning was repeated in an attempt to improve the image quality.

The image quality varies greatly between patients, with differing image resolutions, brightness, contrast and aspect ratio / rotation.

sax_8
sax_8

As you can see in the animated gifs above, some of the images are such poor quality that it is difficult for the human eye to discern the details. So my first challenge was to improve the brightness and contrast. One way to do this is to do a linear transformation on each image so that its pixels have a preset mean and standard deviation.

20160306image01
20160306image02
20160306image03
20160306image04

As you can see, while this approach helped, it did not work well enough for the problem images. I also tried non-linear transformations without much more success. No transformation function was flexible enough for the wide range of image qualities. Frequently part of the problem image gets washed out.

At the National Heart Centre Singapore, I spoke with Assistant Professor Calvin Chin about how doctors use imaging to assess heart volume. He explained that the very bright, washed out sections in some of the images are the result of fat deposits within the patient’s body. He also explained how to find the left ventricle chamber in an image (it is round with a thick lining surrounding it) and what to do about the dark patches inside the chamber (include them in the area of the chamber because they are blood vessels). This was really helpful. It pays to bring in some domain knowledge to a machine learning problem.

20060306image05
20060306image06

What I wanted was for all images to have similar brightness histograms. After much experimentation, I couldn’t find a transformation function that achieved this for me. But then I realised that I didn’t need to use a function – I could just use empirical histograms as my target function, mapped to my original image via the brightness ranking of each pixel. All I needed to do was select an exemplar image (or multiple exemplar images) and then order the pixel brightnesses, then map across. Here’s how I did it using R:


library(pacman)
pacman::p_load(oro.dicom)

# a function to turn the image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# read a poor image and translate the pixel brightnesses
rebalanceImage = function(badImage)
{
# get the pixel brightnesses and get an index that sorts them
v = img2vec(badImage)
o2 = order(v)

# get a target histogram, allowing for the size of the bad image
vIn = sample(vAll, nrow(badImage) * ncol(badImage))
vIn = vIn[order(vIn)]

#
v2 = v
v2[o2] = vIn

# turn the piuxel vector back into an image
cleanImage = matrix(v2, nrow(badImage), ncol(badImage))

return (cleanImage)
}
# create a benchmark histogram from an exemplar image
dicomBenchmark = readDICOM('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/1/study/sax_13')
images = dicomBenchmark[[2]]
img = images[[1]]
vAll = unname(unlist(images))
vAll = vAll[order(vAll)]

# read the raw image
dicomImage = readDICOMFile('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/1/study/sax_8/IM-4560-0001.dcm')
# fix the contrast and brightness
fixedImage = rebalanceImage(dicomImage$img)
# read the raw image
dicomImage2 = readDICOMFile('C:/Users/Colin/Dropbox/blogging/20160306 Second Annual Data Science Bowl Part 1/SADSB/6/study/sax_8/IM-9548-0001.dcm')
# fix the contrast and brightness
fixedImage2 = rebalanceImage(dicomImage2$img)

This gave me fairly consistent brightnesses and contrast, regardless of the quality of the original images, and also prevented washed out regions. You can see the results below:

20160306image07
20160306image08
20160306image09
20160306image10

Standardising the input data helps machine learning algorithms perform better because they don’t have to waste resources figuring out how to adjust for varying inputs where that variation is not a predictive feature.

In my next blog I will describe how I used the DICOM header information to further improve the model inputs, and to create extra features for my final model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

An Even Dozen – Denoising Dirty Documents: Part 12

15 Sunday Nov 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R, Stacking, XGBoost

≈ 3 Comments

Tags

Image Processing, Kaggle, Machine Learning, R, Stacking, XGBoost

Over the past 11 blogs in this series, I have discussed how to build machine learning models for Kaggle’s Denoising Dirty Documents competition.

dozeneggs

The final blog in this series brings the count to an even dozen, and will achieve two aims:

  1. ensemble the models that we have built
  2. take advantage of the second information leakage in the competition

Ensembling, the combining of individual models into a single model, performs best when the individual models have errors that are not strongly correlated. For example, if each model has statistically independent errors, and each model performs with similar accuracy, then the average prediction across the 4 models will have half the RMSE score of the individual models. One way to increase the statistical independence of the models is to use different feature sets and / or types of models on each. I therefore chose the following combination of models:

  1. deep learning – thresholding based features
  2. deep learning – edge based features
  3. deep learning – median based features
  4. images with backgrounds removed using information leakage
  5. xgboost – wide selection of features
  6. convolutional neural network – using raw images without background removal pre-processing
  7. convolutional neural network – using images with backgrounds removed using information leakage
  8. deep convolutional neural network – using raw images without background removal pre-processing
  9. deep convolutional neural network – using images with backgrounds removed using information leakage

20151115 ensemble structure

It turned out that some of these models had errors that weren’t strongly independent to other models. But I was rushing to improve my leaderboard score in the final 48 hours of the competition, so I didn’t have time to experiment.

I didn’t experiment much with different ensemble models. However I did test xgboost versus a simple average or a least square linear regression, and it outperformed both. Maybe an elastic net could have done a good job.

Here is the R code for my ensemble:


.libPaths(c(.libPaths(), "./rlibs"))
library(png)
library(data.table)
library(xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

cleanFolder = "./data/train_cleaned"
inFolder1 = "./threshold based model/training data"
inFolder2 = "./edge based model/training data"
inFolder3 = "./median based model/training data"
inFolder4 = "./foreground/train foreground"
inFolder5 = "./submission 11/train_postprocessed"
inFolder6 = "./convnet/train_predicted"
inFolder7 = "./cnn_leakage/train_predicted"
inFolder8 = "./CNN based model/training"
inFolder9 = "./deep CNN/train_predicted"

outPath = "./stacked/stacking.csv"

filenames = list.files(cleanFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
y = img2vec(imgY)
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(y, x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("y", "threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
#rows = sample(nrow(dat), 15000000)
dat[is.na(dat)] = 0
#dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
dtrain <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
#
nThreads = 30
# do cross validation first
#xgb.tab = xgb.cv(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = 1000, early.stop.round = 15, nfold = 4, print.every.n = 10)
# what is the best number of rounds?
#min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
min.error.idx = 300 # was 268
xgb.mod = xgboost(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

dat_predicted = predict(xgb.mod, newdata=as.matrix(dat[,-1]))
sqrt( mean( (dat$y - dat_predicted) ^ 2 )) # 0.00759027

save (xgb.mod, file = "./model/xgb.rData")

#####################################################################################################################################

imgFolder = "./data/test"
inFolder1 = "./threshold based model/test data"
inFolder2 = "./edge based model/test data"
inFolder3 = "./median based model/test data"
inFolder4 = "./foreground/test foreground"
inFolder5 = "./submission 11/test_postprocessed"
inFolder6 = "./convnet/test_predicted"
inFolder7 = "./cnn_leakage/test_predicted"
inFolder8 = "./CNN based model/test"
inFolder9 = "./deep CNN/test_predicted"

outFolder = "./stacked/test data"
outFolder2 = "./stacked/test images"

filenames = list.files(imgFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))

# turn the images into vectors
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
yHat = predict(xgb.mod, newdata=as.matrix(dat))
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
imgY = matrix(yHat, nrow(imgX1), ncol(imgX1))
writePNG(imgY, file.path(outFolder2, f))
save(imgY, file = file.path(outFolder, gsub(".png", ".rData", f)))
}

Ensembling materially improved my leaderboard score versus any of the individual models. I feel that was due to the use of different features across my 3 deep learning models. So now I had a set of images that looked quite good:

20151115 output 1

20151115 output 2

To my eyes, my predicted images were indistinguishable from the clean images in the training data. In a real world situation I would have stopped model development here, because the image quality exceeds the minimum requirements for OCR. However, since this was a competition, I wanted the best score I could get.

So I took advantage of the second data leakage in the competition – the fact that the cleaned images were repeated across the dataset. This meant that I could compare a cleaned images to other cleaned images that appeared to have the same text and the same font, and clean up any pixels that were different across the set of images. I experimented with using the mean of the pixel brightness across the images, but using the median performed better.


library(png)
library(data.table)

inFolder = "./stacked/test data"
outFolder = "./information leakage/data"
outFolder2 = "./information leakage/images"

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

filenames = list.files(inFolder, pattern = "\\.rData$")
for (f in filenames)
{
print(f)

load(file.path(inFolder, f))
imgX = imgY

# look for the closest matched images
scores = matrix(1, length(filenames))
for (i in 1:length(filenames))
{
load(file.path(inFolder, filenames[i]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) && ncol(imgY) >= ncol(imgX))
{
imgY = imgY[1:nrow(imgX), 1:ncol(imgX)]
rmse = sqrt(mean( (imgX - imgY)^2 ))
}
scores[i] = rmse
}

dat = matrix(1, ncol(imgX) * nrow(imgX), 4)
for (i in 1:4)
{
f2 = filenames[order(scores)][i]
load(file.path(inFolder, f2))
dat[,i] = img2vec(imgY)
}

dat2 = apply(dat, 1, median)
#dat2 = apply(dat, 1, mean)

imgOut = matrix(dat2, nrow(imgX), ncol(imgX))
writePNG(imgOut, file.path(outFolder2, gsub(".rData", ".png", f)))
save(imgOut, file = file.path(outFolder, f))
}

This information leakage halved the RMSE, and I suspect that it was what allowed the top two competitors to obtain RMSE scores less than 1%.

So that’s it for this series of blogs. I learned a lot from my first Kaggle competition. Competing against others, and sharing ideas is a fun way to learn.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 11

08 Sunday Nov 2015

Posted by Colin Priest in Adaptive Thresholding, Background Removal, Deep Learning, Edge Detection, h2o, Image Processing, Kaggle, Machine Learning, Median Filter, Morphology, R

≈ 1 Comment

Tags

Adaptive Thresholding, Background Removal, Deep Learning, Edge Detection, h2o, Image Processing, Kaggle, Machine Learning, Median Filter, Morphology, R

In my last blog I showed how to use convolutional neural networks to build a model that removed stains from an image. While convolutional neural networks seem to be well suited for image processing, in this competition I found that deep neural networks performed better. In this blog I show how to build these models.

warnH022 - deep water

Since I wanted to use R, have limited RAM and I don’t have a powerful GPU, I chose to use h2o to build the models. That way I could do the feature engineering in R, pass the data to h2o, let h2o build a model, then get the predicted values back in R. The memory management would be done in h2o, which uses deep learning algorithms that adjust the RAM constraints. So I guess this combination of deep learning and h2o could be called “deep water” 😉

For my final competition submission I used an ensemble of models, including 3 deep learning models built with R and h2o. Each of the 3 deep learning models used different feature engineering:

  • median based feature engineering
  • edge based feature engineering
  • threshold based feature engineering

This blog shows the details of the median based model. I leave it to the reader to implement the edge based and threshold based models using the image processing scripts from my earlier blogs in this series.

If you don’t already have h2o installed on your computer, then you can install it directly from R. At the time of writing this blog, you could install h2o using the following script:


# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/R")))

That script will need to be changed as new versions of h2o are released. So use the latest instructions shown here.

Once h2o is installed, you can interface with h2o from R using the CRAN package.


install.packages("h2o")
library(h2o)

Median based image processing is used for feature engineering in this example, but you could use any combination of image processing techniques for your feature engineering. I got better performance using separate deep learning models for different types of image processing, but that may be because I had limited computing resources. If you have more computing resources than me, then maybe you will be successful with a single large model that uses all of the image processing techniques to create features.


# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}
 
median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
 k = 1
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}
 
# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5
 p = 1.39
 th = 240
 
 # the background is found via a median filter
 background = median_Filter(img, w)
 
 # the foreground is darker than the background
 foreground = img / background
 foreground[foreground > 1] = 1
 
 foreground2 = foreground ^ p
 foreground2[foreground2 >= (th / 255)] = 1
 
 return (matrix(foreground2, nrow(img), ncol(img)))
} 

img2tab = function(imgX, f)
{
 median5 = img2vec(median_Filter(imgX, 5))
 median17 = img2vec(median_Filter(imgX, 17))
 median25 = img2vec(median_Filter(imgX, 25))
 backgroundRemoval = img2vec(background_Removal(imgX))
 foreground = readPNG(file.path(foregroundFolder, f))
 
 # pad out imgX
 padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
 offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))
 
 # raw pixels window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
 x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x2 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
 x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x3 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
 x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x4 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
 x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x5 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
 x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x6 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
 x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

 dat = data.table(cbind(x, x2, x3, x4, x5, x6))
 setnames(dat,c(
 paste("x", seq_len((2*padding+1)^2), sep=""), 
 paste("median5", seq_len((2*padding+1)^2), sep=""),
 paste("median17", seq_len((2*padding+1)^2), sep=""),
 paste("median25", seq_len((2*padding+1)^2), sep=""),
 paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
 paste("foreground", seq_len((2*padding+1)^2), sep="")
 ))
 
 return (dat)
}

If you’ve been following my blog, then you will see that there’s nothing new in the two image processing functions shown above.

To build the model you will need to start h2o, import the data and tell h2o to create a deep learning model.

h2oServer = h2o.init(nthreads = 6, max_mem_size = "10G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
 score_training_samples = 0, 
 overwrite_with_best_model = TRUE,
 activation = "Rectifier", seed = 1,
 hidden = c(200, 200,200), epochs = 15,
 adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
 fast_mode = T, diagnostics = T, ignore_const_cols = T,
 force_load_balance = T)


You should change the h2o.init parameters according to the hardware on your computer. I’m running my model on a PC with 8 CPUs and 16GB of RAM, so I left a couple of CPUs free to do the user interface and core operating system functionality, plus some RAM for the operating system. Scale these parameters up or down if your PC specifications are more or less powerful than mine.

The model may take a few hours to fit. During that time R will not do anything. So if you want to see how the model is progressing, then point your browser to your localhost (port 54321 on my PC, but maybe a different port on yours) and use the h2o web interface to see what is happening.

You can get the predicted values using the following script:


filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
 predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
 imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))
 
 # correct the pixel brightnesses that are out of bounds
 imgOut[imgOut > 1] = 1
 imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Running predictions is as simple as creating a data file, importing it to h2o, and then asking h2o to give you the predicted values from your already fitted model. I found that some of the raw predicted values were out of the [0, 1] range, and improved my leaderboard score by limiting the predicted values to lie within this range.

You do not need to shut down h2o after you finish running a model. In fact you may wish to leave it running so that you can do model diagnostics or run more predictions.

If you wish to save a copy of your model, for later reuse, then you can use the following syntax:


modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

Just remember that h2o needs to be running when you save models or load previously saved models.

In my next, and final, blog in this series, I will show how to take advantage of the second information leakage in the competition.

For those who want the entire R script to try out for themselves, here it is:


install.packages("h2o")
library(h2o)
library(png)
library(data.table)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5
p = 1.39
th = 240

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img / background
foreground[foreground > 1] = 1

foreground2 = foreground ^ p
foreground2[foreground2 >= (th / 255)] = 1

return (matrix(foreground2, nrow(img), ncol(img)))
}

dirtyFolder = "./data/train"
cleanFolder = "./data/train_cleaned"
outFolder = "./model"
foregroundFolder = "./foreground/train foreground"

outPath = file.path(outFolder, "trainingdata.csv")
outPath2 = file.path(outFolder, "testdata.csv")
filenames = list.files(dirtyFolder)
padding = 2
set.seed(1)
library(h2o)
h2oServer = h2o.init(nthreads = 15, max_mem_size = "110G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
score_training_samples = 0,
overwrite_with_best_model = TRUE,
activation = "Rectifier", seed = 1,
hidden = c(200, 200,200), epochs = 15,
adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
fast_mode = T, diagnostics = T, ignore_const_cols = T,
force_load_balance = T)

summary(model.dl)

modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

outFolder = "./model/training data"

img2tab = function(imgX, f)
{
median5 = img2vec(median_Filter(imgX, 5))
median17 = img2vec(median_Filter(imgX, 17))
median25 = img2vec(median_Filter(imgX, 25))
backgroundRemoval = img2vec(background_Removal(imgX))
foreground = readPNG(file.path(foregroundFolder, f))

# pad out imgX
padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))

# raw pixels window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x2 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x3 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x4 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x5 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x6 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

dat = data.table(cbind(x, x2, x3, x4, x5, x6))
setnames(dat,c(
paste("x", seq_len((2*padding+1)^2), sep=""),
paste("median5", seq_len((2*padding+1)^2), sep=""),
paste("median17", seq_len((2*padding+1)^2), sep=""),
paste("median25", seq_len((2*padding+1)^2), sep=""),
paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
paste("foreground", seq_len((2*padding+1)^2), sep="")
))

return (dat)
}

dirtyFolder = "./data/test"
outFolder = "./model/test data"
foregroundFolder = "./foreground/test foreground"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))

# correct the pixel brightnesses that are out of bounds
imgOut[imgOut > 1] = 1
imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 9

15 Thursday Oct 2015

Posted by Colin Priest in Background Removal, Image Processing, Kaggle, R

≈ 3 Comments

Tags

Background Removal, Image Processing, Kaggle, R

Now that Kaggle’s Denoising Dirty Documents Competition has closed, it’s time to start posting the secrets to getting a very good score in this competition. In this blog, I describe how to take advantage of the first of two information leakages that I used.

512px-broken-water-pipe-clip-art-380417

Information leakage occurs in predictive modelling when the training and test data includes values that would not be known at the time a prediction was being made. For example, I once worked on a direct marketing sales propensity modelling project where our predictive model fitted the training data far too well, making us suspicious. We eventually tracked it down to an incorrectly designed data extract that used status values as at the data extraction date, instead of as at the date at which a prediction would have been run. The sale of the product to the customer changed the value of that data field, so the data needed to be the value of the data field before the sale occurred. In real life projects, information leakage is a bad thing because it overstates the model accuracy. Therefore you need to ensure that it does not occur; otherwise your predictive model will not perform well. In data science competitions, information leakage is something to be taken advantage of. It enables you to obtain a higher score.

The first information leakage in this competition comes from how the training data was created. There are only 8 different page backgrounds. There are 2 coffee cup stains, 2 folded pages, 2 watermarks and 2 crumpled pages.

20151015 output 1

This gives us a huge advantage in removing the background and the stains. Instead of using an uncertain estimate based upon only a single image, we can group together the images so that each group has the same background. Then, for each pixel location, calculate the brightest pixel across all of the images in that group, and use the brightest value as the pixel brightness in the background. You can find a couple of scripts on Kaggle that do this.

The problem is that you can’t do the same for the test images because they have different backgrounds, and they have only four different backgrounds. You can’t have known this without manually looking at the test images, and that break the rules prohibiting manual processing of the test images.

20151015 output 2

You can also run into troubles if your model just assumes that the test images are arranged so that there are 8 different background images that repeat every 8 images.

A better solution is to let your model decide which images can be grouped together, and let your model decide which background belongs with each image. The way to do this is to apply a median filter to each image, to obtain an estimate of the background of each image, and then group the median images together according to similarity.


library(png)
library(raster)
library(data.table)

median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# training data
dirtyFolder = "C:/Users/Colin/dropbox/Kaggle/Denoising Dirty Documents/data/train"
outPath = "D:/CNN with background removal/train median25"
filenames = list.files(dirtyFolder)
# use a 25x25 median filter to get the background
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

median25 = median_Filter(imgX, 25)

outFile = file.path(outPath, f)
writePNG(median25, outFile)
}

The script above calculates the median filter for each image and stores it in a folder. I have used a filter size of 25 pixels because I want broad patterns of the background, and I want the text removed.

Now that I have the median images, I can iterate through each training image and link it to other images that have similar median images. I have used RMSE as a measure of similarity. The cutoff RMSE of 2.5% was determined by looking at the range of RMSE values and looking for a natural break.

20151015 output 3


# find the images with matching background
outPath2 = "D:/CNN with background removal/train background"
outPath3 = "D:/CNN with background removal/train foreground"
for (i in 1:length(filenames))
{
f = filenames[i]
print(f)

imgX = readPNG(file.path(outPath, f))

scores = matrix(1, length(filenames))
for (j in 1:length(filenames))
{
imgY = readPNG(file.path(outPath, filenames[j]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) & ncol(imgY) >= ncol(imgX)) rmse = sqrt(mean((imgX - imgY[1:nrow(imgX), 1:ncol(imgX)]) ^ 2))
scores[j] = rmse
}

sameStains = filenames[scores <= 0.025]
nImages = length(sameStains)

rawData = matrix(0, ncol(imgX) * nrow(imgX), nImages)
for (j in 1:nImages)
{
imgY = readPNG(file.path(dirtyFolder, sameStains[j]))
rawData[,j] = img2vec(imgY[1:nrow(imgX), 1:ncol(imgX)])
}

background = matrix(unlist(apply(rawData,1,max)), nrow(imgX), ncol(imgX)) # background is defined as the lightest pixel of images with similar median transformations
plot(raster(background))
writePNG(background, file.path(outPath2, f))

imgX = readPNG(file.path(dirtyFolder, f))
foreground = (imgX - background) / background
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])
#plot(raster(foreground))
writePNG(foreground, file.path(outPath3, f))
}

One of the tricks in the script above was that I didn’t simply subtract the background from the image. If I had done that, then the result would not be consistent in areas where the stains occur:

20151015 output 4

20151015 foreground-bad

Notice that in the image above, the writing is faded where the stain was dark. That’s because it was created with the code:


foreground = (imgX - background)
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])

Where the background is dark, there is less opportunity for the writing to contrast against the background. To fix this, I changed the above script to be:


foreground = (imgX - background) / background
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])

By dividing the difference by the background brightness, I have rescaled the contrast to allow for the limitation on the maximum contrast at this location. The result is shown below:
20151015 foreground-good
This time the writing has consistent darkness across the entire image, regardless of the brightness of the background pixels.

My competition submission consisted of 4 stages, and this leakage-based background removal was the first stage. The final stage also took advantage of an information leakage. But you will have to wait for a couple of blogs to see what that was…

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 8

02 Friday Oct 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R

≈ 4 Comments

Tags

Image Processing, Kaggle, Machine Learning, R

In this blog we will engineer a new feature to go into our model. So far we have predominantly been using localised features – information about pixels that are located nearby the pixel whose brightness we are predicting. In this blog we will consider the structure of a document, and use that to improve our model.

20150801 - after

The image consists of multiple lines of text, arranged into one or more paragraphs with multiple sentences in each paragraph.

ducks lined up

If we get our ducks in a row, we can use the fact that the text (like our metaphorical ducks) is arranged into lines, and that there are gaps between those lines of text. In particular, if we can find the gaps between the lines, we can ensure that the predicted value within those gaps is always the background colour.

Let’s look at the horizontal profile of a sample input image:


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

f = "135.png"
imgX = readPNG(file.path(dirtyFolder, f))

hProfileX = unlist(apply(imgX, 1, mean))
hProfileY = unlist(apply(imgY, 1, mean))

x = 1:nrow(imgX)
plot(x, hProfileX, col="red", type="l", xlab = "row", ylab = "brightness")

20150930 output 01

The rows with low brightness show where text occurs at regular intervals. But the absolute level of the row brightness varies, according to the noise and stains, so much that sometimes the gaps between the lines of text are darker than the rows in which text occurs. This will make it more difficult to determine which rows are which.

But what if we used the predicted images from our stacked model from the last blog as the inputs to this predictor?


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"

f = "135.png"
imgX = readPNG(file.path(dirtyFolder, f))

hProfileX = unlist(apply(imgX, 1, mean))
hProfileY = unlist(apply(imgY, 1, mean))

x = 1:nrow(imgX)
plot(x, hProfileX, col="red", type="l", xlab = "row", ylab = "brightness")

20150930 output 02

Now it’s much easier to find the gaps between the lines of text. To do this we need estimate three parameters:

  1. cycle length: the number of rows from one line of text to the next line of text
  2. gap length: the height of the gap between lines of text
  3. starting location of the first cycle

The constraints that we can use for determining these parameters are:

  • cycle length is a positive number
  • gap length is a positive number
  • gap length is less than cycle length
  • gap rows are white
  • most letters (e.g. weruioaszxcvnm) are short and these letters are located in the darkest rows, which we will call the “core rows”
  • some letters are taller (e.g. tdfhklb) and create slightly darker rows above the core rows
  • some letters are taller (e.g. qypgj) and create slightly darker rows below the core rows
  • if we cluster the values, the darker cluster contains the core of the letters (the core rows), but these rows may be lighter when a paragraph starts or ends
  • the in-between valued rows next to the white rows are possibly where the taller letters are located and their darkness will vary according to the number of tall letters

Let’s start by clustering the row brightnesses, and checking where the rows switch from one cluster to another, and using this to estimate cycle length.


km = kmeans(hProfileX, 2)
cutoff = (max(hProfileX[km$cluster == which.min(km$centers)]) + min(hProfileX[km$cluster == which.max(km$centers)])) / 2
plot(hProfileX, type="l")
lines(hProfileX * 0 + cutoff, col="red")

starts = (2:length(hProfileX))[sapply(2:length(hProfileX), function(x) hProfileX[x] < cutoff & hProfileX[x-1] > cutoff)]
cycleLength = mean(diff(starts))

20150930 output 03

20150930 output 04

Next we will estimate the starting location of the first cycle. But before we do that, we need to define the “start” of the cycle. Let’s define the cycle as starting at the row in which the first letter starts. That would be the first non-white row before the first core rows.

cycleStarts = starts[1]
while (hProfileX[cycleStarts] < 0.99)
cycleStarts = cycleStarts – 1

The first cycle starts at row 3.

To estimate the gap length, we should measure the number of white rows before each cycle starts.


gaps = matrix(0, length(starts) - 1)
for (i in 2:length(starts))
{
 cycleStarts = starts[i]
 while (hProfileX[cycleStarts] < 0.99)
 cycleStarts = cycleStarts - 1
 gapStarts = cycleStarts - 1
 while (hProfileX[gapStarts] > 0.99)
 gapStarts = gapStarts - 1
 gapLength = cycleStarts - gapStarts
 gaps[i - 1] = gapLength
}
gaps

The gaps are (13, 12, 17, 13, 12, 11, 12, 18, 12, 12). The two outliers occur after rows of text that do not contain any of the characters (q, y, p, g, j). After removing these two outliers, the gaps do not exceed 13 rows. So we can write code to allow for this as follows:


mGaps = median(gaps)
gapLength = mean(gaps[abs(gaps - mGaps) <= 2])

This gives us a gap length estimate of 13.125 and completes our line location model for this image. The quick way to use this model is to white out the gap rows. A more complex model could use the newly estimated row types (core, gap, other) as predictors.


imgCleaned = imgX
numCycles = length(starts)
for (i in 1:numCycles)
{
cycleStart = starts[i] - min(gaps[abs(gaps - mGaps) <= 2])
imgCleaned[cycleStart:(starts[i]-1)] = 1
}

The script above would clean up any minor image blemishes that exist between the lines of text.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 7

23 Wednesday Sep 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R, Stacking

≈ 3 Comments

Tags

Image Processing, Kaggle, Machine Learning, R, Stacking

By the time I’d finished building the model in my last blog, I’d started to overload my computer’s RAM and CPU, so much so that I couldn’t add any more features. One solution could be for me to upgrade my hardware, or rent out a cluster in the cloud, but I’m trying to save money at the moment. So today I’m going to restructure my predictive solution into separate predictive models, each of which do not individually overload my computer, but which are combined via stacking to give an overall solution that is more predictive than any of the individual models.

stacking

Stacking would also allow us to answer Bobby’s question:

I thought it was interesting how the importance chart didn’t show any one individual pixel as being significant. Is there another similar importance chart or metric that can show the importance of a combination of pixels?

So let’s start by breaking up the current monolithic model into discrete chunks. Once we have done this, it will be easier for us to add new features (which I will do in the next blog). I will store the predicted outputs from each model in image format with separate folders for each model.

20150923 diagram 1

The R script for creating the individual models is primarily recycled from my past blogs. Since images contain integer values for pixel brightness, one may argue that storing the predicted values as images loses some of the information content, but the rounding of floating point values to integer values only affects the precision by a maximum of 0.2%, and the noise in the predicted values is an order of magnitude greater than that. If you want that fractional extra predictive power, then I leave it to you to adapt the script to write the floating point predictions into a data.table.

Linear Transformation

The linear transformation model was developed here, and just involves a linear transformation, then constraining the output to be in the [0, 1] range.

20150801 output 6


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png)

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

# predict using linear transformation
outModel1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)

# linear transformation
yHat = -0.126655 + 1.360662 * x

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel1, f))
}

Thresholding

The thresholding model was developed here, and involves a combination of adaptive thresholding processes, each with different sizes of localisation ranges, plus a global thresholding process. But this time we will use xgboost instead of gbm.

20150815 output 3


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(y, x, x2, x3))
setnames(dat,c("y", "raw", "thresholded", "adaptive"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(x, x2, x3))
setnames(dat,c("raw", "thresholded", "adaptive"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel2, f))
}

Edges

The edge model was developed here, and involves a combination of canny edge detection and image morphology.

20150822 output 2


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost, biOps)

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x4, x5, x6))
setnames(dat,c("y", "raw", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(x, x4, x5, x6))
setnames(dat,c("raw", "canny", "cannyDilated1", "cannyDilated2"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel3, f))
}

Background Removal

The background removal model was developed here, and involves the use of median filters. An xgboost model is used to combine the median filter results.

20150829 output 1


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)
# a function to do a median filter
median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(NA, nrow(img) * ncol(img), filterWidth ^ 2)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x, na.rm = TRUE)))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img - background
foreground[foreground > 0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(y, x, x7a, x7b, x7c, x8))
setnames(dat,c("y", "raw", "median5", "median11", "median17", "background"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

# turn the images into vectors
x = img2vec(imgX)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(x, x7a, x7b, x7c, x8))
setnames(dat,c("raw", "median5", "median11", "median17", "background"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel4, f))
}

Nearby Pixels

The spacial model was developed here, and involves the use of a sliding window and xgboost.

20150904 output 7


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
 pad = 2
 width = 2 * pad + 1
 padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
 k = 1
 for (i in seq_len(width))
 {
 for (j in seq_len(width))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }

 return (tab)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX)
 y = img2vec(imgY)
 
 # surrounding pixels
 x9 = proximalPixels(imgX)

 dat = data.table(cbind(y, x9))
 setnames(dat,append("y", paste("x", 1:25, sep="")))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)


# get the predicted result for each image
outModel5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX) 
 
 # surrounding pixels
 x9 = proximalPixels(imgX)

 dat = data.table(x9)
 setnames(dat,paste("x", 1:25, sep=""))
 
 # predicted values
 yHat = predict(xgb.mod, newdata=as.matrix(dat))
 
 # constrain the range to be in [0, 1]
 yHat = sapply(yHat, function(x) max(min(x, 1),0))
 
 # turn the vector into an image
 imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))
 
 # save the predicted value
 writePNG(imgYHat, file.path(outModel5, f))
}

Combining the Individual Models

There are many ways that the predictions of the models could be ensembled together. I have used an xgboost model because I want to allow for the complexity of the problem that is being solved.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, data.table, xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

inPath1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
inPath2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
inPath3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
inPath4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
inPath5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"

cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacking.csv"

filenames = list.files(inPath1)
for (f in filenames)
{
 print(f)
 imgX1 = readPNG(file.path(inPath1, f))
 imgX2 = readPNG(file.path(inPath2, f))
 imgX3 = readPNG(file.path(inPath3, f))
 imgX4 = readPNG(file.path(inPath4, f))
 imgX5 = readPNG(file.path(inPath5, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 #x = img2vec(imgX)
 y = img2vec(imgY)
 
 # contributing models
 x1 = img2vec(imgX1)
 x2 = img2vec(imgX2)
 x3 = img2vec(imgX3)
 x4 = img2vec(imgX4)
 x5 = img2vec(imgX5)

 dat = data.table(cbind(y, x1, x2, x3, x4, x5))
 setnames(dat,c("y", "linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 5000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)

# get the predicted result for each image
outModel = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"
for (f in filenames)
{
 print(f)
 imgX1 = readPNG(file.path(inPath1, f))
 imgX2 = readPNG(file.path(inPath2, f))
 imgX3 = readPNG(file.path(inPath3, f))
 imgX4 = readPNG(file.path(inPath4, f))
 imgX5 = readPNG(file.path(inPath5, f))
 
 # contributing models
 x1 = img2vec(imgX1)
 x2 = img2vec(imgX2)
 x3 = img2vec(imgX3)
 x4 = img2vec(imgX4)
 x5 = img2vec(imgX5)

 dat = data.table(cbind(x1, x2, x3, x4, x5))
 setnames(dat,c("linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
 
 # predicted values
 yHat = predict(xgb.mod, newdata=as.matrix(dat))
 
 # constrain the range to be in [0, 1]
 yHat = sapply(yHat, function(x) max(min(x, 1),0))
 
 # turn the vector into an image
 imgYHat = matrix(yHat, nrow(imgX1), ncol(imgX1))
 
 # save the predicted value
 writePNG(imgYHat, file.path(outModel, f))
}

20150923 output 1
The graph above answers Bobby’s question – when considered together, the nearby pixels provide most of the additional predictive power beyond that of the raw pixel brightness. It seems that the key to separating dark text from noise is to consider how the pixel fits into the local region within the image.
Now that we have set up a stacking model structure, we can recommence the process of adding more image processing and features to the model. And that’s what the next blog will do…

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 6

07 Monday Sep 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

≈ 7 Comments

Tags

Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

So far in this series of blogs we have used image processing techniques to improve the images, and then ensembled together the results of that image processing using GBM or XGBoost. But I have noticed that some competitors have achieved reasonable results using purely machine learning approaches. While these pure machine learning approaches aren’t enough for the competitors to get to the top of the leader board, they have outperformed some of the models that I have presented in this series of blogs. However these scripts were invariably written in Python and I thought that it would be great to see how to use R to build a similar type of model (except better, because we will include all of the image processing predictors that we have developed so far). So today we will add a brute-force machine learning approach to our model.

At the time of writing this blog, the ranked top competitor who has shared their script is placed 17th with an RMSE of 2.6%. While I am not very experienced in Python, I can usually figure out what a Python script is doing if I read it. So here’s what I think that script ise doing:

  1. pad out each image by an extra 2 pixels (see my last blog for how to pad out an image in R)
  2. run a 3×3 sliding window along the image pixels (see my last blog for how to create a sliding window in R)
  3. use all 9 pixels within the sliding window as predictors for the pixel in the centre of the sliding window
  4. use a random forest model to predict the pixel brightnesses

This is a pure machine learning approach because it doesn’t do any image processing to pre-process the image. It simply says that if you want to predict a particular pixel brightness, then you should probably look around at the brightnesses of the pixels either side of that pixel.

canstockphoto16547059

While I don’t want to fuel the R versus Python language wars, I do want to create a model in R that can outperform my competitors. So instead of using a 3×3 sliding window, I will use a 5×5 sliding window. Because the random forest implementation in R tends to run out of RAM on my computer, I will use XGBoost.

More_Power_HomeImprovementO

Here is how I create a 5×5 sliding window.


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# create a padded image within which we will embed our source image
pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

# create a matrix of predictor values - each column is a pixel from the sliding window
tab = NULL
for (i in seq_len(width))
{
for (j in seq_len(width))
{
if (i == 1 && j == 1)
{
tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
} else {
tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
}
}
}

head(tab[,1:4])

20150904 output 1

When I noticed that this script ran a bit slowly, I looked at how I wrote the loops. R does not like manual looping, and it is very inefficient at appending data. So I rewrote the script to pre-allocate space for all of the cells rather than appending each column as it was calculated.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

This modification gave me a tenfold improvement in speed! It’s a reminder that I shouldn’t be lazy when writing R scripts.

There’s one small issue to consider before we put this all together for a new more powerful predictive model. In the code above I have used a brightness value of 1 when padding out the image, but that doesn’t look natural. The image below shows a clear white border around the image. This could confuse the machine learning algorithm, as it could waste time learning what to do with pure white pixels.

20150904 output 2

So instead it makes sense to pad out the image with a background brightness. One way to do this is to replicate the edge of the original image.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
# fill in the padded edges
padded[,1:pad] = padded[,pad + 1]
padded[,ncol(img) + pad + 1:pad] = padded[,ncol(padded) - pad]
padded[1:pad,] = padded[pad + 1,]
padded[nrow(img) + pad + 1:pad,] = padded[nrow(padded) - pad,]

20150904 output 3

This looks more natural, except where there is writing at the edge of the image. Another way is to pad out the pixels using the median of the entire image.


pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

20150904 output 4

I haven’t tested which approach works best. So I leave it to the reader to compare the results from the three different padding approaches, and use whichever gives the best result (although I suspect that there won’t be much difference between the second and third approaches).

Let’s pull it all together, and add the surrounding pixels as a predictor to a more complete model.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb  0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

return (tab)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata_blog6.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

# median filter and related features
x7 = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

# surrounding pixels
x9 = proximalPixels(imgX)

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,append(c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"), paste("x", 1:25, sep="")))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain  1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150904 output 5

None of the individual pixels that we added have strong predictive powers, yet our RMSE on the training data dropped from 2.4% for the last blog’s model to 1.4% for this model! That’s because it is the combination of nearby pixels that are predictive, not just any individual pixel. I haven’t scored this model on the test set, but I suspect that it will get you a good ranking.
20150904 output 6

There is hardly any trace of the coffee cup stain remaining. This is a pretty good candidate model.

In order to understand the effect of the nearby pixels, it is useful to visualise their variable importance in a grid.

pixelNames = paste("x", 1:25, sep="")
pixelNames[13] = "raw"
grid = t(matrix(sapply(1:25, FUN = function(x) importance_matrix$Gain[importance_matrix$Feature == pixelNames[x]]), 5, 5))
grid[3, 3] = NA
plot(raster(grid))

20150904 output 7
This graphic shows that we didn’t need all of the surrounding pixels to create a good predictive model. The pixels that don’t lie on the same row or column as the target pixel aren’t as important. If we wanted to expand beyond the 3×3 sliding window used by our competitors, then we didn’t need to add all of the extra pixels. We could have just added the pixels at (1, 3) and (3, 1) and (3, 5) and (5, 3).

The model that we developed in this blog is pushing the boundaries of what my PC can do. Without access to very powerful computers and / or a cluster, we need to use a different approach if we want to improve on this blog’s model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents : Part 5

28 Friday Aug 2015

Posted by Colin Priest in Background Removal, Kaggle, Median Filter, R, Variable Importance

≈ 12 Comments

Tags

Background Removal, Kaggle, Median Filter, R, Variable Importance

In my last blog we had faded the coffee cup stains, but there was more work to be done. So far we had used adaptive thresholding and edge detection. Today we will use median filters and background removal.

keep-off-median-banner

A median filter is an image filter that replaces a pixel with the median value of the pixels surrounding it. In doing this, it smoothes the image, and the result is often thought of as the “background” of the image, since it tends to wipe away small features, but maintains broad features.

While the biOps package has a median filter implementation, it isn’t difficult to write a function to do that ourselves, and it can be quite instructive to see how a median filter works.

median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = NULL
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 if (i == 1 && j == 1)
 {
 tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 } else {
 tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
 }
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# use the median filter and save the result
filtered = median_Filter(img, 17)
writePNG(filtered, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

20150829 output 1

What we get after applying a median filter is something that looks like the background of the image. It contains the coffee cup stains and also the shade of the paper upon which the writing appears. With a filter width of 17, the writing has almost entirely faded away.
I didn’t choose that filter width of 17 randomly. It was the result of running models with several different filter widths and seeing which had the best predictive powers.
Median filters are not fast, especially once the filter width increases. This function runs with similar speed to that in the biOps package. I shall leave the task of writing a parallel processing version of the median filter function to the more impatient readers.

While we now have the background, what we really wanted was the foreground – the writing, without the coffee cup stains. The foreground is the difference between the original image and the background. But in this case we know that the writing is always darker than the background, so our foreground should only show pixels that are darker than the background. I have also rescaled the result to lie in the interval [0, 1]. Here is an R script to implement background removal.

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5

 # the background is found via a median filter
 background = median_Filter(img, w)

 # the foreground is darker than the background
 foreground = img - background
 foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 return (matrix(foreground, nrow(img), ncol(img)))
}

# run the background removal function and save the result
foreground = background_Removal(img)
writePNG(foreground, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

20150829 output 2

While it’s not perfect, the resulting filtered image has done a reasonably good job of separating the writing from the background. It is reasonable to expect that it will be a useful predictor in our model.
This time a filter width of 5 was chosen purely for the purpose of speed. You could have used a grid search to find the “best” filter width parameter.

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
imgClean = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned\\3.png")

bestRMSE = 1
bestWidth = 5

widths = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25)

for (w in widths)
{
 # the background is found via a median filter
 background = median_Filter(img, w)

 # the foreground is darker than the background
 foreground = img - background
 foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 # score the result
 rmse = sqrt(mean( (foreground - imgClean) ^ 2 ))
 if (rmse < bestRMSE)
 {
 bestRMSE = rmse
 bestWidth = w
 print(c(bestWidth, rmse))
 }
}

In the past few blogs I have used the GBM package to create a predictive model. But as we have added more predictors or features, it has started to take a long time to fit the model and to calculate the predictions. So today I’m switching to the xgboost package because the top Kagglers are using it (Owen Zhang, Kaggle’s top ranked competitor says “when in doubt use xgboost“), and because it runs much faster than GBM. I’d never used xgboost until this week, and I must say that I’m quite impressed with its speed.

Here is an R script that fits an xgboost model, using all of the features that we have come up with over the 5 blogs to date.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
 source("http://bioconductor.org/biocLite.R")
 biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
 # fit 3 clusters
 v = img2vec(img)
 km.mod = kmeans(v, 3)
 # allow for the random ordering of the clusters
 oc = order(km.mod$centers)
 # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
 hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))
 
 # using upper threshold
 imgHi = v
 imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1
 
 return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
 img.eb <- Image(t(img))
 img.thresholded.3 = thresh(img.eb, 3, 3)
 img.thresholded.5 = thresh(img.eb, 5, 5)
 img.thresholded.7 = thresh(img.eb, 7, 7)
 img.thresholded.9 = thresh(img.eb, 9, 9)
 img.thresholded.11 = thresh(img.eb, 11, 11)
 img.kmThresh = kmeansThreshold(img)

 # combine the adaptive thresholding
 ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
 ttt.2 = apply(ttt.1, 1, max)
 ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
 return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
 m1 = t(matrix(Img, nrow(Img), ncol(Img)))
 return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 # do some morphology on the edges to fill the gaps between them
 mat <- matrix (0, 3, 3)
 mask <- imagedata (mat, "grey", 3, 3)
 img.dilation = imgBinaryDilation(img.canny, mask)
 img.erosion = imgBinaryErosion(img.dilation, mask)
 return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 # do some morphology on the edges to fill the gaps between them
 mat <- matrix (0, 3, 3)
 mask <- imagedata (mat, "grey", 3, 3) img.dilation = imgBinaryDilation(img.canny, mask) img.erosion = imgBinaryErosion(img.dilation, mask) img.erosion.2 = imgBinaryErosion(img.erosion, mask) img.dilation.2 = imgBinaryDilation(img.erosion.2, mask) return(matrix(img.dilation.2 / 255, nrow(img), ncol(img))) } # a function to do a median filter median_Filter = function(img, filterWidth) { pad = floor(filterWidth / 2) padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad) padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img tab = NULL for (i in seq_len(filterWidth)) { for (j in seq_len(filterWidth)) { if (i == 1 && j == 1) { tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]) } else { tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])) } } } filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)]))) return (matrix(filtered, nrow(img), ncol(img))) } # a function that uses median filter to get the background then finds the dark foreground background_Removal = function(img) { w = 5 # the background is found via a median filter background = median_Filter(img, w) # the foreground is darker than the background foreground = img - background foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 return (matrix(foreground, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"
 
outPath = file.path(outFolder, "trainingdata_blog5.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
 y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)
 
 # threshold the image
 x2 = kmeansThreshold(imgX)

 # adaptive thresholding
 x3 = img2vec(adaptiveThresholding(imgX))
 
 # canny edge detector and related features
 x4 = img2vec(cannyEdges(imgX))
 x5 = img2vec(cannyDilated1(imgX))
 x6 = img2vec(cannyDilated2(imgX))
 
 # median filter and related features
 x7 = img2vec(median_Filter(imgX, 17))
 x8 = img2vec(background_Removal(imgX))

 dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8))
 setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 10000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predictions
dtrainFull <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
yHat = predict(xgb.mod, newdata=dtrainFull)
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse) # 2.4% vs 4.1%

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), 
 img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)),img2vec(median_Filter(img, 17)), img2vec(background_Removal(img)) )
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
yHatImg = predict(xgb.mod, newdata=as.matrix(x))
yHatImg[yHatImg < 0] = 0 yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150829 output 3

Both the background removal and the median filter features are important, with even higher importance scores than last blog’s canny edge detector features.

20150829 output 4

The coffee cup stain has been mostly removed, especially the speckles that we saw after applying the model in the last blog.

The result represents an improvement in RMSE score on the training data from 4.1% in the last blog to 2.4% in this blog. In the next blog we will further improve on this score.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...
← Older posts

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 277 other subscribers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 86 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: