Tags

Automation, Image Processing, Kaggle, Medical Imaging, R

In Part 1 of this blog series, I described how to fix the brightness and contrast of the MRI images. In this blog we finish cleaning up the input data.

Other than brightness and contrast, we need to fix up the following problems:

different image sizes
different image rotations – portrait versus landscape
different pixel spacing
different short axis slice spacing
image sets from duplicate locations
sax sets aren’t in the same order as their locations
some sax image sets have multiple locations

Different Image Sizes and Rotations

There are approximately a dozen different image sizes, include rotated images. Not all of the image sizes scale to the same 4:3 aspect ratio that is the most common across the training set. Some of the machine learning algorithms I used later need fixed dimension images, so I compromised and decided to use a standard sizing of 256 x 192 pixels portrait aspect ratio. This meant that I wasn’t always scaling the x-axis and the y-axis by the same amount, and occasionally I was even upscaling an image.


library(pacman)
pacman::p_load(EBImage)
rescaleImage = function(img)
{
 imgOut = img
 # check for landscape aspect ratio and correct
 if (nrow(img) > ncol(img)) imgOut = t(img)
 imgOut = resize(imgOut, 256, 192) # standardise image size to 256 x 192
 return (imgOut)
}

Note that I used matrix transpose to rotate the image 90 degrees. I could also have used the rotate function in EBImage. Either way could work, but I felt that matrix transpose would be a faster operation.

Image Locations and Spacing

The DICOM images contain information about image slice location, pixel spacing (how far apart are adjacent pixel centroids), and slice spacing (how far apart are adjacent sax slices).


library(pacman)
pacman::p_load(oro.dicom)

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
 img = readDICOMFile(path)
 width = ncol(img$img)
 height = nrow(img$img)
 headers = img$hdr
 patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
 patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
 patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
 ps = extractHeader(img$hdr, 'PixelSpacin', numeric = FALSE)
 pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
 pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
 seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
 location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
 pathSplit = unlist(strsplit(path, '/&amp;amp;', fixed = TRUE))
 filename = pathSplit[length(pathSplit)]
 prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
 frame = as.integer(unlist(strsplit(prefix, '-'))[3])
 sliceNum = 0
 if (length(unlist(strsplit(prefix, '-'))) &amp;gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
 time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
 #
 return (list(
 id = patientID,
 age = patientAge,
 gender = patientGender,
 pixelSpacingX = pixelSpacingX,
 pixelSpacingY = pixelSpacingY,
 width = width,
 height = height,
 series = seriesNum,
 location = location,
 frame = frame,
 sliceNum = sliceNum,
 time = time,
 path = path
 ))
}

I’m not a medical expert. So when there are images from duplicate locations, I don’t know which image sets are the best to use. Therefore I just assumed that the medical specialist repeated the MRI scans until the image quality was adequate. This meant that I chose the image set with the latest time stamps (from the header information inside the DICOM files, not the file system date).

Then I just

searched for DICOM images in all of the subfolders for each patient
filtered to use only sax slices, ignoring those images that did not come from a folder that contained the substring “sax”
read the DICOM header information from each image to find its location and time
searched for the high time stamp for each location and kept that image
wrote out a new folder structure where sax_01, sax_02, … were the sax slice image sets for that patient, order by their location along the long axis of the heart

The R script to do this isn’t too difficult. I’ve pasted it below:


# this script creates a cleaner version of the image data
# plus an extract of the file headers
# 1) removes duplicate images
# 2) reorders images by location

# read a poor image and translate the pixel brightnesses
rebalanceImage = function(badImage)
{
v = matrix(badImage, nrow(badImage) * ncol(badImage))
o2 = order(v)
vIn = sample(vAll, nrow(badImage) * ncol(badImage))
vIn = vIn[order(vIn)]
v2 = v
v2[o2] = vIn
cleanImage = matrix(v2, nrow(badImage), ncol(badImage))
return (cleanImage)
}

# check whether a folder exists and make it if it doesn't exist
checkFolder = function(mainDir, subDir)
{
if(!dir.exists(file.path(mainDir, subDir)))
dir.create(file.path(mainDir, subDir))
}

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
img = readDICOMFile(path)
width = ncol(img$img)
height = nrow(img$img)
headers = img$hdr
patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
ps = extractHeader(img$hdr, 'PixelSpacing', numeric = FALSE)
pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
pathSplit = unlist(strsplit(path, '/', fixed = TRUE))
filename = pathSplit[length(pathSplit)]
prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
frame = as.integer(unlist(strsplit(prefix, '-'))[3])
sliceNum = 0
if (length(unlist(strsplit(prefix, '-'))) &amp;gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
#
return (list(
id = patientID,
age = patientAge,
gender = patientGender,
pixelSpacingX = pixelSpacingX,
pixelSpacingY = pixelSpacingY,
width = width,
height = height,
series = seriesNum,
location = location,
frame = frame,
sliceNum = sliceNum,
time = time,
path = path
))
}
############################################################################################################################

library(pacman)
pacman::p_load(oro.dicom, raster, data.table, png, flexclust, foreach, doParallel, snowfall)

# whether this run is for the training set (FALSE) or the validation set (TRUE)
useValidation = FALSE
#useValidation = TRUE

# create a benchmark histogram from an exemplar image
dicomBenchmark = readDICOM('/home/colin/data/Second-Annual-Data-Science-Bowl/train/1/study/sax_13')
images = dicomBenchmark[[2]]
img = images[[1]]
vAll = unname(unlist(images))
vAll = vAll[order(vAll)]

# loop through all of the patients
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned'
if (useValidation)
{
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned'
}
cases = list.dirs(rootFolder, recursive=FALSE)
#cases = cases[grep('123', cases)]
simpleData = matrix(0, 500, 3)
simpleFeatures = data.frame(id = rep(0, 500), age = rep(0, 500), gender = rep('U', 500),
pixelSpacingX = integer(500), pixelSpacingY = integer(500),
width = integer(500), height = integer(500),
numSlices = rep(0, 500), stringsAsFactors = FALSE)
allImages = NULL

sfInit(parallel=TRUE, cpus=14)
sfLibrary(oro.dicom)

for (patient in cases)
{
patientFolder = paste0(patient, '/study')
imgSequences = list.dirs(patientFolder, recursive=FALSE)
# filter for 'sax' folders
imgSequences = imgSequences[grep('sax', imgSequences, fixed = TRUE)]
nMax = 10000
imgTable = data.table(id = integer(nMax), age = integer(nMax), gender = character(nMax),
pixelSpacingX = integer(nMax), pixelSpacingY = integer(nMax),
width = integer(nMax), height = integer(nMax),
location = numeric(nMax), frame = integer(nMax), series = integer(nMax),
sliceNum = integer(nMax), time = numeric(nMax),
path = character(nMax))
for (imgFolder in imgSequences)
{
# find the first file in imgs
imgFiles = list.files(imgFolder)
if (length(imgFiles) &lt; 30)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
} else {
if (length(imgFiles) %% 30 != 0)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
}
}
# do the next part regardless of the number of images
paths = unlist(lapply(imgFiles, function(x) return (paste0(imgFolder, '/', x))))
result <- sfLapply(paths, getDicomHeaderInfo)
for (row in result)
{
n = sum(imgTable$id > 0) + 1
imgTable$id[n] = row$id
imgTable$age[n] = row$age
imgTable$gender[n] = row$gender
imgTable$pixelSpacingX[n] = row$pixelSpacingX
imgTable$pixelSpacingY[n] = row$pixelSpacingY
imgTable$width[n] = row$width
imgTable$height[n] = row$height
imgTable$series[n] = row$series
imgTable$location[n] = row$location
imgTable$frame[n] = row$frame
imgTable$sliceNum[n] = row$sliceNum
imgTable$time[n] = row$time
imgTable$path[n] = row$path
}
}
# remove surplus records from table
imgTable= imgTable[imgTable$id > 0]
# grab the latest image for each location and frame
latestImages = imgTable[order(id, age, gender, pixelSpacingX, pixelSpacingY, width, height, location, frame, time, sliceNum, series), .SD[c(.N)], by=c(&amp;amp;amp;quot;id&amp;amp;amp;quot;, &amp;amp;amp;quot;age&amp;amp;amp;quot;, &amp;amp;amp;quot;gender&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingX&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingY&amp;amp;amp;quot;, &amp;amp;amp;quot;width&amp;amp;amp;quot;, &amp;amp;amp;quot;height&amp;amp;amp;quot;, &amp;amp;amp;quot;location&amp;amp;amp;quot;, &amp;amp;amp;quot;frame&amp;amp;amp;quot;)]
#
print(paste0('Patient: ', latestImages$id[1]))
uniqueLocs = sort(unique(latestImages$location[abs(latestImages$location) > 0.000001]))
simpleFeatures[latestImages$id[1], ] = list(latestImages$id[1], latestImages$age[1], latestImages$gender[1],
latestImages$pixelSpacingX[1], latestImages$pixelSpacingY[1],
latestImages$width[1], latestImages$height[1],
length(uniqueLocs))
# create the cleaned-up images for CNN training
imgTable$cleanPath = '---'
iSax = 1
patientID = latestImages$id[1]
for (loc in uniqueLocs)
{
imgset = latestImages$path[latestImages$location == loc]
iImage = 1
for (imgPath in imgset)
{
dicomImage = readDICOMFile(imgPath)
# fix the contrast and brightness
fixedImage = rebalanceImage(dicomImage$img)
# get the details of this image
###patientID = extractHeader(dicomImage$hdr, 'PatientID', numeric = TRUE)
sliceID = iSax
imageID = iImage
outPath = paste0(outFolder, '/', patientID, '/study/sax_', formatC(sliceID, width=2, flag='0'), '/image-', formatC(imageID, width=2, flag='0'), '.png')
#print(outPath)
#plot(raster(fixedImage))
checkFolder(outFolder, as.character(patientID))
checkFolder(paste0(outFolder, '/', patientID), 'study')
checkFolder(paste0(outFolder, '/', patientID, '/study'), paste0('sax_', formatC(sliceID, width=2, flag='0')))
writePNG(fixedImage / max(fixedImage), outPath)
imgTable$cleanPath[imgTable$path == imgPath] = outPath
#
iImage = iImage + 1
}
iSax = iSax + 1
}

if (patient == cases[1])
{
allImages = imgTable
} else
{
allImages = data.frame(rbind(allImages, imgTable))
}
}

sfStop()

I stored all of the DICOM header information in a table. Some of this will be required later, when calculating the volume of the left ventricle chamber.

The next step, which will be described in my next blog, is to design a convolutional neural network that will automatically find the left ventricle in an image.

3 thoughts on “Second Annual Data Science Bowl – Part 2”

Pingback: Second Annual Data Science Bowl – Part 3 – Automatically Finding the Heart Location in an MRI Image | Keeping Up With The Latest Techniques
Jennifer Cao said:

September 7, 2016 at 7:14 pm

Fantastic!

I tried the codes. But the loop through codes didn’t work for me. R log says Error: unexpected ‘&’ in ” . Can you explain how the &quot etc macro variables works? Thank you!

# loop through all of the patients
rootFolder = "/home/colin/data/Second-Annual-Data-Science-Bowl/train"
outFolder = "/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned"
if (useValidation)

LikeLike

- Colin Priest said:
  
  September 8, 2016 at 12:07 am
  
  Hi Jennifer.
  Sorry about that. WordPress mangled my code when it published. I have just edited the code online, and hopefully this time WordPress won’t mangle it again.
  Colin
  
  LikeLike

Keeping Up With The Latest Techniques

~ brief insights

Second Annual Data Science Bowl – Part 2

Different Image Sizes and Rotations

Image Locations and Spacing

3 thoughts on “Second Annual Data Science Bowl – Part 2”

Leave a comment Cancel reply

Different Image Sizes and Rotations

Image Locations and Spacing

Share this:

Related

3 thoughts on “Second Annual Data Science Bowl – Part 2”

Leave a comment Cancel reply