Tags

, , , ,

In Part 1 of this blog series, I described how to fix the brightness and contrast of the MRI images. In this blog we finish cleaning up the input data.

Other than brightness and contrast, we need to fix up the following problems:

  • different image sizes
  • different image rotations – portrait versus landscape
  • different pixel spacing
  • different short axis slice spacing
  • image sets from duplicate locations
  • sax sets aren’t in the same order as their locations
  • some sax image sets have multiple locations

Different Image Sizes and Rotations

There are approximately a dozen different image sizes, include rotated images. Not all of the image sizes scale to the same 4:3 aspect ratio that is the most common across the training set. Some of the machine learning algorithms I used later need fixed dimension images, so I compromised and decided to use a standard sizing of 256 x 192 pixels portrait aspect ratio. This meant that I wasn’t always scaling the x-axis and the y-axis by the same amount, and occasionally I was even upscaling an image.


library(pacman)
pacman::p_load(EBImage)
rescaleImage = function(img)
{
 imgOut = img
 # check for landscape aspect ratio and correct
 if (nrow(img) > ncol(img)) imgOut = t(img)
 imgOut = resize(imgOut, 256, 192) # standardise image size to 256 x 192
 return (imgOut)
}

20160307image0120160307image02

Note that I used matrix transpose to rotate the image 90 degrees. I could also have used the rotate function in EBImage. Either way could work, but I felt that matrix transpose would be a faster operation.

Image Locations and Spacing

The DICOM images contain information about image slice location, pixel spacing (how far apart are adjacent pixel centroids), and slice spacing (how far apart are adjacent sax slices).


library(pacman)
pacman::p_load(oro.dicom)

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
 img = readDICOMFile(path)
 width = ncol(img$img)
 height = nrow(img$img)
 headers = img$hdr
 patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
 patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
 patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
 ps = extractHeader(img$hdr, 'PixelSpacin', numeric = FALSE)
 pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
 pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
 seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
 location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
 pathSplit = unlist(strsplit(path, '/&', fixed = TRUE))
 filename = pathSplit[length(pathSplit)]
 prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
 frame = as.integer(unlist(strsplit(prefix, '-'))[3])
 sliceNum = 0
 if (length(unlist(strsplit(prefix, '-'))) > 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
 time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
 #
 return (list(
 id = patientID,
 age = patientAge,
 gender = patientGender,
 pixelSpacingX = pixelSpacingX,
 pixelSpacingY = pixelSpacingY,
 width = width,
 height = height,
 series = seriesNum,
 location = location,
 frame = frame,
 sliceNum = sliceNum,
 time = time,
 path = path
 ))
}

I’m not a medical expert. So when there are images from duplicate locations, I don’t know which image sets are the best to use. Therefore I just assumed that the medical specialist repeated the MRI scans until the image quality was adequate. This meant that I chose the image set with the latest time stamps (from the header information inside the DICOM files, not the file system date).

Then I just

  1. searched for DICOM images in all of the subfolders for each patient
  2. filtered to use only sax slices, ignoring those images that did not come from a folder that contained the substring “sax”
  3. read the DICOM header information from each image to find its location and time
  4. searched for the high time stamp for each location and kept that image
  5. wrote out a new folder structure where sax_01, sax_02, … were the sax slice image sets for that patient, order by their location along the long axis of the heart

The R script to do this isn’t too difficult. I’ve pasted it below:


# this script creates a cleaner version of the image data
# plus an extract of the file headers
# 1) removes duplicate images
# 2) reorders images by location

# read a poor image and translate the pixel brightnesses
rebalanceImage = function(badImage)
{
v = matrix(badImage, nrow(badImage) * ncol(badImage))
o2 = order(v)
vIn = sample(vAll, nrow(badImage) * ncol(badImage))
vIn = vIn[order(vIn)]
v2 = v
v2[o2] = vIn
cleanImage = matrix(v2, nrow(badImage), ncol(badImage))
return (cleanImage)
}

# check whether a folder exists and make it if it doesn't exist
checkFolder = function(mainDir, subDir)
{
if(!dir.exists(file.path(mainDir, subDir)))
dir.create(file.path(mainDir, subDir))
}

# function to extract the dicom header info
getDicomHeaderInfo = function(path)
{
img = readDICOMFile(path)
width = ncol(img$img)
height = nrow(img$img)
headers = img$hdr
patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE)
patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3))
patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE))
ps = extractHeader(img$hdr, 'PixelSpacing', numeric = FALSE)
pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1])
pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2])
seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE)
location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1)
pathSplit = unlist(strsplit(path, '/', fixed = TRUE))
filename = pathSplit[length(pathSplit)]
prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1]
frame = as.integer(unlist(strsplit(prefix, '-'))[3])
sliceNum = 0
if (length(unlist(strsplit(prefix, '-'))) > 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4])
time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE)
#
return (list(
id = patientID,
age = patientAge,
gender = patientGender,
pixelSpacingX = pixelSpacingX,
pixelSpacingY = pixelSpacingY,
width = width,
height = height,
series = seriesNum,
location = location,
frame = frame,
sliceNum = sliceNum,
time = time,
path = path
))
}
############################################################################################################################

library(pacman)
pacman::p_load(oro.dicom, raster, data.table, png, flexclust, foreach, doParallel, snowfall)

# whether this run is for the training set (FALSE) or the validation set (TRUE)
useValidation = FALSE
#useValidation = TRUE

# create a benchmark histogram from an exemplar image
dicomBenchmark = readDICOM('/home/colin/data/Second-Annual-Data-Science-Bowl/train/1/study/sax_13')
images = dicomBenchmark[[2]]
img = images[[1]]
vAll = unname(unlist(images))
vAll = vAll[order(vAll)]

# loop through all of the patients
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned'
if (useValidation)
{
rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate'
outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned'
}
cases = list.dirs(rootFolder, recursive=FALSE)
#cases = cases[grep('123', cases)]
simpleData = matrix(0, 500, 3)
simpleFeatures = data.frame(id = rep(0, 500), age = rep(0, 500), gender = rep('U', 500),
pixelSpacingX = integer(500), pixelSpacingY = integer(500),
width = integer(500), height = integer(500),
numSlices = rep(0, 500), stringsAsFactors = FALSE)
allImages = NULL

sfInit(parallel=TRUE, cpus=14)
sfLibrary(oro.dicom)

for (patient in cases)
{
patientFolder = paste0(patient, '/study')
imgSequences = list.dirs(patientFolder, recursive=FALSE)
# filter for 'sax' folders
imgSequences = imgSequences[grep('sax', imgSequences, fixed = TRUE)]
nMax = 10000
imgTable = data.table(id = integer(nMax), age = integer(nMax), gender = character(nMax),
pixelSpacingX = integer(nMax), pixelSpacingY = integer(nMax),
width = integer(nMax), height = integer(nMax),
location = numeric(nMax), frame = integer(nMax), series = integer(nMax),
sliceNum = integer(nMax), time = numeric(nMax),
path = character(nMax))
for (imgFolder in imgSequences)
{
# find the first file in imgs
imgFiles = list.files(imgFolder)
if (length(imgFiles) < 30)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
} else {
if (length(imgFiles) %% 30 != 0)
{
print(paste0('length = ', length(imgFiles), '! in ', imgFolder))
}
}
# do the next part regardless of the number of images
paths = unlist(lapply(imgFiles, function(x) return (paste0(imgFolder, '/', x))))
result <- sfLapply(paths, getDicomHeaderInfo)
for (row in result)
{
n = sum(imgTable$id > 0) + 1
imgTable$id[n] = row$id
imgTable$age[n] = row$age
imgTable$gender[n] = row$gender
imgTable$pixelSpacingX[n] = row$pixelSpacingX
imgTable$pixelSpacingY[n] = row$pixelSpacingY
imgTable$width[n] = row$width
imgTable$height[n] = row$height
imgTable$series[n] = row$series
imgTable$location[n] = row$location
imgTable$frame[n] = row$frame
imgTable$sliceNum[n] = row$sliceNum
imgTable$time[n] = row$time
imgTable$path[n] = row$path
}
}
# remove surplus records from table
imgTable= imgTable[imgTable$id > 0]
# grab the latest image for each location and frame
latestImages = imgTable[order(id, age, gender, pixelSpacingX, pixelSpacingY, width, height, location, frame, time, sliceNum, series), .SD[c(.N)], by=c(&amp;amp;amp;quot;id&amp;amp;amp;quot;, &amp;amp;amp;quot;age&amp;amp;amp;quot;, &amp;amp;amp;quot;gender&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingX&amp;amp;amp;quot;, &amp;amp;amp;quot;pixelSpacingY&amp;amp;amp;quot;, &amp;amp;amp;quot;width&amp;amp;amp;quot;, &amp;amp;amp;quot;height&amp;amp;amp;quot;, &amp;amp;amp;quot;location&amp;amp;amp;quot;, &amp;amp;amp;quot;frame&amp;amp;amp;quot;)]
#
print(paste0('Patient: ', latestImages$id[1]))
uniqueLocs = sort(unique(latestImages$location[abs(latestImages$location) > 0.000001]))
simpleFeatures[latestImages$id[1], ] = list(latestImages$id[1], latestImages$age[1], latestImages$gender[1],
latestImages$pixelSpacingX[1], latestImages$pixelSpacingY[1],
latestImages$width[1], latestImages$height[1],
length(uniqueLocs))
# create the cleaned-up images for CNN training
imgTable$cleanPath = '---'
iSax = 1
patientID = latestImages$id[1]
for (loc in uniqueLocs)
{
imgset = latestImages$path[latestImages$location == loc]
iImage = 1
for (imgPath in imgset)
{
dicomImage = readDICOMFile(imgPath)
# fix the contrast and brightness
fixedImage = rebalanceImage(dicomImage$img)
# get the details of this image
###patientID = extractHeader(dicomImage$hdr, 'PatientID', numeric = TRUE)
sliceID = iSax
imageID = iImage
outPath = paste0(outFolder, '/', patientID, '/study/sax_', formatC(sliceID, width=2, flag='0'), '/image-', formatC(imageID, width=2, flag='0'), '.png')
#print(outPath)
#plot(raster(fixedImage))
checkFolder(outFolder, as.character(patientID))
checkFolder(paste0(outFolder, '/', patientID), 'study')
checkFolder(paste0(outFolder, '/', patientID, '/study'), paste0('sax_', formatC(sliceID, width=2, flag='0')))
writePNG(fixedImage / max(fixedImage), outPath)
imgTable$cleanPath[imgTable$path == imgPath] = outPath
#
iImage = iImage + 1
}
iSax = iSax + 1
}

if (patient == cases[1])
{
allImages = imgTable
} else
{
allImages = data.frame(rbind(allImages, imgTable))
}
}

sfStop()

I stored all of the DICOM header information in a table. Some of this will be required later, when calculating the volume of the left ventricle chamber.

The next step, which will be described in my next blog, is to design a convolutional neural network that will automatically find the left ventricle in an image.