Tags
In Part 1 of this blog series, I described how to fix the brightness and contrast of the MRI images. In this blog we finish cleaning up the input data.
Other than brightness and contrast, we need to fix up the following problems:
- different image sizes
- different image rotations – portrait versus landscape
- different pixel spacing
- different short axis slice spacing
- image sets from duplicate locations
- sax sets aren’t in the same order as their locations
- some sax image sets have multiple locations
Different Image Sizes and Rotations
There are approximately a dozen different image sizes, include rotated images. Not all of the image sizes scale to the same 4:3 aspect ratio that is the most common across the training set. Some of the machine learning algorithms I used later need fixed dimension images, so I compromised and decided to use a standard sizing of 256 x 192 pixels portrait aspect ratio. This meant that I wasn’t always scaling the x-axis and the y-axis by the same amount, and occasionally I was even upscaling an image.
library(pacman) pacman::p_load(EBImage) rescaleImage = function(img) { imgOut = img # check for landscape aspect ratio and correct if (nrow(img) > ncol(img)) imgOut = t(img) imgOut = resize(imgOut, 256, 192) # standardise image size to 256 x 192 return (imgOut) }
Note that I used matrix transpose to rotate the image 90 degrees. I could also have used the rotate function in EBImage. Either way could work, but I felt that matrix transpose would be a faster operation.
Image Locations and Spacing
The DICOM images contain information about image slice location, pixel spacing (how far apart are adjacent pixel centroids), and slice spacing (how far apart are adjacent sax slices).
library(pacman) pacman::p_load(oro.dicom) # function to extract the dicom header info getDicomHeaderInfo = function(path) { img = readDICOMFile(path) width = ncol(img$img) height = nrow(img$img) headers = img$hdr patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE) patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3)) patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE)) ps = extractHeader(img$hdr, 'PixelSpacin', numeric = FALSE) pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1]) pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2]) seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE) location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1) pathSplit = unlist(strsplit(path, '/&', fixed = TRUE)) filename = pathSplit[length(pathSplit)] prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1] frame = as.integer(unlist(strsplit(prefix, '-'))[3]) sliceNum = 0 if (length(unlist(strsplit(prefix, '-'))) > 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4]) time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE) # return (list( id = patientID, age = patientAge, gender = patientGender, pixelSpacingX = pixelSpacingX, pixelSpacingY = pixelSpacingY, width = width, height = height, series = seriesNum, location = location, frame = frame, sliceNum = sliceNum, time = time, path = path )) }
I’m not a medical expert. So when there are images from duplicate locations, I don’t know which image sets are the best to use. Therefore I just assumed that the medical specialist repeated the MRI scans until the image quality was adequate. This meant that I chose the image set with the latest time stamps (from the header information inside the DICOM files, not the file system date).
Then I just
- searched for DICOM images in all of the subfolders for each patient
- filtered to use only sax slices, ignoring those images that did not come from a folder that contained the substring “sax”
- read the DICOM header information from each image to find its location and time
- searched for the high time stamp for each location and kept that image
- wrote out a new folder structure where sax_01, sax_02, … were the sax slice image sets for that patient, order by their location along the long axis of the heart
The R script to do this isn’t too difficult. I’ve pasted it below:
# this script creates a cleaner version of the image data # plus an extract of the file headers # 1) removes duplicate images # 2) reorders images by location # read a poor image and translate the pixel brightnesses rebalanceImage = function(badImage) { v = matrix(badImage, nrow(badImage) * ncol(badImage)) o2 = order(v) vIn = sample(vAll, nrow(badImage) * ncol(badImage)) vIn = vIn[order(vIn)] v2 = v v2[o2] = vIn cleanImage = matrix(v2, nrow(badImage), ncol(badImage)) return (cleanImage) } # check whether a folder exists and make it if it doesn't exist checkFolder = function(mainDir, subDir) { if(!dir.exists(file.path(mainDir, subDir))) dir.create(file.path(mainDir, subDir)) } # function to extract the dicom header info getDicomHeaderInfo = function(path) { img = readDICOMFile(path) width = ncol(img$img) height = nrow(img$img) headers = img$hdr patientID = extractHeader(img$hdr, 'PatientID', numeric = TRUE) patientAge = as.integer(substr(extractHeader(img$hdr, 'PatientsAge', numeric = FALSE), 1, 3)) patientGender = as.character(extractHeader(img$hdr, 'PatientsSex', numeric = FALSE)) ps = extractHeader(img$hdr, 'PixelSpacing', numeric = FALSE) pixelSpacingX = as.numeric(unlist(strsplit(ps, ' '))[1]) pixelSpacingY = as.numeric(unlist(strsplit(ps, ' '))[2]) seriesNum = extractHeader(img$hdr, 'SeriesNumber', numeric = TRUE) location = round(extractHeader(img$hdr, 'SliceLocation', numeric = TRUE), digits = 1) pathSplit = unlist(strsplit(path, '/', fixed = TRUE)) filename = pathSplit[length(pathSplit)] prefix = unlist(strsplit(filename, '.', fixed = TRUE))[1] frame = as.integer(unlist(strsplit(prefix, '-'))[3]) sliceNum = 0 if (length(unlist(strsplit(prefix, '-'))) &gt; 3) sliceNum = as.integer(unlist(strsplit(prefix, '-'))[4]) time = extractHeader(img$hdr, 'InstanceCreationTime', numeric = FALSE) # return (list( id = patientID, age = patientAge, gender = patientGender, pixelSpacingX = pixelSpacingX, pixelSpacingY = pixelSpacingY, width = width, height = height, series = seriesNum, location = location, frame = frame, sliceNum = sliceNum, time = time, path = path )) } ############################################################################################################################ library(pacman) pacman::p_load(oro.dicom, raster, data.table, png, flexclust, foreach, doParallel, snowfall) # whether this run is for the training set (FALSE) or the validation set (TRUE) useValidation = FALSE #useValidation = TRUE # create a benchmark histogram from an exemplar image dicomBenchmark = readDICOM('/home/colin/data/Second-Annual-Data-Science-Bowl/train/1/study/sax_13') images = dicomBenchmark[[2]] img = images[[1]] vAll = unname(unlist(images)) vAll = vAll[order(vAll)] # loop through all of the patients rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train' outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned' if (useValidation) { rootFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate' outFolder = '/home/colin/data/Second-Annual-Data-Science-Bowl/validate-cleaned' } cases = list.dirs(rootFolder, recursive=FALSE) #cases = cases[grep('123', cases)] simpleData = matrix(0, 500, 3) simpleFeatures = data.frame(id = rep(0, 500), age = rep(0, 500), gender = rep('U', 500), pixelSpacingX = integer(500), pixelSpacingY = integer(500), width = integer(500), height = integer(500), numSlices = rep(0, 500), stringsAsFactors = FALSE) allImages = NULL sfInit(parallel=TRUE, cpus=14) sfLibrary(oro.dicom) for (patient in cases) { patientFolder = paste0(patient, '/study') imgSequences = list.dirs(patientFolder, recursive=FALSE) # filter for 'sax' folders imgSequences = imgSequences[grep('sax', imgSequences, fixed = TRUE)] nMax = 10000 imgTable = data.table(id = integer(nMax), age = integer(nMax), gender = character(nMax), pixelSpacingX = integer(nMax), pixelSpacingY = integer(nMax), width = integer(nMax), height = integer(nMax), location = numeric(nMax), frame = integer(nMax), series = integer(nMax), sliceNum = integer(nMax), time = numeric(nMax), path = character(nMax)) for (imgFolder in imgSequences) { # find the first file in imgs imgFiles = list.files(imgFolder) if (length(imgFiles) < 30) { print(paste0('length = ', length(imgFiles), '! in ', imgFolder)) } else { if (length(imgFiles) %% 30 != 0) { print(paste0('length = ', length(imgFiles), '! in ', imgFolder)) } } # do the next part regardless of the number of images paths = unlist(lapply(imgFiles, function(x) return (paste0(imgFolder, '/', x)))) result <- sfLapply(paths, getDicomHeaderInfo) for (row in result) { n = sum(imgTable$id > 0) + 1 imgTable$id[n] = row$id imgTable$age[n] = row$age imgTable$gender[n] = row$gender imgTable$pixelSpacingX[n] = row$pixelSpacingX imgTable$pixelSpacingY[n] = row$pixelSpacingY imgTable$width[n] = row$width imgTable$height[n] = row$height imgTable$series[n] = row$series imgTable$location[n] = row$location imgTable$frame[n] = row$frame imgTable$sliceNum[n] = row$sliceNum imgTable$time[n] = row$time imgTable$path[n] = row$path } } # remove surplus records from table imgTable= imgTable[imgTable$id > 0] # grab the latest image for each location and frame latestImages = imgTable[order(id, age, gender, pixelSpacingX, pixelSpacingY, width, height, location, frame, time, sliceNum, series), .SD[c(.N)], by=c(&amp;amp;quot;id&amp;amp;quot;, &amp;amp;quot;age&amp;amp;quot;, &amp;amp;quot;gender&amp;amp;quot;, &amp;amp;quot;pixelSpacingX&amp;amp;quot;, &amp;amp;quot;pixelSpacingY&amp;amp;quot;, &amp;amp;quot;width&amp;amp;quot;, &amp;amp;quot;height&amp;amp;quot;, &amp;amp;quot;location&amp;amp;quot;, &amp;amp;quot;frame&amp;amp;quot;)] # print(paste0('Patient: ', latestImages$id[1])) uniqueLocs = sort(unique(latestImages$location[abs(latestImages$location) > 0.000001])) simpleFeatures[latestImages$id[1], ] = list(latestImages$id[1], latestImages$age[1], latestImages$gender[1], latestImages$pixelSpacingX[1], latestImages$pixelSpacingY[1], latestImages$width[1], latestImages$height[1], length(uniqueLocs)) # create the cleaned-up images for CNN training imgTable$cleanPath = '---' iSax = 1 patientID = latestImages$id[1] for (loc in uniqueLocs) { imgset = latestImages$path[latestImages$location == loc] iImage = 1 for (imgPath in imgset) { dicomImage = readDICOMFile(imgPath) # fix the contrast and brightness fixedImage = rebalanceImage(dicomImage$img) # get the details of this image ###patientID = extractHeader(dicomImage$hdr, 'PatientID', numeric = TRUE) sliceID = iSax imageID = iImage outPath = paste0(outFolder, '/', patientID, '/study/sax_', formatC(sliceID, width=2, flag='0'), '/image-', formatC(imageID, width=2, flag='0'), '.png') #print(outPath) #plot(raster(fixedImage)) checkFolder(outFolder, as.character(patientID)) checkFolder(paste0(outFolder, '/', patientID), 'study') checkFolder(paste0(outFolder, '/', patientID, '/study'), paste0('sax_', formatC(sliceID, width=2, flag='0'))) writePNG(fixedImage / max(fixedImage), outPath) imgTable$cleanPath[imgTable$path == imgPath] = outPath # iImage = iImage + 1 } iSax = iSax + 1 } if (patient == cases[1]) { allImages = imgTable } else { allImages = data.frame(rbind(allImages, imgTable)) } } sfStop()
I stored all of the DICOM header information in a table. Some of this will be required later, when calculating the volume of the left ventricle chamber.
The next step, which will be described in my next blog, is to design a convolutional neural network that will automatically find the left ventricle in an image.
Pingback: Second Annual Data Science Bowl – Part 3 – Automatically Finding the Heart Location in an MRI Image | Keeping Up With The Latest Techniques
Fantastic!
I tried the codes. But the loop through codes didn’t work for me. R log says Error: unexpected ‘&’ in ” . Can you explain how the " etc macro variables works? Thank you!
# loop through all of the patients
rootFolder = "/home/colin/data/Second-Annual-Data-Science-Bowl/train"
outFolder = "/home/colin/data/Second-Annual-Data-Science-Bowl/train-cleaned"
if (useValidation)
LikeLike
Hi Jennifer.
Sorry about that. WordPress mangled my code when it published. I have just edited the code online, and hopefully this time WordPress won’t mangle it again.
Colin
LikeLike