MRI Segmentation of Musculoskeletal Components Using U-Net: Preliminary Results

Recent advances in medical imaging and computer vision offer unprecedented potential for objective, automated, and personalized diagnosis and treatment in healthcare. A pivotal area where this potential is yet to be fully harnessed lies in the processing of MRI data for the extraction of musculoskeletal features, crucial for patient-specific musculoskeletal modeling. Such models hold significance for assessing neuromusculoskeletal diseases and analyzing human movement. This manuscript presents our initial efforts in developing a method that utilizes deep learning to segment specific anatomical structures, namely osseous and myeloid tissues, from MRI scans, with minimal annotated data. We place particular emphasis on a convolutional neural network (CNN) approach, utilizing the U-Net architecture. Our work elaborates on the segmentation process, demonstrates results on individual MRI slices, and proposes a method for volumetric analysis. We also explore potential enhancements for achieving more precise segmentations and robust feature extraction. The promising initial findings advocate for a future where the segmentation of intricate anatomical structures becomes more accessible, efficient, and rapid.


INTRODUCTION
This study employs Magnetic Resonance Imaging (MRI), a fundamental tool in medical diagnostics and research, to segment and calculate the volumes of musculoskeletal components.MRI facilitates the examination of various anatomical features by providing detailed cross-sectional images of the human body.Despite the availability of other imaging modalities, such as CT scans and Xrays, which have their own pivotal roles in medical diagnostics, MRI scans have been selected for this research due to their inherent ability to capture intricate soft-tissue details and contrast, which are highly valuable for effective deep learning-based segmentation.By leveraging MRI scans and with diligent annotation of training data, the model developed in this study could potentially be expanded to encompass other structures, such as muscles or other organs, as the research progresses.
Traditionally, the analysis of MRI scans has relied heavily on manual processing by radiologists.They visually inspect the scans to identify and diagnose conditions, a task that can vary in accuracy among professionals [9], and requires significant expertise.Even when converting MRI data to computational models for automatic processing, manual intervention is necessary to identify, segment, mark, and measure regions of interest, processes which are timeconsuming and prone to human error.While specialist imaging techniques and software exist that focus on segmentation tasks and volumetric assessments [16], they are still not fully automated, often requiring significant time investment and manual oversight.
Within this context, deep learning segmentation techniques present a promising approach [17].These automated methods assure consistent, objective, swift, and more accessible volumetric analyses, potentially enabling a change in the field of medical image processing and inference.
Accurate volumetric analyses are pivotal for a myriad of medical endeavors.For example, they could enable effective disease monitoring and diagnosis, particularly in tracking the progression of diseases such as osteoporosis, osteoarthritis, tumors, or analyzing human movement [20], where changes in bone volume can signify variations in bone density.Furthermore, the outcomes of treatments can be objectively evaluated by observing alterations in volume or appearance.The significance also extends to designing custom prosthetics or implants, where meticulous volume and shape assessments would guarantee optimal fit and function, and even to identifying atrophy.The possibilities seem endless.This paper delves into the methodologies employed, the challenges encountered, and the potential applications of the findings.

BACKGROUND
The emergence of deep learning methods, especially Convolutional Neural Networks (CNN), has significantly advanced medical image analysis in recent years.CNNs excel in pattern recognition, thereby enhancing the accuracy of automated MRI scan evaluations [19].
Despite the promise, several challenges hinder the universal applicability of deep learning models for MRI analysis.The variability in MRI image quality, differences in scanning equipment, and the complex nature of human anatomy present substantial obstacles.Additionally, the scarcity of pre-existing annotated training data, particularly those labeled by medical professionals, poses a further challenge [1,21].Deep learning architectures typically require extensive training data to achieve meaningful classifications or segmentations, and procuring such MRI datasets incurs significant costs.Initially, this research aimed to build a model from scratch with limited training data.However, insights from the training phase revealed that this approach was unfeasible.
The primary objective of this study is to harness supervised deep learning for effective multiclass semantic segmentation (see appendix) of MRI scans, initially focusing on bones (osseous tissues) and bone marrow (myeloid tissue).To circumvent the challenges associated with limited resources, a methodology referred to as 'transfer learning' (see appendix) was employed [14,15].While the current focus is on segmenting bones and bone marrow, it is worth noting that similar methods could also be applied to different muscles or other anatomical structures.The subsequent sections describe the specific deep learning methods explored, data sourcing and pre-processing pipelines, the results obtained, and potential avenues for further refinement.

METHODS
This section delves into data acquisition, annotation, the approach employed, evaluation metrics, and volume calculations from MRI scans.

Data Acquisition
Acquiring high-quality and precise data is crucial in deep learning.The intricacies and subtle variances within MRI scans necessitate the need for selection and preparation processes.A challenging aspect of this research has been to work with limited data, as it is seemingly tricky to procure anonymized data.Attempts are being made in this direction to take the research to further levels.
Emphasis is made on using T1-weighted scans.Here is a brief differentiation between T1 and T2 weighted scans [6], explaining the rationale behind choosing T1 weighted scans.
For T1 vs. T2 in MRI, two aspects are common: TR -(Relaxation Time of the radio pulse) is the time between successive pulse sequences applied to the same slice.TE -(Excitation Time of the radio pulse) is the duration between transmitting the radio pulse and receiving the echoed signal.T1-weighted MRI images are produced using short TE and TR times, and T2-weighted images are captured using longer TR and TE times.In general, T1-weighted MRI images have better contrast for visualizing anatomical structures and are suitable for identifying the anatomy, such as bones, cartilage, ligaments, and tendons.On a T1-weighted image, trabecular bone appears bright, while cortical bone and soft tissues such as muscles, tendons, and ligaments appear dark.T2-weighted MRI images are generally better for detecting pathological conditions such as inflammation, edema, and fluid accumulation.On a T2-weighted image, fluid appears bright.This makes T2-weighted images particularly useful for detecting fluid-filled cysts, tumors, and other soft tissue abnormalities.
We used the MRI sequence from a prior study [12].The sequence was captured on a Siemens machine with a MagneticFieldStrength of 1.5T.The sequence has an average PixelSpacing of (0.833, 0.833), SliceThickness of 1 mm, SamplesPerPixel of 1 (grayscale image), RepetitionTime of 440ms, EchoTime of 39ms, and a total of 1728 images as inferred from the DICOM (see appendix) metadata.
Given the aim to segment specific regions of interest, scans were ensured to prominently highlight the bones, bone marrow, and musculature.Scans with noticeable artifacts or those that lacked clarity were promptly excluded from the dataset.

Data Annotation and Preparation
The datasets were manually annotated under the guidance of a radiologist to ensure that the labels were consistent with the actual anatomical structures.After exploring various labeling tools, the open-source LabelMe [22] and ITK-SNAP [24] were selected for their user-friendly interfaces and capability to annotate intricate structures in MRI scans by allowing drawing using polygons.
A pre-processing pipeline was developed to enhance the robustness and accuracy of segmentation.To mitigate the variance of MRI scans sourced from multiple scanners or facilities due to differences in hardware and acquisition protocols, the pipeline exports the original DICOM images to 2D PNG slices and then adjusts the brightness and contrast to ensure a consistent visual output across the dataset.In addition, the pipeline separates images with left and right views into distinct cropped views.This cropping enhanced the model's accuracy by removing uninformative background pixels and minimizing class imbalance during training.
MRI scans often exhibit variations when sourced from multiple scanners or facilities due to differences in hardware and acquisition protocols.This consideration is crucial when aiming to create a generalized model.To mitigate such inconsistencies, a pre-processing pipeline has been introduced.This pipeline involves exporting the original DICOM images to 2D PNG slices.The pipeline further adjusts aspects like brightness and contrast, ensuring a consistent visual output across the dataset.A significant decision was made to separate images with left and right views into distinct cropped views.Experiments have shown that cropping these into individual images enhances the model's accuracy, likely due to the reduction of uninformative background pixels.This potentially highlights, more prominently, the areas of interest, which, in turn, minimizes class imbalance during training.
Data augmentations were not used for this research due to mixed effects observed with these techniques.While data augmentation is a popular strategy in deep learning to expand datasets artificially, and this research could have indeed benefited from the same, on account of the limited annotated data, the experience with data augmentation techniques throughout the course of the study has been varied.Methods from Keras [4] (deep learning framework) and albumentations [3] (library) were tested.Care was exercised to use the same augmentation techniques for the original images and the annotated RGB training masks for the CNN.However, it was observed that the edges of the annotated classes on the training masks got inadvertently blurred, introducing false classes due to an increase in different pixel values besides that of pure RGB.To mitigate this, a Python function that leveraged OpenCV [2] was written to threshold the pixel values and normalize the brightness for each channel to keep the pixel values in check.However, the results were suboptimal, and the model was observed to exhibit spiking and inconsistent training and validation losses.A hypothesis behind this unexpected behavior is that certain augmentations introduced unwanted interpolations, leading to a potential loss of ground truth data.Thus, a decision was made to not use data augmentation methods for this research for the time being.
Lastly, the dataset contains about 200 axial training images with annotated classes as mentioned in Table 1.A train-test split of 10% has been used.Although this is a small number of images for training a deep learning model, efforts are being made to construct a larger dataset.All images in the dataset have been resized to a consistent size of 128 x 128 pixels to be fed to the CNN.This uniformity simplifies input handling and ensures that the model receives standardized inputs.

Model Architecture and Training
CNNs, particularly the U-Net architecture [18], have been recognized for their proficiency in medical image processing.The U-Net's symmetric structure distinguishes it by adeptly capturing both granular details and broader contextual information, a precision considered essential for tasks in medical imaging, like segmentation.This study leverages transfer learning by employing a deep learning model anchored on the ResNet architecture [8].The ResNet50 backbone, characterized by its deep layers and residual connections, is widely acknowledged for its robust performance.Its implementation was sourced from the segmentation_models library [10], renowned for its powerful U-Net configurations.
The primary goal was to ensure that diverse medical images are processed effectively without a decline in performance.To this end, a 10% dropout was integrated into the model to prevent overfitting and to achieve a somewhat higher degree of generalization, a shift from accuracy to IOU (see appendix) as the training metric was implemented.In cases of class imbalance, accuracy might provide a misleadingly optimistic view of the model's performance, as it may predominantly measure the most frequent class.Thus, while accuracy is intuitive, it may not always capture the nuances of segmentation tasks, especially when object regions are sparse or imbalanced compared to the background.Noteworthy is the superior performance exhibited by the ResNet50 backbone.A potential catalyst for this was identified to be the availability of pre-trained encoder weights derived from the ImageNet dataset [5].Given the limited dataset at hand, these pre-trained weights proved invaluable, obviating the need for extensive retraining.
Moreover, the model's loss function was strategically chosen to be a combination of Dice Loss and Focal Loss instead of the conventional Categorical Cross Entropy.This decision was influenced by the dataset's inherent class imbalances, with a predominant representation of, for instance, background class pixels.The combination of Dice Loss and Focal Loss demonstrated prowess in addressing such imbalances, ensuring a refined model training process [11].
During the model preparation phase, the training images and their corresponding masks were converted into Numpy [7] arrays, facilitating their incorporation into the neural network for training purposes.Subsequent to this, a label encoder was employed, and one-hot encoding was executed on the masks to ensure they were suitably formatted for segmentation.The chosen activation function was Softmax, paired with the Adam optimizer [13] for refining model parameters.Class weights were also assigned to address the prevalent class imbalance issue, and a combination of DiceLoss and CategoricalFocalLoss was used as the loss function.For evaluation metrics, Intersection over Union (IoU) and F-Score, both with a threshold of 0.5, were adopted.The training process experimented with batch sizes of 1 to 5 and the number of epochs between 150 and 200.Following the training, predictions were made, yielding the segmented images as outcomes.Figure 1 highlights the various steps in the approach employed.
Efforts were also made to segment individual bones in the lower leg, namely, the Tibia, Fibula, and Femur (Cortex), along with the bone marrow (myeloid tissue) and muscle tissue.Three sample segmentation results using the proposed methods have been shown in Figure 2, Figure 3, and Figure 4, respectively.

Volume calculation
The following approach has been proposed to derive the volumetric data from segmented MRI scans.The overarching strategy is first to segment the MRI scans into distinct regions of interest and then compute the volume of each segmented region.The method centers around converting the segmented output generated by the CNN into pixel data [23].
Pixel Data to Area Calculation: Each 2D slice derived from an MRI scan provides an area represented in pixel data.To determine the physical area this pixel data represents, spatial metadata accompanying the MRI scan is crucial.For instance, if the MRI metadata specifies a (Pixel-Spacing) pixel-to-physical-area ratio (e.g., 1 pixel = 'x' sq.mm), this allows for an accurate transformation of the segmented region's area from pixel space to real-world space.Let P be the number of pixels in a segmented region of a 2D MRI slice.Let R be the pixel-to-physical-area ratio given by the MRI metadata (e.g., in sq.mm per pixel).Thus, the area of the segmented region in a 2D slice: 2D Area to 3D Volume: Once the area of a 2D slice is known, volume of a slice could be obtained by multiplying the area by the thickness (height) of that slice.In MRI, each slice has a defined Slice-Thickness, typically provided in the scan's DICOM metadata.Summing the volumes of all individual slices would give the total volume for the area of interest.For example, if the area of the segment from a slice measures 200 sq.mm and the slice thickness is 1 mm, then the volume of the segment on that slice is 200 cubic mm.Let T be the thickness of the MRI slice (e.g., in mm).Thus, the Volume of the segmented region in a single slice: 3D Image Stacking: Two predominant methods can be adopted to create a comprehensive 3D representation of the regions of interest from individually segmented 2D slices.
• Programmatic Stacking: This method leverages computational tools, such as array operations in Python's NumPy, to layer individual 2D slices sequentially and to form a contiguous 3D structure.• Software Stacking: Dedicated software solutions offer capabilities to stack and visualize 2D slices in a 3D construction.Let N be the total number of MRI slices that contain the segmented region of interest.More formally, the total volume of the 3D segmented region would be given by:

RESULTS
The following section details the results, specific findings, and evaluation metrics.Below are some results to illustrate the model's performance as of writing this publication.The IoU scores in  annotations.Lastly, focusing on axial slices alone could be adequate for generating sagittal and coronal views.Moreover, it is inherently easier to label intricate structures axially.

DISCUSSION
The following section discusses possible improvements and potential limitations.The motivation behind this research has been to contribute to the domain of medical science by leveraging existing scientific computation methods and perhaps improving upon them.

Model Optimization
Ensemble models harnessing various backbones could potentially elevate segmentation accuracy.While 2D U-Net offers prowess in segmenting isotropic volumes, MRI data might benefit from 3D U-Net owing to its inherent anisotropy.The novel segmentation_models_3D library promises a facile implementation of the same.

Limitations and Safe Interpretation of Results
A prudent approach necessitates the examination of segmented volumes alongside additional medical data to prevent therapeutic errors.Current data limitations arise from its sourcing and nonspecialist annotations, hinting at the invaluable potential of expert collaborations.In clinical and research settings, particularly where precision is paramount, the segmentation accuracy of the model must consistently align with medical standards; any deviation can have significant implications.Comparisons with established medical practices and calibrated devices would be essential, using a healthy patient baseline for reference and even for training the model.

Future Work
The absence of publicly available annotated training datasets remains a challenge.By amalgamating data across sources, a more holistic model for segmentation beckons.Intensity adjustments, resizing, and contrast corrections could significantly optimize and help consolidate multi-source data for training.Further, collaborations with medical professionals promise more generalized and accurate annotated datasets to build a holistic model for obtaining quantitative and qualitative data.Existing Atlas-based segmentation methods could also be leveraged to provide a decent starting point for annotations and to speed up the process towards obtaining high-quality training data.Deep learning's prowess in MRI segmentation holds vast potential in medical imaging and in bridging the gap to various applications like physical therapy.As research in this domain progresses, the horizon only expands, promising a fusion of technology and medicine like never before.

Figure 1 :
Figure 1: A flow chart summarizing the key points discussed in the above sections.

Table 1
present an overview of the model's performance in segmenting different classes across the appended figures.Trabecular Bone achieved high IoU scores across all figures, while Cortical Bone in the Tibia and Femur also demonstrated strong performance.Muscle Tissue and Background classes showed variability in their scores, hinting at the need for more annotated training data for these classes.Corresponding IoU scores for the different classes in each image have been summarized inTable 1.Furthermore, Table 1 consists of the figures' names; all prediction classes are listed as columns.For any figure, if a class does not exist, corresponding columns in the table have intentionally been left blank.