A Computer Vision Based Colonoscopy Support System for Real-Time Monitoring of Bowel Preparation and Colonic Anatomical Localization

The high prevalence of late stage colorectal cancer underscores the need for robust detection systems capable of mitigating its progression during its early stages. While routine colonoscopies have been the industry-standard for identifying signs of early colorectal cancer, it is crucial to uphold several key quality benchmarks to ensure their effectiveness and precision. These quality indices include factors like the scope withdrawal rate and bowel preparation, among others. Our approach leverages on image processing and deep learning to establish a supportive system that highlights areas requiring improvement during scope procedures for clinical practitioners. We demonstrate this via a fine-tuned ResNet-50 architecture to assess bowel preparation yielding 98.5% average accuracy, and a curvature-tracking based approach for colonic anatomical localization for precise monitoring of the withdrawal speed and bowel preparation. We show a pilot iteration of this integrated system on pre-recorded colonoscopy videos, and propose steps for further clinical testing.


INTRODUCTION
Colorectal cancer stands as one of the most prevalent cancers affecting both males and females, ranking third in incidence among men and second among women [1].Fortunately, its prognosis is significantly improved when detected early, and patients can generally expect a high chance of full recovery.In particular, the presence and consequent removal of pre-cancerous polyps pose a risk to growing into more advanced adenomas and possible malignancies [2].The removal of these adenomas has shown to be correlated to reduced cancer incidence, and thereby, mortality from colorectal cancer [3].Colonoscopies are well established procedures and have shown to lead to early detection of these pre-cancerous polyps when executed appropriately.However, they are susceptible to human error, and these deficiencies in quality control during their execution can result in overlooked lesions and subsequent interval cancers.Specifically, quality indicators encompassing the scope withdrawal time and the segment-specific cleanliness and bowel preparation play a critical role in determining the procedure's efficacy.Spending adequate time inspecting each segment of the colon during the withdrawal phase has been linked to a higher adenoma pick-up rate due to more meticulous examination of the mucosa [4].Moreover, ensuring a properly cleaned bowel offers a clearer visual field, which is pivotal for practitioners identifying pre-cancerous polyps.Therefore, precise documentation and monitoring of these quality indices hold the potential to assist medical professionals maintain good adenoma detection rates, consequently improving the probability of early-stage colorectal cancer detection and the facilitation of prompt medical intervention.
The medical domain has seen an accelerated convergence of machine learning and computer vision methodologies, serving as a framework for prognostic and predictive systems in diagnosing various diseases and conditions.These systems also find utility in refining decision-making processes.Deep learning in particular has been identified as an ideal candidate for these applications, due to its prior use in aiding medical decision-making and its rapid acceleration in recent years.It has been frequently used in numerous applications such as image reconstruction, classification of diseases, image segmentation among others.A main advantage of deep learning methods is their ability to improve efficiency in decision making by reducing the time that would otherwise be spent manually, such as in drug discovery [5].Classical computer vision and processing techniques also hold an important place within these applications, specifically via feature extraction and automation of image analytical procedures, and are extensively used across various medical domains such as pathology, cardiology, endoscopy, and others [6].A key application of such computational methods is in augmenting existing systems or assisting clinical practitioners by providing a kind of "second opinion" to compare against, and to use as a benchmark to standardise decision-making.Within the realm of colonoscopy assistance, as described above, deep learning and computer vision techniques therefore pose an ideal method to augment clinical practice and provide clinicians with a valuable additional tool.
In this paper, we present a methodology to simultaneously interpret the bowel quality in each detected system of the colon and monitor the scope withdrawal rate.Such an approach has the ability to enhance the efficacy of routine colonoscopy procedures by monitoring and advising on the previously mentioned quality indices.The task of interpreting the bowel quality is accomplished through a transfer learning approach for a multi-class classification model using a ResNet-50 neural network backbone.We propose a simple curvature-tracking based method to make inferences on the segment of the colon being scoped at a particular time.Withdrawal time, which refers to the time spent by the practitioner during the withdrawal of the scope from the caecum, is monitored by timing the bowel quality prediction output presented to the practitioner at every 30 seconds and advising on the speed of the scope based on camera motion capture.

RELATED WORK 2.1 Bowel quality measure
The "quality" of the bowel prior to and during a colonoscopy routine is pivotal in assessing the effectiveness of the procedure.The Boston Bowel Preparation Scale (BBPS) is a 4-point scoring system that is used to ascertain the preparation quality of a bowel.A score of 0 indicates poor bowel preparation and a score of 3 indicates high quality bowel preparation.In clinical practice, the BBPS is calculated for 3 main segments of the colon and are aggregated to produce a final score out of 9, which determines the overall cleanliness of the bowel [7].A score below 6 is typically regarded as insufficient bowel preparation [8].

Deep learning for medical diagnosis
Machine learning techniques in endoscopy augmentation, and more specifically, in the inference of bowel quality, have made considerable headway in recent years.Zhou et al. presented their system ENDOANGEL that classifies images of the colon based on the bowel quality score (the BBPS score) using a deep convolutional network and report an average accuracy of 91.89% using a deep-convolutional neural network [9].Nam et al. used a similar convolutional neural network architecture to evaluate small-bowel preparation and achieved an accuracy of 93% [10].In other related research, various model architectures and techniques have shown to be effective in medical diagnostics.For example, Almalik et al. explored vision transformers for the classification of chest X-ray images and reported an accuracy of 97.64% [11].Using transfer learning, Alzubaidi et al. achieved an accuracy of 97.51% for breast cancer classification [12].Similarly, numerous studies also explore the application of deep learning for various medical diagnostics related to cardiac, respiratory, endocrine, and cranial diseases among others.

Colonic anatomical localization
The colon is divided into seven key locations that can be thought of as "colonic segments": the rectum, sigmoid colon, descending colon, splenic flexure, hepatic flexure, ascending colon, and the caecum.Broadly, we can combine the hepatic and splenic flexure into one umbrella term: the transverse colon.Precise localization, which refers to identifying which segment of the colon the scope is in during the colonoscopy routine, is important to promote targeted treatment if a pre-cancerous polyp is detected.During the procedure, physicians typically use visual markers as well as the movement of the scope or the shape the scope moves in to determine the scope location.
Automating this procedure is a non-trivial task, and not many approaches to this localization problem have been explored in recent literature.Most commonly, deep learning has been used to classify images of the bowel to their respective locations.Saito et al. [13] used a pre-trained convolutional neural network to classify bowel images into seven anatomical locations, reporting an overall accuracy of 66.6% on the test data after training the model on ∼10000 images.Houwen et al. [14] used a similar method but used images from magnetic endoscope imaging to train a pre-trained classifier, reporting an overall accuracy of 63%.Hence, using deep learning to infer the location given an image of a section of the bowel not only requires a large amount of data, but also does not show remarkable results.This can be attributed to the complexity of the images, and it is often difficult for practitioners to identify the location using solely visual biomarkers.A different method proposed by Herp et al. [15] describes a feature-based tracking approach using an endoscopic pill that models the shape of the colon as the pill is ingested.This method achieved an average accuracy of 86% in reconstructing the colon shape and subsequently labeling the anatomical regions.None of these approaches suggest a real-time application, which poses a significant gap in the deployment of such algorithms in clinical practice.Hence, not only should a new approach accurately predict the anatomical region, but it should also do so feasibly in real-time.

METHODS
Our proposed system can be broken down into three blocks: a classifier that is used to infer the Boston Bowel Preparation Scale (BBPS) Score and hence provide a measure of the quality of the bowel, a localizer that predicts the current location of the scope in the bowel, and a movement monitor that advises on the scope speed and stability.The aggregated BBPS score is shown every 30 seconds, with 2 readings per location (approximately 1 minute in total per predicted segment).The speed and stability are monitored continuously through the procedure.

Classifier and BBPS Score Prediction
We used the publicly available Nerthus dataset [16] containing 5525 labelled images taken from 21 videos of colonoscopy routines.These images were labelled according to their BBPS score, i.e., 0, 1, 2 or 3.The dataset was randomly split into 70% training data, 20% validation data and 10% test data sets.Further testing was also done on a dataset curated by the National University Hospital.
The training dataset was subjected to various pre-processing and augmentation techniques including resizing, rotating, flipping and normalization to enhance the variability of data in each class.
We used transfer learning via fine-tuning of the last few layers of a ResNet-50 backbone, hereby called ScopeNet and used focal loss as an alternative to cross-entropy to account for the imbalance in image classes.
The term(1 −   )  is a modulating factor introduced in their paper [17] .It under-weighs misclassifications from easier classes, i.e., when   0.5, since the value converges to 1.This decreases the influence of easier classes on the overall loss. is the focusing parameter which adjusts the rate at which easier classifications are under-weighed.The ResNet-50 backbone consists of 48 convolutional layers, 1 max pooling layer, and 1 average pooling layer.Specifically, it leverages on skip connections to solve the problem of vanishing gradients.The Nerthus dataset is small, and training a highly deep model from scratch would lead to overfitting.By using pre-trained weights and unfreezing the last few layers of the ResNet-50 model, and with the inherent skip connections that further reduce the vanishing gradients problem, we retain the depth the model provides while still preventing potential overfitting.

Colonic Anatomical Localization
To track the motion of the colonoscope during the procedure, we propose a simple curvature-based tracking methodology.As seen in Figure 4a, there are three key turns in the colon which occur between the ascending colon and the hepatic flexure, splenic flexure and the descending colon, and the descending colon and the sigmoid.Hence, given a priori knowledge that the starting point is at the caecum (from the withdrawal phase), by monitoring which turn the scope takes, we can determine the current location in the colon.The camera attached to the scope offers a wealth of valuable information.In particular, when the camera peers through an opening, like the colon, we are able to perceive the overall direction of the scope's movement and anticipate the next destination.This is perhaps best understood through the analogy of peering into a tunnel.Similar to how we perceive the direction of the tunnel by observing the darkest portion, the same concept applies when we look at an image of the bowel through the scope camera.Just as the darkest point in the tunnel indicates its continuing path, the darkest part of the bowel image serves as a visual cue, allowing us to discern the direction or curvature of the scope's movement.Figure 4b demonstrates this idea.
We can identify this point of minimum intensity in the image using 2D wavelet analysis.The Daubechies wavelet of order 4 maximizes efficiency and is suitable for edge detection, which is critical for depth estimation.On performing wavelet decomposition and thresholding, the resulting image has areas of high and low intensities.The deepest point in the image can be thought of as the location with the lowest intensity value in the thresholded image.We follow the coordinate of maximum depth at each timestep in the colonoscopy routine to visualize the rough path of the scope.On this path, we perform curvature analysis to determine the critical turning points which indicate a change in the location in the colon.This is done using the Frenet-Serret formulae [19], a set of mathematical equations that describe the behavior of a curve in a 3-dimensional Euclidean space.Given a parameterized curve r(t) in 3D space, the Frenet-Serret formulae allow us to compute the tangent vector T, normal vector N, and binormal vector B at any point along the curve.The tangent vector represents the direction of the motion of the curve at that point, the normal vector represents the direction of curvature, and the binormal vector represents the direction of twist.Using these, we can compute the normalized curvature  at any point along the curve, which represents the rate at which the curve changes direction.
For the pilot study, we focus on running tests to find the appropriate threshold for the curvature measure at which a "turn" to the next location is indicated.To implement this localization mechanism into a real-time system, whenever the  value measured at every point relative to the past coordinates in the path surpasses the set threshold, the location predicted by the system would be updated.

Motion Analysis
An additional feature we propose to enhance the effectiveness of the colonoscopy procedure is motion analysis and advising on the scope speed.If the practitioner moves the scope too fast during the procedure, the quality of the images passed to the neural network and the localizer would be substandard, which would negatively impact the performance of the system.Additionally, monitoring the speed can be useful in ensuring the colon is adequately scoped.We use Fast-Fourier Transform (FFT) analysis to evaluate the blurriness of the images being passed to the pipeline.Particularly, by analyzing the magnitude of the FFT of the image and applying the inverse FFT, we can estimate the degree of motion blur and determine if the scope is moving too fast.Further testing in a clinical setting is required to ascertain the exact threshold to determine when the scope is moving too fast based on the FFT results.

EXPERIMENTS 4.1 Boston Bowel Preparation Scale Score Prediction
We ran the ScopeNet model using standard parameters; i.e, 10 epochs, learning rate of 0.01, batch size of 32, 0.1 gamma and step size of 7. We used Bayesian optimisation to tune these parameters, with the final set being 15 epochs, learning rate of 0.025, a batch size of 80 and a step size of 10.Our model performed better than other architectures as seen in Table 1.

Colonic Anatomical Localization
Real images obtained from colonoscopy videos exhibit complexity, including blurry frames and specular noise.As a result, the curvature-based method is initially tested on a publicly available dataset of synthetic colon images, generated from CT-colonography [21].This dataset comprises 16,016 RGB images depicting different segments of the colon; however, they lack location annotations.Thus, this dataset is solely utilized to assess the viability of the proposed method.Additionally, we used the dataset curated by the National University Hospital (NUH) for further testing of the localization methodology for the pilot study.
For the initial testing of the curvature analysis approach, 364 images were used from the CT-colonography dataset.Following the pipeline, the 3D visualization of the entire path flow is shown in Figure 4c. Figure 4d and 4e show the curvature analysis for a short segment of this path.
From the curvature graph in Figure 4e, we can see that the maximum curvature occurs at ∼20s and from the first frame.This corroborates with the 3D plot, implying that the method can work at ∼40s and ∼60s on the synthetic data.On the NUH dataset, due to the images being more complex, we evaluate the performance on shorter segments.Particularly, since the movement of the scope is more rough due to the back-and-forth motion, there is a lot more variability in the path.In addition to this, the real data has a lot more noise in the images than the synthetic dataset, specifically motion blur and specular noise.Hence, we apply a median filter and evaluate the performance with varying kernel sizes to determine the best results.We observe that having a bigger kernel size generally improves the prediction of the path.This is particularly because of the removal of noise, which affects the estimation of the point of greatest depth.However, increasing the kernel size also results in information loss which may be useful in later stages, so the optimal kernel size is worth investigating.
Based on the 3D path graphs, we can make estimations about the current location from our prior knowledge that the withdrawal phase begins from the caecum.Therefore, we assume that the initial point of maximum curvature indicates a turn from the ascending colon to the hepatic flexure, the second point of maximum curvature signifies a turn from the splenic flexure to the descending colon, the third point indicates a turn from the descending colon to the sigmoid, and finally, that the last point denotes a turn from the sigmoid to the rectum.It should be noted that differentiating between the hepatic and splenic flexure is not feasible with this method.However, as outlined in the following section, deep learning can be employed in conjunction with this approach to identify these anatomical regions.Furthermore, due to the complexities present in real data, resulting in numerous disturbances in the path, accurately modeling the complete path at the current stage has proven to be challenging and requires further research.Figure 5 provides an annotated path from the descending colon to the splenic flexure to demonstrate the method's validity as a proof-of-concept.

BBPS Score Prediction
In the experiments carried out as described, the main evaluation metrics used were accuracy, precision, and recall (for each individual class).After Bayesian Optimization, we found that the ResNet-50 backbone provides the best performance overall and can be used for deployment.ResNet-50 particularly outperforms VGG and DenseNet due to its inherent architecture of allowing shortcut connections and residual functions that reduce the training loss, while still maintaining the complexity in terms of the number of stacked layers [22].With 48 convolutional layers, 1 max pooling layer, and 1 average pooling layer, ResNet-50 facilitates this behavior.Moreover, the inclusion of residual functions mitigates the issue of vanishing gradients, a common limitation of the VGG architecture.To support our findings, further tests should be conducted on more challenging test sets.An interesting observation is the poor performance by the vision transformers.One particular reason why this might have occurred is due to the limited training set since we were training the vision transformer from scratch.Additionally, vision transformers typically do not perform as well on images that mainly differ on texture and other finer details as seen in the images [23].

Colonic Anatomical Localization
Although this method establishes a baseline for anatomical location prediction, further refinement and investigation are necessary to enhance its rigor.One significant limitation lies in the depth estimation stage, where the point estimated is incorrect due to the perspective of the camera.By applying perspective transformations to the images, those captured at the same or similar timepoints can be standardized, resulting in reduced fluctuations within short time periods.Another challenge tackled in this study relates to the usability of frames in the colonoscopy video.Since some frames are not usable, this means there are often various "jumps" between frames, where some images within some time periods are not used.This discontinuity in the images adds to the fluctuation and can cause the exact moment of location change to be missed.To overcome this, we increased the number of frames taken per second, which resulted in slightly more images being classified as usable.However, since this did not solve the problem completely, in the 3D visualization, we used an interpolated spline to smoothen the path representation.Further work on overcoming this problem should be done to make the system more robust.Lastly, while this method works in some cases on poorly prepped bowels, it does not generalize well, making it limited in its scope to predict anatomical regions in only well-prepped bowels at this stage.This system can be enhanced by combining it with a visual learner such as a CNN to improve the confidence of the prediction as well.However, this integration presents a significant challenge, as it requires a substantial amount of well-annotated data.

INTEGRATION TO A REAL-TIME SUPPORT SYSTEM
The application of the BBPS predictor and the colonic anatomical localizer is real-time monitoring and advising during the colonoscopy routine.We demonstrate a pilot iteration on a pre-recorded colonoscopy video obtained from NUH to showcase the methodology.The colonic anatomical localizer is not explicitly shown, however, this video was cropped to show only a homogenous segment of the colon.The demonstration can be found here.In our video, we show the score prediction every 5 seconds (in longer videos, it is every 30 seconds), a real-time histogram recording the scores, and a running motion analysis (the methodology of which was described in the Methods section).

CONCLUSIONS AND FURTHER WORK
In this paper, we presented a methodology to augment colonoscopy outcomes and discuss proof of viable translation to clinical settings.We presented ScopeNet, a transfer-learning based architecture to predict the Boston Bowel Preparation Scale Score with an average accuracy of 98.5% from images of the bowel, a segment-wise colonic localizer, and additional tools such as speed and motion analysis.
The integration of such systems has the ability to reduce the burden on practitioners in real time, and improve the early detection of pre-cancerous polyps during colonoscopy routines.However, this research will vastly benefit from further work to the subcomponents in this system.Studies performing a benchmark analysis against human physicians are vital to evaluate the necessity and efficacy of the BBPS predictor we described.Furthermore, innovations in data pre-processing to exploit the visual characteristics of bowel images could promote new methods and insights into medical computer vision.The colonic anatomical localizer presented will also require further extensive testing with various types of colonoscopy videos, and new improvements and methodologies should be proposed to encapsulate abnormal cases of colonoscopies.In addition to improvements in these phases of the project, further research in the realm of augmenting colonoscopy outcomes using machine learning and automated systems could include volumetric analysis through 3D renderings of the colon.This would help visualize the scope routine in more detail, promote more precise targeted treatment and make the system more complete.

Figure 2 :
Figure 2: a Images are passed into the ScopeNet model (described in the Methods section) sequentially.The model outputs the BBPS score for the image.b Simultaneously, the location of the scope in the colon is predicted (the segment of the colon the scope is in).c Additional features such as motion analysis is done via blur detection using FFT.

Figure 4 :
Figure 4: a Anatomical segmentation of the colon [20].b Visualisation of the determining the next location.c 3D visualisation of the path taken by the colonoscope during a colonoscopy.d Visualisation of a short path segment during a colonoscopy.e Curvature analysis calculated using Frenet-Serret equations.

Figure 5 :
Figure 5: Annotated path from descending colon to splenic flexure and corresponding curvature analysis.

Table 1 :
Results of Comparative Experiments