Original article Clinical endoscopy| Volume 97, ISSUE 2, P335-346, February 2023

# Novel deep learning–based computer-aided diagnosis system for predicting inflammatory activity in ulcerative colitis

• Author Footnotes
∗ Drs Fan and Xu and Mr Mu contributed equally to this article.
Published:August 17, 2022

### Background and Aims

Endoscopy is increasingly performed for evaluating patients with ulcerative colitis (UC). However, its diagnostic accuracy is largely affected by the subjectivity of endoscopists’ experience and scoring methods, and scoring of selected endoscopic images cannot reflect the inflammation of the entire intestine. We aimed to develop an automatic scoring system using deep-learning technology for consistent and objective scoring of endoscopic images and full-length endoscopic videos of patients with UC.

### Methods

We collected 5875 endoscopic images and 20 full-length videos from 332 patients with UC who underwent colonoscopy between January 2017 and March 2021. We trained the artificial intelligence (AI) scoring system using these images, which was then used for full-length video scoring. To more accurately assess and visualize the full-length intestinal inflammation, we divided the large intestine into a fixed number of “areas” (cecum, 20; transverse colon, 20; descending colon, 20; sigmoid colon, 15; rectum, 10). The scoring system automatically scored inflammatory severity of 85 areas from every video and generated a visualized result of full-length intestinal inflammatory activity.

### Results

Compared with endoscopist scoring, the trained convolutional neural network achieved 86.54% accuracy in the Mayo-scored task, whereas the kappa coefficient was .813 (95% confidence interval [CI], .782-.844). The metrics of the Ulcerative Colitis Endoscopic Index of Severity–scored task were encouraging, with accuracies of 90.7%, 84.6%, and 77.7% and kappa coefficients of .822 (95% CI, .788-.855), .784 (95% CI, .744-.823), and .702 (95% CI, .612-.793) for vascular pattern, erosions and ulcers, and bleeding, respectively. The AI scoring system predicted each bowel segment’s score and displayed distribution of inflammatory activity in the entire large intestine using a 2-dimensional colorized image.

### Conclusions

We established a novel deep learning–based scoring system to evaluate endoscopic images from patients with UC, which can also accurately describe the severity and distribution of inflammatory activity through full-length intestinal endoscopic videos.

#### Abbreviations:

AI (artificial intelligence), CADx (computer-assisted diagnosis), CNN (convolutional neural network), MMES (modified Mayo endoscopic score), UC (ulcerative colitis), UCEIS (Ulcerative Colitis Endoscopic Index of Severity)
Ulcerative colitis (UC) is a chronic idiopathic inflammatory bowel disease of unknown etiology.
• Le Berre C.
• Ananthakrishnan A.N.
• Danese S.
• et al.
Ulcerative colitis and Crohn’s disease have similar burden and goals for treatment.
A previous study reported that endoscopic evaluations can determine suitable treatment methods according to a reliable index—the Mayo endoscopic score—in which the severity of UC is divided into 4 grades (0, normal; 1, mild; 2, moderate; 3, severe).
• Schroeder K.W.
• Tremaine W.J.
• Ilstrup D.M.
Coated oral 5-aminosalicylic acid therapy for mildly to moderately active ulcerative colitis.
In addition, the extent and severity of UC are evaluated by the Ulcerative Colitis Endoscopic Index of Severity (UCEIS) score, which is calculated as a summation of 3 subscores: vascular pattern (0-2 points), bleeding (0-3 points), and erosions and ulcers (0-3 points).
• Travis S.P.
• Schnell D.
• Krzeski P.
• et al.
Developing an instrument to assess the endoscopic severity of ulcerative colitis: the Ulcerative Colitis Endoscopic Index of Severity (UCEIS).
However, evaluations often differ between endoscopists, with kappa values ranging from .45 to .53 for beginners and .71 to .74 for experts.
• Vashist N.M.
• Samaan M.
• Mosli M.H.
• et al.
Endoscopic scoring indices for evaluation of disease activity in ulcerative colitis.
Procedures to reduce bias expend considerable delays and costs.
Artificial intelligence (AI) tools have shown great potential in the field of medical image recognition and data integration.
• Silver D.
• Huang A.
• et al.
Mastering the game of Go with deep neural networks and tree search.
,
• Gulshan V.
• Peng L.
• Coram M.
• et al.
Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.
After training with a large number of endoscopic images, AI provides a less time-consuming but more reliable automated estimation in UC endoscopy images. Previous research has achieved a high level of accuracy in UC endoscopy image classification tasks. These existing AI systems focused on image and video classification based on Mayo and UCEIS scores and demonstrated results equivalent to those of doctors.
• Takenaka K.
• Ohtsuka K.
• Fujii T.
• et al.
Development and validation of a deep neural network for accurate evaluation of endoscopic images from patients with ulcerative colitis.
,
• Ozawa T.
• Ishihara S.
• Fujishiro M.
• et al.
Novel computer-assisted diagnosis system for endoscopic disease activity in patients with ulcerative colitis.
However, most existing techniques only scored the mucosal part with the most severe lesions on single endoscopic images, which cannot reflect the UC lesion range and distribution of intestinal inflammatory activities in different segments.
• de Jong D.C.
• Löwenberg M.
• Koumoutsos I.
• et al.
Validation and investigation of the operating characteristics of the ulcerative colitis endoscopic index of severity.
Gottlieb et al
• Gottlieb K.
• Requa J.
• Karnes W.
• et al.
Central reading of ulcerative colitis clinical trial videos using neural networks.
considered full-length endoscopy videos using recurrent neural networks and provided a consolidated score that represented the overall condition of the entire intestine. However, tools to evaluate inflammatory activities (distributions and severities) in the entire intestine (ascending, descending, transverse, sigmoid, and rectum) with quantitative analysis are limited.
To address these technical limitations, we sought to develop a computer-assisted diagnosis (CADx) scoring system by deep learning technology.
• Li X.
• Liang D.
• Meng J.
• et al.
Development and validation of a novel computed-tomography enterography radiomic approach for characterization of intestinal fibrosis in Crohn’s disease.
The goal of this retrospective study was to compare whether an AI system can achieve a level of accurate evaluation equivalent to that of novice and experienced endoscopists. Furthermore, the study also aimed to validate the scoring system that could predict the distribution and severity of inflammation and visualize inflammatory activities in each segment and in the entire intestine.

## Methods

An overview of the CADx system architecture is presented in Figure 1.

### Image and video data preparation

#### Image labeling

Standard colonoscopes (EC450, EC580, EC600, and CF-EC760; Fuji Medical Systems, Tokyo, Japan) with white light were used. The images and videos were obtained between January 2017 and July 2019 from 332 patients who were treated for UC at the Zhongshan Hospital of the Xiamen University. All patients underwent total colonoscopy, and endoscopic images were obtained of all 5 intestinal segments. The current study only included clear white-light endoscopic images without image-enhanced endoscopy, stool, blur, or halation. All images were labeled with Mayo and UCEIS scores by 4 endoscopists with 30, 11, 4, and 6 years of experience, respectively. The study was approved by the local Ethical Review Board of Zhongshan Hospital Affiliated to Xiamen University (IRB2022144).
The endoscopists scored each image after mutual discussion. We defined the images as “clean label” when 3 of the 4 endoscopists concluded the same results. In addition, some images were assigned with varied labels because the endoscopists arrived at different conclusions because of their diverse experience. Li et al

Li Y, Yang J, Song Y, et al. Learning from noisy labels with distillation. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE 2017. p. 1928-1936.

demonstrated how to use noisy labels to improve classification accuracy in AI visual recognition tasks. Therefore, we selected some “noisy label” images in the training samples, with the following selection criteria. First, the 4 endoscopists arrived at 2 similarly labeled results. For example, if 2 of them labeled 1 image as Mayo score 0 and the other 2 labeled the image as Mayo score 1 (represented as 0, 0, 1, 1), the image would be kept. On the contrary, results such as (0, 0, 1, 2) or (0, 0, 2, 2) would be excluded. Second, we chose the relatively more severe results as the image label to train. For example, the Mayo score 1 label would be the training label in the results (0, 0, 1, 1), and Mayo score 3 would be chosen in the results (2, 2, 3, 3).
The number of clean label and noisy label images in each classification task are shown in Table 1. Noisy label images accounted for ≥10% of each dataset (21.61% in Mayo score, 13.39% in UCEIS erosions and ulcers score, 10.44% in UCEIS bleeding score, and 25.01% in UCEIS vascular pattern score). We use specially designed algorithmic modules to process these noisy label images. Our data significantly increased in volume after applying the distillation method to these images. For validation samples, we selected 30% of the total number of images in each category, and those images belonging to the noisy label were eliminated from the validation set and we reselected from the clean label images.
Table 1Detailed information of the endoscopic image dataset
SubitemTotal0123
Mayo5875106489518552061
Clean label483187473615261695
Noisy label1044190159329366
Erosions and ulcers4251139711461342366
Clean label3749123110101183325
Noisy label50216613615941
Bleeding920336268172144
Clean label833304242155130
Noisy label8732261714
Vascular pattern50999428643293
Clean label40797536912635
Noisy label1020189173658
—, Not applicable.

#### Video collection

Twenty videos collected from 18 patients were processed at Zhongshan Hospital Affiliated to Xiamen University from January 2019 to August 2019 using Fuji Medical Systems EPX-4450HD/ELUXEO 7000 (Fuji Medical Systems, Tokyo, Japan). We excluded the endoscopic videos that were too long or too short, ensuring a similar recording time for each video. Each frame was converted to a 1280 × 1024 endoscopic red–green–blue image. All videos included 5 intestinal segments, and the beginning and end times of each intestinal segment were marked by endoscopists. Each segment received a baseline modified Mayo endoscopic score (MMES) from 1 expert endoscopist.

### AI algorithm

In our work, the main AI structure can be divided into following modules: image classification frame work, video processing pipeline, and weighted scoring system. ResNet50 pretrained on ImageNet was selected as the main convolutional neural network (CNN) framework because of its reliability and validity.

He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE 2016. p. 770-778.

,

Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition. IEEE 2009. p. 248-255.

ResNet50 is a widely used conventional feature extractor that was the first step in the training protocol, and then a probability score ranging from 0 to 1 was created for each image by the trained CNN, indicating the likelihood that a given image belonged to each Mayo or UCEIS score.
A specialized CNN model was designed for the video processing pipeline, which contained a visual clarity module and an image-similarity determination module. This CNN model was used to preprocess the highly complex data and obtain an image sequence as the input of the scoring module. Therefore, the visual clarity model first detected all frames in the visual clarity phase to remove low-clarity images. The similarity detection procedure judged the similarity of each frame with 4 adjacent images and then removed the images that were too similar to the previous and subsequent frames to prevent these images from being scored repeatedly.

### Scoring model development

The scoring system was built with 2 parts: a baseline scoring model and weighted scoring model.

#### Baseline scoring model MMES

The use of the MMES system is an approach for determining a quick sketch of the full-length intestinal situation, which is expressed as follows:
$∑i=05Mi×lengthN$

where Mi (0-3) is the clinician’s approximate estimation that reflects the severity of the lesion in particular intestinal segments and length and N are the inflammatory part’s length and the number (0-5) of intestinal segments with inflammation. The values obtained using this equation were subjectively determined by each endoscopist based on their experience and may have considerable differences.

#### Weighted scoring model

A weighted scoring model was built and then applied to the image sequence. Based on clinical experience, we divided the intestine into a fixed number of “areas” (cecum, 20; transverse colon, 20; descending colon, 20; sigmoid colon, 15; rectum, 10). Each area contains a certain number of intestinal images. As illustrated in Figure 2, taking the descending colon as an example, we listed the basic dividing and scoring steps. The system calculated the area score according to the following formula:

where avg is the average of all endoscopic images’ Mayo score prediction of the image sequence, Numx represents the proportion of the Mayo score results in this intestinal area, and PPVx (positive predictive value) is the statistical analysis value of the AI algorithm predictions of the Mayo classification task, which measures the ratio of the true-positive cases to the classified positive cases. Clinicians focus on high-severity inflammation during clinical diagnosis. Thus, in the calculation formula of the area score, we took the Mayo score (the values of 1, 2, and 3 representing mild, moderate, and severe inflammations, respectively) as the weight coefficient.
After determining the scores for each area of the intestinal segment, as shown in the blue box in Figure 3, the segment score was generated by a linear combination as follows:

where ratio and L represent the percentage proportion of each inflammation severity and the number of image sequences of this intestinal segment, respectively.

### Statistical analysis

Primary outcome measures were defined as the accuracy of the AI algorithm for predicting Mayo and UCEIS scores. The agreement between endoscopists and AI was evaluated using the kappa coefficient.
• Brenner H.
• Kliebsch U.
Dependence of weighted kappa coefficients on the number of categories.
Moreover, the harmonic average of positive predictive values and negative predictive values—the F1 score—was selected as the basis for evaluating the model performance. The secondary outcome measures were the output of the scoring system, including the results of the area score and segment score. We discuss the value distribution of the area score. Moreover, to verify the validity of the segment score, the validation part focuses on the correlation between the weighted scoring system and human observers’ baseline scoring models. The consistence was evaluated by a confusion matrix and other metrics, including positive predictive value, negative predictive value, sensitivity, and specificity.

## Results

### CNN output for predicting Mayo and UCEIS scores

After excluding colonoscopy images that had perforations or other adverse events, 5875 clear white-light images were analyzed in this study. The number of validation images in each Mayo and UCEIS category are as follows: Mayo, 1762 images; UCEIS erosions and ulcers subitem, 1275 images; UCEIS bleeding subitem, 276 images; and UCEIS vascular pattern subitem, 1529 images. The trained CNN predicted a high performance for each primary outcome measure with 86.54% accuracy in the Mayo-scored classification task, whereas the kappa coefficient was .813 (95% confidence interval [CI], .782-.844). The accuracy of the UCEIS-scored task for evaluating vascular pattern, erosions plus ulcers, and bleeding were 90.7%, 84.6%, and 77.7%, respectively, whereas the kappa coefficients were .822 (95% CI, .788-.855), .784 (95% CI, .744-.823), and .702 (95% CI, .612-.793), respectively.
As shown in Table 2, the AI algorithm identified every Mayo category (0, 1, 2, 3) with a positive predictive value of 86.78% (95% CI, .8405-.8951), 75.32% (95% CI, .7184-.7880), 84.74% (95% CI, .8184-.8764), and 92.06% (95% CI, .8988-.9424), respectively; a negative predictive value of 96.88% (95% CI, .9548-.982), 95.05% (95% CI, .9330-.9680), 94.42% (95% CI, .9257-.9627), and 95.62% (95% CI, .9397-.9727); a sensitivity of 87.50% (95% CI, .8476-.9024), 69.05% (95% CI, .6572-.7238), 87.50% (95% CI, .8455-.9045), and 92.06% (95% CI, .8988-.9424); a specificity of 96.68% (95% CI, .9523-.9813), 96.33% (95% CI, .9481-.9785), 93.06% (95% CI, .9101-.9511), and 95.62% (95% CI, .9397-.9727); and an F score of 87.14%, 72.05%, 86.10%, and 92.06%. To ensure the credibility of the AI results, we compared the probability estimate output with the results labeled by the doctors (some samples are shown in Fig. 4). Regarding the UCEIS scores, the vascular pattern subitem for subscores (0, 1, 2) of the UCEIS result achieved an F score of 87.83%, 76.92%, and 95.15%, respectively. The erosion and ulcer subitem results for subscores (0, 1, 2, 3) achieved an F score of 89.30%, 78.48%, 85.61%, and 83.58%, respectively, in the 4 categories. The bleeding subitem result for subscores (0, 1, 2, 3) achieved an F score of 89.30%, 78.48%, 85.61%, and 83.58%, respectively, in the 4 categories.
Table 2Predicting metrics with the Mayo and UCEIS scores
ScoresPositive predictive value (%)Negative predictive value (%)Sensitivity (%)Specificity (%)F score (%)
Mayo score
086.78 (.8405-.8951)96.88 (.9548-.9828)87.50 (.8476-.9024)96.68 (.9523-.9813)87.14
175.32 (.7184-.7880)95.05 (.9330-.9680)69.05 (.6572-.7238)96.33 (.9481-.9785)72.05
284.74 (.8184-.8764)94.42 (.9257-.9627)87.50 (.8455-.9045)93.06 (.9101-.9511)86.10
392.06 (.8988-.9424)95.62 (.9397-.9727)92.06 (.8988-.9424)95.62 (.9397-.9727)92.06
UCEIS vascular pattern
088.30 (.8550-.9109)97.11 (.9565-.9856)87.37 (.8448-.9026)97.34 (.9594-.9874)87.83
175.58 (.7185-.7931)95.74 (.9399-.9750)78.31 (.7473-.8189)95.07 (.9319-.9695)76.92
295.44 (.9363-.9725)90.56 (.9802-.9310)94.86 (.9294-.9678)91.57 (.8916-.9398)95.15
UCEIS erosions and ulcers
090.98 (.8820-.9376)94.14 (.9186-.9642)87.68 (.8849-.9087)95.79 (.9384-.9774)89.30
176.23 (.7209-.8037)92.69 (.9016-.9522)80.87 (.7705-.8469)90.58 (.8774-.9342)78.48
284.06 (.8050-.8762)94.04 (.9174-.9634)87.22 (.8398-.9046)92.41 (.8984-.9498)85.61
393.33 (.9091-.9575)97.71 (.9626-.9916)75.68 (.7151-.7984)99.48 (.9878-.9999)83.58
UCEIS bleeding
089.66 (.8366-.9567)94.29 (.8972-.9886)86.67 (.7997-.9337)95.65 (.9163-.9967)88.14
172.73 (.6396-.8150)90.91 (.8525-.9657)69.57 (.6051-.7863)92.11 (.8680-.9742)71.11
270.83 (.6188-.7978)90.67 (.8494-.9340)70.83 (.6188-.7978)90.67 (.8494-.9640)70.83
375.00 (.6647-.8353)94.67 (.9085-.9909)81.82 (.7422-.8942)92.21 (.8693-.9749)78.260
Values in parentheses are 95% confidence intervals.
UCEIS, Ulcerative Colitis Endoscopic Index of Severity.

### Scoring system results of the endoscopic videos

#### Results of the area score

We evaluated 20 full-length endoscopic videos from 1 center with a total of 1700 slice areas (20 × (20 + 20 + 20 + 15 + 10)). As shown in the histogram in Figure 5, we found that 91.54% (1555/1700) of the areas’ scores predicted by the automatic scoring system were distributed in 4 intervals: 852 cases in (0, .4), 267 cases in (.72, 1.12), 292 cases in (3.2, 3.6), and 144 cases in (8.0, 8.4), which accounted for 50.1%, 15.7%, 17.2%, and 8.5%, respectively. This is consistent with the 4 severities classification of the Mayo score by physicians’ global assessment (0, normal; 1, mild; 2, moderate; and 3, severe). After carefully analyzing the numerical distribution of area score results, we chose the following 4 ranges as the severity division of the area score: normal, (.0, .4); mild, (.4, 1.2); moderate, (1.2, 4.0); and severe, (4.0, 8.4) (Fig. 2).

#### Validity of the segment score

To verify the effectiveness of the segment score, we defined a comparison method based on the consistency of intestinal segment inflammation. First, in the MMES system, the endoscopist makes an overall Mayo score M (0, 1, 2, 3) for each individual intestinal segment that is used as our intestinal segment inflammation baseline. Second, as shown in the Figure 3, the automatic scoring system calculated the proportion of different inflammation severities areas, and the severity with the highest proportion would be compared with the baseline. The experts determined 65 abnormal segments (Mayo 1, 21; Mayo 2, 30; Mayo 3, 14) and used the above 2 methods to get the severity results. The confusion matrix in Figure 6 displays the comprehensive comparison results. The highest proportion of severity predictions made by the scoring method from 54 samples corresponded to the endoscopists’ overall Mayo score. Figure 7 shows the exact comparison results between the human observer and scoring system.

### Visualized results of intestinal inflammatory activities

We present some qualitative results of the intestinal inflammatory activities simulation in Figure 8. The scoring results of 9 patients with mild, moderate, and severe symptoms are listed in the left half of Figure 8. The CADx predicted each bowel segment’s score and depicted it with 2-dimensional colorized bowel images. Moreover, the scoring system would predict the severity score of intestinal inflammatory activities, and the detailed scoring instance can be found in Figure 7, row b.
In addition, we present 2 qualitative results from patients with UC who received medical treatment. These patients underwent endoscopies and were monitored by endoscopists from pretreatment to post-treatment. The visualized results are presented in Figure 8. One endoscopist performed patient J’s and K’s full-length video clinical evaluation by using the MMES scoring system. Before and after treatment, patient J’s inflammatory MMES were 12.0 and 5.5, respectively, whereas the scores of patient K were 11.20 and 2.10, respectively. The 2 patients' scores were predicted by our scoring system to be 5.13 and 1.96, respectively, for patient J and 10.62 and 1.72, respectively, for patient K. The trends for these 2 scores were similar. The MMES scoring method gave the 2 patients nearly identical scores before treatment, but retrospective analysis of their full-length endoscopic videos revealed that the 2 patients' inflammatory areas and severity were distinct. The CADx system could visualize the differences in inflammatory activity in the videos, including identifying inflammatory areas and severities.

## Discussion

In this study, we developed a CADx system that is capable of reliably completing the Mayo and UCEIS visual classification tasks and automatically scoring full-length endoscopic videos’ inflammatory activities. A desirable UC endoscopy scoring method should classify the severity of intestinal inflammatory activity in a quantitative, simple, and repeatable manner. Although an oversimplified scoring system is poor in risk stratification, a complicated system may cause inconsistencies between different operators and even result in the inconsistencies judged by the same operator at different periods.
• Daperno M.
• Comberlato M.
• Bossa F.
• et al.
Training programs on endoscopic scoring systems for inflammatory bowel disease lead to a significant increase in interobserver agreement among community gastroenterologists.
Additionally, conventional endoscopic evaluations of disease severity vary because of the considerable differences between endoscopists based on their clinical experiences, which results in considerable biases. In contrast, our trained deep-learning model can achieve an endoscopy estimation equivalent to that of an experienced expert with high efficiency, accuracy, and good repeatability. With the help of this CADx scoring system, the distribution range and severity of UC inflammation can be determined, and a weighted score for both intestinal segments and the entire intestine can be calculated. Furthermore, a simulation graphic can be used to intuitively display the distribution and severity proportion of inflammatory activity (Fig. 9).
In contrast to prior studies on the Mayo and UCEIS scores of a single endoscopic image, our study focused on the integrity of the entire intestinal situation and considered the evaluation of specific intestinal segments. Compared with the MMES system, we first divided the intestine into a fixed number of areas, scored each area separately, and then calculated the inflammatory activity score for each intestinal segments. Figure 7 compares the MMES from a human endoscopy expert with the scoring results performed by the AI automatic scoring system.
Our CADx system's predictions of UC inflammatory activity were consistent with some clinical features as well. UC is a continuous intestinal inflammation, usually manifested as backwash ileitis (lesions on the left side of the colon are worse than on the right side).
• Heuschen U.A.
• Hinz U.
• Allemeyer E.H.
• et al.
Backwash ileitis is strongly associated with colorectal carcinoma in ulcerative colitis.
As shown in Figure 8, the results of most samples are consistent with this clinical features. Moreover, the CADx system showed inflammatory characteristics consistent with mild proctitis and discontinuous lesions in multiple intestinal segments in these cases.
• Naganuma M.
• Aoyama N.
• et al.
Complete mucosal healing of distal lesions induced by twice-daily budesonide 2-mg foam promoted clinical remission of mild-to-moderate ulcerative colitis with distal active inflammation: double-blind, randomized study.
,
• Joo M.
• Odze R.D.
Rectal sparing and skip lesions in ulcerative colitis: a comparative study of endoscopic and histologic findings in patients who underwent proctocolectomy.
The visualized results from 2 patients with UC who received medical treatment are presented in Figure 8J and K. After treatment, both patients' scores predicted by our scoring system were decreased. The visualized image showed that the severity of their inflammatory activities had also been greatly reduced. This indicated that perhaps our scoring system could provide a method to compare the inflammatory activity of patients before and after treatment. Thus, gastroenterologists could use this AI scoring system to estimate the treatment responses in patients with UC. In the future, we intend to collect more samples of patients who received medical treatment and complete the validation of this research content.
There were some study limitations. First, the prediction ability of the deep-learning network for some categories is inadequate, including Mayo 1 and the UCEIS bleeding subitem. In the classification task of these categories, the F1 score was lower than that of other categories with more training samples, indicating that the performance of our algorithm was limited by the imbalance of data samples. We hope to collect more data belonging to these categories in future research and improve the training algorithm to guarantee the prediction ability of the CNN model in all categories. Second, by analyzing the simulation results with the output of this scoring system, we found that some intestinal segments had considerably high proportions of high-severity lesions. A typical example is shown in Figure 10, where the AI scoring system considered the proportion of severe inflammation in this patient’s descending colon and sigmoid colon was 100%, which was also observed in other cases (>50% local area inflammatory severity was predicted as severe). We learned that these cases had multiple discontinuous severe lesions in the same intestinal segment (another manifestation of “jumping lesions”). The preprocessing network also failed to completely remove the repeated severe and moderate inflammation images, resulting in excessively high scores of these local areas after weighted calculation. This shows that there is still a gap between our scoring system and the most desired clinical conditions. Finally, our scoring system is not in real time. When analyzing intestinal videos, endoscopists need to mark the time node when intestinal segments are transformed, so it is difficult to score in real time when doctors make clinical diagnoses, which means that the system does not make full use of the advantages of machine learning.
• Rombaoa C.
• Kalra A.
• Dao T.
• et al.
Automated insertion time, cecal intubation, and withdrawal time during live colonoscopy using convolutional neural networks—a video validation study [abstract].
,
• Karnes W.
• Requa J.
• Dao T.
• et al.
Automated documentation of multiple colonoscopy quality measures in real-time with convolutional neural networks.
In future research, improving the time performance of the proposed scoring system will be an important research direction. Although this novel weighted scoring model of endoscopic videos can reconstruct the endoscopic distribution range and severity of UC, a large prospective study is needed to validate its usefulness in clinical decisions and monitoring after treatment. Furthermore, the definition of endoscopic mucosal healing for this new scoring system remains unclear; hence, the validation of this model in a multicenter study is the next step for elucidating its correlation with the clinical indices, laboratory measurement of active disease, and patient-defined remission.
In conclusion, we developed a CADx system based on deep-learning technology. This CADx system can complete reliable Mayo and UCEIS visual classification tasks as well as complete automatic scoring for full-length endoscope videos. The classification capacity of Mayo and UCEIS scores for single endoscopic images is equivalent to that of clinical experts. At the same time, the objective and visualized evaluation of the UC inflammatory environment of the entire intestine fills the technical gap in this field. In future work, we hope that more cases from multiple centers can be included to further verify and improve our CADx system.

## Acknowledgments

We thank the Department of Endoscopy Center, Zhongshan Hospital of Xiamen University, for their assistance with the human endoscopy study and the participants for their time.

## References

• Le Berre C.
• Ananthakrishnan A.N.
• Danese S.
• et al.
Ulcerative colitis and Crohn’s disease have similar burden and goals for treatment.
Clin Gastroenterol Hepatol. 2020; 18: 14-23
• Schroeder K.W.
• Tremaine W.J.
• Ilstrup D.M.
Coated oral 5-aminosalicylic acid therapy for mildly to moderately active ulcerative colitis.
N Engl J Med. 1987; 317: 1625-1629
• Travis S.P.
• Schnell D.
• Krzeski P.
• et al.
Developing an instrument to assess the endoscopic severity of ulcerative colitis: the Ulcerative Colitis Endoscopic Index of Severity (UCEIS).
Gut. 2012; 61: 535-542
• Vashist N.M.
• Samaan M.
• Mosli M.H.
• et al.
Endoscopic scoring indices for evaluation of disease activity in ulcerative colitis.
Cochrane Database System Rev. 2018; 1 (CD011450)
• Silver D.
• Huang A.
• et al.
Mastering the game of Go with deep neural networks and tree search.
Nature. 2016; 529: 484-489
• Gulshan V.
• Peng L.
• Coram M.
• et al.
Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.
JAMA. 2016; 316: 2402-2410
• Takenaka K.
• Ohtsuka K.
• Fujii T.
• et al.
Development and validation of a deep neural network for accurate evaluation of endoscopic images from patients with ulcerative colitis.
Gastroenterology. 2020; 158: 2150-2157
• Ozawa T.
• Ishihara S.
• Fujishiro M.
• et al.
Novel computer-assisted diagnosis system for endoscopic disease activity in patients with ulcerative colitis.
Gastrointest Endosc. 2019; 89: 416-421
• de Jong D.C.
• Löwenberg M.
• Koumoutsos I.
• et al.
Validation and investigation of the operating characteristics of the ulcerative colitis endoscopic index of severity.
Inflamm Bowel Dis. 2019; 25: 937-944
• Gottlieb K.
• Requa J.
• Karnes W.
• et al.
Central reading of ulcerative colitis clinical trial videos using neural networks.
Gastroenterology. 2021; 160: 710-719
• Li X.
• Liang D.
• Meng J.
• et al.
Development and validation of a novel computed-tomography enterography radiomic approach for characterization of intestinal fibrosis in Crohn’s disease.
Gastroenterology. 2021; 160: 2303-2316
1. Li Y, Yang J, Song Y, et al. Learning from noisy labels with distillation. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE 2017. p. 1928-1936.

2. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE 2016. p. 770-778.

3. Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition. IEEE 2009. p. 248-255.

• Brenner H.
• Kliebsch U.
Dependence of weighted kappa coefficients on the number of categories.
Epidemiology. 1996; 7: 199-202
• Daperno M.
• Comberlato M.
• Bossa F.
• et al.
Training programs on endoscopic scoring systems for inflammatory bowel disease lead to a significant increase in interobserver agreement among community gastroenterologists.
J Crohns Colitis. 2017; 11: 556-561
• Heuschen U.A.
• Hinz U.
• Allemeyer E.H.
• et al.
Backwash ileitis is strongly associated with colorectal carcinoma in ulcerative colitis.
Gastroenterology. 2001; 120: 841-847
• Naganuma M.
• Aoyama N.
• et al.
Complete mucosal healing of distal lesions induced by twice-daily budesonide 2-mg foam promoted clinical remission of mild-to-moderate ulcerative colitis with distal active inflammation: double-blind, randomized study.
J Gastroenterol. 2018; 53: 494-506
• Joo M.
• Odze R.D.
Rectal sparing and skip lesions in ulcerative colitis: a comparative study of endoscopic and histologic findings in patients who underwent proctocolectomy.
Am J Surg Pathol. 2010; 34: 689-696
• Rombaoa C.
• Kalra A.
• Dao T.
• et al.
Automated insertion time, cecal intubation, and withdrawal time during live colonoscopy using convolutional neural networks—a video validation study [abstract].
Gastrointest Endosc. 2019; 89: AB619
• Karnes W.
• Requa J.
• Dao T.
• et al.
Automated documentation of multiple colonoscopy quality measures in real-time with convolutional neural networks.
Am J Gastroenterol. 2018; 113: S1532