Detection of elusive polyps using a large-scale arti ﬁ cial intelligence system (with videos)

GRAPHICAL ABSTRACT Background and Aims: Colorectal cancer is a leading cause of death. Colonoscopy is the criterion standard for detection and removal of precancerous lesions and has been shown to reduce mortality. The polyp miss rate during colonoscopies is 22% to 28%. DEEP DEtection of Elusive Polyps (DEEP 2 ) is a new polyp detection system based on deep learning that alerts the operator in real time to the presence and location of polyps. The primary outcome was the performance of DEEP 2 on the detection of elusive polyps. Methods: The DEEP 2 system was trained on 3611 hours of colonoscopy videos derived from 2 sources and was validated on a set comprising 1393 hours from a third unrelated source. Ground truth labeling was provided by of ﬂ ine gastroenterologist annotators who were able to watch the video in slow motion and pause and rewind as required. To assess applicability, stability, and user experience and to obtain some preliminary data on performance in a real-life scenario, a preliminary prospective clinical validation study was performed comprising 100 procedures. Results: DEEP 2 achieved a sensitivity of 97.1% at 4.6 false alarms per video for all polyps and of 88.5% and 84.9% for polyps in the ﬁ eld of view for less than 5 and 2 seconds, respectively. DEEP 2 was able to detect polyps

Conclusions: DEEP 2 has a high sensitivity for polyp detection and was effective in increasing the detection of polyps both in colonoscopy videos and in real procedures with a low number of false alarms. (Clinical trial registration number: NCT04693078.) (Gastrointest Endosc 2021;94:1099-109.) (

footnotes appear on last page of article)
Colorectal cancer is the second leading cause of cancer death worldwide, 1 resulting in an estimated 900,000 deaths per year. 2 Colonoscopy is the criterion standard for detection and removal of precancerous lesions; 19 million colonoscopies are performed in the United States annually. 3 Colonoscopic removal of polyps has been shown to reduce mortality. However, colonoscopy is performed with variable efficiency. 4,5 Tandem colonoscopy studies showed that 22% to 28% of polyps are missed by the performing endoscopist; 20% to 24% are histologically confirmed adenomas, 6 and missed lesions may turn into interval cancers. 7 A variety of factors leads to missed polyps: operator fatigue, distraction, and skill level are prime among them. 8 An algorithmic solution seems to be an attractive option to deal with these factors, potentially reducing the polyp miss rate. In particular, we refer to an automated real-time polyp detection system that runs during the colonoscopy and alerts the operator to the presence of polyps.
Because the endoscope outputs a standard video stream, it is natural to apply computer vision algorithms. In terms of families of computer vision algorithms, those based on artificial intelligence (AI) are most appropriate because of their dominance in the realm of object detection. [9][10][11][12][13][14] AI is already used fairly widely in colonoscopy for optical biopsy sampling, [15][16][17][18][19] navigation, [20][21][22][23][24][25] and polyp detection. [26][27][28][29][30][31][32][33][34][35][36][37][38] We propose a new AI-based system for polyp detection, which we dub DEEP 2 : DEEP DEtection of Elusive Polyps. This new system has 2 principal advantages over existing platforms. The first advantage pertains to the detection performance on elusive polyps, those polyps that are particularly difficult for endoscopists to detect. We identified 2 types of elusive polyps: fleeting polyps, which appear in the field of view (FOV) for a very brief time, and subtle polyps, those that escape detection by the endoscopist during the procedure and also by initial offline annotators. We quantify the performance of DEEP 2 on both types of elusive polyps, thereby showing the system's ability to improve detection rates. The second advantage pertains to the very low false-positive rate that DEEP 2 exhibits. This low false-positive rate carries strong clinical implications: Systems that have fewer false positives are more likely to be adopted in the clinic because they are easier to use and provide a more pleasant experience for the user. Finally, with the aim of assessing the applicability, stability, and user experience and to obtain some preliminary data on performance in a real-life scenario, we performed a pre-liminary prospective clinical validation study (clinicaltrial. gov ID: NCT04693078).

Dataset
The dataset was gathered from screening colonoscopy procedures performed in 3 Israeli hospitals. Each case consisted of a single video of an entire procedure (including the insertion phase) recorded at 30 frames per second. The following endoscope models were used: CF-H180AL, CF-HQ190L, and PCF-Q180AL (Olympus, Tokyo, Japan); EC-760R-V/L, EG-760R, and EC-530LP (Fujifilm, Tokyo, Japan); and EC-3890LK (Pentax, Tokyo, Japan). All videos and metadata were deidentified, according to the Health Insurance Portability and Accountability Act Safe Harbor. For both train and validation sets, procedures represented a sampling of the procedures performed at each institution over a period of time. The training dataset was obtained from 2 different university hospitals; the validation data were obtained from a third unrelated community hospital. Then we performed a small prospective clinical validation study in 100 patients that represented new videos never "seen" by DEEP 2 before.

Annotation procedure
Each video was annotated offline by board-certified gastroenterologists, drawn from a pool of 20 from a variety of hospitals in Israel and India (Supplementary Table 1, available online at www.giejournal.org). They were paid on an hourly basis, and their pay was not in any way based on the results they provided. Each video (and still image) was labeled by 2 separate offline gastroenterologists and verified by a third whose role was to unify the 2 annotations (Fig. 1). Offline labeling enables the annotators to watch the video more slowly and to freeze and rewind, allowing the labeling of polyps over and above those found by the performing endoscopist. To aid in annotation, a specialized labeling tool was used ( Supplementary Fig. 1, available online at www.giejournal. org). (See more details of the annotation procedure in Appendix 1, available online at www.giejournal.org.)

Neural network architecture
Details of the neural network architecture are provided in Appendix 1. Two different types of neural networks were trained and tested: RetinaNet 10 and LSTM-SSD. 11 The overall framework for polyp detection is illustrated in Supplementary Figure 2

Neural network training
We trained both neural networks (RetinaNet and LSTM-SSD) using standard stochastic gradient descent type techniques for minimizing detection loss. Parameters of the training procedure are given in Supplementary Table 2 (available online at www.giejournal.org). From here on we focus on RetinaNet architecture; training of the LSTM-SSD architecture is quite similar, with some minor modifications.
Positive frames (ie, frames that contain polyps) are generated as the entire set of video frames annotated as polyps by the offline annotators as previously described, combined with the still images that contain polyps. As noted in Supplementary Table 3 (available online at www.giejournal.org), there are 189,994 video frames, 14,693 still images, for a total of 204,687 positive frames. Negative frames can be any frames that do not contain polyps; because of the large size of the train set, this can be up to 80 million frames. To not overwhelm the training process with too many negatives as compared with positives, we limited the number of negatives to 1 million frames, selected at random. It should be noted that caution was exercised in choosing these frames: Given that the annotators do not label all positive frames but rather only a sampling of them, sampling purely randomly over the rest would occasionally yield positive frames. However, this issue is easily resolved because part of the annotation involves marking the first and last frames in which the polyp appears in the FOV; thus, one can simply avoid choosing negative frames from within the range between the first and last frames of each polyp. Given the 1 million frames so sampled, we trained an initial version of our detector.
The next phase involves the selection of so-called hard negatives. In particular, we ran this initial version of the detector over the remaining 79 million or so negatives that were not part of the training set. We then computed the maximal bounding box probabilities for each of these frames and sorted the frames from highest to lowest. The hard negatives are the first 1 million of these frames. The idea is simple: These hard negatives are the most challenging for the detector to produce the correct prediction (ie, a prediction of no bounding boxes), and so we would like the detector to see these during training. We then ran a second round of training, including these hard negatives in our new larger training set. (Thus, although 2 million negative frames are actually fed to the neural network training procedure, it is fair to say that all 80 million are used in training, because all 80 million are checked as possible candidates to be hard negatives.) This process of inclusion of hard negatives makes the network much more robust, in particular leading to many fewer false positives.
No domain adaptation was performed to adapt the network from train to validation. The network, trained as described above, was applied as is to the validation set.

Detector evaluation
Evaluation of the detectors was performed in terms of 2 metrics. The first metric is polyp sensitivity in which a polyp is considered as detected if the detector has declared its existence in at least 1 frame during its duration in the FOV, as marked by the annotators. The second Figure 1. Annotation process. Two annotators were given the task of independently annotating the video. These 2 annotations were then sent for a merging stage to a third annotator whose role was to unify the 2 annotations. In particular, where there was disagreement between the 2 annotators, the third annotator sought to choose the correct annotation. For example, this annotator might choose to either remove or include a frame where 1 annotator believed there was a polyp and the other did not. Finally, the annotations were examined by a nonphysician to simply ensure no obvious errors had occurred in the labeling process. If such errors were suspected, the annotation was sent back to the third annotator to verify whether there were indeed errors and, if so, to fix these errors. metric is the number of false alarms. A false alarm is considered to have occurred if the detector has declared a detection in 1 frame in which there was no ground truth annotation of a polyp. To evaluate instances in which there was a false alarm but indeed it was a polyp that was missed by human observers, we performed a reanalysis of 200 randomly selected procedures that were labeled normal (ie, without polyps; see the discussion under Subtle polyp evaluation below).
When polyp sensitivity and the false alarm rate are used, a performance curve can be traced out by varying the detection threshold; as the threshold increases, the polyp sensitivity will decrease, but so will the false alarms. In discussions with several gastroenterologists, 5 false alarms per procedure was defined as the region of particular interest on the performance curve. To compare our algorithm with metrics reported in Wang et al 32 and Urban et al, 37 we used per-frame image-level metrics as well: per-frame sensitivity, per-frame specificity, and the area under the curve for the image-level performance curve.

Fleeting polyp evaluation
A key aspect for evaluating DEEP 2 's performance on fleeting polyps (ie, polyps that appear in the FOV for a very brief time) was having a baseline of endoscopist performance on these polyps. To obtain this metric, annotators were asked to examine each polyp they had annotated that was in the FOV for 5 seconds and decide whether the endoscopist had detected the polyp or not. To make the decision, annotators were told to note if the polyp was in focus and at the center of the frame, but ultimately they were expected to make this decision based on their own endoscopic experience. Then, the endoscopist's sensitivity was computed as a gross number for all procedures or broken down for a particular range of polyp durations.

Subtle polyp evaluation
By definition, any detection of a subtle polyp (ie, a polyp that defied detection by both the endoscopist performing the procedure and the initial offline annotators) is initially classified as a false alarm. Therefore, we performed a reanalysis of the detector's false alarm of 200 procedures randomly selected that were deemed normal (without polyps). Then, all of the detector's false alarms were collected and were sent to the annotators to reexamine those specific frames. These annotators were instructed to classify the false alarms according to the classes shown below (See Results, section "Performance on elusive polyps, Subtle polyps").

Clinical study details
The clinical validation study was a prospective, nonblinded, pilot study of 100 consecutive routine screening or surveillance colonoscopies conducted at Shaare Zedek Medical Center in Jerusalem, Israel (clinicaltrial.gov ID: NCT04693078). Patients with a history of surgery involving the colon or rectum, known diagnosis of colorectal cancer, previous history of inflammatory bowel disease, and suspicion or diagnosis of genetic polyposis syndromes were excluded. During each procedure, DEEP 2 was run in real time, with its output presented on a secondary screen. A single off-the-shelf consumer-grade Nvidia 2080Ti GPU (Nvidia, Santa Clara, Calif, USA) was sufficient to enable real-time performance. Each system alert (green bounding box and sound alert) was considered to be a detection and was reviewed by the physician in real time. Endoscopists were requested to report how many polyps were discovered by DEEP 2 that they themselves may have missed and how many false alarms were produced by DEEP 2 . After each procedure, endoscopists were debriefed as to how well or poorly the system functioned and whether they subjectively assessed that the AI system had an impact on outcomes of the colonoscopy. Important events that stemmed from each DEEP 2 detection were logged in the study electronic case report form to be further explored offline later. Logged parameters were apparent type of polyp, timing and location of false alarms, whether the polyp was detected or missed by the performing endoscopist, lesion management, Boston Bowel Preparation Scale score, and age and sex of the patient. Because of regulation and patient privacy laws and by request of the ethical review board, all clinical data were deidentified such that no linking could be made with pathology reports; thus, no information regarding histology of polyps is available. Late adverse events monitoring was under the responsibility of the performing endoscopist. The primary outcomes for the design of the clinical study were number of additional polyps detected by DEEP 2 in real time and safety. Secondary outcomes were polyp detection rate (ie, percentage of colonoscopies where !1 polyp was detected), rate of false alarms per colonoscopy, and user experience on a 5-point scale.
The protocol was approved by the ethical review board of Shaare Zedek Medical Center and was conducted in accordance with the Good Clinical Practice guidelines of the International Conference on Harmonization and the provisions of the Declaration of Helsinki. All patients provided written informed consent.

Statistical analysis
To compute confidence intervals (CIs), we used the standard normal approximation for properly normalized sums of independent random variables. To compute P values, we used both standard techniques and an upper bound on the complementary cumulative distribution function of the normal distribution Q(t) < exp(-t 2 /2) / (2pt) 1/2 (eg see Wikipedia 39 ). The latter was used when the standard technique was beyond the computer's numerical tolerance (leading to P Z 0) and yields an upper bound on P. . A, Performance curve for DEEP 2 as implemented on the RetinaNet architecture, illustrating the trade-off between sensitivity and false alarms per video; the false alarms per minute are also indicated on the x-axis in parentheses. B, Selected points from the performance curve in A; all points are chosen to be under the threshold of 5 false alarms per procedure, a threshold that a consensus of gastroenterologists agreed would make the system usable in practice. C and D, Performance curve for DEEP 2 as implemented on the LSTM-SSD architecture and corresponding selected points; despite changing the type of neural network used, the results remain largely the same. E, Example detections on a variety of different types of polyps. See supplementary videos 1-14 (available online at www.giejournal.org) for further detail. F, Performance of the algorithm as a function of training set size using the RetinaNet architecture. The performance curve for DEEP 2 as implemented on the RetinaNet architecture on a per-image basis (rather than a video-event basis, as shown here is illustrated in Supplementary Fig. 5

Data
The training data collected from 2 academic hospitals consisted of 3611 procedures, equivalent to 796 hours of video, or 86 million frames. To test the system's generalizability, the validation data were collected from a third unrelated community hospital and consisted of 1393 colonoscopy procedures, equivalent to 310 hours of video, or 33 million frames (Supplementary Table 3). Each video in both the training and validation data was annotated by 3 gastroenterologists in a consensus-based approach; the annotators were drawn from a pool of 20 gastroenterologists with 4 to 40 years of experience (mean, 12.6 years; median, 8.0 years) and 400 to 3000 colonoscopies performed per year (mean, 1424 colonoscopies; median, 1200 colonoscopies). To estimate the endoscopist performance, offline annotators examined the colonoscopy video and labeled whether the performing endoscopist detected the polyp (and possibly ignored it) or missed the polyp. For each row of the table, we provide a 2 Â 2 table showing the number of polyps when either DEEP 2 (shortened to "AI") makes a detection or not (columns) and the endoscopist (shortened to "GI") makes a detection or not (rows). Note in particular the large numbers when DEEP 2 detects and the GI does not versus the opposite case when the GI detects and DEEP 2 does not. B, Performance breakdown by a 30-second duration threshold; greater than 30 seconds is taken to be a rough proxy for "histologically confirmed polyps" because these will be examined for some time before resection. DEEP 2 has a nearly identical perfect performance as the endoscopists for the polyps with durations above 30 seconds while giving clearly superior performance on the polyps with durations less than 30 seconds. C, Clinical significance of elusive polyps. Based on the labeling of the annotators, it can be seen that a major portiond56.0%dof polyps appearing for less than 5 seconds are clinically significant (ie, either adenomatous or malignant). FOV, Field of view; CI, confidence interval; AI, artificial intelligence. among these false alarms: adenomas, hyperplastic polyps, and those deemed worthy of a "second look," a designation indicating that the annotator cannot tell from the video whether the detection is indeed a polyp but it would be suspicious enough to warrant further investigation during a live procedure. The subsequent rows show false alarms that were indeed false alarms/misdetections on reanalysis; interestingly, many of these are "natural" errors, including inflammation, bubbles, and irregularities of haustral folds.
The entire performance curve of DEEP 2 in shown in Figure 2, illustrating the trade-off between sensitivity and false alarms per procedure and displaying a table with a number of points drawn from this curve. The point of particular clinical relevance is a sensitivity of 97.1% (95% CI, 95.8%-98.4%) versus 4.6 false alarms per procedure (95% CI, 4.0-5.2). This number of false alarms corresponds to a per-frame false alarm rate of .23%. These results use the LSTM-SSD neural network architecture ( Fig. 2A and  B; results for the RetinaNet architecture were similar [ Fig. 2C and D]). All points on the performance curves are given in Supplementary Tables 4 and 5 The performance curve for DEEP2 as implemented on the RetinaNet architecture on a per-image basis, rather than a video-event basis, as shown here is illustrated in Supplementary Fig. 5, available online at www.giejournal. org. (available online at www.giejournal.org). Qualitative results are shown in Figure 2E. The effect of training set size on DEEP 2 performance showed a decline in the performance as the training set size decreased (Fig. 2F). Example videos are included in the Supplementary Material (see supplementary videos 1-14 available online at www.giejournal.org). Comparing DEEP 2 with other available systems 32,37 showed a significant decrease in the false alarm rate (see Appendix 1, available online at www. giejournal.org); however, because these AI tools were evaluated based on different validation datasets, the meaning of this finding must be taken with caution.

Performance on elusive polyps
Fleeting polyps. To investigate whether polyps that appear briefly would be more difficult for the algorithm to detect, we compared this with the fraction of these polyps detected by the performing endoscopist as determined by offline annotators. Results are reported in Figures 3A and B and Videos 1 to 5 (available online at www.giejournal.org). We note that for polyps appearing for less than 5 seconds in the FOV, DEEP 2 had a sensitivity of 88.5% (95% CI, 84.6%-92.4%) compared with 31.7% (95% CI, 26.0%-37.5%) for endoscopists (P < 10 -83 ). This comparison was even starker for polyps appearing for less than 2 seconds, for a sensitivity of 84.9% (95% CI, 79.3%-90.5%) versus 18.9% (95% CI, 12.8%-24.9%), respectively (P < 10 -100 ). The system retains a similar performance level for much shorter durations: On polyps that appear for less than half a second, the sensitivity was 88.7% (95% CI, 80.8%-96.6%), and on those appearing for less than a tenth of second, 87.5% (95% CI, 71.3%-100.0%).
Conversely, the algorithm rarely missed polyps appearing for a longer duration. Histologically confirmed polyps required by definition removal of the lesion; therefore, they generally appeared in the FOV for more than 30 seconds. For these polyps, the system's detection rate was 99.8% (95% CI, 99.4%-100.0%).
Subtle polyps. DEEP 2 's ability to detect subtle polyps is reported in Figure 4 and Videos 6 to 9 (available online at www.giejournal.org). By definition, these polyps were initially identified as false alarms. In a reanalysis of the detector's false alarms from 200 randomly selected procedures that were deemed normal (without polyps), 1087 false alarms were recorded, of which 44 had clinical significance: 20 were identified as having endoscopic features compatible with adenomas and 24 with hyperplastic polyps. A further 42 were deemed worthy of a "second look," a designation indicating that the finding would warrant further investigation during a live procedure. On a per-procedure basis, these numbers translated to .1 adenomas (95% CI, .06-.14), .12 hyperplastic polyps (95% CI, .07-.17), and .21 second looks (95% CI, .15-.27), for a total of .43 extra "events" per procedure (95% CI, .35-.50). Information on the other types of false alarms is reported in Figure 4.

Clinical significance of elusive polyps
In an attempt to quantify the clinical significance of elusive polyps in the absence of histopathology reports, the annotators labeled elusive polyps as belonging to 1 of 3 classes: hyperplastic, adenomatous, or malignant. Polyps were considered to be of potential clinical significance if they were labeled as adenomatous or malignant. The results are reported in Figure 3C. For polyps appearing for less than 5 seconds in the FOV, the proportion of clinically significant polyps was 56.0% (95% CI, 49.8%-62.1%).

Clinical study
In the clinical validation study, all procedures were performed by 1 of 3 board-certified endoscopists previously trained in the use of DEEP 2 , with experience ranging from 7 to 35 years (mean, 17.0 years; median, 9.0 years). Mean patient age was 60 years (95% CI, 58-63), and 44% were women. The mean on Boston Bowel Preparation Scale score was 7.45 (95% CI, 7.11-7.80). The overall polyp detection rate was 74%.

DISCUSSION
For a system like DEEP 2 to be clinically relevant, it is crucial to know how many false positives are tolerable. DEEP 2 detected 97.1% of polyps, while producing an average of 4.6 false alarms per procedure, which translated to .47 false alarms per minute of the procedure. This compared very favorably with the 2.4 false alarms per minute of withdrawal time of the commercial GI-Genius system (Medtronic, Minneapolis, Minn, USA) 40 ; however, the results from the Medtronic GI-Genius system were obtained from a real-time randomized control trial. Furthermore, our per-frame false-positive rate was .23% compared with the rate of .9% in Hassan et al, 38 which represents a decrease by a factor of 4. Given the ability of DEEP 2 to detect all but 2.9% of polyps while producing a tolerable number of false positives, we believe the system can have immediate value as a decision support tool in the clinic. DEEP 2 's performance on elusive polyps, displaying an extremely high sensitivity for polyps appearing in the FOV for a very short time, is in our opinion where the system really shows its benefit as a decision support tool. Previous studies have restricted themselves to validation on histologically confirmed polyps. This approach seems to be sensible but has a critical drawback: The only polyps with histologic confirmation are, by definition, those discovered and removed. An automated system that is validated only on polyps that were already discovered by the endoscopist does not offer much added value. By contrast, a system that can detect polyps that were missed by the performing endoscopist offers immediate and obvious benefit. Remarkably, the system was able to detect polyps that were missed by the performing endoscopist and offline annotators (ie, subtle polyps). The performance of DEEP 2 on subtle polyps suggests its generalizability: The system has learned to detect items that were initially missed by all who viewed the procedure. In computing the performance of DEEP 2 , we made sure to validate on a dataset that was from a different institution. In particular, a community hospital that often focuses on achieving a significant volume of colonoscopies is incentivized toward a decrease in procedure duration. DEEP 2 's ability to generalize to a community hospital is noteworthy because one might reasonably argue it is precisely the endoscopist in such a setting that would benefit the most.
It is also interesting to note that DEEP 2 is fairly insensitive to the choice of neural network architecture. We used 2 architectures: RetinaNet and LSTM-SSD. RetinaNet is a leading technique for object detection on static images (applied to video by applying it to frames in a consecutive fashion). It is a single-stage detector with no initial region proposal phase; rather, it is based on a fully convolutional network that takes the image as input and directly outputs all detections in the image at once. RetinaNet is a top performer on a variety of benchmarks given a fixed computational budget; it is known for balancing speed of computation with accuracy. LSTM-SSD is a true video object detection architecture, which can explicitly account for the temporal character of the video (eg, temporal consistency of detections, ability to deal with blur and fast mo-tion). The SSD part of the network is a single-stage detector like RetinaNet, whereas the LSTM part incorporates recurrent units based on ConvLSTM layers, which allow for the easy integration of both spatial and temporal information. LSTM-SSD is known for being robust and very computationally lightweight and can therefore run on lessexpensive processors. Comparable results were attained also on the much heavier Faster R-CNN architecture 9 and on the very recent EfficientDet architecture. 14 The fact that results are similar across different architectures implies that one can choose the network meeting the available hardware specifications. This indicates the potential for deployment in less developed areas, where cost control is paramount.
Our clinical study was small, but the aim was to assess the applicability, stability, safety, and user experience and to obtain some preliminary data on performance in a real-life scenario. Our results support the conclusions described above. In particular, using DEEP 2 as a real-time decision support tool increased the polyps detected per colonoscopy by 54% at a rate of 3.87 false alarms per procedure, without adverse events and with an overall positive user experience. A larger controlled study including information on the histology of polyps is necessary to fully quantify the improvement in quality measures that DEEP 2 may provide.
We recognize that the current study has limitations. First, the annotation procedure is not precisely similar to the standard consensus-based approaches used in AI studies dealing with static images (eg, in radiology and ophthalmology). The reasons for this have been discussed and justified in the section explaining the annotation procedure. Second, the annotation relied on 70 to 100 representative frames for a given polyp. It is possible that the use of more frames might lead to an increase in performance. Third, the finding of "subtle polyps" leads to the question of whether there are "hidden" polyps that are missed by all 3dendoscopists, annotators, and algorithm. Fourth, to better understand and describe computer-aided diagnostic systems with the ultimate goal of developing an "ideal" AI polyp detection algorithm, the performance of the computer-aided diagnostic system should be evaluated for different morphologic subtypes of colorectal lesions 41 as classified by the Paris classification (eg, flat or depressed-type neoplasms). 42,43 In the current study we were unable to perform this analysis; rather, we measured all types of polyps as a single group. The same is true for different histopathologic subtypes (eg, adenomatous lesions, serrated polyps, hyperplastic polyps) and grades of dysplasia. In practice, this could increase procedure time and cost because more diminutive polyps might be found and not characterized/predicted as adenoma or hyperplastic. Finally, our clinical study is admittedly small, noncontrolled, and without histologic information. In addition, because endoscopists had to report how many polyps were discovered by DEEP 2 that they themselves www.giejournal.org Volume 94, No. 6 : 2021 GASTROINTESTINAL ENDOSCOPY 1107 may have missed, reporting bias may be an issue; however, because endoscopists do not like to admit they are missing polyps, we believe that reporting bias in this setting would weaken the performance of the system in favor of the performance of the endoscopists. The results here are encouraging, but clearly a larger, randomized study is warranted. More broadly, there are several limitations relating to the practical deployment of any AI-based system for automated polyp detection. These include inter alia, uptake attitudes of endoscopists, deskilling of endoscopists, over-reliance on the AI system, and concerns that nonphysicians might replace physicians in the performance of endoscopic procedures. 32,[44][45][46][47][48] In conclusion, we have presented DEEP 2 , a system for automatic polyp detection that attains state-of-the-art performance. DEEP 2 is very sensitive on the population of polyps as a whole and in particular on the subpopulation of elusive polypsdthose polyps that human operators have the most difficulty in detecting and for which AI can provide the most added value. DEEP 2 also displays a low rate of false positives, leading to a potentially more pleasant experience for the user. A small clinical study reinforces these conclusions.

ACKNOWLEDGMENTS
We acknowledge Yiwen Luo, Huy Doan, and Quang Duong for software infrastructure support for data collection. We also thank the many gastroenterologists who helped in the large-scale annotation effort that was necessary for this study.
The dataset was gathered from colonoscopy procedures performed in 3 Israeli hospitals. In all 3 cases, the data gathered were used under license for the current study and so are not publicly available. Figures 1, 2, and 4 and Supplementary Figures 1, 2, 3, and 4 contain raw images from Shaare Zedek Medical Center. The supplementary videos are also from Shaare Zedek Medical Center.
The code used for training the models has a large number of dependencies on internal tooling, infrastructure, and hardware, and its release is therefore not feasible. However, all experiments and implementation details are described in sufficient detail in Methods and/or in Appendix 1 to allow independent replication with nonproprietary libraries. Major components of our work are available in open source repositories: Tensorflow (https://www.tensorflow.org) and Tensorflow Object Detection API (https://github.com/tensorflow/models/tree/ master/research/object_detection).
This video can be viewed directly from the GIE website or by using the QR code and your mobile device. Download a free QR code scanner by searching "QR Scanner" in your mobile device's app store.
Use your mobile device to scan this QR code and watch the author interview. Download a free QR code scanner by searching "QR Scanner" in your mobile device's app store.

Supplementary annotation methods
The challenge of ground truth acquisition in video annotation is that the videos are long; therefore, one cannot expect to label each frame in the video. Instead, the goals of the annotation procedure were to identify every polyp that occurred within a given video and, for each such identified polyp, to annotate a number of representative frames with bounding boxes around the polyp. The number of representative frames was taken to be in the range of 70 to 100. This set of frames always contained both the first and the last frame during which the polyp was in the field of view and the remaining frames were roughly equally spaced. For polyps that appeared for 25 seconds (ie, 750 frames), 70 to 100 frames only represent 10% to 15% of the frames; nevertheless, this is more than sufficient to capture the appearance diversity of the polyp, because at 30 frames per second this represents sampling the polyp about every quarter of a second. The rest of the frames (ie, frames without polyps) were used to train the system as negative frames (see Neural network training in Methods).
To aid in annotation, a specialized labeling tool was used, shown in Supplementary Figure 1. This tool allowed the annotator to easily pause the video as well as rewind and use various playback speeds. The tool also allowed for the labeling of still images. As is common practice, the aim was to have more than 1 gastroenterologist annotate each video to provide a more reliable, consensus-based labeling. However, this is considerably more difficult to achieve for video annotation than for image annotation, because individual annotators may differ in both the temporal and spatial locations of their respective bounding boxes. The procedure illustrated in Figure 1 was therefore used.
Two annotators were given the task of independently annotating the video, using the procedure described above. These 2 annotations were then sent for a refinement stage to a third annotator whose role was to unify the 2 annotations. In terms of unification, there are 2 separate scenarios: when a given frame has been labeled by both of the initial annotators and when a given frame has been labeled by only 1 of the initial annotators. Note that the latter case often happens, because each annotator only labels 70 to 100 representative frames, which is less than the total number of frames; therefore, their annotations often do not overlap. In the first case, the third annotator must choose the more appropriate bounding box between 1 of the 2 annotators, or the more appropriate boxes, for multiple polyps. In the second case, there is no such decision to be made. In both cases, the third annotator can slightly refine the location of the chosen bounding box, if so desired. Finally, for negative frames (ie, frames which both initial annotators agreed were devoid of polyps), the third annotator had the ability to disagree by proposing a new polyp and sending it back to the initial annotators for agreement.
Finally, the annotations were examined by a nonphysician to simply ensure that no obvious errors had occurred in the labeling process. If such errors were suspected, then the annotation was sent back to the third annotator to verify whether there were indeed errors and, if so, to fix these errors. This procedure was used for the construction of both the training and validation sets.
Full description of the neural network architecture RetinaNet. The overall framework for polyp detection is illustrated in Supplementary Figure 2. Instantiation of this system involves choosing a particular neural network architecture for the block labeled "CNN for Detection" and optionally for the block labeled "Memory (State)." We begin by describing a simpler architecture in which only the block labeled "CNN for Detection" is used.
We used the RetinaNet architecture for object detection, 1 illustrated in Supplementary Figure 3, which works as follows. A large set of candidate object locations, referred to as "anchor boxes," is sampled across the image (at about 100,000 locations). This set densely covers a grid of spatial positions, with 3 scales and 3 aspect ratios for each position. The ResNet-50 network 2 is applied directly to the image to extract features. These ResNet features are then combined across multiple resolutions levels using a feature pyramid network.
The features from each feature pyramid network level are then further fed into 2 subnetworks. First, the classification subnet predicts the probability of object presence at each spatial position for each of the A Z 9 anchors per position and K object classes. The subnet is a fully convolutional network, terminating in a convolutional layer with KA filters. Note that we can predict the type of polyp here (eg, adenomatous vs hyperplastic) by taking K > 1 classes; in practice, we do not do so. Second, the box regression subnet is a fully convolutional network that predicts an offset from each anchor box to a nearby ground-truth object, if 1 exists. It is identical in structure to the classification subnet except that it terminates in 4A linear outputs per spatial location. The top predictions from all levels are merged, and nonmaximum suppression with a threshold is applied to yield the final detections.
Long Short-Term Memory Single Shot Detector (LSTM-SSD). Returning to Supplementary Figure 2, it is possible to use a neural network architecture that incorporates the block labeled "Memory (State)." The purpose of such a block is to maintain and pass state between successive image frames; this recursive structure allows the model to make more confident predictions relative to a single-frame model in difficult scenarios such as blur and occlusion. One such architecture is the LSTM-SSD architecture, shown in Supplementary Figure 4, which we now describe.
The base of the LSTM-SSD architecture uses a single shot multibox detector, or SSD 3 ; this is a single-frame detector that is similar to the RetinaNet architecture described above. The first modification made to the standard SSD setup is to replace all convolutional layers by depth-wise separable convolutions 4,5 ; this enables faster computations. The second modification is to insert recurrent units between consecutive convolutional layers. These recurrent units, which are ConvLSTM layers, 6,7 allow for information from both the current frame and the previous frames to be incorporated.

Temporal logic layer
Referring to Supplementary Figure 2, the output of the detection network is refined by passing through a temporal logic layer. The purpose of this layer is to take the bounding boxes output by the network and to refine them based on bounding boxes detected in previous frames. In particular, we find the following very simple form of temporal logic to work well. We examine bounding boxes from the previous n frames, where this set of n includes the current frame. If out of this set of n frames, k of them, including the current frame, have at least 1 detection, then we declare detection for the current frame; otherwise, we do not.
This very simple layer is effective in filtering out false positives. It is parameterized by only 2 values, namely k and n. These numbers will impact the latency of the detections, because one effectively has to wait n frames to make a detection. Thus, for practical reasons concerned with latency, one may choose to make n small.

Supplementary comparison of DEEP DEtection of Elusive Polyps with other published systems
Comparing artificial intelligence tools that were evaluated based on different validation data sets is problematic; hence, the following results must be viewed with caution. Nevertheless, and taking into consideration this limitation, the aim of these comparisons is to better illustrate the performance of DEEP DEtection of Elusive Polyps (DEEP 2 ) relative to current available systems, especially the significantly fewer false positives. To illustrate the system's performance in this regard, we compare it with 2 leading artificial intelligence systems for polyp detection reported in the literature. 8,9 Comparison with Urban et al. 8 For a comparison with the method of Urban et al, we used the same validation methodology, which is different from ours in that it is more forgiving in terms of what constitutes a false positive. Specifically, Urban et al stated that "all false-positive findings with duration of at least 1 second are counted" (see Table 5 caption in Urban et al).
The performance for Urban et al is computed based on their Table 5. We use the column corresponding to "11 Challenging Videos" to compute the sensitivity Z 68 / 73 Z 93.2%. The false alarms per procedure is computed as 46 / 11 Z 4.2. Although that system incurs 4.18 false alarms per procedure when operating at a sensitivity of 93.2%, DEEP 2 incurs .14 false alarms per procedure when operating at a sensitivity of 93.8%, a decrease in the false positives by a factor of almost 30.
An alternative computation may be based on combining the 2 columns of their Table 5 to include the somewhat easier "9 Videos" dataset in addition to the "11 Challenging Videos" dataset. In this case, their sensitivity increases to 113 / 118 Z 95.8%, whereas their false alarms per procedure also increases to 127 / 20 Z 6.4. One may compare this latter case with the performance of DEEP 2 which achieves a sensitivity of 96.1% versus .38 false alarms per procedure.
Comparison with Wang et al. 9 The method of Wang et al focuses on purely image-level metrics. The performance for Wang et al for the propose of these comparisons is computed based on their Table 2. For an equivalent per-frame sensitivity of 94.4%, DEEP 2 achieves a 2.5% absolute increase in per-frame specificity, from 95.9% to 98.4%, equivalent to 2.6 times fewer false alarms.
Other comparisons. On the flip side, the algorithm rarely misses polyps with longer duration. Histologically confirmed polyps generally appear for more than 30 seconds in the field of view, as they are being interrogated by the endoscopist (and later, sometimes resected). On these polyps, the system's detection rate is 99.8% (95% confidence interval, 99.4%-100.0%). We note that this is comparable with the 99.7% rate on histologically confirmed polyps reported in Hassan et al 10 ; however, our per-frame false alarm rate of .23% is about 4 times lower than the rate of .9% reported in Hassan et al. DEEP 2 detects 97.1% of polyps while producing an average of 4.6 false alarms per procedure. Note that the latter number translates to .47 false alarms per minute of the procedure; this compares very favorably with the 2.4 false alarms per minute of withdrawal time of the commercial Medtronic GI-Genius system tested in Hassan et al. 11 Note that the difference is probably even more dramatic, because the 2.4 false alarms per minute is only for the withdrawal; Hassan et al 11 explicitly stated that they limited their analysis to the withdrawal phase of the colonoscopy videos, ignoring the insertion phase in which a high number of false positives can be triggered by the collapsing folds for little if any insufflation of the lumen. Furthermore, our per-frame false-positive rate is .23%, as compared with the .9% reported in Hassan et al, 10 which represents a decrease of a factor of 4.
Regarding comparisons of the used datasets, both the training and validation of DEEP 2 were done on a considerably different kind of dataset from any yet been reported in the literature. Note that Urban et al 8 10 the datasets are of comparable size; the key difference lies in the fact that the current study uses a more diverse population of polyps. In particular, Hassan et al 10 used only polyps that were detected by the performing endoscopist, whereas our dataset also includes all polyps discovered by offline gastroenterologist annotators, a considerably larger set. The use of this type of data is what enables the performance on elusive polyps. Beyond this, it is important to describe the critical way in which the data were used in training. In particular, we extracted 2 million negative frames to represent the diversity of the nonpolyp background of the colon. Of these data, 1 million frames were sampled randomly and used to train the initial model; the remaining 1 million frames were so-called hard negatives (ie, negatives that tend to give the detector trouble) in the sense that they produce false positives in earlier iterations of the training. These 1 million hard negatives are drawn from a population of over 80 million frames; thus, the effective size of the training set is considerably larger. The unparalleled diversity of both polyps and background is what leads to the system's robust performance. Indeed, we have validated the effect of training set size in Figure 2F. The clear increase in performance as the dataset grows in size justifies the collection of such a large dataset.
Supplementary Figure 4. LSTM-SSD architecture. The base of the LSTM-SSD architecture uses a single shot multibox detector (SSD), a single-frame detector, that is similar to the RetinaNet architecture. The first modification made to the standard SSD setup is to replace all convolutional layers by depthwise separable convolutions; this enables faster computations. The second modification is to insert recurrent units between consecutive convolutional layers. These recurrent units allow for information from both the current frame and the previous frames to be incorporated. In particular, the recurrent units are ConvLSTM layers, which allow for the easy integration of both spatial and temporal information. For speed purposes, an efficient Bottleneck-LSTM unit based again on depth-wise separable convolutions is used to reduce the computational cost significantly compared with a regular LSTM unit.
Supplementary Figure 5. Performance curve for images. The performance curve for the DEEP DEtection of Elusive Polyps system as implemented on the RetinaNet architecture, on a per-image basis (rather than a video-event basis, as shown in Figure 2). This curve illustrates the trade-off between perimage sensitivity and per-image specificity, with selected points on the curve shown in the In the summary of video data, the gross statistics for training set vs validation set are reported. In all training data, statistics for both still image data and video data are reported, broken down by polyp vs nonpolyp. These still images were taken by the endoscopist during the colonoscopy, generally when an event of interest, either a polyp or an anatomic landmark, was encountered. The number of unique polyps for all data on both train and validation sets are reported. All points from the performance curve for the RetinaNet architecture, as shown in Figure 2A. That is, Figure 2B is a sampling of this table. In addition to reporting the sensitivity, false alarms per video, and false alarms per minute, we also report the detector threshold used, as well as the window size. With regard to the detector threshold, note that as the threshold goes down, the sensitivity increases, as do the number of false alarms, as one would expect. The window size corresponds to the parameter n in the temporal logic layer; higher n indicates a longer detection latency. All points from the performance curve for the LSTM-SSD architecture, as shown in Figure 2C. That is, Figure 2C is a sampling of this table. In addition to reporting the sensitivity, false alarms per video, and false alarms per minute, we also report the detector threshold used, as well as the window size. With regard to the detector threshold, note that as the threshold goes down, the sensitivity increases, as do the number of false alarms, as one would expect. The window size corresponds to the parameter n in the temporal logic layer; higher n indicates a longer detection latency.