Detection of elusive polyps using a large-scale artificial intelligence system (with videos)

Open AccessPublished:June 29, 2021DOI:https://doi.org/10.1016/j.gie.2021.06.021

      Background and Aims

      Colorectal cancer is a leading cause of death. Colonoscopy is the criterion standard for detection and removal of precancerous lesions and has been shown to reduce mortality. The polyp miss rate during colonoscopies is 22% to 28%. DEEP DEtection of Elusive Polyps (DEEP2) is a new polyp detection system based on deep learning that alerts the operator in real time to the presence and location of polyps. The primary outcome was the performance of DEEP2 on the detection of elusive polyps.

      Methods

      The DEEP2 system was trained on 3611 hours of colonoscopy videos derived from 2 sources and was validated on a set comprising 1393 hours from a third unrelated source. Ground truth labeling was provided by offline gastroenterologist annotators who were able to watch the video in slow motion and pause and rewind as required. To assess applicability, stability, and user experience and to obtain some preliminary data on performance in a real-life scenario, a preliminary prospective clinical validation study was performed comprising 100 procedures.

      Results

      DEEP2 achieved a sensitivity of 97.1% at 4.6 false alarms per video for all polyps and of 88.5% and 84.9% for polyps in the field of view for less than 5 and 2 seconds, respectively. DEEP2 was able to detect polyps not seen by live real-time endoscopists or offline annotators in an average of .22 polyps per sequence. In the clinical validation study, the system detected an average of .89 additional polyps per procedure. No adverse events occurred.

      Conclusions

      DEEP2 has a high sensitivity for polyp detection and was effective in increasing the detection of polyps both in colonoscopy videos and in real procedures with a low number of false alarms. (Clinical trial registration number: NCT04693078.)

      Graphical abstract

      Abbreviations:

      AI (artificial intelligence), DEEP2 (DEEP DEtection of Elusive Polyps), FOV (field of view)
      Colorectal cancer is the second leading cause of cancer death worldwide,
      • Guren M.G.
      The global challenge of colorectal cancer.
      resulting in an estimated 900,000 deaths per year.
      • Ferlay J.
      • Ervik M.
      • Lam F.
      • et al.
      Global cancer observatory: cancer today.
      Colonoscopy is the criterion standard for detection and removal of precancerous lesions; 19 million colonoscopies are performed in the United States annually.
      • Kim A.
      An astounding 19 million colonoscopies are performed annually in The United States. iData Research.
      Colonoscopic removal of polyps has been shown to reduce mortality. However, colonoscopy is performed with variable efficiency.
      • Mathews S.C.
      • Zhao N.
      • Holub J.L.
      • et al.
      Improvement in colonoscopy quality metrics in clinical practice from 2000 to 2014.
      ,
      • Lam A.Y.
      • Li Y.
      • Gregory D.L.
      • et al.
      Association between improved adenoma detection rates and interval colorectal cancer rates after a quality improvement program.
      Tandem colonoscopy studies showed that 22% to 28% of polyps are missed by the performing endoscopist; 20% to 24% are histologically confirmed adenomas,
      • Leufkens A.M.
      • van Oijen M.G.H.
      • Vleggaar F.P.
      • et al.
      Factors influencing the miss rate of polyps in a back-to-back colonoscopy study.
      and missed lesions may turn into interval cancers.
      • Anderson R.
      • Burr N.E.
      • Valori R.
      Causes of post-colonoscopy colorectal cancers based on World Endoscopy Organization system of analysis.
      A variety of factors leads to missed polyps: operator fatigue, distraction, and skill level are prime among them.
      • Forbes N.
      • Boyne D.J.
      • Mazurek M.S.
      • et al.
      Association between endoscopist annual procedure volume and colonoscopy quality: systematic review and meta-analysis.
      An algorithmic solution seems to be an attractive option to deal with these factors, potentially reducing the polyp miss rate. In particular, we refer to an automated real-time polyp detection system that runs during the colonoscopy and alerts the operator to the presence of polyps.
      Because the endoscope outputs a standard video stream, it is natural to apply computer vision algorithms. In terms of families of computer vision algorithms, those based on artificial intelligence (AI) are most appropriate because of their dominance in the realm of object detection.
      • Ren S.
      • He K.
      • Girshick R.
      • et al.
      Faster R-CNN: towards real-time object detection with region proposal networks.
      • Lin T.-Y.
      • Goyal P.
      • Girshick R.
      • et al.
      Focal loss for dense object detection.
      • Liu M.
      • Zhu M.
      Mobile video object detection with temporally-aware feature maps.
      • Wu H.
      • Chen Y.
      • Wang N.
      • et al.
      Sequence level semantics aggregation for video object detection.
      • Zhou X.
      • Wang D.
      • Krähenbühl P.
      Objects as points.
      • Tan M.
      • Pang R.
      • Le Q.V.
      EfficientDet: scalable and efficient object detection.
      AI is already used fairly widely in colonoscopy for optical biopsy sampling,
      • Byrne M.F.
      • Chapados N.
      • Soudan F.
      • et al.
      Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model.
      • Byrne M.F.
      • Soudan F.
      • Henkel M.
      • et al.
      Real-time artificial intelligence “full colonoscopy workflow” for automatic detection followed by optical biopsy of colorectal polyps [abstract].
      • Zhu X.
      • Nemoto D.
      • Wang Y.
      • et al.
      Detection and diagnosis of sessile serrated adenoma/polyps using convolutional neural network (artificial intelligence) [abstract].
      • Sánchez-Montes C.
      • Bernal J.
      • García-Rodríguez A.
      • et al.
      Review of computational methods for the detection and classification of polyps in colonoscopy imaging.
      • Zhou D.
      • Tian F.
      • Tian X.
      • et al.
      Diagnostic evaluation of a deep learning model for optical diagnosis of colorectal cancer.
      navigation,
      • Ma R.
      • Wang R.
      • Pizer S.
      • et al.
      Real-time 3D reconstruction of colonoscopic surfaces for determining missing regions.
      • Turan M.
      • Ornek E.P.
      • Ibrahimli N.
      • et al.
      Unsupervised odometry and depth learning for endoscopic capsule robots.
      • Turan M.
      • Almalioglu Y.
      • Araujo H.
      • et al.
      Deep EndoVO: a recurrent convolutional neural network (RCNN) based visual odometry approach for endoscopic capsule robots.
      • Rau A.
      • Edwards P.J.E.
      • Ahmad O.F.
      • et al.
      Implicit domain adaptation with conditional generative adversarial networks for depth prediction in endoscopy.
      • Chen R.J.
      • Bobrow T.L.
      • Athey T.
      • et al.
      SLAM Endoscopy enhanced by adversarial depth prediction.
      • Freedman D.
      • Blau Y.
      • Katzir L.
      • et al.
      Detecting deficient coverage in colonoscopies.
      and polyp detection.
      • Bernal J.
      • Tajkbaksh N.
      • Sanchez F.J.
      • et al.
      Comparative validation of polyp detection methods in video colonoscopy: results from the MICCAI 2015 endoscopic vision challenge.
      • Mohammed A.
      • Yildirim S.
      • Farup I.
      • et al.
      Y-Net: a deep convolutional neural network for polyp detection.
      • Brandao P.
      • Mazomenos E.
      • Ciuti G.
      • et al.
      Fully convolutional neural networks for polyp segmentation in colonoscopy.
      • Wichakam I.
      • Panboonyuen T.
      • Udomcharoenchaikit C.
      • et al.
      Real-time polyps segmentation for colonoscopy video frames using compressed fully convolutional network.
      • Angermann Q.
      • Bernal J.J.
      • Sánchez-Montes C.
      • et al.
      Real-time polyp detection in colonoscopy videos: adapting still frame-based methodologies for video sequences analysis..
      • Bernal J.
      • Sánchez J.
      • Vilariño F.
      Towards automatic polyp detection with a polyp appearance model.
      • Wang P.
      • Xiao X.
      • Glissen Brown J.R.
      • et al.
      Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy.
      • Badrinarayanan V.
      • Kendall A.
      • Cipolla R.
      SegNet: a deep convolutional encoder-decoder architecture for image segmentation.
      • Wang P.
      • Berzin T.M.
      • Glissen Brown J.R.
      • et al.
      Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study.
      • Zhou G.
      • Liu X.
      • Berzin T.M.
      • et al.
      A real-time automatic deep learning polyp detection system increases polyp and adenoma detection during colonoscopy: a prospective double-blind randomized study.
      • Wang P.
      • Liu X.
      • Berzin T.M.
      • et al.
      Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (CADe-DB trial): a double-blind randomised study.
      • Urban G.
      • Tripathi P.
      • Alkayali T.
      • et al.
      Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy.
      • Hassan C.
      • Wallace M.B.
      • Sharma P.
      • et al.
      New artificial intelligence system: first validation study versus experienced endoscopists for colorectal polyp detection.
      We propose a new AI-based system for polyp detection, which we dub DEEP2: DEEP DEtection of Elusive Polyps. This new system has 2 principal advantages over existing platforms. The first advantage pertains to the detection performance on elusive polyps, those polyps that are particularly difficult for endoscopists to detect. We identified 2 types of elusive polyps: fleeting polyps, which appear in the field of view (FOV) for a very brief time, and subtle polyps, those that escape detection by the endoscopist during the procedure and also by initial offline annotators. We quantify the performance of DEEP2 on both types of elusive polyps, thereby showing the system's ability to improve detection rates. The second advantage pertains to the very low false-positive rate that DEEP2 exhibits. This low false-positive rate carries strong clinical implications: Systems that have fewer false positives are more likely to be adopted in the clinic because they are easier to use and provide a more pleasant experience for the user. Finally, with the aim of assessing the applicability, stability, and user experience and to obtain some preliminary data on performance in a real-life scenario, we performed a preliminary prospective clinical validation study (clinicaltrial.gov ID: NCT04693078).

      Methods

      Dataset

      The dataset was gathered from screening colonoscopy procedures performed in 3 Israeli hospitals. Each case consisted of a single video of an entire procedure (including the insertion phase) recorded at 30 frames per second. The following endoscope models were used: CF-H180AL, CF-HQ190L, and PCF-Q180AL (Olympus, Tokyo, Japan); EC-760R-V/L, EG-760R, and EC-530LP (Fujifilm, Tokyo, Japan); and EC-3890LK (Pentax, Tokyo, Japan). All videos and metadata were deidentified, according to the Health Insurance Portability and Accountability Act Safe Harbor. For both train and validation sets, procedures represented a sampling of the procedures performed at each institution over a period of time. The training dataset was obtained from 2 different university hospitals; the validation data were obtained from a third unrelated community hospital. Then we performed a small prospective clinical validation study in 100 patients that represented new videos never “seen” by DEEP2 before.

      Annotation procedure

      Each video was annotated offline by board-certified gastroenterologists, drawn from a pool of 20 from a variety of hospitals in Israel and India (Supplementary Table 1, available online at www.giejournal.org). They were paid on an hourly basis, and their pay was not in any way based on the results they provided. Each video (and still image) was labeled by 2 separate offline gastroenterologists and verified by a third whose role was to unify the 2 annotations (Fig. 1). Offline labeling enables the annotators to watch the video more slowly and to freeze and rewind, allowing the labeling of polyps over and above those found by the performing endoscopist. To aid in annotation, a specialized labeling tool was used (Supplementary Fig. 1, available online at www.giejournal.org). (See more details of the annotation procedure in Appendix 1, available online at www.giejournal.org.)
      Figure thumbnail gr1
      Figure 1Annotation process. Two annotators were given the task of independently annotating the video. These 2 annotations were then sent for a merging stage to a third annotator whose role was to unify the 2 annotations. In particular, where there was disagreement between the 2 annotators, the third annotator sought to choose the correct annotation. For example, this annotator might choose to either remove or include a frame where 1 annotator believed there was a polyp and the other did not. Finally, the annotations were examined by a nonphysician to simply ensure no obvious errors had occurred in the labeling process. If such errors were suspected, the annotation was sent back to the third annotator to verify whether there were indeed errors and, if so, to fix these errors.

      Neural network architecture

      Details of the neural network architecture are provided in Appendix 1. Two different types of neural networks were trained and tested: RetinaNet
      • Lin T.-Y.
      • Goyal P.
      • Girshick R.
      • et al.
      Focal loss for dense object detection.
      and LSTM-SSD.
      • Liu M.
      • Zhu M.
      Mobile video object detection with temporally-aware feature maps.
      The overall framework for polyp detection is illustrated in Supplementary Figure 2 (available online at www.giejournal.org), RetinaNet architecture in Supplementary Figure 3 (available online at www.giejournal.org), and LSTM-SSD architecture in Supplementary Figure 4 (available online at www.giejournal.org).

      Neural network training

      We trained both neural networks (RetinaNet and LSTM-SSD) using standard stochastic gradient descent type techniques for minimizing detection loss. Parameters of the training procedure are given in Supplementary Table 2 (available online at www.giejournal.org). From here on we focus on RetinaNet architecture; training of the LSTM-SSD architecture is quite similar, with some minor modifications.
      Positive frames (ie, frames that contain polyps) are generated as the entire set of video frames annotated as polyps by the offline annotators as previously described, combined with the still images that contain polyps. As noted in Supplementary Table 3 (available online at www.giejournal.org), there are 189,994 video frames, 14,693 still images, for a total of 204,687 positive frames. Negative frames can be any frames that do not contain polyps; because of the large size of the train set, this can be up to 80 million frames. To not overwhelm the training process with too many negatives as compared with positives, we limited the number of negatives to 1 million frames, selected at random. It should be noted that caution was exercised in choosing these frames: Given that the annotators do not label all positive frames but rather only a sampling of them, sampling purely randomly over the rest would occasionally yield positive frames. However, this issue is easily resolved because part of the annotation involves marking the first and last frames in which the polyp appears in the FOV; thus, one can simply avoid choosing negative frames from within the range between the first and last frames of each polyp. Given the 1 million frames so sampled, we trained an initial version of our detector.
      The next phase involves the selection of so-called hard negatives. In particular, we ran this initial version of the detector over the remaining 79 million or so negatives that were not part of the training set. We then computed the maximal bounding box probabilities for each of these frames and sorted the frames from highest to lowest. The hard negatives are the first 1 million of these frames. The idea is simple: These hard negatives are the most challenging for the detector to produce the correct prediction (ie, a prediction of no bounding boxes), and so we would like the detector to see these during training. We then ran a second round of training, including these hard negatives in our new larger training set. (Thus, although 2 million negative frames are actually fed to the neural network training procedure, it is fair to say that all 80 million are used in training, because all 80 million are checked as possible candidates to be hard negatives.) This process of inclusion of hard negatives makes the network much more robust, in particular leading to many fewer false positives.
      No domain adaptation was performed to adapt the network from train to validation. The network, trained as described above, was applied as is to the validation set.

      Detector evaluation

      Evaluation of the detectors was performed in terms of 2 metrics. The first metric is polyp sensitivity in which a polyp is considered as detected if the detector has declared its existence in at least 1 frame during its duration in the FOV, as marked by the annotators. The second metric is the number of false alarms. A false alarm is considered to have occurred if the detector has declared a detection in 1 frame in which there was no ground truth annotation of a polyp. To evaluate instances in which there was a false alarm but indeed it was a polyp that was missed by human observers, we performed a reanalysis of 200 randomly selected procedures that were labeled normal (ie, without polyps; see the discussion under Subtle polyp evaluation below).
      When polyp sensitivity and the false alarm rate are used, a performance curve can be traced out by varying the detection threshold; as the threshold increases, the polyp sensitivity will decrease, but so will the false alarms. In discussions with several gastroenterologists, 5 false alarms per procedure was defined as the region of particular interest on the performance curve. To compare our algorithm with metrics reported in Wang et al
      • Wang P.
      • Xiao X.
      • Glissen Brown J.R.
      • et al.
      Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy.
      and Urban et al,
      • Urban G.
      • Tripathi P.
      • Alkayali T.
      • et al.
      Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy.
      we used per-frame image-level metrics as well: per-frame sensitivity, per-frame specificity, and the area under the curve for the image-level performance curve.

      Fleeting polyp evaluation

      A key aspect for evaluating DEEP2’s performance on fleeting polyps (ie, polyps that appear in the FOV for a very brief time) was having a baseline of endoscopist performance on these polyps. To obtain this metric, annotators were asked to examine each polyp they had annotated that was in the FOV for ≤5 seconds and decide whether the endoscopist had detected the polyp or not. To make the decision, annotators were told to note if the polyp was in focus and at the center of the frame, but ultimately they were expected to make this decision based on their own endoscopic experience. Then, the endoscopist’s sensitivity was computed as a gross number for all procedures or broken down for a particular range of polyp durations.

      Subtle polyp evaluation

      By definition, any detection of a subtle polyp (ie, a polyp that defied detection by both the endoscopist performing the procedure and the initial offline annotators) is initially classified as a false alarm. Therefore, we performed a reanalysis of the detector's false alarm of 200 procedures randomly selected that were deemed normal (without polyps). Then, all of the detector's false alarms were collected and were sent to the annotators to re-examine those specific frames. These annotators were instructed to classify the false alarms according to the classes shown below (See Results, section “Performance on elusive polyps, Subtle polyps”) .
      Figure thumbnail gr2ad
      Figure 2Overall performance of DEEP DEtection of Elusive Polyps (DEEP2). A, Performance curve for DEEP2 as implemented on the RetinaNet architecture, illustrating the trade-off between sensitivity and false alarms per video; the false alarms per minute are also indicated on the x-axis in parentheses. B, Selected points from the performance curve in A; all points are chosen to be under the threshold of 5 false alarms per procedure, a threshold that a consensus of gastroenterologists agreed would make the system usable in practice. C and D, Performance curve for DEEP2 as implemented on the LSTM-SSD architecture and corresponding selected points; despite changing the type of neural network used, the results remain largely the same. E, Example detections on a variety of different types of polyps. See (available online at www.giejournal.org) for further detail. F, Performance of the algorithm as a function of training set size using the RetinaNet architecture. The performance curve for DEEP2 as implemented on the RetinaNet architecture on a per-image basis (rather than a video-event basis, as shown here is illustrated in , available online at www.giejournal.org). FA, False alarm.
      Figure thumbnail gr2ef
      Figure 2Overall performance of DEEP DEtection of Elusive Polyps (DEEP2). A, Performance curve for DEEP2 as implemented on the RetinaNet architecture, illustrating the trade-off between sensitivity and false alarms per video; the false alarms per minute are also indicated on the x-axis in parentheses. B, Selected points from the performance curve in A; all points are chosen to be under the threshold of 5 false alarms per procedure, a threshold that a consensus of gastroenterologists agreed would make the system usable in practice. C and D, Performance curve for DEEP2 as implemented on the LSTM-SSD architecture and corresponding selected points; despite changing the type of neural network used, the results remain largely the same. E, Example detections on a variety of different types of polyps. See (available online at www.giejournal.org) for further detail. F, Performance of the algorithm as a function of training set size using the RetinaNet architecture. The performance curve for DEEP2 as implemented on the RetinaNet architecture on a per-image basis (rather than a video-event basis, as shown here is illustrated in , available online at www.giejournal.org). FA, False alarm.
      Figure thumbnail gr3
      Figure 3Performance of DEEP DEtection of Elusive Polyps (DEEP2) on fleeting polyps. A, Sensitivity by duration of polyp in the FOV performance of DEEP2 versus endoscopist performance for short durations. The 95% confidence intervals are shown in each case. To estimate the endoscopist performance, offline annotators examined the colonoscopy video and labeled whether the performing endoscopist detected the polyp (and possibly ignored it) or missed the polyp. For each row of the table, we provide a 2 × 2 table showing the number of polyps when either DEEP2 (shortened to “AI”) makes a detection or not (columns) and the endoscopist (shortened to “GI”) makes a detection or not (rows). Note in particular the large numbers when DEEP2 detects and the GI does not versus the opposite case when the GI detects and DEEP2 does not. B, Performance breakdown by a 30-second duration threshold; greater than 30 seconds is taken to be a rough proxy for “histologically confirmed polyps” because these will be examined for some time before resection. DEEP2 has a nearly identical perfect performance as the endoscopists for the polyps with durations above 30 seconds while giving clearly superior performance on the polyps with durations less than 30 seconds. C, Clinical significance of elusive polyps. Based on the labeling of the annotators, it can be seen that a major portion—56.0%—of polyps appearing for less than 5 seconds are clinically significant (ie, either adenomatous or malignant). FOV, Field of view; CI, confidence interval; AI, artificial intelligence.
      Figure thumbnail gr4
      Figure 4Performance of DEEP DEtection of Elusive Polyps (DEEP2) on subtle polyps. The offline gastroenterologist annotators reanalyzed the false alarms of DEEP2 on a set of 200 randomly chosen sequences in which neither the performing endoscopist nor the original offline annotators detected any polyps. The top 3 rows show clinically relevant detections among these false alarms: adenomas, hyperplastic polyps, and those deemed worthy of a “second look,” a designation indicating that the annotator cannot tell from the video whether the detection is indeed a polyp but it would be suspicious enough to warrant further investigation during a live procedure. The subsequent rows show false alarms that were indeed false alarms/misdetections on reanalysis; interestingly, many of these are “natural” errors, including inflammation, bubbles, and irregularities of haustral folds.

      Clinical study details

      The clinical validation study was a prospective, nonblinded, pilot study of 100 consecutive routine screening or surveillance colonoscopies conducted at Shaare Zedek Medical Center in Jerusalem, Israel (clinicaltrial.gov ID: NCT04693078). Patients with a history of surgery involving the colon or rectum, known diagnosis of colorectal cancer, previous history of inflammatory bowel disease, and suspicion or diagnosis of genetic polyposis syndromes were excluded. During each procedure, DEEP2 was run in real time, with its output presented on a secondary screen. A single off-the-shelf consumer-grade Nvidia 2080Ti GPU (Nvidia, Santa Clara, Calif, USA) was sufficient to enable real-time performance. Each system alert (green bounding box and sound alert) was considered to be a detection and was reviewed by the physician in real time. Endoscopists were requested to report how many polyps were discovered by DEEP2 that they themselves may have missed and how many false alarms were produced by DEEP2. After each procedure, endoscopists were debriefed as to how well or poorly the system functioned and whether they subjectively assessed that the AI system had an impact on outcomes of the colonoscopy. Important events that stemmed from each DEEP2 detection were logged in the study electronic case report form to be further explored offline later. Logged parameters were apparent type of polyp, timing and location of false alarms, whether the polyp was detected or missed by the performing endoscopist, lesion management, Boston Bowel Preparation Scale score, and age and sex of the patient. Because of regulation and patient privacy laws and by request of the ethical review board, all clinical data were deidentified such that no linking could be made with pathology reports; thus, no information regarding histology of polyps is available. Late adverse events monitoring was under the responsibility of the performing endoscopist. The primary outcomes for the design of the clinical study were number of additional polyps detected by DEEP2 in real time and safety. Secondary outcomes were polyp detection rate (ie, percentage of colonoscopies where ≥1 polyp was detected), rate of false alarms per colonoscopy, and user experience on a 5-point scale.
      The protocol was approved by the ethical review board of Shaare Zedek Medical Center and was conducted in accordance with the Good Clinical Practice guidelines of the International Conference on Harmonization and the provisions of the Declaration of Helsinki. All patients provided written informed consent.

      Statistical analysis

      To compute confidence intervals (CIs), we used the standard normal approximation for properly normalized sums of independent random variables. To compute P values, we used both standard techniques and an upper bound on the complementary cumulative distribution function of the normal distribution Q(t) < exp(–t2/2) / (2πt)1/2 (eg see Wikipedia). The latter was used when the standard technique was beyond the computer's numerical tolerance (leading to P = 0) and yields an upper bound on P.

      Results

      Data

      The training data collected from 2 academic hospitals consisted of 3611 procedures, equivalent to 796 hours of video, or 86 million frames. To test the system's generalizability, the validation data were collected from a third unrelated community hospital and consisted of 1393 colonoscopy procedures, equivalent to 310 hours of video, or 33 million frames (Supplementary Table 3). Each video in both the training and validation data was annotated by 3 gastroenterologists in a consensus-based approach; the annotators were drawn from a pool of 20 gastroenterologists with 4 to 40 years of experience (mean, 12.6 years; median, 8.0 years) and 400 to 3000 colonoscopies performed per year (mean, 1424 colonoscopies; median, 1200 colonoscopies).

      DEEP2 performance

      The entire performance curve of DEEP2 in shown in Figure 2, illustrating the trade-off between sensitivity and false alarms per procedure and displaying a table with a number of points drawn from this curve. The point of particular clinical relevance is a sensitivity of 97.1% (95% CI, 95.8%-98.4%) versus 4.6 false alarms per procedure (95% CI, 4.0-5.2). This number of false alarms corresponds to a per-frame false alarm rate of .23%. These results use the LSTM-SSD neural network architecture (Fig. 2A and B; results for the RetinaNet architecture were similar [Fig. 2C and D]). All points on the performance curves are given in Supplementary Tables 4 and 5 The performance curve for DEEP2 as implemented on the RetinaNet architecture on a per-image basis, rather than a video-event basis, as shown here is illustrated in Supplementary Fig. 5, available online at www.giejournal.org. (available online at www.giejournal.org). Qualitative results are shown in Figure 2E. The effect of training set size on DEEP2 performance showed a decline in the performance as the training set size decreased (Fig. 2F). Example videos are included in the Supplementary Material (see supplementary videos 1-14 available online at www.giejournal.org). Comparing DEEP2 with other available systems
      • Wang P.
      • Xiao X.
      • Glissen Brown J.R.
      • et al.
      Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy.
      ,
      • Urban G.
      • Tripathi P.
      • Alkayali T.
      • et al.
      Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy.
      showed a significant decrease in the false alarm rate (see Appendix 1, available online at www.giejournal.org); however, because these AI tools were evaluated based on different validation datasets, the meaning of this finding must be taken with caution.

      Performance on elusive polyps

      Fleeting polyps

      To investigate whether polyps that appear briefly would be more difficult for the algorithm to detect, we compared this with the fraction of these polyps detected by the performing endoscopist as determined by offline annotators. Results are reported in Figures 3A and B and Videos 1 to 5 (available online at www.giejournal.org). We note that for polyps appearing for less than 5 seconds in the FOV, DEEP2 had a sensitivity of 88.5% (95% CI, 84.6%-92.4%) compared with 31.7% (95% CI, 26.0%-37.5%) for endoscopists (P < 10–83). This comparison was even starker for polyps appearing for less than 2 seconds, for a sensitivity of 84.9% (95% CI, 79.3%-90.5%) versus 18.9% (95% CI, 12.8%-24.9%), respectively (P < 10–100). The system retains a similar performance level for much shorter durations: On polyps that appear for less than half a second, the sensitivity was 88.7% (95% CI, 80.8%-96.6%), and on those appearing for less than a tenth of second, 87.5% (95% CI, 71.3%-100.0%).
      Conversely, the algorithm rarely missed polyps appearing for a longer duration. Histologically confirmed polyps required by definition removal of the lesion; therefore, they generally appeared in the FOV for more than 30 seconds. For these polyps, the system's detection rate was 99.8% (95% CI, 99.4%-100.0%).

      Subtle polyps

      DEEP2's ability to detect subtle polyps is reported in Figure 4 and Videos 6 to 9 (available online at www.giejournal.org). By definition, these polyps were initially identified as false alarms. In a reanalysis of the detector's false alarms from 200 randomly selected procedures that were deemed normal (without polyps), 1087 false alarms were recorded, of which 44 had clinical significance: 20 were identified as having endoscopic features compatible with adenomas and 24 with hyperplastic polyps. A further 42 were deemed worthy of a "second look," a designation indicating that the finding would warrant further investigation during a live procedure. On a per-procedure basis, these numbers translated to .1 adenomas (95% CI, .06-.14), .12 hyperplastic polyps (95% CI, .07-.17), and .21 second looks (95% CI, .15-.27), for a total of .43 extra "events" per procedure (95% CI, .35-.50). Information on the other types of false alarms is reported in Figure 4.

      Clinical significance of elusive polyps

      In an attempt to quantify the clinical significance of elusive polyps in the absence of histopathology reports, the annotators labeled elusive polyps as belonging to 1 of 3 classes: hyperplastic, adenomatous, or malignant. Polyps were considered to be of potential clinical significance if they were labeled as adenomatous or malignant. The results are reported in Figure 3C. For polyps appearing for less than 5 seconds in the FOV, the proportion of clinically significant polyps was 56.0% (95% CI, 49.8%-62.1%).

      Clinical study

      In the clinical validation study, all procedures were performed by 1 of 3 board-certified endoscopists previously trained in the use of DEEP2, with experience ranging from 7 to 35 years (mean, 17.0 years; median, 9.0 years). Mean patient age was 60 years (95% CI, 58-63), and 44% were women. The mean on Boston Bowel Preparation Scale score was 7.45 (95% CI, 7.11-7.80). The overall polyp detection rate was 74%.
      On a per-procedure basis, endoscopists detected a mean of 1.63 polyps per colonoscopy (95% CI, 1.19-2.06), of which DEEP2 detected all but .08 (95% CI, .02-.14). However, DEEP2 discovered a further .89 polyps per colonoscopy (95% CI, .66-1.12) that were not detected by the endoscopists as determined by themselves (P < 10–5). This implies an increase of 54%. There were 3.87 false positives per procedure (95% CI, 3.34-4.40). User experience was very positive; endoscopists gave the system an average score of 3.9 (95% CI, 3.7-4.1). Samples are shown in Videos 10 to 13 (available online at www.giejournal.org).

      Discussion

      For a system like DEEP2 to be clinically relevant, it is crucial to know how many false positives are tolerable. DEEP2 detected 97.1% of polyps, while producing an average of 4.6 false alarms per procedure, which translated to .47 false alarms per minute of the procedure. This compared very favorably with the 2.4 false alarms per minute of withdrawal time of the commercial GI-Genius system (Medtronic, Minneapolis, Minn, USA)
      • Hassan C.
      • Badalamenti M.
      • Maselli R.
      • et al.
      Computer-aided detection-assisted colonoscopy: classification and relevance of false positives.
      ; however, the results from the Medtronic GI-Genius system were obtained from a real-time randomized control trial. Furthermore, our per-frame false-positive rate was .23% compared with the rate of .9% in Hassan et al,
      • Hassan C.
      • Wallace M.B.
      • Sharma P.
      • et al.
      New artificial intelligence system: first validation study versus experienced endoscopists for colorectal polyp detection.
      which represents a decrease by a factor of 4. Given the ability of DEEP2 to detect all but 2.9% of polyps while producing a tolerable number of false positives, we believe the system can have immediate value as a decision support tool in the clinic.
      DEEP2’s performance on elusive polyps, displaying an extremely high sensitivity for polyps appearing in the FOV for a very short time, is in our opinion where the system really shows its benefit as a decision support tool. Previous studies have restricted themselves to validation on histologically confirmed polyps. This approach seems to be sensible but has a critical drawback: The only polyps with histologic confirmation are, by definition, those discovered and removed. An automated system that is validated only on polyps that were already discovered by the endoscopist does not offer much added value. By contrast, a system that can detect polyps that were missed by the performing endoscopist offers immediate and obvious benefit. Remarkably, the system was able to detect polyps that were missed by the performing endoscopist and offline annotators (ie, subtle polyps). The performance of DEEP2 on subtle polyps suggests its generalizability: The system has learned to detect items that were initially missed by all who viewed the procedure. In computing the performance of DEEP2, we made sure to validate on a dataset that was from a different institution. In particular, a community hospital that often focuses on achieving a significant volume of colonoscopies is incentivized toward a decrease in procedure duration. DEEP2’s ability to generalize to a community hospital is noteworthy because one might reasonably argue it is precisely the endoscopist in such a setting that would benefit the most.
      It is also interesting to note that DEEP2 is fairly insensitive to the choice of neural network architecture. We used 2 architectures: RetinaNet and LSTM-SSD. RetinaNet is a leading technique for object detection on static images (applied to video by applying it to frames in a consecutive fashion). It is a single-stage detector with no initial region proposal phase; rather, it is based on a fully convolutional network that takes the image as input and directly outputs all detections in the image at once. RetinaNet is a top performer on a variety of benchmarks given a fixed computational budget; it is known for balancing speed of computation with accuracy. LSTM-SSD is a true video object detection architecture, which can explicitly account for the temporal character of the video (eg, temporal consistency of detections, ability to deal with blur and fast motion). The SSD part of the network is a single-stage detector like RetinaNet, whereas the LSTM part incorporates recurrent units based on ConvLSTM layers, which allow for the easy integration of both spatial and temporal information. LSTM-SSD is known for being robust and very computationally lightweight and can therefore run on less-expensive processors. Comparable results were attained also on the much heavier Faster R-CNN architecture
      • Ren S.
      • He K.
      • Girshick R.
      • et al.
      Faster R-CNN: towards real-time object detection with region proposal networks.
      and on the very recent EfficientDet architecture.
      • Tan M.
      • Pang R.
      • Le Q.V.
      EfficientDet: scalable and efficient object detection.
      The fact that results are similar across different architectures implies that one can choose the network meeting the available hardware specifications. This indicates the potential for deployment in less developed areas, where cost control is paramount.
      Our clinical study was small, but the aim was to assess the applicability, stability, safety, and user experience and to obtain some preliminary data on performance in a real-life scenario. Our results support the conclusions described above. In particular, using DEEP2 as a real-time decision support tool increased the polyps detected per colonoscopy by 54% at a rate of 3.87 false alarms per procedure, without adverse events and with an overall positive user experience. A larger controlled study including information on the histology of polyps is necessary to fully quantify the improvement in quality measures that DEEP2 may provide.
      We recognize that the current study has limitations. First, the annotation procedure is not precisely similar to the standard consensus-based approaches used in AI studies dealing with static images (eg, in radiology and ophthalmology). The reasons for this have been discussed and justified in the section explaining the annotation procedure. Second, the annotation relied on 70 to 100 representative frames for a given polyp. It is possible that the use of more frames might lead to an increase in performance. Third, the finding of “subtle polyps” leads to the question of whether there are "hidden" polyps that are missed by all 3—endoscopists, annotators, and algorithm. Fourth, to better understand and describe computer-aided diagnostic systems with the ultimate goal of developing an “ideal” AI polyp detection algorithm, the performance of the computer-aided diagnostic system should be evaluated for different morphologic subtypes of colorectal lesions
      • Kudo S.-E.
      • Mori Y.
      • Misawa M.
      • et al.
      Artificial intelligence and colonoscopy: current status and future perspectives.
      as classified by the Paris classification (eg, flat or depressed-type neoplasms).
      Participants in the Paris Workshop
      The Paris endoscopic classification of superficial neoplastic lesions: esophagus, stomach, and colon: November 30 to December 1, 2002.
      ,
      Endoscopic Classification Review Group
      Update on the Paris classification of superficial neoplastic lesions in the digestive tract.
      In the current study we were unable to perform this analysis; rather, we measured all types of polyps as a single group. The same is true for different histopathologic subtypes (eg, adenomatous lesions, serrated polyps, hyperplastic polyps) and grades of dysplasia. In practice, this could increase procedure time and cost because more diminutive polyps might be found and not characterized/predicted as adenoma or hyperplastic. Finally, our clinical study is admittedly small, noncontrolled, and without histologic information. In addition, because endoscopists had to report how many polyps were discovered by DEEP2 that they themselves may have missed, reporting bias may be an issue; however, because endoscopists do not like to admit they are missing polyps, we believe that reporting bias in this setting would weaken the performance of the system in favor of the performance of the endoscopists. The results here are encouraging, but clearly a larger, randomized study is warranted. More broadly, there are several limitations relating to the practical deployment of any AI-based system for automated polyp detection. These include inter alia, uptake attitudes of endoscopists, deskilling of endoscopists, over-reliance on the AI system, and concerns that nonphysicians might replace physicians in the performance of endoscopic procedures.
      • Wang P.
      • Xiao X.
      • Glissen Brown J.R.
      • et al.
      Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy.
      ,
      • Le Berre C.
      • Sandborn W.J.
      • Aridhi S.
      • et al.
      Application of artificial intelligence to gastroenterology and hepatology.
      • Ahmad O.F.
      • Stoyanov D.
      • Lovat L.B.
      Barriers and pitfalls for artificial intelligence in gastroenterology: ethical and regulatory issues.
      • Misawa M.
      • Kudo S.-E.
      • Mori Y.
      • et al.
      Artificial intelligence-assisted polyp detection for colonoscopy: initial experience.
      • Wadhwa V.
      • Alagappan M.
      • Gonzalez A.
      • et al.
      Physician sentiment toward artificial intelligence (AI) in colonoscopic practice: a survey of US gastroenterologists.
      • McNeil M.B.
      • Gross S.A.
      Siri here, cecum reached, but please wash that fold: Will artificial intelligence improve gastroenterology?.
      In conclusion, we have presented DEEP2, a system for automatic polyp detection that attains state-of-the-art performance. DEEP2 is very sensitive on the population of polyps as a whole and in particular on the subpopulation of elusive polyps—those polyps that human operators have the most difficulty in detecting and for which AI can provide the most added value. DEEP2 also displays a low rate of false positives, leading to a potentially more pleasant experience for the user. A small clinical study reinforces these conclusions.

      Acknowledgments

      We acknowledge Yiwen Luo, Huy Doan, and Quang Duong for software infrastructure support for data collection. We also thank the many gastroenterologists who helped in the large-scale annotation effort that was necessary for this study.
      The dataset was gathered from colonoscopy procedures performed in 3 Israeli hospitals. In all 3 cases, the data gathered were used under license for the current study and so are not publicly available. Figures 1, 2, and 4 and Supplementary Figures 1, 2, 3, and 4 contain raw images from Shaare Zedek Medical Center. The supplementary videos are also from Shaare Zedek Medical Center.
      The code used for training the models has a large number of dependencies on internal tooling, infrastructure, and hardware, and its release is therefore not feasible. However, all experiments and implementation details are described in sufficient detail in Methods and/or in Appendix 1 to allow independent replication with nonproprietary libraries. Major components of our work are available in open source repositories: Tensorflow (https://www.tensorflow.org) and Tensorflow Object Detection API (https://github.com/tensorflow/models/tree/master/research/object_detection).

      Supplementary data

      Appendix 1

      Supplementary annotation methods

      The challenge of ground truth acquisition in video annotation is that the videos are long; therefore, one cannot expect to label each frame in the video. Instead, the goals of the annotation procedure were to identify every polyp that occurred within a given video and, for each such identified polyp, to annotate a number of representative frames with bounding boxes around the polyp. The number of representative frames was taken to be in the range of 70 to 100. This set of frames always contained both the first and the last frame during which the polyp was in the field of view and the remaining frames were roughly equally spaced. For polyps that appeared for 25 seconds (ie, 750 frames), 70 to 100 frames only represent 10% to 15% of the frames; nevertheless, this is more than sufficient to capture the appearance diversity of the polyp, because at 30 frames per second this represents sampling the polyp about every quarter of a second. The rest of the frames (ie, frames without polyps) were used to train the system as negative frames (see Neural network training in Methods).
      To aid in annotation, a specialized labeling tool was used, shown in Supplementary Figure 1. This tool allowed the annotator to easily pause the video as well as rewind and use various playback speeds. The tool also allowed for the labeling of still images. As is common practice, the aim was to have more than 1 gastroenterologist annotate each video to provide a more reliable, consensus-based labeling. However, this is considerably more difficult to achieve for video annotation than for image annotation, because individual annotators may differ in both the temporal and spatial locations of their respective bounding boxes. The procedure illustrated in Figure 1 was therefore used.
      Two annotators were given the task of independently annotating the video, using the procedure described above. These 2 annotations were then sent for a refinement stage to a third annotator whose role was to unify the 2 annotations. In terms of unification, there are 2 separate scenarios: when a given frame has been labeled by both of the initial annotators and when a given frame has been labeled by only 1 of the initial annotators. Note that the latter case often happens, because each annotator only labels 70 to 100 representative frames, which is less than the total number of frames; therefore, their annotations often do not overlap. In the first case, the third annotator must choose the more appropriate bounding box between 1 of the 2 annotators, or the more appropriate boxes, for multiple polyps. In the second case, there is no such decision to be made. In both cases, the third annotator can slightly refine the location of the chosen bounding box, if so desired. Finally, for negative frames (ie, frames which both initial annotators agreed were devoid of polyps), the third annotator had the ability to disagree by proposing a new polyp and sending it back to the initial annotators for agreement.
      Finally, the annotations were examined by a nonphysician to simply ensure that no obvious errors had occurred in the labeling process. If such errors were suspected, then the annotation was sent back to the third annotator to verify whether there were indeed errors and, if so, to fix these errors. This procedure was used for the construction of both the training and validation sets.

      Full description of the neural network architecture

      RetinaNet

      The overall framework for polyp detection is illustrated in Supplementary Figure 2. Instantiation of this system involves choosing a particular neural network architecture for the block labeled "CNN for Detection" and optionally for the block labeled "Memory (State)." We begin by describing a simpler architecture in which only the block labeled "CNN for Detection" is used.
      We used the RetinaNet architecture for object detection,

      Lin T, Goyal P, Girshick R, et al. Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV). 2017. p. 2999–3007. Available at: https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_paper.html. Accessed January 13, 2021.

      illustrated in Supplementary Figure 3, which works as follows. A large set of candidate object locations, referred to as "anchor boxes," is sampled across the image (at about 100,000 locations). This set densely covers a grid of spatial positions, with 3 scales and 3 aspect ratios for each position. The ResNet-50 network
      • He K.
      • Zhang X.
      • Ren S.
      • et al.
      Deep residual learning for image recognition.
      is applied directly to the image to extract features. These ResNet features are then combined across multiple resolutions levels using a feature pyramid network.
      The features from each feature pyramid network level are then further fed into 2 subnetworks. First, the classification subnet predicts the probability of object presence at each spatial position for each of the A = 9 anchors per position and K object classes. The subnet is a fully convolutional network, terminating in a convolutional layer with KA filters. Note that we can predict the type of polyp here (eg, adenomatous vs hyperplastic) by taking K > 1 classes; in practice, we do not do so. Second, the box regression subnet is a fully convolutional network that predicts an offset from each anchor box to a nearby ground-truth object, if 1 exists. It is identical in structure to the classification subnet except that it terminates in 4A linear outputs per spatial location. The top predictions from all levels are merged, and nonmaximum suppression with a threshold is applied to yield the final detections.

      Long Short-Term Memory Single Shot Detector (LSTM-SSD)

      Returning to Supplementary Figure 2, it is possible to use a neural network architecture that incorporates the block labeled "Memory (State)." The purpose of such a block is to maintain and pass state between successive image frames; this recursive structure allows the model to make more confident predictions relative to a single-frame model in difficult scenarios such as blur and occlusion. One such architecture is the LSTM-SSD architecture, shown in Supplementary Figure 4, which we now describe.
      The base of the LSTM-SSD architecture uses a single shot multibox detector, or SSD
      • Liu W.
      • Anguelov D.
      • Erhan D.
      • et al.
      SSD: single shot multibox detector.
      ; this is a single-frame detector that is similar to the RetinaNet architecture described above. The first modification made to the standard SSD setup is to replace all convolutional layers by depth-wise separable convolutions
      • Howard A.G.
      • Zhu M.
      • Chen B.
      • et al.
      MobileNets: efficient convolutional neural networks for mobile vision applications.
      ,

      Chollet F. Xception: deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017. p. 1800–1807.

      ; this enables faster computations. The second modification is to insert recurrent units between consecutive convolutional layers. These recurrent units, which are ConvLSTM layers,
      • Shi X.
      • Chen Z.
      • Wang H.
      • et al.
      Convolutional LSTM network: a machine learning approach for precipitation nowcasting.
      ,
      • Patraucean V.
      • Handa A.
      • Cipolla R.
      Spatio-temporal video autoencoder with differentiable memory.
      allow for information from both the current frame and the previous frames to be incorporated.

      Temporal logic layer

      Referring to Supplementary Figure 2, the output of the detection network is refined by passing through a temporal logic layer. The purpose of this layer is to take the bounding boxes output by the network and to refine them based on bounding boxes detected in previous frames. In particular, we find the following very simple form of temporal logic to work well. We examine bounding boxes from the previous n frames, where this set of n includes the current frame. If out of this set of n frames, k of them, including the current frame, have at least 1 detection, then we declare detection for the current frame; otherwise, we do not.
      This very simple layer is effective in filtering out false positives. It is parameterized by only 2 values, namely k and n. These numbers will impact the latency of the detections, because one effectively has to wait n frames to make a detection. Thus, for practical reasons concerned with latency, one may choose to make n small.

      Supplementary comparison of DEEP DEtection of Elusive Polyps with other published systems

      Comparing artificial intelligence tools that were evaluated based on different validation data sets is problematic; hence, the following results must be viewed with caution. Nevertheless, and taking into consideration this limitation, the aim of these comparisons is to better illustrate the performance of DEEP DEtection of Elusive Polyps (DEEP2) relative to current available systems, especially the significantly fewer false positives. To illustrate the system's performance in this regard, we compare it with 2 leading artificial intelligence systems for polyp detection reported in the literature.
      • Urban G.
      • Tripathi P.
      • Alkayali T.
      • et al.
      Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy.
      ,
      • Wang P.
      • Xiao X.
      • Glissen Brown J.R.
      • et al.
      Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy.

      Comparison with Urban et al.
      • Urban G.
      • Tripathi P.
      • Alkayali T.
      • et al.
      Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy.

      For a comparison with the method of Urban et al, we used the same validation methodology, which is different from ours in that it is more forgiving in terms of what constitutes a false positive. Specifically, Urban et al stated that “all false-positive findings with duration of at least 1 second are counted” (see Table 5 caption in Urban et al).
      The performance for Urban et al is computed based on their Table 5. We use the column corresponding to “11 Challenging Videos” to compute the sensitivity = 68 / 73 = 93.2%. The false alarms per procedure is computed as 46 / 11 = 4.2. Although that system incurs 4.18 false alarms per procedure when operating at a sensitivity of 93.2%, DEEP2 incurs .14 false alarms per procedure when operating at a sensitivity of 93.8%, a decrease in the false positives by a factor of almost 30.
      An alternative computation may be based on combining the 2 columns of their Table 5 to include the somewhat easier “9 Videos” dataset in addition to the “11 Challenging Videos” dataset. In this case, their sensitivity increases to 113 / 118 = 95.8%, whereas their false alarms per procedure also increases to 127 / 20 = 6.4. One may compare this latter case with the performance of DEEP2 which achieves a sensitivity of 96.1% versus .38 false alarms per procedure.

      Comparison with Wang et al.
      • Wang P.
      • Xiao X.
      • Glissen Brown J.R.
      • et al.
      Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy.

      The method of Wang et al focuses on purely image-level metrics. The performance for Wang et al for the propose of these comparisons is computed based on their Table 2. For an equivalent per-frame sensitivity of 94.4%, DEEP2 achieves a 2.5% absolute increase in per-frame specificity, from 95.9% to 98.4%, equivalent to 2.6 times fewer false alarms.

      Other comparisons

      On the flip side, the algorithm rarely misses polyps with longer duration. Histologically confirmed polyps generally appear for more than 30 seconds in the field of view, as they are being interrogated by the endoscopist (and later, sometimes resected). On these polyps, the system's detection rate is 99.8% (95% confidence interval, 99.4%-100.0%). We note that this is comparable with the 99.7% rate on histologically confirmed polyps reported in Hassan et al
      • Lin T.-Y.
      • Goyal P.
      • Girshick R.
      • et al.
      Focal loss for dense object detection.
      ; however, our per-frame false alarm rate of .23% is about 4 times lower than the rate of .9% reported in Hassan et al. DEEP2 detects 97.1% of polyps while producing an average of 4.6 false alarms per procedure. Note that the latter number translates to .47 false alarms per minute of the procedure; this compares very favorably with the 2.4 false alarms per minute of withdrawal time of the commercial Medtronic GI-Genius system tested in Hassan et al.
      • Liu M.
      • Zhu M.
      Mobile video object detection with temporally-aware feature maps.
      Note that the difference is probably even more dramatic, because the 2.4 false alarms per minute is only for the withdrawal; Hassan et al
      • Liu M.
      • Zhu M.
      Mobile video object detection with temporally-aware feature maps.
      explicitly stated that they limited their analysis to the withdrawal phase of the colonoscopy videos, ignoring the insertion phase in which a high number of false positives can be triggered by the collapsing folds for little if any insufflation of the lumen. Furthermore, our per-frame false-positive rate is .23%, as compared with the .9% reported in Hassan et al,
      • Lin T.-Y.
      • Goyal P.
      • Girshick R.
      • et al.
      Focal loss for dense object detection.
      which represents a decrease of a factor of 4.
      Regarding comparisons of the used datasets, both the training and validation of DEEP2 were done on a considerably different kind of dataset from any yet been reported in the literature. Note that Urban et al
      • Forbes N.
      • Boyne D.J.
      • Mazurek M.S.
      • et al.
      Association between endoscopist annual procedure volume and colonoscopy quality: systematic review and meta-analysis.
      used 20 videos totaling 5 hours and .5M frames, whereas Wang et al
      • Ren S.
      • He K.
      • Girshick R.
      • et al.
      Faster R-CNN: towards real-time object detection with region proposal networks.
      used 192 videos, totaling 12.6 hours (many of the videos are shorter, consisting of only a single polyp) and 1.4M frames. As can be seen from Supplementary Table 3, our video data are about 2 orders of magnitude larger than the data from either of these previous studies, for both training and validation. When comparing with Hassan et al,
      • Hassan C.
      • Wallace M.B.
      • Sharma P.
      • et al.
      New artificial intelligence system: first validation study versus experienced endoscopists for colorectal polyp detection.
      the datasets are of comparable size; the key difference lies in the fact that the current study uses a more diverse population of polyps. In particular, Hassan et al
      • Hassan C.
      • Wallace M.B.
      • Sharma P.
      • et al.
      New artificial intelligence system: first validation study versus experienced endoscopists for colorectal polyp detection.
      used only polyps that were detected by the performing endoscopist, whereas our dataset also includes all polyps discovered by offline gastroenterologist annotators, a considerably larger set. The use of this type of data is what enables the performance on elusive polyps.
      Beyond this, it is important to describe the critical way in which the data were used in training. In particular, we extracted 2 million negative frames to represent the diversity of the nonpolyp background of the colon. Of these data, 1 million frames were sampled randomly and used to train the initial model; the remaining 1 million frames were so-called hard negatives (ie, negatives that tend to give the detector trouble) in the sense that they produce false positives in earlier iterations of the training. These 1 million hard negatives are drawn from a population of over 80 million frames; thus, the effective size of the training set is considerably larger. The unparalleled diversity of both polyps and background is what leads to the system's robust performance. Indeed, we have validated the effect of training set size in Figure 2F. The clear increase in performance as the dataset grows in size justifies the collection of such a large dataset.
      Figure thumbnail fx2
      Supplementary Figure 1Annotation tool. A polyp is shown as marked via a bounding box in a particular frame of the video. On the right is shown a box for this specific polyp, indicating various frames with timestamps that have been annotated, along with other information that can be marked, such as the Paris classification and whether the polyp is sessile serrated. Multiple polyps can be marked per video, including multiple annotations per frame where relevant; in this instance, 2 separate polyp annotations can be seen in the box on the right, corresponding to the purple and green text boxes. The purple polyp occurs in the current frame, whereas the green polyp occurs in other frames during the video and hence is not seen in the image.
      Figure thumbnail fx3
      Supplementary Figure 2The algorithmic framework for the DEEP DEtection of Elusive Polyps system. Black arrows indicate computational flow; green dashed arrows indicate optional steps. In the simpler setup, the current frame of the endoscopic video is passed to a convolutional neural network (CNN), which performs object detection. Specifically, it outputs a list of bounding boxes each of which contains a polyp; each bounding box is characterized by the box coordinates and by a detection probability, indicating the probability that the CNN imputes to how likely the given box is to contain a polyp. This list of bounding boxes is then passed through a temporal logic layer, which aggregates the current set of bounding boxes with those from previous frames and on this basis may choose to filter any of the current boxes. The boxes that pass the filter are then output and overlaid on the current frame. A video composed of these frames with box overlays may then be shown to the endoscopist performing the colonoscopy, thereby acting as a decision support tool. The green arrows indicate optional steps, in which knowledge from previous frames is stored in a kind of memory, or state, which can then aid in the current detection. An example of the simpler setup is the RetinaNet object detection network,
      • Hassan C.
      • Wallace M.B.
      • Sharma P.
      • et al.
      New artificial intelligence system: first validation study versus experienced endoscopists for colorectal polyp detection.
      whereas an example of the more complex setup is the LSTM-SSD object detection network.
      • Hassan C.
      • Badalamenti M.
      • Maselli R.
      • et al.
      Computer-aided detection-assisted colonoscopy: classification and relevance of false positives.
      Both network architectures are used in the current study.
      Figure thumbnail fx4
      Supplementary Figure 3RetinaNet architecture. A large set of candidate object locations, referred to as “anchor boxes,” is sampled across the image (at about 100,000 locations). This set densely covers a grid of spatial positions, with 3 scales and 3 aspect ratios for each position. The ResNet-50 network is applied directly to the image to extract features. These ResNet features are then combined across multiple resolutions levels via a feature pyramid network. The features from each feature pyramid network level are then further fed into 2 subnetworks. First, the classification subnet predicts the probability of object presence at each spatial position for each of the A = 9 anchors per position and K object classes. The subnet is a fully convolutional network that applies four 3 × 3 convolutional layers, each with C = 256 filters and each followed by rectified linear activation function (ReLU) activations, followed by a 3 × 3 convolutional layer with KA filters. Note that we can predict the type of polyp here (eg, adenomatous vs hyperplastic) by taking K >1 classes; in practice, we do not do so. Second, the box regression subnet is a fully convolutional network that predicts an offset from each anchor box to a nearby ground-truth object, if 1 exists. It is identical in structure to the classification subnet except that it terminates in 4A linear outputs per spatial location. (The factor 4 arises because a bounding box can be described by 4 numbers.) The top predictions from all levels are merged, and nonmaximum suppression with a threshold is applied to yield the final detections.
      Figure thumbnail fx5
      Supplementary Figure 4LSTM-SSD architecture. The base of the LSTM-SSD architecture uses a single shot multibox detector (SSD), a single-frame detector, that is similar to the RetinaNet architecture. The first modification made to the standard SSD setup is to replace all convolutional layers by depth-wise separable convolutions; this enables faster computations. The second modification is to insert recurrent units between consecutive convolutional layers. These recurrent units allow for information from both the current frame and the previous frames to be incorporated. In particular, the recurrent units are ConvLSTM layers, which allow for the easy integration of both spatial and temporal information. For speed purposes, an efficient Bottleneck-LSTM unit based again on depth-wise separable convolutions is used to reduce the computational cost significantly compared with a regular LSTM unit.
      Figure thumbnail fx6
      Supplementary Figure 5Performance curve for images. The performance curve for the DEEP DEtection of Elusive Polyps system as implemented on the RetinaNet architecture, on a per-image basis (rather than a video-event basis, as shown in ). This curve illustrates the trade-off between per-image sensitivity and per-image specificity, with selected points on the curve shown in the table to the right. The area under the curve is .994.
      Supplementary Table 1List of annotators
      AnnotatorCountry based inYears of experienceVolume of colonoscopies
      1India63000
      2India82000
      3India51200
      4India71000
      5India152000
      6India72000
      7India251200
      8India152000
      9India61000
      10India71000
      11India91500
      12Israel7600
      13Israel71250
      14Israel9400
      15Israel4650
      16Israel25NR
      17Israel30800
      18Israel82600
      19Israel40NR
      20Israel11NR
      Mean12.61424
      Median81200
      NR, Not recorded.
      Supplementary Table 2Hyperparameters for the training of the detection system
      HyperparameterValue
      Optimizer hyperparameters
       No. of steps1,500,000
       Learning rate scheduleSteps 0-400K: .0002

      Steps 400K-800K: .00002

      Steps 800K-1500K: .000002
       Optimization algorithmMomentum with value .9
       Batch size8
       Weight decay.0004
       Batch norm decay.997
       Maximum no. of boxes100
       Loss functionSigmoid focal cross-entropy loss

      α = .25

      γ = 2.0
      Augmentation hyperparameters
       Image size, pixels640 × 640
       FlippingHorizontal, vertical
       Random croppingAspect ratio ε [.75, 3]

      Proportion over the original image ε [.75, 1]

      Minimum overlap with any polyp = .0
      The detection systems are based on deep learning; these hyperparameters describe aspects of the training of the neural networks.
      Supplementary Table 3Summary of the data
      Summary of video data
      TrainValidation
      Videos36111393
      Hours796310
      Frames86 million33 million
      Unique patients24871181
      All training data: video + still images
      PolypNonpolyp
      Still images14,693158,646
      Frames from video189,99480 million
      Total204,68780 million
      No. of unique polyps
      TrainValidation
      Still images62412724
      Video2230956
      Total84713680
      In the summary of video data, the gross statistics for training set vs validation set are reported. In all training data, statistics for both still image data and video data are reported, broken down by polyp vs nonpolyp. These still images were taken by the endoscopist during the colonoscopy, generally when an event of interest, either a polyp or an anatomic landmark, was encountered. The number of unique polyps for all data on both train and validation sets are reported.
      Supplementary Table 4RetinaNet architecture: complete performance results
      SensitivityFalse alarms per videoFalse alarms per minuteThresholdTemporal window size
      .8211.2154.0223.61479.0000
      .8316.2462.0254.6147.0000
      .8316.2462.0254.60409.0000
      .8441.2769.0286.59279.0000
      .8515.2846.0294.58139.0000
      .8525.3308.0342.59277.0000
      .8598.3385.350.56849.0000
      .8692.3846.0397.55539.0000
      .8766.4538.0469.54219.0000
      .8808.5538.0572.54217.0000
      .8849.5692.0588.52769.0000
      .8954.6692.0692.51239.0000
      .8975.7231.0747.52767.0000
      .9069.8154.0843.51237.0000
      .9079.9308.0962.52765.0000
      .9142.9538.0986.47619.0000
      .9153.9769.1009.49477.0000
      .92261.1538.1192.47617.0000
      .92891.1846.1224.45439.0000
      .93831.4462.1494.43069.0000
      .94461.8769.1939.43067.0000
      .94981.9923.2059.40219.0000
      .95502.5615.2647.43065.0000
      .95712.5769.2663.40217.0000
      .96132.7538.2846.37009.0000
      .96233.4308.3545.40215.0000
      .96343.5692.3688.37007.0000
      .96654.2000.4340.32999.0000
      .96764.9154.5079.37005.0000
      .96865.3692.5548.32997.0000
      .97077.3231.7567.32995.0000
      .974911.16151.1534.32993.0000
      .978019.56922.0222.37001.0000
      .983327.55382.8472.32991.0000
      All points from the performance curve for the RetinaNet architecture, as shown in Figure 2A. That is, Figure 2B is a sampling of this table. In addition to reporting the sensitivity, false alarms per video, and false alarms per minute, we also report the detector threshold used, as well as the window size. With regard to the detector threshold, note that as the threshold goes down, the sensitivity increases, as do the number of false alarms, as one would expect. The window size corresponds to the parameter n in the temporal logic layer; higher n indicates a longer detection latency.
      Supplementary Table 5LSTM-SSD architecture: complete performance results
      SensitivityFalse alarms per videoFalse alarms per minuteThresholdTemporal window size
      .8333.1846.0191.99479.0000
      .8439.1923.0199.99477.0000
      .8576.2154.0223.99309.0000
      .8742.2462.0254.99149.0000
      .8803.2769.0286.99147.0000
      .8848.3231.0334.99305.0000
      .8909.3846.0397.99145.0000
      .8924.4538.0469.98925.0000
      .8970.4692.0485.98079.0000
      .9015.5308.0548.97229.0000
      .9106.6154.0636.95589.0000
      .9152.7692.0795.97225.0000
      .9212.7769.0803.92289.0000
      .9273.9462.0978.92287.0000
      .93181.1231.1161.78289.0000
      .93481.2154.1256.92285.0000
      .93941.3462.1391.78287.0000
      .94091.4692.1518.86295.0000
      .94391.6231.1677.56029.0000
      .94851.9538.2019.56027.0000
      .95302.4517.2534.30679.0000
      .96062.8224.2917.30677.0000
      .96363.3991.3513.16389.0000
      .96673.4479.3563.30675.0000
      .96824.0202.4154.16387.0000
      .97124.5868.4740.30673.0000
      .97276.0615.6264.56021.0000
      .97586.4975.6714.16383.0000
      .97887.4056.7653.09015.0000
      .98038.5574.8843.30671.0000
      .98189.6126.9934.09013.0000
      .984812.06821.2471.16381.0000
      .986417.13721.7709.09011.0000
      All points from the performance curve for the LSTM-SSD architecture, as shown in Figure 2C. That is, Figure 2C is a sampling of this table. In addition to reporting the sensitivity, false alarms per video, and false alarms per minute, we also report the detector threshold used, as well as the window size. With regard to the detector threshold, note that as the threshold goes down, the sensitivity increases, as do the number of false alarms, as one would expect. The window size corresponds to the parameter n in the temporal logic layer; higher n indicates a longer detection latency.

      References

        • Guren M.G.
        The global challenge of colorectal cancer.
        Lancet Gastroenterol Hepatol. 2019; 4: 894-895
        • Ferlay J.
        • Ervik M.
        • Lam F.
        • et al.
        Global cancer observatory: cancer today.
        International Agency for Research on Cancer, Lyon, France2020 (Available at:)
        http://gco.iarc.fr/today/home
        Date accessed: April 8, 2021
        • Kim A.
        An astounding 19 million colonoscopies are performed annually in The United States. iData Research.
        (Available at:)
        • Mathews S.C.
        • Zhao N.
        • Holub J.L.
        • et al.
        Improvement in colonoscopy quality metrics in clinical practice from 2000 to 2014.
        Gastrointest Endosc. 2019; 90: 651-655
        • Lam A.Y.
        • Li Y.
        • Gregory D.L.
        • et al.
        Association between improved adenoma detection rates and interval colorectal cancer rates after a quality improvement program.
        Gastrointest Endosc. 2020; 92: 355-364
        • Leufkens A.M.
        • van Oijen M.G.H.
        • Vleggaar F.P.
        • et al.
        Factors influencing the miss rate of polyps in a back-to-back colonoscopy study.
        Endoscopy. 2012; 44: 470-475
        • Anderson R.
        • Burr N.E.
        • Valori R.
        Causes of post-colonoscopy colorectal cancers based on World Endoscopy Organization system of analysis.
        Gastroenterology. 2020; 158: 1287-1299
        • Forbes N.
        • Boyne D.J.
        • Mazurek M.S.
        • et al.
        Association between endoscopist annual procedure volume and colonoscopy quality: systematic review and meta-analysis.
        Clin Gastroenterol Hepatol. 2020; 18: 2192-2208
        • Ren S.
        • He K.
        • Girshick R.
        • et al.
        Faster R-CNN: towards real-time object detection with region proposal networks.
        IEEE Trans Pattern Anal Machine Intell. 2017; 39: 1137-1149
        • Lin T.-Y.
        • Goyal P.
        • Girshick R.
        • et al.
        Focal loss for dense object detection.
        (Available at:)
        • Liu M.
        • Zhu M.
        Mobile video object detection with temporally-aware feature maps.
        (Available at:)
        • Wu H.
        • Chen Y.
        • Wang N.
        • et al.
        Sequence level semantics aggregation for video object detection.
        (Available at:)
        http://arxiv.org/abs/1907.06390
        Date accessed: January 13, 2021
        • Zhou X.
        • Wang D.
        • Krähenbühl P.
        Objects as points.
        (Available at:)
        http://arxiv.org/abs/1904.07850
        Date accessed: January 13, 2021
        • Tan M.
        • Pang R.
        • Le Q.V.
        EfficientDet: scalable and efficient object detection.
        (Available at:)
        http://arxiv.org/abs/1911.09070
        Date accessed: January 13, 2021
        • Byrne M.F.
        • Chapados N.
        • Soudan F.
        • et al.
        Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model.
        Gut. 2019; 68: 94-100
        • Byrne M.F.
        • Soudan F.
        • Henkel M.
        • et al.
        Real-time artificial intelligence “full colonoscopy workflow” for automatic detection followed by optical biopsy of colorectal polyps [abstract].
        Gastrointest Endosc. 2018; 87: AB475
        • Zhu X.
        • Nemoto D.
        • Wang Y.
        • et al.
        Detection and diagnosis of sessile serrated adenoma/polyps using convolutional neural network (artificial intelligence) [abstract].
        Gastrointest Endosc. 2018; 87: AB251
        • Sánchez-Montes C.
        • Bernal J.
        • García-Rodríguez A.
        • et al.
        Review of computational methods for the detection and classification of polyps in colonoscopy imaging.
        Gastroenterol Hepatol. 2020; 43: 222-232
        • Zhou D.
        • Tian F.
        • Tian X.
        • et al.
        Diagnostic evaluation of a deep learning model for optical diagnosis of colorectal cancer.
        Nat Commun. 2020; 11: 2961
        • Ma R.
        • Wang R.
        • Pizer S.
        • et al.
        Real-time 3D reconstruction of colonoscopic surfaces for determining missing regions.
        in: Medical image computing and computer assisted intervention—MICCAI 2019. Springer, Cham, Germany2019: 573-582 (Available at:)
        • Turan M.
        • Ornek E.P.
        • Ibrahimli N.
        • et al.
        Unsupervised odometry and depth learning for endoscopic capsule robots.
        (Available at:)
        http://arxiv.org/abs/1803.01047
        Date accessed: January 13, 2021
        • Turan M.
        • Almalioglu Y.
        • Araujo H.
        • et al.
        Deep EndoVO: a recurrent convolutional neural network (RCNN) based visual odometry approach for endoscopic capsule robots.
        Neurocomputing. 2018; 275: 1861-1870
        • Rau A.
        • Edwards P.J.E.
        • Ahmad O.F.
        • et al.
        Implicit domain adaptation with conditional generative adversarial networks for depth prediction in endoscopy.
        Int J Comput Assist Radiol Surg. 2019; 14: 1167-1176
        • Chen R.J.
        • Bobrow T.L.
        • Athey T.
        • et al.
        SLAM Endoscopy enhanced by adversarial depth prediction.
        (Available at:)
        http://arxiv.org/abs/1907.00283
        Date accessed: January 13, 2021
        • Freedman D.
        • Blau Y.
        • Katzir L.
        • et al.
        Detecting deficient coverage in colonoscopies.
        IEEE Trans Med Imag. 2020; 39: 3451-3462
        • Bernal J.
        • Tajkbaksh N.
        • Sanchez F.J.
        • et al.
        Comparative validation of polyp detection methods in video colonoscopy: results from the MICCAI 2015 endoscopic vision challenge.
        IEEE Trans Med Imag. 2017; 36: 1231-1249
        • Mohammed A.
        • Yildirim S.
        • Farup I.
        • et al.
        Y-Net: a deep convolutional neural network for polyp detection.
        (Available at:)
        http://arxiv.org/abs/1806.01907
        Date accessed: January 13, 2021
        • Brandao P.
        • Mazomenos E.
        • Ciuti G.
        • et al.
        Fully convolutional neural networks for polyp segmentation in colonoscopy.
        (Available at:)
        • Wichakam I.
        • Panboonyuen T.
        • Udomcharoenchaikit C.
        • et al.
        Real-time polyps segmentation for colonoscopy video frames using compressed fully convolutional network.
        in: Schoeffmann K. Chalidabhongse T.H. Ngo C.W. MultiMedia modeling. Springer International Publishing, Cham2018: 393-404
        • Angermann Q.
        • Bernal J.J.
        • Sánchez-Montes C.
        • et al.
        Real-time polyp detection in colonoscopy videos: adapting still frame-based methodologies for video sequences analysis..
        in: Cardoso M.J. Arbel T. Luo X. Computer assisted and robotic endoscopy and clinical image-based procedures. Springer International Publishing, Cham2017: 29-41 (Intl J Comput-Assist Radiol Surg 2017)
        • Bernal J.
        • Sánchez J.
        • Vilariño F.
        Towards automatic polyp detection with a polyp appearance model.
        Pattern Recogn. 2012; 45: 3166-3182
        • Wang P.
        • Xiao X.
        • Glissen Brown J.R.
        • et al.
        Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy.
        Nat Biomed Eng. 2018; 2: 741-748
        • Badrinarayanan V.
        • Kendall A.
        • Cipolla R.
        SegNet: a deep convolutional encoder-decoder architecture for image segmentation.
        IEEE Trans Pattern Anal Machine Intell. 2017; 39: 2481-2495
        • Wang P.
        • Berzin T.M.
        • Glissen Brown J.R.
        • et al.
        Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study.
        Gut. 2019; 68: 1813-1819
        • Zhou G.
        • Liu X.
        • Berzin T.M.
        • et al.
        A real-time automatic deep learning polyp detection system increases polyp and adenoma detection during colonoscopy: a prospective double-blind randomized study.
        Gastroenterology. 2019; 156 (S-1511)
        • Wang P.
        • Liu X.
        • Berzin T.M.
        • et al.
        Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (CADe-DB trial): a double-blind randomised study.
        Lancet Gastroenterol Hepatol. 2020; 5: 343-351
        • Urban G.
        • Tripathi P.
        • Alkayali T.
        • et al.
        Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy.
        Gastroenterology. 2018; 155: 1069-1078
        • Hassan C.
        • Wallace M.B.
        • Sharma P.
        • et al.
        New artificial intelligence system: first validation study versus experienced endoscopists for colorectal polyp detection.
        Gut. 2020; 69: 799-800
        • Wikipedia
        Q-function.
        (Available at:)
        • Hassan C.
        • Badalamenti M.
        • Maselli R.
        • et al.
        Computer-aided detection-assisted colonoscopy: classification and relevance of false positives.
        Gastrointest Endosc. 2020; 92: 900-904
        • Kudo S.-E.
        • Mori Y.
        • Misawa M.
        • et al.
        Artificial intelligence and colonoscopy: current status and future perspectives.
        Dig Endosc. 2019; 31: 363-371
        • Participants in the Paris Workshop
        The Paris endoscopic classification of superficial neoplastic lesions: esophagus, stomach, and colon: November 30 to December 1, 2002.
        Gastrointest Endosc. 2003; 58: S3-S43
        • Endoscopic Classification Review Group
        Update on the Paris classification of superficial neoplastic lesions in the digestive tract.
        Endoscopy. 2005; 37: 570-578
        • Le Berre C.
        • Sandborn W.J.
        • Aridhi S.
        • et al.
        Application of artificial intelligence to gastroenterology and hepatology.
        Gastroenterology. 2020; 158: 76-94
        • Ahmad O.F.
        • Stoyanov D.
        • Lovat L.B.
        Barriers and pitfalls for artificial intelligence in gastroenterology: ethical and regulatory issues.
        Techn Innov Gastrointest Endosc. 2020; 22: 80-84
        • Misawa M.
        • Kudo S.-E.
        • Mori Y.
        • et al.
        Artificial intelligence-assisted polyp detection for colonoscopy: initial experience.
        Gastroenterology. 2018; 154: 2027-2029
        • Wadhwa V.
        • Alagappan M.
        • Gonzalez A.
        • et al.
        Physician sentiment toward artificial intelligence (AI) in colonoscopic practice: a survey of US gastroenterologists.
        Endosc Int Open. 2020; 8: E1379-E1384
        • McNeil M.B.
        • Gross S.A.
        Siri here, cecum reached, but please wash that fold: Will artificial intelligence improve gastroenterology?.
        Gastrointest Endosc. 2020; 91: 425-427