Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

Published:September 02, 2020DOI:


      Background and Aims

      Colonoscopy is commonly performed for colorectal cancer screening in the United States. Reports are often generated in a non-standardized format and are not always integrated into electronic health records. Thus, this information is not readily available for streamlining quality management, participating in endoscopy registries, or reporting of patient- and center-specific risk factors predictive of outcomes. We aim to demonstrate the use of a new hybrid approach using natural language processing of charts that have been elucidated with optical character recognition processing (OCR/NLP hybrid) to obtain relevant clinical information from scanned colonoscopy and pathology reports, a technology co-developed by Cleveland Clinic and eHealth Technologies (West Henrietta, NY, USA).


      This was a retrospective study conducted at Cleveland Clinic, Cleveland, Ohio, and the University of Minnesota, Minneapolis, Minnesota. A randomly sampled list of outpatient screening colonoscopy procedures and pathology reports was selected. Desired variables were then collected. Two researchers first manually reviewed the reports for the desired variables. Then, the OCR/NLP algorithm was used to obtain the same variables from 3 electronic health records in use at our institution: Epic (Verona, Wisc, USA), ProVation (Minneapolis, Minn, USA) used for endoscopy reporting, and Sunquest PowerPath (Tucson, Ariz, USA) used for pathology reporting.


      Compared with manual data extraction, the accuracy of the hybrid OCR/NLP approach to detect polyps was 95.8%, adenomas 98.5%, sessile serrated polyps 99.3%, advanced adenomas 98%, inadequate bowel preparation 98.4%, and failed cecal intubation 99%. Comparison of the dataset collected via NLP alone with that collected using the hybrid OCR/NLP approach showed that the accuracy for almost all variables was >99%.


      Our study is the first to validate the use of a unique hybrid OCR/NLP technology to extract desired variables from scanned procedure and pathology reports contained in image format with an accuracy >95%.


      ACG (American College of Gastroenterology), ADR (adenoma detection rate), ASGE (American Society for Gastrointestinal Endoscopy), CRC (colorectal cancer), EHR (electronic health record), NLP (natural language processing), OCR (optical character recognition), PPV (positive predictive value), SQL (Structured Query Language), SSP (sessile serrated polyp)
      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Gastrointestinal Endoscopy
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Joseph D.A.
        • Meester R.G.
        • Zauber A.G.
        • et al.
        Colorectal cancer screening: estimated future colonoscopy need and current volume and capacity.
        Cancer. 2016; 122: 2479-2486
        • Zauber A.G.
        • Winawer S.J.
        • O'Brien M.J.
        • et al.
        Colonoscopic polypectomy and long-term prevention of colorectal-cancer deaths.
        N Engl J Med. 2012; 366: 687-696
        • Winawer S.J.
        • Zauber A.G.
        • Ho M.N.
        • et al.
        Prevention of colorectal cancer by colonoscopic polypectomy. The National Polyp Study Workgroup.
        N Engl J Med. 1993; 329: 1977-1981
        • Nishihara R.
        • Wu K.
        • Lochhead P.
        • et al.
        Long-term colorectal-cancer incidence and mortality after lower endoscopy.
        N Engl J Med. 2013; 369: 1095-1105
        • Corley D.A.
        • Jensen C.D.
        • Marks A.R.
        • et al.
        Adenoma detection rate and risk of colorectal cancer and death.
        N Engl J Med. 2014; 370: 1298-1306
        • Rex D.K.
        • Schoenfeld P.S.
        • Cohen J.
        • et al.
        Quality indicators for colonoscopy.
        Am J Gastroenterol. 2015; 110: 72-90
        • Raju G.S.
        • Lum P.J.
        • Slack R.S.
        • et al.
        Natural language processing as an alternative to manual reporting of colonoscopy quality metrics.
        Gastrointest Endosc. 2015; 82: 512-519
        • Narula J.
        Are we up to speed?: from big data to rich insights in CV imaging for a hyperconnected world.
        JACC Cardiovasc Imaging. 2013; 6: 1222-1224
        • Murdoch T.B.
        • Detsky A.S.
        The inevitable application of big data to health care.
        JAMA. 2013; 309: 1351-1352
        • Pakhomov S.S.
        • Hemingway H.
        • Weston S.A.
        • et al.
        Epidemiology of angina pectoris: role of natural language processing of the medical record.
        Am Heart J. 2007; 153: 666-673
        • Wells Q.S.
        • Farber-Eger E.
        • Crawford D.C.
        Extraction of echocardiographic data from the electronic medical record is a rapid and efficient method for study of cardiac structure and function.
        J Clin Bioinform. 2014; 4: 12
        • Garvin J.H.
        • DuVall S.L.
        • South B.R.
        • et al.
        Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure.
        J Am Med Inform Assoc. 2012; 19: 859-866
        • Maddox T.M.
        • Matheny M.A.
        Natural language processing and the promise of big data: small step forward, but many miles to go.
        Circ Cardiovasc Qual Outcomes. 2015; 8: 463-465
        • Nath C.
        • Albaghdadi M.S.
        • Jonnalagadda S.R.
        A natural language processing tool for large-scale data extraction from echocardiography reports.
        PLoS One. 2016; 11e0153749
        • Imler T.D.
        • Morea J.
        • Kahi C.
        • et al.
        Multi-center colonoscopy quality measurement utilizing natural language processing.
        Am J Gastroenterol. 2015; 110: 543-552
        • Gawron A.J.
        • Thompson W.K.
        • Keswani R.N.
        • et al.
        Anatomic and advanced adenoma detection rates as quality metrics determined via natural language processing.
        Am J Gastroenterol. 2014; 109: 1844-1849
        • Mehrotra A.
        • Dellon E.S.
        • Schoen R.E.
        • et al.
        Applying a natural language processing tool to electronic health records to assess performance on colonoscopy quality measures.
        Gastrointest Endosc. 2012; 75: 1233-1239.e14
        • Kaminski M.F.
        • Regula J.
        • Kraszewska E.
        • et al.
        Quality indicators for colonoscopy and the risk of interval cancer.
        N Engl J Med. 2010; 362: 1795-1803
        • Kreimeyer K.
        • Foster M.
        • Pandey A.
        • et al.
        Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review.
        J Biomed Inform. 2017; 73: 14-29
        • Imler T.D.
        • Sherman S.
        • Imperiale T.F.
        • et al.
        Provider-specific quality measurement for ERCP using natural language processing.
        Gastrointest Endosc. 2018; 87: 164-173.e2
        • Raju G.S.
        • Lum P.J.
        • Slack R.S.
        • et al.
        Natural language processing as an alternative to manual reporting of colonoscopy quality metrics.
        Gastrointest Endosc. 2015; 82: 512-519
        • Lee J.K.
        • Jensen C.D.
        • Levin T.R.
        • et al.
        Accurate identification of colonoscopy quality and polyp findings using natural language processing.
        J Clin Gastroenterol. 2019; 53: e25-e30

      Linked Article

      • Will machines decipher colonoscopy quality from endoscopists’ notes?
        Gastrointestinal EndoscopyVol. 93Issue 3
        • Preview
          Colonoscopy has been shown to reduce incidence and mortality of colorectal cancer; however, its effectiveness is highly dependent on the quality.1-3 Therefore, it is widely recognized that quality assessment, assurance, and improvement tools in colonoscopy are essential to ensure its effectiveness. There are clearly defined and measurable quality indicators such as the endoscopist’s adenoma detection rate (ADR), rate of adequate bowel preparation, cecal intubation rate, and mean withdrawal time, which have been proved to be associated with important outcomes for patients.
        • Full-Text
        • PDF