1969: The first colonoscopy is performed, and humanity collectively agrees this is nobody's favorite Tuesday activity.
2014: Deep learning algorithms start spotting polyps in endoscopic images, because apparently training a neural network on thousands of colon photos is a perfectly normal career choice.
2022: ChatGPT arrives, and everyone wonders if it can write poetry, pass the bar exam, and maybe - just maybe - spot a suspicious lump during your next colonoscopy.
2026: Researchers from Italy, Japan, Norway, and a half-dozen other countries publish a study in Gut putting GPT-4o and Gemini 1.5 Pro to the ultimate test: finding colorectal polyps in colonoscopy videos. The results? Let's just say the chatbots showed up to the exam, but maybe didn't study as hard as the specialist.
The Polyp Problem Nobody Talks About at Dinner
Here's a statistic that should make you uncomfortable: colonoscopies miss between 6% and 27% of polyps (Leufkens et al., 2012). That's not a rounding error. That's a "we really need a second pair of eyes" situation. And since colorectal cancer is the second leading cause of cancer death worldwide, those missed polyps aren't just an academic concern - they're the kind of thing that ruins lives.
Enter computer-aided detection, or CADe. These purpose-built AI systems have been trained on massive datasets of labeled colonoscopy images, and they're genuinely good at their job. Studies show CADe systems significantly boost adenoma detection rates, the single most important quality metric in colonoscopy (Corley et al., 2014). Every 1% increase in adenoma detection rate correlates with a 3% decrease in colorectal cancer risk. That's not nothing.
So, Can ChatGPT Do Colonoscopy Now?
This is where Carlini and colleagues decided to ask a question that sounds like a dare: what if you just... showed colonoscopy videos to a large language model and asked it to find polyps? No specialized training. No curated medical image datasets. Just the same AI that helps people write wedding speeches and debug Python code.
They used the SUN colonoscopy video database - 100 polyps across 99 patients, plus 13 polyp-free videos - and compared three contestants: a commercial CADe system (the purpose-built specialist), OpenAI's GPT-4o (the overachieving generalist), and Google's Gemini 1.5 Pro (the other overachieving generalist).
The Scoreboard
The results read like a sports league table where the professional athlete predictably outperforms the enthusiastic amateurs:
- CADe system: 99% sensitivity, 100% specificity, 99.1% accuracy. The boring, reliable workhorse.
- GPT-4o: 87% sensitivity, 100% specificity, 88.5% accuracy. Respectable, honestly. Missed 13 polyps but never cried wolf on a clean colon.
- Gemini 1.5 Pro: 68% sensitivity, 92.3% specificity, 70.8% accuracy. Showed up, tried its best, got a participation trophy.
For context, GPT-4o achieving 87% sensitivity with zero false positives on a task it was never specifically trained for is roughly equivalent to your Labrador retriever correctly sorting your mail 87% of the time. Impressive? Absolutely. Ready to replace the postal worker? Not quite.
Why This Actually Matters
The tempting takeaway is "LLMs aren't as good as dedicated systems, case closed." But that misses the point entirely. These models weren't trained on a single colonoscopy image. They absorbed the internet's collective medical knowledge through text and general-purpose vision training, and then detected polyps at rates that would have been considered cutting-edge for purpose-built systems just a few years ago.
The authors note that these results are "comparable with the much more mature and labor-intensive deep learning algorithms," which is academic-speak for "this is weirdly impressive for something we didn't even train." A recent study comparing vision language models against traditional CNNs found that top-performing VLMs like GPT-4.1 achieved F1 scores matching ResNet50 for polyp detection (Bilal et al., 2025). The gap is closing.
The real promise isn't replacement - it's accessibility. Purpose-built CADe systems require expensive hardware, regulatory approval, and institutional procurement. LLMs are already everywhere. In resource-limited settings where a dedicated CADe system is a luxury, an LLM-based screening assist could be the difference between catching a precancerous polyp and missing it entirely.
The Fine Print
Before anyone gets too excited: LLMs hallucinate. They're inconsistent. They can be misled by on-image text and annotations. Gemini falsely flagged one clean colon as having a polyp, which in clinical terms is the kind of mistake that leads to unnecessary procedures and anxious patients. And 87% sensitivity, while remarkable for a general-purpose model, means 13 out of 100 polyps walked right past the bouncer unnoticed.
There's also the uncomfortable reality that these models process frames at roughly one per second, whereas dedicated CADe systems analyze video in real time. Asking GPT-4o to provide live colonoscopy assistance today would be like asking a book reviewer to do play-by-play sports commentary. Different skill set, different tempo.
What Comes Next
The trajectory here is clear, even if the destination isn't. Multimodal LLMs are getting better at medical imaging tasks at a rate that should make anyone in the CADe business pay attention. The question isn't whether LLMs will become clinically useful for endoscopic analysis - it's when, and in what configuration. Perhaps as a second reader, a triage tool, or a training assistant for endoscopists learning what to look for.
For now, your colonoscopist's trained eyes and their dedicated AI sidekick remain the gold standard. But somewhere in a server farm, a language model that also writes haikus and explains quantum physics is getting quietly better at spotting the tiny cellular rebels that could ruin your day.
And honestly, that's a timeline worth paying attention to.
References:
-
Carlini L, Massimi D, Mori Y, et al. Large language models for detecting colorectal polyps in endoscopic images. Gut. 2026;75(5):854-856. DOI: 10.1136/gutjnl-2025-335091
-
Corley DA, Jensen CD, Marks AR, et al. Adenoma detection rate and risk of colorectal cancer and death. N Engl J Med. 2014;370(14):1298-1306. DOI: 10.1056/NEJMoa1309086
-
Spadaccini M, Nastro RA, Massimi D, et al. Accuracy of multi-modal large language models for endoscopic detection of colorectal neoplasia. Endosc Int Open. 2025;13:a25318169. DOI: 10.1055/s-0045-1805508
-
Bilal M, et al. Vision language models versus machine learning models performance on polyp detection and classification in colonoscopy images. Sci Rep. 2025;15:8933. DOI: 10.1038/s41598-025-29566-2
-
Leufkens AM, van Oijen MG, Vleggaar FP, Siersema PD. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy. 2012;44(5):470-475. DOI: 10.1055/s-0031-1291666
Disclaimer: The image accompanying this article is for illustrative purposes only and does not depict actual experimental results, data, or biological mechanisms.