Abstract:
In our quest to advance medical AI, we demonstrate that a pre-trained and frozen Vision Transformer paired with a linear classifier can achieve highly competitive performance in endoscopic image classification. Our central contribution is a systematic, layer-wise analysis that identifies the source of the most powerful features, challenging the common heuristic of using only the final layer. We uncover a distinct "peak-before-the-end" phenomenon, where a late-intermediate layer offers a more generalizable representation for the downstream medical task. On the Kvasir and HyperKvasir benchmarks, our parameter-light approach not only achieves excellent accuracy but also drastically reduces computational overhead. This work provides a practical roadmap for efficiently leveraging the power of general foundation models in clinical environments.