Abstract
This paper studies using Vision Transformers (ViT) in class incremental
learning. Surprisingly, naive application of ViT to replace convolutional
neural networks (CNNs) results in performance degradation. Our analysis reveals
three issues of naively using ViT: (a) ViT has very slow convergence when class
number is small, (b) more bias towards new classes is observed in ViT than
CNN-based models, and (c) the proper learning rate of ViT is too low to learn a
good classifier. Base on this analysis, we show these issues can be simply
addressed by using existing techniques: using convolutional stem, balanced
finetuning to correct bias, and higher learning rate for the classifier. Our
simple solution, named ViTIL (ViT for Incremental Learning), achieves the new
state-of-the-art for all three class incremental learning setups by a clear
margin, providing a strong baseline for the research community. For instance,
on ImageNet-1000, our ViTIL achieves 69.20% top-1 accuracy for the protocol of
500 initial classes with 5 incremental steps (100 new classes for each),
outperforming LUCIR+DDE by 1.69%. For more challenging protocol of 10
incremental steps (100 new classes), our method outperforms PODNet by 7.27%
(65.13% vs. 57.86%).