Photo by Charles Deluvio on Unsplash

In this post we will cover high level concepts of using Transformers in Vision (ViT) tasks. We will follow the contours of ICLR 2021 paper by Google Brain — “An Image is Worth 16x16 Words Transformers for Image Recognition at Scale”. First we will cover the concept of ViT at a high level. Then we will do a quick recap of Transformers in general. And finally we will look at some implementation level details of Vision Transformers (ViT).

This post is divided into three parts:

  1. Introduction
  2. Background of Transformers
  3. Vision Transformer

1. Introduction

Nimish Sanghi

Apart from overseeing successful ventures and providing growth mentoring to startups, I like to explore and write about latest advances in AI and Deep Learning.

