Transformers have played a crucial role in natural language processing tasks in the last decade. Their success attributes mainly to their ability to extract and exploit temporal information.
When a certain method works well in a domain, it is normal to expect to see studies that try to bring that method to other domains. This was the case with transformers as well, and the domain was computer vision. Introducing transformers to vision tasks was a huge success, bringing numerous similar studies afterward.
The vision transformer (ViT) was proposed in 2020, outperforming its convolutional neural network (CNN) counterpart in the image classification tasks. Its main benefits were at a large scale since they require more data or stronger regularisation.
ViT inspired many researchers to dive deeper into the rabbit hole of transformers to see how further they can go in different tasks. Most of them focused on image-related tasks, and they obtained really promising results. However, the application of ViTs into the video domain remained an open problem, more or less.
When you think of it, transformers, more importantly, attention-based architectures, look like the perfect structure to be used with videos. They are the intuitive choice for modeling the dependency in natural languages and extracting contextual relationships between the words. A video also contains these properties, so why not use the transformer to process videos? This is the question the authors of ViViT asked, and they came up with an answer.
Most state-of-the-art video-related solutions use 3D-convolutional networks, but their complexity makes it challenging to achieve proper performance on commodity devices. Some studies focused on adding the self-attention property of transformers into the 3D-CNNs to better capture long-term dependencies within the video.
Continue reading… “Google Research Proposes an Artificial Intelligence (AI) Model to Utilize Vision Transformers on Videos”
