Skip to content

ICCV 2023 Survival Guide

10 Computer Vision Papers You Won’t Want to Miss

The annual IEEE/CVF International Conference on Computer Vision (ICCV) is just a week away! The Conference, which is set to take place on October 4th-6th in Paris, France, is shaping up to be something special. 

Ah, Paris! The city of love, croissants, and—perhaps romantic strolls along the Seine with fervent debates on transformer models? While some folks are trying to capture the perfect Eiffel Tower selfie, we’re here capturing multi-dimensional arrays of pixel intensity values. Talk about capturing the moment, n’est-ce pas?

In the land where Impressionism was born, it’s only fitting that we gather to discuss how computers perceive the world. And with more than 1,000 papers being presented, there’s a lot to discuss! 

But have no fear! We’ve done the heavy lifting (and scrolling, and skimming) for you. We’ve combed through every paper to bring you the crème de la crème of ICCV 2023 so you can focus on what really matters, whether that’s finding the right session to spark your next big idea or sipping on a fine Bordeaux as you ponder the intricacies of object detection. 

Without further ado, here are our top ten picks in alphabetical order:

  1. DEVA: Tracking Anything with Decoupled Video Segmentation
  2. Effective Whole-body Pose Estimation with Two-stages Distillation
  3. EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
  4. FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization
  5. LightGlue: Local Feature Matching at Light Speed
  6. ProPainter: Improving Propagation and Transformer for Video Inpainting
  7. Segment Anything
  8. Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models
  9. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
  10. ViperGPT: Visual Inference via Python Execution for Reasoning

DEVA: Tracking Anything with Decoupled Video Segmentation

Original source: DEVA Paper

  • Links: (Arxiv | Code | Project Page)
  • Authors: Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee

TL;DR: DEVA extends Segment Anything (see below!) to video for open-world video segmentation

Traditionally, training video segmentation models has been a time-intensive and costly undertaking. In large part, this owes to the fact that end-to-end video segmentation models require densely annotated video data, which means segmentation masks for each frame. While this has not proved prohibitive for specific video segmentation tasks, it has meant that equivalent efforts need to be made for each desired task. In other words, the time spent densely annotating a dataset for video object segmentation would not translate to a video panoptic segmentation task, which would require its own annotation.

Decoupled Video segmentation flips the script by decoupling video segmentation into first, a task-specific image-level segmentation task, and second, a task-agnostic bi-directional temporal propagation. In plain English, this means you take an image segmentation model and apply it to the frames in your video. The “hypothesis” segmentation masks from different frames are then fused in a way that is coherent. This temporal propagation module only needs to be trained once, and can then be used in conjunction with any image segmentation model — even open vocabulary models like Segment Anything!   

Effective Whole-body Pose Estimation with Two-stages Distillation

Original source: DW2Pose Paper

  • Links: (Arxiv | Code)
  • Authors: Zhendong Yang, Ailing Zeng, Chun Yuan, Yu Li

TL;DR: State of the art 2D Human Pose Estimation

Pose estimation is the task of identifying, localizing, and orienting various key points on a person’s body. The “whole body” variant of the task involves simultaneous detection and localization of keypoints from head to toe, and is a crucial component in many downstream computer vision applications, from synthetic motion generation to tracking humans in virtual or augmented reality environments.

While there exist very accurate whole body pose estimation like RTMPose, one of the current challenges in the space is achieving similar levels of accuracy while operating at much lower latency. In this paper, the authors present DWPose, a two-stage knowledge distillation approach for whole body pose estimation. First, the student model learns from a larger teacher model. Then the student’s backbone is frozen, and its head is updated — it learns from itself! All told, DWPose is able to achieve state-of-the-art performance on the  COCO-WholeBody benchmark.

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Original source: EfficientViT Paper

  • Links: (Arxiv | Code)
  • Authors: Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan

TL;DR: Improved memory efficiency and reduced redundancy 👉 blazing fast vision transformers

Transformer models are all the rage these days, and the Vision Transformer (ViT) has taken over the top spot in many areas of computer vision. One of the downsides of the standard vision transformer, however, is the heavy computational cost incurred during inference. For the most part, this has thus far prevented their adoption in real-time computer vision applications. That may be about to change.

With EfficientViT, the researchers from The Chinese University of Hong Kong and Microsoft Research make two welcome improvements to the ViT architecture. First, they reduce communication time between feature channels by swapping out some memory-bound self-attention layers with far more memory-efficient feed-forward network (FFN) layers. Second, with a new Cascaded Group Attention (CGA), they minimize redundant computations happening at multiple attention heads. Combined with some clever reallocation of network parameters, they are able to take ViTs into new territory. As just one example, EfficientViT-L0 for Segment Anything can process 1000+ images per second on an A100 GPU, more than 3x that of MobileSAM, while achieving mean Intersection over Union (mIoU).

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Original source: FastViT Paper

  • Links: (Arxiv | Code)
  • Authors: Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, Anurag Ranjan

TL;DR: Removing skip connections with structural reparameterization in hybrid transformer model 👉 Robust models with state-of-the-art tradeoff between accuracy and latency

With FastViT, Apple researchers set out to solve a similar problem as EfficientViT: the high computational cost of vision transformers. Their approach, however, is based on maximizing the strengths of the up-and-coming vision transformer models and convolutional neural networks of old to form a fast hybrid transformer. 

In designing FastViT, the team combined three central principles. First, they reduce skip connections, whose high memory access contributes significantly to inference latency. Second, they factorize dense convolutions, reducing the number of parameters, and increase their representational “capacity” using a technique called overparameterization. Finally, they replace some of the computationally-intensive self-attention layers at early stages with convolutional kernels. Altogether, these design decisions result in an architecture that outperforms competitive models in accuracy and latency on tasks from image classification to 3D mesh regression.

LightGlue: Local Feature Matching at Light Speed

Original source: LightGlue Paper

  • Links: (Arxiv | Code)
  • Authors: Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys

TL;DR: Faster, more accurate sparse feature matching that adapts to the problem difficulty

Image matching is the task of associating select points in one image with points in another image, so as to “align” the two. It is used in a variety of downstream applications, from camera tracking to 3D reconstruction. At CVPR 2020, Magic Leap unveiled SuperGlue, a transformer-based graph neural network which met resounding success at both matching images from sparse sets of points, and rejecting outliers — instances where the two images are not drawn from the same scene. However, the computational intensity of the transformer-based model prevented its utilization in low-latency applications. (If you’re noticing a trend, you’re not alone!)

LightGlue revisits some of SuperGlue’s design decisions, making changes which collectively improve memory, computation, accuracy, and trainability. The model simultaneously computes a “matchability score”, for how confident it is that the point in one image can be matched to a point in the second image, and a “pairwise similarity” for pairs of points. My favorite part: the model includes a classifier that allows it to stop the matching process if it is highly confident. This means that the easier the matching job, the quicker the matching process will terminate.

ProPainter: Improving Propagation and Transformer for Video Inpainting

Original source: ProPainter Project Page

  • Links: (Arxiv | Code | Project Page)
  • Authors: Shangchen Zhou, Chongyi Li, Kelvin C.K. Chan, Chen Change Loy

TL;DR: Faster, more computationally efficient video inpainting by dual propagation (both features and images)

Much like video segmentation, video inpainting is an extension of an image-based task (in this case inpainting) to videos, in a spatially and temporally coherent and consistent manner. Approaches to video inpainting typically fall into two buckets: image propagation and feature propagation. Image propagation on its own can result in “unpleasant artifacts” and “texture misalignment”. On the other hand, feature propagation, which uses a Transformer architecture, is typically limited to short sequences and lower resolution videos as a result of the Transformer’s memory and compute constraints. At this point, I know I sound like a broken record!

ProPainter (ProPagation and an efficient Transformer) combines the strengths of image propagation and feature propagation to achieve dual-domain propagation. The implementation of the model involves a bevy of improvements, from turning CPU-intensive processes into GPU computations, an efficient recurrent neural network (RNN) for completing the flows, and discarding unnecessary/redundant segments of the Transformer’s query and key/value spaces. All told, the model significantly outperforms the prior state of the art in peak signal to noise ratio (PSNR) while at the same time reducing memory consumption!

Segment Anything

Original source: Segment Anything Model GitHub repo

  • Links: (Arxiv | Code | Project Page | Tutorial)
  • Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick

TL;DR: Game changing foundation vision model for prompted and unprompted segmentation tasks, plus largest-ever segmentation dataset

With Segment Anything (SAM), Meta AI has taken image segmentation to the next level. Open source and easy to use, SAM brings high quality zero-shot segmentation to a wide domain. The Vision Transformer-based model, which comes in three sizes, supports multiple modes of segmentation. In automatic segmentation mode, it will generate predicted segmentation masks for all things (distinct entities) and stuff (concepts like “sky”). Alternatively, you can prompt the model with bounding boxes and/or key points! 

To generate SAM, the team constructed SA-1B, the largest segmentation dataset to date, consisting of 1B masks across 11M images. The annotation of the dataset and the training of the model were performed in a looping process: the model was used to assist human annotators, and these annotations were used to retrain the model!

It has already sired a slew of smaller segmentation models like FastSAM, MobileSAM, and NanoSAM, via distillation or by virtue of the SA-1B dataset, and has inspired a seemingly endless parade of related “<Insert Task> Anything” models, from Track Anything to Inpaint Anything. Meta AI’s FACET benchmark (see below) is also adapted from a subset of SA-1B.
💡SAM is natively supported by the computer vision library FiftyOne, making segmenting your data easier than ever!

Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

Original source: Text2Room Project Page

  • Links: (Arxiv | Code | Project Page)
  • Authors: Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, Matthias Nießner

TL;DR: 2D images from diffusion models ➕ rendering from novel viewpoint ➕ fusing the images  👉 room-scale 3D meshes

Mesh representations of three-dimensional scenes are useful in computer graphics, and in the creation of 3D assets for augmented and virtual reality environments. Generating these 3D meshes, however, one runs into multiple challenges. Chief among these challenges is the lack of availability of high-quality 3D data to train on. Due to the heightened costs and longer times involved in collecting 3D data, the datasets are typically smaller, or primarily consist of simple objects and scenes. And while neural radiance fields (NeRFs) hold promise for 3D scene generation, extending them to room-level scales presents its own difficulties. 

Text2Room bypasses these problems by harnessing the 2D image generation capabilities of text-to-image diffusion models and cleverly combining the 2D images into realistic 3D scenes. The method starts by generating a single image of a scene. Monocular depth estimation is used to backproject the scene into three dimensions, and an initial 3D mesh is generated from this. From there, the mesh is iteratively rendered from novel viewpoints, any holes are inpainted, and the images are fused together. In tests, Text2Room outperformed competitors on a slate of quantitative metrics, from perceptual quality (PQ) to 3D structure completeness (3DS).

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Original source: Text2Video-Zero Paper

  • Links: (Arxiv | Code | Hugging Face | Project Page)
  • Authors: Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

TL;DR: Motion dynamics ➕ cross-frame attention 👉 off-the-shelf text-to-image models can be applied to generate videos

Whereas Text2Room leverages the power of 2D text-to-image models to generate high-quality 3D meshes, Text2Video-Zero harnesses these same 2D models to perform low-cost zero-shot text-to-video generation. That is, Text2Video-Zero sets forth an approach to video generation from text prompts that does not necessitate fine-tuning or optimization.

Instead of randomly sampling latent codes (the inputs to diffusion models) for each frame independently, the latent codes are constructed through an iterative warping process. This incentivizes temporal consistency across the generated frames, but on its own is still an insufficient constraint. On top of this, Text2Video replaces the self-attention layers in the diffusion model with cross-frame attention layers connecting each frame to the first frame. 
The most exciting part of Text2Video-Zero is that the basic approach also works for other video tasks such as conditional video generation and even instruction-guided video editing! You can run the model in text-to-video mode, or in either of these two additional modes with Hugging Face’s diffusers library!

ViperGPT: Visual Inference via Python Execution for Reasoning

Original source: ViperGPT GitHub repo

TL;DR: Using code generation models to compose vision-language models 👉 state-of-the-art performance on visual inference tasks

ViperGPT (so-named because it executes Python code) takes a modular approach to visual inference. Instead of relying on end-to-end models, which bundle visual processing and reasoning but lack in interpretability, ViperGPT uses GPT-3 Codex to generate Python code that executes specific subroutines to answer a query. In the gif above, for instance, ViperGPT can employ object detection models to determine how many muffins there are, and then use this result to reason about the right answer to the query.

By defining simple, task-specific APIs, ViperGPT is able to leverage the code generation and reasoning capabilities of existing models without fine-tuning. Additionally, the output from the code-generation model is, well, code, it is more interpretable than end-to-end models. To top things off, ViperGPT achieves state of the art performance on multiple zero shot tasks, including general visual question answering, grounded question answering, and even referring expression tasks!

💡If you like ViperGPT, you should check out:

  • HuggingGPT: LLM dispatcher for CV tasks
  • VisProg: Visual reasoning without training (CVPR 2023 best paper)
  • VoxelGPT: Text-to-query for CV datasets

Speed Round

Here are twenty more cool ICCV 2023 projects you should check out!

A Generalist Framework for Panoptic Segmentation of Images and Videos

  • Links: (Arxiv)
  • Authors: Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, David J. Fleet

TL;DR: Google Research applies Geoffrey Hinton et al.’s Bit Diffusion model to panoptic segmentation 👉more general architecture and loss function competitive with specialized methods

Chupa : Carving 3D Clothed Humans from Skinned Shape Priors using 2D Diffusion Probabilistic Models

  • Links: (Arxiv | Code | Project Page)
  • Authors: Byungjun Kim, Patrick Kwon, Kwangho Lee, Myunggi Lee, Sookwan Han, Daesik Kim, Hanbyul Joo

TL;DR: Decomposing the task of 3D human mesh generation into (1) pose-conditional normal map generation with diffusion models, and (2) carving the 3D mesh utilizing the normal maps 👉 realistic 3D human human meshes

Dense Text-to-Image Generation with Attention Modulation

TL;DR: Layout of objects in images generated by diffusion models is related to the model’s attention and cross-attention maps. Modulating these maps 👉 better layout control and better performance for dense text prompts

DOLCE: A Model-Based Probabilistic Diffusion Framework for Limited-Angle CT Reconstruction

  • Links: (Arxiv | Project Page)
  • Authors: Jiaming Liu, Rushil Anirudh, Jayaraman J. Thiagarajan, Stewart He, K. Aditya Mohan, Ulugbek S. Kamilov, Hyojin Kim

TL;DR: Conditional diffusion models can reduce artifacts in CT images reconstructed from severe undersampling scenarios

Doppelgangers: Learning to Disambiguate Images of Similar Structures

  • Links: (Arxiv | Code | Dataset | Project Page)
  • Authors: Ruojin Cai, Joseph Tung, Qianqian Wang, Hadar Averbuch-Elor, Bharath Hariharan, Noah Snavely

TL;DR: The task of distinguishing images of the same or different 3D surfaces (visual disambiguation) is treated as binary classification on pairs of images. This is enabled by a new Doppelgangers dataset

Equivariant Similarity for Vision-Language Foundation Models

  • Links: (Arxiv | Code)
  • Authors: Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang

TL;DR: New benchmark EqBen for assessing how faithfully the notion of similarity holds up in vision-language model under semantic changes

FACET: Fairness in Computer Vision Evaluation Benchmark 

  • Links: (Dataset | Paper | Project Page | Tutorial)
  • Authors: Laura Gustafson, Chloe Rolland, Nikhila Ravi, Quentin Duval, Aaron Adcock, Cheng-Yang Fu, Melissa Hall, Candace Ross

TL;DR: Meta AI releases a diverse benchmark dataset for evaluating fairness, bias, and disparity across protected attributes

GlueStick: Robust Image Matching by Sticking Points and Lines Together

  • Links: (Arxiv | Code | Project Page)
  • Authors: Rémi Pautrat, Iago Suárez, Yifan Yu, Marc Pollefeys, Viktor Larsson

TL;DR: Treating points, lines, and their descriptors as combined “wireframes” and applying graph neural nets 👉state-of-the-art matching for both points and line segments

HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion

  • Links: (Arxiv | Code | Project Page)
  • Authors: Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, Angela Dai

TL;DR: Overfitting multi-layer perceptrons (MLPs) on individual neural implicit fields and training a diffusion model to denoise these weights 👉realistic 3D shapes and 4D mesh animations

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

  • Links: (Arxiv | Code)
  • Authors: Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

TL;DR: Denoising diffusion network conditioned on multiple modalities and trained on multimodal extensions to Dress Code and VITON-HD 👉 fashion image editing that can be guided by text prompts, poses, or sketches

Neural Haircut: Prior-Guided Strand-Based Hair Reconstruction

  • Links: (Arxiv | Code | Project Page)
  • Authors: Vanessa Sklyarova, Jenya Chelishev, Andreea Dogaru, Igor Medvedev, Victor Lempitsky, Egor Zakharov

TL;DR: Samsung researchers create two-stage method for accurately reconstructing hair at the strand level from monocular videos or multiview images

Prompt-aligned Gradient for Prompt Tuning

  • Links: (Arxiv | Code)
  • Authors: Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, Hanwang Zhang

TL;DR: Principled approach to fine-tuning the similarity measure in vision-language models without forgetting general knowledge, ProGrad, shows strong few-shot generalization

Reference-guided Controllable Inpainting of Neural Radiance Fields

  • Links: (Arxiv | Project Page)
  • Authors: Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski

TL;DR: Monocular depth estimation to back-project an image to 3D coordinates ➕ new rendering technique 👉 consistent NeRF inpainting from original NeRF and a single inpainted image

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

  • Links: (Arxiv | Code | Project Page)
  • Authors: Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu

TL;DR: Retrieving reference motion sequences that are semantically and kinematically relevant and selectively incorporating this knowledge into the motion generation process 👉 diverse motion with state-of-the-art motion quality and text-motion consistency

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

  • Links: (Arxiv | Code | Project Page)
  • Authors: Lingdong Kong, Youquan Liu, Xin Li, Runnan Chen, Wenwei Zhang, Jiawei Ren, Liang Pan, Kai Chen, Ziwei Liu

TL;DR: Benchmark evaluation suite for detection and segmentation models in out-of-distribution autonomous driving scenarios

ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

  • Links: (Arxiv | Project Page)
  • Authors: Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, Angela Dai

TL;DR: Higher-resolution successor to the popular ScanNet dataset, captured at 33-millimeter resolution, designed for novel view synthesis of indoor scenes

SegGPT: Segmenting Everything In Context

TL;DR: Generalist vision model built on top of Painter that is capable of many segmentation tasks, including semantic segmentation, video object segmentation, and panoptic segmentation, but does not set out to achieve state-of-the-art performance on any single task

SHIFT3D: Synthesizing Hard Inputs For Tricking 3D Detectors

  • Links: (Arxiv)
  • Authors: Hongge Chen, Zhao Chen, Gregory P. Meyer, Dennis Park, Carl Vondrick, Ashish Shrivastava, Yuning Chai

TL;DR: New approach to generating plausible yet challenging 3D shapes in order to probe the failure modes of 3D object detectors for autonomous driving and other mission-critical tasks

Text2Performer: Text-Driven Human Video Generation

  • Links: (Arxiv | Code | Project Page)
  • Authors: Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, Ziwei Liu

TL;DR: Generate temporally coherent human-centered videos from text input by decomposing into appearance (share across all frames) and pose (continuously changing) representations, and predicting pose embeddings with a diffusion model

Tracking Everything Everywhere All at Once

  • Links: (Arxiv | Code | Project Page)
  • Authors: Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, Noah Snavely

TL;DR: Generate temporally coherent human-centered videos from text input by decomposing into appearance (share across all frames) and pose (continuously changing) representations, and predicting pose embeddings with a diffusion model.