I’m excited to attend CVPR 2024! There is A LOT of awesome research again this year! Gearing up for the event, I made a short list of papers I find interesting and would like to explore more, especially as it relates to my work on open source FiftyOne. 📄

Here’s a summary of my LinkedIn posts from this week – a paper per day – in reverse order. 🙃

Also, visit the Voxel51 booth #1519 at CVPR and chat with me and the rest of the team about visual AI, data-centric ML, or whatever excites you! 👋

🔥 CVPR 2024 Paper Spotlight: CoDeF 🔥

Recent progress in video editing/translation has been driven by techniques like Tune-A-Video and FateZero, which utilize text-to-image generative models. Because a generative model (with inherent randomness) is applied to each frame in input videos, these methods are susceptible to breaks in temporal consistency.

Content Deformation Fields (CoDeF) overcome this challenge by representing any video with a flattened canonical image, which captures the textures in the video, and a deformation field, which describes how each frame in the video is deformed relative to the canonical image. This allows for image algorithms like image translation to be “lifted” to the video domain, applying the algorithm to the canonical image and propagating the effect to each frame using the deformation field.

Through lifting image translation algorithms, CoDeF achieves unprecedented cross-frame consistency in video-to-video translation. CoDeF can also be applied for point-based tracking (even with non-rigid entities like water), segmentation-based tracking, and video super-resolution!

Arxiv: https://arxiv.org/abs/2308.07926
Project page: https://qiuyu96.github.io/CoDeF/
GitHub: https://github.com/qiuyu96/CoDeF
My post on LinkedIn: https://www.linkedin.com/posts/jacob-marks_cvpr2024-computervision-ml-activity-7207366220457598977-wKBh/

🔥 CVPR 2024 Paper Spotlight: Depth Anything 🔥

How do you estimate depth using just a single image? Technically, calculating 3D characteristics of objects like depth requires comparing images from multiple perspectives — humans, for instance, perceive depth by merging images from two eyes.

Computer vision applications, however, are often constrained to a single camera. In these scenarios, deep learning models are used to estimate depth from one vantage point. Convolutional neural networks (CNNs) and, more recently, transformers and diffusion models employed for this task typically need to be trained on highly specific data.

Depth Anything revolutionizes relative and absolute depth estimation. Like Meta AI’s Segment Anything, Depth Anything is trained on an enormous quantity and diversity of data — 62 million images, giving the model unparalleled generality and robustness for zero-shot depth estimation, as well as state-of-the-art fine-tuned performance on datasets like NYUv2 and KITTI. (the video shows raw footage, MiDaS - previous best, and Depth Anything)

The model uses a Dense Prediction Transformer (DPT) architecture and is already integrated into Hugging Face's Transformers library and FiftyOne!

Arxiv: https://arxiv.org/abs/2401.10891
Project page: https://depth-anything.github.io/
GitHub: https://github.com/LiheYoung/Depth-Anything
Depth Anything Transformers Docs: https://huggingface.co/docs/transformers/model_doc/depth_anything
Monocular Depth Estimation Tutorial: https://medium.com/towards-data-science/how-to-estimate-depth-from-a-single-image-7f421d86b22d
Depth Anything FiftyOne Integration: https://docs.voxel51.com/tutorials/monocular_depth_estimation.html#Hugging-Face-Transformers-Integration
My post on LinkedIn: https://www.linkedin.com/posts/jacob-marks_cvpr2024-computervision-ml-activity-7207003799486357504-o6e1/

🔥 CVPR 2024 Paper Spotlight: YOLO-World 🔥

Over the past few years, object detection has been cleanly divided into two camps.

1️⃣ Real-time closed-vocabulary detection:
Single-stage detection models like those from the You-Only-Look-Once (YOLO) family made it possible to detect objects from a pre-set list of classes in mere milliseconds on GPUs.

2️⃣ Open-vocabulary object detection:
Transformer-based models like Grounding DINO and Owl-ViT brought open-world knowledge to detection tasks, giving you the power to detect objects from arbitrary text prompts, at the expense of speed.

YOLO-World bridges this gap!

YOLO-World uses a YOLO backbone for rapid detection and introduces semantic information via a CLIP text encoder. The two are connected through a new lightweight module called a Re-parameterizable Vision-Language Path Aggregation Network.

What you get is a family of strong zero-shot detection models that can process up to 74 images per second!

YOLO-World is already integrated into Ultralytics (along with YOLOv5, YOLOv8, and YOLOv9), and FiftyOne!

Arxiv: https://arxiv.org/abs/2401.17270
Project page: https://www.yoloworld.cc/
GitHub: https://github.com/AILab-CVC/YOLO-World?tab=readme-ov-file
YOLO-World Ultralytics Docs: https://docs.ultralytics.com/models/yolo-world/
YOLO-World FiftyOne Docs: https://docs.voxel51.com/integrations/ultralytics.html#open-vocabulary-detection
My post on LinkedIn: https://www.linkedin.com/feed/update/urn:li:activity:7206641438845992960/

🔥 CVPR 2024 Paper Spotlight: DeepCache 🔥

Diffusion models dominate the discourse regarding visual genAI these days — Stable Diffusion, Midjourney, DALL-E3, and Sora are just a few of the diffusion-based models that produce breathtakingly stunning visuals.

If you’ve ever tried to run a diffusion model locally, you’ve probably seen for yourself how these models can be pretty slow. This is because diffusion models iteratively try to denoise an image (or other state), meaning that many sequential forward passes through the model must be made.

DeepCache accelerates diffusion model inference by up to 10x with minimal quality drop-off. The technique is training-free and works by leveraging the fact that high-level features are fairly consistent throughout the diffusion denoising process. By caching these once, this computation can be saved in subsequent steps.

Arxiv: https://arxiv.org/abs/2312.00858
Project page: https://horseee.github.io/Diffusion_DeepCache/
GitHub: https://github.com/horseee/DeepCache?tab=readme-ov-file
DeepCache Diffusers Docs: https://huggingface.co/docs/diffusers/main/en/optimization/deepcache
My post on LinkedIn: https://www.linkedin.com/posts/jacob-marks_cvpr2024-computervision-ml-activity-7206279082433478656-E5QC/

🔥 CVPR 2024 Paper Spotlight: PhysGaussian 🔥

I’m a sucker for some physics-based machine learning, and this new approach from researchers at UCLA, Zhejiang University, and the University of Utah is pretty insane.

3D Gaussian splatting is a rasterization technique that generates realistic new views of a scene from a set of photos or an input video. It has rapidly risen to prominence because it is simple, trains relatively quickly, and can synthesize novel views in real time.

However, to simulate dynamics (which involves motion synthesis), views generated by Gaussian splatting had to be converted into meshes before physical simulation and final rendering could be performed.

PhysGaussian cuts through these intermediate steps by embedding physical concepts like stress, plasticity, and elasticity into the model itself. At a high level, the model leverages the deep relationships between physical behavior and visual appearance, following Nvidia's “what you see is what you simulate” (WS2) approach.

Very excited to see where this line of work goes!

Arxiv: https://arxiv.org/abs/2311.12198
Project page: https://xpandora.github.io/PhysGaussian/
My post on LinkedIn: https://www.linkedin.com/posts/jacob-marks_cvpr2024-computervision-ml-activity-7205916642499780608-sxti/

Talk to a computer vision expert