Patterns and trends in computer vision from CVPR papers
The annual IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) is just around the corner. Every year, thousands of computer vision researchers and engineers from across the globe come together to take part in this monumental event. The prestigious conference, which can trace its origin back to 1983, represents the pinnacle of progress in computer vision. With CVPR playing host to some of the field’s most pioneering projects and painstakingly crafted papers, it’s no wonder that the conference has the fourth highest h5-index of any conference or publication, trailing only Nature, Science, and The New England Journal of Medicine.
This year, the conference will take place in Vancouver, Canada, from June 18th – June 22nd. With 2359 accepted papers, 100 workshops, 33 tutorials, and Flash Sessions happening in the Expo (including two by Voxel51!), CVPR will have something for everyone. Such a large volume of high quality content can be overwhelming, but it also provides us with invaluable insight into the current state of affairs in computer vision, what’s hot, and where the field is going.
To help you navigate the rapidly changing world of computer vision in 2023, we gathered, scraped, cleaned, and dug into the data. We even used AI to quantify how creative authors were! In this blog post, we’ll share what we learned. And stay tuned for our upcoming CVPR 2023 Survival Guide, including the top ten papers you won’t want to miss!
- Digging into the CVPR paper data
- What’s trending in computer vision
- Analyzing creativity in computer vision with AI
- Where to find Voxel51 at CVPR
Digging into the data
The starting point for all of our analyses was the list of all accepted papers. We scraped the web to find the actual papers corresponding to the titles in the initial list. If there was a version of the paper on Arxiv, we grabbed the author list, title (for many papers, the title on Arxiv differed slightly from the title posted on the CVPR website), and abstract. If the paper was not on Arxiv, but was available elsewhere on the web, we attempted to capture that information instead.
Note: this data was scraped on April 20th, 2023. Any papers that have been uploaded to the Arxiv in the intervening weeks may be excluded from this analysis.
- 2359 papers accepted (out of 9155 submissions)
- 1724 papers with versions on Arxiv
- 68 additional papers found elsewhere
Authors per paper
- The average CVPR paper had approximately 5.4 authors.
- The paper with the most authors was Why is the winner the best?, which had 125 authors. All 125 of them can share this ignominious honor.
- 13 papers had only one author.
Primary Arxiv category
Of the 1724 papers with versions on Arxiv, 1545, or just under 90%, had cs.CV listed as their primary category. cs.LG was second, with 101. eess.IV (26) and cs.RO (16) also claiming a piece of the pie. Other categories with CVPR papers include: cs.HC, cs.CV, cs.AR, cs.DC, cs.NE, cs.SD, cs.CL, cs.IT, cs.CR, cs.AI, cs.MM, cs.GR, eess.SP, eess.AS, math.OC, math.NT, physics.data-an, and stat.ML.
- The words “dataset” and “model” appeared jointly in 567 abstracts. “Dataset” appeared on its own in 265 abstracts, whereas “model” appeared independently 613 times. Only 16.2% of CVPR accepted paper abstracts contained neither word.
- According to CVPR paper abstracts, the most popular datasets this year were ImageNet (105), COCO (94), KITTI (55), and CIFAR (36).
- 28 papers introduce a new “benchmark”.
It seems that you can’t have machine learning projects without acronyms. Out of the 2359 papers, the titles of 1487 had acronyms or compound words with multiple capital letters in them. That’s 63%!
Some of these acronyms are catchy – they are easy to read and roll off the tongue:
- CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose
- PATS: Patch Area Transportation with Subdivision for Local Feature Matching
- CIRCLE: Capture In Rich Contextual Environments
Some are more difficult to speak aloud than others:
- SIEDOB: Semantic Image Editing by Disentangling Object and Background
- FJMP: Factorized Joint Multi-Agent Motion Prediction over Learned Directed Acyclic Interaction Graphs
Some of them seemed to use creative license on acronym construction:
- SCOTCH and SODA: A Transformer Video Shadow Detection Framework
- EXCALIBUR: Encouraging and Evaluating Embodied Exploration
In addition to the 2023 paper titles, we scraped the titles for all 2022 accepted CVPR papers. From these two lists, we calculated the relative frequency of various keywords, giving high-level insight into what is trending up and what is trending down. Here are the highlights, presented in terms of percent change relative to 2022 frequencies.
It should be no surprise to see diffusion models trending upward, with image generation models like stable diffusion and Midjourney going viral. Diffusion models are also finding applications in denoising, image editing, and style transfer. Add all of this up, and you get by far the biggest winner across all categories, with a 573% increase year-over-year.
Neural Radiance Fields, or NeRFs, have also grown in popularity, as exhibited by an 80% increase in usage of the word radiance, and a 39% increase for NeRF. NeRFs have moved beyond proof of concept to editing, applications, and training process optimization.
The dips for “Transformer” and “ViT” are less indicative of transformer models going out of style, but rather a reflection of how dominant these models were in 2022. In 2021, the word “transformer” only appeared in 37 paper titles. In 2022, that number skyrocketed to 201. Transformers are not going away any time soon.
Changing of the guard
On the other hand, CNNs, which fell off by 68%, appear to be falling out of favor. Once the darling of computer vision, in 2023 it seems CNNs have lost their edge. Many titles that mention CNNs also mention other models. For instance, these papers mention both CNNs and transformers:
- Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
- Learned Image Compression with Mixed Transformer-CNN Architectures
Traditional discriminative tasks like detection, classification, and segmentation are not falling out of favor, but their share of the computer vision mind-space is shrinking due to a flurry of advances in generative CV applications, as evidenced by upticks for “Editing”, “Synthesis,” and of course “Generation”.
The keyword “mask” saw a 263% increase year-over-year, appearing 92 times in the 2023 accepted paper titles – sometimes twice in a single title. Some of these occurrences are still in the context of segmentation:
- SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation
- DynaMask: Dynamic Mask Selection for Instance Segmentation
But the majority (64%) actually refer to “masked” tasks, including 8 instances of “masked image modeling”, and 15 “masked autoencoder” tasks. Additionally, there are 8 occurrences of “masking”.
It is also worth noting that 3 paper titles with the word “mask” actually refer to “mask-free” tasks.
Zero vs Few
“Zero-shot” learning is gaining traction, with the rise of transfer learning, generative approaches, prompting, and general purpose models. At the same time, “few-shot” learning is down from last year. In terms of raw numbers, however, “few-shot” (45) maintains a slight edge over “zero-shot” (35), at least for now.
The boundaries blur
While the frequency of traditional computer vision keywords like “image” and “video” is relatively unchanged, “text”/”language” and “audio” are appearing more often. Even if the word “multi-modal” itself is not cropping up in paper titles, it is hard to deny that computer vision is trending towards a multi-modal future.
This is especially pronounced for vision-language tasks, as evidenced by the sharp upticks for “Open”, “Prompt,” and “Vocabulary”. The most extreme example of this is the compound term “Open-vocabulary”, which occurred just 3 times in 2022, but shows up 18 times in 2023.
Point cloud 9
Three-dimensional computer vision applications are moving away from inferring 3D information from 2D images (“Depth” and “Stereo”). Instead, computer vision systems are being trained to work directly on 3D point cloud data.
How abstract was your abstract? Creativity in computer vision
No attempt at comprehensive ML-related coverage in 2023 would be complete without bringing ChatGPT into the mix. We decided to make things interesting and use ChatGPT to find the most creative titles from CVPR 2023.
For each paper that had a draft uploaded to Arxiv, we scraped the abstract, and asked ChatGPT (GPT-3.5 API) to generate a title for the corresponding CVPR paper. We then took these ChatGPT generated titles and the actual paper titles, and used OpenAI’s `text-embedding-ada-002` model to generate embedding vectors, and computed the cosine similarity between ChatGPT-generated title and author-generated title.
What can this tell us? The closer ChatGPT can get to the actual paper title, the more predictable it was. In other words, the further ChatGPT’s prediction was, the more “creative” the authors were in naming their paper. Embeddings plus cosine similarity gives us an interesting, albeit far from perfect, way to quantify this.
We sorted the papers according to this metric. Without further ado, here are the most creative titles:
Actual: Tracking Every Thing in the Wild
Predicted: Disentangling Classification from Tracking: Introducing TETA for Comprehensive Benchmarking of Multi-Category Multiple Object Tracking
Actual: Learning to Bootstrap for Combating Label Noise
Predicted: Learnable Loss Objective for Joint Instance and Label Reweighting in Deep Neural Networks
Actual: Seeing a Rose in Five Thousand Ways
Predicted: Learning Object Intrinsics from Single Internet Images for Superior Visual Rendering and Synthesis
Actual: Why is the winner the best?
Predicted: Analyzing Winning Strategies in International Benchmarking Competitions for Image Analysis: Insights from a Multi-Center Study of IEEE ISBI and MICCAI 2021
As for the least creative titles, around 40 title pairs were either exact matches, or differed by only punctuation! Want to see which titles? Try this analysis out and share what you find in the FiftyOne community Slack 🙂
Visit Voxel51 at CVPR!
Want to discuss the state of computer vision, talk about data-centric AI, or learn how the open source computer vision toolkit FiftyOne can help you overcome data quality issues and build higher quality models? Come by our booth #1618 at CVPR. Not only will we be available to discuss all things computer vision with you, we’d also simply love to meet fellow members of the CV community, and swag you up with some of our latest and greatest threads.
Want to make your dataset easily accessible to the fastest growing community in machine learning and computer vision? Add your dataset to the FiftyOne Dataset Zoo so anyone can load it with a single line of code.
Reach out to me on Linkedin! I’d love to discuss how we can work together to help bring your research to a wider audience 🙂
Join the FiftyOne community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!