Last week Voxel51 hosted the January 2023 Computer Vision Meetup. In this blog post you’ll find the payback recordings, highlights from the presentations and Q&A, as well as the upcoming Meetup schedule so that you can join us at a future event. Hope to see you soon!
First, Thanks for Voting for Your Favorite Charity!
In lieu of swag, we gave Meetup attendees the opportunity to help guide our monthly donation to charitable causes. The charity that received the highest number of votes this month was Foundation Fighting Blindness. We are pleased to be making a donation of $200 to them on behalf of the computer vision community!
Computer Vision Meetup Recap at a Glance
Cameron R. Wolfe // Hyperparameter Scheduling for Computer Vision
Julien Simon // An Intro to Computer Vision with Hugging Face Transformers
Hyperparameter Scheduling for Computer Vision
Cameron R. Wolfe, research scientist at Alegion and Ph.D. student at Rice University, shares some of his research on the topic of hyperparameter scheduling for computer vision. To begin, Cameron explains that hyperparameter tuning is important because deep learning can be computationally expensive. When you’re training over a dataset, you go through all of the data from multiple epochs, and at the end you check whether or not your model performs well. If it doesn’t, then you have to go back to stage one. This creates a loop where, if you don’t select your hyperparameters properly, you’re continually retraining your network trying to get one that performs well. While many people attribute the computational expense of deep learning to big datasets and large models, the problem gets worse with incorrect hyperparameters because you have to incur the expense of training your deep neural network multiple times.
In his presentation, Cameron describes how to use hyperparameter schedules, how to set them properly, and practical takeaways for hyperparameters across three main areas: learning rate, training precision, and video deep learning.
The learning rate part of the presentation is based on a paper Cameron wrote with Rice – REX: Revisiting Budgeted Training with an Improved Schedule. The idea behind it is that when you consider different budget amounts for training a deep neural network, certain learning rate schedules work better than others.
Why does budget matter? Maybe you have a computational and/or monetary budget and need to train within it; or maybe you have a deadline and can’t spend too much time training your network. One of the most effective ways for working within a budget is to simply reduce your number of training epochs. Instead of training a model for 200 epochs on ImageNet, for example, you can shorten it down to 90 and you can probably get pretty good performance if you set your hyperparameters correctly. Cameron’s research looks at how to properly set the learning rate when training neural networks in a budget-aware setting.
To explore different learning rate options in a budget-aware setting, Cameron decomposes learning rate schedules into two components: profile (the continuous function that models the decay of your learning rate; classic examples are – cosine, linear, exponential, and step schedules) and sampling rate (how frequently you want to update your learning rate from this profile).
To set the scene, Cameron shows two classic examples – a step schedule and a linear schedule.
What Cameron’s research contributes is a new profile for learning rate decay: reverse exponential or REX, which is a learning rate schedule that keeps the learning rate high for a while and then decays it to a lower learning rate at the end of training.
Now the question is: how do the learning rate and sampling rate pairings perform? To conduct this experiment, Cameron conducted experiments for each profile to find the optimal sampling rate across seven different domains.
From this, the research takes the optimal profile and sampling rate pairings and sees how they compare on six different learning rate schedules.
- Step schedules (commonly used in computer vision) only perform well in the high-budget regime
- REX performs really well across domains and budgets
- Many choices exist for learning rate schedules; choose the best one based on your settings – number of epochs, domain, or problem (classification, detection, etc.)
Because training deep neural networks is computationally expensive, another way to reduce costs would be to look at better schedules for low precision training for neural networks. Specifically, in this presentation and research he’s working on, Cameron looks at cyclical precision training (CPT) (not fixed or static training), where the precision being used for the neural network varies cyclically throughout the entire training process.
The idea behind low precision training is that instead of training your neural network at the normal 32 bit precision, you can lower this to 16 bits, 8 bits, or maybe even lower, which would save compute costs. How this works is, when you do your forward pass, you quantize your activation and weights to a lower precision before performing the matrix’s multiplication in the forward pass, which is faster. You can do the same thing in the backward pass, which has two matrix multiplications (one to compute your weight update and one to propagate the gradient to the previous layer), which saves twice as much compute.
The question Cameron’s research addresses is: what are the best hyperparameter schedules for CPT to achieve gains (reduced cost or increased performance)? This research follows a similar approach to the REX research by decomposing the precision schedules into parts (in this case, three).
Cameron notes that the harder part is choosing repeated or triangular schedules. To better understand what is meant by repeated or triangular schedules, Cameron shows the set of 10 (repeated and reflected), which you can see below.
Cameron’s research collapses the 10 schedules into three groups (large schedules, medium schedules, and small schedules) based on the impact they have on the amount of compute savings you get. Here’s the mapping -> large = RR and RTH; medium = LR, LT, CR, CT, RTV, ETV; small = ER, ETH. (You may want to come back to these mappings later when you review the precision takeaways later in the post.)
Cameron walks us through some of the highlights from the exploration:
- There tends to be a correlation between the model performance and the amount of training compute that we’re using. Cameron adds “if we use less compute or have more computational savings, we tend to get a little bit worse performance and vice versa. So even though we can use these alternative precision schedules, in a lot of cases we’re not getting these compute benefits for free. If we want our training to go by faster, we’re probably going to get a model that performs not quite as good.”
- In certain cases using alternative schedules (beyond just cosine schedules) can be beneficial. For example, when training a ResNet 18 on ImageNet, the best performance actually comes with the exponential schedule.
- Some settings are sensitive to lots of quantization, so you can’t always train a network with really low precision; it depends on the domain. So be careful with how low the precision is with which you’re going to be training.
Key takeaways for training precision are:
- You can gain a lot by exploring alternative hyperparameter schedules for cyclic precision training.
- Here’s Cameron’s guidance on how to choose one:
- Use small schedules to minimize training costs
- Use large schedules to maximize model performance
- Use medium schedules to find a balance
Now for a twist: none of these can actually be used practically right now because current hardware only supports low precision training at certain levels. Cameron recommends “the best thing to use for low precision training right now is just something called Apex,” which implements 16 bit precision training with minimal code changes required. For more information, check out Cameron’s article: Quantized Training with Deep Networks.
Video deep learning
The last part of Cameron’s presentation is about video deep learning and how cyclical batch sizes can speed up wall-clock training time. In this section, Cameron walks us through some background found in the SlowFast Networks for Video Recognition paper. The idea is that instead of having just a single 3D CNN that we convolve over the entire input video over multiple layers, we separate our network into two different modules: the Slow Pathway, which is in charge of capturing spatial or semantic features; and the Fast Pathway that’s in charge of capturing motion features.
The answer proposed in the paper, A Multigrid Method for Efficiently Training Video Models, is to vary the mini-batch size according to a hyperparameter schedule.
In the image below on the right, you can see the results for training over a Kinetics dataset on a single GPU using this multigrid training method. This leads to improved efficiency – the same performance in just two days instead of nearly an entire week!
Hyperparameter schedules are fundamental. You can apply them in different scenarios and they end up being useful for finding a benefit (reduced compute, increased performance, reduced training time, etc.)
Here’s a recap of the live Q&A from this presentation during the virtual Computer Vision Meetup:
Q: Any comment on REX in regards to transfer learning?
A: We didn’t test that in this case, but that would be a really interesting test to run. A lot of times what I’ve found is that when you’re fine tuning a network on a downstream dataset, if you start with too high of a learning rate, that might cause problems. So my guess would be that REX is fine for transfer learning but you have to be careful with setting the initial learning rate to make sure that it’s not too high. But again, that would be pretty cool to test.
Q: In addition to problem domain and budget, do you see that just the dataset itself (size, diversity, etc.) can cause one LR to work better or worse?
A: Yes, all of these things are factors that you have to consider when choosing hyperparameters. So we can boil this down to the question of if you’re training a neural network on one dataset and you switch it and try to train it on another dataset, are you going to use the same hyperparameters out of the box? Probably not. You’re going to have to test a few things and see what works. When you change domains or datasets, you’re going to have to do more tuning and figure out what’s best for your new dataset.
Q: When do you think we will see convex optimizers for arbitrary neural networks?
Neural networks like by nature are nonconvex. But at the same time, even though the optimization problem with neural networks is nonconvex, for a lot of optimization algorithms, all of the proofs are in convex settings or simplified settings. The literature for analyzing nonconvex convergence rates is a lot harder than the convex proof. So typically it seems like for deep learning we can take intuitions from convex optimization and see whether they’ll work. And in the case of SGD, for example, it works really well.
Q: How are we initiating REX?
With REX, you choose an initial learning rate, you have a schedule by which that learning rate is varied, and then from the beginning to the end of your training, you decay the learning rate from the initial choice to kind of 10x lower than that initial choice or maybe 100x according to that schedule. So there’s no initiation needed, it’s just a fixed function profile by which we decay our learning rate.
Q: In addition to randomizing training sets, do you insert intentional “disturbances”?
A: No, really all we do here is pick a bunch of different domains, a bunch of different publicly available datasets, and then run training across all of these to see which learning rate schedules work best. All of the setups are pretty standard I would say; we’re not doing anything extra just running training over public datasets.
Q: Low Precision training, is that also called Quantization Aware Training?
A: Quantization aware training refers to methods that make your network perform better when you make it smaller at the end so that you can deploy it onto an edge device, for example. The purpose of that is training the network in a way that when you quantize it, it will still perform well at the edge because when you convert it to a low precision representation, that uses less memory so that you can deploy it. The difference between that and what I’m doing here is that cyclical precision training (CPT) is not focused on trying to create a smaller neural network at the end, it’s focused upon trying to reduce training costs in general. We’re performing the training process in a low precision manner so that we can save compute costs during training. We’re not necessarily trying to quantize the network to make it smaller at the end. Though, there could be a relationship between low precision training and quantization aware training such that performing low precision training like this could be a good method of quantization aware training to where these networks are easier to quantize at the end.
Is CPT theoretical (e.g. implemented entirely in software)?
A: Correct, there’s no hardware support for this yet. I hope that hardware will catch up with it eventually, but right now the best thing to use is Apex because you can train it at floating point 16 precision, it’s easy to implement, and it’ll increase speed without losing any performance, though check to make sure.
Q: Are there trade offs/disadvantages for Low Training Precision?
Yes. There’s a correlation between model performance and training compute. So if you’re lowering the precision too much, you’re going to pay the cost – your network is going to deteriorate in performance. This is relative to training with a fixed, but lower precision level. But if you adopt a different strategy that doesn’t quantize quite as aggressively, you might actually improve model performance while saving compute. In the talk I shared an example of an ImageNet case compared to the baselines, where the exponential precision schedules both reduced compute compared to baselines and improved the performance. So it depends; there is a trade off if you quantize too much, but there are certain cases where you can actually get better performance and save training costs. It just depends on the scenario.
Check out the additional resources on the presentation:
- Talk transcript
- Presentation links (to papers, articles, etc.)
- All links (newsletter, Twitter, company, lab, etc.)
Thank you Cameron on behalf of the entire Computer Vision Meetup community for sharing your research and helping us better understand how to approach the tuning of hyperparameters!
An Intro to Computer Vision with Hugging Face Transformers
Julien Simon’s talk provides an overview of what it means to work with computer vision models, especially those hosted on the Hugging Face Hub. Julien starts by sharing a brief history of computer vision in deep learning (in fact, what got him into deep learning was computer vision!).
For a long time, working on computer vision applications meant working with convolutional neural networks (CNNs). But, Julien notes, “the golden age of CNNs is probably behind us.” Why? Enter the Vision Transformer (ViT), originally published by Google in 2021 and described in the paper: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.
Julien summarizes how the ViT works: ViT breaks an image into patches, which are flattened and treated as a sequence of tokens. This results in several key benefits. First, Julien shares a benchmark from the research paper, specifically one that compares the vision transformer to ResNet-152, which was state of the art at the time, and it performed quite a bit better. Also in that paper, the ViT was four times less compute intensive than training ResNet-152. Additionally, transfer learning is built in, meaning you can train your model initially on one dataset, learn the embeddings, and then specialize that model for downstream tasks. Finally, ViT offers state-of-the-art-accuracy.
CV Transformers are gaining in popularity, but historically have been difficult to work with. This opened up an opportunity for Hugging Face to build open source libraries, a website, and commercial tools that make it very easy to find, share, and work with state-of-the-art models, even if you’re not an expert.
Here’s a sampling of Hugging Face by the numbers:
- 120,000+ models
- 18,000+ datasets
- 25+ ML models including, Keras, Scikit-Learn models, fastai, and more
Zooming in on computer vision, Julien notes that Hugging Face hosts about 4,000 computer vision models covering a variety of tasks across four broad categories.
Julien then heads over to the Hugging Face Hub to show us how easy it is to download, train, and deploy models. He also shows the inference widget where you can predict images in the Hub, as well as links to datasets that the model has been trained on and Spaces, where you can quickly build, host, and share your ML apps.
Live demo time! Starting at ~16:49 in the presentation, Julien shows an app he built in Spaces that predicts an image (prediction happens with just two lines of Python code!) with 10+ models and displays the results.
But what about more advanced models? Julien picks three – image captioning, zero shot segmentation using a text prompt, and audio classification using spectrogram – and shows several interactive live demos on Spaces. Then he covers multi-modal transformers, and diffusers, all with live demos and code snippets. Experimenting with models has never been easier!
In addition to working with inferences and models, you can also train and deploy with Hugging Face. Julien walks us through the “family picture” that includes all the goodies available from Hugging Face today.
Before opening up for Q&A, Julien shares some links to get you started:
- Tasks: https://huggingface.co/tasks
- Training course: https://huggingface.co/course/
- Docs: https://huggingface.co/docs
- GitHub: https://github.com/huggingface
- Forum: https://discuss.huggingface.co/
- Support: https://huggingface.co/support
Q: Is ViT already using production somewhere?
A: Yes, we have customers who use a vision transformer in production. If you’re wondering about potential challenges in production, I can see two. The first is fine tuning the model on your own data, but that’s easy enough to do. The second challenge you might run into is inference latency. If you have really low latency requirements or if you’re deploying models at the edge, then transformers may still be a little bit too big (as compared to CNNs). Although, in the classifier space, you’ll find models that are just a few megabytes; they have small mobile compatible versions of vision transformers.
Q: Can we target edge devices for Hugging Face models?
A: Yes. Although the hardware requirements for some transformer models can still be overwhelming for edge devices, there’s a lot of work going on there to shrink the models (ex: you can find models that are 2-3 megabytes). Reductions are happening by using smaller architectures, or through techniques like quantization, pruning, or through optimization tools, etc. In addition, if you need to consider transformer models at the edge, look for a hardware solution that will meet your needs and fit within your hardware budget.
Q: When playing with applications in Hugging Face Spaces, are user uploaded images ultimately archived, or deleted, or something else?
A: Nothing is stored in a Space; there’s no data storage, there’s no data usage, and none of the data you send either on Spaces or on inference widgets, etc. is stored. Unless you want your Space to store something there somewhere, there’s no other code than the code you’re able to see in the Space.
Q: You mentioned 4x better efficiency for ViT compared to ResNet. Is that training? If so, is that training from scratch or transfer learning from a pretrained model, or both?
A: Yes, in the research paper, they reported a 4x training speed up compared to ResNet-152, and that was training from scratch. Generally, there seems to be some consensus, based on what I read in multiple papers, that transformers are a fit when you have a ton of data. For example, if you have 10M+ images and you train from scratch, you’ll get better results with transformers than CNNs. On the other hand, if you have only, let’s say, a hundred thousand images to train initially, CNNs tend to do better. However, one of the big benefits of transformers is transfer learning. So, unless you have a super exotic domain, I would encourage you to start from one of those pre-trained models, evaluate them on your own data, and fine tune a little bit.
Q: Can you compare the size of transformers to equivalent CNNs?
The vision transformer (going from memory) is about 300 to 400 megabytes, which is pretty large. But like I said, you can find some downscaled versions, even down to tens of megabytes or less; small enough that you could fit on a small device.
Check out these additional resources:
A big thank you to Julien on behalf of the entire Computer Vision Meetup community for getting us up-to-speed with Hugging Face Transformers!
Computer Vision Meetup Locations
Computer Vision Meetup membership has grown to more than 2,500+ members in just a few months! The goal of the meetups is to bring together communities of data scientists, machine learning engineers, and open source enthusiasts who want to share and expand their knowledge of computer vision and complementary technologies.
New Meetup Alert – We just added a Computer Vision Meetup location in Singapore! Join one of the (now) 13 Meetup locations closest to your timezone.
- Ann Arbor
- New York
- San Francisco
- Silicon Valley
Upcoming Computer Vision Meetup Speakers & Schedule
We have an exciting lineup of speakers already signed up for February and March. Become a member of the Meetup closest to you, then register for the Zoom for the Meetups of your choice.
- Breaking the Bottleneck of AI Deployment at the Edge with OpenVINO — Paula Ramos, PhD (Intel)
- Understanding Speech Recognition with OpenAI’s Whisper Model — Vishal Rajput (AI-Vision Engineer)
- Zoom Link
- Lighting up Images in the Deep Learning Era — Soumik Rakshit, ML Engineer (Weights & Biases)
- Training and Fine Tuning Vision Transformers Efficiently with Colossal AI — Sumanth P (ML Engineer)
- Zoom Link
There are a lot of ways to get involved in the Computer Vision Meetups. Reach out if you identify with any of these:
- You’d like to speak at an upcoming Meetup
- You have a physical meeting space in one of the Meetup locations and would like to make it available for a Meetup
- You’d like to co-organize a Meetup
- You’d like to co-sponsor a Meetup
Reach out to Meetup co-organizer Jimmy Guerrero on Meetup.com or ping him over LinkedIn to discuss how to get you plugged in.
The Computer Vision Meetup network is sponsored by Voxel51, the company behind the open source FiftyOne computer vision toolset. FiftyOne enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster. It’s easy to get started, in just a few minutes.