Skip to content

Recapping the Computer Vision Meetup — December 2022

Last week Voxel51 hosted the December 2022 Computer Vision Meetup. What a wonderful event! Our amazing speakers shared insightful presentations, the virtual room was packed, and the Q&A was vibrant! In this blog post we provide the recordings, recap presentation highlights and Q&A, as well as share the upcoming Meetup schedule so that you can join us at a future event. Hope to see you soon!

First, Thanks for Voting for Your Favorite Charity!

In lieu of swag, we gave Meetup attendees the opportunity to help guide our monthly donation to charitable causes. The charity that received the highest number of votes was Children International. We are pleased to be making a donation of $200 to them on behalf of the computer vision community!

children international logo

Meetup Recap at a Glance

Talk #1: Wearable Vision Sensors Summary

Kris Kitani, associate research professor of the Robotics Institute at Carnegie Mellon University, shares some of the research and projects his lab is working on centered around the concept of wearable vision sensors. Kris explains that computer vision is currently undergoing a paradigm shift from third-person generated data, such as somebody uploading pictures that they took, to first-person data where the sensors themselves are on the user. He then shows three early prototype systems based on wearable sensors.

First, Kris walks through a learning based model where the goal is to accurately estimate a person’s hand pose using a smartwatch. This model’s inputs include hand and motion history images, and the output is 19 different joint positions and their angles to describe the joints on a hand. Then, using inverse kinematics to solve for what the hand might look like, the estimated hand pose can be visualized using computer graphics.

Kris moves onto the next model, which uses a chest mounted camera for pose estimation. This model centers around three streams: one that estimates how the chest camera is moving, another that is looking at how the pose is changing relative to the camera (joints, angles, etc.), and a third stream that is focused on which way the head is looking in 3D. Kris shows a demo video visualizing a few scenes reconstructed from a chest mounted camera and this model.

The last model Kris that describes uses a head mounted camera for pose estimation. In this model, the input is the egocentric, forward-facing camera, and a detected object. The output is an object-aware human pose estimate of the person wearing the camera. Kris shows demo videos of what the camera is seeing, alongside a computer-generated reconstruction of what that person is doing. Learn more and see it all in action in the video replay.

Talk #1 Video Replay

Talk #1 Q&A Recap

Here’s a recap of the live Q&A from this presentation during the virtual Computer Vision Meetup:

Can you use hand pose estimation to determine if someone pulled a trigger or not?

When you’re holding an object with a trigger, there’s only so much that your hand can do, relative to other examples where the hand has total free motion. So I would assume that hand pose estimation would actually work pretty well in this situation. Incorporating auditory information would also increase recognition.

Have you tested hand pose estimation out for different sized people?

We have not done a full, robust study that would be needed to make this a commercial product. So far, we’ve just tried it on a small group of graduate students. However, I think it’s fairly robust and with enough data, yes, it should be able to be generalized to more people if done properly.

What minimum camera resolution do you need?

We haven’t run that experiment. So far we’ve been using GoPro cameras for most of these experiments, so that will give you the range of the resolutions that we’ve been working with.

To keep ethical AI in perspective, would a robust study also include people of different ethnicity (aka skin color)?

Yes, of course. One of the questions earlier kind of alluded to that as well. Different body sizes, different skin colors would definitely need to be taken into consideration for commercial development.

Can the chest mounted camera be used to know if a person is following you?

As originally scoped, I would think not, but there might be an interesting way to do that.

How can you hallucinate or predict what an arm is doing that is out of the field of view?

That’s essentially what we’re doing with the physics simulation: we try to hallucinate what the other limbs are doing even though we can’t see them.

How are these simulated environments generated?

You could use any physics simulator that you want, but the way we’re generating it is that we’re running object detection. In newer methods, we’re doing a 3D reconstruction and then inserting these synthetic objects into the scene and then we use them when we’re running the physics simulation. There’s a whole stream of computer vision work on just scene understanding and scene reconstruction, which could be leveraged, but we’re not going that deep into it at the moment.

Is the inferencing and computation done on an edge device or on a server?

All of these are running in real time, using the device and also a desktop computer. Because they run in real time, I could conceivably run them on the embedded device, but we are not running it like that right now.

Have you thought about predicting the next move of the user instead of just recognizing the motion in the current timestamp?

Yes, I have a stream of work on trajectory forecasting and activity forecasting, which I didn’t talk about today. We do have an ACCV paper and an ICCV paper on using a wearable camera to try to predict what people are doing.

The emphasis on moving from third to first-person view is fabulous, but so much of vision for the last 20–30 years has been third-person view. In your experience so far, what is most different about doing computer vision in ego (first person) versus exo (third person)?

With a wearable first-person camera, we have a much higher resolution video of people with their hands interacting with objects, which opens up a whole new area of what it means to understand people when they’re interacting with objects. Grasp taxonomies, pressure points, detailed manipulation — it’s actually kind of intersecting with a lot of robotics research. So not only is it different, it’s opening up a whole new world of research topics because of this new perspective that we have.

Talk #1 Additional Resources

Check out this additional resource on the presentation:

Thank you Kris on behalf of the entire Computer Vision Meetup community for sharing your research and early prototype systems that will inspire and influence commercial developments yet to come!

Talk #2: The Future of Data Annotation: Trends, Problems & Solutions Summary

Anna Petrovicheva is the CEO at and CTO at presented on the future of data annotation. The data annotation market is growing at 26% per year. This rapid growth rate reflects the importance of categorizing and labeling data in the AI-driven world. In the talk, Anna shared five major trends related to data annotation and what innovations we can expect in the upcoming years:

  1. Working with datasets is an iterative process. In the past, you mostly focused on a dataset one time. Now, people typically think about it as a “first chunk of the data, and follow on chunks, they’re different,” said Anna. The first chunk you want to be big, meaningful, and diverse; it typically covers 70% of your use case. When you deploy in production, the follow on chunks cover the 30% of the business case that you didn’t predict. beforehand.
  2. Going beyond 3D. Now, there is an abundance of LiDAR data available, as well as radar data from satellites, and medical imaging from MRI scans. You will see data annotation moving to support more and more types of data, including a variety of 3D data types, so that they are easy to work with.
  3. Data annotator is a profession. When the AI boom started, data scientists were annotating their data. Now there are in-house teams, data agencies, and even crowdsourcing options available for data annotation.
  4. Data is moving to open source tools. One trend Anna sees is that open source tools will be as prevalent or even more popular than closed source solutions. Examples: CVAT and FiftyOne are both open source and available on GitHub, and there are many more.
  5. There will be more data flows and flexible integrations. The industry is moving from manual dataset management to automated, reliable data flows available anytime. Data flows vary by company, but there are starting to be common tools underpinning data flows, such as tools like CVAT and FiftyOne, that can be connected via APIs.

Talk #2 Video Replay

Talk #2 Q&A Recap

What are the most popular annotation formats?

There are a lot of them and they very much depend on the type of the data and type of the annotation that you are working with. For example, if we are talking about bounding boxes, probably the most popular format is COCO format, which encodes the coordinates of the bounding boxes on each image. There are other formats for data segmentation, for polygon annotation, or annotation in 3D. So, it depends on the data. You can check the supported formats in CVAT in our README on GitHub.

Is there version control for annotations?

Yes, there are several tools for data versioning. My favorite one is called DVC, data version control. It’s also a fellow open source tool:

Talk #2 Additional Resources

Check out these additional resources:

A big thank you to Anna on behalf of the entire Computer Vision Meetup community for sharing the important trends in data annotation and the AI industry more broadly!

Talk #3: Using Similarity Learning to Improve Data Quality Summary

Kacper Łukawski, developer advocate at Qdrant, gave a talk about similarity learning and how it can be used to improve data quality, in particular in the context of image-based tasks.

Real-world data is a living structure, it grows day by day, changes a lot, and becomes harder to work with. The process of splitting or labeling is error-prone and these errors can prove to be very costly. Similarity learning is not only a great tool for data classification, even in the case of extreme classification, but also a surprisingly good way to improve the quality of your data.

Kacper takes us on a journey to understand the inner workings of similarity learning, starting with defining traditional neural networks that have been designed to solve classification problems. He uses this as a foundation to thoroughly compare and contrast similarity learning.

Kacper then provides an example of using similarity learning in action with a web furniture marketplace that includes images of products, as well as their names and the category they belong to. In this example, the marketplace has decided to automate the process of assigning a product to a category in order to determine whether or not a new category needs to be created to have a better user experience, or simply to sell a little more by identifying and correcting products with wrong categories. Kacper shows various ways similarity learning can be used to automate the process and find the items that are assigned incorrectly. You can dive into all of the details in the replay.

Talk #3 Video Replay

Talk #3 Q&A Recap

Can you please elaborate on differences between fine tuning the last layer and performing a clustering on embedding space based on distance?

Actually, these are two steps in the same process. First, if you take a pre-trained model, let’s say trained on ImageNet, and you want to use it in a different domain, then the original embeddings won’t be able to capture the similarities between those new objects that well. So the goal of fine tuning the last layer is to adjust the original embeddings in a way that those similar points will be closer to each other than the originals would be. Comparing the distance could be done on those original embeddings, but they do not provide the best possible quality out of the box. So, in this approach, we started with fine tuning and then compared the embeddings after that.

What are the advantages of Similarity Learning by looking at the embedding space rather than the final classification (of the network) after softmax?

You could probably use the final output of that network. But since it was done to solve a classification problem, those vectors won’t capture the regularities in your image data; instead, they will be trying to model the probability distribution of your classes. In the example I shared, we are not trying to solve the classification problem that the pre-trained model was trained for, instead we want to determine the content of the image, what it represents. Those previous vector representations are of a lower dimensionality, and they have a different data representation. This is where Similarity Learning comes in.

Is similarly learning compatible with transfer learning using pretrained models?

Similarity learning is similar to transfer learning, which takes an existing model and applies it to a different task or domain. The only difference here is that transfer learning doesn’t require any training. For example, transfer learning could be used to recognize the class on an object. But in our example, we don’t want to recognize a class, we want to have a data representation so that two objects that are similar in reality will also be similar in the embeddings space using a distance function, which is why we need to undergo a training effort and use similarity learning.

How would you optimize the similarity comparison to find outliers? Comparing every point with the other is computationally demanding with large data.

That would be an issue if we want to calculate the distance metrics to all the points, which is why we typically suggest focusing on specific subsets of the data. But of course, if you want to find some outliers in the whole set, you could approach it in a different way. Let’s say you have all your embeddings already collected, and if you want to find some outliers, you could use a vector database, such as Qdrant, and simply try to calculate the distance of every single point you have in a collection to some different points. A point which is an outlier will have the closest point quite far from it. But if the distances are quite low, then it may indicate there is a cluster in your data.

And if you have a vector database like Qdrant that is already optimized, thanks to HNSW graphs you will be able to find the closest points with their distances quite easily, which will no longer require calculation of the distance metrics.

How is the anchor selected for the anchor-based similarity measure?

If you want to encode the category name, then this is already an anchor; we want to capture the points that have some encodings that are not close to the category name encoding. But with diversity search, we can select it randomly, because the assumption is that our outliers should be far away from every single item from the dataset that is not the outlier.

What are the differences between Qdrant and Milvus?

In a nutshell, Qdrant and Milvus solve the same problem. At Qdrant we performed some benchmarks, which are available on our website. From my point of view, using Qdrant with some additional metadata stored for every single vector is way more efficient as compared to Milvus. But this is just my experience. I’m not an expert in Milvus personally. If you want to discuss more, please feel free to reach out and we can do a more detailed comparison.

Talk #3 Additional Resources

Check out these additional resources on this talk:

Thank you to Kacper on behalf of the entire Computer Vision Meetup community for the very thorough exploration of using similarity learning to improve data quality.

Computer Vision Meetup Locations

Computer Vision Meetup membership has grown to more than 2,000+ members in just a few months! The goal of the meetups is to bring together a community of data scientists, machine learning engineers, and open source enthusiasts who want to share and expand their knowledge of computer vision and complementary technologies. If that’s you, we invite you to join the Meetup closest to your timezone:

Upcoming Computer Vision Meetup Speakers & Schedule

We recently announced an exciting lineup of speakers for December, January, and February. Become a member of the Meetup closest to you, then register for the Zoom for the Meetups of your choice.

January 12

  • Hyperparameter Scheduling for Computer Vision — Cameron Wolfe (Alegion/Rice University)
  • An introduction to computer vision with Hugging Face transformers — Julien Simon (Hugging Face)
  • Zoom Link

February 9

  • Breaking the Bottleneck of AI Deployment at the Edge with OpenVINO — Paula Ramos, PhD (Intel)
  • Understanding Speech Recognition with OpenAI’s Whisper Model — Vishal Rajput (AI-Vision Engineer)
  • Zoom Link

March 9

  • Lighting up Images in the Deep Learning Era — Soumik Rakshit, ML Engineer (Weights & Biases)
  • Training and Fine Tuning Vision Transformers Efficiently with Colossal AI — Sumanth P (ML Engineer)
  • Zoom Link

Get Involved!

There are a lot of ways to get involved in the Computer Vision Meetups. Reach out if you identify with any of these:

  • You’d like to speak at an upcoming Meetup
  • You have a physical meeting space in one of the Meetup locations and would like to make it available for a Meetup
  • You’d like to co-organize a Meetup
  • You’d like to co-sponsor a Meetup

Reach out to Meetup co-organizer Jimmy Guerrero on or ping him over LinkedIn to discuss how to get you plugged in.

The Computer Vision Meetup network is sponsored by Voxel51, the company behind the open source FiftyOne computer vision toolset. FiftyOne enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster. It’s easy to get started, in just a few minutes.