Audio and AI Meetup - August 6, 2026

Name: Audio and AI Meetup - August 6, 2026
Start: 2026-08-06
End: 2026-08-06

Aug 06, 2026

9:00 AM - 11:00 AM PST

Online. Register for the Zoom!

Speakers

About this event

Join our virtual meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Schedule

Curating, Searching, and Evaluating Audio Datasets in FiftyOne

In this talk, we'll start with the ESC-50 environmental-sound dataset to show how FiftyOne represents audio: browsing clips in the tabular view, rendering spectrograms directly in the sample grid with a custom renderer, and turning sounds into searchable vectors with CLAP embeddings. Then we'll demo a similarity-search panel that lets you query an entire audio collection by example clip or a natural-language prompt to quickly find matching sounds.

We'll conclude with a live research problem: Audio Moment Retrieval from the DCASE 2026 Challenge, where the goal is to localize the exact moment in a long recording that matches a text query. We'll frame this as temporal detection, evaluate predictions, and visualize ground-truth vs. predicted moments on an interactive timeline to intuitively expose model failure modes.

Attendees will leave with a concrete blueprint and open code for applying visual data-centric AI practices to their own audio and multimodal datasets.

Do Speech Models Actually Understand Speech? Evaluating Speech LLMs Under Realistic Spoken Instruction Conditions

Speech Large Language Models (SLLMs) are increasingly capable; but are we evaluating them the right way? Most benchmarks rely on text prompts, yet real users interact with these systems through speech, a modality that introduces noise, disfluencies, and stylistic variation that text simply doesn't capture.

In this talk, we present findings from a systematic study across 11 tasks, 12 languages, and five prompt styles, examining how prompt modality, language, and task type shape SLLM performance.

AI based Audio Forensics

In this presentation, attendees will discover several modules developed by Gradiant for the detection and analysis of synthetically generated or manipulated audio. The session will be delivered by one of the developers involved in the design and implementation of these technologies, providing first-hand insight into their capabilities and underlying methodology.

The presentation will cover the traceability module, which helps identify the origin of AI-generated content. It will also cover the segment detection tool, designed to locate manipulated regions within an audio recording, as well as the complete audio detection tool, which assesses whether an entire recording has been synthetically generated.

Real-Time ASR at 4x on Consumer Hardware: The Meetily Architecture

This talk covers the engineering behind Meetily, an open-source meeting assistant that runs Whisper and NVIDIA Parakeet transcription entirely on-device. We'll walk through how we got Parakeet to roughly 4x real-time on consumer hardware, and the specific points where it still falls over.

We'll also get into the honest trade-offs between local and cloud inference: latency, accuracy, cost, and what you actually give up by choosing one over the other. Wrapping ML inference in a Rust/Tauri desktop app came with its own costs, which we'll unpack as well.

Finally, we'll look at what "fully local" really means at an architecture level, where that boundary sits, and how easily it leaks once you add model downloads, integrations, or a pluggable LLM backend.