Welcome to Voxel51's weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.

📰 The Industry Pulse

ViewCrafter Synthesizes High-Fidelity Novel Views Like Magic!

ViewCrafter synthesizes high-fidelity novel views of generic scenes using single or sparse images, video diffusion models, and point-based 3D representations. This approach seems better than prevuous neural 3D reconstruction methods that rely on dense multi-view captures. But how does it do what it does?

Point Cloud Creation: Builds a point cloud from a single reference image or sparse image set.
Video Diffusion Model: Trains a point-conditioned model to generate high-fidelity views based on coarse point cloud renders.
Iterative Synthesis: Progressively moves cameras, generates novel views, and updates the point cloud.
3D-GS Optimization: Uses the completed dense point cloud and synthesized views to initialize and train 3D-GS for more consistent results.

Learn more about it on the project page.

Robo-researchers are infiltrating academia

Ah, the wonders of modern academia! Who needs rigorous peer review when you've got ChatGPT churning out "scientific" papers faster than you can say "publish or perish"? A new study has uncovered 139 questionable papers on Google Scholar, with suspected deceptive use of LLM tools. It seems no platform is safe from the invasion of robo-researchers, as most papers were in non-indexed journals or working papers, ranging from ResearchGate, ORCiD, and even X (because who needs academic standards when you've got hypebeasts on social media). Still, some appeared in established journals and conferences. It doesn’t seem like this trend is endemic to any one one field in particular, as the study found these papers in fields like environmental studies, health studies, and computing breakthroughs—it seems like LLMs have it all covered. Why bother with years of painstaking research when you can have a transformer spit out a paper in seconds? Read more here.

OpenAI is considering restructuring its corporate structure in the coming year.

The company is currently in talks to raise $6.5 billion at a $150 billion pre-money valuation, but this deal is contingent on removing the profit cap for investors. CEO Sam Altman has reportedly informed employees that OpenAI's structure will likely change in 2025, moving closer to a traditional for-profit model. The current structure, where a nonprofit controls the for-profit arm, appears to be a point of contention for potential investors. Despite these potential changes, OpenAI has stated that the nonprofit aspect will remain central to its mission. The company emphasizes its commitment to developing AI that benefits everyone while positioning itself for success in achieving its goals. If they make this change, I think "ForProfitAI" would make for a suitable new name. Whatever the case, I firmly believe it’s Open Source for the win. Read more here.

💎 GitHub Gems

Reader-LM: Get this, it’s an LLM that converts HTML to Markdown. This is included here ironically because it’s a thing that no one asked for and will likely never use. Why write regex to clean your dataset when you can use a 500m (1GB size) LLM and unnecessary GPU power to clean your dataset! Hi3D: Hi3D, a novel video diffusion-based approach for generating high-resolution, multi-view consistent images with detailed textures from a single input image. Hi3D leverages the temporal consistency knowledge in video diffusion models to achieve geometry consistency across multiple views, and employs a 3D-aware video-to-video refiner to scale up the multi-view images while preserving high-resolution texture details. Prompt2Fashion: This repo introduces a dataset of automatically generated fashion images created using the methodology presented in the "AutoFashion" paper. The dataset focuses on personalization, incorporating a variety of requirements like gender, body type, occasions, and styles, and their combinations, to generate fashion images without human intervention in designing the final outfit or the conditioning prompt for the Diffusion Model. Lexicon3D: This framework extracts features from various foundation models, constructs 3D feature embeddings as scene embeddings, and evaluates them on multiple downstream tasks. The paper presents a novel approach to representing complex indoor scenes using a combination of 2D and 3D modalities, such as posed images, videos, and 3D point clouds. The extracted feature embeddings from image- and video-based models are projected into 3D space using a multi-view 3D projection module for subsequent 3D scene evaluation tasks.

📖 Good Read: Founder Mode

It came out of nowhere. All of a sudden, my entire X and LinkedIn feeds were filled with posts (and plenty of memes) using the words “Founder Mode.” Eventually, I learned the origin of the term: a Paul Graham essay titled Founder Mode.. After hearing Airbnb's Brian Chesky speak at a Y Combinator event, Paul Graham, tech guru and startup whisperer, had an epiphany: it turns out, the conventional wisdom of "hire good people and let them do their thing" is about as effective as trying to put out a fire with gasoline. Inspired by this revelation, Graham furiously penned an essay and hit "publish." The impact was immediate and powerful. Since then, the world of startups hasn't been the same. As founders read his words, a collective "Aha!" moment swept through the startup world. Suddenly, tech founders everywhere were nodding in recognition, realizing they'd been fighting the same uphill battle in their own companies. Graham's essay struck a chord that reverberated through the tech industry, uniting founders in their shared struggles and sparking a new conversation about what it really takes to build a successful startup. A tale of two modes

Manager Mode: The MBA-approved method of treating your org chart like a game of "Don't Touch the Lava." He thinks this leads to hiring "professional fakers" who excel at driving companies into the ground
Founder Mode: The mystical art of knowing what's happening in your company. Revolutionary, right?

Founder Mode Graham argues that Founder Mode is more complex but ultimately more effective than manager mode, and is characterized by:

Direct involvement: Founders understand their company's operations deeply, often bypassing traditional management hierarchies.
Skip-level interactions: Regular communication with employees at various levels, not just direct reports.
Intuitive decision-making: Relying on founder's instincts rather than solely on conventional management practices

Basically, Founder Mode is like Manager Mode, but with 100% more founder intuition and 50% less delegation. Results may vary, batteries not included. Graham admits we know about as much about Founder Mode as we do about dark matter. But fear not! He predicts that once we figure it out, founders will achieve even greater heights – like building rockets to Mars or creating social networks that definitely won't cause any problems whatsoever. He suggests that:

The specifics of founder mode may vary between companies and over time
Many successful founders have already been implementing aspects of founder mode, often being viewed as eccentric for doing so
There's a need for more research and understanding of founder mode
Once founder mode is better understood, it could lead to even greater achievements by founders

🎙️ Good Listens : A full breakdown of the Reflection-70B fiasco

🍿 You might wanna grab some popcorn for this one! Earlier this month (September 2024), Reflection 70B was making waves on X as THE new open-source LLM. Released by HyperWrite, a New York startup, it claimed to be the world's top open-source model. Yet soon after its release, Reflection 70B's performance was questioned and accused of potential fraud.

Initial Announcement and Claims

Thursday, September 5, 2024: - Matt Shumer, co-founder and CEO of OthersideAI (HyperWrite), releases Reflection 70B on Hugging Face. - Shumer claims it's "the world's top open-source model" and posts impressive benchmark results. - The model is said to be a variant of Meta's Llama 3.1, trained using Glaive AI's synthetic data generation platform. - Shumer attributes the performance to "Reflection Tuning," allowing the model to self-assess and refine responses.

Skepticism and Investigations

Friday, September 6 - Monday, September 9, 2024: - Independent evaluators and the open-source AI community begin questioning the model's performance. - Attempts to replicate the impressive results fail. - Some responses indicate a possible connection to Anthropic's Claude 3.5 Sonnet model. - Artificial Analysis posts on X that its tests yield significantly lower scores than initially claimed. - It's revealed that Shumer is invested in Glaive AI, which he didn't disclose when releasing Reflection 70B. - Shumer attributes discrepancies to issues during the model's upload to Hugging Face and promises to correct the weights. On September 8, X user Shin Megami Boson openly accused Shumer of "fraud in the AI research community."

Silence and Response

Sunday, September 8 - Monday, September 9, 2024: - Shumer goes silent on Sunday evening. Tuesday, September 10, 2024: Shumer breaks his silence, apologizing and claiming he "Got ahead of himself." He states a team is working to understand what happened and promises transparency. Sahil Chaudhary, founder of Glaive AI, posts that the benchmark scores shared with Shumer haven't been reproducible. Yuchen Jin, CTO of Hyperbolic Labs, details his efforts to host Reflection 70B and expresses disappointment in Shumer's lack of communication.

Ongoing Skepticism

Post-September 10, 2024: The AI community remains skeptical of Shumer's and Chaudhary's explanations. Many are calling for more detailed explanations of the discrepancies and the true nature of Reflection 70B. The situation continues to evolve, with the AI community awaiting further clarification and evidence from Shumer and his team regarding the true capabilities and origins of Reflection 70B. Here’s a good recap of the entire fiasco on YouTube, and you can also read more about this here.

Good Research: 🦸 A survey on comic understanding

Comics present unique challenges for models due to their combination of visual and textual narratives, creative variations in style, non-linear storytelling, and distinctive compositional elements. While vision-language models have advanced significantly, their application to Comics Understanding is still developing and faces several challenges. However, there’s a lack of a comprehensive framework for categorizing and understanding the various tasks involved in Comics Understanding. This survey introduces a novel framework called the Layer of Comics Understanding (LoCU), categorizing tasks based on input/output modalities and spatio-temporal dimensions. The LoCU framework aims to guide researchers through the intricacies of Comics Understanding, from basic recognition to advanced synthesis tasks.

Holy frameworks, Batman! With LoCU, we're ready to take on the comic book universe!

This framework provides a structured approach to understanding the various tasks involved in Comics Understanding, from simple classification to complex generation and synthesis. Each layer builds upon the previous ones, increasing in complexity and abstraction:

Layer 1 deals with basic recognition and enhancement tasks.
Layer 2 focuses on more detailed analysis and identification within comic panels.
Layer 3 involves searching and modifying comic content.
Layer 4 requires deeper comprehension of comic narratives and context.
Layer 5 represents the most advanced tasks, involving the creation of new comic-like content.

Here’s what the layers do in more detail: Layer 1: Tagging and Augmentation 1. Tagging:

Tasks: Page Stream Segmentation, Image Classification, Emotion Classification, Action Detection
Focus: Basic labeling and categorization of comic elements

2. Augmentation:

Tasks: Image Super-Resolution, Style Transfer, Vectorization, Depth Estimation
Focus: Enhancing or transforming the visual aspects of comics

Layer 2: Grounding, Analysis, and Segmentation 1. Grounding:

Tasks: Object Detection, Character Re-identification, Free Vocabulary Grounding, Text-Character Association
Focus: Identifying and localizing specific elements within comic panels

2. Analysis:

Tasks: Dialog Transcription, Translation
Focus: Extracting and processing textual information from comics

3. Segmentation:

Tasks: Instance Segmentation
Focus: Detailed delineation of individual objects or characters within panels

Layer 3: Retrieval and Modification 1. Retrieval:

Tasks: Image-Text Retrieval, Text-Image Retrieval, Composed Image Retrieval
Focus: Finding relevant comic images or text based on queries

2. Modification:

Tasks: Image Inpainting, Image Editing via Text Instructions
Focus: Altering or manipulating comic images based on specific inputs

Layer 4: Understanding

Tasks: Visual Question Answering, Visual Reasoning, Visual Dialog
Focus: Comprehending and interpreting the content of comics at a deeper level, including narrative elements and context

Layer 5: Generation and Synthesis 1. Generation:

Tasks: Image-to-Text Generation, Grounded Image Captioning, Text-to-Image Generation, Scene Graph Generation for Captioning, 3D Character Generation
Focus: Creating new textual or visual content based on comic inputs

2. Synthesis:

Tasks: Story Description, Multi-page Story Generation, Story-to-Video Generation
Focus: Producing complex, multi-modal outputs that capture the essence of comic storytelling

Thanks to Vivoli et al., we have the Layer of Comics Understanding (LoCU) framework to guide us through this maze of panels, bubbles, and superhero capes. From basic tagging to full-blown narrative synthesis, it's like having a superhero team of AI models ready to tackle every comic challenge. Time to turn those pixels into epic stories and make our favorite comics come alive in ways we never thought possible. Let's get our capes on and save the day, one panel at a time!

🗓️. Upcoming Events

Check out these upcoming AI, machine learning and computer vision events! View the full calendar and register for an event.

Talk to a computer vision expert