The Multimodal LLM Wars

PLUS: AI Devices Hit The Market

Hello readers,

Welcome to another edition of This Week in the Future! Google and OpenAI are waging a war over multimodal LLMs, AI devices are starting to hit the market, and LLMs are running at the edge. It seems a new class of consumer products is inevitable.

As always, thanks for being a subscriber! We hope you enjoy this week’s content — for a video breakdown, check out the episode on YouTube.

Let’s get into it!

The Dawn of LMMs

Large multimodal models (LMMs) appear to be the next phase of the AI era, and both Google and OpenAI have recently published papers on the fusion of LLMs with vision and other extended capabilities.

First up is RT-X by Google DeepMind. This new model showcases the power of massive data integration in robotics applications, marking a significant milestone in robotics research and development.

A Leap in Robotics

What DeepMind did was they amassed a plethora of videos related to robotics from universities around the globe. This comprehensive data compilation, encapsulating a diverse range of robotics models and real-world applications, served as the bedrock for RT-X's training.

Key Outcome

Training RT-X on this extensive dataset resulted in significant improvements in the model's generalized capability to tackle novel tasks. RT-X significantly outperforms its predecessor, RT-2.

Open X-Embodiment

Google has also open-sourced the colossal dataset under the name Open X-Embodiment. By making the dataset publicly available, Google is inviting further advancements in robotics models.

Why This Matters

Google has demonstrated that applying LLMs to vision tasks in robotics results in vastly better performance and emergent capabilities. This integration of LLMs has put the robotics industry on pace to deliver a truly generalized robotic agent, a prerequisite for robots in the home or in the factory.

OpenAI’s Findings

OpenAI recently released an extensive report on GPT-4V, offering a deep dive into the model's abilities and challenges. This 160-page document sheds light on the potential of LMMs and reinforces the principles of RT-X. Some of GPT-4V’s capabilities include:

Visual pointing and referring prompting: The model can interpret visual pointers, like edges and angles, and identify specific objects in a scene.

Few-shot learning: Building on the principles observed in large language models, GPT-4V can be trained with limited examples. A cited example demonstrates how this approach helped the model correctly identify a reading on a speedometer after initial misinterpretation.

Recognition abilities: GPT-4V can identify celebrities, notable landmarks, and even specific food dishes. Additionally, it possesses the capability to interpret human emotions based on facial images.

Embodied Agent Approach: In a fascinating use case, the model could virtually navigate a home's interior, moving from room to room upon instruction, mimicking a robot's journey.

Browser Interaction: GPT-4V is adept at understanding browser interfaces and can recommend actions to complete specific online workflows. Sam Altman even backed a startup working on this very application.

Our Take

While the model does hallucinate, the future is promising. OpenAI has already hinted at their next venture, a large multimodal model named Gobi, designed to process both videos and images. These LMM advancements by Google and OpenAI signal the possibility of AI-driven robotic agents into our daily lives sooner than anticipated.

AI Products and AI at the Edge

Stability AI’s release of Stable LM 3B, which brings a high-performance language model to smart devices, has got us thinking about how the compression of these models onto smaller and smaller hardware will enable a new class of consumer products and perhaps even usher in the smartphone killer. A few AI products have already hit the market:

Rewind Pendant: This pendant, designed to be worn around the neck, acts as a personal audio recorder. It captures and transcribes every word you utter, creating a textual archive you can revisit later.

AI-Powered Fashionable Pin: Details about this device remain shrouded in mystery, but it's undeniably a fusion of fashion and technology. Touted as a potential smartphone alternative, it boasts AI-driven optical recognition and a laser-projected display.

Ray-Ban Meta Smart Glasses: A stark contrast from the infamous Google Glass, these smart glasses look like normal glasses, all while running Meta’s AI assistant.

Pixel 8 On-Device AI: Google's new Pixel 8 phones are notable for executing generative AI models directly on the device rather than relying on cloud-based computations.

🔥 Rapid Fire

🎙️ The AI For All Podcast

This week’s episode featured Jean-Simon Venne, the CTO and co-founder of BrainBox AI, who discussed the current state of enterprise AI adoption and how enterprises are reducing costs and driving sustainability with AI-operated smart buildings!

📖 What We’re Reading

In keeping with the theme of AI devices and AI at the edge, Qualcomm has compiled insights on edge AI and the next-gen products it will enable. Plus, if you want to take advantage of open-source LLMs, GitHub provides a good starting guide.

AI on the Edge: The latest on-device AI insights and trends (link)

“On-device generative AI brings many exciting advantages including cost, privacy, performance, and personalization. And this puts generative AI at the center of your latest and greatest smartphone, personal computer, extended reality device, wearable — and even your next car.”

Source: Qualcomm OnQ
A developer’s guide to open source LLMs and generative AI (link)

“Over the past year, there has been an explosion of open source generative AI projects on GitHub: by our count, more than 8,000. They range from commercially backed large language models (LLMs) like Meta’s LLaMA to experimental open source applications.”

Source: GitHub

💻️ AI Tools and Platforms

  • Atera → AI-powered IT management platform

  • Nexusflow → Generative AI cybersecurity copilot

  • Helios → Supply chain disruption prediction with AI

  • Datatonic → Leading cloud data + AI consultancy

  • Unordinary → AI, cloud, and IT services