- AI For All
- Posts
- The Multimodal LLM Wars
The Multimodal LLM Wars
PLUS: AI Devices Hit The Market
Hello readers,
Welcome to another edition of This Week in the Future! Google and OpenAI are waging a war over multimodal LLMs, AI devices are starting to hit the market, and LLMs are running at the edge. It seems a new class of consumer products is inevitable.
As always, thanks for being a subscriber! We hope you enjoy this week’s content — for a video breakdown, check out the episode on YouTube.
Let’s get into it!
The Dawn of LMMs
Large multimodal models (LMMs) appear to be the next phase of the AI era, and both Google and OpenAI have recently published papers on the fusion of LLMs with vision and other extended capabilities.
First up is RT-X by Google DeepMind. This new model showcases the power of massive data integration in robotics applications, marking a significant milestone in robotics research and development.
A Leap in Robotics
What DeepMind did was they amassed a plethora of videos related to robotics from universities around the globe. This comprehensive data compilation, encapsulating a diverse range of robotics models and real-world applications, served as the bedrock for RT-X's training.
Key Outcome
Training RT-X on this extensive dataset resulted in significant improvements in the model's generalized capability to tackle novel tasks. RT-X significantly outperforms its predecessor, RT-2.

Open X-Embodiment
Google has also open-sourced the colossal dataset under the name Open X-Embodiment. By making the dataset publicly available, Google is inviting further advancements in robotics models.
Why This Matters
Google has demonstrated that applying LLMs to vision tasks in robotics results in vastly better performance and emergent capabilities. This integration of LLMs has put the robotics industry on pace to deliver a truly generalized robotic agent, a prerequisite for robots in the home or in the factory.
OpenAI’s Findings
OpenAI recently released an extensive report on GPT-4V, offering a deep dive into the model's abilities and challenges. This 160-page document sheds light on the potential of LMMs and reinforces the principles of RT-X. Some of GPT-4V’s capabilities include:
Visual pointing and referring prompting: The model can interpret visual pointers, like edges and angles, and identify specific objects in a scene.
Few-shot learning: Building on the principles observed in large language models, GPT-4V can be trained with limited examples. A cited example demonstrates how this approach helped the model correctly identify a reading on a speedometer after initial misinterpretation.
Recognition abilities: GPT-4V can identify celebrities, notable landmarks, and even specific food dishes. Additionally, it possesses the capability to interpret human emotions based on facial images.
Embodied Agent Approach: In a fascinating use case, the model could virtually navigate a home's interior, moving from room to room upon instruction, mimicking a robot's journey.
Browser Interaction: GPT-4V is adept at understanding browser interfaces and can recommend actions to complete specific online workflows. Sam Altman even backed a startup working on this very application.
Our Take
While the model does hallucinate, the future is promising. OpenAI has already hinted at their next venture, a large multimodal model named Gobi, designed to process both videos and images. These LMM advancements by Google and OpenAI signal the possibility of AI-driven robotic agents into our daily lives sooner than anticipated.
AI Products and AI at the Edge
Stability AI’s release of Stable LM 3B, which brings a high-performance language model to smart devices, has got us thinking about how the compression of these models onto smaller and smaller hardware will enable a new class of consumer products and perhaps even usher in the smartphone killer. A few AI products have already hit the market:
Rewind Pendant: This pendant, designed to be worn around the neck, acts as a personal audio recorder. It captures and transcribes every word you utter, creating a textual archive you can revisit later.
AI-Powered Fashionable Pin: Details about this device remain shrouded in mystery, but it's undeniably a fusion of fashion and technology. Touted as a potential smartphone alternative, it boasts AI-driven optical recognition and a laser-projected display.
Ray-Ban Meta Smart Glasses: A stark contrast from the infamous Google Glass, these smart glasses look like normal glasses, all while running Meta’s AI assistant.
Pixel 8 On-Device AI: Google's new Pixel 8 phones are notable for executing generative AI models directly on the device rather than relying on cloud-based computations.
🔥 Rapid Fire
NexGen Cloud is building $1 billion Supercloud for AI in Europe
Armilla offers insurance for enterprises using AI
BrainChip makes 2nd-gen Akida platform available
Microsoft releases AutoGen
NSA is opening AI security center
Okta introduces Okta AI
Arc Browser releases Arc Max AI assistant
LinkedIn reimagines hiring and learning with new AI features
Visa launches $100 million generative AI initiative
Meta introduces gen AI features for advertisers
Google is adding Bard to Assistant
Samsung is developing AI chips with startup Tenstorrent
🎙️ The AI For All Podcast
This week’s episode featured Jean-Simon Venne, the CTO and co-founder of BrainBox AI, who discussed the current state of enterprise AI adoption and how enterprises are reducing costs and driving sustainability with AI-operated smart buildings!
📖 What We’re Reading
In keeping with the theme of AI devices and AI at the edge, Qualcomm has compiled insights on edge AI and the next-gen products it will enable. Plus, if you want to take advantage of open-source LLMs, GitHub provides a good starting guide.
AI on the Edge: The latest on-device AI insights and trends (link)
“On-device generative AI brings many exciting advantages including cost, privacy, performance, and personalization. And this puts generative AI at the center of your latest and greatest smartphone, personal computer, extended reality device, wearable — and even your next car.”
A developer’s guide to open source LLMs and generative AI (link)
“Over the past year, there has been an explosion of open source generative AI projects on GitHub: by our count, more than 8,000. They range from commercially backed large language models (LLMs) like Meta’s LLaMA to experimental open source applications.”
💻️ AI Tools and Platforms
Atera → AI-powered IT management platform
Nexusflow → Generative AI cybersecurity copilot
Helios → Supply chain disruption prediction with AI
Datatonic → Leading cloud data + AI consultancy
Unordinary → AI, cloud, and IT services