Direct and sustained engagement with foundational AI models is crucial for two main reasons: having access to the latest updates and, more importantly, developing an intuition for the emergent capabilities available from the most authentic versions.
Updates
March and April 2024 marked new milestones for the leading three foundation models.
Evaluating relative AI model capabilities is very hard, and the LMSYS “Chatbot Arena” is the best idea we have: anyone can contribute a blind vote for the best of two anonymous chatbots at chat.lmsys.org. More than 600,000 votes have been cast. The results are scored like chess player rankings.
First, Claude 3 was announced by Anthropic on 4 March 2024, and almost immediately unseated GPT-4 in rankings on the Chatbot Arena. The spotlight is now shining on Claude 3, which briefly appeared to truly surpass GPT-4 Turbo in performance, marking the first time another AI model has outperformed a member of the GPT-4 family.
Claude 3’s coup was followed on 9 April 2024 by substantial updates to GPT-4 and Gemini, and GPT-4 regained its Arena lead. The full list of current rankings in the Arena is interesting, and may have changed in the past few hours.
When that new version GPT-4 Turbo was released, OpenAI announced only that it was a "majorly improved" version. It appears to integrate multimodal GPT-4 Vision processing (recognizing the contents of images), allowing it to process image content directly within the model. Without specifics, OpenAI later vaguely promised this model has improved performance in logical reasoning, coding, and mathematical tasks, as well as a knowledge cutoff up to April 2024.
On the same day as OpenAI’s announcement, Google introduced its Gemini 1.5 Pro model. It includes some interesting features like a one million token context window, native audio processing, and a JSON mode for structured data extraction.
Intuitions for the capabilities
However, the updates are changing constantly. To help choose a foundation model with which to engage, you need to know that the tangible differences between AI foundation models comes from the blend of the data oceans used in the training process, deliberate design by their creators, and unexpected emergent behaviors that even the programmers explicitly do not understand.
Famously, when Gemini came out it was very ‘woke’. You could ask it to show you a picture of the U.S. Founding Fathers or soldiers in Nazi Germany, and it would provide an unusually multicultural group of people for those requests. That woke-ness came directly from the framework of explicit 'system prompts' or guardrails established by the developers at Google.
But all the models also exhibit a range of strengths, weaknesses, and unique traits that come from a deeper, more intricate interplay between the associations built between their training processes and oceans of data. This interplay leads to behaviors that resemble human-like personality. Theories about why this happens vary widely. It's exactly because no one yet understands what’s going on inside the box that it’s so important to build early intuitions with what we have available.
I’m going back to Ethan Mollick’s interview with Ezra Klein where Professor Mollick really underscored the features and personalities of the leading AI models—Claude 3, GPT-4, and Gemini:
Anthropic’s Claude 3
[I]f you like sort of intellectual challenge, I think Claude 3 is the most intellectual of the models …. Generally, I think Claude 3 is the most likely to freak you out right now.
… Claude 3 is currently the warmest of the models. And it is the most allowed, by its creators, Anthropic, I think, to act like a person. So it’s more willing to give you its personal views, such as they are. And again, those aren’t real views. Those are views to make you happy.... And it’s a beautiful writer, very good at writing, kind of clever — closest to humor, I’ve found, of any of the AIs. Less dad jokes, and more, actual, almost, jokes.
OpenAI’s GPT-4:
The biggest capability set right now is GPT-4, so if you do any math or coding work, it does [the] coding for you. It has some really interesting interfaces. That’s what I would use — and because GPT-5 is coming out, that’s [going to be] fairly powerful.… GPT-4 is probably the most likely to be super useful right now.
… GPT-4 feels like a workhorse at this point. It is the most neutral of the approaches. It wants to get stuff done for you. And it will happily do that. It doesn’t have a lot of time for chitchat.
Google’s Gemini:
Google’s [Gemini] is probably the most accessible, and plugged into the Google ecosystem.… Google’s Gemini feels like it really, really wants to help.
… [In my university classes, we] build these scenarios [for the students] where the A.I. actually acts like a counterparty in a negotiation.… But when we try and get [Gemini] to do that [counterparty negotiation], it keeps leaping in on the part of the students, to try and correct them and say, “No, you didn’t really want to say this. You wanted to say that.” And “I’ll play out the scenario as if it went better.”
It really wants to, kinda, make things good for you.