This last week has been really busy with the release of the latest Innovations in Artificial Intelligence Models: Gemini 2.0, OpenAI o1 PRO, Llama 3.3, Cognition Labs Devin, Grok’s Aurora and OpenAI’s Sora.
I leave you with a fantastic summary by Mathew Bergman on all these developments and some more related to quantum computing. By the way, this is content that I will cover in some of my next posts as I am very interested in the leap that can be made to have a computing capacity impossible to imagine with current systems.
1. Gemini 2.0: The Jewel of Google
Gemini 2.0 is Google’s latest AI model, launched on December 11, 2024 It introduces significant advances in AI capabilities that put it at the top of the market.
It is a multimodal model capable of working with text, images, audio and video, which means that it can understand and generate content in several formats at the same time.
What makes this model so special? Here I explain it to you:
- Total multimodality: For example, you can ask it to analyze an image, generate a text description, or even create a multilingual audio based on a video. In the above AIGrid video you can see how the model’s vision capability, together with the ability to “explain” what it is seeing through audio, opens up endless possibilities that we already saw in OpenAI’s 4th model in the demonstration that Mira Murati did with the entire OpenAI team.
- AI Agent. Also known as “Project Astra”, the model can act as an intelligent system, using memory, reasoning and planning to complete tasks under user supervision. This functionality is truly impressive.
- Take control of your Chrome browser: Also known as “Project Mariner” allows you to control Google ecosystem applications from the browser. For example, you can locate the addresses of a list of companies you have in a Google Sheets sheet on the Internet. This is a functionality very similar to RPA but powered by AI. We have already seen this capability in other level 2,5, or 3 agent systems that I covered in another article that you can find here.
- Speed and efficiency: The Gemini 2.0 Flash version is not only faster, but also more economical compared to previous models. Gemini 2.0 processes requests twice as fast as its predecessor, with improved time to first token (TTFT).
- Unique capabilities: It can execute code from natural language instructions. Imagine saying “write a program to organize my photos by date” and have it do it automatically…
For example, consulting information, operating a household appliance, analyzing handwritten texts, having the system describe a point of interest you are visiting, shopping recommendations, receiving directions to a location. Also combined with vision devices such as glasses, they could give us very precise instructions to any type of question we could ask the agent without using our smartphone…. Can you imagine?
2. OpenAI CHATGPT o1 PRO
This update to o1-preview introduces advanced reasoning capabilities, focusing on using additional computational time to solve complex problems.
O1 Pro spends more time thinking, achieving better answers to technical problems. It is also faster than its predecessor, O1 Preview, with a 50% improvement in reasoning speed.
He excels in mathematical reasoning, logic, and complex problem-solving. However, he still has limitations in vision tasks and multimodal analysis.
Priced at $200 per month, it is aimed at professionals who require advanced programming or logical reasoning solutions.
3. Llama 3.3: Meta's Open Source AI
Meta has introduced Llama 3.3, an affordable model designed so that developers of all types can use it without large costs or infrastructure. Although it has 70 billion parameters, it achieves results comparable to much larger models.
- Affordable and efficient: Perfect for startups or researchers with limited resources.
- Open collaboration: Its license allows it to be freely modified and used, which has led to more than 650 million downloads.
- Sustainable progress: Meta is building giant data centers, such as 2GW in Louisiana, to train future models.
4. Devin: The Standalone Agent for Developers
Devin can change your life If you are a programmer or work with software,
Devin is an innovative artificial intelligence (AI) developed by Cognition Labs, which presents itself as the first autonomous software engineer. This system can perform programming tasks from conceptualization to implementation, including writing, debugging, and testing code, as well as training other AI models.
- Full integration with GitHub: It can review your code, identify problems, and fix them automatically.
- Time optimization: Imagine having an “assistant” to take care of the technical details while you concentrate on design or strategy.
It is priced from USD 500 per month, making it ideal for companies looking to maximize productivity through the use of autonomous AI-driven engineers.
5. Grok's Aurora: The Revolution in Image Generation
Finally, Aurora, a Grok model specializing in images, is making a difference in visual creativity.
- Detail and realism: Generate hyper-realistic images based on textual descriptions. For example, you could ask for “a scene of a forest at dawn with fog” and get something impressive.
- Versatility: In addition to creating, you can modify images in real time, adjusting them according to your needs.
This model is ideal for creative industries such as advertising, design, and content production. It is the most permissive model in the generation of images, being able to use celebrity faces to make new creations. As you can see, both the main image of this post and the one below of Charlize Theron have been generated through a simple prompt in Grok.
6. OpenAI Sora
After 10 months of waiting, Sora debuts as an advanced tool for generating realistic and creative videos based on textual prompts.
If you are a user of the Plus version of ChatGPT, you will have access to 50 videos per month in 720p and watermarked. If you want to start creating content more seriously, however, you must pay about $200/month to enjoy unrestricted access, Full HD resolution and no watermarks.
The main functions of this model are:
- Multimodal generation: Use images or prompts to create customized videos.
- Advanced editing: Tools such as Remix, Loop, Blend and Storyboard offer great creative control.
- Visual Styles: Option to apply effects such as “Film Noir” or “Papercraft” to vary the aesthetics.
In this first case the prompt has been very simple and without complex instructions or storyboard. The result is quite good
In this case I have obtained with a simple prompt the text SALVA between cumulonimbus clouds in a blue sky. Pretty good result.
As an Aikido practitioner, I was interested in how the model could interpret complex movements. Here, I have used a more elaborate prompt, although the result, as you can see, is far from reality and presents numerous “artifacts” or hallucinations.
Undoubtedly, this has been a very intense week in presenting multiple novelties related to new versions of current models and new models, which will surely make our lives easier.
And in your case, how do you think these technologies could impact your work or personal projects?
Are you noticing how fast this is going? It is both exciting but also, without a doubt, disturbing, don’t you think?
Leave me your comments and share your opinion.
Have a good week!