Chatbot laughs, jokes and sings: Open AI presents GPT-4o
The new AI model from Open AI is designed to be a personal AI companion. Using the smartphone camera, the software interprets the environment and can react to it.
Open AI presented its new language model GPT-4o on Monday evening. In the livestream, three employees demonstrated several features of the model.
GPT-4o is based on the capabilities of the current AI model GPT-4, but now also attempts to recognise and express emotions. It uses video and audio input to react to the environment. This should enable the AI to hold in-depth conversations. Its reaction times are also said to have improved. According to Open AI, this makes conversations feel more natural.
Learning help, real-time translation and personal assistance
On the Open AI website there are several demo videos that demonstrate the capabilities of GPT-4o. The videos show how the AI acts as a real-time translator in multilingual conversations, gives a student helpful hints for maths homework, reacts to events in the camera field of view and interacts with other GPT-4o clients. One impressive example is how an AI without camera access talks to another AI with a camera and asks it about its surroundings. The AI can also moderate conversations or games such as rock, paper, scissors.
In this video you can see how two AIs interact based on human instructions:
Open AI wants to position GPT-4 as a fully-fledged conversation partner that reacts to the environment and the course of a dialogue. You can tell that the model has been trained for conversations: The AI responds animatedly, laughs and reacts with little jokes without being asked. It responds to good news with excitement and joy, and reacts sensitively to depressed moods. It is also possible to get GPT-4o to respond only sarcastically, to speak faster or slower or with dramatic emphasis, or even to sing.
Here Open AI presents how GPT-4o can help a student learn maths:
One AI model for all processes
It was already possible to talk to ChatGPT. This required three different AI models. Open AI describes the process on the company website as follows: A language model converted the spoken language into text and provided it to GPT, the actual brain of the AI. The response from GPT in text form was rendered as speech by another language model.
Because GPT only had the pure text at its disposal, a lot of information was lost during processing: the AI was unable to recognise the pitch or background noise of the speaker or whether different speakers were involved. Conversely, GPT was unable to express emotions, laugh or even sing through the mouthpiece of a text-to-speech model.
The major innovation of GPT-4o is that a single AI model takes on all tasks. It can process, interpret and react to multimodal input, i.e. speech, text, images and audio.
Only available for a few at the moment
At the moment, GPT-4o is not yet available to the general public. According to the Open AI website, a "small group of trusted partners" will initially be given access to the AI's new audio and video capabilities. In the medium term, users with a free GPT account will also be able to use GPT-4o, but with limited performance compared to the Plus account.
There will be price adjustments for developers: Open AI states that GPT-4o is twice as fast and half the price of the previous flagship model "GPT-4 Turbo". <p
Feels just as comfortable in front of a gaming PC as she does in a hammock in the garden. Likes the Roman Empire, container ships and science fiction books. Focuses mostly on unearthing news stories about IT and smart products.