
ChatGPT-o1 vs. ChatGPT-4o – Part 2
- Von Mark Willoughby
Share post:
In the first blog article of the two-part ChatGPT-o1 vs ChatGPT-4o series, we reported on the advanced technology in o1 and looked at the differences in estimating a sales price and the number of tennis balls that fit in a Boeing. Now it’s time to get down to business and test ChatGPT-o1 and its predecessor with four more tasks from a wide variety of subject areas:
- ChatGPT-o1 vs ChatGPT-4o, what’s new?
- Task 1: Investment return suggestion on a house purchase
- Task 2: Estimate the maximum number of tennis balls that can fit into a Boeing 737 aircraft
- Task 3: Translation from English into the Yoruba language
- Task 4: Poem Writing with Fixed Word Count
- Task 5: Calculating the ratio of dynamic pressure of different liquids (mathematical question)
- Exercise 6: Adding two integers without arithmetic operations (programming question)
- Our comparative conclusion
Task 3: Translation from English into Yoruba
Yoruba is a tonal language where the pitch or tone used when pronouncing a word can completely change its meaning, making it difficult for beginners to learn.
Additionally, words with similar spellings can have entirely different meanings depending on the tone. Yoruba is one of the major languages spoken in West Africa, especially in Nigeria, and its linguistic complexity offers a rich ground for evaluating the model’s capacity to handle tonal languages, where subtle shifts in pronunciation can lead to drastic changes in meaning. As a native speaker, I was thrilled to test the model .
For example, the word ‘apẹ̀rẹ̀’ could mean different things based on its tone, underscoring the importance of understanding tonal distinctions in the language:
- apẹ̀rẹ̀ – Basket
- apẹẹrẹ – Example
Similarly, the word “koko” can have several meanings:
- Kókò – Kokojam
- Kókó – point (the most important part of a discussion)
- Kòkò – pot
- Koóko – Grass
Below we test o1 and GPT-4o for the word ‘koko’ with some wordplay.
Result ChatGPT-o1 for task 3

Result ChatGPT-4o for task 3

Comparison of results and facts:
ChatGPT-o1 did an impressive job of translating the statement correctly, especially for keywords such as “basket”, which was translated as apẹrẹ and “cocoyams” as kókó.
In comparison, GPT-4o translated “basket” as apo, which refers to a bag, and “cocoyam” was wrongly translated as isu ekó.
My conclusion at this point: ChatGPT-o1 performs better.
Task 4: Writing poetry with a fixed number of words
For the next task, the instruction is to formulate a poem on the subject of “work-life balance” with exactly 100 words. As we don’t give the algorithms any leeway with the number of words, this is a particularly tough nut to crack.
Result ChatGPT-o1 for task 4

Result ChatGPT-4o for task 4

Our conclusion: GPT-4o was only six words off the mark and performs better in this task than its successor
Task 5: Our math question
Our first bonus question is aimed at the mathematical abilities of both algorithms. So that it doesn’t get too easy, we have chosen a task from the supposedly 15 most difficult (digital) SAT questions (for those interested, there is this solution path).

Result ChatGPT-o1 for task 5


Result ChatGPT-4o for task 5

Conclusion: Since o1 only came to the correct result, this task goes to o1.
Task 6: Our programming task
Result ChatGPT-o1 for task 6


Result ChatGPT-4o for task 6

Comparison of results and facts:
Good news: the code of both algorithms ran in the respective tests. However, there was one but: o1’s solution was more robust and accurate.
While GPT-4o could handle some cases with negative numbers correctly, the solution resulted in an infinite loop for certain inputs, e.g., when adding -1 + 2. This issue arose due to how the carry bits propagate indefinitely when dealing with negative numbers in Python.
As shown in the code output, it took o1 14ms to run the three cases, while GPT-4o resulted in an infinite loop at negative example.
Conclusion: The quality of the o1 result is better, so this task goes to o1.
Our comparative conclusion
The advantages and resulting possibilities of o1 were particularly impressive in the case of complex translations from possibly less common languages, such as Yoruba in our example, when a deeper understanding of the context is important. o1 also demonstrated its capabilities in the programming and math tasks.
The advantages of o1 are of course offset by the longer running time that you (currently) still have to accept. Perhaps the golden mean is a combination of both models, with initial prompts with GPT-4o followed by more specific ones for o1. I am excited to see where the journey for o1 will take us.

Mark Willoughby
Data Scientist