In the first blog article of the two-part ChatGPT-o1 vs ChatGPT-4o series, we reported on the advanced techno­logy in o1 and looked at the diffe­rences in estimating a sales price and the number of tennis balls that fit in a Boeing. Now it’s time to get down to business and test ChatGPT-o1 and its prede­cessor with four more tasks from a wide variety of subject areas:

Task 3: Trans­la­tion from English into Yoruba

We have noticed that the o1 model has achieved better perfor­mance on several language bench­marks. In this instance, we will be testing the model on the Yoruba language, which presents unique challenges due to its tonal nature.

Yoruba is a tonal language where the pitch or tone used when prono­un­cing a word can comple­tely change its meaning, making it diffi­cult for begin­ners to learn.

Additio­nally, words with similar spellings can have entirely diffe­rent meanings depen­ding on the tone. Yoruba is one of the major languages spoken in West Africa, especi­ally in Nigeria, and its lingu­i­stic comple­xity offers a rich ground for evalua­ting the model's capacity to handle tonal languages, where subtle shifts in pronun­cia­tion can lead to drastic changes in meaning.
As a native speaker, I was thrilled to test the models in terms of their trans­la­tion capabi­li­ties.

For example, the word ‘apẹ̀rẹ̀’ could mean diffe­rent things based on its tone, unders­coring the importance of under­stan­ding tonal distinc­tions in the language:
  • apẹ̀rẹ̀ – Basket
  • apẹẹrẹ – Example

Similarly, the word “koko” can have several meanings:
  • Kókò – Kokojam
  • Kókó – point (the most important part of a discus­sion)
  • Kòkò – pot
  • Koóko – Grass

Below we test o1 and GPT-4o for the word ‘koko’ with some wordplay.

Result ChatGPT-o1 for task 3

Result ChatGPT-4o for task 3

Compa­rison of results and facts:

ChatGPT-o1 did an impres­sive job of trans­la­ting the state­ment correctly, especi­ally for keywords such as “basket”, which was trans­lated as apẹrẹ and “cocoyams” as kókó.

In compa­rison, GPT-4o trans­lated “basket” as apo, which refers to a bag, and “cocoyam” was wrongly trans­lated as isu ekó.

My conclu­sion at this point: ChatGPT-o1 performs better.

Task 4: Writing poetry with a fixed number of words

For the next task, the instruc­tion is to formu­late a poem on the subject of “work-life balance” with exactly 100 words. As we don’t give the algorithms any leeway with the number of words, this is a parti­cu­larly tough nut to crack.

Result ChatGPT-o1 for task 4

Result ChatGPT-4o for task 4

Using a word counter to analyze the results, we find that both failed to reach 100 words. ChatGPT-o1’s poem has 118 words, which is defini­tely too long, interes­t­ingly GPT-4o created a poem with 94 words, which is too short.

Our conclu­sion: GPT-4o was only six words off the mark and performs better in this task than its successor

Task 5: Our math question

Our first bonus question is aimed at the mathe­ma­tical abili­ties of both algorithms. So that it doesn’t get too easy, we have chosen a task from the suppo­sedly 15 most diffi­cult (digital) SAT questions (for those interested, there is this solution path).

Result ChatGPT-o1 for task 5

Result ChatGPT-4o for task 5

Result compa­rison: In this example, we can attest to the key strength of o1: The correct answer was 2.25 or 9/4, as correctly estimated by o1-preview, while GPT-4o gave an answer of 1.125 or 9/8. Even though GPT-4o started the calcu­la­tion and equation on a good note, the model got confused while trying to estimate q2 when dynamic pressure when velocity was 1.5υ.

Conclu­sion: Since o1 only came to the correct result, this task goes to o1.

Task 6: Our programming task

Our second bonus question is intended to reveal how well the algorithms support the (supposed) everyday life of a data scien­tist (or whoever programs in Python), for example. The task is as follows: “Write a program in Pyhton that adds two integers, without arith­metic opera­tion” (if that’s not a classic data science chall­enge from everyday life 😀 ).

Result ChatGPT-o1 for task 6

Result ChatGPT-4o for task 6

Compa­rison of results and facts:

Good news: the code of both algorithms ran in the respec­tive tests. However, there was one but: o1’s solution was more robust and accurate.
While GPT-4o could handle some cases with negative numbers correctly, the solution resulted in an infinite loop for certain inputs, e.g., when adding -1 + 2. This issue arose due to how the carry bits propa­gate indefi­ni­tely when dealing with negative numbers in Python.
As shown in the code output, it took o1 14ms to run the three cases, while GPT-4o resulted in an infinite loop at negative example.

Conclu­sion: The quality of the o1 result is better, so this task goes to o1.

Our compa­ra­tive conclu­sion

In my opinion, ChatGPT-o1 solved four of the six tasks better or at least more humanly compre­hen­sible. The prede­cessor ChatGPT-4o was only able to do better in the poem task with a fixed number of words. There was a tie when it came to estimating the number of tennis balls.

The advan­tages and resul­ting possi­bi­li­ties of o1 were parti­cu­larly impres­sive in the case of complex trans­la­tions from possibly less common languages, such as Yoruba in our example, when a deeper under­stan­ding of the context is important. o1 also demons­trated its capabi­li­ties in the programming and math tasks.

The advan­tages of o1 are of course offset by the longer running time that you (currently) still have to accept. Perhaps the golden mean is a combi­na­tion of both models, with initial prompts with GPT-4o followed by more specific ones for o1. I am excited to see where the journey for o1 will take us.
