
ChatGPT-o1 vs. ChatGPT-4o
- Von Mark Willoughby
Share post:
In the world of artificial intelligence, speed counts and while ChatGPT 4 and its successor ChatGPT-4o are still winning over all users, OpenAI has already been sending the next giant into the ring since mid-September 2024: ChatGPT-o1 (o1). As a beta user, I recently had the opportunity to put the latest model through its paces and, of course, to send it into the ring with GPT-4o. Both candidates had to prove their general problem-solving ability as well as perform classic tasks in the everyday life of a data scientist. In this two-part blog, I present the results and my personal conclusions:
- ChatGPT-o1 vs ChatGPT-4o, what’s new?
- Task 1: Investment return suggestion on a house purchase
- Task 2: Estimate the maximum number of tennis balls that can fit into a Boeing 737 aircraft
- Task 3: Translation from English into the Yoruba language
- Task 4: Poem Writing with Fixed Word Count
- Task 5: Calculating the ratio of dynamic pressure of different liquids (mathematical question)
- Exercise 6: Programming: add two integers without arithmetic operations (programming question)
ChatGPT-o1 vs ChatGPT-4o, what's new?
OpenAI promises to improve the quality of answers with o1 by having the model mimic a human thought process (which may take a little more time) rather than providing answers directly. This approach makes it more effective at tackling complex tasks, especially in areas such as programming and math. On a technical level, OpenAI has therefore added a layer around the actual model which, in accordance with metacognition, first considers how the problem can best be solved (in subtasks) instead of solving it directly in some way.
Based on the solution of the subtasks, it is re-evaluated whether the results are useful or need to be repeated (according to the “tree of thoughts” (ToT) approach, Andrew Ng’s thoughts on “Agentic Design Patterns” are also recommended ).
Even as a human being, you wouldn’t just write down an essay, for example, but plan, create and evaluate it paragraph by paragraph and rewrite it if necessary.
ChatGTP-o1 with fewer AI hallucinations
By incorporating advanced reasoning functionality, the model has the potential to reduce hallucination—a common issue in AI where models generate incorrect or nonsensical information by confidently stating facts that aren’t grounded in reality.
However, there are some caveats. Due to the model’s reasoning functionality, it takes more time to generate responses compared to previous versions. Additionally, unlike models like GPT-4.0, files and images cannot be passed directly as input to the o1-preview model at the moment, and its data analysis functionality is more limited.
When comparing o1-preview to model 4.0, one notable difference is the emphasis on human-like reasoning in the o1-preview, which improves its ability to handle more complex, multi-step tasks.
Given that the o1-preview model boasts enhanced reasoning capabilities, we decided to test it with questions from various categories that require a higher level of reasoning. The model has already demonstrated noticeable improvements in reasoning tasks and language benchmark tests compared to previous versions. Therefore, we will be further testing its functionality to explore its potential in more complex scenarios and use cases.
Task 1: Investment return suggestion on a house purchase
I specifically stated that the apartment is located in Hamburg, hoping the models would take information about the local market into account. Also, by stating the time period to 10 years, I am hoping the models will consider factors such as inflation or growth in real estate prices.

We found that ChatGPT-o1 took almost 28 seconds to think about this question, making realistic assumptions about the inflation rate and property value growth.
The comparison of results shows that both models provide information on inflation and the development of real estate prices, even if they make different assumptions for the latter. Both models also attempt to justify their approach and the final calculated price.
Result ChatGPT-o1 for task 1


Result ChatGPT-4o for task 1


Comparison of results and facts:
ChatGPT-o1 | ChatGPT-4o | |
---|---|---|
2% p.a. | Assumed inflation rate | 2% p.a. |
4% p.a. | Assumed increase in value | 4% p.a. |
Yes | Disclosure of the calculation details | no |
Yes | Justification for calculation steps | Yes |
Sales price between 370 000€ and 430 000€ | Final price proposal | Selling price of 407 323€ |
Conclusion: Both models started from reasonable assumptions, but the idea of presenting a price range with a rationale feels more human, which is my personal preference.
Task 2: Estimate the maximum number of tennis balls that can fit into a Boeing 737 aircraft
Result ChatGPT-o1 for task 2

Result ChatGPT-4o for task 2

Conclusion: Ultimately, the additional cargo volume at o1 means that almost 100K more balls are expected. However, since we have not provided any information on the volume and both models have made reasonable, plausible assumptions and estimates, there is no winner here, it is a draw. We will probably have to wait a little longer for the real answer.
The first two questions have been asked and answered, and we will report on how the two giants fared in the other four tasks in our Part 2.

Mark Willoughby
Data Scientist