ChatGPT-o1 vs. ChatGPT-4o

Share post:

In the world of artifi­cial intel­li­gence, speed counts and while ChatGPT 4 and its successor ChatGPT-4o are still winning over all users, OpenAI has already been sending the next giant into the ring since mid-September 2024: ChatGPT-o1 (o1). As a beta user, I recently had the oppor­tu­nity to put the latest model through its paces and, of course, to send it into the ring with GPT-4o. Both candi­dates had to prove their general problem-solving ability as well as perform classic tasks in the everyday life of a data scien­tist. In this two-part blog, I present the results and my personal conclu­sions:

  • ChatGPT-o1 vs ChatGPT-4o, what’s new?
  • Task 1: Invest­ment return sugges­tion on a house purchase
  • Task 2: Estimate the maximum number of tennis balls that can fit into a Boeing 737 aircraft
  • Task 3: Trans­la­tion from English into the Yoruba language
  • Task 4: Poem Writing with Fixed Word Count
  • Task 5: Calcu­la­ting the ratio of dynamic pressure of diffe­rent liquids (mathe­ma­tical question)
  • Exercise 6: Programming: add two integers without arith­metic opera­tions (programming question)

ChatGPT-o1 vs ChatGPT-4o, what's new?

OpenAI promises to improve the quality of answers with o1 by having the model mimic a human thought process (which may take a little more time) rather than provi­ding answers directly. This approach makes it more effec­tive at tackling complex tasks, especi­ally in areas such as programming and math. On a technical level, OpenAI has there­fore added a layer around the actual model which, in accordance with metaco­gni­tion, first considers how the problem can best be solved (in subtasks) instead of solving it directly in some way.

Based on the solution of the subtasks, it is re-evaluated whether the results are useful or need to be repeated (accor­ding to the “tree of thoughts” (ToT) approach, Andrew Ng’s thoughts on “Agentic Design Patterns” are also recom­mended ).

Tree of thoughts by
Yao et al. 2023

Even as a human being, you wouldn’t just write down an essay, for example, but plan, create and evaluate it paragraph by paragraph and rewrite it if neces­sary.

ChatGTP-o1 with fewer AI hallu­ci­n­a­tions

By incor­po­ra­ting advanced reaso­ning function­a­lity, the model has the poten­tial to reduce hallu­ci­n­a­tion—a common issue in AI where models generate incor­rect or nonsen­sical infor­ma­tion by confi­dently stating facts that aren’t grounded in reality.

However, there are some caveats. Due to the model’s reaso­ning function­a­lity, it takes more time to generate responses compared to previous versions. Additio­nally, unlike models like GPT-4.0, files and images cannot be passed directly as input to the o1-preview model at the moment, and its data analysis function­a­lity is more limited.

When compa­ring o1-preview to model 4.0, one notable diffe­rence is the emphasis on human-like reaso­ning in the o1-preview, which improves its ability to handle more complex, multi-step tasks.

Given that the o1-preview model boasts enhanced reaso­ning capabi­li­ties, we decided to test it with questions from various catego­ries that require a higher level of reaso­ning. The model has already demons­trated noticeable impro­ve­ments in reaso­ning tasks and language bench­mark tests compared to previous versions. There­fore, we will be further testing its function­a­lity to explore its poten­tial in more complex scena­rios and use cases.

Task 1: Invest­ment return sugges­tion on a house purchase

I speci­fi­cally stated that the apart­ment is located in Hamburg, hoping the models would take infor­ma­tion about the local market into account. Also, by stating the time period to 10 years, I am hoping the models will consider factors such as infla­tion or growth in real estate prices.

Chat-GPT o1 vs. Chat-GPT 4o Prompt for selling house with profit

We found that ChatGPT-o1 took almost 28 seconds to think about this question, making reali­stic assump­tions about the infla­tion rate and property value growth.

The compa­rison of results shows that both models provide infor­ma­tion on infla­tion and the develo­p­ment of real estate prices, even if they make diffe­rent assump­tions for the latter. Both models also attempt to justify their approach and the final calcu­lated price.

Result ChatGPT-o1 for task 1

result of o1 model for task1
ChatGPT-o1 Ergebnis für Task 1 weiter

Result ChatGPT-4o for task 1

Ergebnis ChatGPT-4o für Task 1
ChatGPT-o1 Ergebnis für Task 1 weiter

Compa­rison of results and facts:

ChatGPT-o1ChatGPT-4o
2% p.a.Assumed infla­tion rate2% p.a.
4% p.a.Assumed increase in value4% p.a.
YesDisclo­sure of the calcu­la­tion detailsno
YesJusti­fi­ca­tion for calcu­la­tion stepsYes
Sales price between
370 000€ and 430 000€
Final price proposalSelling price of 407 323€
The main diffe­rence between ChatGPT-o1 and GPT-4o is that o1 preferred to give a recom­mended range for the sale price of the apart­ment, along with a detailed explana­tion for the answer, while GPT-4o offered a fixed value within the range suggested by o1. Also, the indivi­dual calcu­la­tion steps are explained in detail at each point in o1’s response, whereas GPT-4o states the facts but the aggre­ga­tion is done in the background.

Conclu­sion: Both models started from reasonable assump­tions, but the idea of presen­ting a price range with a ratio­nale feels more human, which is my personal prefe­rence.

Task 2: Estimate the maximum number of tennis balls that can fit into a Boeing 737 aircraft

The next question takes us even further into human reaso­ning, a typical question for inter­views: The models are tasked with roughly estimating how many tennis balls can fit into a Boeing 737 aircraft. We speci­fi­cally did not state the measures of a tennis ball nor the aircraft, hoping the models will work this out themselves.

Result ChatGPT-o1 for task 2

Result ChatGPT-4o for task 2

Compa­rison of results and facts: First of all, it is interes­ting to see that both models obviously use diffe­rent sources for the cabin dimen­sions, even if these only differ in the decimal place range. o1 even went one step further and took the cargo space into account in addition to the interior volume (fair, we had not provided any infor­ma­tion on this). The two models do not differ in their approach; first they deter­mine the aircraft volume and then the volume of a tennis ball, this time with the same tennis ball size. o1 first speci­fies the correc­tion by packa­ging ineffi­ci­ency and then concludes the maximum number of balls, while GPT-4o first calcu­lates the maximum number and then corrects it. However, as only the final number of balls is relevant for us, the order of calcu­la­tion is irrele­vant.

Conclu­sion: Ultim­ately, the additional cargo volume at o1 means that almost 100K more balls are expected. However, since we have not provided any infor­ma­tion on the volume and both models have made reasonable, plausible assump­tions and estimates, there is no winner here, it is a draw. We will probably have to wait a little longer for the real answer.

The first two questions have been asked and answered, and we will report on how the two giants fared in the other four tasks in our Part 2.
Picture of Mark Willoughby

Mark Willoughby

Data Scien­tist

Projektanfrage

Vielen Dank für Ihr Interesse an den Leistungen von m²hycon. Wir freuen uns sehr, von Ihrem Projekt zu erfahren und legen großen Wert darauf, Sie ausführlich zu beraten.

Von Ihnen im Formular eingegebene Daten speichern und verwenden wir ausschließlich zur Bearbeitung Ihrer Anfrage. Ihre Daten werden verschlüsselt übermittelt. Wir verarbeiten Ihre personenbezogenen Daten im Einklang mit unserer Datenschutzerklärung.

Project request

Thank you for your interest in m²hycon’s services. We look forward to hearing about your project and attach great importance to providing you with detailed advice.

We store and use the data you enter in the form exclusively for processing your request. Your data is transmitted in encrypted form. We process your personal data in accordance with our privacy policy.