Pressure Testing GPT-4 & Claude 2.1 Long Context


Longer context, better retrieval?

"Needle In A Haystack" Analysis With Large Context Models

Welcome to the 155 people who have joined us since last week! If you aren’t subscribed, join 3,034 fun AI fans. View this post online.

Picture this, I’m sitting at TheGP in San Francisco watching OpenAI’s Dev Day.

Sam Altman just announced GPT-4 with 128K tokens of context. This means you can fit nearly 300 pages of text into a prompt.

Every developer gets excited. No longer do they have to split their prompts into pieces and send it to GPT-4 one by one.

But I’m sitting there wondering, "longer context sounds great, but is there a performance impact?"

I decided to run a small test.

I took a single page of text and placed a random fact in it. I gave GPT-4 that page and asked it to retrieve the fact. Then I did the same thing again, but this time with 300 pages of context. Can you guess what happened?

With a single page, GPT-4 performed perfectly. It was able to recall the random fact. But with 300 pages, GPT-4 wasn’t able to do it.

I wondered where the breaking point was. So I decided to find out.

In the end, the results for both GPT-4 and Claude 2.1 went viral on Twitter with 2.5M views.

What’s the big deal with context lengths?

Around the time that ChatGPT was launched, developers were stuck with ~4K tokens of context. With such a small window, we had to be selective with the text we put in it.

This limitation rang the alarm with every LLM marketing & product team. The hunt for longer contexts was on.

Since then we’ve seen progressively larger contexts released. OpenAI just launched GPT-4-Preview (128K tokens of context) along with Anthropic’s Claude 2.1 (200K tokens of context). That’s 486 pages of context!

Increased contexts means we can do larger analysis, larger summaries, and we can loosen up our efforts on retrieval.

As far as marketing was concerned, the bigger the context, the better the model.

I wanted to find out if this was actually true. So I put GPT-4 and Claude 2.1 to the test.

Needle In A Haystack

I’m a simple person, I like milk with my cereal, butter with my bread, and easy, practical ways to test LLMs.

What made most sense to me was placing a random statement in the middle of a really long background context to see if the model could pull it out. After seeing GPT-4 break down on my first test, I decided I wanted to iterate through progressively larger context lengths to see where the break point was.

But then I remembered that where your fact was place in the document had an impact on retrieval. This was made popular by the “lost in the middle” paper. The authors found that facts placed at the top and bottom of a document had better recall than those in the middle.

So I decided that I would also iterate through document depths (the % downward the fact was placed).

I would use LangChain evals to quickly judge whether or not the fact was retrieved correctly.

The GPT-4 test was paid out of my own pocket (~$215) so I only did a 15 by 15 grid of searches (~337 API calls). The larger contexts were evaluated 2x for more data.

After I tweeted out the results of the GPT-4 test, I was contacted by Anthropic who wanted me to run the test for them as well. Knowing that the test would get pricey they offered credits to run it. (They didn't bias the results at all, just gave credits)

Since my spending limit was raised I decided to do a 35 by 35 grid (~1,225 API calls). You can’t ever ask a data person if they want more data 😈. Then I published those results.

I was asked multiple times, “Can you send me the white paper on this?” They were looking for me to send arXiv links, but they got a long tweet instead.

What did I learn?

Model Retrieval

  • At the largest token lengths, neither GPT-4 or Claude 2.1 can reliably retrieve placed facts
  • Facts at the very top and very bottom of the document were recalled with nearly 100% accuracy
  • For some reason, facts positioned at the top 50% of the document were recalled with less performance than the bottom

So what does this mean for you?

  • Prompt Engineering Matters - It’s worth tinkering with your prompt and running A/B tests to measure retrieval accuracy
  • No Guarantees - Your facts are not guaranteed to be retrieved. Don’t bake the assumption they will into your applications
  • Less context = more accuracy - This is well known, but when possible reduce the amount of context you send to the models to increase its ability to recall. RAG is still extremely important
  • Position Matters - Also well know, but facts placed at the very beginning and 2nd half of the document seem to be recalled better

Predictions

  • Eventually recalling simple facts from long contexts won’t be a problem
  • Companies will put less emphasis on large context windows
  • In-context retrieval benchmarks will become the norm. We are already seeing the pressure ramp up

Behind the scenes


Greg Kamradt

Twitter / LinkedIn / Youtube / Work With Me

Unsubscribe

Greg's Updates & News

AI, Business, and Personal Milestones

Read more from Greg's Updates & News

Sully Omar Interview 2 years of building with LLMs in 35 minutes Welcome to the 100 people who have joined us since last week! If you aren’t subscribed, join 9,675 AI folks. View this post online. Subscribe Now Sully, CEO Of Otto (Agents in your spreadsheets) came on my new series AI Show & Tell I reached out to him because you can tell he feels the AI. His experience is not only practical, it's battle tested. Sully's literally built a product of autonomous async agents that do research for...

Joining ARC Prize How the cofounder of Zapier recruited me to run a $1M AI competition Welcome to the 2,450 people who have joined us since last post! If you aren’t subscribed, join 9,619 AI folks. View this post online. Subscribe Now "We gotta blow this up." That's what Mike Knoop (co-founder of Zapier) says to me in early 2024. "ARC-AGI, we gotta make it huge. It's too important." "Wait, ARC? What are you talking about?" I quickly reply. "It's the most important benchmark and unsolved...

Building a business around a commodity OpenAI's models are a commodity, now what? Welcome to the 296 people who have joined us since last week! If you aren’t subscribed, join 3,939 AI folks. View this post online. Subscribe Now Large Language Models are becoming a commodity. We all know it. So if you’re a foundational model company, what do you do? You build a defensible business around your model. You build your moat. Google famously said they have no moat, “and neither does OpenAI.” But...