Lessons learned with 500 million GPT tokens

[CSDN Editor's Note] In the last six months, our company has released some important features for large language models, and what I read about large language models on Hacker News is very different from what I encountered, so I wanted to share some of the lessons I've learned after dealing with about 500 million markers.

Original link: https://kenkantzer.com/lessons-after-a-half-billion-gpt-tokens/

Reproduction without permission is prohibited!

作者 | KEN KANTZER 责编 | 夏萌

译者 | 弯月出品 | CSDN（ID：CSDNnews）

First, some background:

We are using the OpenAI model.
The usage ratio is GPT-4: 85% and GPT-3.5: 15%.
We worked specifically with text, so there was no gpt-4-vision, Sora, whisper, etc.
Our use case is B2B with a focus on summarizing/analyzing/extracting.
500,000,000 tokens is not as large as expected, about 750,000 pages of text.

Lesson 1: The fewer prompts, the better

We've found that for some common sense, the hint doesn't need to enumerate the exact list, and the results are better without too much explanation. GPT isn't stupid, and if it's too specific, it can be confusing.

This is fundamentally different from writing code, where everything has to be explicit.

Here's an example to illustrate the problem:

Part of our pipeline is reading some blocks of text and asking GPT to classify them into the 50 states or federal governments of the United States. The task itself is not difficult, we could have used strings/regex, but it took more time due to too many extreme special cases. So, our first attempt is roughly as follows:

Here's a block of text. One field should be "locality_id", and it should be the ID of one of the 50 states, or federal, using this list:

[{"locality: "Alabama", "locality_id": 1}, {"locality: "Alaska", "locality_id": 2} ... ]

The following is a piece of text. A field should be "locality_id" and should be the ID of one of the 50 states or federal governments, using the following list:

[{"locality: "Alabama", "locality_id": 1}, {"locality: "Alaska", "locality_id": 2} ... ]

While this tip sometimes works (I estimate the probability to be over 98%), the number of failures is so high that we have to dig deeper.

In the course of our investigation, we noticed that another field, "name", always returns the full name of the state, and that it is the correct state name, but we do not have such an explicit requirement.

So instead we use name to do a simple string search, so it works.

I think in general, a better approach would be to tell GPT, "You definitely know 50 states, so just give me the full names associated with them, or the federal names associated with the government." ”

Pretty unexpected, right? The more vague the prompt, the better the quality of GPT, and the stronger the generalization ability, which is a typical higher-level delegation and thinking.

Lesson 2: You don't need a Langchain. You probably don't even need anything else that OpenAI has released in the past year. Only the chat API is required.

Langchain is a perfect example of premature abstraction. At first, we thought we had to use it because that's what it says online. However, after using millions of tokens, there are about 3~4 distinct large language model functions in production, and our openai_service file still only has a 40-line function:

The only API we use is chat. We've been extracting json. We don't need JSON schemas, we don't need function calls, and we don't need helpers (although we all have them). We don't even use system prompts (maybe we should use ...... ）。 When gpt-4-turbo is released, we only need to update a string in the codebase.

The beauty of this is a powerful general-purpose model: less is more.

The above function contains about 40 lines of code, most of which are dealing with the regular 500 error / socket closure of the OpenAI API.

We've built some auto-truncation so there's no need to worry about context length limitations. We have our own proprietary token length estimator. Here's the code:

In extreme cases, such as when there are many periods or numbers (in which case the token ratio is less than 3 characters per token), this code will fail. So we have another proprietary attempt/capture retry logic:

We've made considerable progress with this approach and are flexible enough to meet our needs.

Lesson 3: Using streaming APIs to improve latency and show the user the characters that are being output at an indefinite speed is actually a major UX innovation for ChatGPT.

We thought it was just a gimmick, but the feedback from users for seeing the characters output at an uncertain speed (like typing out letters one by one) has been so positive that it feels like an AI's mouse/cursor UX moment.

Lesson 4: GPT is very bad at generating null hypotheses

The most error-prone prompt language we've come across is: "If you can't find anything, return an empty output". GPT often returns something inexplicable, and it also causes it to often lack confidence and return blanks more often than expected.

Most of our tips are in the form of:

“Here’s a block of text that’s making a statement about a company, I want you to output JSON that extracts these companies. If there’s nothing relevant, return a blank. Here’s the text: [block of text]”

"Here's a statement about a company that I want you to output and extract the JSON of those companies. If there is no relevant content, return blank. The text reads as follows: [text content]".

For a while, we had a bug where [text content] could be empty. GPT often returns some inexplicable text. By the way, GPT is fond of words related to bakeries, such as:

Sunshine Bakery（阳光面包店）
Golden Grain Bakery（金色谷物面包店）
Bliss Bakery（幸福面包店）

Luckily, the solution is to fix the bug and not send a prompt if there is no text. However, things get bad when it's hard to define "empty" in a program, and you need GPT to do it yourself.

Lesson 5: "Context window" is a misnomer, and it's just the input that gets bigger and bigger, not the output

There's a little-known fact: GPT-4 has an input window of 128k tokens, but its output window is still only 4k! The term "context window" is confusing.

But the actual problem is even more serious, and we often ask GPT to return a list of JSON objects. Don't overcomplicate it: it's just an array of JSON task lists, each with a name and a tag.

However, GPT returns no more than 10 pieces of data. We tried 15 of them and only had a 15% success rate.

Initially, we thought it was because the context window was only 4k, but after we checked, we found that there were only 700~800 tokens for 10 pieces of data, but GPT would stop.

Of course, you can also give it a hint, come up with one task at a time, then give it (hint + task), then propose the next task, etc. But if you need to play a phone game with GPT, you'll have to deal with things like Langchain.

Lesson 6: For those of us who are ordinary users, vector databases and RAG/embeddings are basically useless

I've tried, but I'm confused whenever I think I have a killer use case for RAG/embedding.

In my opinion, vector databases/RAGs are actually designed for search. Limited to searching. It's like Google and Bing search. Here's why:

There is no cut-off point for relevance. There are solutions that seem to create cutoff points for relevance through heuristics, but are actually simply not reliable. In my opinion, this breaks the RAG, you always run the risk of retrieving irrelevant results, or being too conservative and you'll miss out on important results.
Why put vectors in a specialized, proprietary database, away from all other data? Unless you have the size of Google or Bing, this loss of context is definitely not worth it.
Unless you're doing an open-ended search, say the entire internet, users generally don't like semantic search because it returns content that the user didn't type directly. For most applications that are searched in business applications, users are domain experts, and they don't need you to guess what they might want, they tell you explicitly.

It seems to me (pure guess) that a better use of large language models for most search cases is to convert a user's search into a faceted search using a normal completion prompt, or even a more complex query (or even SQL!). ）。 But this is not RAG at all.

Lesson 7: There are basically no hallucinations

Our use case is basically "here's a piece of text, extract something from it." "Generally speaking, if you ask GPT to extract the company name from a piece of text, it won't give you a company indiscriminately (unless there's no company in the text, but that's the null hypothesis problem mentioned above!). ）。

Again, I'm sure you've noticed this if you're an engineer: GPT is virtually hallucination-free, it doesn't create variables, or randomly introduce spelling mistakes when rewriting the block of code you send it. When you ask it to give you something, it creates the illusion that a standard library function exists, but that's more of a null hypothesis. That is, it does not know how to express "I don't know".

But if your use case is: "Context full details below, analyze/summarize/extract", then it is very reliable. I think a lot of the products that have been released recently have highlighted this exact use case.

So the key is that high-quality data input and high-quality GPT token response output.

summary

Here are answers to some frequently asked questions:

Q: Will we achieve artificial general intelligence?

A: No. At least not the transformers + internet data + billion-dollar infrastructure approach.

Q: Does GPT-4 really work? It's all marketing?

A: 100% useful. Today's AI is like the early days of the internet.

Q: Will AI cause everyone to lose their jobs?

A: No. I think AI has just lowered the barrier to entry for the average person to use machine learning/AI.

Lessons learned with 500 million GPT tokens

Lesson 1: The fewer prompts, the better

Lesson 2: You don't need a Langchain. You probably don't even need anything else that OpenAI has released in the past year. Only the chat API is required.

Lesson 3: Using streaming APIs to improve latency and show the user the characters that are being output at an indefinite speed is actually a major UX innovation for ChatGPT.

Lesson 5: "Context window" is a misnomer, and it's just the input that gets bigger and bigger, not the output

summary