Exclusive|How to simplify the big language model and create a new era of data cleaning!

作者：Naresh Ram翻译：王闯（Chuck）校对：zrx
本文约6600字，建议阅读10+分钟本文通过实例展示了大语言模型（LLM）在数据清洗中的惊艳之处。文章详细介绍了如何利用提示、API调用和自动化代码，以低成本实现数据清洗，并展示了其在提升数据质量和发掘洞察方面的巨大潜力。

Tags: data science, data cleansing, big language model, LLM

Use OpenAI's GPT model to clean up questionnaire feedback. The full code has been uploaded to the Github link (https://github.com/aaxis-nram/data-cleanser-llm-node).

The image was generated by Dall-E 2 and modified by the author

In the digital age, accurate and reliable data is essential for businesses. This data provides businesses with personalized experiences and helps them make informed decisions[1]. However, due to the sheer volume and complexity of data, processing data often presents significant challenges and requires a lot of tedious and manual work. In this context, the Big Language Model (LLM) is born, a transformative technology with natural language processing and pattern recognition capabilities that promises to revolutionize the process of data cleansing and make data more available.

In a data scientist's toolbox, LLM is like a wrench and screwdriver, able to reshape activities and play a role in improving data quality. As the proverb goes, LLM will reveal actionable insights that ultimately pave the way for a better customer experience.

Now, let's jump right into today's case.

Image uploaded by Scott Graham to Unsplash

Case

When surveying students, setting the fact field to free-form text is the worst option! You can imagine some of the answers we've received.

Just kidding, one of our customers is Study Fetch (https://www.studyfetch.com/), an AI-powered platform that leverages course materials to create personalized, all-around learning suites for students. They conducted a survey of college students and received more than 10,000 feedbacks. However, their CEO and co-founder, Esan Durrani, ran into a small problem. It turns out that in the survey, the "Major" field is a free-form text box, which means that the respondent can enter anything. As data scientists, we know that this is definitely not a smart choice for statistical calculations. So, the raw data obtained from the survey looks like this...

Oh my God, get your Excel ready! Be prepared to spend an hour, or even three hours, on an adventure to deal with these data outliers.

But, don't worry, we have a big language model (LLM) hammer.

As one elder said, if you only have one hammer, then all problems will be like nails. And isn't data cleansing the most suitable task for this hammer?

We just need to simply classify them into known categories using our friendly big language model. In particular, OpenAI's generative pre-trained Transformer (GPT) model is the LLM behind the popular chatbot application ChatGPT. The GPT model uses up to 175 billion parameters and has been trained on 2.6 billion stored web pages from the public dataset Common Crawl. In addition, through a technique called reinforcement learning from human feedback (RLHF), trainers can push and motivate the model to provide more accurate and useful responses [2].

For our goal, I believe that more than 175 billion parameters should be enough, as long as we can give the right prompt.

Image uploaded by Kelly Sikkema to Unsplash

The key is the prompt

Ryan and Esan, from an AI company, whose main business is writing great prompts. They provided the first version of our prompt. This version is great and works very well using language inference [3], but there are two areas for improvement:

First, it only applies to a single record.
Second, it uses the 'Completion' method of the da Vinci model (my bank account started panicking as soon as I mentioned it).

This leads to exorbitant costs, which we cannot accept. So Ryan and I rewrote the prompts using 'gpt-3.5-turbo' respectively for bulk operations. For me, OpenAI's prompt best practices (https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api) and ChatGPT prompt engineering for The Developers course (https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/) is very helpful. After a series of iterations of thinking, implementation, analysis, and improvement, we ended up with a working version.

Now, let's show the second revised prompt right away:

LLM's response to this prompt is:

This method will have some effect to a greater or lesser extent. But I don't really like the repetitive, long-winded project names. In LLM, text is tokens, and tokens are real money. You know, my programming skills were honed in the fiery abyss of the dot-com bubble bursting. Let me tell you, I never miss a single opportunity to save costs.

So I modified the prompt slightly in the "Expected format" section. I ask the model to output only the ordinal number of the survey feedback (e.g., 1 for the play above) and the ordinal number of the item (e.g., 1 for literature). Then Ryan suggested that I should ask for the output JSON format instead of CSV for easier parsing. He also suggested that I add a "sample output" section, which is excellent advice.

The final prompt is as follows (simplified for clarity):

The full prompt we use can be viewed on the GitHub link (https://github.com/aaxis-nram/data-cleanser-llm-node) here.

The output of the model is:

So, as we discussed earlier, the output of the model is a mapping between the categories we define and the ordinal numbers of the survey response. Take the first line as an example: 1, 1. This means that 1 is the response number and 1 is the corresponding mapper number. Survey Response 1 is "Drama" and Mapped Program 1 is "Literature and Humanities." This looks correct! Drama is in its rightful #1 place and is in the spotlight of all.

Although the output looks at first glance like embedded output (for clustering and dimensionality reduction), they are just the same mapping information, except for ordinal locations. In addition to providing some cost advantages in token usage, these numbers are easier to parse.

We can now turn raw survey feedback files into meaningful disciplines, aggregate them, and gain valuable actionable insights.

But wait, I'm not going to sit in front of a computer, enter every piece of survey feedback into the browser and calculate the mapping. In addition to being mind-numbing, the error rate is also unacceptable.

What we need is some good automation tools. Let's take a look at the API...

Image uploaded by Laura Ockel to Unsplash

API Savior

As you may already know, application program interfaces (APIs) enable our programs to efficiently interact with third-party services. While many people have achieved impressive results by using ChatGPT, the true potential of language models lies in leveraging APIs to seamlessly integrate natural language capabilities into applications so that users do not feel its presence. Like the incredible science behind the phone or computer you're using to read this article.

If you don't have API permissions yet, you can apply here, https://openai.com/blog/openai-api [4]. Once you register and get your API key, the specifications can be found here (https://platform.openai.com/docs/api-reference/chat). Some very helpful examples and code samples can be found here (https://platform.openai.com/examples). Before you actually apply it, playground (API Test Platform, https://platform.openai.com/playground) is a great feature for testing prompts in various settings [5].

We will use REST to call the chat completion API. An example of a call is as follows:

Let's take a quick look at the parameters and their effects.

model

So far, the only chat completion model open to the public is GPT-3.5-TURBO. Esan can use the GPT 4 model, which I'm very jealous of. Although GPT-4 is more accurate and less likely to go wrong [2], it costs about 20 times more than GPT-3.5-turbo, which is more than enough for our needs, thank you.

Temperature

Temperature is one of the most important settings that we can pass to the model, second only to the prompt. According to the API documentation, it can be set to a value between 0 and 2. It has a significant effect [6] because it controls for randomness in the output, a bit like the amount of caffeine in your body before you start writing. You can find a guide here for the values you can use for each app [7].

For our use case, we just want the output that has no changes. We want the engine to give us an untouched map, the same every time. So, we used a value of 0.

N value

How many chats are generated to complete the selection? If we are writing creatively and want to have multiple options, we can use 2 or 3. For our case, n=1 (default) would be fine.

role

The role can be System, User, or Assistant. System roles provide instructions and set the environment. User roles represent prompts from end users. The assistant role is a response based on the history of the conversation. These roles help structure conversations and enable users and AI assistants to interact effectively.

Model maximum token

This is not necessarily the parameter we pass in the request, although another parameter max_tokens limit the total length of the response taken from the chat.

First, tokens can be thought of as part of a word. A token is about 4 characters in English. For example, the quote "The best way to predict the future is to create it," attributed to Abraham Lincoln and others, contains 11 tokens.

Image from Open AI Tokenizer, generated by the author

If you think a token is a word, here is an example of 64 tokens that can show that it's not that simple.

Image from Open AI Tokenizer, generated by the author

Get ready, because now to reveal a shocking truth: each emoji you use in your message adds up to the cost of up to 6 important tokens. That's right, your favorite smiley faces and winks are sneaky little token thieves!

The maximum token window for a model is a technical limitation. Your prompts (including any extra data in it) and answers must fit into the maximum limitations of the model. When the conversation is complete, the content, characters, and all previous messages take up tokens. If you delete a message from the input or output (helper message), the model will completely lose knowledge of it [8]. Like when Dolly was looking for Chico, no Fabio, no bingo, no Harpo, no Elmo? ... Nemo!

For GPT-3.5-Turbo, the maximum limit for models is 4,096 tokens, or about 16,000 characters. For our example, prompts take up about 2000 characters, each survey feedback averages about 20 characters, and the mapping feedback is about 7 characters. So, if we put N survey responses in each prompt, the maximum number of characters should be:

2000 + 20N + 7N should be less than 16000.

After solving this equation, we get an N value less than 518 or about 500. Technically, we can put 500 survey responses in each request and process our data 20 times. However, we chose to put 50 feedbacks in each feedback and do 200 processing because if we put more than 50 survey responses in a single request, we occasionally receive an exception response. Sometimes, there can be problems with the service! We're not sure if it's a long-term problem with the system or if we happen to have an unfortunate situation.

So, how do we use the APIs we have? Let's get to the highlight part, the code.

Image uploaded by Markus Spiske to Unsplash

The method of the code

Node.js is a JavaScript runtime environment [9]. We will write a Node.js/JavaScript program that will perform the actions described in this flowchart:

Flowchart of the program, drawn by the author

My Javascript skills are not that great. I can write better Java, PHP, Julia, Go, C#, and even Python. But Esan insists on using Node, so let's use Javascript.

The full code, hints, and sample input can be found at this GitHub link (https://github.com/aaxis-nram/data-cleanser-llm-node). However, let's take a look at the most attractive parts first:

First, let's see how we can use the "csv-parser" Node library to read CSV files.

Next, we call the classifier to generate the map.

We then construct the prompt from the category, the main prompt text, and the data in the CSV. Next, we use their OpenAI Node library to send the prompt to the service.

Finally, when all iterations are complete, we can convert the srcCol text (i.e. survey feedback) to targetCol (i.e. standardized project name) and write out the CSV.

This JavaScript was not as complicated as I expected, and it was completed in 2 to 3 hours. I guess anything always looks intimidating until you start doing it.

So, now that we've got the code ready, it's time for the final execution...

Image uploaded by Alexander Grey to Unsplash

Execute the procedure

Now, we need a place to run this code. After debating whether the load should be run on a cloud instance, I did some quick calculations and realized I could run it on my laptop in less than an hour. That's not too bad.

We started a round of testing and noticed that the service returned the data provided to it in 1 out of 10 requests, instead of mapping the data. Therefore, we only receive a list of survey feedback. Since no mapping is found, these feedbacks in the CSV file will be mapped as empty strings.

To avoid detecting and retrying in the code, I decided to rerun the script, but only process records with empty target columns.

The script will first empty the target column of all rows and fill in the normalized program name. Due to an error in the response, the target column of some rows will not be mapped and will remain empty. When the script runs a second time, it only builds prompts for responses that were not processed in the first run. We ran the program a few times and mapped everything out.

Multiple runs took about 30 minutes or so and didn't require much supervision. Here is a selection of some of the more interesting mappings in the model:

Sample mapping between input and program name, diagram drawn by the author

Most look right. Not sure if Organizational Behavior is a social science or business? I guess either one will do.

Each request for about 50 records requires a total of about 800 tokens. The cost of the entire exercise is 40 cents. We probably spent 10 cents on testing, re-running, etc. So, the total cost was about 50 cents, about 2.5 hours of coding/testing time, half an hour of running time, and we got the job done.

Total cost: Approximately less than $1

Total time: about 3 hours

Perhaps manually using Excel for conversion, sorting, regular expressions, and drag-and-drop copying, we might have done it in the same amount of time and saved a little money. However, it was more fun to do so, we learned something, we had a script/process that could be repeated, and we also produced an article. And, I think StudyFetch can afford 50 cents.

This is a great application that we achieve in an efficient, profitable way, but what other uses can LLM be used for?

Image uploaded by Marcel Strauß to Unsplash

Explore more use cases

There may be more use cases for adding language capabilities to your application than I showed above. Here are more use cases related to the review data we just reviewed:

Data parsing and standardization: LLM can help parse and standardize data by identifying and extracting relevant information from unstructured or semi-structured data sources, such as those we just saw.

Data deduplication: LLM can help identify duplicate records by comparing various data points. For example, we can compare names, majors, and universities in review data to flag potential duplicate records.

Data Summary: LLM can summarize different records to get an overview of the answers. For example, for the question "What was your biggest challenge in learning?" A large language model can summarize multiple responses from the same major and university to see if any patterns exist. We can then put all the summaries into a single request to get a list of the whole. But I guess a summary of each segment might be more useful.

Sentiment analysis: LLM can analyze reviews to determine sentiment and extract valuable insights. To the question "Are you willing to pay for services that help you learn?" LLM can classify sentiment from 0 (very negative) to 5 (very positive). We can then use this to analyze students' interest in paid services by segmenting.

Although student reviews are just a small example, the technology has multiple applications in a wider range of fields. At my company, AAXIS, we specialize in enterprise and consumer digital commerce solutions. Our work includes migrating large amounts of data from existing legacy systems to new systems with different data structures. To ensure data consistency, we analyze the source data using a variety of data tools. The techniques presented in this article are very helpful for this goal.

Other digital commerce use cases include checking product catalogs for errors, writing product descriptions, scanning review responses, and generating product review summaries, among others. Writing code for these use cases is much simpler than asking students about their majors.

However, it is important to note that while LLM is a powerful tool when it comes to data cleansing, they should be used in combination with other techniques and human supervision. The data cleansing process often requires domain expertise, contextual understanding, and human review to make informed decisions and maintain data integrity. LLMs are not inference engines[10], they are just predictors for the next word. They often provide false information (hallucinations) with great confidence and persuasion [2][11]. In our tests, since we were mainly involved in classification, we did not experience any hallucinations.

If you tread carefully and understand the pitfalls, LLM can be a powerful tool in your toolbox.

Image uploaded by Paul Szewczyk to Unsplash

End

In this article, we first look at a specific use case for data cleansing: normalizing questionnaire feedback to a specific set of values. Doing so groups feedback and gains valuable insights. To categorize these feedbacks, we used OpenAI's GPT 3.5 Turbo, a powerful LLM. We detail the prompts used, how to handle prompts with API calls, and the code that automates them. In the end, we managed to bring all the components together and get the job done at a cost of less than a dollar.

Are we holding a legendary LLM hammer and finding the perfectly shiny nail in free-text survey feedback? Maybe. More likely, we took out a Swiss Army knife, peeled it and enjoyed some delicious fish. Although LLM is not a tool specifically designed for this, it is still very practical. And, Esan, really, really loves sushi.

So, what use cases do you have for LLM? We'd love to hear your thoughts!

Acknowledgement

The bulk of the work for this article was done by me, Esan Durrani, and Ryan Trattner, co-founders of StudyFetch. StudyFetch is an AI-based platform that leverages course materials to create personalized, one-stop learning sets for students.

I would like to thank my colleagues Prashant Mishra, Rajeev Hans, Israel Moura and Andy Wagner for their review and suggestions for this article.

I would also like to thank my friend of 30 years, Kiran Bondalapati, VP of Engineering at TRM Labs, for his early leadership in generative AI and for reviewing this article.

At the same time, I would like to especially thank my editor, Megan Polstra, who, as always, adds a professional and refined style to the article.

Resources

1. Temu Raitaluoto, “The importance of personalized marketing in the digital age”, MaketTailor Blog, May 2023,https://www.markettailor.io/blog/importance-of-personalized-marketing-in-digital-age

2. Ankur A. Patel, Bryant Linton and Dina Sostarec, GPT-4, GPT-3, and GPT-3.5 Turbo: A Review Of OpenAI’s Large Language Models, Apr 2023, Ankur’s Newsletter,https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-turbo-a-review

3. Alexandra Mendes, Ultimate ChatGPT prompt engineering guide for general users and developers, Jun 2023, Imaginary Cloud Blog,https://www.imaginarycloud.com/blog/chatgpt-prompt-engineering/

4. Sebastian, How to Use OpenAI’s ChatGPT API in Node.js, Mar 2023, Medium — Coding the Smart Way,https://medium.com/codingthesmartway-com-blog/how-to-use-openais-chatgpt-api-in-node-js-3f01c1f8d473

5. Tristan Wolff, Liberate Your Prompts From ChatGPT Restrictions With The OpenAI API Playground, Feb 2023, Medium — Tales of Tomorrow,https://medium.com/tales-of-tomorrow/liberate-your-prompts-from-chatgpt-restrictions-with-the-openai-api-playground-a0ac92644c6f

6. AlgoWriting, A simple guide to setting the GPT-3 temperature, Nov 2020, Medium,https://algowriting.medium.com/gpt-3-temperature-setting-101-41200ff0d0be

7. Kane Hooper, Mastering the GPT-3 Temperature Parameter with Ruby, Jan 2023, Plain English,https://plainenglish.io/blog/mastering-the-gpt-3-temperature-parameter-with-ruby

8. OpenAI Authors, GPT Guide — Managing tokens, 2023, OpenAI Documentation,https://platform.openai.com/docs/guides/gpt/managing-tokens

9. Priyesh Patel, What exactly is Node.js?, Apr 2018, Medium — Free Code Camp,https://medium.com/free-code-camp/what-exactly-is-node-js-ae36e97449f5

10. Ben Dickson, Large language models have a reasoning problem, June 2022, Tech Talks Blog,https://bdtechtalks.com/2022/06/27/large-language-models-logical-reasoning/

11. Frank Neugebauer, Understanding LLM Hallucinations, May 2023, Towards Data Science,https://towardsdatascience.com/llm-hallucinations-ec831dcd7786

原文标题：From Chaos to Clarity: Streamlining Data Cleansing Using Large Language Models

Original link: https://towardsdatascience.com/from-chaos-to-clarity-streamlining-data-cleansing-using-large-language-models-a539fa0b2d90

Exclusive|How to simplify the big language model and create a new era of data cleaning!

Read on

Global AI Agent inventory, big language model entrepreneurship must refer to 60 AI agents

Reversing the Curse: The Powerlessness of Big Language Models

CNCC | Prospective problems and challenges of large language models in mathematics: theory, methods and applications

Recently, the desktop operating system, the three camps have very large version updates. First of all, domestic DeepinOS accesses AI large language models. Immediately after the 26th, Microsoft Wind

The implementation practice of large language model in data warehouse data governance

The breakthrough of the big language model is to equip AI with five senses and five senses

How to use big language models to build a private knowledge base?

🚀Langchain-Chatchat: The New Choice for Local Knowledge Base Q&A! 🌟 Project Highlights: Based on the Big Language Model: Combining Langchain and Ch

Microsoft launched the AutoGen framework to help developers create complex applications based on large language models

Live Review | Potential and resistance, explore the application of big language models in the field of financial risk control

Under the wave of ChatGPT, look at the development of China's large language model industry #Dongshroom Business School#

The Big Language Model of Federal Law

The bookstore picked it up casually and took a look, and stood for three hours to read it, the fastest reading speed 😂 ever#Large Language Model#OpenAI

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

MIT Amazing Proof: Big Language Model is the World Model? LLM understands space and time

How to Become LLM Word Master! "The Underlying Mental Method of Big Language Model"

With 3 times the sensitivity, it only takes a few seconds to search for millions of protein pairs, and Fudan and others have developed new language models

8.3K Stars!

Meta Researchers Crack the Curse of Large Model Reversal and Launch "Language Model Physics"

Decoding AI: Demystifying the "brain" of chatbots - large language models

Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model

Intel has made important progress in the field of artificial intelligence accelerators, and its subsidiary HabanaLabs is in

Researchers propose a new concept of artificial intelligence that allows large language models to interact with the real physical world

Llama 3: The next frontier of open-source large language models

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature