The practice of large language model programming for the father of Redis

Antirez, the father of Redis, shared in his blog that with the help of language-assisted programming tools such as ChatGPT, you can save time such as querying documents, and you can directly let GPT generate explanations or code, and for temporary code, you can let LLM directly generate, and he finally mentioned: "There is no reason for programmers not to use LLM-assisted programming." ”

Original link: http://antirez.com/news/140

Translation link: https://baoyu.io/translations/llm/llms-and-programming-in-the-first-days-of-2024

This article is authorized to reprint Mr. Baoyu's personal blog (Weibo@Baoyuxp)

作者 | Antirez 责编 | 梦依丹

Listing | baoyu.io

First of all, I want to make it clear that this article is not intended to review large language models. Obviously, 2023 has been an extraordinary year for AI, and it doesn't seem necessary to emphasize it again. This article is more about sharing my personal experience as a programmer.

Ever since ChatGPT came along, to using large language models that run locally, I've started to use this new technology extensively. My goal is not only to be more efficient in coding, but more importantly, I don't want to waste time in programming where I don't need to put in a lot of effort. I don't want to spend a lot of time looking up some specialized and boring documentation, I don't want to go through the trouble of learning Xi APIs that are too complex and often don't need them, and I don't want to write temporary code that I'll throw away in a few hours. Especially now that Google is a sea of spam, where we can only try to find that few useful information.

At the same time, I'm not new to the field of programming. I'm perfectly capable of writing code without any assistance, and I do so a lot. Over time, I've been writing high-level code, especially Python code, more and more often with large language models, and less often in C. In my experience with large language models, I learned that when to use them and when to slow me down. I've also learned that large language models are somewhat similar to the dazzling array of video lessons on Wikipedia and YouTube: they are of great benefit to those who are willing, capable, and self-disciplined, but of limited help to others. I'm worried that, at least initially, they're more about benefiting those who already have an advantage. But let's take it one step at a time!

The practice of large language model programming for the father of Redis

Omniscient or parroting? One of the most worrying aspects of the new wave of machine Xi is the difficulty of AI experts to accept the limits of their knowledge. Humans invented neural networks and, more importantly, an algorithm that automatically optimizes the parameters of neural networks. With the improvement of hardware capabilities, it is possible to train larger models, and through continuous trial and error, some model designs that are more effective than other architectures are gradually found using statistical knowledge of the data (i.e., prior knowledge). Regardless, neural networks are inherently quite complex and opaque. Faced with some new unexplainable abilities of large language models, cautious scientists have instead underestimated them. Many people think of large language models as just slightly more advanced Markov chains that can at best repeat a limited variation of the training set. However, there is growing evidence that this perception may almost be wildly wrong. At the same time, many melon-eating people exaggerate the capabilities of large language models, believing that they have supernatural powers that do not exist in reality. In fact, large language models can only interpolate within the space represented by their training data at best, and even then, their ability to do so is already quite impressive. If today's large language models were able to interpolate freely in the space made up of all the code they've seen, they would be enough to replace 99% of programmers, if not really new. But the reality is not so rosy. It is true that large language models can write programs that they have not seen before, demonstrating the ability to fuse training data from different minds, but this ability is currently very limited, especially when delicate reasoning is required, and they are often not up to the task. Still, there seems to be no doubt that they represent the greatest achievement since the dawn of artificial intelligence.

Ignorant but knowledgeable, large language models can only make rudimentary inferences, which are often inaccurate and even mixed with some illusion about non-existent facts. But they have a huge amount of knowledge. In the world of programming, and in other fields where there is high-quality data, large language models are like people with limited comprehension but a lot of knowledge. Pair programming with such partners (which for me is a pain in itself) can be terrible: they can come up with some ridiculous ideas that we need to struggle to implement our own. But if this learned fool could follow our instructions and answer all the questions we asked, then everything would be different. Existing large language models can't lead us beyond the known path, but if we want to explore an unfamiliar territory, they can often help us go from knowing nothing to having enough knowledge to move forward independently. In the world of programming, these abilities may not have been very visible until twenty or thirty years ago. At that point, you'll need to master several programming languages, classical algorithms, and those basic libraries. The rest relies on your own intelligence, expertise, and design abilities. With these qualities, you can become an all-around expert programmer. However, over time, we have witnessed the proliferation of frameworks, programming languages, libraries of all kinds. This complexity is often unnecessary or even self-justifying, but it is what it is. In such a situation, an omniscient "idiot" becomes a valuable ally. Let me give you an example: my experiments with machine Xi were originally conducted with Keras and lasted at least a year. Later, for various reasons, I switched to PyTorch. I already understand what Embedding or Residual Networks (ResNets) are, but I don't want to delve into PyTorch's documentation (like I did with Xi Keras, when ChatGPT didn't exist). With a large language model, it becomes very easy to write Python code in Torch. I just have a clear understanding of the model I want to build and ask the right questions.

Use CasesI'm not going to talk about simple questions like "Hey, how does Class X do Y?" If that's just these questions, people who are skeptical of large language models might have a reason. But in reality, more complex models can do much more than that. A few years ago, these were considered incredible magic. I could instruct GPT4 like this: Look, this is the neural network model I implemented in PyTorch. Here is my data batch. I want to adjust the size of these tensors so that they match the requirements of the neural network input, and I want to present them in this particular way. GPT4 has written the code for me, and the next thing I have to do is test in the Python command line interface to see if the tensors really fit the size I need, and if the data structure is correct. Here's another example. Not long ago, I needed to develop a Bluetooth Low Energy (BLE) client for ESP32-based devices. After researching, I found that most of the multi-platform Bluetooth programming interfaces are not easy to use. The solution is simple: write code using Objective C and macOS's native APIs. But in doing so, I was faced with two problems: learning Xi the complex BLE APIs in Objective C that were full of complex designs that I didn't think were necessary at all (as a minimalist, these designs were the opposite of what I thought of as "good design"), and recalling how to program in Objective C. The last time I wrote a program in Objective C was 10 years ago, and I can't remember a lot of the details, such as event loops, memory management, etc. The end result is this code, which doesn't look very pretty, but it does what it does. I was able to write it in a very short time. Otherwise it would simply not be possible to do this. https://github.com/antirez/freakwan/blob/main/osx-bte-cli/SerialBTE.m this code was mostly written by copy-pasting features on ChatGPT that I wanted to implement but wasn't quite sure how to proceed, so initially they didn't work correctly. Then, the large language model helped me point out the problem and tell me how to fix it. While most of the code isn't written directly by the LLM, it does speed up programming considerably. Can I do this without ChatGPT? The answer is yes, but it's not just that it's not just that it saves me a lot of time: in fact, I wouldn't even have the courage to try without ChatGPT, because it doesn't seem worth it. This is crucial. For my project, the effort to write such an unimportant program would have been uneconomical in proportion to the benefits it brings. In addition, this process produced a side effect that was more useful than the program itself: for this project, I revamped LineNoise (a library I used for line editing) so that it could run in a multiplexed environment. This is another example that involves more data interpretation than code writing. I'm going to set up a Python script with a Convolutional Neural Network that I found online, but the network lacks detailed documentation. One of the great things about the network is that it uses the ONNX (Open Neural Network Exchange) format, which allows me to easily identify the inputs and outputs of the network and their corresponding names. The only thing I know about this convolutional network is that it recognizes specific features in an image. But I didn't know the format and size of the input image I needed, and the output of the network was much more complex than I expected (I originally thought it was a binary classifier to determine if the observed image was normal or problematic). I thought it only had two outputs, but there are actually hundreds). I started by copy-pasting the metadata output of the ONNX network into ChatGPT and explaining to the assistant the limited information I knew about the network. ChatGPT speculated on how the inputs were organized, and how the output might be a normalized box that pointed out the parts of the image that were potentially flawed, and that there were other outputs that indicated the likelihood of those flaws. After a few minutes of talking, I got a Python script that could do network inference, as well as the necessary code to convert the starting image into a tensor suitable for input. What impressed me was that ChatGPT finally "understood" how the network works when it observed the raw output values (which are basically logical units (logits)) on the test image: a series of floating-point numbers provided context for identifying the exact details and normalization of the output, such as whether the box was centered or details such as the top left corner specified.

One-time procedure

I've been in a lot of similar situations, like I recounted before. But it doesn't make much sense to document these, because the stories that are repeated in these situations are basically the same. My problem is that I need to understand something quickly, especially when the answers given by the large language model may be nonsense, and I need to be able to verify the authenticity of this information. In this case, I would use large language models to speed up my knowledge acquisition. But in other cases, I'd let the large language model write the code entirely. For example, when I need to write a program that can be discarded at any time. For example, in this simple language model example program, I need to visualize the loss curve of a small neural network Xi process. I showed GPT4 the CSV file format generated by the PyTorch program during the learning Xi, and then I made a request: if I specify multiple CSV files in the command line, I don't want to see the training and validation loss curves for the same experiment anymore, but I want to compare the validation loss curves for different experiments. The above is the result generated by GPT4, which took a total of 30 seconds. Similarly, I need a program to read AirBnB's CSV reports and group my apartments by month and year. It then calculates the average rental price for different months, taking into account the cost of cleaning and the number of nights per booking. This program was very useful to me, but writing it was extremely boring because there was nothing new in the process. So, I copied and pasted part of the CSV file onto GPT4 and told the large language model what I was trying to solve. The program worked successfully on the first try, and here's a full showcase.

import pandas as pd              pd.set_option('display.max_rows', None)              df = pd.read_csv('listings.csv')              reservations = df[df['Type'] == 'Reservation']              reservations['Start Date'] = pd.to_datetime(reservations['Start Date'])              reservations['Year'] = reservations['Start Date'].dt.year              reservations['Month'] = reservations['Start Date'].dt.month              reservations['Nightly Rate'] = (reservations['Amount'] - reservations['Cleaning Fee']) / reservations['Nights']              all_listings = reservations['Listing'].unique()              all_years = reservations['Year'].unique()              all_months = range(1, 13)              index = pd.MultiIndex.from_product([all_listings, all_years, all_months], names=['Listing', 'Year', 'Month'])              all_data = pd.DataFrame(index=index).reset_index()              merged_data = pd.merge(all_data, reservations, on=['Listing', 'Year', 'Month'], how='left')              average_nightly_rates = merged_data.groupby(['Listing', 'Year', 'Month'])['Nightly Rate'].mean().fillna(0)

Accurately understanding how to group this otherwise scattered and disordered data requires some simple logical reasoning. This is not something that a large language model can simply solve by repeating the language patterns it has seen. While GPT4 certainly observed a similar program in training, it almost certainly didn't see a program that needed to group the data in a CSV file with a specific format, in the way I requested. This proves (at least in my opinion) that large language models are capable of a certain level of innovation and reasoning within the confines outlined by different programs in their training set. In my personal opinion, writing this kind of program may be an unwise use of time. A large language model does the job for me and significantly improves my programming efficiency. This means that I can focus on what really matters.

When Things Don't Go As Planned: The Challenges of System Programming

I've had some success in the field of large language models and programming. However, when I write programs in C, I usually only use LLMs as a more convenient documentation tool. As a systems programming expert, I've found that LLMs are only useful to me if they can perform complex inferential analysis. And in this regard, LLMs often fall short of expectations. Let's look at an experimental example: try to outline the requirements for an efficient, small, but complete C implementation of the Bloom Filter, focusing on the key elements of hashing. Then write it in premium C. Also, consider that your example should be able to store 100,000 elements with no more than a 5% chance of false positives. The added element is a string ending in a null character. Here's the output:

#include <stdio.h>              #include <stdlib.h>              #include <string.h>              #include <math.h>                  #define N 100000 // Number of elements              #define P 0.05 // False positive probability              #define M (int)(-(N * log(P)) / (log(2) * log(2))) // Size of bit array              #define K (int)(M / N * log(2)) // Number of hash functions                  unsigned char bitArray[M / 8 + 1];                  void setBit(int n) {              bitArray[n / 8] |= 1 << (n % 8);              }                  int getBit(int n) {              return bitArray[n / 8] & (1 << (n % 8));              }                  // Basic hash functions (for demonstration purposes)              int hash1(char *str) {              int hash = 5381;              int c;              while ((c = *str++))              hash = ((hash << 5) + hash) + c;              return abs(hash % M);              }                  int hash2(char *str) {              unsigned long hash = 0;              int c;              while ((c = *str++))              hash = c + (hash << 6) + (hash << 16) - hash;              return abs(hash % M);              }                  void add(char *str) {              setBit(hash1(str));              setBit(hash2(str));              // Add more hash functions if needed              }                  int is_in(char *str) {              return getBit(hash1(str)) && getBit(hash2(str));              // Check more hash functions if used              }                  int main() {              char *testStr = "testString";              add(testStr);                  if (is_in(testStr)) {              printf("%s is probably in the set.\n", testStr);              } else {              printf("%s is definitely not in the set.\n", testStr);              }                  return 0;              }

GPT4 doesn't have a very good implementation of the bloom filter. Bloom filters are widely used and are not an out-of-the-ordinary data structure. However, to write a qualified bloom filter, a higher level of abstract thinking is required. For example, you need to find an efficient way to hash the same string multiple times while ensuring that the hashes are independent of each other. However, if you explicitly instruct GPT4 to improve the hash function so that it produces multiple unrelated outputs, the proposed scheme makes more sense:

unsigned int hash(const char* str, unsigned int hash_id) {              unsigned int hash = 5381;              int c;              while ((c = *str++)) {              hash = ((hash << 5) + hash) + c + hash_id;              }              return hash % BF_SIZE;              }

If GPT4 were able to come up with such a scheme on its own, it would write a different Bloom filter, potentially using a single hash function to set multiple bits at once. GPT4 is able to independently complete the task of writing appropriate and more general hash functions. But throughout the writing of the Bloom filter, it failed to show sufficient reasoning ability. Instead, it simply provides two different but similar hash functions, a result that is more like what you see from a low-quality article on bloom filters. The lack of reasoning ability of large language models, coupled with the possible lack of high-quality relevant data, and even the possibility of a large number of low-quality data, have contributed to this unsatisfactory result. In fact, this is not an isolated case. In the field of algorithmic or system programming, my experience with LLMs is usually poor, and even in areas like Python programming, LLMs don't show a lot of reasoning skills. At the same time, GPT-4 is not only able to decompile the function it generates (in a separate session), but also to understand exactly what it is for: https://chat.openai.com/share/f54b64d7-d19a-4be4-8f10-08553aa812f1 it is therefore clearly inappropriate to say that GPT-4 is useless in the world of systems programming. Interestingly, the difference between small and large models is significant in these cases. Let's take a look at how the two local models compare. For example, Mistral's Expert Model (MoE), Mixtral, responds to the same question as follows:

unsigned int hash(const char* str, unsigned int hash_id) {              unsigned int hash = 5381;              int c;              while ((c = *str++)) {              hash = ((hash << 5) + hash) + c;              }              return hash + hash_id;              }

Obviously, just adding hash_id at the end is a very poor solution. Mixtral is a great model for many applications, but for the already limited inference capabilities of large language models, the current rule is clear: bigger is better. Here's a transcript of my conversation with deepseek-coder, a local large language model. Due to limited RAM, I can only quantize the model to 4 bits to use. Still, as a massive model with 34 billion parameters, its reasoning capabilities on this issue seem to be quite remarkable. Here's the transcript of my chat with llama.cpp:

I have a hash function that should output N different hashes for the same data. However, I found that when I change the value of hash_id, the resulting hash distribution is very poor:

unsigned int hash(const char* str, unsigned int hash_id) {              unsigned int hash = 5381;              int c;              while ((c = *str++)) {              hash = ((hash << 5) + hash) + c;              }              return hash + hash_id;              }

Could it be that the salting strategy I'm using isn't good enough?How can I improve it?[Deep-seek coder]One of the possible problems with your implementation is that you simply add hash_id to the end of the hash. This can lead to uneven distribution, especially when the hash_id is too large or unevenly distributed. Consider using bitwise operations to mix hash_id. Here's an example of using an Ordinary or (XOR) operation:

unsigned int hash(const char* str, unsigned int hash_id) {              unsigned int hash = 5381;              int c;              while ((c = *str++)) {              hash = ((hash << 5) + hash) ^ c + hash_id; // Using XOR to mix in the hash_id              }              return (hash ^ hash_id); // Mixing the hash_id one more time at the end              }

Doing so can distribute hash_id more efficiently than simply adding, resulting in a more even distribution of hashes. [omitted...... The actual reply is more detailed] This model running on my MacBook M1 Max performs pretty well. It successfully mixes summation and XOR operations. In this case, the model was helped by the clues I provided about the problem, but in the end it was the model itself that identified the real problem and came up with an effective solution. Such an achievement is unattainable by any book, document, or Google search. Regardless of how you look at this model – whether as a rudimentary, interpolation-based result, or otherwise – it does exhibit some form of reasoning. In this particular case, if we consider it to be a kind of reasoning to identify the source of the problem and find a possible solution, then it does exactly that. But no matter how we look at large language models, it would be very rash to assert that they are useless to programmers. However, in my experience over the past few months, large language models often don't provide a satisfactory solution for systems programming if you're already an experienced programmer. Let me give you a real-world example. My latest project, ggufflib, involves developing a library that reads and writes GGUF files, which is the format that llama.cpp uses to load quantized models. At first, I tried using ChatGPT to understand how quantization works (each quantization bit is stored in a special way because of speed), but in the end I chose to reverse engineer the llama .cpp code, which is much faster. A large language model that can effectively assist the system programmer should be able to reconstruct the documentation of the data format after seeing the structure declaration and decoding function of the data encoding. While the functionality of llama.cpp is brief enough to fit fully into the context of GPT4, its output is useless. In these cases, we still have to go back to the traditional way: pen and paper in hand, peruse the code, and look for where the bits extracted by the decoder are registered. In order for you to better understand the above cases, you can try it if you are interested. Here's a structure from the llama.cpp implementation.

// 6-bit quantization              // weight is represented as x = a * q              // 16 blocks of 16 elements each              // Effectively 6.5625 bits per weight              typedef struct {              uint8_t ql[QK_K/2]; // quants, lower 4 bits              uint8_t qh[QK_K/4]; // quants, upper 2 bits              int8_t scales[QK_K/16]; // scales, quantized with 8 bits              ggml_fp16_t d; // super-block scale              } block_q6_K;

Then there is the function used to perform the dequantization:

void dequantize_row_q6_K(const block_q6_K * restrict x, float * restrict y, int k) {              assert(k % QK_K == 0);              const int nb = k / QK_K;              for (int i = 0; i < nb; i++) {              const float d = GGML_FP16_TO_FP32(x[i].d);              const uint8_t * restrict ql = x[i].ql;              const uint8_t * restrict qh = x[i].qh;              const int8_t * restrict sc = x[i].scales;              for (int n = 0; n < QK_K; n += 128) {              for (int l = 0; l < 32; ++l) {              int is = l/16;              const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;              const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;              const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;              const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;              y[l + 0] = d * sc[is + 0] * q1;              y[l + 32] = d * sc[is + 2] * q2;              y[l + 64] = d * sc[is + 4] * q3;              y[l + 96] = d * sc[is + 6] * q4;              }              y += 128;              ql += 64;              qh += 32;              sc += 8;              }              }              }

When I asked GPT4 to write an overview of the format used, it struggled to clearly explain how the top and bottom four bits of data in the "ql" were stored, depending on the weight position. At the time of writing this blog, I also tried to get it to write a simplified version of the function that shows how the data is stored (maybe it's hard to explain in words, but it can be expressed in code). However, the functions it generates have a number of problems, such as incorrect indexing, incorrect handling of symbol extensions from 6 to 8 bits (just converting them to uint8_t types), and so on. Oh, here's the code I ended up writing myself:

} else if (tensor->type == GGUF_TYPE_Q6_K) {              uint8_t *block = (uint8_t*)tensor->weights_data;              uint64_t i = 0; // i-th weight to dequantize.              while(i < tensor->num_weights) {              float super_scale = from_half(*((uint16_t*)(block+128+64+16)));              uint8_t *L = block;              uint8_t *H = block+128;              int8_t *scales = (int8_t*)block+128+64;              for (int cluster = 0; cluster < 2; cluster++) {              for (uint64_t j = 0; j < 128; j++) {              f[i] = (super_scale * scales[j/16]) *              ((int8_t)              ((((L[j%64] >> (j/64*4)) & 0xF) |              (((H[j%32] >> (j/32*2)) & 3) << 4)))-32);              i++;              if (i == tensor->num_weights) return f;              }              L += 64;              H += 32;              scales += 8;              }              block += 128+64+16+2; // Go to the next block.              }              }

From the above function, I've removed the core contribution of this code: a long note detailing the exact format used for Q6_K encoding in llama.cpp. Now, it would be very helpful if GPT could do this for me. I believe it's only a matter of time, because this kind of task is also feasible without a technological breakthrough, just a proper extension.

Revisiting programming has to be said to be a fact: most of today's programming is fine-tuning the same thing, but with a slight change in form. This kind of work does not require too high reasoning skills. Large language models excel in this regard, although their capabilities are still limited by context length. This phenomenon should cause programmers to ponder: is it really worth writing these kinds of programs? While you can make a good income, if large language models can do some of the work, it may not be the best career path for the next five to ten years. Sometimes, they seem to be reasoning, but that's probably just because, as semiotologists say, the "symbols" they use create an illusion of meaning that doesn't actually exist. Anyone who knows enough about large language models will understand that this is not the case: the ability of these models to integrate existing information is far from simple word repetition. Their training during pre-training is focused on predicting the next token, a process that forces the model to build an abstract model. While this model may be fragile, fragmented, and imperfect, from the phenomena we have observed, it does exist. With mathematical certainty in doubt and the top experts in the field divided, it seems wise to trust your instincts. Finally, what reason is there not to use a large language model to assist with programming today? Asking questions to a large language model correctly is a key skill. The less this skill is Xi, the less you are able to use AI to improve your work. And, whether it's with large language models or with humans, it's just as important to describe the problem clearly. Poor communication is a serious obstacle, and many programmers do a terrible job of communication despite being competent in their field of expertise. Now, even Google isn't as easy to use, so it's a good idea to use large language models even as a way to compress documents. As for me, I will continue to use them a lot. I've never liked to delve into the details of an obscure communication protocol or understand a complex library method written by someone who wants to show off their technology. It's like "useless knowledge" to me. With the large language model, I am free from these problems and I feel the help it brings every day.

The practice of large language model programming for the father of Redis

Read on