Large Language Models (LLMs) are becoming more popular, but they require a lot of resources, especially GPUs. In this post, we will introduce how to run llm on a high-performance CPU using the llama .cpp library in Python.

Use the Llama .cpp to run LLM quickly on the CPU

Large Language Models (LLMs) are becoming more and more popular, but their operation is computationally intensive. There are many researchers working to improve this shortcoming, such as HuggingFace's development of 4-bit and 8-bit model loading. But they also require a GPU to work. While it is possible to run these LLMs directly on the CPU, the performance of the CPU is not yet sufficient for existing needs. Recent work by Georgi Gerganov makes it possible for LLM to run on high-performance CPUs. This is thanks to his LLAMA .cpp library, which provides high-speed inference for various LLMs.

The original llama.cpp library focused on running models locally in the shell. This doesn't provide a lot of flexibility for users and makes it difficult for users to take advantage of a large number of Python libraries to build applications. The recent development of LangChain has made it possible to use llama .cpp in python.

In this post, we will introduce how to use llama-cpp-python packages in Python using llama .cpp library. We'll also cover how to run Vicuna LLM using the LLaMA -cpp-python library.

llama- pcp -python

pip install llama-cpp-python

For more detailed installation instructions, see llama-pcp-python documentation: https://github.com/abetlen/llama-cpp-python#installation-from-pypi-recommended.

Use LLM and llama-cpp-python

As long as the language model is converted to GGML format, it can be loaded and used by llama .cpp. While most popular LLMs have GGML versions available.

It is important to note that they are quantized when the original llm is converted to GGML format. The benefit of quantification is to reduce the memory required to run these large models without significantly degrading performance. For example, 7 billion parameter models with a size of 13GB can be loaded in less than 4GB of RAM.

In this article, we use the GGML version of the Vicuna-7B, which is available for download from HuggingFace: https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized.

Download the GGML file and load LLM

You can download the model using the following code. The code also checks if the file already exists before attempting to download it.

import os
import urllib.request
def download_file(file_link, filename):
# Checks if the file already exists before downloading
if not os.path.isfile(filename):
urllib.request.urlretrieve(file_link, filename)
print("File downloaded successfully.")
else:
print("File already exists.")
# Dowloading GGML model from HuggingFace
ggml_model_path = "https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized/resolve/main/ggml-vicuna-7b-1.1-q4_1.bin"
filename = "ggml-vicuna-7b-1.1-q4_1.bin"
download_file(ggml_model_path, filename)

The next step is to load the model:

from llama_cpp import Llama

llm = Llama(model_path="ggml-vicuna-7b-1.1-q4_1.bin", n_ctx=512, n_batch=126)

When loading a model, you should set two important parameters.

n_ctx: Used to set the maximum context size for the model. The default value is 512 tokens.

The context size is the sum of the number of tokens in the input prompt and the maximum number of tokens that the model can generate. Models with smaller context sizes generate text much faster than models with larger context sizes.

n_batch: Used to set the maximum number of prompt tokens to batch when generating text. The default value is 512 tokens.

n_batch parameters should be set carefully. Reducing n_batch helps speed up text generation on multithreaded CPUs. But too little can cause text generation to deteriorate significantly.

Use LLM to generate text

The following code writes a simple wrapper function to generate text using LLM.

def generate_text(
prompt="Who is the CEO of Apple?",
max_tokens=256,
temperature=0.1,
top_p=0.5,
echo=False,
stop=["#"],
):
output = llm(
prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
echo=echo,
stop=stop,
)
output_text = output["choices"][0]["text"].strip()
return output_text

The llm object has several important parameters:

prompt: The input prompt for the model. The text is labeled and passed to the model.

max_tokens: This parameter sets the maximum number of tokens that the model can generate. This parameter controls the length of text generation. The default value is 128 tokens.

temperature: The temperature, between 0 and 1. A higher value, such as 0.8, will make the output more random, while a lower value (such as 0.2) will make the output more focused and deterministic. The default value is 1.

top_p: An alternative to temperature sampling, called kernel sampling, where the model considers labeled outcomes with top_p probabilistic qualities. So 0.1 means that only marks containing the first 10% of probability qualities are considered.

echo: Used to control whether the model returns (echoes) the model prompt at the beginning of the generated text.

stop: A list of strings used to stop text generation. If the model encounters any strings, text generation stops at that markup. It is used to control the illusion of the model and prevent the model from generating unnecessary text.

The llm object returns a dictionary object of the following form:

{
"id": "xxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", # text generation id 
"object": "text_completion", # object name
"created": 1679561337, # time stamp
"model": "./models/7B/ggml-model.bin", # model path
"choices": [
{
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.", # generated text
"index": 0,
"logprobs": None,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14, # Number of tokens present in the prompt
"completion_tokens": 28, # Number of tokens present in the generated text
"total_tokens": 42
}
}

The resulting text can be extracted from a dictionary object using output, "choices"["text"].

Sample code for generating text using Vicuna-7B

import os
import urllib.request
from llama_cpp import Llama
def download_file(file_link, filename):
# Checks if the file already exists before downloading
if not os.path.isfile(filename):
urllib.request.urlretrieve(file_link, filename)
print("File downloaded successfully.")
else:
print("File already exists.")
# Dowloading GGML model from HuggingFace
ggml_model_path = "https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized/resolve/main/ggml-vicuna-7b-1.1-q4_1.bin"
filename = "ggml-vicuna-7b-1.1-q4_1.bin"
download_file(ggml_model_path, filename)
llm = Llama(model_path="ggml-vicuna-7b-1.1-q4_1.bin", n_ctx=512, n_batch=126)
def generate_text(
prompt="Who is the CEO of Apple?",
max_tokens=256,
temperature=0.1,
top_p=0.5,
echo=False,
stop=["#"],
):
output = llm(
prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
echo=echo,
stop=stop,
)
output_text = output["choices"][0]["text"].strip()
return output_text
generate_text(
"Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.",
max_tokens=356,
)

The resulting text is as follows:

Hawaii is a state located in the United States of America that is known for its beautiful beaches, lush landscapes, and rich culture. It is made up of six islands: Oahu, Maui, Kauai, Lanai, Molokai, and Hawaii (also known as the Big Island). Each island has its own unique attractions and experiences to offer visitors.
One of the most interesting cultural experiences in Hawaii is visiting a traditional Hawaiian village or ahupuaa. An ahupuaa is a system of land use that was used by ancient Hawaiians to manage their resources sustainably. It consists of a coastal area, a freshwater stream, and the surrounding uplands and forests. Visitors can learn about this traditional way of life at the Polynesian Cultural Center in Oahu or by visiting a traditional Hawaiian village on one of the other islands.
Another must-see attraction in Hawaii is the Pearl Harbor Memorial. This historic site commemorates the attack on Pearl Harbor on December 7, 1941, which led to the United States' entry into World War II. Visitors can see the USS Arizona Memorial, a memorial that sits above the sunken battleship USS Arizona and provides an overview of the attack. They can also visit other museums and exhibits on the site to learn more about this important event in American history.
Hawaii is also known for its beautiful beaches and crystal clear waters, which are perfect for swimming, snorkeling, and sunbathing.

summary

In this post, we have introduced how to use the llama .cpp library and llama-cpp-python package in Python. These tools support CPU-based LLM high-performance execution.

Llama.cpp is updated almost daily. The pace of inference is getting faster and faster, and the community regularly adds support for new models. In Llama.cpp there is a "convert.py" that can help you convert your Pytorch model to ggml format.

The LLAMA .cpp library and the LLAMA-CPP-Python package provide a robust solution for running LLM efficiently on the CPU. If you're interested in incorporating LLM into your application, I recommend taking a closer look at this package.

This article was written by Ashwin Mathur

Use the Llama .cpp to run LLM quickly on the CPU