This guide provides a comprehensive overview of the llama.cpp library and how to use it to develop custom applications.
- 1. Introduction
- 2. Getting Started
- 3. Advanced Topics
- 4. Chat Applications
- 5. Backends
- 6. Error Handling
- 7. Performance Tips
- 8. Conclusion
llama.cpp is a C/C++ library for running Large Language Models (LLMs) locally. It is designed for high performance and portability, with support for various hardware backends (CPU, GPU) and operating systems.
This guide will walk you through the key concepts and APIs of the library, with code examples to help you get started.
To use llama.cpp in your own project, you need to include the llama.h header file and link against the llama library.
The best way to understand how to use the library is to look at the examples provided in the examples directory. The simple example is a good starting point.
Before you can use the library, you need to initialize the backend.
#include "llama.h"
int main() {
llama_backend_init();
// ...
llama_backend_free();
return 0;
}The llama_backend_init() function initializes the backend. You can control the NUMA (Non-Uniform Memory Access) behavior with llama_numa_init(). The llama_backend_free() function frees the resources used by the backend.
To load a model, you need to use the llama_model_load_from_file() function. This function takes the path to the model file and a llama_model_params struct as input.
lama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 99; // Offload all layers to GPU
lama_model * model = llama_model_load_from_file("path/to/model.gguf", model_params);
if (!model) {
fprintf(stderr, "error: unable to load model\n");
return 1;
}The llama_model_default_params() function returns a llama_model_params struct with default values. You can modify this struct to customize the model loading process. For example, you can set the number of GPU layers to offload to the GPU.
Once you have loaded a model, you need to create a context. The context holds the state of the model and is used for inference.
lama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx = 2048; // Context size
lama_context * ctx = llama_init_from_model(model, ctx_params);
if (!ctx) {
fprintf(stderr, "error: failed to create the llama_context\n");
return 1;
}The llama_context_default_params() function returns a llama_context_params struct with default values. You can modify this struct to customize the context creation process. For example, you can set the context size (n_ctx).
Before you can run inference, you need to tokenize the input prompt. The llama_tokenize() function can be used for this purpose.
const llama_vocab * vocab = llama_model_get_vocab(model);
std::vector<llama_token> tokens;
tokens = common_tokenize(vocab, "Hello, world!", true);The common_tokenize function is a helper function from common.h that simplifies tokenization.
To run inference, you need to use the llama_decode() function. This function takes a llama_batch as input.
lama_batch batch = llama_batch_init(tokens.size(), 0, 1);
for (size_t i = 0; i < tokens.size(); i++) {
common_batch_add(batch, tokens[i], i, {0}, false);
}
batch.logits[batch.n_tokens - 1] = true; // We want to get the logits for the last token
if (llama_decode(ctx, batch) != 0) {
fprintf(stderr, "llama_decode() failed\n");
return 1;
}The llama_batch_init() function initializes a llama_batch. The common_batch_add() function is a helper to add a token to the batch. The logits field of the batch determines for which tokens the logits will be returned.
After running inference, you can use a sampler to sample the next token from the logits.
lama_sampler * smpl = llama_sampler_chain_init(llama_sampler_chain_default_params());
lama_sampler_chain_add(smpl, llama_sampler_init_greedy());
const llama_token new_token_id = llama_sampler_sample(smpl, ctx, -1);The llama_sampler_chain_init() function initializes a sampler chain. You can add different samplers to the chain, such as greedy sampling, top-k sampling, and top-p sampling. The llama_sampler_sample() function samples the next token.
By repeatedly running inference and sampling, you can generate text.
for (int i = 0; i < n_predict; ++i) {
const llama_token new_token_id = llama_sampler_sample(smpl, ctx, -1);
if (llama_vocab_is_eog(vocab, new_token_id)) {
break;
}
printf("%s", common_token_to_piece(ctx, new_token_id).c_str());
common_batch_clear(batch);
common_batch_add(batch, new_token_id, n_past + i, {0}, true);
if (llama_decode(ctx, batch) != 0) {
fprintf(stderr, "failed to decode\n");
return 1;
}
}llama.cpp supports batching to improve performance. The batched example shows how to use batching to process multiple sequences in parallel.
The key idea is to create a llama_batch with a larger size and add tokens from different sequences to it. The llama_decode() function will then process the entire batch in a single call.
The parallel example demonstrates how to use llama.cpp to simulate a server with multiple clients. Each client has its own sequence, and the server processes the requests in parallel.
This example uses a single context and a single batch, but it uses different sequence IDs to distinguish between the clients.
The embedding example shows how to use llama.cpp to generate embeddings for a given text.
To get embeddings, you need to set the embedding parameter to true in the common_params. The llama_get_embeddings() function can then be used to get the embeddings.
The save-load-state example shows how to save and load the state of the model and context.
The llama_state_get_size() and llama_state_get_data() functions can be used to get the state data. The llama_state_set_data() function can be used to set the state data.
This is useful for saving the state of a long-running generation and resuming it later.
The KV cache stores the key-value pairs for the tokens that have been processed. This allows the model to reuse the computations for the tokens that have already been seen.
llama.cpp provides several functions for managing the KV cache:
llama_kv_cache_seq_cp(): Copies the KV cache from one sequence to another.llama_kv_cache_seq_rm(): Removes a range of tokens from the KV cache for a specific sequence.llama_kv_cache_seq_div(): Divides the KV cache for a specific sequence.
These functions can be used to implement more advanced generation strategies, such as speculative decoding and tree-based search.
The parallel example shows how to use llama_kv_cache_seq_cp to share the system prompt's KV cache among multiple sequences.
The simple-chat example shows how to build a simple chat application using llama.cpp.
This example uses a loop to get input from the user, generate a response, and then print the response. It also shows how to use a chat template to format the conversation.
llama.cpp supports different backends for running inference on different hardware.
The CPU backend is the default backend. It is optimized for performance on a wide range of CPUs.
llama.cpp supports NVIDIA GPUs via CUDA and AMD GPUs via HIP. To use the GPU backend, you need to set the n_gpu_layers parameter in the llama_model_params to a value greater than 0.
The build.md file in the docs directory provides instructions on how to build llama.cpp with GPU support.
It is important to handle errors when using llama.cpp. Most functions in the library return an error code if something goes wrong.
For example, the llama_model_load_from_file() function returns a nullptr if it fails to load the model. The llama_decode() function returns a non-zero value if it fails to decode the batch.
if (llama_decode(ctx, batch) != 0) {
fprintf(stderr, "llama_decode() failed\n");
return 1;
}You should always check the return values of the functions you call and handle errors appropriately.
Here are some tips for optimizing the performance of llama.cpp:
- Choose the right parameters: The performance of
llama.cppcan be affected by the parameters you choose. For example, then_ctxandn_batchparameters can affect the memory usage and the inference speed. You should experiment with different parameters to find the best values for your application. - Use batching effectively: Batching can significantly improve the performance of
llama.cpp. You should try to process as many tokens as possible in a single batch. Thebatchedandparallelexamples show how to use batching effectively. - Use the GPU backend: If you have a supported GPU, you can use the GPU backend to accelerate inference. To use the GPU backend, you need to set the
n_gpu_layersparameter in thellama_model_paramsto a value greater than 0. - Use the right build options: When building
llama.cpp, you can enable different build options to optimize for performance. For example, you can enableLLAMA_CUBLASto use cuBLAS for matrix multiplication on NVIDIA GPUs.
This guide has provided a comprehensive overview of the llama.cpp library. By following the examples and the documentation, you should be able to use llama.cpp to develop your own custom applications.