Introducing my rust-llama.cpp fork


Have you been looking for an easy way to use large language models (LLMs) within the Rust ecosystem so that you don’t have to leave the warm embrace of Rust’s borrow checker? Well then, do I have a fork for you!


Georgi Gerganov’s llama.cpp project has continued to progress at insane speeds, powering a huge number of open-source projects due to the ease at which you can link it’s C++ code into other projects.

Due to the breakneck pace, it’s not a sustainable idea to rewrite its functionality in Rust. So if you want to use ‘GGUF’ LLM neural networks for text inference from within Rust, the easiest and sanest path is to call the C++ library using Rust’s Foreign Function Interface features.

There are multiple ways to achieve this which basically boil down to high-level or low-level wrapper creation. A common tool to accomplish this is bindgen. Using bindgen, it becomes possible to automatically generate FFI wrappers for C/C++ code by simply writing a file to control the process. In this file, you can choose to scan the C/C++ header files and automatically create wrappers for everything it finds, all while customizing what to ignore. This works well for C, but it gets tricky with C++ because anything related to templates is unsupported. Wrapping just the plain C/C++ functions would leave you with a ’low-level’ binding, anyway.

I wanted the convenience of a higher level binding, so I looked at the listed rust-llama.cpp wrappers in the llama.cpp README file. Text inference only worked when short-circuiting the token count to obscene numbers due to memory allocation errors. The embedding functions had no chance of working because it never updated the Rust Vector’s size. All around, it looked like it was a rough port from the Go wrappers mentioned in the README file and relatively little evidence that it was actually used for anything real.

After investigating what I’d need to do to fix things up and implement features I wanted, it looked like I was going to have to invest a fair amount of time into this. For that reason, I decided to fork the repo instead of submit PR after PR. This would also mean I wouldn’t risk having features I wanted get turned down.

For this reason, my own high-level Rust wrapper fork was born:

Feature Highlights

You can see the README file for the running list of changes, but I want to call attention to these specifically:

  • More up-to-date llama.cpp pinned as a git submodule.
  • Not just a more up-to-date pinned submodule, but reworked the binding.cpp code to look similar to llama.cpp’s own main example (the main executable that project builds for the command line).
  • Integration tests for basic feature coverage! (They double as examples)
  • Added cuda and metal features to the library for hardware accelerated builds.
  • Silenced llama.cpp output by adding a logfile feature to the library and have all the output directed there instead. If the feature isn’t enabled, then the llama.cpp output is dropped. Note: some GGML spam still exists at publication time.
  • In-memory prompt caching support so that repeat requests with the exact same prompt do not have to ingest the prompt again. This is a huge speed-up in cases where you’re running a model that is not fully offloaded to the GPU.
  • Added support for GBNF Grammars so that output of the LLM can be constrained to a known format.

Admittedly, the last two features listed are possible largely due to the binding.cpp rework to use llama.cpp’s included sampler functions. And the hack used to get the metal feature working for hardware acceleration on Apple silicon came from Rustformers llm repository.

And of course, there are a few memory management related fixes like freeing models properly so you can switch them out at runtime as well as other miscellaneous fixes like actually getting the embeddings related functions working.

As an example of what’s possible, I had some fun with the grammar integration test and had it generate JSON for video game character descriptions. This was the first output Airoboros-L2-13B Q8_0 gave me:


How To Use The Wrappers In A New Project

All of the integration tests show how easy it can be to get results, but to be more helpful to newer rustaceans, here is a basic rundown:

Create A New App With Cargo

To create the shell of the project, use this cargo command:

cargo new hello_rusty_llama 

It will tell you it’s creating a binary application.

Add The Rust-llama.cpp Dependency

My rust-llama.cpp fork is added by git reference. When you do this in a rust project, it pins the build to that specific git commit so that the build remains reproducible.

cd hello_rusty_llama
cargo add --git llama_cpp_rs

That automatically adds the project to your dependency list in the Cargo.toml file. At this point, you should be able to execute cargo build and get a compiled “Hello, world!” output with cargo run, but we want our own locally sourced and freshly cooked text!

Download A Model

To keep things dead simple, we’re going to use a ‘old’ model: Airoboros-L2-7B-2.2.1. This is also considered a ‘small’ model, but despite that, you’ll need 8 GB of VRAM on your graphics card to run it completely accelerated. If you have less, you can run it completely - or partially - on the CPU. In the project directory, use wget to download the model, or download it by another means and move it there.


Write Some Code!

Finally the fun part, right?! Open up the src/ file in your editor of choice and paste in the following block. It looks long but that’s just because I commented it heavily for illustration purposes.

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},

fn main() {
    // setup the parameters for loading the large language model file
    let model_params = ModelOptions {
        // the airoboros 7b model reference in this example can handle
        // a context size of 4096, but more VRAM constrained machines
        // might not be able to cope. Smaller context means less VRAM
        // used, and we won't need this much for this example.
        context_size: 2048,

        // this determines how many of the 'layers' in the LLM get sent
        // to the GPU for processing. specifying a large number just
        // ensures all layers are offloaded. if you want to run on the
        // CPU, comment this parameter out or set to 0.
        n_gpu_layers: 9999,


    // load the model we downloaded with the above parameters. it will need
    // to be `mut` because running Predict can update internal cached data.
    let llama_filepath_str = "airoboros-l2-7b-2.2.1.Q4_K_S.gguf".to_string();
    let mut llm_model = match LLama::new(llama_filepath_str, &model_params) {
        Ok(m) => m,
        Err(err) => panic!("Failed to load model: {err}"),

    // setup the parameters that control the text prediction. only some
    // of the possibilities are shown here.
    let predict_options = PredictOptions {
        // predict at most 256 tokens, which will hard stop the prediction
        // at that point if the model doesn't finish earlier.
        tokens: 256,

        // this is how big the 'prompt' chunks should be for processing.
        // if using only CPU, you may wish to set this much lower, to
        // something like 8.
        batch: 512,

        // the following three options are 'sampler' settings to control
        // how the text is predicted. play with the possibilities here...
        temperature: 1.3,
        min_p: 0.05,
        penalty: 1.03,

        // we define our callback with a closure and just have it print
        // the generated tokens to stdout.
        token_callback: Some(std::sync::Arc::new(move |token| {
            print!("{}", token);
            let _ = std::io::Write::flush(&mut std::io::stdout());
            true // returning true means text prediction should continue

    // this is the 'prompt' to send the LLM that is basically telling it
    // what to write. different models have different formats and this
    // is the format for Airoboros 2.2.1.
    let prompt = "A chat\n\
        USER: Write the start to the next movie collaboration between\
        Quentin Tarantino and Robert Rodriguez.\n\

    // finally, call predict() to generate the text. the callback in
    // predict_options will get each token as it happens, but the final
    // result will be in the (String, LLamaPredictTimings) tuple of
    // the Result object returned.
    let _result = llm_model.predict(prompt.to_string(), &predict_options);

The flow of the code should be well explained by the comments. To see it run accelerated on your GPU, just run it with cargo:

cargo run --features "llama_cpp_rs/cuda"

Note: If you’re running on Apple silicon, the feature you need is actually "llama_cpp_rs/metal". And if you’re running on CPU only, you can omit the features parameter all together and just use cargo run.

What Next?

That’s basically how easy it is to start with the rust-llama.cpp wrapper. You can play with some PredictOptions fields to change the samplers. You could set the prompt_cache_in_memory field to true and then call predict multiple times with the same prompt but with different sampler settings. You could pull one of the sample grammars from llama.cpp into a string and then the PredictOptions’s grammar string to that to constrain it’s output. You could import reqwest, pull content from the web, push it into a prompt and ask the LLM to summarize it … The world’s your oyster.

Hell, if you’re ambitious enough, you can write your own terminal UI chat application and create a character based on Flynn from the movie ‘Tron’ and chat with him for game development advice … but more on that later …


Feel free to discuss this post on in my Lemmy community over here.