在我自己的 Mac 上快速高效地运行 LLM 仅 2 MB（我的世界1.16.1.02）

这篇文章介绍了一个基于 Rust 和 WebAssembly (Wasm) 的解决方案，用于在异构边缘计算设备上快速和便携地进行 Llama2 模型的推理。

与 Python 相比，这种 Rust Wasm 应用程序的体积仅为 Python 的 1/100，速度提升 100 倍，并且可以在全硬件加速环境中安全运行，不需要更改二进制代码。

文章基于 Georgi Gerganov 创建的 llama.cpp 项目，将原始的 C 程序适配到 Wasm 上。

安装过程包括安装 WasmEdge 和 GGML 插件，下载预构建的 Wasm 应用和模型，然后使用 WasmEdge 运行 Wasm 推理应用，并传递 GGUF 格式的模型文件。

此外，文章还提供了多个命令行选项，用于配置与模型的交互方式。

原文链接：https://www.secondstate.io/articles/fast-llm-inference/

译者 | 明明如月责编 | 梦依丹

出品 | CSDN（ID：CSDNnews）

与 Python 相比，Rust Wasm 应用程序的大小可能只有 Python 的1/100，但速度快 100 倍。最重要的是，无需对二进制代码进行任何更改，就可以在各种硬件加速器上安全运行。参见：为什么埃隆·马斯克说 Rust 是 AGI 的语言？。

我们创建了一个非常简单的 Rust 程序，用于以原生速度运行 llama2 模型的推理。当编译为 Wasm 时，这个二进制应用程序（仅 2 MB）可以完全跨设备移植，兼容异构硬件加速器。Wasm 运行时（WasmEdge）还为云环境提供了一个安全可靠的执行环境。事实上，WasmEdge 运行时与容器工具无缝协作，可以在许多不同设备上部署和执行这个可移植应用程序。

在我的 MacBook 上与 llama2 模型聊天

这项工作基于 Georgi Gerganov 创建的 llama.cpp 项目。我们采用了原始的 C 程序在 Wasm 上运行。它适用于 GGUF 格式的模型文件。

步骤1. 安装 WasmEdge 和 GGML 插件

在 Linux 或 Mac 设备上使用以下命令安装所有组件。

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

步骤2. 下载预构建的 Wasm 应用和模型

curl -LO https://github.com/second-state/llama-utils/raw/main/chat/llama-chat.wasm

你还应该下载一个 GGUF 格式的 llama2 模型。下面的例子下载了一个调整为5位权重的 llama2 7B聊天模型。

curl -LO https://huggingface.co/wasmedge/llama2/resolve/main/llama-2-7b-chat-q5_k_m.gguf

步骤3. 运行

使用 WasmEdge 运行 wasm 推理应用程序，同时加载 GGUF 模型。现在，你可以输入问题与模型进行聊天了。

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm

配置模型行为

你可以使用命令行选项配置与模型的交互方式。

Options: -m, --model-alias <ALIAS> Model alias [default: default] -c, --ctx-size <CTX_SIZE> Size of the prompt context [default: 4096] -n, --n-predict <N_PRDICT> Number of tokens to predict [default: 1024] -g, --n-gpu-layers <N_GPU_LAYERS> Number of layers to run on the GPU [default: 100] -b, --batch-size <BATCH_SIZE> Batch size for prompt processing [default: 4096] -r, --reverse-prompt <REVERSE_PROMPT> Halt generation at PROMPT, return control. -s, --system-prompt <SYSTEM_PROMPT> System prompt message string [default: "[Default system message
for the prompt template]"] -p, --prompt-template <TEMPLATE> Prompt template. [default: llama-2-chat] [possible values: llama-2-chat, codellama-instruct, mistral-instruct-v0.1, mistrallite, openchat, belle-llama-2-chat, vicuna-chat, chatml] --log-prompts Print prompt strings to stdout --log-stat Print statistics to stdout --log-all Print all log information to stdout --stream-stdout Print the output to stdout in the streaming way -h, --help Print help

例如，以下命令指定了 2048 个 token 的上下文长度和每次响应的最大 512 个 token 。它还告诉 WasmEdge 以流式方式将模型响应逐个 token 返回到stdout。该程序在低端 M2 MacBook 上每秒生成大约 25 个 token。

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \ llama-chat.wasm -c 2048 -n 512 --log-stat --stream-stdout
[USER]:Who is the "father of the atomic bomb"?（谁是“原子弹之父”?）
---------------- [LOG: STATISTICS] -----------------
llama_new_context_with_model: n_ctx = 2048llama_new_context_with_model: freq_base = 10000.0llama_new_context_with_model: freq_scale = 1llama_new_context_with_model: kv self size = 1024.00 MBllama_new_context_with_model: compute buffer total size = 630.14 MBllama_new_context_with_model: max tensor size = 102.54 MB[2023-11-10 17:52:12.768] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was the director of the Manhattan Project, the secret research and development project that produced the atomic bomb during World War II. He is widely recognized as the leading figure in the development of the atomic bomb and is often referred to as the "father of the atomic bomb."llama_print_timings: load time = 15643.70 msllama_print_timings: sample time = 2.60 ms / 83 runs ( 0.03 ms per token, 31886.29 tokens per second)llama_print_timings: prompt eval time = 7836.72 ms / 54 tokens ( 145.12 ms per token, 6.89 tokens per second)llama_print_timings: eval time = 3198.24 ms / 82 runs ( 39.00 ms per token, 25.64 tokens per second)llama_print_timings: total time = 18852.93 ms
----------------------------------------------------

下面是它在 Nvidia A10G 机器上以每秒 50 个 token 的速度运行示例。

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \ llama-chat.wasm -c 2048 -n 512 --log-stat
[USER]:Who is the "father of the atomic bomb"? （谁是“原子弹之父”?）
---------------- [LOG: STATISTICS] -----------------llm_load_tensors: using CUDA for GPU accelerationllm_load_tensors: mem required = 86.04 MBllm_load_tensors: offloading 32 repeating layers to GPUllm_load_tensors: offloading non-repeating layers to GPUllm_load_tensors: offloaded 35/35 layers to GPUllm_load_tensors: VRAM used: 4474.93 MB..................................................................................................llama_new_context_with_model: n_ctx = 2048llama_new_context_with_model: freq_base = 10000.0llama_new_context_with_model: freq_scale = 1llama_kv_cache_init: offloading v cache to GPUllama_kv_cache_init: offloading k cache to GPUllama_kv_cache_init: VRAM kv self = 1024.00 MBllama_new_context_with_model: kv self size = 1024.00 MBllama_new_context_with_model: compute buffer total size = 630.14 MBllama_new_context_with_model: VRAM scratch buffer: 624.02 MBllama_new_context_with_model: total VRAM used: 6122.95 MB (model: 4474.93 MB, context: 1648.02 MB)[2023-11-11 00:02:22.402] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
llama_print_timings: load time = 2601.44 msllama_print_timings: sample time = 2.63 ms / 84 runs ( 0.03 ms per token, 31987.81 tokens per second)llama_print_timings: prompt eval time = 203.90 ms / 54 tokens ( 3.78 ms per token, 264.84 tokens per second)llama_print_timings: eval time = 1641.84 ms / 83 runs ( 19.78 ms per token, 50.55 tokens per second)llama_print_timings: total time = 4254.95 ms
----------------------------------------------------
[ASSISTANT]:The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was the director of the Manhattan Project, the secret research and development project that produced the first atomic bomb during World War II. He is widely recognized as the leading figure in the development of the atomic bomb and is often referred to as the "father of the atomic bomb."（“原子弹之父”这个词通常与物理学家 J. Robert Oppenheimer 联系在一起。奥本海默是曼哈顿计划的主管,曼哈顿计划是二战期间研发出第一颗原子弹的秘密研发项目。他被广泛认为是原子弹研发的*物,常被称为“原子弹之父”。）

注：原始文档均为英文，括号内的中文为翻译内容。

LLM 代理和应用

我们利用 Rust 和 WasmEdge 技术，构建了一个与 OpenAI 兼容的 API 服务。它允许你使用任何兼容 OpenAI 的开发工具，如 flows.network，来创建 LLM 代理和应用。

边缘上的 Llama，图片由 Midjourney 生成

为什么不使用 Python？

LLM（大型语言模型）如 llama2 通常在 Python（例如 PyTorch、Tensorflow 和 JAX）中进行训练。在 AI 计算的推理应用中，占比约 95%，Python 并不适合。

Python 包具有复杂的依赖性，难以设置和使用。
Python 依赖庞大。Python 或 PyTorch 的 Docker 镜像通常有几 GB 甚至几十 GB。这对于边缘服务器或设备上的AI推理尤其成问题。
Python 是一种非常慢的语言，与 C、C 和 Rust 等编译语言相比，慢达 35,000 倍。
由于 Python 慢，大部分实际工作必须委托给 Python 包装器下的原生共享库。这使得 Python 推理应用非常适合演示，但根据业务需求进行修改时非常困难。
重度依赖原生库和复杂的依赖管理使得 Python AI 程序难以在各种设备上移植，同时利用设备的独特硬件特性。

LLM 工具链中常用的 Python 包相互冲突

Chris Lattner（参与 LLVM、Tensorflow 和 Swift 语言开发的著名软件工程师）在 This Week in Startup 播客上进行了精彩的采访。他讨论了 Python 在模型训练中的优势，但并不适合推理。

Rust Wasm 的优势

Rust Wasm 技术栈构建了统一的云计算基础设施。它涵盖了从设备、边缘云到本地服务器和公有云的全方位服务。它在 AI 推理应用中是 Python 技术栈的有效替代方案。此外，埃隆·马斯克曾评价 Rust 为 AGI（通用人工智能）的理想语言。

超轻量级。推理应用仅为 2 MB，包含所有依赖，不到典型 PyTorch 容器大小的 1%。
非常快。在推理应用的各个环节，包括预处理、张量计算和后处理，都能达到原生 C/Rust 的速度。
便携。相同的 Wasm 字节码应用可以在所有主要计算平台上运行，并支持异构硬件加速。
易于设置、开发和部署。不再有复杂的依赖。只要在你的笔记本上使用标准工具构建单个 Wasm 文件，就可以在任何地方部署！
安全且适用于云。Wasm 运行时旨在隔离不受信任的用户代码。Wasm 运行时可以由容器工具管理，并轻松部署在云原生平台上。

Rust 推理程序

我们演示的推理程序核心部分是用 Rust 编写的，其源代码非常简洁，仅包含 40 行，最终被编译成 Wasm 格式。该 Rust 程序负责管理用户输入，跟踪对话历史，它还能将文本转换为 llama2 的聊天模板，并借助 WASI NN API 执行推理操作。

fn main { let args: Vec<String> = env::args.collect; let model_name: &str = &args[1];
let graph = wasi_nn::GraphBuilder::new(wasi_nn::GraphEncoding::Ggml, wasi_nn::ExecutionTarget::AUTO) .build_from_cache(model_name) .unwrap; let mut context = graph.init_execution_context.unwrap;
let system_prompt = String::from("<<SYS>>You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>"); let mut saved_prompt = String::new;
loop { println!("Question:"); let input = read_input; if saved_prompt == "" { saved_prompt = format!("[INST] {} {} [/INST]", system_prompt, input.trim); } else { saved_prompt = format!("{} [INST] {} [/INST]", saved_prompt, input.trim); }
// 将用户问题字符串预处理成张量格式后,设置为模型输入张量,以供模型进行下游推理计算。 let tensor_data = saved_prompt.as_bytes.to_vec; context .set_input(0, wasi_nn::TensorType::U8, &[1], &tensor_data) .unwrap;
// 执行推理计算 context.compute.unwrap;
// 获取输出 let mut output_buffer = vec![0u8; 1000]; let output_size = context.get_output(0, &mut output_buffer).unwrap; let output = String::from_utf8_lossy(&output_buffer[..output_size]).to_string; println!("Answer:\n{}", output.trim);
saved_prompt = format!("{} {} ", saved_prompt, output.trim); }}

要自行构建应用，只需安装 Rust 编译器及其 wasm32-wasi 编译目标。

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shrustup target add wasm32-wasi

然后，检出源项目，并运行cargo命令从 Rust 源项目构建 Wasm 文件。

# 克隆源项目git clone https://github.com/second-state/llama-utilscd llama-utils/chat/
# 构建cargo build --target wasm32-wasi --release
# 结果 wasm 文件cp target/wasm32-wasi/release/llama-chat.wasm .

在云端或边缘运行

获取 Wasm 字节码文件后，你便可以在任何支持 WasmEdge 运行时的设备上进行部署。你只需要安装带有 GGML 插件的 WasmEdge。我们目前提供的 GGML 插件支持包括通用 Linux 和 Ubuntu Linux 在内的多种操作系统，适用于 x86 和 ARM CPU、Nvidia GPU，以及 Apple M1/M2/M3。

基于 llama.cpp，WasmEdge GGML 插件将自动利用设备上的任何硬件加速来运行 llama2模型。例如，如果你的设备有 Nvidia GPU，安装程序将自动安装优化了 CUDA 的 GGML 插件版本。对于 Mac 设备，我们专门为 Mac OS 构建了 GGML 插件，它利用 Metal API 在 M1/M2/M3 内置的神经处理引擎上执行推理工作负载。Linux CPU 构建的GGML 插件使用 OpenBLAS 库来自动检测和利用现代 CPU 上的高级计算特性，如 AVX 和 SIMD。

通过使用 Rust 和 Wasm 技术，我们可以实现 AI 模型在异构硬件和平台上的可移植性，同时又不损失性能。