跳至主要內容

llamafile-一键运行大模型

程序员李某某大约 16 分钟

llamafile-一键运行大模型

参考网站

模型介绍

llamafile 模型集合(一个文件即可运行) llamafile 是一种AI大模型部署(或者说运行)的方案, 与其他方案相比,llamafile的独特之处在于它可以将模型和运行环境打包成一个独立的可执行文件,从而简化了部署流程。用户只需下载并执行该文件,无需安装运行环境或依赖库,这大大提高了使用大型语言模型的便捷性。这种创新方案有助于降低使用门槛,且一个文件能同时支持 macOS, Windows, Linux, FreeBSD, OpenBSD, 和 NetBSD系统,使更多人能够轻松部署和使用大型语言模型。

部署

Windows

注意:Windows 系统不支持单个 exe 文件超过 4GB,所以大于 4GB 的模型,需要分别下载 llamafile 和 gguf 模型运行;此外,也可以使用 Windows 的 WSL 子系统(Linux)来运行,同样可以绕过 4GB 的限制

大于4G的模型

小于4G的模型

  • 直接在下载好的模型文件,修改文件名,增加 .exe 后缀,如改成 Yi-6B-Chat-q4.exe
  • cmd运行.\Yi-6B-Chat-q4.exe
  • 浏览器打开 http://127.0.0.1:8080 即可开始聊天

Linux、Mac

  • 下载模型文件

  • 在终端运行

    # 授权
    chmod +x ./Yi-6B-Chat-q4.llamafile.llamafile
    # 运行
    ./Yi-6B-Chat-q4.llamafile.llamafile
    
  • 浏览器打开 http://127.0.0.1:8080 即可开始聊天

注意 Mac 系统可能需要授权,在【设置】→ 【隐私与安全】点击【仍然打开】进行授权)

模型列表

模型模型大小下载地址使用示例说明
Qwen-7B-Chat4.23GBQwen-7B-Chat-q4_0.llamafileopen in new window示例open in new window中英文对话
Qwen-14B-Chat7.65GB 14.06GBQwen-14B-Chat-q4_0.llamafileopen in new window Qwen-14B-Chat-q8_0.llamafileopen in new window-中英文对话
Qwen1.5-7B-Chat5.18GBqwen1_5-7b-chat-q5_k_m.llamafileopen in new window-中英文对话
Qwen1.5-14B-Chat9.84GBqwen1_5-14b-chat-q5_k_m.llamafileopen in new window-中英文对话
Qwen1.5-0.5B-Chat420.57MBqwen1_5-0_5b-chat-q4_k_m.llamafileopen in new window-中英文对话
Qwen1.5-14B-Chat1.17GBqwen1_5-1_8b-chat-q4_k_m.llamafileopen in new window-中英文对话
Baichuan-13B-Chat7.06GBBaichuan-13B-Chat-q4_0.llamafileopen in new window-中英文对话
OrionStar-Yi-34B-Chat19.27GBOrionStar-Yi-34B-Chat-q4_0.llamafileopen in new window-中英文对话
CodeLlama-7b-Instruct3.59GBCodeLlama-7b-Instruct-q4_0.llamafileopen in new window示例open in new window中英文对话、擅长写代码
CodeFuse-QWen-14B7.65GBCodeFuse-QWen-14B-q4_0.llamafileopen in new window-中英文对话、擅长写代码
Yi-6B-Chat(连接失效)3.45GBYi-6B-Chat-q4.llamafileopen in new window示例open in new window中英文对话
LLaVA-1.5-7B3.99GBllava-v1.5-7b-q4.llamafileopen in new window示例open in new window英文视觉多模态
Yi-VL-6B3.61GBYi-VL-6B-q4_0.llamafileopen in new window-中英文视觉多模态
Ph-22.78GBphi-2.Q8_0.llamafile.llamafileopen in new window-英文对话、微软出品
TinyLlama-1.1B-Chat-v1.01.12GBTinyLlama-1.1B-Chat-v1.0.Q8_0.llamafileopen in new window-英文对话
Meta-Llama-3-8B-Instruct4.95GBMeta-Llama-3-8B-Instruct-Q4_K_M.llamafileopen in new window-英文对话

可选参数说明

  • -ngl 999 表示模型的多少层放到 GPU 运行,其他在 CPU 运行,如果没有 GPU 则可设置为 -ngl 0 ,默认是 999,也就是全部在 GPU 运行(需要装好驱动和 CUDA 运行环境)。
  • --host 0.0.0.0 web 服务的hostname,如果只需要本地访问可设置为 --host 127.0.0.1 ,默认是0.0.0.0 ,即网络内可通过 ip 访问。
  • --port 8080 web服务端口,默认 8080 ,可通过该参数修改。
  • -t 16 线程数,当 cpu 运行的时候,可根据 cpu 核数设定多少个内核并发运行。

整合ChatGPT-Next-Web

Chat-Next-Webopen in new window ---- 一键免费部署你的私人 ChatGPT 网页应用

官方手册

LLAMAFILE(1) BSD General Commands Manual LLAMAFILE(1)

NAME llamafile — large language model runner

概要

 llamafile [--server] [flags...] -m model.gguf [--mmproj vision.gguf]
 llamafile [--cli] [flags...] -m model.gguf -p prompt
 llamafile [--cli] [flags...] -m model.gguf --mmproj vision.gguf --image
           graphic.png -p prompt

描述

 llamafile is a large language model tool. It has use cases such as:

 -   Code completion
 -   Prose composition
 -   Chatbot that passes the Turing test
 -   Text/image summarization and analysis

选项 OPTIONS

 The following options are available:

 --version
         Print version and exit.

 -h, --help
         Show help message and exit.

 --cli   Puts program in command line interface mode. This flag is implied
         when a prompt is supplied using either the -p or -f flags.

 --server
         Puts program in server mode. This will launch an HTTP server on a
         local port. This server has both a web UI and an OpenAI API com‐
         patible completions endpoint. When the server is run on a desk
         system, a tab browser tab will be launched automatically that
         displays the web UI.  This --server flag is implied if no prompt
         is specified, i.e. neither the -p or -f flags are passed.

 -m FNAME, --model FNAME
         Model path in the GGUF file format.

         Default: models/7B/ggml-model-f16.gguf

 --mmproj FNAME
         Specifies path of the LLaVA vision model in the GGUF file format.
         If this flag is supplied, then the --model and --image flags
         should also be supplied.

 -s SEED, --seed SEED
         Random Number Generator (RNG) seed. A random seed is used if this
         is less than zero.

         Default: -1

 -t N, --threads N
         Number of threads to use during generation.

         Default: $(nproc)/2

 -tb N, --threads-batch N
         Set the number of threads to use during batch and prompt process‐
         ing. In some systems, it is beneficial to use a higher number of
         threads during batch processing than during generation. If not
         specified, the number of threads used for batch processing will
         be the same as the number of threads used for generation.

         Default: Same as --threads

 -td N, --threads-draft N
         Number of threads to use during generation.

         Default: Same as --threads

 -tbd N, --threads-batch-draft N
         Number of threads to use during batch and prompt processing.

         Default: Same as --threads-draft

 --in-prefix-bos
         Prefix BOS to user inputs, preceding the --in-prefix string.

 --in-prefix STRING
         This flag is used to add a prefix to your input, primarily, this
         is used to insert a space after the reverse prompt. Here's an ex‐
         ample of how to use the --in-prefix flag in conjunction with the
         --reverse-prompt flag:

               ./main -r "User:" --in-prefix " "

         Default: empty

 --in-suffix STRING
         This flag is used to add a suffix after your input. This is use‐
         ful for adding an "Assistant:" prompt after the user's input.
         It's added after the new-line character (\n) that's automatically
         added to the end of the user's input. Here's an example of how to
         use the --in-suffix flag in conjunction with the --reverse-prompt
         flag:

               ./main -r "User:" --in-prefix " " --in-suffix "Assistant:"

         Default: empty

 -n N, --n-predict N
         Number of tokens to predict.

         -   -1 = infinity
         -   -2 = until context filled

         Default: -1

 -c N, --ctx-size N
         Set the size of the prompt context. A larger context size helps
         the model to better comprehend and generate responses for longer
         input or conversations. The LLaMA models were built with a con‐
         text of 2048, which yields the best results on longer input / in‐
         ference.

         -   0 = loaded automatically from model

         Default: 512

 -b N, --batch-size N
         Batch size for prompt processing.

         Default: 512

 --top-k N
         Top-k sampling.

         -   0 = disabled

         Default: 40

 --top-p N
         Top-p sampling.

         -   1.0 = disabled

         Default: 0.9

 --min-p N
         Min-p sampling.

         -   0.0 = disabled

         Default: 0.1

 --tfs N
         Tail free sampling, parameter z.

         -   1.0 = disabled

         Default: 1.0

 --typical N
         Locally typical sampling, parameter p.

         -   1.0 = disabled

         Default: 1.0

 --repeat-last-n N
         Last n tokens to consider for penalize.

         -   0 = disabled
         -   -1 = ctx_size

         Default: 64

 --repeat-penalty N
         Penalize repeat sequence of tokens.

         -   1.0 = disabled

         Default: 1.1

 --presence-penalty N
         Repeat alpha presence penalty.

         -   0.0 = disabled

         Default: 0.0

 --frequency-penalty N
         Repeat alpha frequency penalty.

         -   0.0 = disabled

         Default: 0.0

 --mirostat N
         Use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typ‐
         ical samplers are ignored if used..

         -   0 = disabled
         -   1 = Mirostat
         -   2 = Mirostat 2.0

         Default: 0

 --mirostat-lr N
         Mirostat learning rate, parameter eta.

         Default: 0.1

 --mirostat-ent N
         Mirostat target entropy, parameter tau.

         Default: 5.0

 -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS
         Modifies the likelihood of token appearing in the completion,
         i.e.  --logit-bias 15043+1 to increase likelihood of token
         ' Hello', or --logit-bias 15043-1 to decrease likelihood of token
         ' Hello'.

 -md FNAME, --model-draft FNAME
         Draft model for speculative decoding.

         Default: models/7B/ggml-model-f16.gguf

 --cfg-negative-prompt PROMPT
         Negative prompt to use for guidance..

         Default: empty

 --cfg-negative-prompt-file FNAME
         Negative prompt file to use for guidance.

         Default: empty

 --cfg-scale N
         Strength of guidance.

         -   1.0 = disable

         Default: 1.0

 --rope-scaling {none,linear,yarn}
         RoPE frequency scaling method, defaults to linear unless speci‐
         fied by the model

 --rope-scale N
         RoPE context scaling factor, expands context by a factor of N
         where N is the linear scaling factor used by the fine-tuned
         model. Some fine-tuned models have extended the context length by
         scaling RoPE. For example, if the original pre-trained model have
         a context length (max sequence length) of 4096 (4k) and the fine-
         tuned model have 32k. That is a scaling factor of 8, and should
         work by setting the above --ctx-size to 32768 (32k) and
         --rope-scale to 8.

 --rope-freq-base N
         RoPE base frequency, used by NTK-aware scaling.

         Default: loaded from model

 --rope-freq-scale N
         RoPE frequency scaling factor, expands context by a factor of 1/N

 --yarn-orig-ctx N
         YaRN: original context size of model.

         Default: 0 = model training context size

 --yarn-ext-factor N
         YaRN: extrapolation mix factor.

         -   0.0 = full interpolation

         Default: 1.0

 --yarn-attn-factor N
         YaRN: scale sqrt(t) or attention magnitude.

         Default: 1.0

 --yarn-beta-slow N
         YaRN: high correction dim or alpha.

         Default: 1.0

 --yarn-beta-fast N
         YaRN: low correction dim or beta.

         Default: 32.0

 --ignore-eos
         Ignore end of stream token and continue generating (implies
         --logit-bias 2-inf)

 --no-penalize-nl
         Do not penalize newline token.

 --temp N
         Temperature.

         Default: 0.8

 --logits-all
         Return logits for all tokens in the batch.

         Default: disabled

 --hellaswag
         Compute HellaSwag score over random tasks from datafile supplied
         with -f

 --hellaswag-tasks N
         Number of tasks to use when computing the HellaSwag score.

         Default: 400

 --keep N
         This flag allows users to retain the original prompt when the
         model runs out of context, ensuring a connection to the initial
         instruction or conversation topic is maintained, where N is the
         number of tokens from the initial prompt to retain when the model
         resets its internal context.

         -   0 = no tokens are kept from initial prompt
         -   -1 = retain all tokens from initial prompt

         Default: 0

 --draft N
         Number of tokens to draft for speculative decoding.

         Default: 16

 --chunks N
         Max number of chunks to process.

         -   -1 = all

         Default: -1

 -ns N, --sequences N
         Number of sequences to decode.

         Default: 1

 -pa N, --p-accept N
         speculative decoding accept probability.

         Default: 0.5

 -ps N, --p-split N
         Speculative decoding split probability.

         Default: 0.1

 --mlock
         Force system to keep model in RAM rather than swapping or com‐
         pressing.

 --no-mmap
         Do not memory-map model (slower load but may reduce pageouts if
         not using mlock).

 --numa  Attempt optimizations that help on some NUMA systems if run with‐
         out this previously, it is recommended to drop the system page
         cache before using this. See
         https://github.com/ggerganov/llama.cpp/issues/1437.

 --recompile
         Force GPU support to be recompiled at runtime if possible.

 --nocompile
         Never compile GPU support at runtime.

         If the appropriate DSO file already exists under ~/.llamafile/
         then it'll be linked as-is without question. If a prebuilt DSO is
         present in the PKZIP content of the executable, then it'll be ex‐
         tracted and linked if possible. Otherwise, llamafile will skip
         any attempt to compile GPU support and simply fall back to using
         CPU inference.

 --gpu GPU
         Specifies which brand of GPU should be used. Valid choices are:

         -   AUTO: Use any GPU if possible, otherwise fall back to CPU in‐
             ference (default)

         -   APPLE: Use Apple Metal GPU. This is only available on MacOS
             ARM64. If Metal could not be used for any reason, then a fa‐
             tal error will be raised.

         -   AMD: Use AMD GPUs. The AMD HIP ROCm SDK should be installed
             in which case we assume the HIP_PATH environment variable has
             been defined. The set of gfx microarchitectures needed to run
             on the host machine is determined automatically based on the
             output of the hipInfo command. On Windows, llamafile release
             binaries are distributed with a tinyBLAS DLL so it'll work
             out of the box without requiring the HIP SDK to be installed.
             However, tinyBLAS is slower than rocBLAS for batch and image
             processing, so it's recommended that the SDK be installed
             anyway. If an AMD GPU could not be used for any reason, then
             a fatal error will be raised.

         -   NVIDIA: Use NVIDIA GPUs. If an NVIDIA GPU could not be used
             for any reason, a fatal error will be raised. On Windows,
             NVIDIA GPU support will use our tinyBLAS library, since it
             works on stock Windows installs. However, tinyBLAS goes
             slower for batch and image processing. It's possible to use
             NVIDIA's closed-source cuBLAS library instead. To do that,
             both MSVC and CUDA need to be installed and the llamafile
             command should be run once from the x64 MSVC command prompt
             with the --recompile flag passed. The GGML library will then
             be compiled and saved to ~/.llamafile/ so the special process
             only needs to happen a single time.

         -   DISABLE: Never use GPU and instead use CPU inference. This
             setting is implied by -ngl 0.

 -ngl N, --n-gpu-layers N
         Number of layers to store in VRAM.

 -ngld N, --n-gpu-layers-draft N
         Number of layers to store in VRAM for the draft model.

 -sm SPLIT_MODE, --split-mode SPLIT_MODE
         How to split the model across multiple GPUs, one of:
         -   none: use one GPU only
         -   layer (default): split layers and KV across GPUs
         -   row: split rows across GPUs

 -ts SPLIT, --tensor-split SPLIT
         When using multiple GPUs this option controls how large tensors
         should be split across all GPUs.  SPLIT is a comma-separated list
         of non-negative values that assigns the proportion of data that
         each GPU should get in order. For example, "3,2" will assign 60%
         of the data to GPU 0 and 40% to GPU 1. By default the data is
         split in proportion to VRAM but this may not be optimal for per‐
         formance. Requires cuBLAS.  How to split tensors across multiple
         GPUs, comma-separated list of proportions, e.g. 3,1

 -mg i, --main-gpu i
         The GPU to use for scratch and small tensors.

 -nommq, --no-mul-mat-q
         Use cuBLAS instead of custom mul_mat_q CUDA kernels. Not recom‐
         mended since this is both slower and uses more VRAM.

 --verbose-prompt
         Print prompt before generation.

 --simple-io
         Use basic IO for better compatibility in subprocesses and limited
         consoles.

 --lora FNAME
         Apply LoRA adapter (implies --no-mmap)

 --lora-scaled FNAME S
         Apply LoRA adapter with user defined scaling S (implies
         --no-mmap)

 --lora-base FNAME
         Optional model to use as a base for the layers modified by the
         LoRA adapter

 --unsecure
         Disables pledge() sandboxing on Linux and OpenBSD.

 --samplers
         Samplers that will be used for generation in the order, separated
         by semicolon, for example: top_k;tfs;typical;top_p;min_p;temp

 --samplers-seq
         Simplified sequence for samplers that will be used.

 -cml, --chatml
         Run in chatml mode (use with ChatML-compatible models)

 -dkvc, --dump-kv-cache
         Verbose print of the KV cache.

 -nkvo, --no-kv-offload
         Disable KV offload.

 -ctk TYPE, --cache-type-k TYPE
         KV cache data type for K.

 -ctv TYPE, --cache-type-v TYPE
         KV cache data type for V.

 -gan N, --grp-attn-n N
         Group-attention factor.

         Default: 1

 -gaw N, --grp-attn-w N
         Group-attention width.

         Default: 512

 -bf FNAME, --binary-file FNAME
         Binary file containing multiple choice tasks.

 --winogrande
         Compute Winogrande score over random tasks from datafile supplied
         by the -f flag.

 --winogrande-tasks N
         Number of tasks to use when computing the Winogrande score.

         Default: 0

 --multiple-choice
         Compute multiple choice score over random tasks from datafile
         supplied by the -f flag.

 --multiple-choice-tasks N
         Number of tasks to use when computing the multiple choice score.

         Default: 0

 --kl-divergence
         Computes KL-divergence to logits provided via the
         --kl-divergence-base flag.

 --save-all-logits FNAME, --kl-divergence-base FNAME
         Save logits to filename.

 -ptc N, --print-token-count N
         Print token count every N tokens.

         Default: -1

客户端选项 CLI OPTIONS

 The following options may be specified when llamafile is running in --cli
 mode.

 -e, --escape
         Process prompt escapes sequences (\n, \r, \t, \´, \", \\)

 -p STRING, --prompt STRING
         Prompt to start text generation. Your LLM works by auto-complet‐
         ing this text. For example:

               llamafile -m model.gguf -p "four score and"

         Stands a pretty good chance of printing Lincoln's Gettysburg Ad‐
         dress.  Prompts can take on a structured format too. Depending on
         how your model was trained, it may specify in its docs an in‐
         struction notation. With some models that might be:

               llamafile -p "[INST]Summarize this: $(cat file)[/INST]"

         In most cases, simply colons and newlines will work too:

               llamafile -e -p "User: What is best in life?\nAssistant:"

 -f FNAME, --file FNAME
         Prompt file to start generation.

 --grammar GRAMMAR
         BNF-like grammar to constrain which tokens may be selected when
         generating text. For example, the grammar:

               root ::= "yes" | "no"

         will force the LLM to only output yes or no before exiting. This
         is useful for shell scripts when the --no-display-prompt flag is
         also supplied.

 --grammar-file FNAME
         File to read grammar from.

 --prompt-cache FNAME
         File to cache prompt state for faster startup.

         Default: none

 --prompt-cache-all
         If specified, saves user input and generations to cache as well.
         Not supported with --interactive or other interactive options.

 --prompt-cache-ro
         If specified, uses the prompt cache but does not update it.

 --random-prompt
         Start with a randomized prompt.

 --image IMAGE_FILE
         Path to an image file. This should be used with multimodal mod‐
         els.  Alternatively, it's possible to embed an image directly
         into the prompt instead; in which case, it must be base64 encoded
         into an HTML img tag URL with the image/jpeg MIME type. See also
         the --mmproj flag for supplying the vision model.

 -i, --interactive
         Run the program in interactive mode, allowing users to engage in
         real-time conversations or provide specific instructions to the
         model.

 --interactive-first
         Run the program in interactive mode and immediately wait for user
         input before starting the text generation.

 -ins, --instruct
         Run the program in instruction mode, which is specifically de‐
         signed to work with Alpaca models that excel in completing tasks
         based on user instructions.

         Technical details: The user's input is internally prefixed with
         the reverse prompt (or "### Instruction:" as the default), and
         followed by "### Response:" (except if you just press Return
         without any input, to keep generating a longer response).

         By understanding and utilizing these interaction options, you can
         create engaging and dynamic experiences with the LLaMA models,
         tailoring the text generation process to your specific needs.

 -r PROMPT, --reverse-prompt PROMPT
         Specify one or multiple reverse prompts to pause text generation
         and switch to interactive mode. For example, -r "User:" can be
         used to jump back into the conversation whenever it's the user's
         turn to speak. This helps create a more interactive and conversa‐
         tional experience. However, the reverse prompt doesn't work when
         it ends with a space. To overcome this limitation, you can use
         the --in-prefix flag to add a space or any other characters after
         the reverse prompt.

 --color
         Enable colorized output to differentiate visually distinguishing
         between prompts, user input, and generated text.

 --no-display-prompt, --silent-prompt
         Don't echo the prompt itself to standard output.

 --multiline-input
         Allows you to write or paste multiple lines without ending each
         in '\'.

服务端选项 SERVER OPTIONS

 The following options may be specified when llamafile is running in
 --server mode.

 --port PORT
         Port to listen

         Default: 8080

 --host IPADDR
         IP address to listen.

         Default: 127.0.0.1

 -to N, --timeout N
         Server read/write timeout in seconds.

         Default: 600

 -np N, --parallel N
         Number of slots for process requests.

         Default: 1

 -cb, --cont-batching
         Enable continuous batching (a.k.a dynamic batching).

         Default: disabled

 -spf FNAME, --system-prompt-file FNAME
         Set a file to load a system prompt (initial prompt of all slots),
         this is useful for chat applications.

 -a ALIAS, --alias ALIAS
         Set an alias for the model. This will be added as the model field
         in completion responses.

 --path PUBLIC_PATH
         Path from which to serve static files.

         Default: /zip/llama.cpp/server/public

 --embedding
         Enable embedding vector output.

         Default: disabled

 --nobrowser
         Do not attempt to open a web browser tab at startup.

 -gan N, --grp-attn-n N
         Set the group attention factor to extend context size through
         self-extend. The default value is 1 which means disabled. This
         flag is used together with --grp-attn-w.

 -gaw N, --grp-attn-w N
         Set the group attention width to extend context size through
         self-extend. The default value is 512.  This flag is used to‐
         gether with --grp-attn-n.

日志选项 LOG OPTIONS

 The following log options are available:

 -ld LOGDIR, --logdir LOGDIR
         Path under which to save YAML logs (no logging if unset)

 --log-test
         Run simple logging test

 --log-disable
         Disable trace logs

 --log-enable
         Enable trace logs

 --log-file
         Specify a log filename (without extension)

 --log-new
         Create a separate new log file on start. Each log file will have
         unique name: <name>.<ID>.log

 --log-append
         Don't truncate the old log file.

示例 EXAMPLES

 Here's an example of how to run llama.cpp's built-in HTTP server. This
 example uses LLaVA v1.5-7B, a multimodal LLM that works with llama.cpp's
 recently-added support for image inputs.

       llamafile \
         -m llava-v1.5-7b-Q8_0.gguf \
         --mmproj llava-v1.5-7b-mmproj-Q8_0.gguf \
         --host 0.0.0.0

 Here's an example of how to generate code for a libc function using the
 llama.cpp command line interface, utilizing WizardCoder-Python-13B
 weights:

       llamafile \
         -m wizardcoder-python-13b-v1.0.Q8_0.gguf --temp 0 -r '}\n' -r '```\n' \
         -e -p '```c\nvoid *memcpy(void *dst, const void *src, size_t size) {\n'

 Here's a similar example that instead utilizes Mistral-7B-Instruct
 weights for prose composition:

       llamafile \
         -m mistral-7b-instruct-v0.2.Q5_K_M.gguf \
         -p '[INST]Write a story about llamas[/INST]'

 Here's an example of how llamafile can be used as an interactive chatbot
 that lets you query knowledge contained in training data:

       llamafile -m llama-65b-Q5_K.gguf -p '
       The following is a conversation between a Researcher and their helpful AI
       assistant Digital Athena which is a large language model trained on the
       sum of human knowledge.
       Researcher: Good morning.
       Digital Athena: How can I help you today?
       Researcher:' --interactive --color --batch_size 1024 --ctx_size 4096 \
       --keep -1 --temp 0 --mirostat 2 --in-prefix ' ' --interactive-first \
       --in-suffix 'Digital Athena:' --reverse-prompt 'Researcher:'

 Here's an example of how you can use llamafile to summarize HTML URLs:

       (
         echo '[INST]Summarize the following text:'
         links -codepage utf-8 \
               -force-html \
               -width 500 \
               -dump https://www.poetryfoundation.org/poems/48860/the-raven |
           sed 's/   */ /g'
         echo '[/INST]'
       ) | llamafile \
             -m mistral-7b-instruct-v0.2.Q5_K_M.gguf \
             -f /dev/stdin \
             -c 0 \
             --temp 0 \
             -n 500 \
             --no-display-prompt 2>/dev/null

 Here's how you can use llamafile to describe a jpg/png/gif/bmp image:

       llamafile --temp 0 \
         --image lemurs.jpg \
         -m llava-v1.5-7b-Q4_K.gguf \
         --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
         -e -p '### User: What do you see?\n### Assistant: ' \
         --no-display-prompt 2>/dev/null

 If you wanted to write a script to rename all your image files, you could
 use the following command to generate a safe filename:

       llamafile --temp 0 \
           --image ~/Pictures/lemurs.jpg \
           -m llava-v1.5-7b-Q4_K.gguf \
           --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
           --grammar 'root ::= [a-z]+ (" " [a-z]+)+' \
           -e -p '### User: The image has...\n### Assistant: ' \
           --no-display-prompt 2>/dev/null |
         sed -e's/ /_/g' -e's/$/.jpg/'
       three_baby_lemurs_on_the_back_of_an_adult_lemur.jpg

 Here's an example of how to make an API request to the OpenAI API compat‐
 ible completions endpoint when your llamafile is running in the back‐
 ground in --server mode.

       curl -s http://localhost:8080/v1/chat/completions \
            -H "Content-Type: application/json" -d '{
         "model": "gpt-3.5-turbo",
         "stream": true,
         "messages": [
           {
             "role": "system",
             "content": "You are a poetic assistant."
           },
           {
             "role": "user",
             "content": "Compose a poem that explains FORTRAN."
           }
         ]
       }' | python3 -c '
       import json
       import sys
       json.dump(json.load(sys.stdin), sys.stdout, indent=2)
       print()

提示 PROTIP

 The -ngl 35 flag needs to be passed in order to use GPUs made by NVIDIA
 and AMD.  It's not enabled by default since it sometimes needs to be
 tuned based on the system hardware and model architecture, in order to
 achieve optimal performance, and avoid compromising a shared display.

参考 SEE ALSO

 llamafile-quantize(1), llamafile-perplexity(1), llava-quantize(1),
 zipalign(1), unzip(1)

Mozilla Ocho January 1, 2024 Mozilla Ocho

上次编辑于:
贡献者: 李元昊