Simple Conversation Program¶
Program Identification¶
Author: Saul Emre Erdog
Maintainer: Saul Emre Erdog
Email: erdog@mbasesoftware.com
Name: mbase_simple_conversation
Version: v0.1.0
Type: Example, Utility
Network Usage: No
Lib Depends: mbase-std mbase-inference
Repo location: https://github.com/Emreerdog/mbase/tree/main/examples/simple-conversation
Synopsis¶
mbase_simple_conversation model_path *[option [value]]
mbase_simple_conversation model.gguf
mbase_simple_conversation model.gguf -gl 80
mbase_simple_conversation model.gguf -gl 80 -sys 'You are a helpful assistant.'
Description¶
This is a simple conversation program implemented to demonstrate what can be implemented using MBASE and show the capabilities.
The program is a simple executable where you are having a dialogue with the LLM you provide.
You can adjust the sampling parameters, thread count, or supply a system prompt either from file or from string using the options.
At the end of the program, it will print useful information about the performance of the program such as, pp and tg rates as well as model load delay.
Options¶
- --help¶
Print program information.
- -v, --version¶
Shows program version.
- -sys prompt, --system prompt¶
LLM system prompt. If this option is given after -fsys, it will overwrite it. (default=””)
- -fsys file, --system-prompt-file file¶
Text file that contains the LLM system prompt. If this option is given after -sys, it will overwrite it. (default=””)
- -t count, --thread-count count¶
Amount of threads to use for token generation. (default=16)
- -bt count, --batch-thread-count count¶
Amount of thread to use for initial batch processing. (default=8)
- -c length, --context-length length¶
Total context length of the conversation which includes the special tokens and the response of the LLM. (default=8192)
- -b length, --batch-length length¶
The input is executed in batches in processor decode loop. This is the maximum batch length to be processed in single iteration. (default=4096)
- -gl count, --gpu-layers count¶
Number of layers too offload to GPU. Ignored if there is no GPU is present. (default=999)
- -tk k, --top-k k¶
Top k most tokens to pick from, during the sampling phase. (default=20, min=1, max=<model_vocabulary>)
- -tp p, --top-p p¶
Token probability at most during the sampling phase with values between (0.0, 1.0] where the higher the ‘p’, the bigger the pool. (default=1.0)
- -mp p, --min-p p¶
Token probability at most during the sampling phase with values between (0.0, 1.0] where the higher the ‘p’, the smaller the pool. (default=0.3)
- -pn n, --penalty-n n¶
Apply repetition penalty on last ‘n’ tokens. (default=64)
- -pr frequency, --penalty-repeat frequency¶
Discourages repeating exact tokens based on their past presence. The higher the frequency, the lower the repetition. (default=1.3, min=1.0, max=2.0)
- -temp n, --temperature n¶
Higher values increase the randomness. (default=0.1, min.01, max 1.4)
- -gr, --greedy¶
Ignore all sampling techniques, pick the most probable token. In other words, apply greedy. (default=false)