Benchmark T2T¶

Program Identification¶

Author: M. Emre Erdog
Maintainer: M. Emre Erdog
Email: erdog@mbasesoftware.com
Name: mbase_benchmark_t2t
Version: v1.0.0
Type: Utility
Network Usage: No
Lib Depends: mbase-std mbase-inference
Repo location: https://github.com/Emreerdog/mbase/tree/main/examples/benchmark

Synopsis¶

mbase_benchmark_t2t model_path *[option [value]]
mbase_benchmark_t2t model.gguf -uc 1 -fps 500 -jout .
mbase_benchmark_t2t model.gguf -uc 1 -fps 500 -jout . -mdout .

Description¶

This is a utility program to measure the performance of the given T2T LLM.

The program will do an inference based-off of the given context size, batch length, and n predict, simultaneusly on multiple users at the same time.

While running multiple processors in parallel which is specified by the option -uc, --user-count, program will have a main loop working idly in the rate specified by the option: -fps, --frame-per-second and it will actively display the FPS every second. With this, the user will be able to examine the effect of the inference operation on the main application loop.

At the end of the inference, it will display the following metrics along with the model and session information:

Total elapsed time in seconds.
Average FPS.
For each processor:
- The time it took to initialize the context in milliseconds.
- Prompt processing tokens per second(pp t/s).
- Token generation tokens per second(tg t/s).

Formatted Output¶

Along with the terminal display of benchmark output, user can specify an output directory by supplying options such as -jout, --json-output-path and -mdout, --markdown-output-path in which the program will output files given as mbase_bench.json and mbase_bench.md.

In the next two sections, you will see the benchmark result of the model “Phi 3 Mini 128k Instruct Q4_0” on a given hardware with single user -uc 1:

CPU: Intel(R) Core(TM) i7-9750HF

GPU: NVIDIA GeForce RTX 2060

Program call:

.\mbase_benchmark_t2t.exe path_to_model -jout . -mdout . -uc 1 -b 2048 -np 512

JSON Output Example¶

mbase_bench.json¶

{
  "model_information": {
    "embedding_length": 3072,
    "head_count": 32,
    "layer_count": 32,
    "model_size_gb": 2.0260353,
    "name": "Phi 3 Mini 128k Instruct",
    "quantization": "Q4_0"
  },
  "processor_diagnostics": [{
    "load_delay_ms": 41,
    "pp tokens per sec": 1698.1758,
    "tg tokens per sec": 76.92307
  }],
  "session_information": {
    "batch_proc_threads": 8,
    "batch_size": 2048,
    "compute_devices": [{
      "device_name": "NVIDIA GeForce RTX 2060",
      "type": "GPU"
    }, {
      "device_name": "Intel(R) Core(TM) i7-9750HF CPU @ 2.60GHz",
      "type": "CPU"
    }],
    "context_length": 2048,
    "flash_attention": true,
    "generation_threads": 16,
    "gpu_layers": 999,
    "predict": 512,
    "prompt_length": 1024,
    "user_count": 1
  },
  "useful_metrics": {
    "average_fps": 456.16666,
    "total_elapsed_time_seconds": 9.748
  }
}

Markdown Output Example¶

mbase_bench.md¶

### Model Information
__Name__: Phi 3 Mini 128k Instruct<br>
__Model size__: 2.03 GB <br>
__Quantization__: Q4_0<br>
__Embedding length__: 3072<br>
__Head count__: 32<br>
__Layer count__: 32<br>
### Session Information
__Context length__: 2048<br>
__Batch size__: 2048<br>
__Prompt length__: 1024<br>
__Batch processing threads__: 8<br>
__Generation threads__: 16<br>
__User count__: 1<br>
__Flash attention__: Enabled<br>
__GPU offload layers__: 999<br>
__N Predict__: 512<br>
__Compute devices__:
- NVIDIA GeForce RTX 2060
- Intel(R) Core(TM) i7-9750HF CPU @ 2.60GHz
### Useful Metrics
__Total elapsed time in seconds__: 9.75<br>
__Average FPS__: 456<br>
### Performance Table
| Load delay ms | pp t/s | tg t/s |
| ------------- | ------ | ------ |
| 41 | 1698.18 | 76.92 |

Options¶

-h, --help¶: Print program information.

-v, --version¶: Shows program version.

-dfa, --disable-flash-attention¶: Disables the flash attention, which is enabled by default. Disabling it may decrease the performance.

-t count, --thread-count count¶: Amount of threads to use for token generation. (default=16)

-bt count, --batch-thread-count count¶: Amount of thread to use for initial batch processing. (default=8)

-c length, --context-length length¶: Total context length of the conversation which includes the special tokens and the response of the LLM. (default=2048)

-b length, --batch-length length¶: The input is executed in batches in processor decode loop. This is the maximum batch length to be processed in single iteration. (default=1024)

-pr length, --prompt-length length¶: Pseudo prompt length. Can’t exceed context length. (default=1024) Higher numbers may result in more precise pp t/s

-gl count, --gpu-layers count¶: Number of layers too offload to GPU. Ignored if there is no GPU is present. (default=999)

-np n, --n-predict n¶: Number of tokens to predict. Higher numbers may result in more precise tg t/s. (default=256)

-uc count, --user-count count¶: Amount of users to be processed in parallel. (default=1)

-fps n, --frame-per-second n¶: Max FPS of the main loop. This is for measuring the effects of inference engine on main application loop(default=500, min=10, max=1000).

-jout directory_path, --json-output-path directory_path¶: If the json output path is specified, result will be written there in file “mbase_bench.json”. (default=””)

-mdout directory_path, --markdown-output-path directory_path¶: If the markdown output path is specified, result will be written there in file “mbase_bench.md”. (default=””)