Openai Server¶

Program Identification¶

Author: M. Emre Erdog
Maintainer: M. Emre Erdog
Email: erdog@mbasesoftware.com
Name: mbase_openai_server
Version: v0.1.0
Type: Example, Utility
Network Usage: Yes
Lib Depends: mbase-std mbase-inference
Repo location: https://github.com/Emreerdog/mbase/tree/main/examples/openai-server

Synopsis¶

mbase_openai_server *[option [value]]
mbase_openai_server --hostname "127.0.0.1" -jsdesc description.json
mbase_openai_server --hostname "127.0.0.1" --port 8080 -jsdesc description.json
mbase_openai_server --hostname "127.0.0.1" --port 8080 --ssl-pub public_key_file --ssl-key private_key_file -jsdesc description.json

Description¶

An Openai API compatible HTTP/HTTPS server for serving LLMs. This program provides chat completion API for TextToText models and embeddings API for embedder models.

The mbase_openai_server can host either single or multiple models and serve its clients simultaneusly which is specified by the key processor_count in the provided JSON description file.

In order to properly use the mbase_openai_server program, you should supply a json file, describing the behavior of the openai server program.

Along with the program description json, you can specify the hostname(default=127.0.0.1) to listen to and the port(default=8080). The specified hostname and port must be configured properly so that the application may listen.

JSON Description Usage¶

User must write a JSON description file and specify its path in the option -jsdesc:

mbase_openai_server --hostname "127.0.0.1" -jsdesc description.json

In the description file, user will specify multiple parameters such as the source of the model file, amount of concurrent users, samplers etc.

Format and Parameters¶

JSON file will contain an array of objects that contains the given keys and values:

model_path: Path of the model. It must be a valid GGUF file.
processor_count (default=4): Amount of users that the server will concurrently serve the LLM to.
thread_count (default=8): Amount of generation threads to use for inference engine to generate tokens.
batch_thread_count (default=8): Amount of threads to use for initial batch processing.
context_length (default=2048): Context length of each processor. The inference engine will allocate a context for each processor.
batch_length (default=512): User’s input will be processed in batches by the inference engine. Higher the number, better the performance but significant increase on RAM usage. This number can’t exceed the context size.
gpu_layers (default=80): Number of layers to be offloaded to GPU if there are any GPU devices in your system. Ignored if there are no GPUs.
fsys: Path to the file containing the system prompt. It will cached to the LLM’s KV cache.

If you are hosting a TextToText model, the following samplers may also be specified.

samplers.top_k
samplers.top_p
samplers.min_p
temp
mirostat_v2.tau
mirostat_v2_eta
repetition.penalty_n
repetition.penalty_repeat

If you don’t specify any sampling parameters, the greedy sampling will be applied by default.

Single Model Hosting Example¶

description.json¶

[
    {
        "model_path" : "model.gguf"
    }
]

For 8 concurrent access with 4096 context length each:

description.json¶

[
    {
        "model_path" : "model.gguf",
        "processor_count" : 8,
        "context_length" : 4096
    }
]

Specifying all parameters and some samplers:

description.json¶

[
    {
        "model_path" : "model.gguf",
        "processor_count" : 8,
        "context_length" : 4096,
        "thread_count" : 8,
        "batch_thread_count" : 8,
        "batch_length" : 512,
        "gpu_layers" : 80,
        "fsys" : "your_system_prompt.txt",
        "samplers" :
        {
            "top_k" : 40,
            "top_p" : 1.0,
            "min_p" : 0.3,
            "temp" : 0.8,
            "repetition" :
            {
                "penalty_n" : 64,
                "penalty_repeat" : 1.2
            }
        }
    }
]

Multi Model Hosting Example¶

description.json¶

[
    {
        "model_path" : "model.gguf"
    },
    {
        "model_path" : "model1.gguf"
    },
    {
        "model_path" : "model2.gguf"
    }
]

REST API Usage Example¶

After you create your JSON description file and run the openai-server program, you can request the server using the Openai API.

Chat Completion¶

Important

You can observe the model names by sending a GET request to the /v1/models endpoint.

Using CURL:

curl "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
    "model": "$MODEL_NAME",
    "messages": [
        {
            "role": "developer",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How are you?"
        }
    ]
}'

Using Python:

from openai import OpenAI

client = OpenAI(
        base_url="http://localhost:8080/v1",
        api_key="OPENAI_API_KEY"
)

completion = client.chat.completions.create(
    model="MODEL_NAME",
    messages=[
        {"role": "developer", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "How are you?"
        }
    ]
)

print(completion.choices[0].message)

Options¶

--help¶: Print program information.

-v, --version¶: Shows program version.

--api-key key¶: API key to be checked by the server.

-h host, --hostname host¶: Hostname to listen to. (default=127.0.0.1)

-p port, --port port¶: Port to assign to. (default=8080)

--ssl-public file¶: SSL public file for HTTPS support.

--ssl-key file¶: SSL private key file for HTTPS support.

-jsdesc description_file¶: JSON description file for the openai server program.