Processor Object in Detail¶

Processor object is an abstract object declared as InfProcessorBase in header file mbase/inference/inf_processor.h.

It is responsible for multiple things as listed below:

Abstraction over llama.cpp C context SDK.
Creating a context for the inference.
Provides non-blocking model inference methods for the target model.
Provides the tokenizer.

When an LLM is loaded into the program memory using the methods of model object, the model object waits for processors to be registered. When a single or multiple processor objects get registered into the model object, their states are updated everytime when the model object state is updated by calling the update method of the model object. Thus, the model object will act as a scheduler and synchronization point for multiple processor objects while those processor objects are doing the requested inference operation in parallel.

The model object also handles the case in which if the model is destroyed during the inference process, it will destroy the contexts of the registered processors, unregister all of them automatically, and will gracefully unload the model.

However, if the processor is not registered into the model object but only initialized, the user should manually update the processor state by calling its update method and be careful not to do inference operations if the model object is not valid.

Since, it also inherits from a logical_processor object, it is assumed to be a signal driven parallel state machine.

Note

See Signal-driven Parallel State Machine.

Naming¶

It is named as processor because it literally processes your input and generates the response using the LLM.

Its derivatives are expected to derive this object and name themselves like:

InfProcessorTextToText
InfProcessorEmbedder
InfProcessorSpeechToText (not implemented)
InfProcessorImageTextToText (not implemented)

Currently, the InfProcessorTextToText and InfProcessorEmbedder has been implemented and the InfProcessorTextToText is the main subject of this page.

Identifying the Expensive Operations¶

The operations such as, context creation/deletion, input batch processing and output token generation can be considered as an expensive operations which block the main application thread for a long period of time.

Out of all above, the most expensive one is the input batch processing where your input is supplied to inference engine as batches and all matrix calculations are applied during this period. The output token generation speed will be the same regardless of your input. Both is highly affected by the parameter count and quantization type.

The tokenization operation is not an expensive operation so it is handled synchronously.

Here is a list of methods that correspond to expensive operations:

initialize: Create and initialize the context.
initialize_sync: Synchronized initialize.
destroy: Destroy the context.
destroy_sync: Synchronized destroy.
declare_lora_assign: Assigns a lora adapter into the context.
declare_lora_remove: Remove an assigned lora adapter from the context.
execute_input: Signals the parallel state machine to batch process your input.
execute_input_sync: Synchronized execute.
next: Signals the parallel state machine to compute the next token.
next_sync: Synchronized next.

Essential Callbacks¶

Important

LoRA related callbacks are not mentioned here.

There are some essential callbacks that derived classes must implement to inherit from in order to catch the events related to the processor object.

Those callbacks are as follows:

on_initialize: This is called if the context creation is successful.
on_initialize_fail(last_fail_code out_code): This is called if the context creation is failed in parallel thread and the fail reason is stored in the out argument out_code.
on_destroy: This is called if the processor context is destroyed.

Essential Signals¶

Important

LoRA related signals are not mentioned here.

User can observe signals on the processor object to see if an operation is still operating in parallel. Here are the essential signals that can be observed in the program loop:

signal_state_initializing(): If this is true, it indicates that the processor is initialized and the processor object should be updated by calling update().
signal_state_destroying(): If this is true, it means that the processor is destroyed and the processor object should be updated by calling update().
signal_initializing(): It is true if the processor is actively being initialized in parallel.
signal_destroying(): It is true if the processor is actively being destroyed in parallel.

mbase/inference/inf_processor.h¶

class MBASE_API InfProcessorBase : public mbase::logical_processor {
public:
   ...
   MBASE_ND(MBASE_OBS_IGNORE) bool signal_state_initializing() const noexcept;
   MBASE_ND(MBASE_OBS_IGNORE) bool signal_state_destroying() const noexcept;
   MBASE_ND(MBASE_OBS_IGNORE) bool signal_initializing() const noexcept;
   MBASE_ND(MBASE_OBS_IGNORE) bool signal_destroying() const noexcept;
   ...
protected:
   ...
};

signal_state_input_process(): It is true if the initial input batch processing is finished and the processor object should be updated by calling update().
signal_state_decode_process(): It is true if the next token is calculated by the LLM and the processor object should be updated by calling update().
signal_state_kv_locked_process(): It is true if the input caching on processor is finished and the processor object should be updated by calling update().
signal_input_process(): It is true if the initial input batch processing is active in parallel.
signal_decode_process(): It is true if the next token calculation processing is active in parallel.
signal_kv_locked_process(): It is true if the input caching processing is active in parallel.

mbase/inference/inf_t2t_processor.h¶

class MBASE_API InfProcessorTextToText : public mbase::InfProcessorBase {
public:
   ...
   bool signal_state_lora_operate() const;
   bool signal_state_input_process() const;
   bool signal_state_decode_process() const;
   bool signal_state_kv_locked_process() const;
   bool signal_lora_operate_process() const;
   bool signal_input_process() const;
   bool signal_decode_process() const;
   bool signal_kv_locked_process() const;
   ...
private:
   ...
};

TextToText Execution Flow¶

In order to do an inference operation using the processor, you first need to register the processor into the model object. See Processor Registration Example.
Then you need to register your TextToText client to the processor.
Then you need to tokenize your input using the tokenization methods. See Message Preparation.
Then you will execute your input.
Then you will compute and generate tokens.

Tip

In the documentation, there is a fully-implemented example. See Single-Prompt Example

Processor Object in Detail¶

Naming¶

Identifying the Expensive Operations¶

Essential Callbacks¶

Essential Signals¶

TextToText Execution Flow¶

Client Registration Example¶

Input Execution and Token Generation¶

Advanced¶

Decode Behavior Description¶

Manual Caching¶

Context Shifting¶