Initializing Processor¶
After you have initialized the model, we will need to initialize our processor object and register into the model. Processor is an essential concept in MBASE Inference and it is the object where most of the computation is happening.
Note
For detailed information about the processor, refer to Processor Object in Detail
About Processor Object¶
Processor object is an object who is responsible for doing the inference in general. Since a processor will process the model, a model object must first be initialized. Then, processor can either be registered into the model object or can process it without registering itself.
Before we proceed, lets identify the expensive operations of the processor object. Those operations are:
Creating context, through methods
register_context_process
orinitialize
Destroying context, through the method
destroy
Tokenization through the method
tokenize_input
Batch procesing or prompt processing, through the method
execute_input
Token generation, through the method
next
Depending on the size of the model, size of the context, and input, each operation takes considerable amount of time. Fortunately, all of those methods are non-blocking.
Just like we have said before in model initialization chapter, when those methods are called, the method first validate the arguments, and if the arguments are valid and the operation can start, it returns a flag notifying the user that the operation (assume creating context) is started in another thread.
During this time, while the processor is doing expensive operations in another thread,
the user will call the update
method of the model or processor object as frequent as the loop runs.
If the processor is being registered into the model, processor’s update
method will be called when
model’s update
method is called.
You can register as many processors as you want as long as your system is great enough to do a lot of parallel processing.
Processor Initialize in Action¶
We have a base InfProcessor class where InfProcessorTextToText derives from. In the near future, all inference types such as S2T, T2I will be implemented through that.
In our main.cpp, we will include the class header #include <mbase/inference/inf_t2t_processor.h>
and derive our
class named as ProcessorObject
:
#include <mbase/inference/inf_device_desc.h>
#include <mbase/inference/inf_t2t_model.h>
#include <mbase/inference/inf_t2t_processor.h>
#include <iostream>
#include <mbase/vector.h>
bool gIsRunning = true;
class ProcessorObject;
class ModelObject;
class ProcessorObject : public mbase::InfProcessorTextToText {
public:
void on_initialize_fail(last_fail_code out_code) override
{
std::cout << "Processor initialization failed." << std::endl;
gIsRunning = false;
}
void on_initialize() override
{
std::cout << "Processor is initialized." << std::endl;
}
void on_destroy() override{}
};
Then we will register our processor into the model when the model is initialized by the program.
First, we will set ProcessorObject
as a member of the ModelObject
class as follows:
class ModelObject : public mbase::InfModelTextToText {
public:
void on_initialize_fail(init_fail_code out_fail_code) override
{
std::cout << "Model initialization failed." << std::endl;
gIsRunning = false;
}
void on_initialize() override
{
std::cout << "Model is initialized." << std::endl;
}
void on_destroy() override{}
private:
ProcessorObject processorObject; // Hi!!
};
Tip
Even if it is not necessary for processorObject
to be a member of ModelObject
to be registered,
we are implementing this way for easier management.
Now, we will call the register_context_process method of the ModelObject
in on_initialize
method to register our processor as follows:
class ModelObject : public mbase::InfModelTextToText {
...
void on_initialize() override
{
std::cout << "Model is initialized." << std::endl;
uint32_t contextSize = 4096;
uint32_t batchSize = 1024;
uint32_t procThreadCount = 16;
uint32_t genThreadCount = 8;
bool isFlashAttention = true;
mbase::inf_sampling_set samplingSet;
ModelObject::flags registerationStatus = this->register_context_process(
&processorObject,
contextSize,
batchSize,
genThreadCount,
procThreadCount,
isFlashAttention,
samplingSet
);
if(registerationStatus != ModelObject::flags::INF_MODEL_INFO_REGISTERING_PROCESSOR)
{
std::cout << "Registration unable to proceed." << std::endl;
gIsRunning = false;
}
}
...
};
Even though the parameters are self-explanatory, it is worth mentioning what do they represent:
contextSize
: It is basically the context length of the session.batchSize
: Given input will be processed in batches. That is the maximum amount of batch that will be passed to the context at a single iteration.procThreadCount
: Amount of threads to use for batch processing.genThreadCount
: Amount of threads to use for generating tokens.isFlashAttention
: Whether a flash attention is enabled or disabled. Increases performance.samplingSet
: Samplers to use when predicting the token. If it is empty, ‘greedy’ sampling will be applied by default.
Note
Detailed explanations can be found on Processor Object in Detail
After the registeration is successful, “Processor is initialized.” will be displayed on the terminal.
Here is the total main.cpp file:
#include <mbase/inference/inf_device_desc.h>
#include <mbase/inference/inf_t2t_model.h>
#include <mbase/inference/inf_t2t_processor.h>
#include <iostream>
#include <mbase/vector.h>
bool gIsRunning = true;
class ModelObject;
class ProcessorObject;
class ProcessorObject : public mbase::InfProcessorTextToText {
public:
void on_initialize_fail(last_fail_code out_code) override
{
std::cout << "Processor initialization failed." << std::endl;
gIsRunning = false;
}
void on_initialize() override
{
std::cout << "Processor is initialized." << std::endl;
}
void on_destroy() override
{
}
private:
};
class ModelObject : public mbase::InfModelTextToText {
public:
void on_initialize_fail(init_fail_code out_fail_code) override
{
std::cout << "Model initialization failed." << std::endl;
gIsRunning = false;
}
void on_initialize() override
{
std::cout << "Model is initialized." << std::endl;
uint32_t contextSize = 4096;
uint32_t batchSize = 1024;
uint32_t procThreadCount = 16;
uint32_t genThreadCount = 8;
bool isFlashAttention = true;
mbase::inf_sampling_set samplingSet; // We are setting greedy sampler by supplying empty sampling set
ModelObject::flags registerationStatus = this->register_context_process(
&processorObject,
contextSize,
batchSize,
genThreadCount,
procThreadCount,
isFlashAttention,
samplingSet
);
if(registerationStatus != ModelObject::flags::INF_MODEL_INFO_REGISTERING_PROCESSOR)
{
std::cout << "Registration unable to proceed." << std::endl;
gIsRunning = false;
}
}
void on_destroy() override{}
private:
ProcessorObject processorObject;
};
int main()
{
mbase::vector<mbase::InfDeviceDescription> deviceDesc = mbase::inf_query_devices();
for(mbase::vector<mbase::InfDeviceDescription>::iterator It = deviceDesc.begin(); It != deviceDesc.end(); It++)
{
std::cout << It->get_device_description() << std::endl;
}
ModelObject modelObject;
uint32_t totalContextLength = 32000;
int32_t gpuLayersToUse = 80;
bool isMmap = true;
bool isMLock = true;
if (modelObject.initialize_model_ex(
L"<path_to_your_model>",
totalContextLength,
gpuLayersToUse,
isMmap,
isMLock,
deviceDesc
) != ModelObject::flags::INF_MODEL_INFO_INITIALIZING_MODEL)
{
std::cout << "Unable to start initializing the model." << std::endl;
return 1;
}
while(gIsRunning)
{
modelObject.update();
mbase::sleep(2);
}
return 0;
}
Now, we will register our client to the processor in the next section.