One library that can use directly the Gemma models (in a specific format) is the MediaPipe. Inside the documentation for LLM inference you can find a vast amount of information about converting specific LLMs and use them inside Android, Web and iOs.
Gemma comes in a lot of formats like Keras , Pytorch, Transformers, C++, TensorRT, TensorFlow Lite and others. For our QnA task we can directly download the TensorFlow Lite format from the Kaggle web site but it is worth mentioning how you can convert the Transformers model which is in the .safetensors format into a bin file that can be used inside a mobile. If you check the link above you will see that we are using the int8 which is more appropriate than the int4 model for our difficult task.
You can use the Colaboratory for your conversion procedure which has the below steps:
- Import the libraries
import ipywidgets as widgets
from IPython.display import display
install_out = widgets.Output()
display(install_out)
with install_out:
!pip install mediapipe
!pip install huggingface_hub
import os
from huggingface_hub import hf_hub_download
from mediapipe.tasks.python.genai import converterinstall_out.clear_output()
with install_out:
print("Setup done.")
2. Login to HugginFace (you should have an account and a HF Token)
from huggingface_hub import notebook_login
notebook_login()
3. Create dropdown lists
model = widgets.Dropdown(
options=["Gemma 2B", "Falcon 1B", "StableLM 3B", "Phi 2"],
value='Falcon 1B',
description='model',
disabled=False,
)backend = widgets.Dropdown(
options=["cpu", "gpu"],
value='cpu',
description='backend',
disabled=False,
)
token = widgets.Password(
value='',
placeholder='huggingface token',
description='HF token:',
disabled=False
)
def on_change_model(change):
if change["new"] != 'Gemma 2b':
token_description.layout.display = "none"
token.layout.display = "none"
else:
token_description.layout.display =
4. Have the links for the files to download
def gemma_download(token):
REPO_ID = "google/gemma-1.1-2b-it"
FILENAMES = ["tokenizer.json", "tokenizer_config.json", "model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
os.environ['HF_TOKEN'] = token
with out:
for filename in FILENAMES:
hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir="./gemma-2b-it")
5. Convert the .safetensors format to an appropriate one
def gemma_convert_config(backend):
input_ckpt = '/content/gemma-2b-it/'
vocab_model_file = '/content/gemma-2b-it/'
output_dir = '/content/intermediate/gemma-2b-it/'
output_tflite_file = f'/content/converted_models/gemma_{backend}.tflite'
return converter.ConversionConfig(input_ckpt=input_ckpt, ckpt_format='safetensors', model_type='GEMMA_2B', backend=backend, output_dir=output_dir, combine_file_only=False, vocab_model_file=vocab_model_file, output_tflite_file=output_tflite_file)
You can use the colab file that has been slightly altered from the official one so you can use directly HF login.
After converting the latest model then we can use the MediaPipe solution for LLM INference. The procedure consists of:
- Using the dependency of the library
dependencies {
implementation 'com.google.mediapipe:tasks-genai:0.10.11'
}
2. Uploading the converted .bin model inside mobile on a specific path
/data/local/tmp/llm/model.bin
3. Setting up the library inside the project
class InferenceModel private constructor(context: Context) {
private var llmInference: LlmInferenceprivate val modelExists: Boolean
get() = File(MODEL_PATH).exists()
private val _partialResults = MutableSharedFlow<Pair<String, Boolean>>(
extraBufferCapacity = 1,
onBufferOverflow = BufferOverflow.DROP_OLDEST
)
val partialResults: SharedFlow<Pair<String, Boolean>> = _partialResults.asSharedFlow()
init {
if (!modelExists) {
throw IllegalArgumentException("Model not found at path: $MODEL_PATH")
}
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath(MODEL_PATH)
.setMaxTokens(1024)
.setTemperature(0.0f)
.setResultListener { partialResult, done ->
_partialResults.tryEmit(partialResult to done)
}
.build()
llmInference = LlmInference.createFromOptions(context, options)
}
fun generateResponseAsync(prompt: String) {
llmInference.generateResponseAsync(prompt)
}
companion object {
private const val MODEL_PATH = "/data/local/tmp/llm/model.bin"
private var instance: InferenceModel? = null
fun getInstance(context: Context): InferenceModel {
return if (instance != null) {
instance!!
} else {
InferenceModel(context).also { instance = it }
}
}
}
}
Pay attention above to the:
.setTemperature(0.0f)
This is used so the model will be deterministic meaning we want the exact answer and not to be creative. Other parameters that can be used are summarized here.
Be the first to comment