Train a tiny Llama model to help a specific domain task. | by George Soloupis

The domain we are targeting is a small assistant that can interpret user’s language into specific commands and alter the volume of the mobile phone automatically. In a previous work we showcased offline speech to text with TensorFlow and the Whisper model and the usage of the Whisper .tflite model inside an android application. Following text generation by the Whisper model, the application previously forwarded the text to either ChatGPT or the Gemini API for word summarization into commands. However, in the current implementation, we’ve seamlessly integrated a Llama model in C language, which performs the same task offline!

Screenshot of mobile phone with options.

To train the model you can use the Colab notebook I have prepared here. First of all you have to clone Karpathy’s Github repository and install the requirements that will be used to train the model. Inside there you will find the Sentencepiece library which is the heart of the tokenization procedure in this example.

!git clone https://github.com/karpathy/llama2.c.git
%cd llama2.c
!pip install -r requirements.txt

The original repository uses the TinyStories dataset. I have altered the code so you can use whatever .txt file you have. For example for a QnA task you can use the ‘natural_questions/longt5’ from the TensorFlow Datasets examples and create a .txt file that will have the below form:

'what is the definition of the name tiffany = Epiphany'
'what is the maximum depth of the atlantic ocean = 8 , 486 m  (  27 , 841 ft  ) '
'what is the population of nashville tennessee metropolitan area = 1 , 865 , 298'
'when was the first animal sent into space = 1947'

To convert the dataset to a .txt file use the last cells of the Colab notebook:


import tensorflow as tf
import tensorflow_datasets as tfds
import numpy
nqa = tfds.load('natural_questions/longt5', as_supervised=False)
print(nqa['train'])prefetchdataset = nqa['train']
print(len(prefetchdataset))
def remove_first_character(string):
return string[2:-1]
all_qna = []
n=0
samples = 307373
for element in prefetchdataset:
if  "NULL"in str(element['answer'].numpy()):
continue
tensordata = element['question']+" = " + element['answer']
stringdata = remove_first_character(str(tensordata.numpy()))
all_qna.append(stringdata)
n+=1
if n==samples:
break
print(all_qna)
# Name of the text file
file_name = "output.txt"
# Open the file in write mode and write each string to a new line
with open(file_name, 'w') as file:
for string in all_qna:
file.write(f"{string}\n")
print(f"Strings written to {file_name}")

Having the .txt file ready we have to pretokenize the dataset and create the vocabulary to train the model. I have created the pre_training_script.py file that can be used as:

import time
!python pre_training_script.py train_vocab --vocab_size=1200 --path_to_text=/content/llama2.c/TinyStories-train.txt
time.sleep(5)
!python pre_training_script.py pretokenize --vocab_size=1200 --path_to_text=/content/llama2.c/TinyStories-train.txt

You can alter the vocab size as per your preference based on the task and the size of the dataset.

Three files are going to be generated in the file directory. You can convert the tok1200.model into a proper .bin file ready to be used inside the embedded devices as:

!python tokenizer.py --tokenizer-model=/content/llama2.c/data/tok1200.model

Then with the train_abstract.py script file you can start the training:

!python train_abstract.py --vocab_source=custom --vocab_size=1200

Inside the training script file you can see at the top lines all the parameters you can alter so the final model can be efficient but small in the meantime. You can see the last pages of the Chincilla paper to change the parameters of the model.

After the training you can create an executable file:

!make runfast

and use the model and the tokenizer .bin files:

model_file = '/content/llama2.c/out/model.bin'
tokenizer = '/content/llama2.c/data/tok1200.bin'# Generate args
max_token = 96 #@param {type:"slider", min:32, max:1024, step:32}
temperature = 0 #@param {type:"slider", min:0.0, max:1, step:0.05}
top_p = 0.9 #@param {type:"slider", min:0.0, max:1.0, step:0.05}
prompt = "the music" #@param {type:"string"}
print(f"model: {model_file}, max_token: {max_token}, temperature: {temperature}, top_p: {top_p}, prompt: {prompt}")
print(f"----------------------------\n")
cmd = f'./run {model_file} -z {tokenizer} -t {temperature} -p {top_p} -n {max_token} -i "{prompt}"'
!{cmd}

The above 2 files can be used then inside the android project with native development kit.

Source link