t2t-tuner

Convenient Text-to-Text Training for Transformers

pip install t2t-tuner

Requires PyTorch: either follow PyTorch installation instructions or use a PyTorch container.

Features

Based on the wonderful HuggingFace Transformers library. Tested on T5 and GPT type of models. In theory, it should work with other models that support AutoModelForSeq2SeqLM or AutoModelForCausalLM as well.

The Trainer in this library here is a higher level interface to work based on HuggingFace’s run_translation.py script for text-to-text generation tasks. I decided I want a more more convenient interface for training and inferencing, along with access to things like gradient checkpointing and model parallel to fit larger models - these are already in the HuggingFace library but not exposed in the script. I also added in some features that I wanted (prompt tuning, model summary), integrated it with autoregressive LM training and wrapped it as a single library that can be pip installed.

Examples

Training Models

import t2t

trainer_arguments = t2t.TrainerArguments(model_name_or_path="t5-small",
                                         train_file=YOUR_DATASET)

trainer = t2t.Trainer(arguments=trainer_arguments)

# train without validation
trainer.train(valid=False)

For more concrete examples, check out the notebooks linked below:

Data Format

Seq2Seq Training

{"translation": {"s": "TEXT", "t": "LABEL"}}

Autoregressive LM Training

Training Large Models

This section will outline how to train large language models (> 1 bil parameters) on relatively simple setups.

Some notes for the configurations reported below:

GPT Models

Some GPT configurations that were tested to able to train on a single RTX 3090 (24GB) card (without DeepSpeed):

Model Params Precision Optimizer InputLen BatchSize Other
gpt2 1.5b FP16 Adafactor 128 4 None
gpt2 1.5b FP16 Adafactor 512 1 None
gpt2 1.5b FP16 Adafactor 1024 4 GradCheckpoint
gpt-neo 1.3b FP16 Adafactor 1024 1 None
gpt-neo 1.3b FP16 Adafactor 2048 4 GradCheckpoint
gpt-neo 2.7b FP16 Adafactor 2048 4 GradCheckpoint,FreezeEmbeds

T5 Models

Some T5 configurations that were tested to able to train on a single RTX 3090 (24GB) card (without DeepSpeed):

Model Params Precision Optimizer Seq2SeqLen BatchSize Other
t5 3b FP32 Adafactor 128->128 1 FreezeEmbeds
t5 3b FP32 Adafactor 128->128 1 GradCheckpoint
t5 3b FP32 Adafactor 128->128 128 GradCheckpoint,FreezeEmbeds
t5 3b FP32 Adafactor 512->512 32 GradCheckpoint,FreezeEmbeds

Model Parallelism for T5-11b models

Using this library, you also can fine-tune the t5-11b checkpoints quite easily (single node) with the following settings (without Deepspeed):

Model parallel T5-11b

Note that depending on your system, the loading time for the checkpoint (46GB) can be very long. You’ll need ample CPU RAM (at least ~90GB) to load it successfully.

ONNX RT

ONNX RT works with some models (not T5, yet) and can provide a small boost in speed.

Install ORT, then set TrainingArguments.torch_ort=True

pip install torch-ort -f https://onnxruntimepackages.z14.web.core.windows.net/onnxruntime_stable_torch190.cu111.html

python -m torch_ort.configure

Development

Building Package

python3 -m pip install --upgrade build twine
python3 -m build
python3 -m twine upload dist/*

Disclaimers

This library as developed as a personal project for my own use. Please feel free to fork or use it for your own purposes as well. I will not take responsibility for any mishaps that occur as a result of this library’s usage.

Note for 3090 FE cards, if your fans hit 100%, it means your VRAM temps are high (>100 deg C). Training for long hours at these temperatures in theory should be fine, but if you want a peace of mind (like me), you can lower the power limit incur minor impact on training speeds. As long as your fans never hit 100%, your VRAM temperatures should be good. For example, to lower power limit to 300W (from 350W):

sudo nvidia-smi -pl 300