Language Model Fine-Tuning with LoRA
Categories:
This page explores the Low-Rank Adaptation, LoRA (Hu, Edward J., et al., 2021) as a method for fine-tuning pre-trained language models, and demonstrates how to apply this method using the open-source HuggingFace PEFT library.
Language Model Pretraining
Language models like BERT, BaRT, and GPT are pre-trained on vast amounts of unlabelled text data, such as the entirety of English Wikipedia, every question and answer on the StackExchange network, or The Pile (Gao, Leo, et al., 2020), a massive 825GB corpus of English text which combines the previous two datasets with data gathered from 20 other sources.
Pre-training is performed on large clusters of costly, specialized hardware like GPUs and TPUs, and can take days or weeks to complete. The table below, which shows the hardware used to pre-train a few different language models, gives some idea of the cost and scale of this process:
Model | Pre-Training Cluster | Estimated Daily Cost |
---|---|---|
XLM-RoBERTa[4] | 500 Nvidia V100 GPUs | $29,760 |
OPT-175B[5] | 992 Nvidia A100 GPUs | $93,500 |
LLaMA[6] | 2,048 Nvidia A100 GPUs | $193,000 |
PaLM[7] | 6,144 Google v4 TPUs | $475,000 |
Note: daily cost estimates are based on the public GPU and TPU costs published by Google Cloud as of the time of this writing.
During this pre-training process, each model is given a generalized objective to test its ability to understand language, such as predicting the next word in a sentence or predicting a randomly masked word within a piece of text.