3.8 Preparing Data for Fine-Tuning Foundation Models

1. Preparing Training Data:

Many publicly available datasets and prompt template libraries are used to train large language models (LLMs). These libraries provide templates for various tasks.
Once the instruction dataset is ready, divide it into:
Training dataset: Used to fine-tune the model.
Validation dataset: Used for evaluating the model during training.
Test dataset: Used for final performance evaluation.

During fine-tuning, you:
Select prompts from the training dataset.
Pass the prompts to the LLM to generate completions.
Compare the distribution of completions to the training label to calculate loss.
Use the loss to update the model’s weights.
After many batches of training, update the model to improve its performance on the task.
Evaluation:
Use the validation dataset for intermediate evaluations.
After fine-tuning, use the test dataset for final performance evaluation to get test accuracy.

Data preparation involves collecting, pre-processing, and organizing raw data for model use.
Low-code data preparation: Use Amazon SageMaker Canvas to define data flows with minimal coding.
Scalable data preparation: Use open-source frameworks like Apache Spark, Apache Hive, or Presto, integrated with Amazon SageMaker Studio Classic and Amazon EMR.
Serverless data preparation: Use AWS Glue for serverless, Apache Spark-based data transformation.
SQL data preparation: Use Jupyter Lab in SageMaker Studio for SQL-based data processing.
Feature discovery: Use Amazon SageMaker Feature Store to manage feature data in a centralized repository.
Bias detection: Use Amazon SageMaker Clarify to detect biases like gender, race, or age imbalances in your data.
Data labeling: Use SageMaker Ground Truth to manage data labeling workflows for training datasets.

Non-deterministic output: Generative AI models produce varied outputs, making it hard to validate.
To improve models, use metrics, benchmarks, and datasets to evaluate model capabilities and ensure safe outputs.
Continuous pre-training helps models accumulate broader knowledge and improve performance over time.
In Amazon Bedrock, you can pre-train models like Amazon Titan Text Express and Amazon Titan Text Lite using unlabeled data in a secure environment.