3.8 Preparing Data for Fine-Tuning Foundation Models

1. Preparing Training Data:

  • Many publicly available datasets and prompt template libraries are used to train large language models (LLMs). These libraries provide templates for various tasks.
  • Once the instruction dataset is ready, divide it into:
  • Training dataset: Used to fine-tune the model.
  • Validation dataset: Used for evaluating the model during training.
  • Test dataset: Used for final performance evaluation.

2. Fine-Tuning Process:

  • During fine-tuning, you:
  • Select prompts from the training dataset.
  • Pass the prompts to the LLM to generate completions.
  • Compare the distribution of completions to the training label to calculate loss.
  • Use the loss to update the model’s weights.
  • After many batches of training, update the model to improve its performance on the task.
  • Evaluation:
  • Use the validation dataset for intermediate evaluations.
  • After fine-tuning, use the test dataset for final performance evaluation to get test accuracy.

3. Data Preparation in AWS:

  • Data preparation involves collecting, pre-processing, and organizing raw data for model use.
  • Low-code data preparation: Use Amazon SageMaker Canvas to define data flows with minimal coding.
  • Scalable data preparation: Use open-source frameworks like Apache Spark, Apache Hive, or Presto, integrated with Amazon SageMaker Studio Classic and Amazon EMR.
  • Serverless data preparation: Use AWS Glue for serverless, Apache Spark-based data transformation.
  • SQL data preparation: Use Jupyter Lab in SageMaker Studio for SQL-based data processing.
  • Feature discovery: Use Amazon SageMaker Feature Store to manage feature data in a centralized repository.
  • Bias detection: Use Amazon SageMaker Clarify to detect biases like gender, race, or age imbalances in your data.
  • Data labeling: Use SageMaker Ground Truth to manage data labeling workflows for training datasets.

4. Continuous Pre-training:

  • Non-deterministic output: Generative AI models produce varied outputs, making it hard to validate.
  • To improve models, use metrics, benchmarks, and datasets to evaluate model capabilities and ensure safe outputs.
  • Continuous pre-training helps models accumulate broader knowledge and improve performance over time.
  • In Amazon Bedrock, you can pre-train models like Amazon Titan Text Express and Amazon Titan Text Lite using unlabeled data in a secure environment.
0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like