1. Preparing Training Data:
- Many publicly available datasets and prompt template libraries are used to train large language models (LLMs). These libraries provide templates for various tasks.
- Once the instruction dataset is ready, divide it into:
- Training dataset: Used to fine-tune the model.
- Validation dataset: Used for evaluating the model during training.
- Test dataset: Used for final performance evaluation.
2. Fine-Tuning Process:
- During fine-tuning, you:
- Select prompts from the training dataset.
- Pass the prompts to the LLM to generate completions.
- Compare the distribution of completions to the training label to calculate loss.
- Use the loss to update the model’s weights.
- After many batches of training, update the model to improve its performance on the task.
- Evaluation:
- Use the validation dataset for intermediate evaluations.
- After fine-tuning, use the test dataset for final performance evaluation to get test accuracy.
3. Data Preparation in AWS:
- Data preparation involves collecting, pre-processing, and organizing raw data for model use.
- Low-code data preparation: Use Amazon SageMaker Canvas to define data flows with minimal coding.
- Scalable data preparation: Use open-source frameworks like Apache Spark, Apache Hive, or Presto, integrated with Amazon SageMaker Studio Classic and Amazon EMR.
- Serverless data preparation: Use AWS Glue for serverless, Apache Spark-based data transformation.
- SQL data preparation: Use Jupyter Lab in SageMaker Studio for SQL-based data processing.
- Feature discovery: Use Amazon SageMaker Feature Store to manage feature data in a centralized repository.
- Bias detection: Use Amazon SageMaker Clarify to detect biases like gender, race, or age imbalances in your data.
- Data labeling: Use SageMaker Ground Truth to manage data labeling workflows for training datasets.
4. Continuous Pre-training:
- Non-deterministic output: Generative AI models produce varied outputs, making it hard to validate.
- To improve models, use metrics, benchmarks, and datasets to evaluate model capabilities and ensure safe outputs.
- Continuous pre-training helps models accumulate broader knowledge and improve performance over time.
- In Amazon Bedrock, you can pre-train models like Amazon Titan Text Express and Amazon Titan Text Lite using unlabeled data in a secure environment.