Below are the steps for Data Collection and Preparation in the Machine Learning Lifecycle
Identify Data Requirements
- Determine the type of data (streaming or batch) and its storage location.
- Plan for frequent re-training with new data, ensuring the data collection process is repeatable.
Data Collection
- Use ETL (Extract, Transform, Load) processes to gather data from multiple sources and store it in a centralized repository (e.g., AWS S3).
Labeling Data
- If data is not labeled, consider using tools like SageMaker Ground Truth for labeling via machine learning or human workers.
Data Preprocessing
- Data Wrangling: Clean and format the data (e.g., handle missing values, outliers, or anomalies).
- Use tools like AWS Glue or AWS DataBrew for automating data transformation and preparation.
Feature Engineering
- Identify the most relevant features for minimizing model error.
- Reduce unnecessary features to optimize memory and computing power during training.
- Use SageMaker Feature Store to store and manage features.
Data Splitting
- Split the dataset into:
- Training (80%): For training the model.
- Validation (10%): For evaluating model performance.
- Testing (10%): For final evaluation before deployment.
Tools for Data Collection and Transformation
SageMaker Feature Store: Centralized feature storage for managing features during ML development.
AWS Glue: A fully managed ETL service for data extraction, transformation, and loading.
AWS DataBrew: A visual data preparation tool that simplifies data transformation without needing code.
SageMaker Ground Truth: Active learning tool to create high-quality labeled datasets.