ML Lifecycle -Data Collection and Preparation

Below are the steps for Data Collection and Preparation in the Machine Learning Lifecycle

Determine the type of data (streaming or batch) and its storage location.
Plan for frequent re-training with new data, ensuring the data collection process is repeatable.

Use ETL (Extract, Transform, Load) processes to gather data from multiple sources and store it in a centralized repository (e.g., AWS S3).

If data is not labeled, consider using tools like SageMaker Ground Truth for labeling via machine learning or human workers.

Data Wrangling: Clean and format the data (e.g., handle missing values, outliers, or anomalies).
Use tools like AWS Glue or AWS DataBrew for automating data transformation and preparation.

Identify the most relevant features for minimizing model error.
Reduce unnecessary features to optimize memory and computing power during training.
Use SageMaker Feature Store to store and manage features.

SageMaker Feature Store: Centralized feature storage for managing features during ML development.

AWS Glue: A fully managed ETL service for data extraction, transformation, and loading.

AWS DataBrew: A visual data preparation tool that simplifies data transformation without needing code.

SageMaker Ground Truth: Active learning tool to create high-quality labeled datasets.

AWS MCP Servers: AI-Powered Toolkit for Cloud & DevOps Teams