ML Lifecycle -Data Collection and Preparation

Below are the steps for Data Collection and Preparation in the Machine Learning Lifecycle

Identify Data Requirements

  • Determine the type of data (streaming or batch) and its storage location.
  • Plan for frequent re-training with new data, ensuring the data collection process is repeatable.

Data Collection

  • Use ETL (Extract, Transform, Load) processes to gather data from multiple sources and store it in a centralized repository (e.g., AWS S3).

Labeling Data

  • If data is not labeled, consider using tools like SageMaker Ground Truth for labeling via machine learning or human workers.

Data Preprocessing

  • Data Wrangling: Clean and format the data (e.g., handle missing values, outliers, or anomalies).
  • Use tools like AWS Glue or AWS DataBrew for automating data transformation and preparation.

Feature Engineering

  • Identify the most relevant features for minimizing model error.
  • Reduce unnecessary features to optimize memory and computing power during training.
  • Use SageMaker Feature Store to store and manage features.

Data Splitting

  • Split the dataset into:
  • Training (80%): For training the model.
  • Validation (10%): For evaluating model performance.
  • Testing (10%): For final evaluation before deployment.

Tools for Data Collection and Transformation

SageMaker Feature Store: Centralized feature storage for managing features during ML development.

AWS Glue: A fully managed ETL service for data extraction, transformation, and loading.

AWS DataBrew: A visual data preparation tool that simplifies data transformation without needing code.

SageMaker Ground Truth: Active learning tool to create high-quality labeled datasets.

0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like