1. Model Performance Issues
- Inference: The output of a trained machine learning model is an inference, which is essentially a prediction or classification based on new data.
- Overfitting:
- Definition: The model performs well on training data but poorly on new, unseen data. It becomes too specialized to the training set, failing to generalize.
- Cause: The model fits the training data too closely and may emphasize irrelevant details (noise).
- Solution: Train with more diverse data to avoid overfitting. Stop training once the model reaches the “sweet spot” where it can generalize well to new data.
- Underfitting:
- Definition: The model is too simple and cannot capture meaningful patterns in the data, leading to poor performance on both training and new data.
- Cause: Insufficient training, too small of a dataset, or too simple of a model.
- Solution: Train for longer or use a more complex model if needed.
2. Bias in Machine Learning
- Bias: Occurs when a model shows disparities in performance for different groups, leading to skewed results that favor or disadvantage certain classes.
- Example: A loan approval model trained on data that doesn’t include enough diverse applicants could become biased against certain groups (e.g., women in specific locations).
- Cause: The training data may not be representative of the diversity of real-world scenarios, leading to skewed predictions.
- Solution:
- Diverse Data: Ensure the training data is representative of all relevant groups to avoid bias.
- Feature Weighting: Remove or adjust biased features (e.g., gender or age) from the model.
- Fairness Constraints: Identify and address potential biases (like age or sex discrimination) early in the process.
- Ongoing Evaluation: Continuously evaluate models for fairness and adjust if necessary.
3. Ensuring Model Fairness
- Quality of Data: The quality and quantity of the training data directly affect model accuracy and fairness.
- Bias Detection: Inspect and evaluate the training data to check for potential biases before building the model.
- Continuous Monitoring: Periodically evaluate the model’s output to ensure it remains fair across different demographic groups.
Key Terms to Remember:
- Inference: The output or prediction made by a trained machine learning model.
- Overfitting: When a model performs well on training data but poorly on new data, due to being too tailored to the training set.
- Underfitting: When a model is too simplistic and fails to capture the underlying patterns in the data.
- Bias: Disparities in model performance across different groups, leading to skewed results.
- Fairness: Ensuring that models do not discriminate against any particular group based on factors like age, gender, or location.
- Feature Weighting: Adjusting or removing biased features in a model to improve fairness.