5.10 Data Quality and Lifecycle Management Overview

Data Quality Management

Data quality management focuses on addressing data quality issues found during profiling or other methods. It involves identifying the root causes of issues, which requires business and technical knowledge of the data and its role in initiatives.

Reporting Data Quality Issues

If the problem originates from the source, it’s important to alert the data steward and business users to correct the issues and ensure that data quality is maintained.

Data Integration

Data integration is the process of collecting and merging data from multiple sources. This ensures that data from different systems links coherently, providing a more complete view.

Master Data Management

Master data, such as customer, supplier, and product data, needs special attention. Master data management ensures consistency across systems by reconciling data like names or email addresses, helping avoid discrepancies.

AWS Glue Data Catalog

AWS Glue Data Catalog stores metadata about data sources, including location and schema information. You can populate the catalog manually or automatically using an AWS Glue crawler job.

AWS Glue Data Quality

AWS Glue Data Quality evaluates objects in the AWS Glue Data Catalog. It allows users to define and apply data quality rules, and automatically detects anomalies with machine learning. The results show which rules were met and which were not.

Data Security

Data security involves defining who can access data and when. Data stewards help enable role-based and temporary access, guided by policies set by data owners.

Compliance

Compliance requires understanding and adhering to government regulations. Data owners collaborate with security and legal teams to make decisions about sensitive data, interpreting the rules and aligning them with business needs.

Data Lifecycle Management

Data lifecycle management focuses on storing data efficiently for easy access and optimized costs. The goal is to ensure that data is available to the right people and applications, while maintaining the right balance of control and access.

Balancing Control and Access

Data governance ensures a balance between control and access. Too much control locks data in silos, hindering innovation, while too little control exposes the business to risks.

AWS Lake Formation

AWS Lake Formation manages fine-grained access control for data lakes stored in Amazon S3. It enforces permissions at column, row, and cell levels, allowing data to be shared across different analytics and machine learning services.

Data Lake Setup

With AWS Lake Formation, data from Amazon S3, relational, and NoSQL databases can be moved to a data lake. Data is cataloged and permissions are set, ensuring users have appropriate access to the data.

Training Data Storage and Management

Training data for models is typically stored in S3 buckets. Even if the data is no longer needed, it must be retained for compliance purposes. Amazon S3 offers various storage classes to optimize costs based on access frequency.

S3 Storage Classes

S3 storage classes help manage costs for different access patterns. The classes include S3 Standard for frequent access, S3 Standard-IA for infrequent access, S3 Glacier for long-term archive, and others for various retrieval times.

Lifecycle Rules for S3 Buckets

S3 lifecycle rules automate the transition of data between storage classes based on access patterns. For example, data can move from S3 Standard to S3 Standard-IA, and then to S3 Glacier for long-term storage or deletion.

0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like