Data Quality Management
Data quality management focuses on addressing data quality issues found during profiling or other methods. It involves identifying the root causes of issues, which requires business and technical knowledge of the data and its role in initiatives.
Reporting Data Quality Issues
If the problem originates from the source, it’s important to alert the data steward and business users to correct the issues and ensure that data quality is maintained.
Data Integration
Data integration is the process of collecting and merging data from multiple sources. This ensures that data from different systems links coherently, providing a more complete view.
Master Data Management
Master data, such as customer, supplier, and product data, needs special attention. Master data management ensures consistency across systems by reconciling data like names or email addresses, helping avoid discrepancies.
AWS Glue Data Catalog
AWS Glue Data Catalog stores metadata about data sources, including location and schema information. You can populate the catalog manually or automatically using an AWS Glue crawler job.
AWS Glue Data Quality
AWS Glue Data Quality evaluates objects in the AWS Glue Data Catalog. It allows users to define and apply data quality rules, and automatically detects anomalies with machine learning. The results show which rules were met and which were not.
Data Security
Data security involves defining who can access data and when. Data stewards help enable role-based and temporary access, guided by policies set by data owners.
Compliance
Compliance requires understanding and adhering to government regulations. Data owners collaborate with security and legal teams to make decisions about sensitive data, interpreting the rules and aligning them with business needs.
Data Lifecycle Management
Data lifecycle management focuses on storing data efficiently for easy access and optimized costs. The goal is to ensure that data is available to the right people and applications, while maintaining the right balance of control and access.
Balancing Control and Access
Data governance ensures a balance between control and access. Too much control locks data in silos, hindering innovation, while too little control exposes the business to risks.
AWS Lake Formation
AWS Lake Formation manages fine-grained access control for data lakes stored in Amazon S3. It enforces permissions at column, row, and cell levels, allowing data to be shared across different analytics and machine learning services.
Data Lake Setup
With AWS Lake Formation, data from Amazon S3, relational, and NoSQL databases can be moved to a data lake. Data is cataloged and permissions are set, ensuring users have appropriate access to the data.
Training Data Storage and Management
Training data for models is typically stored in S3 buckets. Even if the data is no longer needed, it must be retained for compliance purposes. Amazon S3 offers various storage classes to optimize costs based on access frequency.
S3 Storage Classes
S3 storage classes help manage costs for different access patterns. The classes include S3 Standard for frequent access, S3 Standard-IA for infrequent access, S3 Glacier for long-term archive, and others for various retrieval times.
Lifecycle Rules for S3 Buckets
S3 lifecycle rules automate the transition of data between storage classes based on access patterns. For example, data can move from S3 Standard to S3 Standard-IA, and then to S3 Glacier for long-term storage or deletion.