data governance frameworks and compliance requirements.
Key Responsibilities:
Design & Develop Data Pipelines:
- Architect and implement end-to-end data pipelines using AWS S3, EMR, Glue, Step Functions, Apache NiFi, Spark.
- Manage data ingestion processes from AWS S3, ensuring secure and efficient data transfer.
- Implement initial data routing, validation, and transformations using Apache NiFi processors and Spark Data Engines
Data Processing & Transformation:
- Integrate using AWS EMR, Apache NiFi, Spark to perform complex data transformations and analytics.
- Optimize Spark jobs for processing large-scale datasets with a focus on performance and resource utilization.
- Handle both historical and incremental data loads, ensuring data consistency and integrity.
Data Storage & Management:
- Define and implement data storage strategies across S3, RDS, and Redshift, adhering to business requirements.
- Manage data catalog creation and schema management using AWS Glue.
Automation & Orchestration:
- Develop and manage workflows using Apache Airflow, AWS Step Functions to automate data processing tasks.
- Implement monitoring, error handling, and retries within the orchestration framework.
Security & Compliance:
- Ensure data security with encryption (AES-256, TLS) and IAM role-based access controls.
- Implement data governance policies using AWS Glue Data Catalog to ensure compliance with regulatory requirements.
Performance Monitoring & Optimization:
- Utilize AWS CloudWatch to monitor the performance of EMR clusters, NiFi flows and data storage.
- Continuously optimize Spark job configurations and NiFi data flows for maximum throughput and minimal latency.