Senior Data Engineer

Houghton Mifflin Harcourt Technology Private Limit

All India, Pune 3 to 7 Yrs 1 month ago

Job Description

Role Overview:

Software Engineering at HMH is dedicated to developing exceptional software solutions to address the needs of teachers and learners, facilitating a wide array of next-generation learning experiences. As part of our team, you will be involved in designing and constructing custom applications and services utilized by millions. We are seeking passionate and innovative software professionals to join our small, self-contained development teams in creating high-quality products and services. Embracing a variety of technologies, we are in the process of building a cutting-edge microservices platform to enhance accessibility of our learning tools and content for all customers. If you are enthusiastic about making a positive impact in the education sector and possess the skills to deliver top-notch software, we invite you to explore this opportunity further.

Key Responsibilities:

Design, build, and maintain ETL/ELT data pipelines from diverse data sources such as databases, APIs, event streams, and files.
Develop and manage data warehouse/lake solutions using platforms like Snowflake, BigQuery, Redshift, or Databricks.
Implement data quality checks, validation, and monitoring procedures to ensure high data reliability.
Optimize queries and pipelines for performance, scalability, and cost efficiency.
Collaborate with stakeholders to understand data requirements and translate them into technical solutions.
Document data models, pipelines, and systems.
Implement data governance, security, and privacy standards including access control and PII handling.
Participate in code reviews, design discussions, and enhance data engineering standards and tooling.
Troubleshoot and resolve data-related issues in production environments.

Qualification Required:

Bachelor's degree in Computer Science, Engineering, Information Systems, Mathematics, or equivalent practical experience.
3 to 5 years of relevant experience.
Proficiency in SQL with experience in complex joins, window functions, and performance tuning.
Hands-on experience with at least one programming language for data engineering, preferably Python.
Familiarity with ETL/ELT tools or frameworks like Airflow, dbt, Luigi, Kafka Streams, Flink, or custom pipelines.
Experience with relational databases such as PostgreSQL, MySQL, SQL Server, and working with large datasets.
Experience with at least one cloud platform (AWS, GCP, or Azure) and its data services like S3/GCS/ADLS, Redshift/BigQuery/Synapse, EMR/Dataproc.
Understanding of data modeling, warehousing, and orchestration concepts.
Knowledge of version control (Git) and CI/CD practices for data code.
Strong problem-solving skills and ability to work with incomplete or ambiguous requirements.
Good communication skills and ability to collaborate effectively in cross-functional teams.

Additional Company Details:

Languages: SQL, Python, Javascript
IAC: Terraform
Orchestration: dbt
Warehousing/Lake: Snowflake
Storage: S3
Streaming: Pub/Sub
Infra/DevOps: Docker, GitHub/GitLab, CI/CD

(Note: The provided job description does not contain any specific additional details about the company.) Role Overview:

Key Responsibilities:

Design, build, and maintain ETL/ELT data pipelines from diverse data sources such as databases, APIs, event streams, and files.
Develop and manage data warehouse/lake solutions using platforms like Snowflake, BigQuery, Redshift, or Databricks.
Implement data quality checks, validation, and monitoring procedures to ensure high data reliability.
Optimize queries and pipelines for performance, scalability, and cost efficiency.
Collaborate with stakeholders to understand data requirements and translate them into technical solutions.
Document data models, pipelines, and systems.
Implement data governance, security, and privacy standards including access control and PII handling.
Participate in code reviews, design discussions, and enhance data engineering standards and tooling.
Troubleshoot and resolve data-related issues in production environments.