Software Engineer with experience (4+ years) in designing / implementing ETL / ELT processes for on-prem databases / cloud warehouses using combination of Python, Spark, SQL scripting & Data Engineering Tools (Azure Cloud)
Solid background in database concepts, ETL architectures / strategies / administration to be able to design compute-efficient data pipelines
Prior experience working with large scale data (big data) processing pipelines / engines using PySpark
Azure Data factory, Databricks, Data Lake Storage
Good communication skill
Good to Have:
Snowflake Cloud
DBT
Familiarity / Experience in SWE best practices for unit-testing, code modularization, QA
Coursework / Past Projects / Github Repos illustrating familiarity with data warehousing best practices
Job Responsibilities
Responsible for the design, development, and implementation of data integration processes using SQL / Python Scripting / Azure Data Flow / Other Cloud Platforms & Tools
Responsible for setting up end to end data pipeline which includes importing, cleaning, transforming, validating and analyzing data with the purpose of understanding or making conclusions from the data for data modeling, data integration and decision making purposes.
Collaborate with business users, analyze user requirements, translate & apply business rules to data transformations
Lead and mentor junior developers in information integration standards, best practices, etc.
Create functional & technical documentation \xe2\x80\x93 e.g. data integration architecture flows, source to target mappings, ETL specification documents, run books, test plans
Performs data collection, profiling, validation, cleansing, analysis, and reporting. Tests, debugs, and documents ETL processes, SQL queries, and stored procedures
Initiates analysis to identify data inconsistencies and resolve in order to raise the level of data integrity
Analyzes data volumes, data types and content to support the design of data architecture solutions
Work with Data Warehouse Architects performing source system analysis, identification of key data issues, data profiling and development of normalized and star-snowflake physical schemas