24x7x365 on call support (in rotation) to manage and execute on the Incident Management process.
Fast and effective response to service failure Alerts and Notifications from a range of systems.
Impact and Severity Assessments of service failures, both internal and external stakeholders.
Management of Bank/PG\'s downtime or other services against SLA Targets. Escalation of downtime within the bank/PG\'s, as well as internally.
Accurately tracking on progress and escalations on issues & internal ticketing systems.
Updating merchants/internal stakeholders on the status of any service outage, either directly by phone and email or via the ticketing tool.
Notifying merchants via email of any planned maintenance, either internal or Bank/PG.
Managing the outcomes of Reason for Outage (RFO/RCA) and Major Incident Reports (MIR) both internally and externally.
Hands on experience on Database (SQL)
Hands on experience on Python, shell scripting.
Software Development in terms of automating repeatable Operations tasks (TOIL).SRE Metrics & Monitoring Strategy (SLI, SLO, etc.). Schedule and lead all continuous improvement activities, including Incident reviews, Change implementation reviews, TOIL automation candidate areas etc.
Based on post-incident reviews, he/she will need to optimize the Software Development Life Cycle (SDLC) to boost service reliability.
To ensure a seamless flow of information between teams, site reliability engineer job may require documenting the knowledge gained.
Must have:
Excellent communication skills .
Patient & friendly attitude with excellent interpersonal skill
Ability to work on own initiative, working to and meeting tight deadlines
Flexible and able to work within a 24/7/365 shift pattern & rotational shifts.
Advance Knowledge of SQL/Linux, Python/shell scripting or any code expertise
Should have knowledge of payment flow/process.
Experience of 2-5 years.
Must have SRE background more focused on automation.