Modernizing IT Operations for System Reliability for a Big four Bank - Altimetrik
Background
The financial industry is increasingly relying on technology to provide services to customers. As a result, banks need to ensure that their systems are reliable, scalable, and efficient. Additionally, they operate in a highly regulated industry where downtime or outages can result in significant financial losses or regulatory penalties.
Site Reliability Engineering (SRE) is transforming global financial services companies by providing a set of platforms and practices that enable them to deliver more reliable and scalable services to their customers. SRE practices focus on building and operating software systems that are highly reliable, scalable, efficient and reduce likelihood of outages or downtime.
Our client, a Global Investment Bank and Financial Services conglomerate, is present across 160 countries providing payments, cards, cash management, working capital and trade solutions to companies, and governments, and other big institutions. With over $13 trillion in assets under custody, it also integrates the capabilities of markets with a trading floor in more than 80 countries. Since its tech-operations are huge and span several geographies, the company wanted a comprehensive strategy that can simplify monitoring, enable system tracing and fully automate their tech operations.
The end goal was to get real time 360-degree insights on collective view of system’s health, service management activities, product quality index, missed revenue and total cost of ownership.
Challenge
Altimetrik’s SRE Transformation team engaged with the client to help define, deploy scalable site reliability engineering framework, policies, and procedures to modernize their IT operations. In our discovery phase we mapped their current way of working and identified certain challenges impacting their system reliability, such as:
- Scattered dashboards with noisy reactive alerts (~1000 alerts/day).
- No end-to-end traceability.
- Tech deliverables were not up to the mark and did not match client’s business priorities.
- Multiple tickets SLA breaches with delayed RCAs.
- Fragmented SRE tool adoption that resulted in high maintenance cost.
Solution
Our team outlined a 12-month transformational roadmap and assisted in adopting SREfoundational services, optimizing observability capabilities and developed an automated suite to provide self-healing capabilities.
1. Cockpit Controller
We developed a self-service automation platform to support critical flow remediation and introduced system throttling to manage event queues and transition of payment methods.
2. ISDMS Integration
Simplified monitoring, and constructed dashboards to capture average request services on active nodes setup of Prometheus and HA Proxy Integration to highlight non-utilized / underutilized nodes.
3. Mission Control
Consolidated monitoring and logging services with single interface providing comprehensive view into live traffic. Also optimized logging and monitoring by distributed tracing to reduce MTTD.
Results
We optimized observability capabilities and developed automated suites to provide self-healing capabilities and reduced their operational toil.
- 1. SLO based alerting.
- Automated service management.
- Eliminated development team overhead in supporting production events.
- Reduced operational and maintenance cost for SRE tools.
- Predictive capacity planning and management.