The Client is a large, IoT focused enterprise in Agricultural industry that provides digital and hardware solutions for distributed devices monitoring, management and streamlining production processes through automation, machine learning and big data analysis.
The Client has initially requested support in SRE areas of competence, in order to quickly finalize several internal projects and initiatives related to observability (like implementing tracing for serverless applicaitons or centralized logging solution) and also governance and security. This resulted in 2 senior SRE engineers joining an existing and established team that helped accelearate several epics and projects.
The company had an intensive backlog and plans towards scaling up their platform by several orders of magniture. We knew immediatelly that Site Relaibility principles will help the business execs grow and expand with control and safety. Over time, thanks to great communication, high skillset and effective delivery the Relout team has been scaled to 5 Engineers (including team leader) and also a second project was opened for Backend Engineers with tech lead position.
The development team focus was solely on augmenting existing team developing core part of the system and support them with their daily duties, improve team’s capabilities and skills and also enhance the reliability and quality of the code through observability and quality control. SRE Team continued to work on high-level industry-wide standards and principles, helping all other teams with imroving reliability throgh standardization, centralized solutions for logs and monitoring, governance and developing automation frameworks.
Both teams were mixed between client’s employees and other vendors and led by Relout leaders.
Challenges
Observability in distributed , microfrontend-based system. Thousands of AWS Lambda’s running on production needed a well-established tracing and monitoring framework to help troubleshoot and debug issues
Business goals and objectives (SLOs) needed to be defined across a large organization through efficient and easy to adapt proces
Tranings and knowledge-sharing for understanding and implementing Site Reliability principles and learn new tools.
centralized view on system’s health
High cost-per-customer needed to be optimize to allow effective and rapid scale-up of the product on the market.
Goals
Extend the business unitts and teams to support them with processing backlog of initiatives and projects
Implement SLO's and SLI's across entire organization
Centralize logging for Serverless applications to simplify debugging and optimize costs
Optimize cloud costs to decrease cost-per-customer
Ensure company governance and security policies are applied to all teams and services (through CI/CD)
Our Approach
A product based on distributed, 100% serverless based solution with edge computing and IoT devices needs to have a strong priority on Site Reliability valies and practices, in order to be stable, scalable and reliable. Observability was the key to this success. Therefore we applied the following principles
Ensure Blameless culture – though post mortems, culture of support and sharing knowledge, we helped the teams increase their overall productity and resolve the most urgent problems and backlog items in short time
Implement SLODLC framework – enabling SRE at large organization is a challenge .One of ways to simplify the adoption was to apply a framework. Relout engineers have supplied a custom and modified SLODLC template to all teams, along with Grafana trainings to ensure simple implementation of SLO’s and SLI’s across the company.
Build effective, high performent and well integrated teams – by applying trust, taking care of integration and well-being of our engineers and ensuring good communication flow we knew the teams performance will
The project is still ongoing since Octover 2022 but as of today, after 6 full months of cooperation both of the teams continue to provide great business value by working in 10 week increments, perform migrations and modernizations, ensuring standards and well-established observability is in place.
The development team with Relout leader driving it consist of 4 Senior Backend Engineers who drive business development and delivery of microservice and microfrontend-based solution for core services within the product portfolio.
The SRE team with Relout leader driving it consist of 5 Mid and Senior SRE engineers who ensure standards are in place, enable governance and security into organization, implement observability and incident management tools and processes and most importantly – keep an eye on all AWS accounts to prevent cost spikes, decrease the cost-per-customer footprint and provide automation support for all other teams to speed up business delivery.
Services provided
Team Extension
9 FTE Engineers have joined the teams responsible for maintaining and developing core product functionality, maintaining AWS accounts and enabling SRE principles and standards