A few months ago, I became the Lead Site Reliability Engineer (SRE) in Visma e-conomic. This new position is allowing me to work with brilliant people! In our industry, brilliant colleagues often have a hard time collaborating, and communication is very hard. In this post, I want to talk about the start of our journey from an Operational (Ops) team to embracing SRE using kindness and compassion as a way: Trust and forgiveness.
A few years ago, the SRE team in e-conomic was called the Ops team. Then to modernize, and do things in the cloud, the team name was changed from Operations Team to DevOps. Still, the way the team interacted with the rest of the organization was always the same: The expectation was that the DevOps team would solve any issue possible at any time.
Even if the DevOps team had everything as Code and allowed developers to work independently for changes in the architecture, it looked like there was no incentive to do self-service or to delegate and trust other teams with any change.
The team preferred to work on specific tasks that could have been delegated, as no other developers knew what DevOps is or wanted to edit YAML. The other teams liked to push their tasks to the DevOps team. When there was too much in the backlog, it was easier to desperately satisfy them instantly, avoiding the backlog and refinment.
The DevOps team was still an Operations team. Very few people knew what DevOps meant. Managers preferred to hire brand new “DevOps engineers” in their teams and then dump Ops tasks on their desks, as they started seeing how the DevOps team was not collaborative or available enough.
As a result, important infrastructure decisions where either taken by other teams or took way longer to be prepared by the DevOps team, as the [toil] was growing and growing.
The DevOps team was either super kind or a blocker. We either ignored requests or punished colleagues for running their own tasks not in a compliant way making the position of any DevOps Engineer vague and confuse. What does a DevOps engineer does? Support?
Kindness to others doesn’t scale
I was reading this article, and even though it is very focused on USA West vs. East coasts, it sparked some random thoughts. Being nice is not the same as being kind to yourself and your colleagues. Even if it is not the dictionary definitions of those words, I like this distinction.
The DevOps team was kind for sure, but not to themselfs. We were useful mostly for others. The engineers on-call would always answer questions, solve issues, addressing the needs for others. In one way, the DevOps are kind.
Our interface with the rest of the org was a ServiceDesk-Like slack channel, where we would receive messages. One of these requests could have been running database queries: that we needed a specific procedure to follow to be compliant with the auditors. This was a manual and exhausting procedure that could last from 30 min to hours, depending on the request. Executing DB queries for other teams was pure toil.
Any of these slack messages would result in an engineer stopping working and focusing entirely on the new task for somebody else. It is addressing the needs, but not the SRE needs: scaling.
Toil does not scale. If we were kind to other teams and not to ourselfs we would keep toil in the backlog for ages before understandign that we can empower automation and teamwork with the rest of the organization.
The team has been working agile for long time, but the time spent to address the team’s needs was very little. To solve this we switched to being fully a Site Reliabilty Engineering team… and I added a special focus on eliminating toil first.
Rebranding Site Reliability Engineers
We started a journey to re-shape the interface of comunication with the team: from the service-desk style of comunication to a community-driven approach so that team could be more autonomous and we would find the time to help each-other defining SLOs, writing better post-mortems, building and improving the underlying infrastracture and more.
We started saying “no”, reprioritizing and being kind to ourselfs. We addressed our own needs in the SRE way: We automated it! Our team goal was to remove toil during the first quarter and automate the slowest/most demanding tasks. This gave time for the SRE team to find a new identity, our corner for documentation.
All of a sudden, with automation in place we can afford to delegate tasks to other teams and let them run their own show! We can trust developers now, and we can forgive and give guidance if there are any issues later.
For example, instead of running SQL queries for other people, they can now use an auditable and compliant automated system. This reduced our manual tasks and added a little more trust to others.
My personal learnings
I think this mind shift, undersetanding kindness, and move from a Ops to SRE in the organization will help more than just following a bunch of books. Some people still call us OPS. I say Oops instead 🤭
We are still serving the organization, but trusting other teams and being kind to our own team will work better in the next years. The next step would be to forbid manual changes and this might be a pill hard to swallow as a form of kindness.
From a non technical point of view I learnt that not everybody understands kindness and compassion. Some people prefer a hierarchical structure. They don’t like people being open and warm. Some have a hard time starting a meta-dialog about mental health, kindness, and compassion. The metrics that would impact their own goals are more important than knowing what an SLO is or what CI/CD means. Being kind is seen as a weakness for some people. That is ok: everybody is different, and people also perform the SRE-way without being aware of it! Just push more on observability and people will see it!
Following the SRE practices in a kind way shows my team’s benefits. It will show it to the rest of the organization too. Being kind to ourselves rather than to everybody else helped way more. We started addressing our own needs rather than being too kind and nice to other people. This does not mean that we are rude! We use more empathy, kindness, and compassion to make sure teams understand why we are doing things or not!