Latest In

News

Tips To Eliminate Toil

There are different ways to eliminate toil that everyone should be aware of. Eliminating toil should be the mantra that everyone follows in their letter and spirit. It could drastically improve productivity and build long-term sustainibility.

Author:Katharine Tate
Reviewer:Karan Emery
Aug 09, 2022114 Shares1.9K Views
When we talk about site reliability engineering or more commonly known as SRE, it is imperative to know how we can eliminate toil.
This is something that is suggested by all industry experts but it is not easy to implement.
We will discuss in detail what it means to eliminate toil and what are some of the nuances attached to it.

Definition Of Eliminate Toil

Work we do not enjoy doing is not necessarily labor.
It is also not only the same as doing some dirty or administrative job.
Individual preferences for the rewarding and joyful sorts of jobs differ, and some people even like manual, repetitive work.
Additionally, there are administrative tasks that must be completed but are not labor; rather, they are overhead.

Difference Between The Overhead And Toil

Team meetings, setting and grading goals, creating snippets, filling out HR paperwork, and other activities that are not directly related to running a production service are examples of overhead.
A cartoonic image sitting on his desk with PC, laptop and mobile phone with pop-up notifications
A cartoonic image sitting on his desk with PC, laptop and mobile phone with pop-up notifications
Sometimes gritty work is not labor at all since it has long-term worth.
It may be filthy, but organizing your service's overall alerting settings and clearing out junk are not laborious tasks.
What then is labor?
The type of work associated with managing a production service known as toil is typically manual, repetitive, automatable, tactical, lacking in lasting value, and scaling linearly as a service increases.
Even though not every task is labor, the closer a task fits one or more of the following characteristics, the more likely it is to be labor.

Manual And Repetitive

Included in this is labor like manually executing a script that automates a process.
Even though running a script is faster than performing each step manually, the actual time a person spends working on the script (not the time it takes) is still toil time.
It is not labor if you are doing something for the first time or even the second time.
Toil is repetitive, hard effort.
This activity is not labor if you are creating a new solution to an old problem.

Automatable And Tactile

That task is toil if a machine could perform it just as well as a human could, or if the necessity for it might be eliminated via design.
If using human judgment is necessary for the task, it's likely not labor.
Site Reliability Engineering written on blue background and chip on the other side of the picture
Site Reliability Engineering written on blue background and chip on the other side of the picture
Instead of being strategy-driven and proactive, labor is interrupt-driven and reactive.
Pager alert management is laborious.
Even though we might never be able to do away with this kind of job, we must always strive to reduce it.
After completing a task, if your service is still in the same condition, it was undoubtedly laborious.
Even if the activity entailed some tedious labor, like cleaning up legacy code and configurations, if it resulted in a permanent benefit to your service, it probably was not toil.
A task is likely to toil if the amount of effort required to complete it scales linearly with service size, traffic volume, or user count.
Apart from some one-time efforts to add resources, an ideal managed and planned service can grow by at least one order of magnitude with no extra labor.

Less Toil Leads Towards More Productivity

The proclaimed objective of our SRE organization is to limit operational work - also known as toil - to less than 50% of each SRE's time.
Each SRE should devote at least 50% of their time to engineering projects that will either lessen future labor requirements or introduce new service capabilities.
Enhancing dependability, performance, or utilization is the normal goal of feature development, and as a secondary benefit, toil is frequently reduced.
Because work tends to grow if left uncontrolled and can quickly eat up 100% of everyone's time, we all share this 50% aim.
The "Engineering" in site reliability engineering refers to the job of decreasing labor and scaling up services.
The SRE organization's ability to scale up sub linearly with service size and to manage services more effectively than either a pure Dev team or other alternative depends on engineering work.
Additionally, we emphasize to incoming SREs that SRE is not a traditional Ops organization by citing the 50 percent rule.
We need to uphold that commitment by preventing any SRE team or division from turning into an operations team.

Toil Calculations

How is the time used if we aim to limit an SRE's time spent working to 50% of their time?
Any SRE who is on-call has a cap on how much work they can take on.
A person sitting on his workstation with his PC and laptop
A person sitting on his workstation with his PC and laptop
Each cycle for a typical SRE includes one week of primary on-call and one week of secondary on-call (for discussion of primary versus secondary on-call shifts, see Being On-Call).
The lower bound on potential toil is 2/6 = 33% of an SRE's time, which means that in a 6-person rotation, at least 2 of every 6 weeks are devoted to on-call shifts and interrupt handling.
The lower bound in an 8-person rotation is 2/8 = 25%.
SREs state that interrupts is their main source of work, which is consistent with this data (that is, non-urgent service-related messages and emails).
Releases and pushes are the following two top sources, followed by on-call (urgent) responses.
Although we typically handle our release and push procedures with a considerable bit of automation, there is still much space for development in this area.

Above Par Performance On Eliminating Toil

We perform significantly better than our overall aim of 50%, according to quarterly polls of Google's SREs, who on average spend roughly 33 percent of their time working.
The average, however, does not account for outliers: some SREs report 0% toil (pure development projects with no on-call work), while others report 80% toil.
When specific SREs complain about excessive work, supervisors should frequently urge such SREs to find rewarding engineering projects and more equitably distribute the workload around the team.

Toil Is Not Always Bad

Everyone is not always unhappy with work, especially when done in modest amounts. Work that is predictable and repetitive can be quite comforting.
They lead to fast victories and a sense of satisfaction.
They might be low-stress, low-risk activities.
Some people naturally gravitate toward laborious activities and may even find that kind of work enjoyable.
Everyone must understand that some amount of toil is inevitable in the SRE function and, in fact, in practically any engineering role.
Toil is not necessarily and invariably negative.
Small amounts are acceptable, and if you are content with them, labor is not an issue.
When labor is endured in excessive numbers, it becomes toxic.
You should be extremely worried and vocal in your complaints if you are under excessive strain.

Strategies For Eliminating Toil

The first step is to introduce automation and follow it in both letter and spirit.
Since labor is automatable, automation is a natural topic for SRE businesses to concentrate on.

Using Automation to Eliminate Toil — An Expert’s Perspective

According to Google, a task is considered toil if a machine could complete it just as well as a human could or if the necessity for it could be eliminated.
There is a good likelihood a work isn't labor if human judgment is required for it.
To cut down on work, Credit Suisse introduced no-code robotic process automation (RPA).
Within a year, they had automated 10% of the work and had raised it to 45%.
At the time of the webinar (May 2021), they were already at 50% automation, and their goal for 2021 is to reach 55% automation.
This was made feasible by their commitment to integrating automation into each team's workflow.

Reuse And Improve

Repeated work tasks are common.
As a result, engineers should use a patch they find for a task frequently, even if it is on a different section of the platform.
To reduce labor, create a library of scripts that may be called.
More and more SRE tools now include pre-built libraries that address the most popular topics.
More issues result from bad code, which implies more work.
Improve the quality of your initial code by using an integrated DevOps strategy that includes thorough testing, automatic feedback loops between operations and development, and clear signals of the issues' priority for addressing.

Measurement Is Imperative

Measurement is crucial as you continue to reduce labor.
Examine changes once you make changes and compare them to your starting point.
Are teams producing equally valuable work or more so than before?
Has the endeavor to reduce toil resulted in a shift in culture and morale?
As automation is incorporated into the mix, there will also be growing pains because teams must lay the foundation for it to function.
Although things will not necessarily change immediately, you will start to notice increased productivity and morale going forward, which eventually aid in meeting corporate objectives.

Focus On Engineering Work Instead

You want your employees to spend most of their time on value-adding engineering work rather than on non-value-adding labor.
Drawing on Vivek Rau's insightful descriptions, we might also characterize engineering work as an imaginative and original activity that calls for human judgment, has lasting worth, and can be used by others.
A high engineering work-to-toil ratio in an organization gives the impression that everyone is metaphorically swimming in the same direction.
Working for a company with a low engineering work-to-toil ratio makes you feel more like you are floating or, in the worst-case scenario, sinking.

People Also Ask

What Is Eliminating Toil?

Google came up with the term "towil" to characterize the tiresome, repetitive chores involved in maintaining a production system. To optimize the time spent on engineering and innovation, Site Reliability Engineering (SRE) teams strive to minimize or even completely remove labor-intensive tasks.

What Qualifies As Toil?

The type of work known as toil is typically manual, repetitive, automatable, tactical, lacking in long-term value, and scaling linearly as service expands.

Why Should Toil Be Limited To A Bounded Part Of The SRE Role?

The SRE position must include work within certain bounds. If SREs are pressed for time, they perform the traditional sysadmin duties that DevOps discourages. SREs can work on projects that support your engineering and reliability objectives whenever they want if you set a threshold for toil at 50% of the time.

Conclusion

Eliminating toil is imperative when we talk about SRE.
This is even applicable in other contexts as well.
The crux of the whole argument is that we should be focused on engineering works rather than toil because it is long-term and improves both productivity and costs.
Jump to
Katharine Tate

Katharine Tate

Author
Karan Emery

Karan Emery

Reviewer
Latest Articles
Popular Articles