Nat Welch is a leading expert on Site Reliability Engineering, and we are lucky to have him leading the charge on SRE at Time by Ping, where he helps us scale the platform for safe, reliable, and performant data handling. SRE is all about employee wellbeing, by applying an engineering mindset to operational problems — something which is vital to us at Time by Ping.
This conversation has been edited for length and clarity.
Healthy paranoia. A lot of a lot of my career has been thinking about how to deal with the worst but with a positive attitude. Figuring out how to deal with those failures in a positive way instead of everyone running around with their heads on fire comes in handy.
I’ve been doing professional software development since I was 17. My first job, I was working for Sonoma County, California, building things like status pages and eventually people’s personal websites over the summer. From then on, I pretty much was always either self-employed or working with some organization or another. During school at Cal Poly, I was working 20 hours a week. I was able to see what I actually wanted to do as a career and that I was good at it.
It was then that I started working for Punchd (This is where I met Niket, TBP’s VP of Product) for basically nothing. After we were acquired by Google and I started working directly for them full-time for four years. First, I worked in Mountain View, then San Francisco, and finally in London. It was a great learning experience, but then I got really burned out living alone in London.
I was accepted to a program in New York called the Recurse Center, which is essentially a writers workshop for programmers. I did that for three months, during which I fell in love with NYC and took a job at a startup.
Shortly after, I joined Hillary Clinton’s presidential campaign doing Site Reliability Engineering (SRE) and Infrastructure, which was interesting. That experience cemented in my mind that there’s a lot to teach other people about SRE.
Oh, it was a blast. And it ended sadly. But I learned a lot.
I was hired in January, getting ready for the primaries.
I had a friend who was already working on the campaign. When he described their problems to me, it sounded like problems I had encountered before and knew how to attack. I met my now wife through the campaign — she was on the research team. And we’ve now been together for about four and a half years because of that.
I learned a lot. It’s probably the hardest I’ve ever worked for anything — basically seven days a week for a year and change.
My last day was December 31 of 2016. I helped with the shutdown, too.
After I helped with the shutdown, I went and worked for a news organization for about two years, which was fun. During that time, I wrote a book which is when I realized I wanted to do more teaching.
I went back to Google and worked on their Customer Reliability Engineering team, which essentially goes out to Google Cloud customers and teaches them how to run large scale systems and operations in the cloud. I was this weird mix between a consultant, an educator, an engineer, and a travelling salesman for two years.
During that, I realized I missed being close to a product. Google Cloud is gigantic, and while I really enjoyed that educating process, Ryan approached me and asked if I wanted to get Time by Ping’s infrastructure in shape. It sounded fun!
Niket was still the same level of high energy that I felt like I’ve never been able to keep up with. He’s grown into more of a leader and less like he’s flying less by the seat of his pants. He has more plans, whereas 10 years ago he was much more reactionary.
For Michael, we worked together just a little bit before Google. He went off to do a ton of stuff and now that we’re working together at Time by Ping, he’s very much a product designer. I get a much different sort of seriousness from him than I did back in 2011 when he was doing cool design work, like Chrome ads and video art direction.
"Reliability tends to be reactionary. So it’s exciting to be at a small company that’s actually thinking about reliability and security early on…"
A lot of it is just working with a small company. Often companies don’t approach reliability until they have a gigantic, business-destroying outage.
The number of times I have worked with friends who have sent me texts like, “We just launched, and lost our production database.” They had some bug that just deleted everything.
Reliability tends to be reactionary. So it’s exciting to be at a small company that’s actually thinking about reliability and security early on, prioritizing it, and putting money behind it. It makes the engineering problems different from what I’ve dealt with before.
""It’s really a way for an organization to approach mental health of its employees in a safe way, so there are a lot of practices around figuring out what sort of strain our employees are dealing with as they interact with these systems.
Site Reliability Engineering comes from the idea of attacking operations problems with a software engineering mindset. As a software engineer that’s attractive, because I love systems and how things fit together, and I love writing code.
It’s really a way for an organization to approach mental health of its employees in a safe way, so there are a lot of practices around figuring out what sort of strain our employees are dealing with as they interact with these systems. Let’s say, my job is responding to tickets or alerts. You always have to have your phone with you. If your boss calls you and says there’s this crazy thing in the news, you have to go do it. The time or hour of the day doesn’t matter.
""SRE is all about stopping feeding humans to repeatable tasks, and instead investing in their mental health and using computers to solve the problem.
With engineering operations the person calling you is usually not a human, it’s usually a computer. One of the questions is, “Why is just one person dealing with this?” You can spread tasks across many people. If I was still living in London and I had a team in Seattle, you split it so when the team in Seattle is awake, they get the notifications, and when I’m awake in London, I get the notifications.
Let’s say Nat is doing a shift for a week, we can see how many times we had to alert him during his day. Every time he’s alerted, it takes a certain amount of time out of his day. You can start developing metrics for determining what point we’re putting someone in a high stress situation. As that percentage increases, I stop gaining satisfaction from my job. I start being upset and looking to quit. Losing an employee is a very expensive operation for the company.
SRE is all about stopping feeding humans to repeatable tasks, and instead investing in their mental health and using computers to solve the problem.
We use metrics usually called toil, which are operations that are manual, repeatable, and linear. As a number of customers grow, toil tasks grow as well. If an engineer is spending more than 50% of her time doing toil-type work, she’s going to burn out and quit. So we put a hard cap on that so that she gets time to do things that are more interesting to her, like creative work or projects to help limit the amount of toil coming in.
Usually, there’s someone inside of leadership fighting for reliability, engineering, productivity, and efficiency. Let’s say a reliability target is going to be 99%. Many websites have numbers much higher than that, but just for the sake of it then we can say “1% of every month, we can be unreliable.” And that space gives you room for failure and room for growth. That makes it measurable, so we can look at our reliability over time.
In the very literal sense, reliability should be argued for because we’re tracking the livelihood of other people. When we’re unreliable or inconsistently reliable, people won’t get paid. That’s how you pay for healthcare and food and everything you need in life. Since lawyers bill in six minute increments, it helps them sanctify that time in reliable ways. It’s valuable.
It gets interesting if you’re unreliable because that often means insecure. That is a concern for the law firm or potentially, if we track time for other organizations in the future. The security of people’s data, and reliable access to that is important.
For most of my career, especially getting started before college, my ability to learn was due to open availability of code. I wouldn’t have been able to get started and learn as much at an early age if Linux and open source software weren’t as freely available.
Continuing code accessibility is really important to me. I like sharing knowledge and mentoring because it’s important to be able to share. Ultimately, we want to do more external knowledge sharing at Time by Ping, so it’s something I like to push for.
"We make sure that if we collect data, we collect it in a way that is useful to the person whose data we are collecting. I don’t want to collect every little piece of data and later find out we provided spying reports to companies"
It’s not just our customers work, it’s also their customers work. The care is magnitudes larger. We think about it in terms of who can access data, making sure that we have audit trails, and ways to know if data is being misused. Then, we can deal with that, and also tell the appropriate people.
Also we make sure that if we collect data, we collect it in a way that is useful to the person whose data we are collecting. I don’t want to collect every little piece of data and later find out we provided spying reports to companies. It’s a fine line, from both an ethical and moral standpoint, and is something that we as a company are constantly thinking about.
Relentless focus on mission. Something Ryan and all of our leadership says very frequently is, “the goal is to sanctify time.” Just this idea that time is important is something I care a lot about. An individual’s time is probably the most important thing to them, it’s a true finite resource. It’s the most uncontrollable thing we deal with as humans.