The blameless podcast

Resilience in Action E13:

What LinkedIn Learned About Hiring and Training Site Reliability Engineers

RIA Episode 13

What LinkedIn Learned About Hiring and Training Site Reliability Engineers

2/23/2022
Kurt Andersen

Kurt Andersen

Kurt Andersen is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know. Before joining Blameless, Kurt was a Sr. Staff SRE at LinkedIn, implementing SLOs (reliability metrics) at scale across the board for thousands of  independently deployable services. Kurt is a member of the USENIX Board of Directors and part of the steering committee for the world-wide SREcon conferences.

Kurt Andersen

Podcast Transcript

Learn more about LinkedIn's School of SRE program on their website. You can also read their blog that gives an introduction to the program or watch their video announcement. Additionally, you are welcome to join the online LinkedIn Group for School of SRE.

Transcript

Kurt Andersen:

Hello. I'm Kurt Andersen and welcome back to Resilience in Action. Today, I have the double pleasure of talking to two of the SREs from LinkedIn who are instrumental in creating their open-sourced school of SRE materials, Akbar and Kalyan. Welcome to the program. By the way of introduction, could you each tell our listeners how you came to be SREs, Akbar, maybe.

Akbar:

Yep. First of all, thank you so much, Kurt, for inviting us to here and this a great platform for us to spread the goodness about the school of SRE, right? By introduction, I would call myself as an accidental SRE. I never imagined to be an SRE or DevOps in my role, in my career. But I ended up being an SRE, right, and I'm actually enjoying thoroughly for the last 15 years. I started my career as an engineer where I was responsible for managing or supporting the company's web servers and the mail servers. I was part of that bigger organization. And I'm actually just a part of the supporting team over there, right? I didn't know anything about SRE and I haven't even heard about the term SRE at that point of time.

Akbar:

Then I was fortunate enough to move to Yahoo, where I really started hearing the term SRE, right? What does that mean, SRE? And what SRE actually do. There I had a privilege to work with the Yahoo mail, Yahoo messenger, Yahoo caching system, where I was mainly responsible for, keep the site up and running for these critical services. And I could clearly correlate what is that SRE can actually add the value to the system, to the members, right? So since then, I realized that this is my area, right? This is my passion. I started enjoying doing my SRE roles and responsibility.

Akbar:

Then in 2012, I joined LinkedIn and it is the best place for an SRE, right. And I started working with the best industry SREs I have ever seen, right. And in the last nine and a half years, I've been into multiple teams. I have the flexibility to work as an IC, as a leader, and I'm thoroughly enjoying it, right. So, that's it, pretty much from my side.

Kurt Andersen:

Wow. Excellent. That sounds like an awesome career arc. Kalyan how about you? How'd you get into SRE?

Kalyan:

Hey all I'm Kalyan. I'm currently working at LinkedIn as a staff engineer site reliability. I'm part of a team called Espresso. Espresso is LinkedIn's NoSQL database cluster fleet. So all the data, be it your profile's data or data related to the news feed, messaging, everything in LinkedIn gets stored in the Espresso clusters. So next time when you see some consistency issues with the data that you see in LinkedIn, you know which team to blame, it's Espresso. Having said that, I've been with LinkedIn for three years. Before LinkedIn, I was working as a SRE at Media.net, contextual advertisement company. There I was dabbling with web server performances and certain core infrastructure components like config management and setting up and managing the private cloud of Media.net and stuff like that. So I joined Media.net right from college as an SRE. And I was with Media.net for six years, then I joined LinkedIn. Yeah. So I'm deeply passionate about distributed systems and system internals like Linux system internal. That's a brief about me. Yeah.

Kurt Andersen:

Great. Okay. So it's long been said, in the industry that it's terribly difficult to hire SREs because they're unicorns amongst the unicorns. And so how does the industry look? How does the talent pool look in India, for example Akbar, where the two of you are based? Is it any easier to hire SREs there?

Akbar:

Short answer is no, right. See, there is no challenge in actually the talent pool. So if you take the talent total available market, we have very vast amount of engineers getting passed out in every year, right? Approximately around 700 K engineers are actually getting passed out in every year from India itself, right. That's a huge number. And the number of people we are hiring for SRE is actually very limited, but still we are struggling to, not only, we means LinkedIn, but every organization struggling to get the right SRE to the teams. What is the challenge? Right? The major challenge there is, awareness. Awareness in terms of, the companies. Awareness in terms of the candidates, which is coming for this SRE role, right. I will briefly chat about that one, right. First of all, traditionally in India, when companies visit the campuses, majority of them, or hundred percentage of them for SWE route.

Akbar:

So the campuses or the students, they aren't aware there's a role for SRE. And traditionally people think about SRE is just an operations thing, right? And they consider it's a second level team, kind of thing to do the work, right? So when they even hear about SRE, "Hey, that's an SRE role, I'm not interested," right. So that's a challenge from the student side. At the same time companies, what they do, we don't have a standardized way, unlike SWEs, right? Like some companies just name it SRE for their CI/CD team, which they just do the deployment system, right. So different organization have their own version of SRE. Now the combining these two challenges, we are actually finding very difficult to figure it out, the right talent for the SRE role, right. Now, title dilution is actually the bigger problem and awareness of candidates SRE expectation is actually the different problem, which I can actually talk about.

Kurt Andersen:

Yeah. That area of awareness because most CS programs don't do any sort of education around reliability, that's a big one. Definitely. So when you were trying to recruit entry level talent for your SRE teams, what works well and what are the challenges?

Kalyan:

Okay, let me take that question. So, like you touched upon Kurt, there is definitely a gap in the academic stream for the SRE role. Whatever thought in the college, it's not wholesome for an SRE role in general. There are individual components, like operating systems, network databases, everything is covered in the academia, but there is a very shorter proportion of courses, which cover on how these things work in tandem. And I think there is less importance given to building reliable architecture as well.

Kalyan:

So these things are definitely challenges. And in fact, rather than challenges, these things, don't give an exposure to the candidate to think problems within perspective of an SRE. That's the primary challenge we have in a candidate attends our interview. For example, since I'm a person who joined SRE from college, it was not aware to me, like I write a program, which I could run with a run time, like a Java or Python, but I don't know how this is going, I was not aware how this is going to get deployed across tons of servers. How is the dependency getting sorted out? How is the runtime, the system dependencies, everything is standardized.

Kalyan:

This was like, this is never taught in the college. And, as per my intro, I'm at least a decade old in college, so this is before all the Docker world. So maintaining all the dependencies and stuff it's something so much curious to me. So this whole role has become a point where unless people themselves venture out and explore it, they might not be aware about SRE roles in general, that's not the case with the traditional software development engineer role as well. So one of the key goals of LinkedIn as the company itself is to make sure the opportunity is available to everybody. So given an opportunity to explore about the role, somebody who's curious or, or interested can pick it up and start exploring the role.

Kalyan:

That's the whole idea about whole school of SRE started. Having talked about all these challenges, we should also cover why entry level talent is more important because if there are challenges, why should we even worry about entry level talents for SRE role? One, it's not a myth. There is a general myth, in fact, even within our community, when we talk, it's said that "SRE is probably something that can only be learned by experience and it's something that comes up on its own as you mature.

Kalyan:

Based on my experience, I felt like I was pushed into this with just curiosity and I was able to learn all through this role. So that's something I feel personally as a myth, but other than that, having a entry level talent brings a diversity to the team. It's not just for SRE, for any role, having a person who is just new and curious to learn, can come up with out of box, innovative solutions to the problems at hand, than somebody who always views the problem with the lens of existing technology and tools that's available. So that's a great positive point for us in having entry level talent.

Kalyan:

And the second thing that I have to call out is, there is definitely a more need for SRE. That's primarily because more people are coming online and with pandemic, more use cases are getting digitized. So we need to have reliable systems, only then the whole trust on digitization is going to stay for people. So that means we need more site reliability engineers on work. So with all of this, we can bridge the gap only if we have entry level talents taking up site reliability engineer. So this is all what made us to start school of SRE.

Kurt Andersen:

Yeah, that makes great sense. And diversity is not just gender diversity, but that years of experience diversity is also an interesting factor that many people don't think about. So thank you for highlighting that one. One of the things you touched on is this concept of the perspective of SRE. And can you expand upon that just a little bit to highlight what the difference is between a SWE and an SRE kind of perspective?

Kalyan:

Okay. So a SWE from a college graduate perspective, a SWE, when a candidate, when he or she prepares for a SWE role, they mostly think about efficiency. The problem they solve is supposed to be solved in a big O log n, big O n big O n square or something of that sort. So efficiency plays a key role there, but from a reliability perspective, there are things like, even a given problem, the program runs at the best, the algo runs at the best efficient solution, what is the prudent memory that we need to give for the program? How could we scale the program so that if we hit a saturation, should we just throw more hardware at it or should we horizontally scale it? And if we horizontally scale it, how are we going to synchronize this with all of this?

Kalyan:

This is all just challenges with respect to performance. The other thing for an SRE is, the other tenet is part of the name itself is reliability. I would say SRE, I have a very controversial opinion that SRE somewhat a pessimist person. SRE always thinks in an architecture, like, "This is deployed from one system, what if that system fails? Okay. Deploy, let's deploy it in multiple system." If you go and say this to an SRE an SRE would say, "What if the whole data center goes off? What if the whole region gets an earthquake? What if it gets a fire accident?" So you have to deploy an architecture across multiple sites so that you are still reliable. It's not a brick and mortar shop that we could close due to an earthquake or any of these disasters. With that being said, though SRE is, the thought processes being pessimistic, they're not somebody who's afraid of failures. They're somebody who's embracing failures. They want to come up with solutions that can be automated. The failure can be handled gracefully, that to, with less toil.

Kalyan:

So I think in that case, though, it's pessimist to think of all the edge cases, it's good, as long as all of these cases are covered, so we can build reliable systems. So that's what I meant as a perspective of an SRE. A perspective an SRE is thinking performance beyond the algorithm with the system internals and thinking reliability, not just from a single site, thinking reliability across. What happens when something untoward happen at one of the region? How would you still serve people? Because it's very important, like a cab pickup company, if they couldn't sell this, leaving some people stranded. LinkedIn, for example, if we couldn't serve, we are not letting people connect to job opportunities. Each of these organizations have some responsibilities and keeping them up and running, I think that gives, that perspective itself, gives good kind of satisfaction to everybody who pick SRE.

Kurt Andersen:

Nice. Yeah. That fits very well with the principles of resilience engineering. Erik Hollnagel has a great article where he talks about levels of company resilience and that's his third level, is being able to anticipate the failures and prepare in advance so that they don't take you out. So that talks about the perspective and how you recruit folks, but when you would get people in once you've convinced them that it's not just an old fashioned operations role and they can actually contribute meaningfully to the system, what kinds of challenge did you encounter with the new grads once they came in to LinkedIn? Akbar maybe.

Akbar:

Yeah. So I think I can actually talk about preschool of SRE and after-school of SRE, right? So one of the common problem we have seen across, right, when we hire ELTs or ELT means entry level talents, right. When we hire them, they go for the normal onboarding process, right. And that onboarding process is mostly focused around the teams which they're into, right. In example, we are having in LinkedIn itself, I think we are having around 30, 35 in SRE teams itself. And they're focused on certain aspect of what that particular domain is actually supposed to do, right. And most of the time what happens is people miss to actually, people forget this new engineer is actually coming from campus, he or she don't have that kind of an exposure to the industry at all. And actually suddenly we are putting that person into the problem space where that individual has to go from the scratch to get into where we are in, right.

Akbar:

Now that's a bigger challenge. And that itself is actually challenge, not only for LinkedIn, but other organization when we are talking about the time spent for the initial few months, right? It may be three months or six months of time, right? I'm just guessing, adjusted time. That is the time we have to give for this entry level talents, right. And that is a investment in the organization has to, first of all this organization ready to put in for that individual, right? To get a successful SRE, right. So now when we started before the school of SRE how we were actually doing it, we started conducting classroom trainings, right? So the classroom trainings is actually, we call for multiple engineers from different teams. If the experienced SREs from different teams and actually have a classroom sessions for two, two and a half weeks, right.

Akbar:

Where we literally go through from the basics, from basics of Linux and where, in that training itself we'll have some preparation doc, preparation material, and actually go through some practical sessions for these engineer, where that person would be able to get the feeling, if I'm running an LS command or VS command, what is that mean? What is happening behind the scenes? Right. So just to try to get that feeling from there itself, to get to know the fundamentals from the scratch, right? So that was actually preschool of SRE. And that actually itself is zeroed into the idea of school of SRE, right. Now, this was something we started doing in LinkedIn and we realized that multiple organizations also have the same kind of a challenge right. Now, how we could solve this problem across, right. And is that something we could do? Right. And that is actually the starting point of school of SRE, right.

Kurt Andersen:

I love that you've got a perspective on how can we scale this solution? That's so SRE. Continue, go on, yeah.

Akbar:

Right, so then what we decided is actually we call all this experienced folks, like a few of them, including you, right. A few of us are part of LinkedIn. Some of them left LinkedIn. We spoke to this multiple people across the globe and actually see what we can actually do, right. And the first thing came out is actually, "Hey, we need to come up with a structurized curriculum for these people so they can actually follow." Right. And we started multiple iterations and actually came up with a list of topics which include networking systems, DBMS and NoSQL, whatever you name it, right, its the same design, fundamental system design, all those topics we came up and actually started taking trainings based out of that work, right. So that was the first step that actually seried into the school of SRE.

Akbar:

Now initial iteration, what we did is, after the pandemic, when the pandemic hit us, what happened, we didn't have that luxury to actually turned this in to two and off weeks of classroom training. Otherwise people will go crazy with the zoom, right? Like six hours training, so we decided not to go with that path but what we started is more recordings and all those sort of things. And again, to get more into scalable, again, from the SRE way we started designing this content, converted the India content into more transcript way. So the engineers can read more into a self based learning and actually more into asynchronous mode of learning, right.

Akbar:

Now, we started sharing this condensed to the new hires and we started conducting asynchronous, like we have a active slack channels where these people, all the trainers, part of the group and actually each engineer will get assigned time, like two, three weeks of time. This time would be allocated for fundamental training or the school of SRE training, right. And we have all the experienced engineers part of this group itself, where they will be ready to answer any questions, any point of time. And if some cases we will go for a zoom course, where all these people will get into the meeting and we will discuss over there as well, right. So that actually, the kind of way we are actually doing it in LinkedIn right now, the in entry level training.

Kurt Andersen:

Nice. So, yeah. Having two or three weeks of solid zoom calls sounds horrible. As you mentioned, going to a remote approach. So, Kalyan, can you tell us a bit more about some of the materials in school of SRE, some of the curriculum points?

Kalyan:

Okay. Sure. So with respect to school of SRE, we started with fundamental courses. So some of our key aspects is we want to keep it as foundational as possible, one. Second, we should not be specific to any tools. We should be just cover the principles. What, why and how and not like which tool should we use to solve this problem, that's left open and cover only the reasoning behind why this is a problem and how do we approach the problem. So currently school of SRE has two phases level 101 and a level 102. So with respect to level 101, which started initially, when we came up with the first round of the curriculum, we had figured out a bunch of courses. One, everybody in the SRE irrespective of the [inaudible 00:22:50] , they need to have a fundamental understanding of how the Linux system works.

Kalyan:

They need to have a understanding of how Git works because we are going to anywhere do a code collaboration. We need to know at least one scripting language. So these were all foundational. Then we added a basic networking so that people can understand how packets get routed, how DNS works. So this has its own understanding behind, you could run your discovery services with DNS and everything. So everything is built on top of those fundamentals. So we just talked about those fundamentals and we had course on databases like RDBMS and NoSQL, and a brief intro on Big Data, which you can process offline. But still this point it's something similar to college. We just introduced them to every individual pieces, but we never said how all of this is going to work in tandem. So we also incorporated a course of system design where all of these work together and you design systems. You design systems with more weightage given to performance, availability and fault tolerance.

Kalyan:

So we wanted all our SREs, especially the entry level talent, once they join. They see every design doc with an aspect of how availability is covered here, how fault tolerance is covered here, how your system is scalable. So that's how we track as a success metric. Once we have this course, we could have every design doc going through this process. So once we are about to seal the course, one of the key things that came up to our mind is, the customer data is something important for LinkedIn, privacy of data is much important. So we can't leave out security as a topic. So security, how you deal with data and a basic course on security is something that also added to the level 101 itself. And that's what our initial rounds of classroom training started in school of SRE.

Kurt Andersen:

Okay. So, you'd said initially it was a classroom two or three weeks. And then with the change to remote you pivoted to having transcripts and maybe some video classes. It's one thing to just sort of read about these concepts, but how do people actually get practical experience that brings these pieces together? How do you accomplish that with school of SRE?

Kalyan:

So one of the things we do is [foreign language 00:25:38].

Kurt Andersen:

So how do you get people to have sort of the real fingertip feel, the practical experience, even beyond just sort of reading about these things?

Kalyan:

Yeah. So reading is one which we did via self based. And one of the things that we, apart from reading and the slack channel, asynchronous communication, you also had dedicated zoom sessions. This is to award zoom [fatic 00:26:10] because we didn't have the whole classroom training, but we have dedicated POCs for each of these courses. And the candidates will get into a session with a POC where they could, each of these courses, they could either go through a problem within LinkedIn, an outage within LinkedIn and how this helped them to solve it. Or we could map it to some of the existing tools within LinkedIn and how do they solve it? For example, if we talk about the NoSQL database, we talk about all the cap philosophy there. But as long as they don't know how this is actually working in real, it doesn't make much sense.

Kalyan:

So Espresso would be taken as a live specimen. And we could say that, there is conflict that's going to happen and how do we resolve conflict in this? And so that's how people actually get to know that how these concepts are applied in real world. And some of these courses, yes they do have sandbox versions to test them out. But I think beyond the sandbox, the thing that really helps is to map it to something that runs in [PROT 00:27:25] and then understand how these problems come and how they're actually handled in production.

Kurt Andersen:

Nice. Now, are you seeing, as school of SRE gets used by other companies, they would have their own case studies, if you like that are particular to their technology stack, are you seeing other companies and hearing from other companies that they're doing similar kinds of supplementation over and above the materials within the GitHub repo?

Kalyan:

Do you want to take that?

Akbar:

Yeah, so we haven't seen that specific cases, but there are request from across globe specifically to have more translations and all those sort of things, like Chinese and Japanese because we have a lot of users coming from China and Japan, right. But we haven't actually really seen that aspect but that is something we are looking forward from other organizations to also, to contribute, right. Or actually looking forward how they could leverage this content and map this to the technology they're actually using it, right.

Kurt Andersen:

Nice. Have you added any sort of sandboxes or used the cloud provider of your choice and spin this stuff up and figure out this problem? Have you added any kind of hands on exercises to the school of SRE?

Kalyan:

Not yet.

Kurt Andersen:

They can be used generically?

Kalyan:

No, we don't have any hands on at this point. We have, I think we have some on that, are a Kubernetes session. We do have how to work with the mini cube and stuff like that, but I don't think we have a standardized school of SRE as a platform somewhere some people can do it. I think one of the reasons is we don't want it to be fixed with the specific tech stack or something of the sort. So we haven't talked about any of these implementations, even in our database courses or in NoSQL, for example, we haven't talked about a Mongo or etc. We just talked about the philosophies and it can be then extended by each of these companies or organizations based on their use cases. But if there is a request from the community to do something of that sort, we could be able to do Docker images so that they can immediately pull them down and then get a sandbox on their system. So yeah, if there is a request we'd love to collaborate with people and then do that.

Kurt Andersen:

Okay. So you talked about the, you've got a 101 set of materials and then you've got a 201 set of materials. What's in the 201 set?

Kalyan:

Okay. So once we're done with the level 101, we actually, we didn't anticipate a very great response from the community. We gave back our Github repository and to be honest, one fine day, we wake up we see we are on the Hacker News first page and we have quite a good amount of stars on our repository. It gave us a kind of really good feeling that we contributed something that people are looking forward to. So then we actually had an, ask me anything session, in LinkedIn. We have a LinkedIn group for school of SRE and we had a session on that. And then we wanted what people as looking for it, how they're using the content. And we got some of these suggestions from folks in that session. And we also made sure that this level 101 is something, who wants to experience what reliability is and we are not covering anything more detail.

Kalyan:

And level 102 is something which somebody can take a stab only if they have completed level 101 and they're still with us. If SRE doesn't interest them, then it's not necessary for them to jump to level 102. And level 102 is logical level evolution of most of these things. For example, we talked about Linux, but we never talked about how a single system can be made highly available, even with Linux. For example, we could have raids via disc high availabilities. You can have bonding in your network, next, all sorts of things came in level 102. We talked about Linux networking in 101. But networking is not just a singular thing. Networking also goes hand in hand with security. You can have a DMZ zone. You can have your all sorts of fencing, DDoS protection, everything.

Kalyan:

And networking also comes with your high availability, how do you increase residency for your top of rack switches? It also goes with performance, how do you load balance traffic and everything? So those aspects are covered in networking. Instead of networking as a singular unit, we try to cover it, how it goes hand in hand with other components. And we covered certain aspects with the system design, which earlier was left vague in the first iteration. As for people who are more curious, who wants to understand how problems are done with scale. So that's level 102. Apart from that, there are, from the community, there are requests for how monitoring and metric collections are done.

Kalyan:

So again, we didn't jump into any of these technologies, any of the TSDS technologies that's available, but we said, what needs to be monitored, what needs to be measured. And some general principles like how an alert fatic can actually cause bad effects on the system's health and things like that. And what are all the different kinds of things, different ways to monitor, you could collect metrics, you could measure from the logs, measure even from outside, like how we have clients, could see a latency and all sorts of those principles we covered in that.

Kalyan:

And CI/CD is one of the other most asked thing. How do you integrate and continuously deploy? So we had some aspects covering that and containers is one more thing which was asked for. So we covered about, what is containers, right from the basic to a brief intro on Kubernetes. So the whole idea is we should... Kubernetes is not something which could be covered in a transcript we will have... So this is like a starting point and if somebody is really in with a use case, why it has come up, then they could actually start from there and explore, what Kubernetes is? And what containers are? How is it different from virtualization? Stuff like that.

Kalyan:

One more common thing that's asked is how incident management is done. Because that's something we figured out, there's a gap in, how war rooms are connected between a company, which is starting up and evolving and companies which have fixed processes. Both have their pros and cons. That's something in our pipeline we haven't released yet. But yeah, these are all some of the things that we got as feedback from the community. And we made sure they're part of level 102 when we released an advanced course.

Kurt Andersen:

Nice. Okay. Excellent. So I think the initial release was in 2020 sometime. And you talked about getting on the front page of Hacker News. Akbar can tell us anything more about how the program has been evolving since the initial release?

Akbar:

Yeah, so just to echo what Kalyan was mentioning, right. We never anticipated this kind of reach and one fine day, when we just woke up the next day, after the public release of first school of SRE, we were in the front page, first line, school of SRE content, this is something LinkedIn open-sourced, right. And that one of time, we haven't even turned the engineering block for school of SRE. We just published the school of SRE into Github repo. So that is actually something amazing to see. But ever since then, we have received so far around 100,000 unique visitors to the static website we have hosted, right. And on an average, we are receiving around 150 to 180 people visiting the school of SRE static website, right. And that is across globe, right. And that is again, I was talking about some of the request reaches actually came for, to do the language translation.

Akbar:

I know if you are actually converting into, if we can have that translation done, it would be even more, right. The reach would be even more. And in terms of the number of trials, we are having already almost 5.3 K stars in Git repo, right? One of the top most from LinkedIn itself, right. And the most important thing, right? And this is the most satisfying thing for each one, work behind the school of SRE. It is actually completely aligning with the LinkedIn vision and mission. We had received multiple responses or testimonials from the people, said that they have cracked their interview and received their dream job of joining as an SRE using the school of SRE, right. And that is a kind of proud moment for each one, work behind the school of SRE, right.

Akbar:

And so that's the testament we have. Everybody is super happy to receive this kind of testimonial. And beyond that one, we still have the school of SRE LinkedIn group, which is close to three K members. And it's an active group where people posting good articles about different technologies. What is happening on the SRE spaces and all those sort of things. And it is having people from the entry level to the most experienced persons in the industry, right. They're actually learning from each other and having the kind of community we wanted, right. So that's what happening, since the school of SRE is released from 2020.

Kurt Andersen:

Yeah. Congratulations. That's awesome news. And it's great to hear how lively the community is. And in a 100 to 150 unique visitors every day that's great. Well, as we try to draw this conversation to a close, tell us a bit about your future plans for school of SRE. How are you looking to evolve as time goes forward?

Kalyan:

Okay. So I would take that. So before the future plans, I think we would also like to take the opportunity this podcast to also thank the whole team of school of SRE, it's not just Akbar and me who is part of the team. It's completely voluntary effort driven by a lot of people, both people, there are people who are experienced in our team. There are people who joined as an SRE from college and who said, what's the gap that we had to bridge. So everybody contributed to it. And everybody made sure that they could contribute back to the community to build reliable systems. So kudos to all of them. With that said our future plans, yes, one of the common thing in Github is request for translations. We also see some unofficial translations, nothing to do with school of SRE in Chinese and certain other languages.

Kalyan:

We would be happy to coordinate with somebody who has, they could reach out to us. There is a contributing page, how people can reach out. If somebody's interested in any of these translations, I think there's one on German, there's one request for Chinese and Japanese, if somebody's interested for any of this, picking up, even pieces of it, we are more than happy to collaborate with them. We are looking for more community contributions outside LinkedIn. We have got one, the CI/CD module in level 102 is a contribution outside LinkedIn. We'll be happy to have more contributions. At this point the contributions are still, we have to have the transcript and it goes through a kind of a peer review. So a mentor is assigned and the mentor helps rather than a mentor, a buddy is assigned for each contributor because it's a big transcript that it's difficult unless a person is assigned to evaluate it or go through it and give solutions for it.

Kalyan:

But we are open to all contributions, please, to open an issue in our Github repository, we will look up for it. And then we will assign a buddy and then we can come up with the transcripts. So we are open to all community contributions. It may not be a transcript that's a community contribution. There could be any issues in the existing content. We have seen really nice people helping us to pinpoint certain mistakes we overlooked in the content. So if there is something please, to help us fix it, go ahead, rise a PR and there could be other ways to help us as well. You could come up with explainer or you could come up with hands on, and once you attach it to us, we could see how we can incorporate it to our existing content. And other plans that we have is, one of the things we started this talk, is saying that the academic streams, the courses taught in the campuses, do have some differences with what's needed in an SRE, so we are looking forward to engage with campus.

Kalyan:

We are taking baby steps in this direction. Right now it's not clearly hashed out, but we would either be happy if any campus, I would be willing to take some course as an elective, or tie up with our student organizations like [Next 00:41:27] user group within the campus. But it could be voluntary where students can look for how to incorporate this, how to explore the school of SRE contents. That's one more aspect we're looking at. And we are at various... Started initial levels of conversation with companies in India, who has the SRE culture to say how we can leverage school of SRE to standardize the onboarding process and hiring process within the SRE community. Like how competitive programming is at least one level of standardization in most of the SWE roles, we could have something similar for SRE, so that the preparation for an SRE role and the initial inertia for an SRE role would be less from a student point of view. So that's pretty much what we have as the future plans for school of SRE.

Kurt Andersen:

OK. Akbar, can you expand upon the use of school of SRE across other companies a bit just as a wrap up and how can other companies get engaged with making use of these materials for their own onboarding purposes?

Akbar:

Yeah, like I mentioned in the beginning, many organizations are actually seeing the challenge of getting that entry level talent training done, right. Because the amount of time they need to spend for training and getting that people on-board. Now with the school of SRE, many of those things are actually being sorted out. We believe that it's being sorted out and we have seen the immediate result in terms of LinkedIn SRE hire, right.

Akbar:

So, one thing that organization could do, this is an opportunity, they could use this, school of SRE content as the onboarding platform for the people. They can actually make use of that one and use their existing experienced engineers to contribute back to the content, right. Where we could actually, as an organization, we could scale across organization to standardize this process and actually make this more successful, right. It is not for school of SRE but it is for the future generation of SREs, right? So we need to have more experienced people coming to this role. And that would be the success for the school of SRE. So the organization, to ask your theorists to actually use this content for onboarding their own ELTs or entry level talents. Use their existing experienced engineers to contribute back to the content. So we could actually use in a successful and meaningful way, right. So that's the ask.

Kurt Andersen:

Excellent. Okay. Well, I love your vision to improve the profession worldwide and make it easier for companies to onboard SREs because everybody wants to do SRE these days or many people want to do SRE. So you're doing a great service to the community here. And I congratulate you for the success that you've had so far and look forward to how things will grow in the future. We will have links on the show page for the podcast, for anybody who wants to see and find the repo, aside from Googling it. You can look for school of SRE or you can follow the links on the show page. And if you want to contribute, please do look up the contributions page there and make your improvement to the system, so that everyone can benefit in the community. Akbar and Kalyan, thank you very much for joining us on Resilience in Action today.

Kalyan:

Thank you Kurt, its great talking to you.

Akbar:

Yeah. Great talking to you.


Get the latest from Blameless

Receive news, announcements, and special offers.


collapse button