Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

In our eighth episode, Kurt chats with Vanessa Yiu, Head of Enterprise Architecture at Goldman Sachs. Vanessa shares her perspective on enterprise architecture, experience in operating enterprise-scale platforms, chairing the first global SRECon, advocating for women in STEM, and how enterprises can embark on the journey of making reliability more important.

See the full transcript of their conversation below, which has been lightly edited for length and clarity.

Transcripts:

Kurt Andersen: Hello. I'm Kurt Andersen. Welcome back to Resilience In Action. Today, we're talking with Vanessa Yiu, who is a senior engineer and head of enterprise architecture at Goldman Sachs. She has over 15 years of experience dealing with both the systems and the people teams associated with enterprise scale systems. She's also one of the co-chairs for this year's SREcon conference, which we will talk about later. So my way of introduction, Vanessa, you said that you practiced jewelry making as a hobby-

Vanessa Yiu: Mm-hmm (affirmative). That's right.

Kurt Andersen: Outside of work. And I wanted to see, are there lessons for SRE that we can take from your hobby or anything you've learned in that area.

Vanessa Yiu: Yeah, first of all, hi everyone, and thanks for having me, Kurt. And I will call out that this is a very insightful question because I don't think anyone has ever asked me this question to date, but actually between SRE and jewelry making, there's actually some parallels. So let me talk about some of those. So first of all, the importance of good design, right? I think we all know as SREs that it is often... If you don't think about your non-functional requirements, what are your security controls, all these aspects, how your system is going to scale, doing that later on, retrofitting or whatnot is going to be much more costly and it's going to take so much longer. The same with jewelry making. Is actually quite precise. If you're thinking... Let's say if you're making a ring and you're trying to build the setting in for a specific stone. The measurements have to be precise, the design.

Vanessa Yiu: How you want the overall piece to look at the end, the finish, et cetera, you really have to think through that right at the beginning. You can't do that in the midst of the crafting process. So I think that's definitely one element. There's no way you can go back and retrofit some of this stuff. And actually anyone who's done this for a while will know that sometimes despite all of the careful planning that you do upfront, bad things will still happen to the actual crafting process. So to give you an example, obviously, if you're firing a jewelry piece, you're soldering or whatnot, sometimes you will have something called a fire stain. So this is really when you're firing a precious metal, oxygen oxidizes the metal that you're firing and you end up with a permanent stain in the metal.

Kurt Andersen: That doesn't sound good.

Vanessa Yiu: No, and you can't avoid that. I mean, in manufacturing processes, they actually try to extract the oxygen as they're firing, but if you're just your usual hobbyist at a workbench, obviously you can't really do that. So what do you do? The last time this happened to me, I was making a ring. It was meant to be for ring finger, and of course the only way you can do it is at the polishing stage, you try and polish off as much of that stain as possible, but then the ring ends up being much bigger than it was intended for. Sometimes you just need to think about, "Okay, how are you going to life back to some of these things that you know are not really in your control?" And similar to when bad things happen in production.

Kurt Andersen: Yeah, being able to adapt on the flight.

Vanessa Yiu: Correct. That's right. Oh, yeah.

Kurt Andersen: That makes good sense. So you joined Blameless for a webinar back in March, and when we talked about a bunch of different things, one of them was that you mentioned this role of enterprise architecture besides what we had known you as in your SRE role.

Vanessa Yiu: That's right.

Kurt Andersen: So for our listeners, can you give us some idea of what enterprise architecture is?

Vanessa Yiu: Yeah. So first and foremost, I'm not actually an EA practitioner by trade. So I think about EA from an SRE, right? First of all, apologies to anyone in your audience who's an EA purist, but there's actually lots of different definitions of what enterprise architecture is across the industry. And that I think is similar to say definitions of DevOps or SRE. Now the one that I like the most is this one, and I'm just going to read out the formal definition, which is enterprise architecture is the continuous practice of describing the essential elements of a socio-technical organization, the relationships to each other, and to the environment in order to understand complexity and manage change. And this is a definition that came out of the EA research forum way back in 2009.

Kurt Andersen: Wow. That seems a lot more forward thinking than what I associated in my mind, enterprise architecture being this old stodgy thing.

Vanessa Yiu: Yeah, that's right. And I'm sure we're going to talk about some of this later, right? The perceptions of the function of what it means, but yeah. So the reason I like this definition is that basically, first of all, it talks about EA as being a capability within the organization. It's not any kind of one off effort. It's not like just this one thing that we do. It is the fact that in any organization there is a combination of social and technical elements, and often as engineers, we only think about the systems and the technical piece and all the social and the people piece. So that calls that out explicitly. And then the last bit, which I think is the most relevant and interesting part, at least for me as an engineer, is the fact that the purpose of the function is really to manage complexity and change in the organization. And I think, again, we resonate with a lot of SREs because that's really a core part of the job as an SRE.

...the purpose of the function is really to manage complexity and change in the organization.

Kurt Andersen: Yeah. It strikes me as interesting that the definition you said came from 2009. In 2009, was when DevOps was coined as a term, and some of these concepts, SRE predates that by a few years at Google. And it's really interesting to me that this concept of the importance of the socio-technical organization, understanding the environment and complexity and change, there were obviously thinkers who were paying attention to these things. It's unfortunate that it didn't get broader exposure. It must've stayed locked in the enterprise architecture space to a certain extent.

Vanessa Yiu: Yeah. In a very specific... Only practitioners of it at the time were really

Kurt Andersen: Siloed, if we like.

Vanessa Yiu: That's right. Yeah. I mean, having spent most of my career to date managing different types of production systems and services, so the way I think about EA is that it's more of a macro view rather than a vertical type view. And really when we talk about enterprise architecture as a noun, it is really a complex system of systems within an organization and is really the job for EA to understand how these elements are linked to each other. And this is almost similar to urban planning, right? As the urban planner, you have to figure out what the blueprint for your towns and the cities look like and figure out how it's going to develop over time. So that's what I think EA is about. And then how does it tie to reliability?

Kurt Andersen: A leading question.

Vanessa Yiu: Yeah, exactly. I think first of all, before we get to that point, we have to talk about what most people associate EA with. And I think you had already alluded to this right at the beginning, which is almost like this old thing, governance related.

Kurt Andersen: Architecture review boards, change review boards-

Vanessa Yiu: Absolutely.

Kurt Andersen: ... all the anti-patterns that DevOps is trying to smash.

Vanessa Yiu: Exactly. So you're right. Traditional EA frameworks and implementations are very documentation heavy, right? It's largely focused on collecting artifacts for things like reference data architectures, your business processes, your application landscapes, and then the enterprise architects act as gatekeepers, if you like. If you want to build something or change something within the ecosystem, you have to go to that architecture review board, jump through all the hoops and get something signed off. And of course, anyone who's worked in any large scale enterprise would know that this is not grounded in reality. That's not really how organizations work. And no one wants to follow these processes.

Kurt Andersen: Not really. No.

Vanessa Yiu: Exactly. The more we move towards agile development practices, then also the harder this becomes. It's just not going to scale. So then how do we make this useful to developers so that they don't see it as an ivory tower type function and it's not a hindrance to their day-to-day? And then we said, "Okay, let's think about this like an SRE," which is to apply software engineering principles to the problem. I know for SRE is how we manage operations, but let's think about how we apply the same principles to enterprise architecture and really what are the key problems that we're really trying to solve for? And the way I see that is that there are two key things we need to solve.

Vanessa Yiu: One is to really understand what is in the enterprise ecosystem, right? Because this helps us understand capabilities, it helps us assess and manage the impact of changes. And when I say understanding what's an ecosystem, I mean, all the way through, from say at a service level, business services, technology services, applications assets, infrastructure assets, et cetera. The whole typology. And then the second one is about technical governance, and it's about how we apply the right guard rails for technical governance and controls and the right architectural biases from the get-go as part of the platform,

Kurt Andersen: How do you approach governance as an enabling capability then as opposed to a hoop to jump through?

Vanessa Yiu: Yeah. So the way we think about solving this problem is really around golden or happy paths. So if we explain define-

Kurt Andersen: Explain that for our listeners, please. Yeah.

Vanessa Yiu: Yeah. So this is almost like having a default path for say, if you do something, let's say, I don't know, like building data pipelines, right? The golden path for doing it would be the way that follows the platform technology and the architectural biases that you've defined for your organization. Through the scaffolding of how you onboard and build that pipeline, it already has the right controls in place.

Kurt Andersen: Ah, okay. All right. Compliance as a service and some of the other automated approaches that are becoming more common is what you're suggesting?

Vanessa Yiu: That's right. And it's not done as a separate process, it's not done as an afterthought, it is just the default behavior if you are to use my offering. And then this becomes easy. What you then do is also to put additional governance processes in place only if people want to deviate from that default, from that relevant path because then you're also incentivizing the correct behavior, because the default will be easy and if you don't want to follow it, it will be difficult. That's the way we're thinking about how to incentivize people to follow the right governance without making it like a rubber stamping red tape approvals type process.

"...we're thinking about how to incentivize people to follow the right governance without making it like a rubber stamping red tape approvals type process."

Kurt Andersen: Right. Right. Well, governance as an integrated pre-baked thing is valuable. I mean, that's one of the things that has made the government digital services in the U.K., or some of the work in the USDS successful in transforming government processes is having these pre-approved frameworks, for instance, going into the cloud. And then all of the government agencies can bypass all of the contracting paperwork and all the other associated paperwork of rolling their own, and they can just take advantage of these more modern platforms out of the box.

Vanessa Yiu: That's right.

Kurt Andersen: Is that basically the golden path?

Vanessa Yiu: That's right. Yeah. That's the approach and what we're trying to adopt for the organization. Now, of course, for any enterprise, easier said than done, I mean, you really have to think about every service we provide, what does good look like? And then how do we make sure that we do automate and provide all these kinds of capabilities out of the box? And often of course, services are not stand-alone. They're often tied to other things. So then how do you manage and scale this as well? Is also another interesting problem space, but that's really the angle of driving this.

Kurt Andersen: Okay. So the other side that you mentioned was understanding what's in your ecosystem. How do you do that on a continuous basis? How do you overcome the problem of work and ecosystem as imagined versus what's actually out there on the servers or sitting at the desks?

Vanessa Yiu: Yeah. So this is really about provenance and about dependency mapping and this needs to happen at multiple layers. For example, at the service level, you really need to be able to track who is consuming my service? For example, API calls, how do I map back to my end users? Which business do they belong to? And then which are the applications that provide that capability, that service capability? And then even at the more application level, think about, "Okay, well, if I'm building a piece of software, where are those libraries, where are those modules being pushed out to in your environment? Which hosts? And it's not just about where it gets deployed to, is where it is also going to be instantiated? Just having the software doesn't mean it is really running in production.

Vanessa Yiu: The way we think about it is you have to build that tracking and that type of automation and collect all of that data at multiple levels to really build a view of what is truly out there in the enterprise ecosystem at any given time and what is actually being used. But having that data is really powerful because you often have to answer questions like, "Okay, well, what is this particular business truly dependent on from a platform perspective?" If we care about resilience and my recovery capabilities, as a business function, which other systems that I need to care about for that business function A versus B. And unless you have that view and that data, you're not going to be able to answer that accurately. No set of reference models is going to be able to answer that for you in production.

"you have to build that tracking and that type of automation and collect all of that data at multiple levels to really build a view of what is truly out there in the enterprise ecosystem at any given time and what is actually being used"

Kurt Andersen: Right. Right. Yeah. Okay. So we've talked a lot about the software, the technical side of the socio-technical systems.

Vanessa Yiu: Yes.

Kurt Andersen: What does EA do regarding the socio side of the ecosystem?

Vanessa Yiu: Yeah, I think, first of all, you need to acknowledge that there are functions in organizations that are performed by humans. There are going to be services that are just not provided by machines. Those also have to be almost cataloged and tracked, and then which teams provide what services to the end client? And then also, where are they distributed? Geo location? Matters to us in terms of our recovery capabilities mapping that. So those are some of the things that are important to your AE as well. And then also okay, well, if there are things that are being done manually today, how do we drive automation of those things?

Kurt Andersen: Right. Okay. Makes sense. And then one of the things that's been interesting in the last year as we have all dealt with the emergency cost of the pandemic, is the ability for organizations. We also see this occur when there's a cybersecurity incident, for example, and I'm thinking like the Maersk issue with NotPetya hitting them and the fortunate accident, if you like that they were only able to recover because they had a server that happened to be offline for maintenance at the time that the rest of the network got wiped out. And does EA have anything to contribute to understanding how an organization can adapt quickly to all of a sudden everybody's got to work from home, for example?

Vanessa Yiu: Yeah, I think so. I think if you understand the capabilities that you're providing and who provides those services, then that data can help drive, "Okay. Well, if this population is now working from home, do I still have enough coverage? Do I have any concerns around...?" I mean, there are certain business functions in my organization that can only be performed in certain locations. Maybe regulatory reasons or-

Kurt Andersen: Yeah, geographic or legal.

Vanessa Yiu: Exactly. Or data type storage or privacy type requirements. So then working remotely, how is that going to impact that? And again, I think this is more of the social and the people element, right? How does this impact my capability? And again, that data helps us confirm whether that's okay or whether that's not okay.

Kurt Andersen: Okay. That makes sense. So let's shift just now to talk a little bit about SREcon. You've been a participant at SREcon now for a few years. It's been great having you and your team make various presentations. And tell us a little bit about what is interesting and exciting about the conference to you.

Vanessa Yiu: Mm-hmm (affirmative). The conference, I mean, obviously a place where SREs from different companies meet. And I always find, for example, the corridor conversations are super interesting. I've only really ever worked in one organization as an SRE. Understanding other practices, how people actually do SRE on the ground, how it applies in different industries, I think to me is the most insightful part of attending SREcon. And I mean, the talks are great as well, right? Different perspectives, even on the same topic, but you hear completely different things even in the same conference and I find that really, really interesting.

"Understanding other practices, how people actually do SRE on the ground, how it applies in different industries, I think to me is the most insightful part of attending SREcon."

Kurt Andersen: Okay. So tell us a little bit about this year's theme. I don't want to turn this into a complete ad for SREcon, but as much as I'd like to, tell us a little bit about this theme and how people can get involved.

Vanessa Yiu: Yeah. So this year's theme is how emerging technologies are influencing and shaping the world of SRE. So I think we touched on this right at the beginning, but SRE has now been around for some time and I think there are lots of practices and principles that we love and we apply in our day-to-day lives. As we were discussing with the co-chairs, we're like, "Okay, but what next? What are the new things that's coming out of SRE that perhaps people are working on in their respective companies or industries that we should share with each other?" And I think this is really how we form this theme for this year. And I know there were also some discussions around, obviously the pandemic.

Vanessa Yiu: We just touched on remote working, and I think last year's theme was more focused around that. How have we gone scaling our environments, et cetera, during a pandemic? All their learnings from this, that we've now discovered that, "Okay, we want to," I'm going to say, "return this into some sort of formal practice or make this more permanent across the industry." So I think these are some of the thoughts behind this year's theme for SREcon. We want it to be forward-thinking.

Kurt Andersen: Very good. Very good. And how do people find out more information? I can throw in... We can attach the link, by the way, to the podcast when it comes out.

Vanessa Yiu: Yes, that would be great. Yes, the call for participation is now open. Is published on the USENIX SREcon website, and we are accepting submissions until 30th of June. So pitch this everyone please submit talks for SREcon '21.

Kurt Andersen: Awesome. And is there anything besides the theme for the year? Is there anything that's particularly different about this year's SREcon from previous ones?

Vanessa Yiu: Yeah. So it's a fully virtual conference, a global one. To those of you who are familiar with SREcon, historically, we would run a regional in-person conference. That would be one for APAC, one for EMEA, one for Americans, two for Americans. So this year, we're running one global virtual conference. We're also looking to host talks across different time zone blocks to make sure we're as inclusive as possible in terms of attendees and also locations and for the participants.

Kurt Andersen: Awesome. Well, thanks. And as I said, we'll attach the link to the podcast for people to find the information on the USENIX website. I wanted to spend the next... Before we finish up the podcast, you mentioned in your introductory notes, in your bio, that you are very passionate about education, opening up this field to both underrepresented minorities, as well as the women, which fall into that category. Well, do fall into that category, and how to get students interested in this field, as opposed to maybe just straight software development. You want to talk about some of the things you've done there and where you've seen success, and what activities work to open this field to new participants?

Vanessa Yiu: Yeah. I think there's a range of things here, right? And some have more direct, measurable impact and some do not. So I'll just talk to both. I mean, from encouraging people to get into STEM, thinking about STEM as a potential career option, something they want to study in higher education, that you have to definitely start early, right? So there are initiatives that I've been participating in where you are really going out to schools, talking to girls or kids in classrooms, et cetera, making sure that they are aware what STEM is, what is the range of careers that are available out there and get them thinking, because I think at that stage, if you only go to talk to someone when they've already made those kinds of decisions, then it's possibly too late. And I think those are the things that are hard to measure direct impact, as you go and be a participant and you go out and talk to school children, but actually very rewarding. So let me give you a little bit of a background of how I got into STEM.

Kurt Andersen: Sure.

Vanessa Yiu: So I hadn't really considered doing computer science at all when I was in school, but I was very interested in the arts. It's probably a hobby and also in music. My first interaction with a computer was actually for composing. So I used my dad's computer to compose my tunes for the piano and then suddenly one day I was like, "Okay, well, the computer is not a piano and yet I can compose and I can play music on it. So why does it allow me to do that?" And I think this is really the trigger for when I was thinking about technology and computers as a user, into almost like an engineer. And I was like, "Well, I want to know why this thing allows me to do this."

Vanessa Yiu: This is how I got interested in computers. The more I learned, I was like, "Okay, well, this is really interesting." So I did IT through school and then I ended up doing computer science in university. And I think it is that almost that spark and that trigger. Sometimes when I go out and I meet kids and it's clear that they hadn't thought about technology in that sense. Because you will go into a room, you say, "Well, how many of you use Instagram?" And they all put their hands up. But then if you ask the question like, "Well, do you want to be the person that builds this platform, this application and allows people to use it?" Then you see eyes light up in the room sometimes, right? And I think this is where you realize, "Okay, people had not considered that possibility."

Vanessa Yiu: And through you going out there and talking to them, you are giving them that, exposing them to technology and what it could potentially mean. And I think that to me is quite powerful and I really enjoyed that element of it. And then of course there are initiatives where you can probably see more of a direct link. Recruitment type initiatives, attending conferences, talking to attendees and actually helping them, recruitment mentoring them, career development, doing that through different organizations, et cetera. It's easier to measure the impact of that. If someone is able to get a role in a company doing software development off the back of those programs, then you can measure the success there. But there's a whole range of things I participate in.

Kurt Andersen: So I know in the United States, the Grace Hopper conference, named after the computer science pioneer, Admiral Grace Hopper is a key focal point for people wanting to get into the industry or being in the industry who are women. Are there any key conferences that you'd like to point people to or mention for people in the U.K., because you're based in London, I believe?

Vanessa Yiu: Yeah. So there are a couple. So first of all, Grace Hopper is actually coming to Europe for the first time this year.

Kurt Andersen: Wow.

Vanessa Yiu: So we will actually be present there as well. So I hope to meet participants of that conference. There's also a couple of others, like the Women In Silicon Roundabout. It's a big event. There is a One Tech World conference, which just happened about two weeks ago. So yeah. So there are numerous organizations that run conferences or other events, or even just hackathons, and they happen regularly throughout the year. So lots of different channels. For those who don't come from a tech background but want to get into tech, there are also organizations like Code First Girls. They offer free coding lessons for people to sign up to. We recently participated in a Python workshop with that organization. So there are many different routes.

Kurt Andersen: So if some of our listeners maybe are excited about this avenue of outreach, but perhaps their organization doesn't have a history or a track record of doing any kind of outreach, what would you suggest? How would you find it...? How would you advise them in getting their organization to become more active in reaching out to women, minorities, and others that may not have considered the field?

Vanessa Yiu: So I definitely need to get across the message, as in the importance of diversity, right? First of all, they need to buy into that, first and foremost. They need to recognize that that is important to their organization. And then I would say be proactive, find out the roots and then say, "Okay, well, these are the key organizations." It could be as simple as sending someone to participate and then starting to build that profile and that network, getting to know the organizers and see how you can tap in those channels. I think sometimes it is as simple or as grassroots as that. And I think there have been many initiatives where it's grassroots driven and networks have sprung up and then they have grown over time. And often those are also most impactful because the people that get involved are truly passionate about that topic or that focus area. So I've also seen success from that.

Kurt Andersen: Awesome. Okay, so Vanessa we've covered all kinds of interesting things in talking for this podcast, and in conclusion, I wanted to ask you one last question and if somebody is listening to this podcast and maybe they work at a large organization will characterize it as an enterprise. And what they know that their enterprise wants to start on this s3 journey. They want to make reliability more important, how would you suggest they go about starting that journey.

Vanessa Yiu:  So um couple of things, I will highlight. So the first one is definitely you need to understand the key problems that you need to solve for any organization, because this will vary depending on the industry you're in, how mature your organization is etc. And, to be honest, I will say this applies more broadly, right now, just in terms of starting sorry but you know even product development or anything like you're looking to do. Identify the key problems, and then the second thing really is you do need to get sponsorship at the executive level right to help you kind of focus get the level of right level of focus across the organization to prioritize this and help make it happen. Then I would say, once you do have that I think you know we can have a whole conversation about how to get a sponsorship as well.

Kurt Andersen:  yeah I know that was a question going around in my head is like, how do you find that executive sponsor, but I think we're running at a time.

Vanessa Yiu: Yeah, we can have another podcast on that.

Vanessa Yiu: Sorry enterprises at some point, but then i'll say once you do have the sponsorship, then you need to kind of define some. Measurable goals and deliverables right and focus on some quick wins that helps you build the case that this can be successful and also, build trust that this is this is something that could work, you know with your stakeholders, and I would say, you know, there are like the definitely like some some pieces of advice that will give on this, you know how to how to make this happen right first one is ready. He is pragmatic about it right when I say measurable focus on. Like I say, the easy wins and be pragmatic about your approach. Everyone's read the girl in our ebook but most organizations are not Google right so trying to do something that's too aspirational or trying to put in place stuff that is like to mature in terms of that service maturity, for what you need is not going to work for you so just be pragmatically, you know what are the, what are the perhaps simple things you can do to start. Recognize that there are going to be, you know your organization will probably have to see stuff out there right, you probably need to focus on eliminating toil to begin with just be pragmatic about what you can achieve if you know the space of time that you have. Don't boil the ocean like that's never going to work.

"To improve reliability well, focus on the easy wins and be pragmatic about your approach. Don't boil the ocean because that's never going to work."

Kurt Andersen: I think it's fair when you, you mentioned the history book, but even Google didn't write that until it has been like 12 years. All right, I'm implementing yesterday so people need to cut themselves a little slack when they. Are they trying to evaluate what they're aiming at?

Vanessa Yiu:  Yeah I totally agree. Say,  the other important piece of advice is to get feedback often and fail fast right away. Some things are just not gonna work, let you embark on maybe you know, trying to fix something and then you realize okay well this complexities around here like you know this this stuff that. We just didn't know and it turns out, pretty difficult you're gonna have to course correct and you'd much rather get feedback from your stakeholders. Get feedback from the engineers on the ground about what's working, what's not working and fail fast, and you know adapt your approach. I think that's super critical in an enterprise.

Kurt Andersen:  Awesome well Thank you so much for talking with me and with our wider listening audience today Vanessa it's been awesome to have you join us and provide your insights.

Vanessa Yiu: Thank you so much for having super fun, thank you.

Kurt Andersen: All right, and on behalf of resilience and action i'm curt Andersen, thank you for joining us and for having joined our guests Vanessa you from Goldman Sachs.

Kurt Andersen: Awesome well Thank you so much for talking with me and with our wider listening audience today Vanessa it's been awesome to have you join us and provide your insights.

Vanessa Yiu: Thank you so much for having super fun, thank you.

Kurt Andersen:  All right, and on behalf of resilience and action i'm curt Andersen, thank you for joining us and for having joined our guests Vanessa you from Goldman Sachs.

About the Author
Blameless Community

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Schedule a demo with us today!