Skip to main content
Background image

Building Disaster Muscle Memory and Collaborative Resilience in DevOps Teams with Matt Lea

Share

Podcast

About This Episode

In this episode, hosts Rachael Lyon and Jonathan Knepher are joined by Matt Lea, creator of Cloud War Games, to explore how teams can prepare for real-world cloud outages and cyber incidents before they occur.

Matt explains how immersive cloud disaster simulations help DevOps and security teams develop muscle memory, reduce single points of failure, and stay calm under pressure. The conversation covers incident response culture, credential leaks, least-privilege access, and how to differentiate between misconfigurations and active attacks. They also explore emerging risks related to AI agents, bot traffic, and multi-cloud decision-making.

Whether you’re leading a startup team or supporting large-scale cloud environments, this episode offers practical advice for building resilient teams capable of acting decisively when it matters most.

Podcast

Building Disaster Muscle Memory and Collaborative Resilience in DevOps Teams with Matt Lea

FP-TTP-Transcript Image- Matt Lea

Welcome, Matt Lea

Rachael Lyon:
Hello, everyone. Welcome to this week's episode of the Point Podcast. I'm Rachael Lyon, here with my co-host, Jon Knepher. Jon, I've missed you. It's been a couple of weeks in your worldly travels.

Jonathan Knepher:
Exactly. I'm glad to be back home, though, back from visiting the kids out in Granada, where everywhere you go out, you get tapas with all of your drinks.

Rachael Lyon:
Isn't that wonderful? You spend like €5, right, on a beer, and then you get like this delicious buffet meal.

Jonathan Knepher:
Yeah, exactly.

Matt Lea:
It was fantastic.

Rachael Lyon:
I love that. I am all about that. Well, we have another awesome guest this week, you guys. I am so excited to welcome Matt Lea. He's the creator of Cloud War Games, an online training platform designed to help cloud engineers and DevOps professionals develop their problem-solving skills by fixing realistic issues in simulated cloud environments. And my favorite thing, please go to his LinkedIn profile where he talks about helping CTOs running on AWS sleep better at night. Which is something we all want, right? We all absolutely want. So welcome.

Rachael Lyon:
Welcome to the podcast, Matt.

Matt Lea:
Thanks for having me. Yeah, I'm glad you guys took an interest.

Jonathan Knepher:
Yeah, Matt, thanks for joining us. Let's kick it right off, though, here.

 

[01:36] What is 'Cloud War Games'?

Jonathan Knepher:
Outages and service delivery problems are not only costly, but they get a lot of attention. Customers get upset. What do you do and how do you help responding teams stay calm and make the right decisions when things go badly?

Matt Lea:
So it's an excellent question, and I'll start off, I guess, where the inspiration for Cloud War Games came from is I was training up these more junior cloud professionals, DevOps type people. And when you get a client, my biggest client, they're a big e-commerce company. If they go down, they lose about $100,000 an hour. So you can imagine how many guys in Ties or C Suite type people are breathing down your neck right there. And so I saw these junior guys, I try and give them a chance to handle some of the outage, but after a certain point, you have to step in and move. And they were having trouble just getting stage fright of sorts. You know, I don't know what switches to push or anything like that. So I thought, wouldn't it be great if there was a way we could simulate this disaster, you know, in our staging environment? Or testing environment, or we could rerun the same disaster we just hit.

Matt Lea:
So over the years, I'm also a journaler. I write down tons of things. And so I actually have a stack of all these 3 am problems. The big headaches. Hard-coded.

Matt Lea:
Read replicas and stuff like that. Or hard code the right replica, and it couldn't switch over. Oh, just the things that spent hours of your time and you hate it. But, you know, you might as well. If you get lemons, you might as well make some lemonade. So then I started designing these simulations, and the first iteration of it, I just used a big whiteboard and actually just drew out a network diagram, and kind of Dungeons and Dragons, the whole thing. And then after a while, I'm like, why am I? Why not use real infrastructure? I can design something that could run for a couple pennies and simulate the whole disaster.

Matt Lea:
And so that kind of expanded. And then more people wanted to be the. The inframaster. We say, you know, this is a dungeon master. And so kind of kept expanding from there, and it became a fun little thing to do. And eventually I said, well, I was. I was talking to a couple of colleagues, and they're like, why don't you make this into a business? And there we go. Cloud War Games was born.

Jonathan Knepher:
That's awesome. And you know, something that I noticed in some of these roles, too, was like, allowing, you know, those. The first. The first people like working on things, how to be more enabled and empowered to take action. Like, how? How does this work together in enabling and in the fast response?

Matt Lea:
Well, that's a really interesting part there. So you might think it pays to be competitive. I got, you know, you want to have that one super engineer, and he's like, I can fix anything, but no one else knows what he's doing. And so when we design these scenarios, most of the time I try and make more collaborative. Okay. You know, I want to see someone say, okay, you go check the DNS records, make sure that's all going. You go check database metrics, and they should be shouting and outcommunicating. That way, you've got multiple people attacking the problem from multiple different angles.

Matt Lea:
You start from one guy from the back, another guy from the front end, another person checking somewhere in the middle, where the application layer is, and you get more efficiency there. So. And if you. And it's not really intuitive off the bat, but if you get, you know, if you can coordinate and rehearse this type of stuff, it becomes much more intuitive.

 

[05:04] Building Incident Response Muscle Memory

Rachael Lyon:
I like, I love this idea, right? Because like, you know, incident response planning. I mean, if you're lucky, you're doing kind of a tabletop exercise once a year, but something like this, that's, you know, kind of much more immersive, I believe. And, and also scenario planning. Right, because scenarios change, particularly when we look at AI and all of these other things. It seems like this would be a great way for folks to start building like muscle memory, versus oh snap, where did I put that plan? And are these people even still here? I mean, I'd love to hear a little bit more on your perspective, kind of this eye-opening moment perhaps that your clients are seeing.

Matt Lea:
Yeah, it really opens up their eyes to data silos, I'd say, which you don't really know when you're just doing it on paper, but. But sometimes it pays to take whatever your lead engineer is, and you're like, oh, this guy, if he gets hit by a bus or if he takes a week off, it's not gonna be a big deal. Have him sit on his hands and then have the people reporting to him then try and solve it. First, without him on the call or them on the call. And second.

Matt Lea:
When they.

Matt Lea:
Assuming the juniors don't figure it out, then you bring in the person, but still have them sit on their hands. They can talk through it. Everybody else is making notes. And so you really find out where the gaps in your knowledge base are. So many customers or clients have single points of failure around a key person, and they don't even realize how bad it is. And so that's one of the biggest revelations I see. And if you have that person, I strongly suggest you take their keyboard away and break something, and then see how long it takes to come back in dev or staging or something of that nature.

Jonathan Knepher:
Yeah. We had an executive here many years ago, and his view was if you ever had a person who was a single point of failure, you had to move them, move them into a different team or something to remove that. And that was kind of the extreme example of that.

Jonathan Knepher:
But kind of bring us to the next step on the team dynamics. Are there cultural factors or team dynamics that you see that maybe they need changing or they need other training and practice, and how does that lead to hopefully making something like this sustainable?

Matt Lea:
Yeah, culturally, a lot of times I see a fear of failure, which is interesting. I deal more with early-stage startups, but a lot of this gets trained in academia. Like, okay, you submit the test, and that's it. You're. If you got the right answers right, you're right. You got the answers wrong, you're wrong, you know, but we don't live in that environment. We live in an environment where you can iterate extremely fast. And the big thing, you know, it's good to be able to know which switches to flip quickly, you know, and you have to know which ones are irreversible.

Matt Lea:
So you know, deleting a production database, that's irreversible, don't, you know, don't do that. But you know, knocking over, you know, a handful of ECS tasks, Docker, you know, tasks running that'll automatically reboot up the same image, you know, assuming we haven't deleted the image or something, that's a finite thing. Even if you delete the image, could you run the build pipeline, get it back? You know, so there's, you know, you have to know which switches to flip and be willing to flip them fast, you know, so that's a culture thing, you see. And that's one of the biggest things I see. People just freeze up on. I can, I can I scale this down to zero, run another deploy, you know, possibly go and go for it? I mean, that's that, you know, so that's one of the things I guess I'm good at is just, and I try and teach is the, yeah, no switches to flip. And that's something you could document or drill, you know, under no circumstances should you, you know, scale production to zero, you know, something of that nature.

Rachael Lyon:
But yeah, so I imagine, I mean, something like this, obviously, I think to learn the most and be the most effective, it's one thing to say, okay, hey, next Friday we're gonna do this thing versus we just put it out in the wild, make it happen. And you gotta respond. Obviously, that requires a certain level of buy-in at the executive level or things like that. But I mean, how are you seeing, particularly I guess with startups, are they open to, you know, we got to be disruptive. If we're going to learn anything and have lasting learning, we got to be disruptive, and we got to do it in the moment, versus we're not going to know when something's happening. I mean, how are you seeing that being taken by the companies you're working with?

Matt Lea:
I'll say this, most of the time, if I just approach a company that hasn't seen any cybersecurity issues, hasn't had an attack or a massive outage, it's not a high priority. But the day after that outage becomes a priority. And so, and that's usually when I get the phone call like, hey, that AWS outage just a few weeks ago. That's when I got the phone calls. Before that, you know, it was a little quiet, but it's. If they experience and it happens. I deal with early-stage startups. They don't have a lot of cyber attacks on them, and they've got just the generic ones, but not the very focused ones.

Matt Lea:
But you get that first focused one, or you get the first time a junior pushes credentials that can send email out to a public repository, and you send out 15 million emails in about eight minutes. True story. Then you start thinking about these things. So I wish I could approach them sooner, but at least once they see that, then they generally like, okay, we don't want that to happen again. How can we drill for that? How can we set policies that says.

Jonathan Knepher:
So you brought up cyber attack versus other issues. What types of things?

Jonathan Knepher:
Do you train folks to look at in order to quickly differentiate? Right, like, are we down? Did something break? Is this a misconfiguration? Or are we under attack, and something like outside of our control is triggering some issue?

Matt Lea:
Yeah, WAF is a big one for DDoS, AWS web application firewall, those logs in there, access logs, of course, metrics. You're seeing fluctuations in traffic. If you're not seeing fluctuations in request counts or external traffic, then chances are it's not a DDoS. But that doesn't mean it couldn't be an SQL injection attack or some other type of attack that way. So, those starting with, are we getting a massive increase in traffic, or basically I create dashboards where starting at the external layer, Route 53 or sorry, not RDS, Route 53. We got all our metrics there. We're looking for spikes anywhere there. Hit API gateway looking for spikes in there.

Matt Lea:
Hits the albs looking for spikes. Hits the C2 task running, looking for CPU usage spikes. It's, you know, and so we can see kind of just the layers as you go through. And I just look for the discrepancies in there. And if you don't, you know, that gives me a pretty good idea where to start, you know. Okay, the discrepancy starts for some reason. Rachael, oh, we've got some batch job that's pounding it right now. We're in trouble.

Matt Lea:
So I try and get people to stack it. Visually, I'm huge on diagramming too, and drawing. I don't know if you see my YouTube videos with my pixel art animated diagramming network software, but. But I like being able to make it tangible. It's such an intangible thing, the cloud, but every little part of it talks to everything else. So if you could see the map and then equate that to where the discrepancy from the baseline of the metrics are or the logs, then you can start tracking it. So it's a long way of saying you start on the outside, work your way back.

Jonathan Knepher:
I guess it sounds like too, the key is being prepared ahead of time with the metrics and the log collections.

Matt Lea:
Yeah, absolutely. You don't want to do that. Game time. I have had to build out metrics game time, because they don't call me until sometimes there is a problem, and then you're just, okay, let's get out, we're off to the races. But if at all possible, you absolutely want to have those dashboards and your CloudWatch Insights queries set to go, your AWS WAF logs and your WAF rules all in place. But I, you know, especially deal with early stage. We don't always get handed a perfect delta, perfect hand of cards. So play with what you got.

 

[13:11] Credential Leaks: What To Do 

Rachael Lyon:
Never a dull moment, right? That's why we love staying in the cyber industry. You just never know what the day may bring. But I'd love to come back on, you know, unfortunate, you know, kind of credential leaking, you know, in that kind of scenario, I mean, what kind of steps would a company take to try to contain something of that magnitude? I mean, you can't really pull it back. So, how would you manage through that?

Matt Lea:
See, with that case, we disabled the credentials, not delete. Because what could happen is that could kill off some other major service for third parties. If it was inside AWS's bubble, you'd use roles, not IAM creds. And so right there we had to. We disabled the creds. And so at that point, you gotta do some math. Did that break some big third party? So say is it that our inventory service, that every order for some reason they're pulling our orders using those credentials because it's a third party, and now all of a sudden we're losing $100,000 an hour because we can't ship orders? Well, then we got to turn that back on and then disable the email role in that case, or the email IAM permission. So you've got to be ready to do some quick math.

Matt Lea:
If you just delete it right away, then you got to go send your, you know, in this case, an inventory producer. And it wasn't, but let's just say it was the people that handle all the shipping for us. You have to get them to do those creds, and for how long is that down and broken? So it's always a delicate balance of math. And I think that's something a lot of engineers don't think of right away. Just very ones and zeros. I say this all the time. Engineers think in ones and zeros in the C-suite, and the language of business is dollars and cents. So you always got to be doing that math if you want to sell.

Matt Lea:
As a I tell, this is kind of career advice for people in cloud. If you can think in dollars and cents as well as ones and zeros, you'll go far.

Jonathan Knepher:
Can you talk some more about some of the specific controls that you would suggest for keeping guardrails around things that could go badly?

Matt Lea:
Level of least principle of least privilege. Okay, your junior guy that shows up day one, don't let him have admin. There's a lot of things, you know, there's, there's absolutely.

Jonathan Knepher:
Don't give root to everybody.

Matt Lea:
Yeah, yeah. I mean, even with credentials, they shouldn't have access. Use Secrets Manager like that to make sure that they only have access to the sandbox credentials for all these services, not, you know, the important stuff. Never commit those big, important banking credentials, you know, into the code base that should be in Secrets Manager or AWS Param store. And if it's not clear, obviously, I'm a bit of an AWS fanboy here. I'm sure there's equivalent password vaults everywhere, but definitely restricting the juniors to what they need to get the job done there and not giving them access to, you know, everything. Day one. They don't even.

Matt Lea:
It's not a lot of times. It's not their fault. It was even just a couple weeks ago I had a junior push up a code base to their own personal GitHub publicly. And it happens more often than you'd think. I had one time. I can't say the specifics on the client, but do you have time for a story you can tell?

Rachael Lyon:
Always.

Matt Lea:
Always. Okay. So we were working with this hardware company, Smart, smart home device company. And they were like, we got to be super secure with the code base for the what lives on the device. So much so that I had to fly one of my right-hand guys to there because they wouldn't put it across the Internet. I had to fly them across the country with a hard drive to try and get the code base. They wouldn't give it to him. Still, he ended up coming back a month later, we found it publicly available on the Internet with keys intact.

Matt Lea:
And it was just like, oh my gosh. We went through all this hassle, and you there it is publicly on someone's GitHub with the keys.

Jonathan Knepher:
Well, it sounds like their paranoia was well-founded, just not well executed.

Matt Lea:
Yeah, I guess. What's the hard outside shell but the candy inside? Yeah. Oh my gosh. So it's not even. You're always being attacked, but sometimes it's the intern that makes a mistake. You know, they got to learn from the mistakes, but don't. Don't let them make the big ones.

Jonathan Knepher:
No, totally. Okay, so you brought up the whole hardened outside, soft inside. What types of things should folks be doing to help protect the inside?

Matt Lea:
Sure. One of the most basic things I see you talked about IAM roles, you know, granular IAM roles, very important. You should never have an API key or, you know, AWS access key you're using for API access to have administrative rights. That's insane. Don't do that. The security groups, of course, I'm pretty nitpicky on security groups. You can basically you can get away most of the time with just being picky with your ingress rules. One of the biggest things I see there is cross-environment contamination.

Matt Lea:
If you don't have good security groups and you keep them in the same AWS account, and all of a sudden some staging jobs write into production, and you've got duplicate data, and you're like, why? So definitely lock down that you don't want to have the public ever able to access your databases there. So you always want to have. I usually do multiple layers of security. Of course, you've got your passwords, you know, but people can brute force that. Secondary, you got your security groups, which is basically firewall rules. And tertiary would be making. So the subnets are actually not able to be hit that like Internet traffic can't hit the subnet. It would have to go through a bash into the public subnet, and that would have to poke a hole to the.

Matt Lea:
To the private. So even if somehow you screwed up the password and you screwed up the firewall rules, you're still in a private subnet that's virtually untouchable by the Internet. So that's some of the basic ones. And those are most of my basic scenarios, actually revolve around those kind of pieces of the puzzle.

 

[18:52] Attack Trends on the Frontline

Rachael Lyon:
Since you're on the front line for attacks, I'd be interested in. And obviously, we're not going to name names or be super specific. But I'd be curious of any trends that you've been seeing kind of bubbling up in the last year, kind of anomalies that maybe are becoming more common as we look to like 20.

Rachael Lyon:
You know, how getting ahead, I don't know we're ever going to get there, but how do we at least start thinking about these things?

Matt Lea:
My client's got e-commerce stuff. They want to move product. It really doesn't matter if it's a bot putting the credit card number in or a human. But you know, as long as it, you know, everybody's happy at the end, you know, the client gets their product. So we're at this very interesting spot where we're spotting what we can tell is bot traffic. But, you know, in order to try and make sure that they still buy a product, we're letting them through. And so that water is murky right now, is what I'm saying. And it's, and these bots, a lot of times they're, they're not the best at pacing themselves.

Rachael Lyon:
Exactly.

Matt Lea:
My client's got e-commerce stuff. They want to move product. It really doesn't matter if it's a bot putting the credit card number in or a human. But you know, as long as it, you know, everybody's happy at the end, you know, the client gets their product. So we're at this very interesting spot where we're spotting what we can tell is bot traffic. But, you know, in order to try and make sure that they still buy a product, we're letting them through. And so that water is murky right now, is what I'm saying. And it's, and these bots, a lot of times they're, they're not the best at pacing themselves.

Matt Lea:
So, you know, we're putting in more threshold or more tools to tell them to, you know, relax. I can't remember the exact status code, but the it's, we're not returning the status code, but we've proposed this is that there's a fake status code 420, which is like chill out or something like that. So, you know we, we have to make sure we're telling these bots, feeding them very efficiently the information they need, but at the same time, you know, make sure they're not bombarding us.

Jonathan Knepher:
What types of things, though, are you seeing? Kind of from the agentic AI stuff, kind of separate from this. Right. Like, are there, are you seeing security implications as well as volumetric things?

Matt Lea:
It depends. I, I, yes, I do see some security implications. I see people rushing to put AI in their product. And I'm doing air quotes for those of you on the audio version. And 90% of what that means, or 90% of the time, what they end up doing is slapping a chatbot in there. And if they're lucky, they figure out how to give it tool call Access, you know, but what I see them doing is giving it, this bot, a bit more decision power than it probably should. Oh, the bot. The bot that can issue a refund is a dangerous bot.

Matt Lea:
Okay. I don't. I don't think I'd ever put a refund in there that would, you know, would. Without a human in the loop, I'd have something on the screen like, this bot is not authorized to issue you refunds, even if it says so, you know, and so keeping the human in the loop on any thing that's remotely important, refunds, changing shipping addresses, things like that, perhaps, you know, so that's something I'm seeing that scares me a bit. And I've got to talk people off the ledge, saying it's not magic. They do hallucinate, you know, I bet you I could jailbreak the thing if I. If you gave me an hour. So, you know, putting.

Matt Lea:
Making sure every tool call has some type of logging. Auditing restrictions are important tool calls, decisions. The bot shouldn't make decisions; it should make recommendations, which are then reviewed by somebody on your internal customer service team. That's my thoughts on it. So that's something where I see it injecting us, injecting our own, shooting ourselves in the foot there with. If you meant like attacking from the outside, that's interesting as well, because we have seen attacks that mutate.

Matt Lea:
The bot is smart enough to change the input and change up what it's, you know, it's not just running through a loop and just blasting you like the old school DDoS. It's changing the inputs up enough where the signature is rather difficult to discern. And if, you know, it's becoming more and more difficult to track. We do find patterns occasionally. I found a very interesting one just yesterday, actually, I should say my colleague found one. I don't want to take credit for him, but found a very interesting one yesterday we saw. But it's possible that was just a scheduled AI assistant, too. So do we block it?

Rachael Lyon:
You know, just tangentially, because John knows I like to do these things. But I was reading a Wired article. We talk about AI and agents and workforce stuff, keeping a human in the leap. One of the Wired articles, this author wanted to stand up a new company, and then he had an AI workforce, and he was the only human in the loop. And as he went through whatever the product was, one of the agents reached out to him and gave him an update. But it was all lies, absolutely lied about all the progress that they were doing. Oh, yes, we completed that cycle. We did this, we sent email.

Rachael Lyon:
And the fellow's like, you're literally lying to me. And the bot's like, yeah, well, my bad. Won't do that again. But that's truly frightening, you know, when you talk about them kind of evolving, learning, and making decisions without a human in the loop. That just kind of, you know, stopped me in my tracks a little bit to, you know, as we think about these things, I. Fascinating article and very humorously written, but I'd be interested on your your thoughts there.

Matt Lea:
Yeah, well, I. So this earlier this year, I tore my bicep, and so I had to get bicep surgery. So I said I'm going to embrace vibe coding for a little bit. And it's funny with those LLMs, after working with them for quite a while, I came to the conclusion that they're basically like having an intern that lies to you. So, you know, as much as I, and I do use it for pixel pushing, occasional I'm a horrible graphic designer. But at the same time, no, you absolutely need to double-check everything with those. Rachael.

Matt Lea:
You know, it's. I can see different models. You could train classifiers or utilize tools like vector, DBS, and coding to kind of work as a classifier, and those can be extremely effective. But just the general knowledge, just to assume you can type it in some text, and it'll just be able to do a classification task really well. I think that's. To say it right now is a bit optimistic, but I don't want to downplay every model either. This, you know, if you have the right tool for the job. I've seen incredible stuff done with Petko, classification models.

Matt Lea:
I mean, honestly, Gans, for images and stuff like that, that's pretty impressive too. But of course, they succeed if there's no right answer. You know, say, give me a cool picture that's completely made up. It'll make up a story all day long. That's, that's not, you know, where we need accuracy there. Stoyanov.

Rachael Lyon:
So like the the images where the person has like seven fingers on one hand, that, that kind of.

Matt Lea:
Yeah, they've gotten so much better at hands. That was, I have to explain to my mom, you know, mom, this is AI, look at the hands. This is something that's generated, and now they're getting so good you can barely tell.

Rachael Lyon:
Oh, it's wonderful.

Rachael Lyon:
Finally.

 

[26:26] AI Strategy and Model Training

Jonathan Knepher:
Yeah, they're definitely scary, how good they are now. It's. It's brutal. Okay, so you mentioned some of the stuff on AI. When do you think it makes sense for customers to be doing their own thing, training their own models, versus using some of the existing models that are out there? There seem to be so many good options today.

Matt Lea:
Good question. Again, I deal with early-stage companies, and I also have friends that have PhDs that are experts in stuff, and it basically, if you don't have a budget for someone with a very strong background in training models, probably somebody with a PhD or something of that nature it, you probably don't want to train one from scratch there. You're, you're, you know, you're small startup. There's. Unless you're doing something that's incredibly intricate and you've, you've got all the training data in the world for it, and no one else does, then, and even then, I probably would fine-tune. Fine-tuning is kind of that middle ground. You can go off the shelf, and if you're a one-person shop and you're just trying to just get some code done, then use a managed service, you know, use a serverless, serverless bedrock, converse. If you wanted a LLM, which is the equivalent of ChatGPT's API anyway, but you know, if you want to take it a step further and you want to fine-tune it for your specific usage, then fine-tuning a model, I'm okay with it.

Matt Lea:
Still at a pretty early stage, you wouldn't need a PhD. I don't have a PhD, and I found fine tuned a lot of different models to fit my purposes. But again, training from scratch, that is a pretty tall order. You better have a lot of money and time on your hands because you're going to spend a lot on compute as well.

Rachael Lyon:
How do you find? I'm kind of interested. When we look at the cloud landscape, there's a simple path and complexity paths, multiple cloud.

Rachael Lyon:
How do you find startups are navigating that path forward as they grow and accelerate their business? I mean how do, do you, are you helping to advise them on how to how to do this in a, in a very thoughtful, strategic way?

Matt Lea:
Absolutely, yeah. I mean multi cloud comes up especially after the recent outage with Amazon, you know, but it's, it's not cheap as you get and you, the smaller you are, the less I'm worried about vendor locking because if you're pre validation, if you haven't got anybody to give you dollars, boot it up as cheap as you can and just get until you've got people repeatedly giving you money. But once you're making a million dollars a day, you know, and if the site goes down, you're losing that in opportunity cost there, then it makes sense to explore multi-region and multi-cloud there. So what I usually do is we do a cost estimate and say, hey, you know, if you go down, it's going to cost you $10,000. If we go multi-region, it's going to cost you $10,000 a month. You know, then it's like, how often do you go down, or does AWS go down in that case? Well, almost never. So does it really.

Matt Lea:
Worth it to do that now? Let's just say you go down, you lose a million dollars or $10 million. Now, that investment of an extra $10,000 a month, not the end of the world. And so that's, you know, you got to also evaluate if you've got technologies that need to run differently on GCP than they do on AWS or Azure. So that's a little extra that you got to think about the engineering hours. It's one of the most overlooked things I always have to deal with is the engineers that forget to count that their hours cost the company money. Oh, it's exciting. I really want to learn how to run this on GCP. It's going to only cost me about three months of my life in payroll.

Matt Lea:
It's like. It is. We're paying you for that, right?

Jonathan Knepher:
That's real money.

Matt Lea:
Yeah, you got to factor that in.

Jonathan Knepher:
But then how do you account, though, for the technical debt side of it, right? Like the idea of running multi-region or multi cloud, like there's early decisions you make in architecture, and like undoing those things, sometimes very difficult.

Matt Lea:
Completely agree that it's moving faster. Docker is a technology from above. Thank goodness for Docker, because that allows us to swap out much quicker. But.

Matt Lea:
Technical debt is an interesting thing. I work with so many startups, and part of me is a purist. Engineer says, do it right the first time and even if we got to take a little bit longer. But I've seen time and time again, some.

Matt Lea:
Brilliant entrepreneurs, novice programmers, leverage technical debt. Like you're taking debt for real estate and they leverage it and make money off it. It's like, oh, yeah, we took on a little bit and it does grow and sometimes it becomes a pain. But as long as your income is growing 10x the debt you're taking on, they still haven't paid off some of the debt that they've had for six years or so. But it doesn't hurt them. It's just they're growing so fast, it doesn't matter. So I look at it like a real estate investment. You know I'm taking out a loan and if the interest on the loan is growing faster than the value of the real estate, then we got a big problem.

Matt Lea:
But if it's the other way around.

Matt Lea:
They can keep throwing money at it for a while and they do, you know, it's not optimal so. But who am I to judge?

 

[31:16] Matt Lea's Path to Cyber

Rachael Lyon:
Definitely. So I would love to. John knows where I like to go here. As we kind of look at our time, I'm always interested in the path to cyber. Everyone has such a unique path of how they got to where they are today. And I'd love to kind of, if you wouldn't mind sharing with our listeners kind of your journey, how you found yourself today. Is this something you've always wanted to do or just kind of happenstance? Right. Life just kind of brought you on this twisted turning path to where you are.

Matt Lea:
So it's funny, I started programming in I think the sixth grade and the way I learned to program was by taking apart video games, hacking into a basically and hacking I mean text files, ini files and breaking them and all that stuff. Then you work your way up and to you know, web technologies and of course you know, it's, you get hacker news and all that different stuff and you're like oh, if somebody else broke it this way or ah, I can back, I can, I can, if I use the counter I can just extract the auto, suggest your password and now I've got your password in plain text there. So little script kitty, things like that, you work your way up. And then I find myself at various startups where you don't have big budgets for someone else to do security or DevOps or all that stuff. So I end up jumping in those roles and of course the bigger the startup the more your area surface for attack. So we've gotten it and I've probably made the mistake before but you start to learn and that's, that's where I kind of get into it. And if you're kind of advising at the level I am up but the C suite you're getting. I spent a good chunk of the morning doing a finops write up but at the same time I spent a good chunk of yesterday doing the cyber security so you've got to know have a good cross training on it.

Matt Lea:
Am I the best cyber security ninja in the world? No, don't think that. But just out of necessity it's become something I've had to deal with. I've had the pleasure of working with some of the best cybersecurity people out there, I'm sure. So I don't want to take anything away from them. I'm kind of, you know, a lot of liaison to the C suite a lot of times. But yeah, that's my journey, I guess. Started by taking apart video games.

Rachael Lyon:
I love it. I love it. That's a great way to start, too, right? I mean, as you figure things out and then how that translates, surprisingly, at higher and higher levels where those kind of tactics still work.

Rachael Lyon:
So I know we're at time, but I did want to say revisit your YouTube channel because I would love to get a shout out to folks if you know, what's the name of your channel so folks can go there and visit. We'll make sure to include it in the show notes as well.

Matt Lea:
Sure. The name of my consulting company and name you find me all across the Internet should be schematical. It's the word schematic with al at the end. So YouTube.com schematical schematical.com if you want to see my comic strips, they're schematical.com comics. So I do comics on all this stuff that you might find amusing. I have a whole series on the lone wolf programmer, which is the knowledge silo you don't want.

Rachael Lyon:
I love it. Okay, we'll be sure to include that in the show notes, everyone. Well, Matt, thank you. This has been a fun conversation. I learned so much we've never had. I was saying we never had a cloud demolition expert on before. So thank you. Thank you for sharing your insights with our listeners today.

Matt Lea:
Pleasure. And if you guys ever want to compete in a cloud war game, you know where to find me.

Jonathan Knepher:
John, sounds like fun.

Rachael Lyon:
That's right. I think you got a signer here. Well, so to all of our listeners, as always, thank you for joining us this week for another amazing guest. And as always, don't forget to drumroll please.

Jonathan Knepher:
Jonathan, smash that subscribe button.

Rachael Lyon:
That's right. And you get a fresh episode every single Tuesday right to your ear inbox. So everyone, until next time, stay secure. 

 

About Our Guest

Matt Lea is the creator of Cloud War Games, an online training platform designed to help cloud engineers and DevOps professionals develop their problem-solving skills by fixing realistic issues in simulated cloud environments. The platform evolved from a tabletop exercise and uses actual AWS infrastructure to create scenarios based on real-world problems, with adjustable difficulty levels. Lea also runs a company called Schematical, which specializes in helping businesses on AWS manage costs, scaling, and security. And be sure to check out Schematical on YouTube and schematical.com/comics.