Episode #001 - Databases and DevOps with Silvia Botros
The Database: the final frontier in the DevOps journey. Losing your company’s data would suck, but hand-crafted, artisanal database servers also sucks. What do you do? This episode’s guest is Silvia Botros, Principal DBA at SendGrid, who joins Mike to talk about the DBA silo, better tooling, the woes of schema management, and more.
About the Guest
Silvia is a Principal Engineer at SendGrid, a cloud email provider with household name clients like eBay, Spotify, Pandora and Airbnb. She is an avid distributed systems and databases tester and spends a lot of her day trolling her Ops team. You can hear more from Silvia on her blog https://blog.dbsmasher.com and on Twitter at @dbsmasher
Key Takeaways
- Involve your DBAs in the engineering process early on to prevent problems in the future.
- Enabling engineering teams to manage their own schema is critical to reducing day-to-day toil of the DBA team and empowering Engineering.
- If you’re wary of automating your databases, set up guard rails to add some safety to it first.
Links Referenced
- SendGrid
- Andy Jassy’s 2018 Keynote at AWS re:Invent
- Stories From the DBA Trenches - Silvia Botros (Velocity London 2018)
- Database Reliability Engineering by Laine Campbell and Charity Majors (affiliate link)
- SendGrid blog
- State of DevOps Report by DORA
- Tool: slowquerydigest (pt-query-digest)
- Tool: Anemometer
- Tool: ProxySQL
- Tool: Vitess
- Auditing Databases at The Grid by Silvia Botros
Transcript
Mike: Running infrastructure at scale is hard. It's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we are going to talk about the rough edges. We are going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor of the Monitoring Weekly Newsletter, author of O’Reilly’s Practical Monitoring, and a DevOps consultant/analyst.
Mike: Hi everyone, I'm Mike Julian, I'm here with Silvia Botros, principle DBA at SendGrid. Welcome to the show, Silvia.
Silvia: Thank you for having me.
Mike: You work at SendGrid. There's a lot of people that have no idea what SendGrid is. Monitoring Weekly is actually a customer as well. I use it for all of my contact forms. It's wonderful. But for those who don't know, what does SendGrid do?
Silvia: First, thank you for being a customer.
Mike: Absolutely.
Silvia: I like to hear that. What we do is basically an email service in the cloud. A lot of companies realized years ago that they need as part of their engagement with their customers whether it's onboarding new customers or communication with existing users, that they need to send a lot of email. It turns out that sending email is a complicated thing. There's a lot of rules around it. SPF, whitelisting IPs, dedicated IP or not. Block lists that the inbox providers maintain. There's magics of times depending on the provider, whether email looks fine or it looks spam-y, or it looks fishy. Turns out there's a lot of logistics around that, that these companies all across the board don't really have the skills to deal with and don't want to spend resources on.
Silvia: Customers of ours include a large swath of companies large and small. Some of our big high profile customers include eBay, Spotify, Uber, Pandora, so you get the gist. You sign up for an account, you need a confirmation email. We have to click a link. They want that to come to the inbox, not to the spam folder and they want it to go to the inbox fast. That is basically the service we provide to them. We also now, as you mentioned have your contact list, we also have a marketing email, a marketing campaign's product that we provide to feed into this and provide the more advanced, the more geared towards marketing and that crowd. Not just the day-to-day transaction stuff.
Mike: Yep. Yeah, it is a surprising huge pain in the ass to actually manage all that stuff. I know several companies that just use their corporate email accounts to do it and surprise, surprise everything goes to spam.
Silvia: Yes. I will fully admit I did not know this was a thing, that this was a viable business model before I got my job at SendGrid. But once I started learning about all the details it dawned on me, how much it is ... Why would you build that yourself? If you are eBay, and you're focusing on bids or getting traffic or if you're Pandora and focusing on getting people to listen to playlists, why would you worry about spam lists and how every individual inbox provider decides how to judge your emails?
Mike: Yeah. That's a rabbit hole that seemingly never ends.
Silvia: True.
Mike: You're a DBA for them which is kind of weird. Those are still a thing? I thought when we all started rubbing DevOps on everything, that DBAs just kind of went away. We didn't really need them anymore because now the developers are handling it.
Silvia: I know, like we had thought DevOps were going to eliminate the sysadmin didn't we?
Mike: Right and it turns out we still need those, so I suppose we probably still need the DBA. What does a DBA actually do? What's your role now? What does it actually look like?
Silvia: A DBA in a company that is honestly trying to apply DevOps is practices, which is whole other rabbit hole we can go into. A DBA in this environment needs to be involved in setting standards of practice, that's helping engineers debug things early on. Helping design. A lot of what I do is early cycle, early life cycle of software where they're still building a new thing and I get to give the input on whether this thing has the potential of growing really fast and what does that provide as far as implications as to the data store. Is the data store they're using appropriate for the use case? A lot of guidance, a lot of things like that. There's also still a lot of day to day things in production. For example, we are SOC2 compliant and we are now public company so we also have to do cover Sarbanes-Oxley. There's a lot of compliance needs that don't necessarily fall into any feature work but still have to be covered because every year we have odd pairs, we have fiduciary requirements that we have to cover and so the database ops team, which is what we call my team right now, is responsible for doing the work that provides the evidence that we actually are compliant with all the things that we have to be compliant with.
Mike: That's kind of interesting. What compliance stuff actually surrounds the database? I know GDPR probably has a big component in here doesn't it?
Silvia: It does. That was definitely a big thing and it covers not just relational databases, it covers a lot of data sources so in practice it involves more than just the DB ops team. To provide context, currently my team is solely focused on relational databases so we have a pretty large MySQL footprint and that's where our focus is.
Mike: Yeah, every time I think about DBA's, I think about just relational databases but once I started thinking about that more and about you and I chatting, I realized maybe that's not actually true because there are a lot of other non-relational databases. Do you work with those as well? Like Mongo and Cassandra, and things like that?
Silvia: Specifically at SendGrid right now my team’s focus, my team’s scope is specifically to my relational databases. But you are correct in the larger picture in the industry. I talked to a number of DBA's and data storage teams in other companies and the trend that I'm seeing is you're starting to become more of a data storage team, or a data platform team. Names are hard so you'll find that the names will mean different things in different places, but yeah, the trend I'm seeing is that it'll be a single team that is in charge of all of the things that handle state.
Silvia: Within that team they will have varying expertise where some of the team are better at relation and some of the team are better at the NoSQL. Some of the team are better at the pubsub and message cues storage. Within the team that cross pollination of knowledge starts happening, so that is definitely a trend. We're not there yet at SendGrid, although it's a conversation definitely to be had.
Mike: Yeah, absolutely. You've spoken at a conference. I was watching a talk of yours earlier this week and I forget where you talked about it. But you were talking about DBA as a silo and is largely seen that DBA is the last silo left in a technical group so everyone else onboard with DevOps except in their CBA off to the side going like, "Hey folks, what about me?" What's going on there? Why is that?
Silvia: It's funny you mentioned that because in a recent keynote by Andy Jassy who's the CEO of AWS, that's the thing he definitely beat on. Where he basically called DBA's the old guard and that “those guys”, and I quote those guys, are basically sitting in the way of engineers building things. That's definitely one approach that a lot of people take to it. There's a lot to be said to the context in which it's said because obviously when it's someone who's trying to sell you managed databases it makes us to make the case that you don't need the DBA team. I take a slightly different approach to that. There's definitely a lot of growth to happen, that needs to happen in the DBA community in how we think about our job. For a very long time the job has evolved. We protect the database, that's your job. We protect the database and, in fact, in that same talk that you mentioned and I think you're talking about the one I gave in London in Velocity.
Mike: Yes, that one.
Silvia: Yes, and that one is up on YouTube if someone wants to find it.
Mike: Yeah, we'll put it in the show notes.
Silvia: Yeah, I've likened in talks that I've given and in conversations. I've likened the old DBA stereotype to Milton from “Office Space” where the database is that red stapler and nobody can touch it and you can't take it, and that's definitely an old point of view. It doesn't work anymore. Not only are companies starting to look at having people work closely together much earlier in the life cycle of software delivery. But also it's unlikely now where you have a shop where you have the database. Like it's one database, it's one host sitting somewhere and that's where we protect. Polyglot data storage is a thing that everybody is starting to get involved in. In a lot of shops it's no longer a thing where it's a single database. The old school LAMP stack is not a thing anymore so it just does not make sense for the people who have certain duties that relate directly to the data storage of the business most important asset to be that protective. Along with that old mindset was a delay in the DBA community of taking up things like automation, configuration management, writing down things. A lot of people will consider what the DBA does is like magical. Like they just go in a corner, do things, and out comes a database that nobody can change anything on. That's definitely a thing that I try to hit on in most of my public speaking in that it's no longer okay and that's not the way to continue to look at the job. The job needs to grow the same way the rest of operations has grown in the last few years.
Mike: Yup, absolutely. I've definitely worked with working companies and with people that they viewed the database as like this holy thing. You cannot touch this database, this is mine. There is some element of protection around it but there is ... The protection is not there just for they wanted to have their small empire, it was more that everyone was so afraid of it. What I was starting to see is there's a lot of engineers, there's a lot of sys-admins who don't really understand relational databases or anything that goes along with it.
Mike: Yeah, maybe they can write some queries and that's great, but how do you tune a database? How do you troubleshoot it when it starts having issues? What is replication lag actually mean? How do we do this at any sort of scale and then people get afraid and like, "Let's hire a DBA." I think part of this is that as DevOps has become a huge thing, we start looking at state versus non-state. A stateful system versus a stateless one and DevOps, when I've got a bunch of web servers is pretty easy. Like, yeah, throw up some Docker and we'll call it a day. We're good. Then, we're like, "But wait, what about all the session storage? What do we do about that? What about the data? We can't practice DevOps on that because moving fast and breaking things and losing all the customer data is not so acceptable." What does it mean to be practicing DevOps when you do have these state concerns? When you have actual customer data that you can't lose? You can't move fast and break things on it. What does it mean there?
Silvia: This is where I push back a little bit at the idea that DevOps means move fast and break things. That has been many a Twitter argument over what it really means for DevOps.
Mike: Absolutely.
Silvia: Channeling my Andrew Clay Shafer now. It's not about breaking things, it's about having a shorter feedback loop, and it's about having visibility into everything. If it's approached like that you realize that there is a lot to be done in database lane that could benefit from the mindset of DevOps. One of the things, I mean, like everything in DevOps it's not only about the tools or not only about the people. One of the reasons DBA's were last to the party is because state is hard. Databases are complicated beasts. You can’t take all of the knowledge from MySQL and apply it to Oracle and MS-SQL or Mongo. There's foundational things that make sense but when it comes to a practice and when it comes to within a product context, finding the sharp edges of this specific data storage tool you're using, that becomes very specialized and that has been a big barrier of entry for a lot of people. There's also this part where DevOps has ... Sometimes the DevOps community has lost its way a little bit towards the shiny toy syndrome. Where it was first all of the config management and then it was Docker and now it's Kube. Actually, no, last week it was Kube. This week it's serverless.
Mike: What will it be next week? Stay tuned.
Silvia: I don't know, like we're going to wait for the next re:invent whether it comes out of that. That's going to be the thing. But that's not what the DevOps things is about in the first place. It's about providing value faster. Providing value faster doesn't have to involve breaking things. Breaking things is a thing we should tolerate and it happens whether we're going fast or not. One of the important publications that came out in 2018 that I think shows us very well and I was very happy to see it included in the database as part of the conversation as well was The State of DevOps Report. Because the focus of the DORA team and you know Dr Forsgren has done amazing work in that. In that she shows based on hard data, it's about providing value. It doesn't matter whether you're doing it using onprem servers or using the cloud, or using whatever it is. Now there's pointers to which of them provides value faster. But if you're not doing it right, it won't matter at the end of the day. To come back to how to do DevOps for DBA's, it becomes about including everybody in the conversation.
Silvia: Figuring out whether something is slow on the database in like day to day operations, doesn't have to be a magical thing that only a DBA can do. MySQL specifically, and I'm going to book on MySQL because that's where I've built my jobs but this probably applies to a lot of those databases. I mean, Oracle is definitely a good example of that. Observability is difficult. You have to actually know specific server related variables to know exactly what to look for when someone comes up to your desk and says, "We think the database is slow." It's like, it's a very vague problem and then when you add to it the complexity of what data storage is like it becomes really difficult so for a very long time, there was not a whole lot of tooling that actually made observability of a data storage layers performance a thing that everybody can look at.
Mike: Yeah, I remember when I first started my system administration career 10 plus years ago, the troubleshooting in the database performance was a really hard problem in that there was just nothing there. Like, what do you look at? Sure I could look at CPU, memory, and all these OS metrics but that doesn't actually help because inevitably ...
Silvia: It doesn't tell you what's actually the thing that's running, yup.
Mike: Right, inevitably it's some long running query or some unoptimized query, but how do I find that out? When you have a system that's more than five queries an hour like it gets a little hard.
Silvia: Exactly, it gets even harder when it's not about even the number of queries per hour but also are you a single tenant database or a multi-tenant database? If you're multi-tenant does that mean that who's crowding whom? Is this a problem of a single query that's holding on a lock? Like, you wish it was sometimes because if it's a lock you can see that. If it's something that's not caching well and so the query that is in a single instance super fast, but is running a million times in one hour it's actually accruing enough processing time that it's slowing everybody else down. That's a real tricky one to find out. For the longest time the only way to do this was to start doing things like TCP dump the MySQL port. Dump all of that to files and go at them with tools like slowquerydigest or even pipe it into something. There was a very old tool by Box called anemometer that you could feed these things to. But it was a manual process. You have to, still at the end of the day, you have to know what you're looking for to be able to tell what's wrong and that was ultimately the biggest impediment for people who are not DBA trained to be able to do this.
Silvia: One of the things that have helped us a lot to make this a lower barrier of entry was using a tool of VividCortex. The founder is Baron Schwartz who is pretty well known in the MySQL community. He has written tools in the past that have helped bridge the gap a little bit but they were mostly like pearl and local to the hosts. But with VividCortex, they took this a completely different level where you run an agent on the database and you get a UI in a web portal that everybody that you can show people to do what to do on and it tries to explain to them in English. As much English as possible what is actually happening. That has been very helpful because when you start talking about DevOps and even more so right now when you start talking about SRE and saying that delivery teams have to own the reliability of their service, you can't do that to them if they cannot have good visibility to ever think that's actually supporting the service. The database has to be part of that.
Mike: Yeah. Are there any other tools that you reach for when you're trying to troubleshoot performance problems?
Silvia: That tends to be our number one. There's definitely, like that space has been approached by all of the existing hosted monitoring solutions like Datadog and New Relic, and whatnot. I've personally found VividCortex to be the best. I think, in part, because the founder is very intimately familiar with MySQL and so, and Postgres. He was a Postgres DBA before coming to MySQL. He's on my end. Having watch, seeing the tool actually show me that it was looking at the right things gave me the confidence to teach my engineers across the SendGrid work how to look at it and know that it will not give them false flags or send them down wrong paths. Because, ultimately if a tool is not actually providing that value it's hard to make the case for them to use it. Engineers are very quick to lose trust into new tools. Once that happens it's hard to come back from so it was important to show that it was going to really provide the value.
Mike: Yeah, it's like once the tool gets in my way, or once the tool is not doing what I expect it to do, I just don't trust it as doing what it should be doing. Yeah, I quickly move on. I completely lose all faith in it.
Silvia: Exactly.
Mike: Tools are hard, like even outside of technology. Why are there 50 brands of hammers at the hardware store. It's like, "Well, some of them are better than others and some of them are better than others at certain things." Trying to use a roofing hammer to drive a nail into the ground is not really going to work too well for you and vice versa. Tools are hard, but tools are also very much built for whatever task you're trying to do with it.
Silvia: Yeah, I found also, so going back to the old DBA stereotype. One of the first things, the values that I have reaped from having something like VividCortex help the rest of the team debug things is that, the first instinct can be, "Well, then we don't need the DBA." Actually, not true. Once we had that at hand, I started to have actual time on my hands to be able to go and review design documentation, for example. I was able to go and start working with our compliance team and our information security team to prepare for audits. We introduced VividCortex before, long before we went public so the long road of compliance was just started and if I didn't have VividCortex, it would've been very difficult to find the cycles to also focus on the long term benefits of the business that I needed to provide. My job at that point, I had realized that it doesn't, my job doesn't have to be day-to-day just whack-a-mole-ing, back performing queries and production. It needed to be something that is long term value and that included things like helping the engineering teams design things that don't even have problems as much as possible in production, helping the teams clean up tech data and identify things that they need to clean up before we hit high volume days; helping the business get certifications for compliance that will help the business and help our sales team make more sales. These are the things I needed to focus on and not day-to-day grabbing slow logs and pouring through them with a magnifying glass to figure out what's wrong.
Mike: Right. I think that's an incredible point you're making. That one of the ... I don't know if I'd go so far as call it a tenant. I mean, I guess I would. Dr Forsgren mentioned this in the State of DevOps rReport. That DevOps, one of the biggest things about this is business value. What you're saying is the tool didn't make DBA's less valuable, it actually made them more valuable because now you're not stuck doing minutia because that's all you have time to do. Now you're actually looking at, "How can I use my skills leveraging these new tools I have to make the business do better to make my teams better?" The DBA's actually significantly more valuable as a result.
Silvia: Exactly, and it becomes a thing. It turns the entire idea of the old guard or the wall garden DBA on its head. It becomes, "No, I don't need this." People can figure out on their own what query they're running was causing the slowdown because they can look at the dashboard and immediately go and do a call, do a PR and do a deploy that will cache this data better. Use the index that in place, or just send me a PR that wants to add the index. Then we've actually made this faster. Now the next thing that we're trying to tackle, and that was clearly called out in the state DevOps report, is schema management. That's the next thing, I think.
Mike: Okay, talk to me about that. I'm not really familiar with that topic.
Silvia: In most shops and even right now at SendGrid, that is still a thing that we have to grapple with. Where doing schema changes to support future features is still a scary thing. There's tooling in the MySQL space, at least, that helps you do schema changes in an online manner so you no longer have to go and do an alter that blocks all rights to a table for like, I don't know how long and you will wait for it. That doesn't have to be a thing, but it's still a little scary. You're going in and changing data. Engineers have focused for a very long time now and big part of the DevOps mindset is CI/CD where you don't touch things in production. You have a pipeline that does the test for you and once they pass it goes and does the deploy. But just like everything that is stateless versus stateful that is a hard one to do when it's schema changes because a rollback with software deploys is, I wouldn't say easy because that would be too, what's the word? That would be pretending that a problem is not that big. I mean, software deploys itself can go on for a while but it's definitely a lot harder with state because there's no rollback in databases. That's not a thing. This is still a space that I don't feel has seen enough traction. There's some companies out there that do it but they focus mostly on enterprise, on very large, I find it funny to call ... When it's Oracle, an enterprise, because we are a pretty large company and we don't run any Oracle or MS-SQL, but in any case, the solutions that are out there right now tend to focus on MS-SQL, Oracle, DB2. There's not a whole lot of focus on the more open source databases like MySQL where you make schema changes, A, less scary and, B, more observable. Where you incorporate version control of what your schema's supposed to look like with actually making it true in production. That is a space that is still, I think, lacking. The solutions I see for it have been very context heavy to the shop that built them.
Mike: Right, so for example Rails has built-in schema migration support. That's pretty cool but what if you're not using ... Most ORM's, Object Relational Managers I think is the acronym. Like Rails and Django and Flask through SQLAlchemy. A lot of them has these migration support where you can version schema changes and it's pretty cool but there's a lot of people not using ORM's so they don't have access to any of these tools and that sucks.
Silvia: Exactly, and here's the thing. I personally, I mean even in my talks when I talk to people. I advise against using ORM's past a certain scale because it can really hinder you ...
Mike: Really?
Silvia: Yeah. Once you've hit a certain level, most ORM's first of all do not support functional sharding. Which functional sharding means you don't have one database, you have individual databases separated by concern. Sometimes you even have to do what is called horizontal sharding which means that you have multiple database clusters that have the same schema but the data inside is different. For example, instead of having one large database for all of the people who are in Slack's workspaces, you start having a different cluster per workspace in slack. But on the inside the actual structure of the tables is the same but it's just the data inside it is different. ORM's are not good at this and the black box attitude in ORM's where you use the object-oriented language to tell it what you want to get and it decides what the SQL's going to look like, it works up to a certain point. Which is true for everything in technology. Once you start hitting a certain level of throughput that you need to achieve to accomplish the SLA you've provided on the product level, it's very likely the ORM is not going to help you there.
Mike: What happens after the ORM?
Silvia: After the ORM you need to start having a closer look at what the queries look like. You need to have something, you need to have abstraction layers that are more proxy-like than completely abstracting out the data structure.
Mike: Okay.
Silvia: Things like that. Right now, one thing we use heavily in SendGrid that we're trying to move everything to is ProxySQL. By the name, and it's written by a former DBA as well, it's strengths are in tracking performance of the databases it's talking to, in managing the client side connections towards, to the application versus the connection open to the database so that if your application traffic spikes up, it can basically play the role of a barrier to try and reuse connections in a way that doesn't also DoS your database. It focuses more on the operational side of things rather than, "Hey, I'm just going to make it super easy for the engineers so they can write the code," but the implications of that when it comes down to the database are whatever. These are tools that are very useful once you've hit a large growth rate, you're starting to have to split your data out into individual clusters whether by concern or just horizontally where you're trying to get out like by ProxySQL and also Vitess which is fairly recent player in the market that's been out for a couple years now. Also supports things like that, so the focus becomes more on scaling and less on easy abstractions that mask the concerns that you might have when you're actually in production.
Mike: Yeah, okay. That makes a lot more sense. Your proxy, man I could've used that a few times. You were talking about your next step at SendGrid is schema management and now that I actually know what schema management is — thank you for that — what does that mean to SendGrid?
Silvia: One of the important things there right now is we're trying as much as possible to take out the DB ops team as a human gatekeeper of things. That's one of the remaining things that we have at hand right now. One of the focuses that we have is we want to be able to enable the engineering teams to manage their own schema. Especially in cases where it's a brand new thing, it doesn't have a lot of traffic on it yet and they're iterating on the code real fast. But the iteration on the code comes along with iterations on the schema. It does not make sense for them to have to keep coming back to us so we can apply those changes for them. It really comes down to just enabling those engineers to be able to do these things, but at the same time enabling them with guard rails where the tool will maybe not change a gigantic table without getting some approval first. Things like that, so that's the space we're trying to figure out a solution for right now among many things.
Mike: Sure.
Silvia: But, from a larger than SendGrid perspective from conversations I've had with other DBA's and from trying to solve this myself, it's definitely a part of DevOps with a database that is not super well solved yet.
Mike: Yeah, that sounds like a pretty hard problem.
Silvia: It is. You can always apply a lot of due diligence. One of the books that has some really good guidelines on how to do this but without being super specific to a database is the Database Reliability book, which came out last year. It's by Laine Campbell and Charity Majors which is a really good combination of authors, but they have dedicated a full ...
Mike: Yes, very good book.
Silvia: Exactly, and I love the fact that it's like Laine has spent all of her life doing relational databases. She founded one of the first DBA consultant outfits — all of whom are great people that I've worked with and then Charity, like, the whole parts and getting MongoDB to become a real serious database away from its old engine days. Between the two of them, it's far less about which database to use and far more about how to do databases safely and how to actually make this job a higher value than the old guard that used to, that was the stereotype. Part of that, they dedicate an entire chapter in that book about how to do schema management in a CI/CD manner. Because there's not a whole lot of tooling out there, it is not a thing that you can just throw money at just yet. A lot of it comes down to the context of your data. You have to decide within your shop like which databases can be the low-risk guinea pigs to try new things on and which ones are like, "No, that needs ..." Once we've fine-tuned everything then we can go to that one right there. These decisions are also part of the value of the DBA team because a lot of times the engineers won't have the full picture.
Mike: Yeah, absolutely. Well that's pretty awesome. I mean, good luck to you on that. It sounds like it is a pretty hard problem. You are charting a little bit new territory it sounds like. At least you have some direction to go.
Silvia: I mean, if I look back on my time at SendGrid, I've been with the company for almost seven years now and unbeknownst to me I had to do that anyway. One story is that when I started at SendGrid just like everybody, other shops at the time (it's like 2012), everybody was building the databases by hand. Even shops that were already using Chef or Puppet at the time, which were the biggest players in that market. They would do everything with Chef and Puppet but the databases not so much.
Mike: I mean, I'm guilty of that myself. I recently helped a company do exactly that.
Silvia: Number two ... Exactly. After I started, by the two year mark was when I realized that we were about to start ... We were growing at a 30 - 40% growth rate every year and with the growth, the dataset would increase. There was a lot of functional sharding to do just to split out IO load and to provide a reasonable throughput of queries on all the data stores so that the database doesn't solve mid-delivery and part of that was, "No, this data cannot live all in one database because you're now going to have to start ordering custom hardware." Very quickly it dawned on me that we will no longer be able to... I can't keep making artisinak databases. I will spend all of my life just setting databases up by hand and that wasn't acceptable. It was like five years ago when I started learning Chef and I was determined to not just use Chef to build the databases but to have Chef run on a schedule including the primaries, managing the configuration of all the databases. Apparently that wasn't a thing back then. Five years ago, that was not a common thing to have the databases running a thing that can just go change configurations without someone watching it like a hawk.
Mike: I would say it's barely a thing now.
Silvia: I think so. Sometimes I wonder if I'm just like ... I'm starting to create my own bubble of the people I talk to that's why I think that barrier is breaking. But I think you're right. Most jobs still don't do that because it just sounds so scary.
Mike: Right, it absolutely is.
Silvia: It was. You have to build the guard rails. You have to make decisions about what configurations need to happen now and what configurations can be placed in a file and then you start automating, you start separately planning rolling restarts. A lot of it, especially about five years ago, was just the fact that MySQL at the time had ... Was not as strong in the automation. You need the database to work with you with these things and MySQL five years ago was not as good at hands-free setup. You had to hand hold a lot of things. It's gotten a lot better now, for sure, but that was a thing as well. I get the mindset of now I can do it, but I found myself in a place where I was this little DBA. We had a consultant outfit with us for that whole time but their primary focus was on call and not letting me burn out with on call pages. Automating these things needed to happen and so you find yourself stuck and you get lemons, you make lemonade. That's how it happened. I have, yeah, if I wasn't at a place like SendGrid specifically with this fast growth, I probably would not have been compelled to do it.
Mike: Yeah, absolutely. I think my takeaway on that one is, it sounds absolutely terrifying it but it's not actually a dumb idea. In fact, it's a really good idea.
Silvia: I'm not going to sit here and claim that it was completely without incident. I have caused my own share of incidents at SendGrid and this is where I go back to DevOps. Whether it's just DevOps for sys admins or for DBA's. It doesn't happen with just the tools, so using Chef is a great thing. Using something, Vividortex is a great thing. But there's still a large and non-negligible factor of people and one of the things that, to loop back to the State of DevOps report. Like psychological safety in a workplace is an important thing. If SendGrid was more of an old school, "No, you're not allowed to ever make mistakes thing," I think I probably would've been fired within the first two months. No kidding, because I had caused a major outage within the first two months. It wasn't related to Chef. But, I mean, but all DBA's I know have done that. Psychological safety needs to be a thing and it needs to apply to everybody equally. To SendGrid’s credit, the culture back then and up until now. Being able to do things and having the team support you, and when it goes sideways we all help each other out to fixing it. That's super important as well. You can't break the old DBA silo in an environment that is all blame-y and point fingers, and, "No that was your fault," or, "That was your fault." It just doesn't happen.
Mike: Yup, absolutely. We've talked quite a bit about theory and general practice and related stuff like this. What is something that people can do today or this week to help them with improving how they work with databases? Whether they have a DBA or whether they don't, what can we do? What's actionable? What we can actually do today?
Silvia: My advice here would be two-fold because there's the low attention span of trying new things and so you want something that will give value fast but there's also the long-term benefits. In the context of quick and fast return on value — we're talking investment — look at the observability of the databases. If you don't yet have, if whatever metric, whatever observability solution that you have does not have the databases as part of it, and not just as part of it but as part of the application dashboard. Like you have an application dashboard and it shows metrics from the services but nothing from the database that the services talk to, fix that. It can be VividCortex. I'd like to plug them a lot because they've been incredibly useful for us. But it doesn't have to be. If it's New Relic or Datadog, or any of those providers make sure that agent runs on the databases and add those metrics to the same dashboard that everybody looks at. Overlay them, have them as a part of the screen that's up on a TV in your office. Things like that, but start having people be comfortable with looking at a dashboard that comes out of a database and try to at least go, "Hey, why's that line looking like a flat line up in the ceiling right now? What happened there?" Questions like this will get people interested into knowing more. It will hopefully, at least, improve the hunger to learn how the databases are and start trying to demystify what's going on there.
Silvia: Now, as far as like long term benefits, incorporate the DBA. If you have dedicated DBA staff, incorporate them in your planning phase. It is a very big red flag is the DBA's are finding out a thing when it's already in production. Now, it might still go fine because the thing that happened in production is small scale but it's a sign of bad communication and you can't get around the idea of people talking to each other. Some places, and we do this at SendGrid. In fact, I think we have a blog post about it up in our company blog like a few months ago. We follow a blueprint process where a new thing that needs to be built to solve some sort of problem for the business has to be described in a design doc with doc, by documentation, and have diagrams and web sequence diagrams and explain why we're doing this, what is the flow of the data? Having someone who is closely involved with managing the data storage layer of the company, involved in reviewing such documents and having a say early on is very, very important.
Mike: Yeah, that's all fantastic advice. Silvia, I have one last question. Where can people find out more about you and your work?
Silvia: I have a blog that I will occasionally post to, although I haven't in a few months now. But every quarter or so I'll try post something on it. It's blog.dbssmasher.com and I think that's going to be in the show notes.
Mike: Yup.
Silvia: Also, if you want a live stream of consciousness of the things that happen that sometimes get me irritated around databases or DevOps, you can definitely follow me on Twitter. It is not always primo content but follow at your own peril.
Mike: Your Twitter feed is really awesome.
Silvia: I was told recently that I have coworkers who have set notifications for when I tweet and that just terrified me even though these are dear friends I've worked with a long time. Strangers following it also gives me a little bit of allergic reaction, but I do use it to vent sometimes so you might find it amusing sometimes.
Mike: Alright, well. Silvia, thank you so much for joining us. This has been a fantastic conversation.
Silvia: Thank you, Julian.
Mike: Alright.
2019 Duckbill Group, LLC