Podcast transcript: How to scale a tech platform

Podcast transcript: How to scale a tech platform

This automatically-generated transcript is taken from the IT Pro Podcast episode ‘How to scale a tech platform'. To listen to the full episode, click here. We apologise for any errors.

Adam Shepherd

Hi, I'm Adam Shepherd.

Sabina Weston

And I'm Sabina Weston.

Adam

And you're listening to the IT Pro Podcast, where this week we're looking at the challenges of scale.

Sabina

Building an application from scratch is a challenging proposition in itself. But for a startup looking to make its name, that's only half the battle. The real challenge comes when you have to scale that platform, expanding its scope and capabilities at a rapid pace without overextending your resources, or burning out your staff.

Adam

It's a task that many businesses are forced to grapple with. And there are many factors for a CTO to take into account, including which technology choices to make, how best to structure the application, and how to build in redundancy. This week, we're joined by one technology leader that has been through this process: Avinash Gangadharan, CTO of car sharing company Turo. Avinash, thanks for coming on the show.

Avinash Gangadharan

It's a pleasure, Adam. Thank you for inviting me to the show.

Sabina

So to set things up a little, could you walk us through what Turo's infrastructure looks like?

Avinash

Oh, well, so we are cloud native. You know, since the inception of Turo, we have been primarily hosted, hosted on AWS; that's to start with. So majority of the services run on AWS. We are containerized, our apps are containerized. Our back end, runs mostly on Java Spring, our data tier, you know, is again, AWS services, MySQL RDS, we use Redis for caching, a bit of memcache. On our front end, web front end, we are primarily a React.js shop. And our apps are native, built on Java/Kotlin on Android, and obviously Swift and Objective C, we have a bit of Objective C left on our apps as well. That that's, you know, our programming stack, you know, if you can, programming language stack; going back on the cloud side, you know, besides just you know, real time databases,` WLTP on RDS. Our data warehouse is also primarily on AWS, we use Redshift as our primary data lake. We have a bunch of Jenkins jobs that runs ETL. We are actively migrating those to Spark EMR. Yeah, that's precisely what what our data warehouse looks like. visualisation is on Domo mode primarily. And there is work going on to move to Tableau. So that was, you know, data platform, a bit about our application stack. On a network security layers. We got CloudFlare, you know, that protects us, has been working really well. On observability, we got New Relic logging, you know, as well on New Relic and data science, you know, which is a critical part of Turo. You know, most most of our models are built, tested, run on Datarobot. They have a few that run in house, but that's what we primarily, that's at a very high level, what our stack looks like.

Adam

So that's quite a diverse stack. How quickly has Turo had to scale up from a back end perspective over the last several years?

Avinash

Well, the last two years, I'd say has been the most adventurous for us. But over time Turo haa scaled, business has scaled, you know, significantly. In fact, in the last 10 years, the business has scaled 100x.

Adam

Oh, wow.

Avinash

Yeah. So from that perspective, you know, the scale has more more, you know, been gradual, since, since the beginning to like, mid, or q3 of 2020. And starting q4 of 2020, you know, late q3 or q4 of 2020. That's when we saw massive growth, massive growth in our volumes, you know, magnitudes that we were not necessarily ready for. We were not prepared for. We knew that we were going to grow. We had plans to grow, you know, our platform, but our plans were like 18 month, 12 to 18 month plans. We had to expedite that and squeeze that into like a three month, four month plan.

Adam

Oh wow.

Avinash

It was a fun and adventurous time for us.

Adam

So what kind of considerations did you have to make when it came to dealing with that kind of really rapid need for scaling at that level?

Avinash

You know, if you think about it, once, you know, a stack, that's cloud native should have capability to scale horizontally fairly easily. Pour money, you know, get some more hardware, provision more hardware, and you should be able to scale horizontally. Theoretically, that's how it works, you know, practically, I don't think majority of the people out there who start, you know, building a product for a startup, build it in a way where this is actually true, you know, you can elastically scale horizontally, in theory, to some extent, but your bottlenecks are still, you know, your, your, primarily your data tiers, that's what happened with us. So on our applications tier, we could very well scale horizontally. But our bottleneck was our data tier, you know, our, primarily our MySQL RDS instance. You know, as any other startup out there, we also have a monolith, we're trying to break down, break it down. And, you know, so we focused on our data tiers, that was our biggest focus. And we knew that over the years, we have done things that got us out there in terms of future timelines, in terms of time to market really quick, but there was a lot of tech debt that we had accumulated. Off the bat, we actually clearly understood, if I, if we focus on the data tiers, if we, you know, just put fundamental patterns in place, you know, read/write splits, and offload certain things on off hours. Those are the kinds of things we did; tune our SQLs. It might sound very fundamental, but we put a lot of effort in tuning our SQLs, we identified, you know, these multiway joins that we had written many, many years over, we've written over and over again, just to get things done quickly. And we clearly identified how these were soon becoming bottlenecks. So to summarise, we focused on our data tier. And we knew that that was a bottleneck very clearly. And we put an effort to offload, you know, a bunch of workload from our primary source to read replicas where we could and tune the SQLs. Just by this effort, it was a few months, a few months of effort, we created more than 98% capacity on our database, basically. And that helped. So it was fundamentally, you know, resolving some of our tech debt, but in a way, which was quick, we did implement, you know, patterns, like read/write splits, but there's a long way to go, we have a lot more to do. You at some point, we will, we will hit that bottleneck again. And we are actively working on now. And another learning that we had, as part of all of this was, you know, we really had to put a lot of focus on our tech platform. I believe that engineering is successful only if the business is successful. There is no question about that, you know, if the business does great, you know, the success is translated to engineering success as well. So all this time, we will really making sure that, you know, Turo as a business is growing healthy, year over year. And that's what our effort was. But obviously, at the back of our mind, we knew that, you know, there's tech debt that's getting accumulated, we started getting more and more conscious about it, I'd say, you know, late 2018, early 2019, and we put in a lot of efforts to modernise, we just didn't anticipate that, you know, the end of COVID, end of the first set of COVID, would bring in volumes that would surprise us, and, you know, be putting some of these things in place to expedite some of this effort. We moved a lot of our effort towards platform scaling. And that just did it for us.

Sabina

So what has been the biggest challenge of this process?

Avinash

Well, if I think about it, I think the time we had on hand was the biggest challenge. We identified that we are getting volumes of traffic back to, back on Turo fairly quickly, but the time from then, when we said that yes, we're going to get you know, a lot of volume now in I'd say, q3 of 2020, we identified that you know, I guess early q2; the time that we had on hand, and the projection of the volumes, made it very adventure. So basically, very challenging. So basically, you know, the plans that we had, that spanned across many months had to be squeezed into, like, you know, three, four months of an effort. So the time that we had on hand to be able to manage the volumes, was the biggest challenge. Besides that, you know, I've seen in my previous, you know, places that I've worked for the challenge in managing tech debt. And resolving tech debt has always been the challenge of getting work prioritised. You know, you're working on a bunch of product features, business features, you know, and then you have a lot of this tech platform work that you got to do. How do you have a convincing argument that this is the right time, you know, we at Turo have a great culture from that perspective, we, everybody understands tech, everybody understands the challenge of tech. So it was, you know, that made that part of the equation was a lot more easier, there was no convincing required; everybody kind of got there was a, there's a lot of trust in each other. And you just, you know, focused on getting this done, we spent a lot of effort, actually majority of our effort was spent on platform stability, we really dialled down a bit on our, on our feature development work. And that focus really helped. So there was a company wide focus. In fact, we made, you know, the availability, security and the performance of our platform as the top three goals of the company. And since then, it has been like that.

Adam

So speaking of the people side of the equation, then, have you found talent acquisition to be a barrier to scale at all? I know a lot of organisations have struggled over the last couple of years, in particular, with recruitment of skilled engineers and developers as a kind of particular pain point. Is that something that you've experienced at all?

Avinash

Oh, absolutely. I think, you know, the scale of business comes with two challenges. One, how does the platform scale with the business? Two? How does the team scale with the business. So scale, you know, has a people impact as well, as well as the platform impact. So the ability to hire fast enough and quickly and build the team, onboard them, get them familiar with the codebase, platform, and get them up and running is a huge challenge. We had those too, it became bigger because of how, you know, work, the style of work was evolving because of the pandemic. You know, people wanted to be, people I feel started looking at what remote is for them. They got comfortable with remote, we have, we're doing this podcast remotely. And that change, actually was was a bit of a difficult one for us. You know, Turo really values its culture a lot, you know, we, we are a company, where we feel that, you know, the human interaction is critical for the success of our business. It's part of our culture, it's the core element of our culture. We, you know, were nervous about going fully remote. And the talent base out there was moving very fast, than where, than how we were on this aspect. So it did pose a challenge, we, initially, we were like, well, you know, we gotta go to hybrid, we're not necessarily going to go to fully remote, or make coming, coming to office as an optional thing. And that poses a bit of a challenge for us. So initially, we started with, you know, trying to find people from within areas that our business, you know, existed in terms of getting set up as a payroll, and that that was a limiter. We realised it, you know, early 2022, I'd say. We missed our 2021 hiring goals, to be very honest, you know, and we missed it significantly. But it was a stark learning for us, we, we, you know, changed courses, we felt that we got to be adjusting towards where, you know, the talent is. And in the last, I'd say, you know, three months, we've seen a huge difference. So, one was, of course, you know, challenge posed due to how we felt we want to be as a team, you know, how remote, hybrid, in person, the whole mix of things, you know, how we thought about it. And the second thing was, you know, the inflation in the competition of the tech talent market overall has been crazy. If you if you can, if you know, there's a global talent shortage when it comes to tech, right?

Sabina

Yes. that's happening in the UK as well.

Avinash

Yeah, exactly. It's everywhere. It's not just, you know, US, it's everywhere. I see this across the globe. And so hiring has been a challenge. So we've done very creative things to figure out, you know, how to attract talent, you know, and we have lots of competition, you know, Turo, has to compete with the Metas of the world, the Googles of the world, you know, and you're not, you know, we don't have as deep pockets as these companies have yet. So it's definitely a challenge, takes a lot of convincing. We knew, you know, that we have to really focus on right kind of sourcing, targeting the right, and the people who are, you know, not looking at just compensation, from a short term perspective, they're looking at being part of a mission, and they're really, you know, investing in a longer term future. So. So, yeah, to answer your question, you know, it has been challenging to scale. But I think we are now at a place where, you know, we feel very confident about growing it. Just to give you some numbers, you know, over the last 12 months or so, we've almost tripled the size of our team.

Sabina

Wow, that is, that is amazing, especially given the talent shortage and pandemic. And I was just going to ask, because we, there were Gartner studies, hiring is definitely a challenge nowadays. But there's another worrying statistic, which I came across recently, when we covered, I think some Gartner research, basically. But the report found that only less than a third, around like 29% of surveyed IT workers said that they are planning to stay with their current employer, which means that 71% were planning to move on in the near future. So can I just ask like, how, if when, when hiring is, is also a challenge, like what what steps do you take to retain this talent when you manage to employ them?

Avinash

Great question. I think attrition has never been a huge challenge for us. And that's a testament of what our culture is about. I think once you know, we get people in, and work with us as a team, we see retention not being a huge challenge. Attrition has always been in my past four years here at Turo has been not not a huge challenge for us. And same thing, you know, continued over the pandemic as well, even with you know, a lot of you know, attrition across the board, we were fairly, you know, healthy in that aspect.

Sabina

That's good to hear.

Avinash

Yeah, and, you know, our team is small, though, you know, we're not talking about, you know, hundreds and thousands of engineers, like other big places, you know, at the moment, our team has roughly about 110 I'd say.

Sabina

That's still pretty, very good.

Avinash

Yeah. So, attrition has never been, you know, a huge challenge for us in engineering and if across the board, you know, at Turo you know, a lot of things play into that; we are very transparent as a company, you know, we are really, you know, a people focused company. And we value our culture, you know, a lot and, and therefore, I think, once we see people joining Turo, they really fall in love with Turo, you can see that from you know, our Glassdoor reviews. And you know, lately we've been ranked number one in the Best Companies to Work For in various, various kinds of service.

Sabina

Wow. So maybe I should apply as well.

Avinash

We're always hiring, Sabina.

Sabina

No, I don't think you take me. I lack the qualifications, definitely.

Adam

Let's talk a little bit about some of the learnings you've taken from this experience. Was there anything that you found particularly surprising about both operating at scale and about reaching that level of scale in the first place?

Avinash

Yeah, I think there were many. And if I, you know, prioritise my answers right. I think the, the thing that surprised me the most was how attractive we were to the bots out there. And to these, you know, request based attacks that happened to us, you know, and I was very surprised, and why would someone try to, you know, DOS Turo or, you know, do a request based attack on us, you know, what's going on? And that surprised us for quite a bit. We were not necessarily sophisticated in that area, you know, 12 months ago, actually, sorry, not well, about 18 months ago, I'd say, you know. So that was, I think my biggest surprise, you know, keeping our incoming traffic in control, in check, was was something that, you know, was, was, was challenging earlier.

Adam

How did you respond to those problems?

Avinash

Yeah, the, you know, not very well, initially, to be honest, we, yeah, initially, you know, we had a lot of downtime it took, this was, it's you know, in, in 2019. Bit in 2019, a bit in early 2020, not early 2020, late the second quarter, 2020. We, we've faced challenges, and we were not necessarily, you know, doing the things we should have, thinking back.

Adam

Such as?

Avinash

Such, as, you know, like, you know, really securing our network traffic through products, like, you know, Cloudflare out there on the edge, or AWS Shields, you know, like edge security, you know, we didn't didn't put a lot of focus on, you know, securing our traffic on the edges, you would take them all in, you know, onto our clusters. And you would put all kinds of, you know, these rules, these policies, this logic on our app tiers to basically say, all right, bad traffic, you know, do a no op, it's too late if you do that, but, you know, if if the traffic makes all the way to a pod that's running, you know, your application. And that's one thing, which, which surprised us, you know, and, you know, we even didn't necessarily feel that we could be a target for these requests based attacks.

Adam

I suppose it's that thing where you kind of don't expect as a business to be hit until you, until you're, quote unquote, successful. Right. But I think what businesses class as successful for themselves is not necessarily what an attacker would class as successful enough to make it a legitimate target.

Avinash

Exactly. And that was a big learning for us. But, but, you know, the, once we had, we had an incident where, you know, purely because of request request flooding. You know, we took a downtime for about, I'd say, you know, over the period of three hours of that attack, about 45 minutes, we were down, and that was horrible. So, you know, that that's the time we decided that we got to just go all in on edge security. And we did, that helped us significantly. And then, you know, immediately after that we saw a massive efforts on account takeovers. So that was the next surprise that came in. All right, you know, there's a lot of interest in taking over Turo accounts. And we saw all kinds of credential stuffing attacks, many many, you know, different kinds of attacks to take over accounts, Turo accounts. So last year, I'd say somewhere around the middle of last year, we put in a lot of effort on account security; account takeover, specifically. And I think what we now have is something close to world class, I'd say, when it comes to securing Turo accounts. Besides these, you know, the other challenges are more usual in terms of scaling platforms, you know, identifying bottlenecks, putting in right, right kind of monitoring, you know, and alerting like observability overall, you know, observability I feel if you ask anyone out there in any, you know, head of engineering, business of our size, I think the answer would always be we could do a little better on observability you know, you're always like, you know, a tad short of where you want to be, I feel. And we will always a tad short, we kept on improving, we felt Oh, there's a little bit more that we got to do. And even last night, you know, I was talking to one of our infrastructure engineers, and we were like, alright, you know, there's flakiness on search. And maybe we could do a little bit more on our liveness and readiness probes, you know. So, yes, so those were, those were the other challenges, we felt that, you know, observability could be better. And we, again, invested a lot in that, you know, we consolidated our logging and monitoring platform. I think that that helped us a lot

Adam

Is that one of the things that you would kind of class as being a key enabler of scaling up, that kind of observability piece?

Avinash

Oh, yeah, I, you know, I, I think that it's very up there, you absolutely need to, I feel, you need to have world class observability on your platform, you need to know what's happening, you need to be able to reduce your, you know, mean time to detect issues significantly, if you can detect issues quick, you know, and quick, when I say quick, I'm talking about, you know, minutes, in fact, minutes going down to single digits for for your most gnarly problems, if you can detect, you know, incidents, the root causes of the incidents, you know, in single digit minutes, you I think you do, you're doing great as a business of our size, obviously, when you grow larger, you know, you got to reduce MTTD, you know, significantly, and you got to take it in sub minutes. And then it also helps you with, you know, resolve things; MTTR, mean time to resolve, I think, I feel that, you know, both of these are very critical for you to keep the platform up, you know, stay on top of availability, and it's critical, you know, a few minutes of downtime can have a huge impact on, you know, what your perception is to your customers. So observability plays a key role for you to identify these areas and and understand where your challenges are as a as a platform. And, and clearly identify those bottlenecks basically.

Adam

So were there any technology choices that you made early on in Turo's lifespan that you felt kind of maybe tripped you up a little bit as you got further down the road and needed to scale more rapidly?

Avinash

Yeah, I've been here for about four years, and one of the things that, you know, I was pleasantly surprised and very happy about was how modern our stack was, from the get go. There was, you know, a huge emphasis on, you know, being simple, being nimble, but still being fast. So, besides the huge bottlenecks on our data tiers, and a few basic, you know, things about splitting code into, you know, more meaningful chunks so that they can be deployed in, you know, in more isolated, you know, forms, like, clusters, and based on workloads or whatnot, you know, besides some of those simple things, you know, we've always been, you know, fairly good in terms of keeping the stack modern. Now, on one area, I feel that, you know, we could have done better is our velocity to move away from Objective C. And that, you know, I think we undermined a little bit, but over the last, I'd say, 12 months, we've really sped that effort up. And we knew that, you know, hiring individuals with Objective C in our job requirements are not necessarily, it's not necessarily an easy thing, you know, it's not very attractive, you know, obviously, the modern day modern day, iOS developer wants to be working on Swift, they moved on. And that's, that's one area, I'd say, you know, we could have done a little better, but over the last 12 months, we've done really good.

Sabina

So, what advice would you have for other organisations looking at scaling up their infrastructure?

Avinash

Yeah, I think, you know, I feel that when you are part of a budding business, you know, the business scale of the business is fairly small, you're focused on really growing the business, you know, you tend to make decisions which completely undermine the fundamentals of you know, platform design, basically. Some of these fundamentals are not very difficult or tough ones for you to go, you know, to go get from the beginning, it's just it's just fear, I guess it's you know, or maybe shortcuts that makes you think that alright, you know, I'm not going to, you know, architect my data tiers, so that the data is organised, you know, in a more, in a better structure, put everything on one instance, you know, write some SQLs and get stuff done.

Adam

Yeah, just get it done now and sort it out later. You never, you'll never do that.

Avinash

Yeah, I think I think, you know, there's always an argument about how important time to market is, when you're an early stage startup, I completely understand. But the effort in these days of how modern, you know, distributed computing has become, the effort involved, for you to identify and work on these fundamental principles of scale is not as much as it was 10 years, 15 years ago. Yes, I'd say, you know, people making these wrong choices now. It might appear to be a bit arrogant, but it's short sighted, I feel. You know, you've got to be putting these efforts from the get go because it's easy. It's not challenging anymore.

Adam

Well, on that note, I think that's about all we've got time for this week. But I'd like to thank Turo's Avinash Gangadharan, for joining us.

Avinash

Thank you, Adam.

Sabina

You can find links to all of the topics we spoken about today in the show notes, and even more on our website at itpro.co.uk.

Adam

You can also follow us on social media as well as subscribe to our daily newsletter.

Sabina

And don't forget to subscribe to the IT Pro Podcast wherever you find podcasts. And if you're enjoying the show, leave us a rating and review.

Adam

We'll be back next week with more analysis from the world of it but until then, goodbye.

Sabina

Bye.

ITPro

ITPro is a global business technology website providing the latest news, analysis, and business insight for IT decision-makers. Whether it's cyber security, cloud computing, IT infrastructure, or business strategy, we aim to equip leaders with the data they need to make informed IT investments.

For regular updates delivered to your inbox and social feeds, be sure to sign up to our daily newsletter and follow on us LinkedIn and Twitter.