Abandoning Microservices?

Recently Prime Video released a blog that seems to have gotten the tech world bouncing up and down. In the article, they discussed their tooling for audio/video analysis. This tool was developed in-house but was never intended to run on a massive scale. As Prime Video has grown, this tool has been used more and more, resulting in performance problems and high costs.

The system’s design is interesting, they used a distributed model with AWS Step Functions. The implementation seems logical on the surface, they used a fully serverless design that allowed them to build quickly and deliver their services.

AWS Architecture

What went wrong?

As the use of the tool grew, a bottleneck emerged in the orchestration of the step functions. AWS step functions use a state machine to manage the process as it triggers different functions. The problem emerged because there were multiple state changes every second and they quickly reached the account limits.

The second problem was that they used S3 to store temporary files. The Lambda functions would need to download the files and but that much file transfer was expensive in S3. There were lots of files being downloaded every second.

What did they do?

They analysed their architecture and found the problem was in the distributed design of the system. They reimplemented their architecture to run as a monolith and manage all the state changes within a single process. This could then be run in containers and scaled horizontally. The new design removed the need to manage distributed components and manage an S3 bucket full of temporary files, everything was one process, and files could be managed locally.

AWS Architecture

The result of this change was a 90% saving in their costs and the ability to scale further than before. They had removed the limits that were holding them back and created a new architecture that allowed them to grow.

What’s wrong with microservices?

This post has received much attention because it flies in the face of industry wisdom. Most developers have worked on projects when the client has asked for a microservice architecture so they scale and grow, but they’ve never really needed to. There is an assumption that to future-proof the architecture, it’s best to use microservices because it enables reuse and scaling.

While the benefits of microservices are real, there are also drawbacks. Any distributed system will need to manage state somewhere, files will need to be stored remotely, and services will need to talk to each other over unreliable networks. There are always more risks of failures with distributed systems, so more testing and defensive programming are needed. It’s hard to get right.

We should also address the benefits. Yes, a microservice architecture can be helpful with scaling if you have components that can genuinely scale independently and you have the traffic to warrant it. Yes, a microservice architecture can promote reuse if the team-building services have the time and resources to make something that others want to use and can be relied on. Often companies think having microservices will solve these problems automatically, they won’t unless there is a real problem to solve and the team have the support they need to make it happen.

Microservices and distributed systems solve specific problems around scale and design, but they add new ones. Sometimes they’re great, but often, a simple monolith will deliver just as much value. In the end, the user will never see the approach, so only use microsevices if you need to.

What should the Prime Video team have done differently?

Nothing.

They needed to build a service quickly and launch it. They did not expect it to need to scale as it did. They could have spent longer designing a system to scale better, but it would have been a waste if that capacity was never needed.

As it turned out, they did need to scale, the business grew in unexpected ways, so requirements changed. It happens in successful companies, and we need to handle it.

They released there was a problem, did some technical analysis and found the bottleneck. They realised the issue was the distributed design, and they changed it.

It’s easy to look at it afterwards and say that a distributed system was the wrong choice, but that seems unfair. They launched something that worked for years. If the problem had been a need to add functionality instead of scale, then the distributed approach would have been great, just add more steps to the process. It’s difficult to know what problems will be faced in the future.

Conclusion

A lot of focus has been given to the debate about microservices vs monoliths, that’s a good debate and should happen more but we don’t think it’s the lesson here. This is an example of delivery done well. The team designed a solution that could be delivered quickly and launched it. That contributed to the growth of the platform and now they had a new technical problem. Make changes fix it, and wait for the next problem to arise.

We can spend a long time on system design and try to predict our future needs, but it’s almost always a waste. Get something that works and launch it. If you’re really lucky, you will have a scaling problem in the future, but only because you launched and people used it. Then you’re allowed to worry about scaling.