Last time I discussed the performance aspect of software economics and how having a faster application can directly impact income, as well as the lack of data on the cost of building a faster solution. This time I’ll explore more on the scalability side of the equation and how better scalability holds down operational costs.
With Moore’s law finally breaking down, scaling will become an even larger concern than it is today. I know that where I’m working today, we’ve ridden Moore’s law to solve a lot of scaling problems for us, wherein the hardware was getting faster and cheaper quickly enough to handle the scaling problems for years. This had allowed us to invest heavily in building out new features and dominating the market. Now, it’s catching up to us, and I doubt we’re the only company with this issue.
In general, you need a certain base level of users to justify the fixed cost of building a system. But, variable costs like servers and storage go up as the user base grows. Most algorithms scale memory and/or CPU at a rate above linear. So, theoretically, at some point growth hurts your ability to actually make a profit. Yet, as an industry we continue to scale up and seek out network effects to get even bigger. How do these two ideas support each other despite seeming contradictory?
In part, the cost of individual resources is so low that it doesn’t matter unless you’ve chosen an architecture that is scale up rather than scale out. The other part is that even in a scale up situation, what you can get today will generally still scale to the level of millions of users, as long as you are willing to pay for it.
This is where the idea that it is easier to get a 10x increase rather than a 10% one comes from. Changing the runtime of a solution is easy enough until you hit the lower bounds of the complexity class. Instead of seeking scalability with the same algorithms, you can rebuild the infrastructure and algorithms to either be more efficient on their runtime, e.g., improving from n^2 to n log n. Or, you can move to scaling against a slower-growing aspect or change where your bottleneck is, e.g., scaling against memory instead of CPU. The problem is that most of the time, the item that you are scaling with is your only option. In many business problems you’re scaling against the needs of the business itself: if you need to send emails your resource usage is going to scale with the number of emails being sent and isn’t a way around that. If you’re at the theoretical bounds you need to solve the business problem through other, more clever means.
Those more clever means usually involve things like caching or estimating. Caching lets you avoid an entire computation with its own complexity bound: this again trades CPU for memory, but is sometimes the best option. Whereas, estimating is all about understanding the degree of precision truly required in a calculation. If the question is “How many players are online right now?” you need to know if saying “a million” is good enough, versus needing to know it was 1,002,783 exactly. For a counter showing the real time player total for monitoring purposes a rough count might well be fine. Then you can also use a known option like HyperLogLog that can count unique entries in less than n space.
The cost of the n^2 algorithm is 50,000 times more expensive than the n log n solution with n = 1,000,000 and 430,000 times more expensive with n = 10,000,000. Regardless of how cheap resources are that will add up pretty quickly. Using an approximation can be an amazing way to cut back resource utilization, but requires a more nuanced understanding of how the system is going to be used, which can create other sorts of fixed costs.
The moral of the story is that more data should eventually require you to go back and reconsider how the system is built. This is true for scale out architectures as well as scale up. Resource consumption will always have a little bit of overhead and as the system gets bigger the overhead is only going to get bigger as a proportion of the whole.