mistakes1

We often get asked to encapsulate our experience into a top 10 list for CTOs and CEOs. As is the case in golf, in technology it is as much about ensuring that your bad hits (aka blunders, mistakes, and failures) are recoverable as it is ensuring that you nail your great hits or successes. We are all going to have failures in our careers but avoiding the really big pitfalls will help ensure that we keep our companies and our products on the right growth path.

So, without further ado, and in keeping with our high standards of “raising the bar”, here are the top 20 things (rather than 10 and in no particular order) we believe are most important to avoid when developing platforms:

1) Failing to design for rollback

We said these were in no particular order, but right out of the gate we are going to provide an exception to the rule. If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad code roll in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW.

2) Confusing product release with product success

Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. The point here is that you are paid to increase shareholder wealth, so have success parties when you achieve objectives specifically tied to that wealth creation. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy.

3) Insular product development/engineering

How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Use best practices like teaming or a process that we later will discuss called Joint Applications Development. We are not arguing that designs should be done by committee, but rather than collaborative designs with a clear owner and decision maker are better than designing without input or checks and balances.

4) Over engineering the solution

Your job is to maximize shareholder value as cost effectively as possible. To that end, one of your mottos should be “simple solutions to complex problems”. The simpler the solution, the lower the cost and the more likely it is that it will be easily and cost effectively maintained. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.

5) Allowing history to repeat itself

Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize shareholder value and grounds for dismissal. In the operations world, a failure to correlate past site incidents and find thematically related root causes should be a cause for termination. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.

6) Scaling through 3d parties

Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. See our articles on database scalability and platform scalability. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases provide for the technology of log shipping to keep read or standby databases in synch with the primary. Per our discussion in technology agnostic design, define how your platform scales through your efforts, not through the systems that a 3d party vendor or opensource software company provides. If you say we use ACME database clusters to scale our database we would argue you have the wrong solution. If, on the other hand you say we split our databases into read and write systems and further split them by customer id you are attacking the problem appropriately.

7) Relying on QA to find your mistakes

You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering. Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.

8) Revolutionary or “big bang” fixes

In our experiences, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the expected ROI to complete and disastrous failures. 9 out of 10 times they are simply not warranted and should be avoided. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.

9) The Multiplicative Effect of Failure

Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc then the product of all of the service calls is your theoretical availability. 5 calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.

10) Failing to create and incent a culture of excellence

Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth. Read our article on being a leader.

11) Under-engineering for scale

The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now. That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.

Continue reading AKF Partners

0