TrackVia’s Data Center Infrastructure Evolution: Part 3 of 3

trackvia_data_center_pure_cloud

TrackVia’s Evolution in our Data Center Infrastructure:
Launching a New Product in a Pure-Cloud Environment

When I kicked off this series about the evolution of TrackVia’s data center strategy, I mentioned some exciting news: the launch of our new product, TrackVia Express. This product represents not only a platform for dozens of new and exciting capabilities that we can put in our customers’ hands, helping them to solve their custom application needs, but also a key culmination of our data center strategy: TrackVia Express is our first purely cloud-based product. In this article I’ll cover some of the decisions behind that move, and the benefits TrackVia and our customers are seeing as a result.

Enhance our software development methodology

In building our newest product we made sure to fully commit to several best practices in our software development methodology. While there is a lot more to that topic than what I’ll cover here, some that relate directly to our use of cloud environments include:

  • Automated builds and deploys
  • Automated test suites

While all of these can be done without a cloud-based data center model, they are improved by the addition of a cloud platform. All of the following examples are real problems we’ve experienced in the past, but no longer will:

  • Most production Web teams have seen what happens when you make a live change to some production configuration (usually in the middle of a firefight), and it isn’t reflected in your repository: when the next release comes around, that change is lost, and an old problem reappears. If you recreate new cloud server instances during your automated deploys, you maintain a much cleaner production environment.
  • By creating new instances regularly, you have a better chance of migrating to newer underlying hardware at your cloud service provider, thereby gaining some “free” performance upgrades.
  • Want to run a full, automated regression test suite against a particular branch of development, but have your QA environments tied up with other testing? Create a new set of server instances in a few minutes, run your tests, and then delete the instances when you’re done — much less time spent coordinating environments in the development process, much less opportunity for team members to step on each other and slow the process down.

Handle traffic spikes better

Consumer-facing websites have long faced (and generally solved) the problem of sudden traffic spikes, typically driven by news events. There are two ways of planning for unexpected spikes in the use of your website or Web application: dedicate excess capacity beyond what you need on a normal basis, or rapidly add capacity as demand increases. The first is expensive, and the second is quite difficult outside of a cloud-based model. However, the second is easy in nearly every common cloud service provider. As a result, we can plan for and handle traffic spikes and effectively manage our data center costs at the same time. As an aside, the traffic spike problem isn’t limited to consumer-facing products: our customers may choose at any time to create new applications and upload hundreds of thousands of records — that’s the beauty of our application platform — and now we handle it much more effectively than in the past.

Product offering flexibility

Note: We aren’t yet offering the features I’ll talk about in this section, but may in the future. Interested in one or more of them? Let us know in the comments or directly through sales or support!

Like most SaaS (Software as a Service) products, we run what’s called a multi-tenant environment, which means our customers use a shared operational environment instead of environments dedicated to each customer. This underlies one of the key benefits of nearly all SaaS products: cost efficiency. Our responsibility, which we take very seriously, is to insulate customers from each other in all respects, but relevant to today’s discussion, to insulate customers from the performance impacts resulting from other customers’ activities. We aim to meet that responsibility every day, and do so with our ability to handle varying levels of traffic automatically.

Occasionally, however, certain customers may have specific requirements, such as a dedicated private environment, or (perhaps even temporarily) an extra-high performance environment. Adding such services to our product offering is now much, much easier — we would leverage the infrastructure we’ve built to stand up new environments with full automation and limited demands on our operations staff, and can create these niche product offerings much more easily and cost-effectively.

Tools to make your team faster

Every engineering leader wants more from their team: more output, more efficiency, more quality, more of lots of things. Some of those gains will come from processes (such as the automation described above), some from making better tools available to your team. Today’s cloud service providers like Amazon Web Services (AWS), Rackspace Cloud, and several others, provide very rich environments with many tools that make your team more efficient.

Our latest product, TrackVia Express, operates within the AWS environment, which is particularly rich in tools for engineering and operations staff. Some of the tools we leverage are described here. There are many others; this is just a sampling.

CloudFormation

  • Using CloudFormation, we have fully automated the creation of any environment (development, QA, demo, production, or others). We automatically create a virtual private cloud, create and assign new load balancers, servers, and other resources to it, create security policies around it, deploy a version of our application to it, configure monitoring, and start up the applications. Time from nothing to a fully operating service: about 20 minutes, only 3-4 of which require attention from a team member.

Auto-scaling

  • As part of our solution to handling traffic spikes described above, we make use of auto-scaling groups. Need more app servers? They’ll automatically be added to handle the extra traffic. Need more data servers in your cluster? Add them automatically. Want to reduce costs during low usage periods? Automatically delete instances. (Note: like many powerful tools, there are some dangers in fully automating your capacity management. You need to protect against automatically scaling down to too few, need to make sure you handle ramp-up windows in a timely manner, etc. You still need good engineering capacity planning, but you have some additional tools at your disposal now.)

Downsides? Yes, there always are

Most advances with big advantages have some disadvantages. Our experience in building out a pure cloud-based model is no different. The big, overriding disadvantage we’ve seen is this: it is really easy to create as many cloud server instances as you need – for testing purposes, demo purposes, capacity purposes, etc. That sounds like an advantage, and it is, but the downside is that you’ll be paying for those instances as long as they’re active, even if you’re not using them. You (or your team) may forget about them until the bill comes due.

The ease with which new networks, new servers, and new storage can be provisioned needs to be paired with tools to monitor usage, and kill off resources that you no longer need.

Conclusions

Over the past few articles, I’ve attempted to summarize our evolution from managed, dedicated hardware to hybrid dedicated and cloud environments, and then to a fully cloud-based production environment. Along the way I hope to have highlighted some of the decisions to be made, some of the tradeoffs to be made, and some pros and cons for different approaches.

We have been very happy with the results we’ve seen throughout this evolution. Providing a highly-available, 24/7 service to thousands of customers ranging from small to massive is always a challenge, but I believe our progress along this evolutionary path has resulted in a better and better service.