Amazon’s Aurora & TrackVia

cloud_computing_small_businesses

EDITORIAL NOTE: TrackVia was recently invited by the good folks at Amazon to contribute an article about our experience transitioning to Amazon’s Aurora database engine to further increase performance and throughput. Below is the article. 

**************************

Business challenge:

Trackvia supports thousands of customers who are located all over the world. Many of our customers are headquartered in the United States, but operate globally in countries such as China, Korea and Australia. Moreover, many of our customers rely on TrackVia to manage critical business operations. As a result, we need to provide maximum availability and reliability 24×7 365 days a year. In 2015, we achieved 99.999% uptime. The environment runs 100% in AWS.

Why Aurora:

With the goal of delivering 5-nines of availability, we chose Aurora for its high availability, automatic handoffs and near instantaneous failover capabilities. Additional requirements that Aurora met included no global DDL locks against a server (a must for a multi-tenant environment) and data dictionary management at scale.

In addition to these requirements, Aurora provides several system scalability features and capabilities, such as:  

  • Automatic provisioning and growth of storage up to 64 TB per cluster
  • The ability to auto scale compute resources with sub-second response times
  • Automatic management of snapshot backups across availability zones
  • Load balanced reads across all replicas in the cluster via the MariaDB Connector J driver

The last year was spent trying, and ruling out, various other high availability solutions. MySQL Cluster had a limit on the number of objects per cluster of 20,320 total database objects. Our clusters run close to 500,000 objects. Galera Cluster, at the time, had node level DDL locks. In a multi-tenant environment, one customer can’t impact another’s ability to run DDL at the same time. MySQL Fabric was only recently released at the time so it was also ruled out as a solution.

Before Aurora, our database environment looked just like the typical MySQL installation. A master database with two replicas using standard MySQL replication. With the implementation of MySQL GTID’s (global transaction identifiers) and the use of MySQL Utilities such as mysqlrepladmin, failover was reduced to a single command. However, it still needed manual intervention to trigger the failover. We managed multiple clusters to minimize customer impact in the event of a master failure.  

When Amazon announced Aurora during re:Invent 2014 our curiosity was peaked, and we eagerly participated in the early beta program. The first step was to move our QA environment to Aurora. We created a toolbox of automated processes to migrate customers (databases) from one environment to Aurora. The general process was: put a database into maintenance mode, dump, scrub the dump, import, and take the database out of maintenance mode. A single cluster typically houses 500-800 databases. During testing, I loaded a single Aurora cluster with up to 3,000 databases with no issues.

So what does our Aurora implementation look like?

We are running several Aurora clusters, 1 writer + 2 replicas on db.r3.4xlarge instances. Each cluster houses approximately 600 databases.

A view of our environment before Aurora. 100% managed by our in house Dev/Ops team:

Amazon_Aurora_TrackVia

Now, everything in blue is managed by Aurora/AWS (high availability with automatic hands/off, near instantaneous failover, automated snapshot backups, load balanced reads across all nodes in a cluster.)

Amazon_Aurora_TrackVia_2

The trick to getting load balanced reads across all nodes in a cluster was replacing the regular MySQL Connector J driver with the MariaDB Connector J driver.

Inside the Global DB Layer is a centralized store for metadata that the app servers use to configure the endpoints of the database clusters. The metadata tells it which cluster a particular customer is located.

The metadata tables look like this:

cluster_endpoint table 

ce_id  description        endpoint  

1      Aurora segment 1   aurora-seg1-us-west-2a.cluster-aakdudyta.us-west-2.rds.amazonaws.com

2      Aurora segment 2   aurora-seg2-us-west-2a.cluster-aakdudyta.us-west-2.rds.amazonaws.com

3      Aurora segment 3   aurora-seg3-us-west-2a.cluster-aakdudyta.us-west-2.rds.amazonaws.com

 

instance_endpoint table

ie_id  ce_id    endpoint

1      1        aurora-seg1-us-west-2b.cluster-aakdudyta.us-west-2.rds.amazonaws.com

2      1        aurora-seg1-us-west-2c.cluster-aakdudyta.us-west-2.rds.amazonaws.com

3      2        aurora-seg2-us-west-2b.cluster-aakdudyta.us-west-2.rds.amazonaws.com

4      2        aurora-seg2-us-west-2c.cluster-aakdudyta.us-west-2.rds.amazonaws.com

5      3        aurora-seg3-us-west-2b.cluster-aakdudyta.us-west-2.rds.amazonaws.com

6      3        aurora-seg3-us-west-2c.cluster-aakdudyta.us-west-2.rds.amazonaws.com

 

Today, we can swap a cluster in and out of the environment with zero app reconfiguration (and zero downtime) by using the cluster_endpoint table. The instance_endpoint table stores the endpoints to each replica and passes them to the MariaDB Connector J driver to distribute read traffic. When an app server fires up, it reads the information from the global database and configures the MariaDB Connector J and its database connection pools.

Along the way, we learned a few other tricks that were necessary when migrating from a MySQL environment to Aurora. Users in Aurora, like the rest of RDS, do not have SUPER privileges. You have to scrub your dump file for portability between MySQL and Aurora.

  • Set log_bin_trust_function_creators = 1. The default is 0 and this allows users that do not have SUPER privileges create triggers and procedures.
  • I have several sed commands to remove things like DEFINER for triggers and procedures. These will throw errors in an RDS/Aurora environment.
  • Check your socket timeout on the MariaDB Connector J driver. The default is to timeout connections on the replicas after 10 seconds.

In the grand scheme of migrating from one environment to another, this was one of the easiest with the most benefits we’ve experienced. Cloudwatch metrics provides detailed views into what is happening on each cluster and alerts will let us know when limits are being reached and when we need to bring a new cluster online.

We are very pleased with the results thus far, and look forward to a bright future ahead with Aurora.

Subscribe to TrackVia’s Blog