Part-2: Creating a Highly Available Architecture

Amit Dhanik
Jun 22, 2021
11 min read

Updated: Jun 24, 2021

If you have not read part-1, it is recommended to read which you can find here.

Here we will continue from NAT Gateways where we last left. Hope you enjoy reading!!

NAT Gateway - Network Address Translation

A NAT Gateway is a highly available Gateway that allows instances in your private subnet to talk to the internet without becoming public. NAT prevents the users from initiating a connection with those instances. Well, why would we need this?

As you have instances in your private subnet, and they might occasionally need OS upgrades, patches, and for any communication with the internet in general. For this purpose, NAT Gateways are absolutely essential.

Functioning of Nat Gateways

Let's take a simple example. Consider the router in your home. The router is assigned a public IP by the service provider. This address can be seen by anyone. Each device in your home has a private IP address, which they used to communicate with each other. Whenever a device wants to communicate with the outside world, the router does this. The router keeps track of which private IP addresses requested(your devices) what traffic and makes sure the data packets are routed to the right device. Similarly, NAT functions in such a way that when traffic goes to the internet, the source IPv4 address is replaced with the NAT device’s address, and when the response traffic goes to those instances, the NAT device translates the address back to those instances private IPv4 addresses. It must be kept in mind that NAT Gateways only deal with IPv4 traffic, if you are dealing with IPv6 traffic, you must use an Egress-only internet gateway. NAT Gateways support UDP, TCP, and ICMP protocols.

IMPORTANT - You have to update your route table to route traffic from private subnets to NAT Gateway. It should always be attached to the Main route table, and your Main route table is the one that does not have any IGW attached to it. This Route table has the subnet association of Private Subnets. You can see this in the diagram discussed in Part-1. The Main Route table and the Custom Route table have different connections and hence serve different purposes.

You can have only 1 NAT Gateway inside 1 A.Z and they are automatically assigned an IP address. In our Architecture, we can see that we have NAT Gateways defined in both the Availability zones(in our public subnet). If you have resources defined in multiple A.Z and they share only one NAT Gateway, this is a sign of poor architecture. This is because, in the event of one A.Z going down due to some disaster, resources in the other A.Z private subnet would also lose access to the internet. You must always try to create an A.Z independent architecture because, in AWS, there is a golden saying - Everything can fail. Hence, we should create NAT gateways for each A.Z and configure routing to ensure that resources are able to function properly even in case of some disaster.

Now, if we look at our architecture, we have our VPC partially ready(public subnet services are complete), which is spanned across 2 A.Z and each A.Z has a public and private subnet associated with it for the purpose of high availability. We have our Load Balancers, IGW and Route 53 also enabled across our Application. Now we just have to discuss the functionality of our Private Subnets and the applications associated with them.

PRIVATE SUBNET AND ITS FUNCTIONING

In our Architecture, we have an Internal Load Balancer routing all the traffic from web servers to the Application servers in the Private subnets. The application servers in our private subnet are the EC2 Instances, and for them to receive the traffic, they must be registered in target groups of the internal Load Balancer. The load balancer will also maintain the health of the registered target and as soon as it detects an unhealthy target, it stops routing to that target. This you already know, so let's discuss our first service, elasticache.

ELASTICACHE

Elasticache is a web service that improves the performance of web applications by allowing you to retrieve information with help of built-in memory caches, instead of relying on slower disk-based databases. For eg - The top ten items in Amazon are cached in memory and not always retrieved from AWS Database.

Elasticache is a very important service for read-heavy applications workload as it can help improve latency significantly. Eg's include social networking, gaming, media sharing websites, etc. The in-memory caching improves application performance by storing critical pieces of data(frequently accessed) in memory for low-latency access.

Elasticache is protocol-compliant with 2 Open sources in-memory caching engines -

Memcached
Redis

We can see in our Architecture, we are making use of the Redis caching engine. Redis stores data in RAM, and hence we have very high write and read speeds. Redis and Memcached, both offer high performances but Redis provides some extra feature which users find more attractive(again, depends on the need of your application). If you are working with Big data, Multi-threading might provide an advantage, which is present only with Memcached. Snapshots, Replication, Advanced-Data Structures, and Transactions are some of the features where Redis has an upper hand.

Interesting Fact - Both Redis and Memcached can also be used as data storage systems and to support data types operation. Both are No-SQL memory data storage systems, with Redis supporting five data types( String, Hash, List, Set, and Sorted Set) while Memcached has no data types.

So, in our architecture, we can see that both of our application servers(EC2 Instances) first communicate with Redis, as this is a gaming application and players might be using the same set of things again and again. The primary Redis engine is only being used, while the second one is kept for High Availability. After we have created a Multi-AZ, Elastic cache monitors the health and connectivity of nodes. If the primary node fails, ElastiCache selects the read replica that has the lowest replication lag and makes it the primary node. This is done automatically and requires no manual effort. The Failover process is triggered in the case when

Loss of Availability in the primary's A.Z
Loss of Network connectivity to the primary
Failure of the primary.

Elasticache also creates another Read replica on its own after it makes a Read replica the primary node.

In case our application server does not find the query results in our Redis(cache got expired), the request is redirected to the database servers(cache miss) and is then cached in Redis. Here, our application is making use of Amazon Aurora. We have different caching strategies in which you can populate and maintain your cache in Elasticcache. Here is a nice article if you want to read about caching strategies.

So, a quick question. Your DB is overloaded. What two steps can you take to improve the performance of your DB?

Use Elastic cache
Add Read Replica (redirect our read queries to Read replica - discusses later)

AMAZON AURORA

In our Application Architecture, If Redis does not have the response for the request made by application servers, the request is sent to our Databases. We have Aurora in both the Availability zones, with Read Replicas also present in each.

Amazon Aurora is a My SQL and P-SQL compatible RDBS Engine that combines the speed and availability of high-end commercial D.B with the cost-effectiveness of Open Source Databases. It is fully managed by RDS which helps in automating patching and backups. There are some important points regarding Aurora which I must mention -

Aurora is designed to handle the loss of up to 2 copies of data without affecting DB's write availability and up to 3 copies without affecting Read availability. (6 copies of Data)
Aurora Storage is self-healing, Data blocks and disks are continuously scanned for errors and repaired automatically. Such a cool feature !!
Aurora is Five times faster than standard MySQL databases and three times faster than standard PostgreSQL databases.
Aurora stores copies of data in the DB cluster across multiple A.Z regardless of whether the instances in the DB cluster span multiple A.Z.
2 copies of your data are contained in each A.Z, with a minimum of 3 A.Z. Hence, when data is written to the primary DB instance, Aurora synchronously replicates the data to six storage nodes associated with your cluster volume. A total of 6 copies of your Data is available!!! This makes our Architecture a Highly Available Architecture, as this will help protect our DB against loss of A.Z and other failures. You should always distribute the primary instance and reader instances in your DB cluster over multiple Availability Zones to improve the availability of your DB cluster, as shown in our Architecture diagram.
Aurora automatically failovers to a new instance if the primary instance fails. Aurora promotes an existing Read Replica as a new primary instance. Hence to increase the availability of our DB cluster, we should always create at least one or more Aurora Replicas in two or more different Availability Zones, as shown in the architecture.

A good excerpt is also given on the AWS site for a better understanding as to why Multi-AZ is important. Below I have taken a snippet from that article.

Suppose that the primary instance in your cluster is unavailable because of an outage that affects an entire AZ. In this case, the way to bring a new primary instance online depends on whether your cluster uses a multi-AZ configuration. If the cluster contains any reader instances in other AZs, Aurora uses the failover mechanism to promote one of those reader instances to be the new primary instance. If your provisioned cluster only contains a single DB instance, or if the primary instance and all reader instances are in the same AZ, you must manually create one or more new DB instances in another AZ. If your cluster uses Aurora Serverless, Aurora automatically creates a new DB instance in another AZ. However, this process involves a host replacement and thus takes longer than a failover.

(Taken from AWS Aurora page https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_AuroraOverview.html)

Read Replicas

Read Replicas are primarily used for read-heavy Databases Workloads. They help in increasing the performance by creating a Read-only copy of your Production D.B. This is achieved by using Asynchronous replication from the primary RDS instance to Read Replicas and then we can route read queries from our applications directly to the read replica, thereby reducing the load on your primary DB instance. Hence, Read Replicas are advised for read-heavy Databases.

Note - The read replica operates as a DB instance that allows only read-only connections, and no write. RDS uses asynchronous replication and updates the Read replica whenever there is a change to the Primary Database.

Some important points to keep in mind for Read Replicas are -

Read Replicas are supported for MariaDB, Microsoft SQL Server, MySQL, Oracle, and PostgreSQL and are primarily used for scaling and not for Disaster Recovery. Although we can promote a read replica to a standalone instance if the primary DB instance fails, but Read Replicas are mainly used for improving the performance of our DB instance.
You must have Automated Backups turned on in order of your Database Instance to deploy a Read Replica.
You can also create Read Replica's of Read Replicas
Each Read Replica will have its own DNS Endpoints.
Read Replicas can be promoted to be their own D.B. This breaks the Replication and Read Replica is promoted to a standalone D.B Instance.
You can have R.R in multiple A.Z, same Regions, and different Regions as well.

RDS supports other DB instances as well. It all depends on what your application needs. You can have MySQL, MariaDB, PostgreSQL, Oracle, and Microsoft SQL Server DB engines as well.

Dynamo DB

Dynamo DB is a fully managed No-SQL Database service that supports key-value schema and is needed for applications that need consistent, single-digit millisecond latency at any scale. It provides encryption at rest by encrypting all our data using encryption keys stored in AWS KMS. "AWS owned CMK" is the default encryption type in which the key is owned by Dynamo DB. You can also create your own keys and manage encryption yourself using "Customer managed CMK".

Note - Dynamo DB is suitable for OLTP(Online Transaction Processing) Workloads and is not suitable for OLAP(Online Analytical Processing) implementations. You should use Relational DB for OLAP.

Dynamo DB works on four basic components -

Tables - Collection of Data
Items - groups of attributes that are uniquely identifiable
Attributes - Each item is composed of one or more attributes. For eg - An item in the people's table contain attributes like first_name, last_name
Primary Key - Each item has a unique identifier that is specific to that item only. With the help of a primary key, we can easily distinguish/uniquely identify each item in a table. We have two primary keys supported in dynamo DB - Simple primary key (Partition Key) and Composite primary key. A composite primary key gives you additional flexibility when querying data.

Dynamo DB has another cool feature known as Streams. It is used to capture data modification events in Dynamo DB tables. For eg - if you added a new item to the table, or you updated an existing item, or you deleted an item, the stream will capture an image of the entire item(including all of its attributes as well). These stream records have a lifetime of 24 hours and are automatically removed thereafter. Here is a real-life example of how Dynamo DB is being utilized at DAZN Media.

Dynamo DB - How High Availability is achieved?

Dynamo DB is a regional service(as you can see in the architecture diagram, it is placed outside of the VPC). All of our Data is stored on solid-state disks (SSDs) and is automatically replicated across multiple A.Z. This helps in achieving high availability. But since it is a regional service, you might think about how data is replicated across multiple AZ ?? Dynamo DB is designed such that it automatically partition data and incoming traffic across multiple partitions. These partitions are stored on numerous backend servers distributed across three Availability zones within a single region.

CROSS-REGION REPLICATION

Since it is a global service, we can make use of Global Table to deploy our application across Multiple Regions and keep them in sync across AWS regions. This way, if a region goes down due to some mishap, we can still have our services functioning. It also helps reduce the latency as users can be served data directly from the closest geographically located table replica. Write performed on any one of the configured global tables is replicated across all global tables of the same name.

How does one access Dynamo DB if it is not present inside of our VPC?

In our Architecture Diagram, we can see that NAT Gateways are being used for communicating with the Dynamo DB. Though they can be used, the most common way to access Dynamo DB from VPC is by making use of VPC Endpoints. Well, what was the need for these VPC endpoints? Why not use NAT Gateways only? Can we not connect to Dynamo DB directly over the internet?

Yes, as a matter of fact, you can communicate with Dynamo DB over the internet(IGW) using HTTPS protocols. But, as it stands out, many corporates don't want their data to send and receive over the public Internet. Well, then you might say that you can make use of a VPN for routing all of the network traffic. This is also possible, but with the only constraint that latency might become an issue. We have one more option - NAT Gateways. We can use them, and they have none of the concerns mentioned above, except that we incur data processing charges for NAT Gateways when accessing Dynamo DB.

So, VPC Endpoints were introduced. Using VPC Endpoints, EC2 instances in our VPC are able to access DynamoDB without having the need to traverse over the public internet by only using their private IPs. No need for IGW or VPN as now the traffic never leaves the Amazon network. Also, there are no charges for using Gateway VPC endpoints. In our architecture, for some or other reason, NAT Gateways are being used, but if you don't want to pay a hefty price, then you should go with Endpoints.

Additional Tip -

If you are confused about whether you should use the Dynamo DB service or not for your architecture, here is an interesting article which you should go through.

Which one over RDS and Dynamo DB?

I had this question on my mind initially, and I found the following answer in StackOverflow. You decide which one suits your application's needs better.

In short, if you have mainly Lookup queries (and not Join queries), DynamoDB (and other NoSQL DB) is better. If you need to handle a lot of data, you will be limited when using MySQL (and other RDBMS).

You can't reuse your MySQL queries nor your data schema, but if you spend the effort to learn NoSQL, you will add an important tool to your toolbox. There are many cases where DynamoDB is giving the simplest solution.

Phew!! That was a lot for this. I hope I added some valuable knowledge. Thanks for reading till here. We will next be discussing AMAZON Storage Services in the upcoming part-3 of the Blog. TIll then stay tuned!!

I will be releasing the last and final part soon!!

I hope you all found the post useful and start using different AWS services as well. If you have any queries, you can always reach out to me. Feel free to provide your feedback in the comment section. Leave a like if you enjoyed reading. Thanks !!

Connect with me on LinkedIn - Amit Dhanik.

Credits - This post was successful because of the following peoples - ACloudguru, AWS resources, pythoholic, and many others.