About the Client

Goibibo is one of the largest online hotel and air-travel booking websites in India. It is
also the number one ranked mobile app under the Travel category. In 2015, their hotels-booking volumes
grew 5x YoY, with 70% of them coming from their mobile app.

Business Problem:

Goibibo was setting up a Disaster recovery environment for their production
workload on AWS cloud in a geographically isolated region. Their primary environment being on AWS
the intention was to have as many data sources in sync and available at DR location to handle
traffic in case of disaster.

The client maintains a set of Amazon ElastiCache clusters that needs to be available in the DR region in
the event of a disaster. Amazon ElastiCache, unfortunately, does not support multi-region slaves so it had to be
a traditional snapshot and restore at scheduled intervals at the DR region. The client was also maintaining a
set of Amazon ElastiCache Memcached instances which had to be created at the click of a button.


We are using AWS Lambda , Amazon S3 and AWS CloudFormation to automate the Disaster Recovery (DR) setup.


Let us walk through the details of each component

  • Amazon ElastiCache: It’s an in-memory storage layer between the application code and the database supports the Memcached or Redis engines. These will helps in speeding up access to databases.</p>
    • Memcached: It’s an open-source, high-performing, distributed memory object caching system developed in 2003 with the initial goal of speeding up dynamic web applications by alleviating database load.
    • Redis: It’s an open-source in-memory data structure store launched in 2009 developed as a broker for caching, messaging, and databases with built-in replication, atomic operation support, various levels of on-disk persistence, and high availability via Redis Cluster
  • Amazon Simple Storage Service (S3): It's a web service offered by AWS. AWS S3 minimizes outage impact . It stores data as objects within resources called "buckets". A user can store as many objects as they want within a bucket, and write, read, and delete objects in user bucket. Objects can be up to 5 terabytes in size.
  • AWS CloudFormation: It gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.
  • AWS Lambda: AWS Lambda has introduced in the year 2014 by AWS. It helps in building a smaller, on-demand application that runs in response to an event. It allows developers to write a code functions on a pay-per-use basis without the need of provision for storage or compute resources by their own. It supports Python, Node.js, C#, and Java, by 2017. One can also Automate AMI Backups and Cleanups (Serverless) using AWS Lambda .
    The high-level representation of the workflow as shown below:
  • Amazon ElastiCache Redis DR: Redis does support the data persistent to disk. Hence, we can take a data backup and restore to the new cluster. The components shown in the above diagram are discussed as follows:
AWS Lambda:
  • We have Amazon ElactiCache (Redis) production clusters configured in the Mumbai region.
  • We have the AWS Lambda python scripts; one to create a snapshot of all the redis clusters having snapshot name suffixed with the current date and another script to describe the redis clusters and iterate through each cluster then describe the snapshots associated with the cluster instance and try to find the current date in the snapshot names which are copied to the Amazon S3 bucket (ec-backup-mumbai) specified in the same region.
  • After that; it describes all the snapshots associated with the cluster instance and tries to find the date with time delta older than days in the snapshot name to retain the snapshots and delete the older snapshots.

Amazon S3:

  • We have versioning enabled in the source region Amazon S3 bucket (ec-backup-mumbai) and also, we have enabled Cross-Origin Resource Sharing (CORS) with the DR region Amazon S3 bucket (ec-backup-singapore) to sync the snapshots.
  • We have created the Amazon S3 rule for both the prod and DR Amazon S3 buckets to delete the objects older than 7 days in order to save the storage cost.
AWS CloudFormation:
  • We have defined two AWS CloudFormation templates for redis clusters with all the production configurations:
  • One for read replica as below:

"Resources": { "replicaredis1": { "Type" : "AWS::ElastiCache::ReplicationGroup", "Properties": {	"ReplicationGroupId" : "dr-replicaredis1",	"ReplicationGroupDescription" : "For replicaredis1 Prod Web Caching", "AutoMinorVersionUpgrade": "true", "CacheParameterGroupName": "dr-redis-pg", "CacheSubnetGroupName": "ec-subnet", "AutomaticFailoverEnabled" : "true",	"ReplicasPerNodeGroup" : 1,	"PreferredCacheClusterAZs": [ "ap-southeast-1a", "ap-southeast-1b" ], "CacheNodeType": "cache.r3.large", "SecurityGroupIds": [ "sg-8XXXXX ], "Engine": "redis", "EngineVersion": "2.8.24", "Port": "6379", "PreferredMaintenanceWindow": "sun:01:00-sun:03:00", "NotificationTopicArn": "arn:aws:sns:ap-southeast-1:XXXXX:sns-notif", "SnapshotRetentionLimit": "1", "SnapshotWindow": "23:00-01:00", "SnapshotArns": [	{ "Fn::Sub": [ "arn:aws:s3:::ec-backup-singapore/replicaredis1-001-${snapshot_date}-0001.rdb", { "snapshot_date": { "Ref": "SnapshotDate" } } ] } ] } }

...so on

  • One for without read replica as shown below

"Resources": { "noreplicaredis1": { "Type": "AWS::ElastiCache::CacheCluster", "Properties": {	"ClusterName" : "dr-noreplicaredis1",	"AutoMinorVersionUpgrade": "true", "CacheParameterGroupName": "default.redis3.2", "CacheSubnetGroupName": "ec-subnet", "AZMode" : "single-az",	"PreferredAvailabilityZones" : [ "ap-southeast-1a" ],	"NumCacheNodes" : 1,	"CacheNodeType": "cache.r3.large", "VpcSecurityGroupIds": [ "sg-dXXXX" ], "Engine": "redis", "EngineVersion": "3.2.4", "Port": "6379", "PreferredMaintenanceWindow": "sat:21:30-sat:22:30", "NotificationTopicArn": "arn:aws:sns:ap-south-1:XXXXXXX:sns-notif", "SnapshotArns": [	{ "Fn::Sub": [ "arn:aws:s3:::ec-backup-singapore/noreplicaredis1-${snapshot_date}-0001.rdb", { "snapshot_date": { "Ref": "SnapshotDate" } } ] } ] } }

...so on

  • While launching AWS cloudformation need to specify the latest snapshot date in DR Amazon S3 bucket:

"Parameters": { "SnapshotDate": { "Description": "Default it is \"18-04-2017\". Specify the latest date of ElastiCacheRedis Snapshot being created to restore from S3 bucket \"ec-backup-singapore\". EXAMPLE:Specify \"19-04-2017\" then it will take the snapshot \"noreplicaredis1-001-19-04-2017-0001.rdb\" to restore", "Type": "String", "Default": "18-04-2017", "AllowedPattern": "^\\d{2}-\\d{2}-\\d{4}$", "ConstraintDescription": "Date of elasticacheredis snapshot creation to restore" } }

Amazon ElastiCache Memcached DR:

  • Memcache doesn’t support data persistent to the disk. Hence, we need to create the new cluster with the cloud formation templates as same as production cluster configurations.

"Resources": { "estreamingcache": { "Type": "AWS::ElastiCache::CacheCluster", "Properties": {	"ClusterName" : "dr-memcache1",	"AutoMinorVersionUpgrade": "true", "CacheParameterGroupName": "default.memcached1.4", "CacheSubnetGroupName": "ec-subnet", "AZMode" : "single-az",	"PreferredAvailabilityZones" : [ "ap-southeast-1a" ],	"NumCacheNodes" : 1,	"CacheNodeType": "cache.t2.micro", "VpcSecurityGroupIds": [ "sg-15XXXXXX" ], "Engine": "memcached", "EngineVersion": "1.4.24", "Port": "11211", "PreferredMaintenanceWindow": "mon:00:00-mon:01:00", "NotificationTopicArn": "arn:aws:sns:ap-southeast-1:XXXXXXX:sns-notif" }

...so on

  • Important Note:
    • We should have the Amazon VPC, Subnet group, parameter groups and security groups recreated in the DR region as same as production.
    • The AWS CloudFormation template should have kept up-to-date with the production clusters configuration.

Nagarjuna D. N , DevOps Engineer

Nagarjuna D N is an AWS SysOps Certified, with 3+ years of experience in IT Infrastructure, currently working as a DevOps Engineer at BluePi.
Key areas of interest include Cloud Computing, Databases, Open Source Technologies, Infrastructure-as-a-Code, Data Center Migrations and Server Automations.

Related Tags