Skip to content

AWS Auto scale Instance-Based on RabbitMQ Custom Metrics

In this article, we will go over how to auto-scale in/out an AWS instance using RabbitMQ metrics using Terraform.

Prerequisite:

In a Domain Driven Design architecture, we are using RabbitMQ to publish messages.
In RabbitMQ, each subscriber is a queue that binds to different events from different domains.

We have a RabbitMQ and workers deployed across multiple EC2 instances.
So, we wanted to implement auto-scaling, which will add machines to burn-down queues with many messages — this is critical for the business to achieve fast eventual consistency and reduce RabbitMQ load operationally.

Problem:

Handling events in handlers can range from quick and light to CPU-intensive operations that take several minutes. AWS’s auto-scaling group enables simple configuration based on CloudWatch metrics.

We couldn’t find a metric indicating the need for additional worker nodes to scale in/out the worker nodes.

Solution:

To scale in/out our worker machines, we created custom CloudWatch metrics and alerts based on them.

Please follow this article for this: Ingesting and monitoring custom metrices in CloudWatch with AWS Lambda.

The final step is to set up the auto-scaling group to scale in and out based on this metric using terraform.

The launch template for this would be like this:

resource "aws_launch_template" "launch_template" {
   name_prefix = "worker-node"
   image_id = var.ami_id

  iam_instance_profile {
    arn = aws_iam_instance_profile.instance_profile.arn
  }
  monitoring {
    enabled = true
  }
  instance_type                        = var.instance_type
  instance_initiated_shutdown_behavior = "terminate"
  key_name                             = var.key_name
  vpc_security_group_ids               = [aws_security_group.instance.id]

  tag_specifications {
    resource_type = "instance"

    tags = {
      Name = "Worker-server"
    }
  }
  lifecycle {
    create_before_destroy = true
  }
}

Create autoscaling group as:

resource "aws_autoscaling_group" "asg" {
  name_prefix      = "worker-node"
  min_size         = var.min_size
  max_size         = var.max_size
  desired_capacity = var.desired_capacity
  launch_template {
    id      = aws_launch_template.launch_template.id
    version = aws_launch_template.launch_template.latest_version
  }
  vpc_zone_identifier = var.private_subnets
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 50
    }
  }
 lifecycle {
    create_before_destroy = true
  }

  tag {
    key                 = "Name"
    value               = "Worker-server"
    propagate_at_launch = true
  }
  force_delete = true
}

Creating scale-up CloudWatch alarm:

resource "aws_autoscaling_policy" "scale_up_using_q" {
  name                   = "worker-node-scale_up_using_q"
  autoscaling_group_name = aws_autoscaling_group.asg.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 1
  cooldown               = 300
}
resource "aws_cloudwatch_metric_alarm" "scale_up_using_q" {
  alarm_name          = "worker-node-scale_up_using_q"
  alarm_description   = "Monitors RabbitMQ queue size for server ASG"
  alarm_actions       = [aws_autoscaling_policy.scale_up_using_q.arn]
  comparison_operator = "GreaterThanOrEqualToThreshold"
  namespace           = "QueueMetrics"
  metric_name         = "TotalMessages"
  threshold           = var.KPI/var.Avgprocessing-time
  evaluation_periods  = "1"
  period              = "300"
  statistic           = "Average"
}

Creating scale-down CloudWatch alarm:

#scale down using queue size
resource "aws_autoscaling_policy" "scale_down_using_q" {
  name                   = "worker-node-scale_down_using_q"
  autoscaling_group_name = aws_autoscaling_group.asg.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = -1
  cooldown               = 300
}
resource "aws_cloudwatch_metric_alarm" "scale_down_using_q" {
  alarm_name          = "worker-node-scale_down_using_q"
  alarm_description   = "Monitors RabbitMQ queue size for server ASG"
  alarm_actions       = [aws_autoscaling_policy.scale_down_using_q.arn]
  comparison_operator = "LessThanThreshold"
  namespace           = "QueueMetrics"
  metric_name         = "TotalMessages"
  threshold           = var.kpi/var.avgProcessing-time
  evaluation_periods  = "1"
  period              = "300"
  statistic           = "Average"
}

You will have to give manual inputs for kpi, which is the most prolonged acceptable latency and avgProcessing-time, if we exceed the threshold for scaling the worker node.

Read more about: How to Host Static Websites on AWS S3?

Give the necessary permissions and IAM roles to run this terraform code and the commands terraform init and terraform apply. And you’ll have an instance that scales in and out based on your RabbitMQ metrics.

Share This Article On:

Other Related Topics:

Cost-effective Use cases & Benefits of Amazon S3

How To Insert Data Into a DynamoDB Table with Boto3

How to Install and Upgrade the AWS CDK CLI

Microsoft Azure vs AWS vs Google Cloud – Comparison

How to Host Static Websites on AWS S3?