Secure web service case study
The Web Developer Security Checklist highlights some of the more important issues to consider when creating a secure web application. To illustrate those issues, this case study describes how the PowerDown web application is implemented.
PowerDown is a cloud optimizer application that lowers cloud costs by powering down idle cloud resources such as servers, databases and containers. PowerDown transparently combines the resource availability requirements of all DevOps staff to create a dynamic schedule for when cloud resources can be safely powered up and down. If you want an overview of PowerDown, please read Announcing PowerDown.
PowerDown Site Composition
The PowerDown service is implemented on an AWS EC2 cluster running over multiple availability zones. The site runs in a single AWS region and uses AWS CloudFront as a CDN.
The service uses CloudFlare in bypass mode as a DNS provider. In the event of a Denial of Service (DoS) Attack, it is enabled as a proxy to provide a good measure of DoS protection for the duration of the attack.
We use AWS AutoScale and custom monitoring to create a reliable, master-less Docker cluster from EC2 instances. We did not choose to use AWS ECS or other Docker cluster providers, as we wanted a bit more control over the scale and placement of our services. As these providers mature their offerings, we may utilize them in the future.
PowerDown server instances use AWS instance IAM roles that define and constrain the abilities of the services running on those instances. All instances use PowerDown and Cloud Watch logs for log capture. Cloud Trail is actively managed to monitor account changes.
The PowerDown service is decomposed into micro-services implemented via Docker containers. We don’t believe in very small micro-services. However, some of our services are currently a little larger than we would like and we will probably split them up a little more in the future.
The PowerDown micro-services are:
- PowerDown Web App — supports the browser based PowerDown application.
- PowerDown Scheduler — responsible for starting and stopping cloud resources.
- PowerDown TimeKeeper — calculates cloud resource costs and availability metrics.
- PowerDown Admin — performs account and database maintenance.
Each of these micro-services is run on our EC2 cluster over multiple server instances spread over several availability zones for reliability and availability.
The PowerDown database uses MySQL hosted via an AWS Aurora cluster running in multiple availability zones. We chose Aurora for its strong availability and performance characteristics coupled with excellent MySQL compatibility.
We use Aurora encryption at rest for physical security (such as for when drives are decommissioned). To protect sensitive data against breaches via remote database access, we use column level encryption for important data such as access tokens, billing details and email addresses. Passwords are hashed using bcrypt.
PowerDown App Portal
The PowerDown app is the primary user interface to the PowerDown service and it provides a security status overview and manager interface. It is the primary security dashboard to monitor and manage the service.
Nginx + NodeJS + TypeScript + Express + Aurelia
The PowerDown application is a NodeJS Express application written in TypeScript and running in a Docker container on an EC2 cluster.
The Nginx server is responsible for serving static content and for proxying requests to the Node applications. Static files are then cached via the AWS CloudFront CDN.
Content is minified and pre-gzipped as part of the build process. We use the Expansive Static Site Generator for static content preparation.
The Nginx servers are run from Docker containers behind an AWS ALB that terminates TLS client connections. The containers are scaled using AutoScale.
The Nginx server and Node application define the following HTTP security headers to minimize the degrees of freedom for clients:
- Set-Cookie SameSite HttpOnly Secure
Node and Express
The Node application is written using Typescript in an ES6 2017 subset. We extensively use the Async Wait pattern and have found it dramatically simplifies node programs. In our case, the performance cost is well worth it, and we expect Async Wait to get faster as implementations are optimized. We believe the simplified calling sequence which avoids callback hell results in a more secure and transparent application.
We use a limited and audited set of packages and we closely monitor our dependencies. Instead of using an ORM package, we use a custom ORM that also performs extensive data validation and encryption services. It also handles JSON object conversions.
We apply rate limiting on slow APIs using express-rate-limit and we apply canary checks on our APIs to detect illegal or abnormal requests.
Log files from all the micro-services are captured and stored centrally in Cloud Watch Logs. The log files are captured from the docker daemon and sent to Cloud Watch.
We use a one year lifespan on log data. Conveniently, Cloud Watch will automatically purge old log events after they expire.
Building the Site
We believe strongly in the benefits of Immutable Infrastructure. This means that once a server is deployed, it is never modified, patched or upgraded. It is merely replaced with a new updated instance if required. The benefits of this approach mean that we can immediately detect unauthorized modifications to our infrastructure. Our EC2 instances and Docker containers are immutable. A secondary benefit is that it greatly simplifies our implementation — we never need to do live patching or upgrading.
All infrastructure is created via Terraform. We do not use the AWS console for creating or modifying any cloud configuration. Infrastructure should be defined as “code” and should be created at the push of a button.
The Terraform configuration files define:
- VPC networks, peers and routing tables
- Security Groups
- IAM users, roles and policies
- Databases and Redis clusters
- EC2 AutoScale groups and launch configurations
- ALB load balancers and target groups
- EC2 instances
- SNS topics
- CloudFlare DNS endpoints
By using an immutable infrastructure as code paradigm, we can audit our cloud configuration for any changes and rapidly regenerate any component without fear of human error.
Using Terraform makes it trivial to replicate production environments for staging and test. To reduce cost, we spin these up and down as required. We automatically turn off all unused servers after-hours.
We do not expose any SSH endpoints on any servers. If really required, we temporarily add a “Support” security group to a specific instance. The PowerDown service will notice this and elevate the security status while it remains. In any case, if this SSH access is forgotten, the next time we deploy and Terraform is run, it will automatically remove this security group from the instance.
Our Security Incident Plan
We have one ;-)