Friday, June 26, 2015

Automatically launching and configuring an EC2 instance with ansible

Ansible makes it easy to configure an EC2 instance from soup to nuts when it comes to launching the instance and configuring it.  Here's a complete playbook I use for this purpose:

$ cat ec2-launch-instance-api.yml
---
- name: Create a new api EC2 instance
  hosts: localhost
  gather_facts: False
  vars:
    keypair: api
    instance_type: t2.small
    security_group: api-core
    image: ami-5189a661
    region: us-west-2
    vpc_subnet: subnet-xxxxxxx
    name_tag: api01
  tasks:
    - name: Launch instance
      ec2:
         key_name: "{{ keypair }}"
         group: "{{ security_group }}"
         instance_type: "{{ instance_type }}"
         image: "{{ image }}"
         wait: true
         region: "{{ region }}"
         vpc_subnet_id: "{{ vpc_subnet }}"
         assign_public_ip: yes
         instance_tags:
           Name: "{{ name_tag }}"
      register: ec2

    - name: Add Route53 DNS record for this instance (overwrite if needed)
      route53:
         command: create
         zone: mycompany.com
         record: "{{name_tag}}.mycompany.com"
         type: A
         ttl: 3600
         value: "{{item.private_ip}}"
         overwrite: yes
      with_items: ec2.instances

    - name: Add new instance to proper ansible group
      add_host: hostname={{name_tag}} groupname=api-servers ansible_ssh_host={{ item.private_ip }} ansible_ssh_user=ubuntu ansible_ssh_private_key_file=/Users/grig.gheorghiu/.ssh/api.pem
      with_items: ec2.instances

    - name: Wait for SSH to come up
      wait_for: host={{ item.private_ip }} port=22  search_regex=OpenSSH delay=210 timeout=420 state=started
      with_items: ec2.instances

- name: Configure api EC2 instance
  hosts: api-servers
  sudo: True
  gather_facts: True
  roles:
    - base
    - tuning
    - postfix
    - monitoring
    - nginx
    - api


The first thing I do in this playbook is to launch a new EC2 instance, add or update its Route53 DNS A record, add it to an ansible group and wait for it to be accessible via ssh. Then I configure this instance by applying a handful or roles to it. That's it.

Some things to note:

1) Ansible uses boto under the covers, so you need that installed on your local host, and you also need a ~/.boto configuration file with your AWS credentials:

[Credentials]
aws_access_key_id = xxxxx
aws_secret_access_key = yyyyyyyyyy

2) When launching an EC2 instance with ansible via the ansible ec2 module, the hosts variable should point to localhost and gather_facts should be set to False

3) The various parameters expected by the EC2 API (keypair name, instance type, VPN subnet, security group, instance name tag etc) can be set in the vars section and then used in the tasks section in the ec2 stanza.

4) I used the ansible route53 module for managing DNS. This module has a handy property called overwrite, which when set to yes will update a DNS record in place if it exists, or will create it if it doesn't exist.
 
5) The add_host task is very useful in that it adds the newly created instance to a hosts group, in my case api-servers. This host group has a group_vars/api-servers configuration file already, where I set various ansible variables used in different roles (mostly secret-type variables such as API keys, user names, passwords etc). The group_vars directory is NOT checked in.

6) In the final task of the playbook, the [api-servers] group (which consists of only the newly created EC2 instance) gets the respective roles applied to it. Why does this group only consist of the newly created EC2 instance? Because when I run the playbook with ansible-playbook, I indicate an empy hosts file to make sure this group is empty:

$ ansible-playbook -i hosts/myhosts.empty ec2-launch-instance-api.yml

If instead I wanted to also apply the specified roles to my existing EC2 instances in that group, I would specify a hosts file that already has those instances defined in the [api-servers] group.

Thursday, June 25, 2015

Deploying monitoring tools with ansible

At my previous job, I used Chef for configuration management. New job, new tools, so I decided to use ansible, which I had played with before. Part of that was that I got sick of tools based on Ruby. Managing all the gems dependencies and migrating from one Ruby version to another was a nightmare that I didn't want to go through again. That's one reason why at my new job we settled on Go as the main language we use for our backend API layer.

Back to ansible. Since it's written in Python, it's already got good marks in my book. Plus it doesn't need a server and it's fairly easy to wrap your head around. I've been very happy with it so far.

For external monitoring, we use Pingdom because it works and it's cheap. We also use New Relic for application performance monitoring, but it's very expensive, so I've been looking at ways to supplement it with Open Source tools.

An announcement about telegraf drew my attention the other day: here was a metrics collection tool written in Go and sending its data to InfluxDB, which is a scalable database also written in Go and designed to receive time-series data. Seemed like a perfect fit. I just needed a way to display the data from InfluxDB. Cool, it looked like Grafana supports InfluxDB! It turns out however that Grafana support for the latest InfluxDB version 0.9.0 is experimental, i.e. doesn't really work. Plus telegraf itself has some rough edges in the way it tags the data it sends to InfluxDB. Long story short, after a whole day of banging my head against the telegraf/InfluxDB/Grafana wall, I decided to abandon this toolset.

Instead, I reached again to trusty old Graphite and its loyal companion statsd. I had problems with Graphite not scaling well before, but for now we're not sending it such a huge amount of metrics, so it will do. I also settled on collectd as the OS metric collector. It's small, easy to configure, and very stable. The final piece of the puzzle was a process monitoring and alerting tool. I chose monit for this purpose. Again: simple, serverless, small footprint, widely used, stable, easy to configure.

This seems like a lot of tools, but it's not really that bad if you have a good solid configuration management system in place -- ansible in my case.

Here are some tips and tricks specific to ansible for dealing with multiple monitoring tools that need to be deployed across various systems.

Use roles extensively

This is of course recommended no matter what configuration management system you use. With ansible, it's easy to use the commmand 'ansible-galaxy init rolename' to create the directory structure for a new role. My approach is to create a new role for each major application or tool that I want to deploy. Here are some of the roles I created:

  • a base role that adds users, deals with ssh keys and sudoers.d files, creates directory structures common to all servers, etc.
  • a tuning role that mostly configures TCP-related parameters in sysctl.conf
  • a postfix role that installs and configures postfix to use Amazon SES
  •  a go role that installs golang from source and configures GOPATH and GOROOT
  • an nginx role that installs nginx and deploys self-signed SSL certs for development/testing purposes
  • a collectd role that installs collectd and deploys (as an ansible template) a collectd.conf configuration file common to all systems, which sends data to graphite (the system name is customized as {{inventory_hostname}} in the write_graphite plugin)
  • a monit role that installs monit and deploys (again as an ansible template) a monitrc file that monitors resource metrics such as CPU, memory, disk etc. common to all systems
  • an api role that does the heavy lifting for the installation and configuration of the packages that are required by our API layer
Use an umbrella 'monitoring' role

At first I was configuring each ansible playbook to use both the monit role and the collectd role. I realized that it's a bit more clear and also easier to maintain if instead playbooks use a more generic monitoring role, which does nothing but list monit and collectd as dependencies in its meta/main.yml file:


dependencies:
  - { role: monit }
  - { role: collectd }

Customize monitoring-related configuration files in other roles

A nice thing about both monit and collectd, and a main reason I chose them, is that they read configuration files from a directory called /etc/monit/conf.d for monit and /etc/collectd/collectd.conf.d for collectd. This makes it easy for each role to add its own configuration files. For example, the api role adds 2 files as custom checks in  /etc/monit/conf.d: check_api and check_nginx. It also adds 2 files as custom metric collectors in /etc/collectd/collectd.conf.d: nginx.conf and memcached.conf. The api role does this via a file called tasks/monitoring.yml file which gets included in tasks/main.yml.

As another example, the nginx role also adds its own check_nginx configuration file to /etc/monit/conf.d via a tasks/monitoring.yml file.

The rule of thumb I arrived at is this: each low-level role such as monit and collectd installs the common configuration files needed by all other roles, whereas each higher-level role such as api installs its own custom checks and metric collectors via a monitoring.yml task file. This way, it's easy to see at a glance what each high-level role does for monitoring: just look in its monitoring.yml task file.

To wrap this post up, here is an example of a playbook I use to build API servers:

$ cat api-servers.yml
---

- hosts: api-servers
  sudo: yes
  roles:
    - base
    - tuning
    - postfix
    - monitoring
    - nginx
    - api

I call this playbook with:

$ ansible-playbook -i hosts/myhosts api-servers.yml

To make this work, the hosts/myhosts file has a section similar to this:

[api-servers]
api01 ansible_ssh_host=api01.mydomain.com
api02 ansible_ssh_host=api02.mydomain.com





Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...