Using Amazon EC2 as a development server

I do more and more data analysis and engineering work these days on my local machine, a 2017 MacBook Pro with 16gb of RAM. Ever since the new MacBooks were released I’ve been tempted to upgrade - especially when crunching data in Python or R (in projects like Fog of War, for instance). Running intensive analysis, ML training models, or data transformation jobs locally tends to lock up my computer for tens of minutes at a time, during which I can’t really even use the machine for other tasks at all.

The new MacBooks are shiny, with impressive CPU benchmarks, but as far as I can tell my actual limiting factor is RAM. The datasets I work with are rarely large enough to really justify investing in actual-big-data infrastructure. They should fit in 16gb of RAM more often than not. The problem is that between terribly inefficient Electron apps like Spotify, Slack, several Firefox tabs, VSCode, Docker, and MacOS itself I rarely have all 16gb of RAM available for use. My machine locks up because it starts using the SSD as swap storage for too-high memory demands once the Python or R process exceeds 3-4GB of memory usage!

I would love to upgrade my laptop to 32gb, but a new 32gb MacBook runs upward of $3000. Of course, I’d be getting the upgraded M1 mac chip as well, but as I don’t really do much video editing or heavy CPU work in general I’m not totally sure if it’s a very efficient use of the money. Windows/Linux might provide better value for a development machine, but I also value the Mac’s better creative application support - audio tools like Ableton and Max/MSP run much more reliably, and I don’t want to juggle multiple laptops.

I began to consider renting cloud compute from Amazon. Thanks to git and Docker it shouldn’t require much setup to get code running on a remote server, and all I really need is the ability to ssh in with the command line, and edit the remote filesystem with VSCode. An EC2 t3.xlarge instance comes with 4 vCPUs and 16gb RAM, and costs approximately 16 cents per hour of usage. This comes out to about $1400 per year, which means that buying a new MacBook would easily provide more value after only about two years.

However, the game changes because Amazon only charges for EC2 when the machine is in the “running” state. If I shut it down and start it up only when needed, the costs should go down dramatically. Realistically, I need the extra resources only occasionally - many tasks can still be done on my local machine, and again Docker and git make the self-collaboration process quite smooth. A conservative estimate of 40h of usage per week brings the estimate down to under $400 per year, which means the new MacBook would take closer to 7 years to break even. In addition, the t3.xlarge instance comes with a blazing fast 5gbps internet connection, which is incredibly useful whenever doing any bulk uploading/downloading work, a benefit that a new laptop can’t provide.

The biggest obstacle I ran into is that Amazon likes to change the public hostname every time the machine is restarted. This means that it’s slightly nontrivial to easily restart the machine and reconnect the necessary tools. However, it’s nothing a little bit of shell scripting can’t handle. The Amazon CLI tools provide everything needed to automate the process.

get_ec2_name() {
  aws ec2 describe-instances --i $EC2_ID --query 'Reservations[].Instances[].PublicDnsName' | grep amazon | tail -n 1 | tr -d '" ,'
}

start-ec2() {
  pending=$(aws ec2 start-instances --i $EC2_ID | grep pending);
  while [ -n "$pending" ]
  do
    sleep 0.5;
    pending=$(aws ec2 start-instances --i $EC2_ID | grep pending);
  done
  name=$(get_ec2_name);
  awk_cmd="/$EC2_HOST_ALIAS/{ n=NR+1 } NR==n{ \$0=\"    HostName $name\" }1";
  cp ~/.ssh/config ~/.ssh/config.bak;
  awk "$awk_cmd" ~/.ssh/config > ~/.ssh/config;
  ssh $EC2_HOST_ALIAS;
}

The get_ec2_name function uses the instance identifier $EC2_ID - which doesn’t change - to query for the public DNS name. The aws ec2 command returns an array of DNS names, some of them empty strings. The result is piped through grep to only filter for valid DNS names, tail’d to pick the most recent name, and stripped of extra characters.

The start-ec2 function uses the aws cli to issue a start-up request to the instance-id and conveniently returns a JSON object describing the instance’s status. During the startup process the status is pending, and once the machine is ready for use the status switches to running. The script simply issues repeated start-instance commands at half-second intervals until the instance is ready for use. Next, the public hostname is fetched and I use awk to write the new hostname into my ssh config file. The SSH config file describes settings for a remote connection, and you can use an aliases/labels to organize different connections! In this case, the awk script actually looks for the SSH config entry with the label $EC2_HOST_ALIAS and replaces the line immediately following it with the new hostname. I also set up the relevant other information in the SSH config, including the username and private key file location.

Host <EC2_HOST_ALIAS>
    HostName <HOSTNAME>
    User <User>
    IdentityFile <Filepath to private key>

Once the ssh file is set up properly, all that’s left is to call ssh $EC2_HOST_ALIAS to connect using the settings in the config. Even better, VSCode’s remote file setup can reference connections in the ssh config file, so it’s super easy to refresh/reconnect VSCode even though the hostnames change throughout.

My workday now starts with a simple start-ec2 command in the terminal and a handful of mouseclicks in VSCode to get my last session reconnected. At the end of the day I log out/shutdown by issuing the sudo shutdown -h command on the connected instance. I can connect from anywhere, enjoy industrial network speed, the commitment is low, the setup is easy, and I pay a reasonable price that correlates directly to how much I need the extra resources.

Comments