It’s been a few years since I’ve last commented on my deployed Kubernetes-based workbench. Over that time I’ve learned a lot and gotten more experience with it and other solutions - it’s time to revisit the setup and decide whether to upgrade or replace.
Deployed System
I never really completed the initial series that was intended to give step-by-step instructions to replicate my environment, largely because there were too many steps to really expect anyone else to follow. I tried to wrap things up in python scripts to make them more user friendly, but to really do it justice would have taken more time than it was worth. As I used the system I decided that others really shouldn’t replicate it, which was further demotivation to continue the series.
The general overview of what was achieved is:
- Kubernetes cluster with small, always-on masters
- SSL-secured LDAP server running in k8s
- EFS as shared storage (RW-many file system, comparted to block which can only be writable by one pod at a time)
- A small pod running in k8s that could be ssh-ed into, as a “control pod”
- A “bastion” jump host deployed to allow ssh-tunneling from the general internet into my k8s cluster (cluster otherwise was locked down)
- The control pod could be used to scale the cluster up or down (via KOPS)
- Pods launched into new nodes were LDAP-connected, thus knew users and user ids so played nicely with shared storage
I’d hoped to achieve a state where I could use a very low-cost laptop (eg chromebook) as a thin client, primarily making use of cloud9 as an IDE, and scaling up as needed to complete analysis. As described below, this never really came to fruition with this setup.
Problems
Over the course of several years I found the setup useful, but had several problems that really took away from the value.
Problem 1: Cost
Minimum running cost was a fairly large problem for the system, given my usage pattern. With no operations on-going, the cluster cost roughly $50/mo (described more here) - and that’s with very small nodes and only cloud9 as a development environment.
If I had been using the cluster heavily or had many more users then this cost would have been fairly minor, but for something intended to just be available when I had some spare cycles at night or on a weekend (rare, with 2 kids and a full time job!), it was a significant downer.
Problem 2: Complexity
Overall the system was complex, non-standard, and fragile. It could have certainly benefited from a refactor, but as it was really wrapping 3 moving interfaces (AWS, KOPS, and Let’s Encrypt) there’s only so far it could go.
Certificates
When I was deploying this, Let’s Encrypt had a restriction that when renewing behind a firewall (which I was, as I wanted my k8s infrastructure seperated from the internet) they’d disabled the ability to automate it. This may not have been such a big deal if the certs lasted for a year, but as they expire every 3 months it was a constant annoyance.
Worsening the situation - the primary reason I was using SSL in the first place was that it was required for my LDAP connection… this means that if I forgot to renew the certs, I was locked out of ssh-ing using LDAP credentials as the ssh nodes wouldn’t trust the LDAP server. I could aways get in through the back - ssh to EC2 instances, exec into docker pods and go from there, but certainly not ideal…
In the ~3 years since I launched the cluster, there are now better tools for automating and monitoring the cert process (eg Cert Manager, but this just keeps getting farther and farther from the goal of the project - doing data science!
KOPS
KOPS was (and is) a decent k8s deployer, but at this point when you can get free k8s hosting on GCE and relatively cheap hosting on AWS it seems to be less appealing. Especially as I would use it infrequently, remembering all of its options to expand/teardown the cluster was cumbersome.
General user experience
My goal had been to have a quick and scalable system that I could access from a low-specification laptop and use to cheaply run powerful analytics. It was scalable but not quick or cheap, and it turned out that the low-spec laptop accessibility was never realized as fully as I would have liked.
A major problem was that I’d really hoped to use Cloud9 (or perhaps Eclipse Che) as a cloud-native IDE, but never really got comfortable in that environment. Compared to PyCharm, Cloud9 just doesn’t have the true “IDE” features I’d expect, especially for a sizeable python project (eg refactoring, rapid running of tests or other configurations, definition lookup and navigation, …), not to mention any hooks into ipython or visualization libraries.
I’d started to consider running a headless X display in a k8s pod (I’ve done this before…), which I’d ssh-tunnel into and run a remote VNC session to access pycharm and other software running locally. While this is something I could do, it’s not something I’d want to have to do frequently, nor something I’d want to have to write instructions for so that others could try to repeat the setup.
Problem 3: Day 2 Operations
“Day 2 Operations” includes everything after deployment until system end of life. For me, that primarily meant:
- Upgrades - Kops and K8s, as well as docker images, ldap
- Backups - User data, LDAP directory data, postgres images, etc
- Monitoring - Are certs expiring, nodes down, charges getting out of hand, …
- Logging - To figure out what when wrong when and why
- Security - Several k8s issues made known after I deployed - luckily my network architecture limited my attack surface (only port 22 ssh was open), but still need to take care of it.
- User administration - Annoying to go through LDAP ui to add ssh keys, make sure new user had correct LDAP schema, etc. Could have wrapped it (yet another thing to wrap)…
- Certificate management - Let’s Encrypt, described above. Could have switched to self-signed with longer cert lifespans, but yet another thing to manage.
Because I was doing everything myself and not making use of any AWS services, everything really had to be done from scratch. I’ve done that, but don’t really want to do in my free time…
Problem 4: Overall usability
Aside from administration, general usability wasn’t what I wanted, either.
Cloud9 really hasn’t been suitable for me as a primary analytics environment. I’m heavily habituated to a workflow that makes extensive use of git, conda environments, and Python - all of which PyCharm does very well with, and Cloud9 completely ignores. It pretty much seems to do syntax highlighting and file browsing and that’s about it.
In order for the system to be usable, my local laptop needed significant configuration, with a good ssh client that could tunnel and supported keys well, chrome plug-ins to selectively route traffic through the ssh tunnel (eg if using jupyter running in the cluster). This made it difficult for me to switch between computers, and impractical for others to try to reproduce.
Problem 5: The world moved on
Since I deployed the system, AWS and GCE has continued to improve tooling around Kubernetes, making much of what I was doing obsolete. In particular, free or cheap hosting to make the K8s substrate secure and take care of upgrading and maintenance would go a long way to improving usability, in a way that wrappers around KOPS could never do.
Conclusion
While it was a good project for 2016 and I got a fair amount of use out of it, the time has come to look around and re-evaluate the options, stay tuned!
Comments
comments powered by Disqus