Something that I spend a lot of my time working on at the moment is containerising workloads. We run a lot of this stuff in AWS, which has some nice orchestration facilities as a part of ECS - but I've been looking at a way to have this same sort of facility in our own datacentres.
Enter Kubernetes, or k8s for short. It's a container/cluster management system from Google, which handles scheduling and placement of containers across a fleet of servers. It's a cool system, but being new suffers from a lack of maturity in documentation and support; it's not unusual to find a getting started guide that either peters out into a sea of TODOs or suggests running a bunch of deprecated commands.
This is mostly possible to work around (Kubernetes itself is usually quite good at telling you minions is now nodes and so on, making this much less of a stumbling block than it could be), but we did have a problem setting up a cluster that managed to completely stump us:
- Both master and nodes were coming online without any obvious errors
- kubectl get nodes reported the nodes as being "Ready"
- Pods could be created...
- ... but kubectl get pods reported all pods stuck in a status of "Pending"
A lot of searching for problems with pods stuck in pending suggested trouble communicating with Docker, nodes reporting a different hostname or address to what the master saw them as, or traffic not making it over the network. We had none of these, and no obvious errors in the logs, making the failure of pods to run appear completely mystifying.
Of course, like so many problems which appear baffling and complex, it turned out to be an utterly basic bit of user error. The solution?
If kube-scheduler is not running on the master, Kubernetes will report everything as okay but will be unable to start containers in a pod.
Hence all our pods being stuck in pending. Some idiot (me) had made a typo in the script which set up the various services, failed to notice, then wondered why nothing was working - especially since the documentation is catching up to a moving target, and often misses out the critical step that in current versions of Kubernetes the scheduler needs to be running in order for anything meaningful to happen.
Anyway - if you have pods stuck in pending, no obvious errors in the logs, and no log events, lack of scheduler could be the reason. Failing that, the debugging FAQ is one of the better resources; but make sure you go through it from the start and check every step. (A lot of advice suggests starting halfway down in a specific section, and it's easy to miss the basics doing this.)