Tuesday, June 13, 2017

OpenStack Reboot Part Deux



OpenStack Rebooted... Part Deux

(spoiler alert - this ends badly)

Day 4 - Node Prep

k. As mentioned there have been some hardware updates. The biggest changes are the addition of a couple of R510s loaded up with drives to act as Ceph nodes and a couple of R620s to increase out compute node count.

The first time we did day 4 it became obvious that using the Ironic pxe_drac driver wasn't all that great for Gen11 servers even though it was recommended. There's a good slideshow from Redhat on troubleshooting ironic (http://dtantsur.github.io/talks/fosdem2016/#/6) that has a great quote on this:
Ironic has different drivers. Some hardware is supported by more than one of them.
Make sure to carefully choose a driver to use: vendor-specific drivers (like pxe_drac or pxe_ilo) are usually preferred over more generic ones (like pxe_ipmitool). Unless they don't work :)
So there's that. They are preferred if they work.  Since I'm throwing a few Gen12 nodes into the mix I tried the pxe_drac driver on them and it seems to have worked so far (knock on silicon). Everything else I've left as pxe_ipmitool.

The 'openstack baremetal import' command is deprecated now. The new hotness is:

$ openstack overcloud node import instackenv.json
Waiting for messages on queue 'e5a76db8-d9d3-4563-a6d0-e4487cfd60ea' with no timeout.
Successfully registered node UUID d4fb130b-84e2-49de-af8a-70649412d9d3
Successfully registered node UUID e33069b8-e757-44b5-89cc-9b6fd51c2d47
Successfully registered node UUID d625ea11-4f67-4e29-958e-9b7c6e55790e
Successfully registered node UUID c5acfdfa-993b-482e-9f58-a403bf1fc976
Successfully registered node UUID 13d99f7f-567d-496c-8892-57066f23fcc2
Successfully registered node UUID 42dc02a2-ebe7-461d-95d6-a821248b4a33
Successfully registered node UUID 4dbf1ed1-864e-4fbb-886b-38c473d3a371
Successfully registered node UUID 3d9be490-7a86-4bd5-b299-3377b790ef8a
Successfully registered node UUID 62c7b754-5b52-40e8-9656-69c102a273ff

[stack@ostack-director ~]$ openstack baremetal node list

+--------------------------------------+---------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name    | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+---------+---------------+-------------+--------------------+-------------+
| d4fb130b-84e2-49de-af8a-70649412d9d3 | TAG-203 | None          | power off   | manageable         | False       |
| e33069b8-e757-44b5-89cc-9b6fd51c2d47 | TAG-201 | None          | power off   | manageable         | False       |
| d625ea11-4f67-4e29-958e-9b7c6e55790e | TAG-625 | None          | power off   | manageable         | False       |
| c5acfdfa-993b-482e-9f58-a403bf1fc976 | TAG-202 | None          | power off   | manageable         | False       |
| 13d99f7f-567d-496c-8892-57066f23fcc2 | TAG-626 | None          | power off   | manageable         | False       |
| 42dc02a2-ebe7-461d-95d6-a821248b4a33 | TAG-627 | None          | power off   | manageable         | False       |
| 4dbf1ed1-864e-4fbb-886b-38c473d3a371 | TAG-628 | None          | power off   | manageable         | False       |
| 3d9be490-7a86-4bd5-b299-3377b790ef8a | TAG-629 | None          | power off   | manageable         | False       |
| 62c7b754-5b52-40e8-9656-69c102a273ff | TAG-630 | None          | power off   | manageable         | False       |
+--------------------------------------+---------+---------------+-------------+--------------------+-------------+


And just like the first time through... so far so good.

Starting with OpenStack Newton it was possible to to add a '--enroll' command line to the import so that nodes would enter in the 'enroll' state rather than the 'manageable' state. This, in turn, allows you to selectively move some nodes to the 'manageable' state for introspection. You can also one shot that during impot with '--introspect --provide' which would do introspection and set the final state to 'available'.

On to Introspection.

Day 5 - Introspection

This actually brings us into 2017 which feels like progress.

This should "just work" but I have a couple of doubts:
1. I didn't wipe the drives on a couple of nodes from the previous setup; will they PXE boot properly?
2. The R510s are all using UEFI instead of BIOS. Is that even an issue?
3. The Ceph nodes have multiple drives. The TripleO docs have a warning "If you don't specify the root device explicitly, any device may be picked. Also the device chosen automatically is NOT guaranteed to be the same across rebuilds. Make sure to wipe the previous installation before rebuilding in this case." So there's that. That would be an Advanced Deployment: https://docs.openstack.org/developer/tripleo-docs/advanced_deployment/root_device.html#root-device

None of our nodes are in maintenance mode (that column is all false). All are listed as 'manageable'.

Ironic Inspectors sets up a DHCP+iPXE server listening to requests from bare metal nodes.

Also new with Ocata you can run a suite of pre-introspection validations.

openstack workflow execution create tripleo.validations.v1.run_groups '{"group_names": ["pre-introspection"]}'

Getting results from this is a little more complex. In my opinion the easiest way is to:

$ openstack task execution list | grep RUNNING

When that returns no results than the workflow is finished and we can look at ERRORs

$ openstack task execution list | grep run_validation | grep ERROR

If there are no errors, you win, move along. If there are errors we can take a closer look.

$ mistral task-get-result {ID}

{ID} is the first column of the task execution list. This should point you in the right direction.

Back to introspection. The 'bulk start' way we did it previously is also gone. We have a couple of options. We can stay with the bulk introspection with:

# openstack overcloud node introspect --all-manageable

This does exactly what you think. It runs introspection on all nodes in the 'manageable' provisioning state. Optionally we can slap a '--provide' at the end to automatically put nodes in the 'available' state past introspection. (Nodes have to be 'available' before they can be deployed into the overcloud).

Alternately we can do them one node at a time which we'd do if we are paranoid about any node succeeding. Still another option would be to bulk shoot it and then re-do some nodes individually. To do individual nodes:

$ openstack baremetal node manage {UUID/NAME}
$ openstack baremetal introspection start {UUID}
$ openstack baremetal introspection status {UUID}
$ openstack baremetal node provide {UUID}

I'm going to bulk run it and then troubleshoot the failures. This is a workflow so you can use technique similar to monitoring the validation workflow to see some progress. I'd do it in another shell so as not to potentially interrupt.

$ openstack overcloud node introspect --all-manageable --provide
Started Mistral Workflow tripleo.baremetal.v1.introspect_manageable_nodes. Execution ID: c9c0b86a-cb3c-49dd-8d80-ec91798b00bb
Waiting for introspection to finish...
Waiting for messages on queue '1ee5d201-012c-4a3a-8332-f63c49b655f3' with no timeout.
.
.

And... everything fails.

Ok. So troubleshooting some IPMI? I expected. Troubleshooting a bit of PXE? Yep. Here the introspection image just keeps cycling and claims it has no network.

*sigh* So... Reality time. So far on OpenStack deployments with Triple-O I've spent 95% of the actual frustration time with ipmi, pxe and assorted Ironic/Baremetal problems. And here's the thing... none of that has anything to do with my final OpenStack cloud! I really wanted to work it this way because it seems somehow elegant... clever. But it really isn't at all worth the time and effort beating down problems in technologies that really aren't bringing anything to the final solution.

Decision time: Continue on with Triple-O trying the 'Deployed Server' methodology or switch to a different deployment method entirely?

Deployed Server essentially means I pre-install CentOS, pre-set the networking (no understack providing neutron based DHCP), install the package repositories, install python-heat-agent packages and then invoke openstack overcloud deploy... sort of. The documentation gets a bit sketch at this point on where we specify the various IP addresses. And in the end... what will the undercloud be bringing to the party?

Alternate install options: ansible (looks like it prefers Ubuntu and utilizes LXC), Puppet (I know puppet pretty well), kola-ansible, go it manually.

Decisions decisions... what I do know is that TripleO has too many moving parts, not enough soup to nuts walkthroughs. When it works it just works. If it doesn't you are clueless.


No comments:

Post a Comment