Travel back in time with Junos snapshots

Traveling back in time is not yet a reality…unless you look at a Junos device 🙂

Imagine something really bad happened to your router and you would love to go back to a functioning scenario. In that case, the way to go ii relying on snapshots!

Snapshots are pretty like a time machine. They take a “picture” of the router at a given moment and allow you to restore that exact moment when needed.

Including snapshot in your daily device management is fundamental!

Moreover, it is important to understand how snapshots work and which types of snapshots we have available…yep, there are different kinds of snapshots.

Let’s start from that. There are two types of snapshots:

  • non recovery snapshots
  • recovery snapshots

Non recovery snapshots are probably the ones most people are more familiar with. They are stored within the junos volume (/dev/gpt/junos), the one where junos boots and runs.
When taken, non recovery snapshots reference the set of packages and configuration found when creating the snapshots.
It is possible to take multiple non recovery snapshots. We might see them as the equivalent of “VM snapshots” we have in ESXi or KVM.
We can instruct Junos to reboot and boot from one of these snapshots.

On the other hand, recovery snapshots are stored in a totally different volume: the OAM volume.
It also references the set of packages and configuration when taken.
Anyhow, there are some differences.
First, we can only have one recovery snapshots, not multiple ones.
Second, as already mentioned, it is stored on a different location.

This second aspect is key to understand the difference between recovery and non-recovery snapshots.

Non recovery snapshots reside in the “normal” Junos volume, the router SSD. That is the volume the router will use by default to load junos and function.

Recovery snapshot, instead, resides on a different media. We will not find it on the SSD but on a separate flash memory. Roughly speaking, the recovery snapshot is a disk dump of the junos volume on another media: the OAM volume.
This type of snapshot represents a sort of last resort in case something really bad happens. By really bad, we mean scenarios where the ssd gets damaged and Junos can no longer start. The ssd can get damaged in different ways: physical or logical. No matter the exact fault, upon that kind of failure, the router will mount the OAM volume and boot from it, using the recovery snapshot.

For this reason, it is important to keep the recovery snapshot updated. By that, I mean that after a release upgrade, we should also take a recovery snapshot so that it also uses the new release.

Keeping the recovery snapshot not in-sync with the installed Junos release might be risky. Let’s assume the device comes to your lab with a recovery snapshot running release X. Then, you upgrade Junos to release Y but you do not create a new recovery snapshot. This means recovery snapshot runs an older release. Let’s assume the new release Y allows you to use a new MPC card that was unsupported with release X. Now, a severe power outage causes your router to go down and, when powering up again, boot from the OAM volume. As a result, the router will run Junos release X which is unable to make the new MPC working properly. This means that all the interfaces of that card will be down, leading to massive network issues.
All of this could have been avoided simply by having the recovery snapshot aligned with the current release.

A non-recovery snapshot instead, might be used to simply restore a previous scenario; no need to face failures like power outage, hardware failures and so on 🙂 For example, a release upgrade did not go well and we restore the system to a pre-upgrade situation by loading a non-recovery snapshot.

If you think about it, at least in my opinion, being sure to have meaningful recovery snapshots becomes fundamental!

Let’s see how to work with snapshots.

The following command shows all the available snapshots:

root@router> show system snapshot

Non-recovery snapshots:
Snapshot snap.20180911.122327:
Location: /packages/sets/snap.20180911.122327
Creation date: Sep 11 12:23:27 2018
Junos version: 16.1R6.7

Snapshot snap.20181115.152401:
Location: /packages/sets/snap.20181115.152401
Creation date: Nov 15 15:24:01 2018
Junos version: 16.1R7.7

Snapshot snap.20200615.141312:
Location: /packages/sets/snap.20200615.141312
Creation date: Jun 15 14:13:12 2020
Junos version: 16.1R7-S4.1

Snapshot snap.20200615.152129:
Location: /packages/sets/snap.20200615.152129
Creation date: Jun 15 15:21:29 2020
Junos version: 18.4R1-S7.1

Total non-recovery snapshots: 4

Recovery Snapshots:
Snapshots available on the OAM volume:
recovery.ufs
Date created: Mon Jun 15 14:17:47 CEST 2020
Junos version: 16.1R7-S4.1

Total recovery snapshots: 1

The output lists both non-recovery (we can have more than one) and recovery (we can only have one) snapshots.

It is possible to delete a non-recovery snapshot:

root@router> request system snapshot delete snap.20200615.141312
NOTICE: Snapshot 'snap.20200615.141312' deleted successfully

A key command is the one to create recovery snapshots. The suggestion is to create it on both routing engines (if you have a dual-re system):

root@router> request system snapshot recovery routing-engine both
re0:
--------------------------------------------------------------------------
Creating image ...
Compressing image ...
Image size is 2682MB
Recovery snapshot created successfully

re1:
--------------------------------------------------------------------------
Creating image ...
Compressing image ...
Image size is 2682MB
Recovery snapshot created successfully

If you need to load the recovery snapshot, simply run:

root@router> request system recover oam-volume

It might happen that snapshot creation fails with this error:

ERROR: The OAM volume is too small to store a snapshot

In this case, start a shell and check the following folder:

root@MX1-NAT44-RE0:/var/home/admin # cd /packages/sets/active/optional/
root@MX1-NAT44-RE0:/packages/sets/active/optional # ls -alth
total 12
drwxr-xr-x  3 root  wheel   512B Jun 15  2020 .
lrwxr-xr-x  1 root  wheel    73B Jun 15  2020 jpfe-wrlinux9 -> /packages/db/jpfe-wrlinux9-x86-32-20200513.174938_builder_junos_184_r1_s7
drwxr-xr-x  4 root  wheel   2.0K Jun 15  2020 ..
lrwxr-xr-x  1 root  wheel    71B Jun 15  2020 jpfe-MXSPC3 -> /packages/db/jpfe-MXSPC3-x86-32-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    75B Jun 15  2020 junos-appidd-mx -> /packages/db/junos-appidd-mx-x86-32-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    67B Jun 15  2020 jail-runtime -> /packages/db/jail-runtime-x86-32-20200430.3cd74ef_builder_stable_11
lrwxr-xr-x  1 root  wheel    40B Jun 15  2020 junos-install-mx-x86-64 -> /packages/db/junos-mx-x86-64-18.4R1-S7.1
lrwxr-xr-x  1 root  wheel    68B Jun 15  2020 sflow-mx -> /packages/db/sflow-mx-x86-32-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    74B Jun 15  2020 junos-secintel -> /packages/db/junos-secintel-x86-32-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    76B Jun 15  2020 junos-runtime-mx -> /packages/db/junos-runtime-mx-x86-32-20200513.174938_builder_junos_184_r1_s7
drwxr-xr-x  2 root  wheel   512B Jun 15  2020 boot
lrwxr-xr-x  1 root  wheel    77B Jun 15  2020 junos-net-mtx-prd -> /packages/db/junos-net-mtx-prd-x86-64-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    76B Jun 15  2020 junos-modules-mx -> /packages/db/junos-modules-mx-x86-64-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    73B Jun 15  2020 junos-libs-mx -> /packages/db/junos-libs-mx-x86-64-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    82B Jun 15  2020 junos-libs-compat32-mx -> /packages/db/junos-libs-compat32-mx-x86-64-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    87B Jun 15  2020 junos-dp-crypto-support-mtx -> /packages/db/junos-dp-crypto-support-mtx-x86-32-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    76B Jun 15  2020 junos-daemons-mx -> /packages/db/junos-daemons-mx-x86-64-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    36B Jun 15  2020 jsdn -> /packages/db/jsdn-x86-32-18.4R1-S7.1
lrwxr-xr-x  1 root  wheel    69B Jun 15  2020 jpfe-X960 -> /packages/db/jpfe-X960-x86-32-20200513.174938_builder_junos_184_r1_s7
lrwxr-xr-x  1 root  wheel    66B Jun 15  2020 jpfe-X -> /packages/db/jpfe-X-x86-32-20200513.174938_builder_junos_184_r1_s7

There, delete any file survived from old releases (e.g. packages from a 15/16 release).

Same can be done with a non-recovery snapshot:

root@router> request system snapshot load <name>

As said, before, recovery snapshot is stored on different media: the OAM volume.

Let’s see how we can locate it.

First, we run a shell as root:

root@router> start shell user root
Password:
root@router:/var/home/admin #

Next, we mount the oam volume and look for the snapshot file:

root@router:/var/home/admin # mount /dev/gpt/oam /oam

root@router:/var/home/admin # ls -la /oam
total 36
drwxr-xr-x   9 root  wheel   512 Jan 22 15:21 .
drwxr-xr-x  23 root  wheel   512 Jun 15  2020 ..
drwxr-xr-x   4 root  wheel  1024 Jan 22 15:22 boot
dr-xr-xr-x   2 root  wheel   512 Sep 10  2018 dev
dr-xr-xr-x   2 root  wheel   512 Sep 10  2018 etc
drwxr-xr-x   2 root  wheel   512 Sep 10  2018 mnt
drwxr-xr-x   2 root  wheel   512 Jan 22 15:23 snapshot
drwxrwxrwt   2 root  wheel   512 Sep 10  2018 tmp
drwxr-xr-x   2 root  wheel   512 Sep 10  2018 var

root@router:/var/home/admin # ls -la /oam/snapshot/
total 2747692
drwxr-xr-x  2 root  wheel         512 Jan 22 15:23 .
drwxr-xr-x  9 root  wheel         512 Jan 22 15:21 ..
-rw-r--r--  1 root  wheel          12 Jan 22 15:23 VERSION
-rwxr-xr-x  1 root  wheel  2812899328 Jan 22 15:22 recovery.ufs.uzip

At the end, remember to unmount the oam volume:

root@router:/var/home/admin # umount /dev/gpt/oam

Finally, let’s try to think how snapshot might be included in our maintenance/management procedures.

When upgrading the release we might follow these stages:

  • prepare new release packages
  • take non-recovery snapshot
  • take recovery snapshot
  • upgrade release
  • verify everything is working (if not you can load the previous non-recovery snapshot)
  • take non-recovery snapshot
  • take recovery snapshot

During normal operations and daily routines, we might think of:

  • taking recovery snapshots regularly (once a week, along with another tool backing up configuration)
  • taking snapshots upon any hardware change (e.g. new cards)
  • taking snapshots upon the introduction of new services

The key concept behind all those considerations is “try to have your snapshots as close as possible to the current situation of your router so that, upon failures, you can restore your device and have it in a status which close to the target one”.

This is important for at least two reasons:

  • even after booting from the OAM volume, the device and its configured services should work
  • it will not require a lot of effort to bring the device to the desired status (this is easier if additional procedures like “regular configuration backup” are in place, as suggested above)

So, what now? Simple, take snapshots!

Ciao
IoSonoUmberto