Now, as we are getting close to the dawn of a new year, I will finally present one of the most fun exploit chains I ever had the privilege to discover and develop.
As you know, the Switch has seen many developments on the hacking front since its release and I'm proud to have taken part in a large number of them alongside SciresM, plutoo, derrek, naehrwert, yellows8, TuxSH, fincs, WinterMute and many others.
But before we reached the comfortable plateau we are in now, there were many complex attempts to defeat the Switch's security and today we will be looking into one of the earliest successful attempts to exploit the system from a bottom-up approach: the nvhax exploit.
To fully understand the context of this write-up we must go back to April 2017, a mere month after the Switch's international release.
Back then, everybody was trying to push the limits of the not-so-secret Internet Browser that came bundled with the Switch's OS. You may recall that this browser was vulnerable to a public and very popular bug known as CVE-2016-4657, notorious for being part of the Pegasus exploit chain. This allowed us to take over the browser's process with minimal effort in less than a week of the Switch's release.
The next logical step would be to escalate outside the browser's sandbox and attempt to exploit other, more privileged, parts of the system and this is how our story begins...
Exploiting the browser became a very pleasant task after the release of PegaSwitch (https://github.com/reswitched/pegaswitch), a JS based toolkit that leverages vulnerabilities in the Switch's browser to achieve complex ROP chains.
Shortly after dumping the browser's binary from memory, documentation on the IPC system began to take form on the SwitchBrew wiki (https://switchbrew.org/wiki/Main_Page).
Before the generic IPC marshalling system now implemented in PegaSwitch, plenty of us began writing our own ways of talking with the Switch. One such example is in rop-rpc (https://github.com/xyzz/rop-rpc), another toolkit designed for running ROP chains on the Switch's browser.
I decided to write my own functions on top of the very first release of PegaSwitch as you can see below (please note that this is severely outdated):
Using this and the "bridge" system (PegaSwitch's way of calling arbitrary functions within the browser's memory space) I could now talk with other services accessible to the browser.
Since this was before smhax was discovered, I didn't know I could just bypass the restrictions imposed by the browser's NPDM so I focused exclusively on the services that the browser itself would normally use.
From these, nvservices immediately caught my attention due to the large amount of symbols left inside the browser's binary which, in turn, made black box analysis of the nvservices system module much easier. This also allowed me to document everything on the SwitchBrew wiki fairly quickly (see https://switchbrew.org/wiki/NV_services).
The nvservices system module provides a high level interface for the GPU (and a few other engines), abstracting away all the low level stuff that a regular application doesn't need to bother with. Its most important part is the nvdrv service family which, as the name implies, provide a communication channel for the NVIDIA drivers inside the nvservices system module.
You can easily see the parallelism with the L4T's (Linux for Tegra) source code but, for obvious reasons, in the Switch's OS the graphics drivers are isolated in a system module instead of being implemented in the kernel.
So, with a combination of reverse engineering and studying Tegra's source code I could steadily document the nvdrv command interface and, more importantly, how to reach the ioctl system that the driver revolves around. There are many ioctl commands for each device interface so this sounded like the perfect attack surface for exploiting nvservices.
Over several weeks I did nothing but fuzz as much ioctl commands as I could reach and, eventually, I found the bugs that would form the core of what would become the nvhax exploit.
The very first bug I found was in the /dev/nvmap device interface. This interface's purpose is to provide a way for creating and managing memory containers that serve as backing memory for many other parts of the GPU system.
From the browser's perspective, accessing this device interface consists in the following steps:
- Open a service session with nvdrv:a (a variation of nvdrv, available only to applets such as the browser).
- Call the IPC command Initialize which supplies memory allocated by the browser to the nvservices system module.
- Call the IPC command Open on the /dev/nvmap interface.
- Submit ioctl commands by using the IPC command Ioctl.
- Close the interface with the IPC command Close.
In this case, while messing around with the /dev/nvmap interface I found a bug in the NVMAP_IOC_FREE ioctl command that would leak back a memory pointer from the nvservices memory space:
I later found out that a few others had also stumbled upon this bug, so I stashed it away for a while.
A few days later I was messing around with the /dev/nvhost-ctrl-gpu interface and got a weird crash in one if its ioctl commands: NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE.
Reverse engineering the browser's code revealed that this ioctl was indeed present, but no code path could be taken to call it under normal circumstances. Furthermore, I was able to observe that this particular ioctl command would only take a struct with a single u64 as its argument.
After finding it in the Tegra's source code (see https://github.com/arter97/android_kernel_nvidia_shieldtablet/blob/master/include/uapi/linux/nvgpu.h#L315) I was able to deduce that it was expecting a nvgpu_gpu_wait_pause_args struct which contains a single field: an u64 pwarpstate.
Turns out, pwarpstate is a pointer to a warpstate struct which contains 3 u64 fields: valid_warps, trapped_warps and paused_warps.
Without going into much detail on how the GM20B (Tegra X1's GPU) works:
- The GM20B has 1 GPC (Graphics Processing Cluster).
- Each GPC has 2 TPCs (Texture Processing Clusters or Thread Processing Clusters, depending on context).
- Each TPC has 2 SMs (Streaming Multiprocessors) and each contains 8 processor cores.
- Each SM can run up to 128 "warps".
- A "warp" is a group of 32 parallel threads that runs inside a SM.
You can find it in Tegra's source code here: https://github.com/arter97/android_kernel_nvidia_shieldtablet/blob/master/drivers/gpu/nvgpu/gk20a/ctrl_gk20a.c#L398
As you can see, it builds a warpstate struct with the information and calls copy_to_user using pwarpstate as the destination address.
Since dealing directly with memory pointers from other processes is incompatible with the Switch's OS design, most ioctl commands were modified to take "in-line" arguments instead. However, NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE was somehow forgotten and kept the original memory pointer based approach.
What this means in practice is that we now have an ioctl command that is trying to copy data directly using a memory pointer provided by the browser! However, since any pointer we pass from the browser is only valid to the browser's memory space, we need to leak memory from the nvservices system module to turn this into something even remotely useful.
Upon realizing this, I recalled the other bug I had found and tried to pass the pointer it leaked to NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE. As expected, I no longer had a crash, instead the command completed successfully!
Unfortunately, NVMAP_IOC_FREE leaks a pointer to a memory container allocated by nvservices using transfer memory. This means that, while the leaked address is valid in nvservices's memory space, it is impossible to find out where the actual system module's sections are located because transfer memory is also subjected to ASLR.
At this point, I decided to share the bug with plutoo and we began discussing potential ways to use it for further exploitation.
As I mentioned before, every time a session is initiated with the nvdrv service family, the client must call the IPC command Initialize and this command requires the client to allocate and submit a kind of memory container that the Switch calls Transfer Memory.
Transfer memory is allocated with the SVC 0x15 (svcCreateTransferMemory) which returns a handle that the client process can send over to other processes which in turn can use it to map that memory in their own memory space. When this is done, the memory range that backs up the transfer memory becomes inaccessible to the client process until the other process releases it.
A few days pass and plutoo has an idea: what if you destroy the service session with nvdrv:a and dump the memory range that backs the transfer memory sent over with the Initialize command?
And that's how the Transfer Memory leak or transfermeme bug was found. You can find a more detailed write-up on this bug from daeken, who also found the same bug independently, here: https://daeken.svbtle.com/nintendo-switch-nvservices-info-leak
With this new memory leak in hand I tried to blindly pass some pointers to NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE and got mixed results (crashes, device interfaces not working anymore, etc.). But when I tried to pass the pointer leaked by NVMAP_IOC_FREE and dump the transfer memory afterwards, I could see that some data had changed.
As it turns out, the pointer leaked by NVMAP_IOC_FREE belongs to the transfer memory region and we can now use this to find out exactly what is being written by NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE and where.
As expected, a total of 48 bytes were being written which, if you recall, make up the total space used by 2 warpstate structs (one for each SM). However, to my surprise, the contents had nothing to do with the "warps" and they kept changing on subsequent calls.
That's right, the warpstate structs were not initialized on the nvservices' side so now we have a 48 byte stack leak as well!
While this may sound convenient, it ended up being a massive pain in the ass due to how unstable the stack contents could be. But, of course, when there's a will there's a way...
Exploiting these bugs was very tricky...
The first idea I came up with was to try coping with the unreliable contents of the semi-arbitrary write from NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE and just corrupt different objects that I could see inside the leaked transfer memory region. This had very limited results and led nowhere.
Luckily, by now, plutoo, derrek, naehrwert and yellows8 succeeded in exploiting the Switch using a top-down approach: the famous glitch attack than went on to be presented at 34C3. So, with the actual nvservices binary now in hand, we could finally plan a proper exploit chain.
Working with plutoo, we found out that other ioctl commands could change the stack contents semi-predictably and we came up with this:
As you can see, we use the NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE bug as-is (due to the last byte being almost always 0) to overwrite the RMOS_SET_PRODUCTION_MODE flag. This allows the browser to access the debug only /dev/nvhost-dbg-gpu and /dev/nvhost-prof-gpu device interfaces and use previously inaccessible ioctl commands.
This was particularly useful to gain access to NVGPU_DBG_GPU_IOCTL_REG_OPS which could be used to plant a nice ROP chain inside the transfer memory. However, pivoting the stack still required some level of control over the stack contents for the NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE bug to work.
Many other similar methods can be used as well, but this always ended up with the same issue: the NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE bug was just too unreliable.
Some weeks pass and SciresM comes up with an insane yet brilliant way of exploiting this:
With a combination of mass memory allocations to manipulate the transfer memory's base address, some clever null byte writes and object overlaps we are now able to build very powerful read and write primitives using the transfer memory region and thus gain the ability to copy memory between the browser process and nvservices. Achieving ROP is now way easier and surprisingly stable with the exploit chain working practically 9 out of 10 times.
By now, smhax had already been discovered hence why nvdrv:t is being used in the code instead. However, this was purely experimental (to understand the different services' access levels) and is not a requirement. The exploit chain works without taking advantage of smhax or any other already patched bugs.
We finally escaped the browser and have full control over nvservices so, what should we do next? What about take the entire system down? ;)
You may recall from gspwn (see https://www.3dbrew.org/wiki/3DS_System_Flaws#Standalone_Sysmodules) on the 3DS that GPUs are often a great place to look into when exploiting a system. Having this in mind since the beginning motivated me to attack nvservices in the first place and, fortunately, the Switch was no exception when it comes to properly secure a GPU.
After SciresM's incredible work on maturing ROP for nvservices, I began looking into what could be done with the GM20B inside the Switch. A large amount of research took place over the following weeks, combining the publicly available Tegra's source code and TRM with the envytools/nouveau project's code and my own reverse engineering of the nvservices system module. It was at this time that this incredibly enlightening quote from the Tegra X1's TRM was found:
If you watched the 34C3 Switch Security talk you probably remember this. If not, then I highly recommend at least re-reading the slides over here: https://switchbrew.github.io/34c3-slides/
I/O devices inside the Tegra X1's SoC are subjected to what ARM calls the SMMU (System Memory Management Unit). The SMMU is simply a memory management unit that stands between a DMA capable input/output bus and the main memory chip. In the Tegra X1's and, consequentially, the Switch's case, it is the SMMU that is responsible for translating accesses from the APB (Advanced Peripheral Bus) to the DRAM chips. By properly configuring the MC (Memory Controller) and locking out the SMMU's page table behind the kernel, this effectively prevents any peripheral device to access more memory than it should.
Side note: on firmware version 1.0.0 it was actually possible to access the MC's MMIO region and thus completely disable the SMMU. This attack was dubbed mchammer and was presented at the 34C3 Switch Security by plutoo, derrek and naehrwert.
So, we now know that the GPU has its own MMU (accordingly named GMMU) and that it is capable of bypassing entirely the SMMU. How do we even access it?
There are many different ways to achieve this but, at the time and using the limited documentation available for the Maxwell GPU family, this is what I came up with:
- Scan nvservices' memory space using svcQueryMemory and find all memory blocks with the attribute IsDeviceMapped set. Odds are that this block is mapped to an I/O device and thus the GPU.
- Locate the 0xFACE magic inside these blocks. This magic word is used as a signature for GPU channels so, if we find it, we found a GPU channel structure.
- Find the GPU channel's page table. Each GPU channel contains a page directory that we can traverse to locate and replace page table entries. Remember that these entries are used by the GMMU and not the SMMU which means not only they follow a different format but also that the addresses used by the GMMU represent virtual GPU memory addresses.
- Patch GPU channel's page table entries with any memory address we want. The important part here is setting bit 31 in each page table entry as this tells the GMMU to access physical memory directly instead of going through the SMMU as usual.
- Get the GPU to access our target memory address via GMMU. I decided to use a rather obscure engine inside the GPU that envytools/nouveau calls the PEEPHOLE (see https://envytools.readthedocs.io/en/latest/hw/memory/peephole.html). This engine can be programmed by poking some GPU MMIO registers and provides a small, single word, window to read/write virtual memory covered by a particular GPU channel. Since we control the channel's page table and we've set bit 31 on each entry, any read or write going through the PEEPHOLE engine will access any DRAM address we want!
After all this, we now have a way to dump the entire DRAM... well, sort of.
An additional layer of protection is enforced at the MC level and that is the memory carveouts. These are physical memory ranges that can be completely isolated from direct memory access.
Since I first implemented all this in firmware version 2.0.0, dumping the entire DRAM only gave me every non-built-in system module and applet that was loaded into memory. However, SciresM later tried it on 1.0.0 and realized we could dump the built-in system modules there!
Turns out, while the kernel has always been protected by a generalized memory carveout, the built-in system modules were loaded outside of this carveout in firmware 1.0.0.
Additionally, the kernel itself would start allocating memory outside of the carveout region if necessary. So, by exhausting some kernel resource (such as service handles) up to the point where kernel objects would start showing up outside of the carveout region, SciresM was able to corrupt an object and take over the kernel as well.
From that point on, we could use jamais vu (see https://www.reddit.com/r/SwitchHacks/comments/7rq0cu/jamais_vu_a_100_trustzone_code_execution_exploit/) to defeat the Secure Monitor and later use warmboothax to defeat the entire boot chain. But, that wasn't the end...
I wanted to reach the built-in system modules on recent firmware versions as well, so SciresM and I cooked up a plan:
- Use smhax to load the creport system module in waiting state.
- Use gmmuhax to find and patch creport directly in DRAM.
- Launch a patched creport system module that would call svcDebugActiveProcess, svcGetDebugEvent and svcReadDebugProcessMemory with arguments controlled by us.
- Use the debug SVCs to dump all running built-in system modules.
This worked for getting all built-in system modules (except boot, which only runs once and ends up overwritten in DRAM)! Naturally, we didn't know about nspwn back then so all this was moot.
However, firmware version 5.0.0 fixed nspwn and suddenly all this was relevant again. We had to work around the fact that smhax was no longer available which required some very convoluted tricks using DRAM access to hijack other system modules.
To make matters worse, Switch units with new fuse patches for the well known RCM exploit were being shipped so the need for nvhax was now very real.
While the GMMU attack itself cannot be fixed (since it's an hardware flaw), some mitigations have been implemented such as creating a separate memory pool for nvservices.
As for nvhax, all 3 bugs mentioned in this write-up have now been fixed in firmware versions 6.0.0 (Transfer Memory leak) and 6.2.0 (NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE and NVMAP_IOC_FREE bugs).
The fix for the Transfer Memory leak consists on simply tracking the address and size of the transfer memory region inside nvservices and every time a service handle is closed, the entire region is cleared before becoming available to the client again.
The NVGPU_GPU_IOCTL_WAIT_FOR_PAUSE bug was fixed by changing its implementation to match every other ioctl command: have the command take "in-line" parameters instead of a memory pointer. The command now returns the 2 warpstate structs directly which are now also properly initialized.
As for the NVMAP_IOC_FREE bug, it now takes into account the case where the client doesn't supply its own memory and prevents the nvservices' transfer memory pointer to be leaked back.
This was definitely one of the most fun exploits I ever worked on, but the absolute best part was having the opportunity to develop it alongside such talented individuals like plutoo, SciresM and many others.
Working on this was a blast and knowing how long it managed to remain unpatched was surprisingly amusing.
As promised, an updated version of this exploit chain written for firmware versions 4.1.0, 5.x and 6.0.0 will be progressively merged into the PegaSwitch project.
As usual, have fun!