Friday, 2 March 2012

A Partial Technique Against ASLR - Multiple O/Ss

Overview

With the advent of Address Space Layout Randomization (ASLR), trying to find new techniques that can weaken its effectiveness is a constant game of cat and mouse. Historically successful attacks against full ASLR implementations have involved either:
  • Flaws in the implementation resulting in address bias.
  • A secondary memory revelation vulnerability leaking the addresses that are randomized by ASLR.
  • Partial overwrites leading to the ability to do relative addressing
We're happy to reveal a small bit of internal research which shows a potential technique, at least in the lab, which works across multiple operating systems. 

For this technique to be useful, you will need:
  • A 32bit ASLR enabled binary in which all libraries are also randomized.
  • The ability to cause excessive memory allocations either via normal operation or a bug.
  • The ability to cause a dynamically linked (Windows) or shared (Linux) library to load at your time of choosing.
  • Minimal other activity in the application - although some can be tolerated.
  • Sufficient RAM or swap to allocate a majority of user space for the target process (64bit operating systems can help here due to typically increased available physical RAM).
We recognize this is quite a list where 'the moon on a stick' wouldn't be out of place. As a result we think our findings fall in the interesting but contrived bucket.

Research

To show how we discovered this particular corner case and what it is, take the following Windows test case:

#include "stdafx.h"   
  #include <Windows.h>   
  int _tmain(int argc, _TCHAR* argv[])   
  {   
       bool bShown=false;   
       int intCount=0;   
       LPVOID intAddress=0;   
       LPVOID intLastAddress=0;   
       LPVOID lpAddress=0;   
       while(1){   
            intAddress=HeapAlloc( GetProcessHeap(),NULL,3096);   
            if(intAddress == NULL){   
                 fprintf(stdout,"0x%08x - %d (%d)\n",intLastAddress,GetLastError(),intCount);   
                 return 0;   
            }   
            if(!bShown){   
                 fprintf(stdout,"0x%08x 0x%08x 0x%08x\n",intAddress,&bShown,&_tmain);   
            }   
            bShown=true;   
            intCount++;   
            intLastAddress=intAddress;   
       }   
       return 0;   
  }   

and the following test case on Linux (note: the reason we use MAP_NORESERVE is due to running within a virtual machine and thus limited real memory, we recognize dlmalloc etc may skew the results):

#include <stdlib.h>  
  #include <stdio.h>  
  #include <stdbool.h>  
  #include <errno.h>  
  #include <sys/mman.h>  
  #include <sys/types.h>  
  #include <sys/stat.h>  
  void main(){  
      void *vdFoo=0;  
      void *vdLast=0;  
      bool bShown=false;  
      int intCount=0;  
      while(1){  
          vdFoo=mmap((void *)NULL,3096,PROT_NONE,MAP_PRIVATE|MAP_NORESERVE|MAP_ANON,-1,0);  
          if(bShown==false){  
              fprintf(stdout," 0x%08x,0x%08x,0x%08x\n",vdFoo,&bShown,&main);  
              bShown=true;  
          }  
          if(vdFoo==-1){  
              fprintf(stdout,"%08x - %d (%d)\n",vdLast,errno,intCount);  
              return;  
          } else {  
              vdLast=vdFoo;  
          }  
          intCount++;  
      }  
  }  

While contrived, they allow us to demonstrate the initial indicator. Now if we compile as a 32bit process and run the Windows version on a 64bit operating system a number of times we get the following:


The same test on Linux returns:

If you review the output from both platforms, you can observe:
  • Stack, heap and code locations are all randomized between runs as expected.
  • The number of allocations prior to failing to allocate are variable.
  • The last address allocated prior to failing to allocate is the same.
Yes, you read that last point correctly; the last heap address prior to allocations failing (i.e. memory exhaustion of the process's virtual address space) is always the same. Although we got terribly excited by this behaviour we recognized these test cases were not representative of the real world. What application only ever allocates the same size?  Also we should point out that on Windows while the address is consistent across runs the last address changes across reboots. But it did convince us to investigate the idea further. 

So we next modified our test case to use random allocation sizes. On Windows the test case became:

int intSize = rand() % (5000 - 3000 + 1);  
  intAddress=HeapAlloc( GetProcessHeap(),NULL,intSize); );  

and on Linux:

int intSize=rand() % (5000 - 3000 + 1);  
  vdFoo=mmap((void *)NULL,intSize,PROT_NONE,MAP_PRIVATE|MAP_NORESERVE|MAP_ANON,-1,0);  

We also made a slight adjustment in the Linux case as we were getting an invalid parameter error quite regularly and thus an early failure when testing (on Windows we didn't experience this as would always receive ERROR_NOT_ENOUGH_MEMORY (0x08) in response to our allocation if it failed). To resolve this problem on Linux we added some logic around the testing for failure conditions to only catch memory exhaustion. So the full modified Linux (POSIX) test case looked like this:

#include <stdlib.h>  
 #include <stdio.h>  
 #include <stdbool.h>  
 #include <errno.h>  
 #include <fcntl.h>  
 #include <sys/mman.h>  
 #include <sys/types.h>  
 #include <sys/stat.h>  
 void main(int argc, char **argv){  
     void *vdFoo=0;  
     void *vdLast=0;  
     bool bShown=false;  
     int intCount=0;  
     bool bFlip=false;  
     while(1){  
         int intSize=rand() % (5000 - 3000 + 1);  
         vdFoo=mmap((void *)NULL,intSize,PROT_NONE,MAP_PRIVATE|MAP_NORESERVE|MAP_ANON,-1,0);  
         if(bShown==false){  
             fprintf(stdout,"0x%08x,0x%08x,0x%08x\n",vdFoo,&bShown,&main);  
             bShown=true;  
         }  
         if(vdFoo==-1){  
             // sometimes we get errno ==2 (invalid arugment) due to the random number  
             // we don't page align this hacks around this  
             if(errno==1){  
                 fprintf(stdout,"%08x - %d (%d)\n",vdLast,errno,intCount);  
                 return;  
             }  
         } else {  
             vdLast=vdFoo;  
             intCount++;  
         }  
     }  
 }  

These example are hopefully slightly more representative of realistic scenarios. Using these test cases we started seeing variations in last successfully allocated heap addresses on Windows close to the expected range with 134 different heap addresses prior to failure. On Linux we continued to see that the last successful allocated address was 0x00010000. Seeing this result on Windows while initially disappointing, didn't preclude the potential that entropy was heavily reduced in low memory situations when late loading a dynamically linked or shared library, while on Linux based on these results it should have been a given.

So we started thinking how could this potentially help us in the real world? We thought if the following criteria could be satisfied then it might head towards a practical application:
  • A process can be crashed then re-spawned or forked fresh OR details of current total memory use obtained.
  • Memory can be allocated in a semi controlled fashion.
  • You know the rough number of allocations required to exhaust nearly all memory from your known state.
  • You can cause or trigger a library to be loaded or bound at a point of your choosing.
This code below satisfies those requirements and serves as an example on Windows:

#include "stdafx.h"  
  #include <Windows.h>  
  int _tmain(int argc, _TCHAR* argv[])  
  {  
       bool bShown=false;  
       bool bFlip=false;  
       int intCount=0;  
       LPVOID intAddress=0;  
       LPVOID intLastAddress=0;  
       LPVOID lpAddress=0;  
       while(1){  
            int intSize = rand() % (5000 - 3000 + 1);  
            intAddress=HeapAlloc( GetProcessHeap(),NULL,intSize); //malloc(3096);  
            if(intAddress == NULL){  
                 fprintf(stdout,"0x%08x - %d (%d)\n",intLastAddress,GetLastError(),intCount);  
                 return 0;  
            }   
            if(!bShown){  
                 fprintf(stdout,"0x%08x 0x%08x 0x%08x\n",intAddress,&bShown,&_tmain);  
            }  
            bShown=true;  
            intCount++;  
            if(argc > 2){  
                 if(intCount==_wtoi(argv[1])){  
                      HMODULE hModule = NULL;  
                      hModule = LoadLibrary(argv[2]);  
                      if(hModule != NULL){  
                           VOID *vdProc = GetProcAddress(hModule,"Function");  
                           fprintf(stdout,"0x%08x\n",vdProc);  
                      } else {  
                           fwprintf(stdout,L"couldn't load %s - %d\n",argv[2],GetLastError());  
                      }  
                 }  
            }  
            intLastAddress=intAddress;  
       }  
       return 0;  
  }  

We ran the test case on Windows three times (while rebooting in between to satisfy the Windows once per boot library randomization) using a variety of different allocations, before loading our DLL, we got the following (SprayDontPray.exe [Allocations] [DLL]:


Addresses Across Reboots and Variable Allocations Before Delayed Loading of a DLL
We're not claiming these results are statistically significant, but across this small data set it showed that the function is at the same address when doing late loading while approaching the limits of maximum available virtual address space. It's also worth noting the test cases failed after a number of allocations between 2,077,554 and 2,077,930. To further help with the real world application of this technique the number of allocations that need to have occurred, yet result in the same address for the function, was anywhere between ~1,500,000 and ~2,070,000 allocations.

On Linux we modified our test case to be the following:

#include <stdlib.h>  
  #include <stdio.h>  
  #include <stdbool.h>  
  #include <dlfcn.h>  
  #include <errno.h>  
  #include <fcntl.h>  
  #include <sys/mman.h>  
  #include <sys/types.h>  
  #include <sys/stat.h>  
  void main(int argc, char **argv){  
      void *vdFoo=0;  
      void *vdLast=0;  
      bool bShown=false;  
      int intCount=0;  
      bool bFlip=false;  
      while(1){  
          int intSize=rand() % (5000 - 3000 + 1);  
          vdFoo=mmap((void *)NULL,intSize,PROT_NONE,MAP_PRIVATE|MAP_NORESERVE|MAP_ANON,-1,0);  
          if(bShown==false){  
              fprintf(stdout,"0x%08x,0x%08x,0x%08x\n",vdFoo,&bShown,&main);  
              bShown=true;  
          }  
          if(vdFoo==-1){  
              // sometimes we get errno ==2 (invalid arugment) due to the random number  
              // we don't page align this hacks around this  
              if(errno==1){  
                  fprintf(stdout,"%08x - %d (%d)\n",vdLast,errno,intCount);  
                  return;  
              }  
          } else {  
              vdLast=vdFoo;  
              intCount++;  
          }  
          if(argc > 2){  
              if(intCount==atoi(argv[1])){  
                  void *hModule = NULL;  
                  hModule = dlopen(argv[2], RTLD_NOW);  
                  if(hModule != NULL){  
                      void *vdProc = dlsym(hModule, "test");  
                      char *error = dlerror();  
                      if(error!= NULL){  
                          fprintf(stdout,"! %s\n",error);  
                      } else {  
                          fprintf(stdout,"0x%08x\n",vdProc);  
                      }  
                  } else {  
                      fprintf(stdout,"couldn't load %s - %s\n",argv[2],dlerror());  
                  }  
              }  
          }  
      }  
  }  

The results on Linux where more surprising using the above test code and a loop (while (true); do ./aslr 750000 ./libourlib.so >> ./fos.txt; done;) we were able to iterate the case over 100 times. From this run we saw the following breakdown:

Linux  2.6.38-8 Address Obtained for a Function in Shared Library Loaded Late
We found reducing the number of allocations before attempting to load the shared library increased the possible addresses.  Increasing the number of allocation before attempting to load started to further increase the number of failures to allocate memory during the run. So based on our small sample this technique appears less reliable on Linux than on Windows.

People are no doubt asking at this point if we tested this on MacOS X / iOS. In short we would have, but. The but was our POSIX compatible test case (above) just causes a kernel panic on a fully patched MacOS X (10.7.3). It doesn't look exploitable as it's actually the kernel panicking itself when it runs out of a certain type of resource.

Conclusions

So in conclusion what does this buy us? Well if you can't heap spray either because the just-in-time compiler is secure and/or non executable memory is used combined with the fact you don't posses any information leaks or ability to do a partial overwrite then the described method may just yield you the return-orientated-programming gadgets or ret2lib payload at a known address that you are looking for.

Also as a final caveat we didn't look at PaX and how it adds to the mix on Linux.

Mobile Device Special Mention

Mobile devices deserve a special mention, as people will not doubt wonder what the implications on Apple (iOS) and Android (Linux) among others. Due to the fact that devices today, in our experience, don't ship with enough physical RAM to allow user land memory exhaustion together and the fact they don't support swap we don't believe this approach will yield much (if anything) on these platforms in the short term. This does however come with some caveats:
  • Changes in the physical RAM profiles on mobile devices will obviously change the risk of this attack becoming practical.
  • Shared libraries that are backed by one single physical RAM instance than can be used to consume virtual address space.
  • Applications that use MAP_NORESERVE with mmap or memory mapped files that can be leveraged to consume a processes virtual address space without consuming actual RAM.

Other Applications of Similar Techniques

While we've focused on the late loading of libraries in this post we also foresee other potential applications for this technique. These other applications include targeting JIT compilers that produce native code. These engines could be similarly targeted to potentially produce the required gadgets at known addresses even where mitigations exist against traditional spraying techniques.

Windows 8

With the release of the Windows 8 consumer preview we took it for a spin to see if the technique would still work. The set-up was slightly different than the Windows 7 test environment, but not sufficiently so we believe it would impact the results. The Windows 8 machine was the 32bit version running inside of VirtualBox. It seems Microsoft are ahead of us here and have managed to mitigate this anomalous behaviour in Windows 8. So in short this technique won't be valid in the future...

Vendor Notifications

We did let a number of OS vendors know about this research prior to publication including Microsoft (Windows) and Google (Linux for Chrome OS). In the case of Microsoft we also worked with them to answer any questions they had and to ensure they didn't feel we were going to cause a cyber apocalypse by releasing this.

We also reported the kernel crash to Apple!

No comments:

Post a Comment