A research on generating CodeQL database for close-sourced applications

Motivation

I adore CodeQL a lot despite the fact that I have only been working with it for less than 3 months. It’s so powerful at analyzing applications and it can save a tremendous amount time doing code review. However, nothing is perfect. There are a few downsides of CodeQL.

First, it only works when you have the source code, and as we know, not every application in this world is open-sourced.

Second, it does not support PHP, according to its developers, there are already many PHP scan tools out there, so they just didn’t feel like making it PHP-inclusive. That’s understandable, tho.

Java is supported by CodeQL, and from my experience, decompiled Java code can almost be as good as the source code, and sometimes, I can even recompile the decompiled Java code without getting any errors. That gave me the motivation to figure out a way to somehow create a CodeQL database from entirely decompiled Java code.

Also from SeeBug’s paper, I knew it’s possible, but may require some additional tinkering and luck.

Challenges

Compiling Process

Nowadays, most used Java compilers are gradle, maven, and ant. But for a decompiled codebase, I needed a more general and trival solution. That’s when I decided to study a bit about javac.

The syntax for javac is quite simple:

javac <Java files to compile> -cp <external class path>

there are a lot more to javac, but for simple and small project, this is enough. But we all know, when dealing with large projects, there has to be many dependencies going around, and it doesn’t seem like javac is smart to sort all dependencies out for us.

Database Creation

And it takes the second challenge out, in order to generate a CodeQL database, we probably need the compilation process to be successful. That, is pretty hard when we have large decompiled codebase as there has to be many syntax errors.

As early in April, I tried look for a way to solve those problems, but it wasn’t until mid June, when I shifted my focus back to CodeQL, I had an idea.

My attempts

CodeQL does not care about the already generated byte codes, it cares about the code that is being generated during the course of creation of database. And it’s possible for the compilation to fail, but for the database to be successfully generated. So my hypothesis at this point is: if I can somehow make as few compilation errors as possible, maybe I can successfully create the database.

If I can instruct javac to compile file in dependency orders, then I can make compilation error to be as few as possible, right? Then I went off and did some research about the dependency stuff.

And I found it. If a @source.txt is provided to javac, it will compile all the Java files in source.txt in the given order. Note that the file name does not matter, but for the Java file inside source.txt, it has to be in full path.

But… how do I figure out the order…? This means I have to sort one out myself, right? Then be it.

Now my focus is on how to rank compile orders by their dependencies. Here is my thought process:

  1. I need to find the file which has the fewest amount of dependencies.
  2. Once low dependent files are compiled, they can be depended by other files, and all thing should work.
  3. I can’t just throw everything as an external library because some may be crucial to the dataflow and I don’t want to miss those.
  4. I need a way to distinguish between public libraries and internal libraries. By public I mean common libraries like Apache common collections, and internal being the ones which their dataflow may be interesting to us.

I figured I can sort public and internal libraries by their package names. For example, libraries provided by Apache may have package name starting with org.apache, and some internal ones may have com.example which we can tell apart. Then for dependency ranking, I want to ignore public dependencies as we can just provide them using the -cp flag in javac.

Lastly, I want to add the ones has the fewest dependencies to my result first, then check the file list again, if some files’ dependency problem has been solved because in theory, we compiled those files first, so javac already knows about thoe classes and methods. Then for some dependency problem we can’t solve (e.g like two files are dependent of each other, doesn’t make sense, but it happens), we need to make sure we won’t hava an infinite loop.

In the end, all things are just summing up to an easy-to-medium algorithm problem on LeetCode.

And here it is, below is the kick ass algorithm I wrote. By the way, it only works on Linux, nobody knows why.

#!/usr/bin/env python3

import os
import sys


publicLib = [
   'org.springframework',
   'org.thymeleaf',
   'org.flywaydb',
   'org.apache',
   'org.slf4j',
   'org.asciidoctor',
   'org.eclipse',
   'javax',
   'com.fasterxml',
   'com.beust',
   'org.joda',
   'com.thoughtworks',
   'com.nulabinc',
   'io.jsonwebtoken',
   'org.jsoup',
   'org.w3c',

]



def getParent(pname):
   return '.'.join(pname.split('.')[:-1])



def parselib(l, p):
   d = {}  

   for line in l:
      try: 
         fname, importData = line.split(':')
         libname = importData.split('import ')[1].replace(';\n', '')

         if fname not in d.keys():
            d[fname] = {'dependency': [], 'publicDependency': [], 'internalDependency': []}

         if not libname.startswith('java.'):
            d[fname]['dependency'].append(libname)

            for (k, v) in p.items():
               if v == libname:
                  d[fname]['internalDependency'].append(k)

               else:
                  d[fname]['publicDependency'].append(libname)

      except Exception as e:
         pass

   print(f'[+] Parsed libraries, total files which have dependencies: {len(libs.keys())}')

   return d


def parsepackage(p):
   pd = {}

   for line in p:
      if len(line) == 0:
         continue
      fname, pname = line.split(':')
      pname = pname.split('package ')[1].replace(';\n', '') + '.' + os.path.basename(fname).split('.')[0]
      pd[fname] = pname
   print(f'[+] Parsed packages, total packages: {len(pd.keys())}')
   return pd


def doCompileChain(libs, packages):
   print('[*] Starting to rank compile priority.')
   res = []
   for k in packages.keys():
      if k not in libs.keys():
         res.append(k)

   lock = 10
   while (len(res) < len(packages)) and (lock > 0):
      lock -= 1
      for (k, v) in libs.items():

         if k not in res:
            if lock == 0:
               res.append(k)
               lock = 10

            if len(v['internalDependency']) == 0:
               res.append(k)
               lock = 10

            else:
               count = len(v['internalDependency'])
               for id in v['internalDependency']:
                  if id in res:
                     count -= 1

               if count == 0:
                  res.append(k)
                  lock = 10

   return res


def generateFile(libs, packages):
   rankedpackages = doCompileChain(libs, packages)
   with open('source.txt', 'w') as f:
      for p in rankedpackages:
         f.writelines(p + '\n')

      f.close()



try:
   pwd = os.path.realpath(sys.argv[1])

except Exception as e:
   print('invalid path')
   exit()


l = os.popen(f"grep -ER 'import .*;' --include \*.java {pwd} | grep -v '\$'").readlines()
d = {}

p = os.popen(f"grep -ER 'package .*;' --include \*.java {pwd} | grep -v '\$'").readlines() 
pd = {}      

packages = parsepackage(p)
libs = parselib(l, packages)

generateFile(libs, packages)

Excuses my coding skills, but it works so who cares. The script will output the compile order in source.txt. Then with that, we can try to create a database with:

codeql database create -l java -s <project path> -c 'javac -cp <path to public library directory>  @source.txt'

I used WebGoat project by OWASP for testing, but it’s an open-sourced project? Yes, so I can have a reference when I’m comparing result later. I downloaded the release version of WebGoat-server from the repository and decompiled with the fernflower decompiler. Then I ran my Python script and ran the codeql command above.

I put all the needed public libraries in the libs directory and started to generate database.

As you can tell, errors immediately appeared after compilation. And, yeah…

That’s when I started to panic, because I was so sure I got it. And now this happens and I didn’t know what to do.

Lucky for me, shortly after my failure, testanull posted his write up on the very matter. Big shout-out to him, I wouldn’t have done it without his blog.

In his blog, he mentions many problems I have encountered, but there is one thing I noticed I have never realized before: the codeql dataset import command.

I can import to a database using this command! We also need to provide a trap directory and a dbscheme argument for the command to work the best.

A trap file is essentially all the data CodeQL collects while compiling, although the database was not successfully created due to too much compilation errors, it still can collect a lot of usful info all thanks to the dependency ording we did before to mininize the errors.

So I can still create a somehow working database from the trap files.

And it worked! db-java, which is the database directory, has successfully been created. Now we can analyze the database as usual.

Side By Side Comparision

As I said earlier on why I choose an open-sourced WebGoat instead of some random close-sourced project. I want to see how good the result is so that I will decide whether to improve my project or not.

I made a website to represent the result visually a while ago and here is the side-by-side comparison of the result.

On the left is the result based on source code compilation, and on the right is from the database generated from decompiled codebase.

I only used the security QL pack in the official CodeQL repository for both, and with source code, CodeQL was able to find 14 vulnerabilities, while with decompiled code, it was able to find 13 of them.

In some severe to critical categories, the results are actually pretty close:

Vulnerability CategoryWith Source CodeWith Decompiled Code
SQL Injection1918
Path Traversal64
Unsafe Deserialization11
XML External Entity33

In general, with analyzing recompiled decompiled code, CodeQL was able to do a pretty good job and not missing too much of the findings. But that’s only comparing to the open-sourced result. In reality there are probably more vulnerabilities than what CodeQL is able to find, and this may require some additional QL rules to improve. I assume the real-life practice should be as close too.

Conclusion

Who doesn’t like traps?

Reference

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s