personnumber3377

Fuzzing ruby regex parser

After my unsuccesful attempt to fuzz apache httpd (did not find any crashes) I found this: https://hackerone.com/reports/1549636 . Basically you pass this string: “((?(1)x

)x)+” to rubys Regexp.new function and it crashes. (See https://github.com/ruby/ruby/commit/052ec6d2585c3ace95671013d336f5543624ef3d and https://hackerone.com/reports/1549636) . The guy got a juicy bounty from it and making a regex fuzzer should not be that hard, right?

Jumping in without doing any research.

As usual, I jumped in without doing any previous reading on the subject, but I think that it should not cause me any problems in the future. I downloaded ruby and then configured it with:

export ASAN_OPTIONS="halt_on_error=0:use_sigaltstack=0:detect_leaks=0"

CC="/home/cyberhacker/Asioita/newaflfuzz/AFLplusplus/afl-gcc-fast"  ../configure --prefix=/home/cyberhacker/Asioita/Hakkerointi/Rubyregex/final/ruby/build/output_fuzz cppflags="-O3 -fsanitize=address -fno-omit-frame-pointer" optflags=-O3 LDFLAGS="-fsanitize=address -fno-omit-frame-pointer"

and it compiled succesfully.

Now, we just need to compile the fuzzer. I could just call ruby on a ruby script which then takes the regex string from stdin in and passes it to the Regexp.new function, but I think that the ruby parsing the source code takes too much time in my opinion. Instead I decided to use the ruby c api to write a program in c code which then calls the ruby regex parser. Here it is:

#include <ruby.h>
#include "ruby/re.h"

#define MAX_INPUT_SIZE 120

VALUE handle_error(VALUE obj1) {

	return 0;
}

VALUE dangerous_func(VALUE x)
{
	/* code that could raise an exception */
	return rb_reg_regcomp(x);
}

int main(int argc, char** argv) {
	VALUE x;

	VALUE result;

	int state = 0;
	
	char string[MAX_INPUT_SIZE];
	ruby_setup();
	ruby_init();

	ruby_init_loadpath();

	while (__AFL_LOOP(1000)) {

		state = 0;

		memset(string, 0, MAX_INPUT_SIZE);

		//fgets(string, MAX_INPUT_SIZE, stdin);

		read(0, string, MAX_INPUT_SIZE);

		if (string[MAX_INPUT_SIZE-2]) {
			return 0;
		}

		x = rb_str_new_cstr(string);

		result = rb_protect(dangerous_func, x, &state);

		//printf("Result: %d\n", state);

	}
	
	//printf("%d\n", state);

	return 0;

}

Then we just need a corpus. The same github repository which I used for the apache fuzzing also has a corpus for regex: https://github.com/dvyukov/go-fuzz-corpus/tree/master/regexp . I plag… erm.. I mean imported that to my fuzzing setup.

Now, with this simple setup I can get around 6000 corpus files in around two days of fuzzing. Not bad, but I think that using certain strategies we can do even better.

Fuzzing smarter

Now, afl-fuzz supports custom-mutators , which is really handy for us. The only interesting characters for ruby regex are basically : [".", "*", "+", "?", "^", "|", "-", "$", "\\"] and in addition to those we have these: [] , {} and () . Everything else is basically useless. Now, the default mutator which afl-fuzz uses is designed for generic binary files, but we want only a very specific subset of characters. I tested the ruby regex code and it basically treats every other character than those mentioned as a regular character. Irregardless if it printable or not. 0xff gets treated the same as “A” for example. This causes the fuzzer to waste a lot of time doing mutations which aren’t really useful for our purposes of fuzzing the regex parser, so we need to make our own. I chose python as the language to implement my custom mutator library since it is the programming language which I know best. (You can take a look at this code here: https://github.com/personnumber3377/rubyregexmutator )

The regex generator which I wrote basically recursively makes expressions and then probabilistically adds subexpressions which can be plain characters or () or [] or {} subexpressions. Not all of the regexes it generates are valid, but a lot more of them are valid than the ones generated by using just default afl-fuzz . The actual mutator uses this regex generator and either appends or replaces random places of the input with a new generated regex expression with a random length. In addition to this it may remove certain sequential bytes from the input. Now using this custom mutator initially yields worse results than using plain afl-fuzz, but this is because with afl-fuzz we are hitting many error handling functions which are not hit with this mutator, because this mutator does not do binary bit flips or havoc for example. Even though the results are initially worse based on the corpus counter, it is actually better than afl-fuzz, because the paths which we are hitting with this mutator are actually the interesting paths which are involved in parsing the regex. This hypothesis is supported by the fact that after running this custom mutator for a while and then switching over to the regular mutator in afl-fuzz, we get a lot better coverage than if we used plain afl-fuzz originally. As of writing this I have got roughly 6200 corpus files with just a bit less than 5 hours of fuzzing when previously with just the regular mutator it took almost two whole days to reach this amount of corpus files.

Modding my custom mutator a bit.

One thing which I would really like to add to my custom mutator is removing and replacing random characters from the input string. Right now it just splices the randomly generated regex expression at some point in the input. I am going to call the function tweak, because it just slightly tweaks the buffer:

def tweak(buffer):

    # basically replace some characters with another and add some random characters and delete random characters

    for counter in range(len(buffer)):
        
        if random.random() < TWEAK_PER_CHARACTER:
            
            tweak_type = random.randrange(1,4)

            if tweak_type == 1:  # replace

                # can not access bytearray by index so we have to do this shit instead :)

                # thing[:3]+b"A"+thing[4:]   

                buffer = buffer[:counter]+bytes(random.choice(all_allowed_chars), encoding="utf-8")+buffer[counter+1:]  # +1 because we want to replace char

            if tweak_type == 2: # add

                buffer = buffer[:counter]+bytes(random.choice(all_allowed_chars), encoding="utf-8")+buffer[counter:]

            if tweak_type == 3: # delete

                buffer = buffer[:counter]+buffer[counter+1:]

    if len(buffer) == 0:  # this is here, because the buffer has to be atleast one character long. Deleting random characters from a short buffer may cause the buffer to be "" .
        buffer = bytes("a", encoding="utf-8")

    return buffer