I always wondered how something so complex as a regex engine actually works under the hood. How does it compile the regex and then how does it match strings which are fed to it? These questions motivated me to implement my own regex engine in python3. This is my journey from the very start of implementing it, however I assume that you already have decent programming experience. This will be a documentation of how I implemented it, but this is by no means a handholding tutorial. I will not be explaining the concepts needed to implement one. This is just to document some questions I had about some parts of implementing an engine that had me confused. Will be following this guide sort of thing described here in my journey: https://www.abstractsyntaxseed.com/blog/regex-engine/introduction
Ok so regex engines are basically finite automatons. “What is that?” you may ask. Well, to be honest I don’t really know myself. All I know is that it is a formalization of a computer. It takes input and then changes its current state based on that input. They are most commonly depicted as a bunch of circles and arrows. Each arrow depicts a state change and every arrow usually has a symbol beside it to signify what input makes the state machine take that route. The difference between non-deterministic and deterministic finite automata is that in deterministic finite automata, there is only one way for the state change to take. aka there is only one valid transition from state to state. This is how I think it works. Finite just means that there are finite states. (I think) I think regexes are non-deterministic finite automata, because there are many ways or idk.. hopefully it will clear up to me when I start actually writing it.
In the blog post, the guy decided to use NFA:s for his implementation of a regex engine. This I think is a bit of a bad move, because choosing NFA:s over DFA:s basically means that we can not wite an optimized regex engine, because optimizing a NFA is NP-hard. The guy even admits it himself: “It would have been easy for me to optimize the NFA by hand, but I wanted to show the same one that will be generated by our engine. There are efficient minimization algorithms for DFA, but sadly, for NFA it’s NP-hard” Maybe I can write a regex engine in DFA, but first, I will follow the guide and implement it as a NFA.
Ok, so let’s actually get to writing code. How do we start? Well, I am going to first write a functional NFA generator, instead of an optimized one. Then maybe I should add a testsuite to test it.
I am going to roughly copy the template the guy did. Maybe change some things, but I will keep it roughly the same. This is because I don’t really know what I am doing.
Let’s create a file called NFA.py. I am first going to try to implement this myself, instead of just copying what the guy is doing. So we need a current state, which described the current state (duh), which basically means where we are in the state machine. Then we will also need a list of possible states for our NFA (basically a list of nodes). Then we also need a way to transition states based on the input to the state machine. I think this should be implemented as a function. Then we also need a way to construct the machine aka we need to provide an api to connect the nodes.
Here is my current skeleton for the state machine:
class NFAEngine:
def __init__(self):
self.states = [] # all possible states. This will be added to later on.
self.current_states = [] # This is the current node(s) which we are at.
self.input_string = "" # This is our input to the state machine.
self.where_we_are_in_input = 0 # This is the index in the input string where we are.
self.alphabet = "" # This is the list of valid characters. If the input character is not in this, then automatically go to the dead state.
def update_input(self, string: str) -> None: # Update the input to the state machine.
self.input_string = string
self.where_we_are_in_input = 0
def consume_next_char(self):
char = self.input_string[self.where_we_are_in_input]
self.where_we_are_in_input += 1
return char
def transition_state(self) # This is a dummy as of now.
# TODO: First get the input character.
# Then look at the current node and see which transition(s) to take.
# Take the transition(s)
Then we also need a way to implement a node in the state machine, so let’s add a state to that. I think a good way to model the state transitions is a tuple. The first element in the tuple is the character and the second element is a reference to the next node object. Or we could also make the transition arrow itself another object. A transition object of sorts, but let’s for now implement it as a touple, or now that I think about it, why don’t we define it as a dictionary? See, the key would be the character and the value would be the reference to the next node(s).
Here is my node class:
class Node:
def __init__(self, name: str): # This is a node in the regex engine.
self.transitions = {} # This will be added to later on.
self.name = name
and here is my engine class with a couple of added methods:
class NFAEngine:
def __init__(self):
self.states = [] # all possible states. This will be added to later on.
self.nodes = {} # This is a dictionary with the name of the node as key and the reference to the node as a value
self.current_states = [] # This is the current node(s) which we are at.
self.current_nodes = [] # This is a list of names
self.input_string = "" # This is our input to the state machine.
self.where_we_are_in_input = 0 # This is the index in the input string where we are.
self.alphabet = "" # This is the list of valid characters. If the input character is not in this, then automatically go to the dead state.
def update_input(self, string: str) -> None: # Update the input to the state machine.
self.input_string = string
self.where_we_are_in_input = 0
def consume_next_char(self):
char = self.input_string[self.where_we_are_in_input]
self.where_we_are_in_input += 1
return char
def get_current_nodes(self) -> Node:
out = []
for node in self.current_nodes:
out.append(self.nodes[node])
return out # This returns a list of all of the node objects. This will be used in the transition function when we transition all of the nodes forward.
def add_node(self, name: str) -> None: # This adds a node into our machine.
new_node = Node(name)
self.nodes[name] = new_node # add to the nodes dict
def add_transition(self, name1: str, name2: str, condition: str): # defines a transition. The condition is a string of length one aka a character.
assert len(condition) == 1 and isinstance(condition, str)
# get destination node object
dest_node = self.nodes[name2]
# get source node
source_node = self.nodes[name1]
if condition in source_node.transitions: # If there is another possible transition for this character, then add another one to it.
source_node.transitions[condition].append(dest_node)
else:
source_node.transitions[condition] = [dest_node] # There isn't a transition already defined for this character so initialize the list.
def set_start_node(self, name: str) -> None: # Sets the group of nodes where could be at to a list containing only the start node aka we are now at the very start.
self.current_nodes = [name]
def transition_state(self) # This is a dummy as of now.
# TODO: First get the input character.
# Then look at the current node and see which transition(s) to take.
# Take the transition(s)
There are unused variables everywhere, but I am going to remove those later. Let’s implement the transition_state function which is the real meat and potatoes of the engine.
Here:
def transition_states(self):
# TODO: First get the input character.
# Then look at the current node and see which transition(s) to take.
# Take the transition(s)
if self.terminated:
print("Can not advance an already terminated state machine!")
return
char = self.consume_next_char()
# Go over each possible current node.
new_nodes = [] # These are the new nodes when we are done.
for node in self.current_nodes:
node_obj = self.nodes[node] # The actual node object
if char in node_obj.transitions: # Check if the character is in the possible transitions. If it isn't then go to the dead state. Going to the dead state is synonomous with just removing this path alltogether, since it is not possible to transition from the dead state to any other state.
node_we_transitioned_into = node_obj.transitions[char]
new_nodes.append(node_we_transitioned_into.name) # add the name to the new nodes. We will set our current nodes to these nodes later on.
# Check if we have reached the end node
if self.end_node in new_nodes:
print("We have reached the end node!")
self.terminated = True
return
if new_nodes == []: # No path goes to the end node.
self.terminated = True
print("We did NOT reach the end node")
return
# Set the all possible current states to the new states.
self.current_nodes = new_nodes
return
Now, let’s write a driver function for this engine and see how it fairs.
I noticed that there was a flaw in my code. There can be many end nodes, so I updated my NFA code:
class NFAEngine:
def __init__(self):
self.states = [] # all possible states. This will be added to later on.
self.nodes = {} # This is a dictionary with the name of the node as key and the reference to the node as a value
self.current_states = [] # This is the current node(s) which we are at.
self.current_nodes = [] # This is a list of names
self.input_string = "" # This is our input to the state machine.
self.where_we_are_in_input = 0 # This is the index in the input string where we are.
self.alphabet = "" # This is the list of valid characters. If the input character is not in this, then automatically go to the dead state.
self.end_nodes = []
def update_input(self, string: str) -> None: # Update the input to the state machine.
self.input_string = string
self.where_we_are_in_input = 0
def consume_next_char(self):
char = self.input_string[self.where_we_are_in_input]
self.where_we_are_in_input += 1
return char
def get_current_nodes(self) -> Node:
out = []
for node in self.current_nodes:
out.append(self.nodes[node])
return out # This returns a list of all of the node objects. This will be used in the transition function when we transition all of the nodes forward.
def add_node(self, name: str) -> None: # This adds a node into our machine.
new_node = Node(name)
self.nodes[name] = new_node # add to the nodes dict
def add_transition(self, name1: str, name2: str, condition: str): # defines a transition. The condition is a string of length one aka a character.
assert len(condition) == 1 and isinstance(condition, str)
# get destination node object
dest_node = self.nodes[name2]
# get source node
source_node = self.nodes[name1]
if condition in source_node.transitions: # If there is another possible transition for this character, then add another one to it.
source_node.transitions[condition].append(dest_node)
else:
source_node.transitions[condition] = [dest_node] # There isn't a transition already defined for this character so initialize the list.
def set_start_node(self, name: str) -> None: # Sets the group of nodes where could be at to a list containing only the start node aka we are now at the very start.
self.current_nodes = [name]
def set_end_nodes(self, name: list) -> None:
self.end_nodes = names
def get_end_node(self) -> Node:
return self.nodes[self.end_node]
def check_termination(self, new_nodes: list) -> bool:
if len(new_nodes) > self.end_nodes:
checked_list = self.end_nodes
other_list = new_nodes
else:
checked_list = new_nodes
other_list = self.end_nodes
for elem in checked_list:
if elem in other_list:
return True
return False
def transition_states(self):
# TODO: First get the input character.
# Then look at the current node and see which transition(s) to take.
# Take the transition(s)
if self.terminated:
print("Can not advance an already terminated state machine!")
return
char = self.consume_next_char()
# Go over each possible current node.
new_nodes = [] # These are the new nodes when we are done.
for node in self.current_nodes:
node_obj = self.nodes[node] # The actual node object
if char in node_obj.transitions: # Check if the character is in the possible transitions. If it isn't then go to the dead state. Going to the dead state is synonomous with just removing this path alltogether, since it is not possible to transition from the dead state to any other state.
node_we_transitioned_into = node_obj.transitions[char]
new_nodes.append(node_we_transitioned_into.name) # add the name to the new nodes. We will set our current nodes to these nodes later on.
# Check if we have reached the end node
if self.check_termination(new_nodes):
print("We have reached the end node!")
self.terminated = True
return
if new_nodes == []: # No path goes to the end node.
self.terminated = True
print("We did NOT reach the end node")
return
# Set the all possible current states to the new states.
self.current_nodes = new_nodes
return
and here is my driver code:
from NFA import *
def main() -> int:
# Basically synonomous with this:
'''
const nfa = new EngineNFA();
nfa.declareStates("q0", "q1");
nfa.setInitialState("q0");
nfa.setEndingStates(["q1"]);
nfa.addTransition("q0", "q1", new CharacterMatcher("a"));
console.log(nfa.compute("a"));
'''
engine = NFAEngine()
engine.update_input("a")
engine.add_node("q0")
engine.add_node("q1")
engine.set_start_node("q0")
engine.set_end_nodes(["q1"])
# add_transition(self, name1: str, name2: str, condition: str)
engine.add_transition("q0", "q1", "a")
engine.transition_states() # Run one step!
return 0
if __name__=="__main__":
exit(main())
aaaaaaannnnnddddd…
self.end_nodes = names
^^^^^
NameError: name 'names' is not defined. Did you mean: 'name'?
whoops. After fixing a couple of typos:
class Node:
def __init__(self, name: str): # This is a node in the regex engine.
self.transitions = {} # This will be added to later on.
self.name = name
class NFAEngine:
def __init__(self):
self.states = [] # all possible states. This will be added to later on.
self.nodes = {} # This is a dictionary with the name of the node as key and the reference to the node as a value
self.current_states = [] # This is the current node(s) which we are at.
self.current_nodes = [] # This is a list of names
self.input_string = "" # This is our input to the state machine.
self.where_we_are_in_input = 0 # This is the index in the input string where we are.
self.alphabet = "" # This is the list of valid characters. If the input character is not in this, then automatically go to the dead state.
self.end_nodes = []
self.terminated = False
def update_input(self, string: str) -> None: # Update the input to the state machine.
self.input_string = string
self.where_we_are_in_input = 0
def consume_next_char(self):
char = self.input_string[self.where_we_are_in_input]
self.where_we_are_in_input += 1
return char
def get_current_nodes(self) -> Node:
out = []
for node in self.current_nodes:
out.append(self.nodes[node])
return out # This returns a list of all of the node objects. This will be used in the transition function when we transition all of the nodes forward.
def add_node(self, name: str) -> None: # This adds a node into our machine.
new_node = Node(name)
self.nodes[name] = new_node # add to the nodes dict
def add_transition(self, name1: str, name2: str, condition: str): # defines a transition. The condition is a string of length one aka a character.
assert len(condition) == 1 and isinstance(condition, str)
# get destination node object
dest_node = self.nodes[name2]
# get source node
source_node = self.nodes[name1]
if condition in source_node.transitions: # If there is another possible transition for this character, then add another one to it.
source_node.transitions[condition].append(dest_node)
else:
source_node.transitions[condition] = [dest_node] # There isn't a transition already defined for this character so initialize the list.
def set_start_node(self, name: str) -> None: # Sets the group of nodes where could be at to a list containing only the start node aka we are now at the very start.
self.current_nodes = [name]
def set_end_nodes(self, names: list) -> None:
self.end_nodes = names
def get_end_node(self) -> Node:
return self.nodes[self.end_node]
def check_termination(self, new_nodes: list) -> bool:
if len(new_nodes) > len(self.end_nodes):
checked_list = self.end_nodes
other_list = new_nodes
else:
checked_list = new_nodes
other_list = self.end_nodes
for elem in checked_list:
if elem in other_list:
return True
return False
def transition_states(self):
# TODO: First get the input character.
# Then look at the current node and see which transition(s) to take.
# Take the transition(s)
if self.terminated:
print("Can not advance an already terminated state machine!")
return
char = self.consume_next_char()
# Go over each possible current node.
new_nodes = [] # These are the new nodes when we are done.
for node in self.current_nodes:
node_obj = self.nodes[node] # The actual node object
if char in node_obj.transitions: # Check if the character is in the possible transitions. If it isn't then go to the dead state. Going to the dead state is synonomous with just removing this path alltogether, since it is not possible to transition from the dead state to any other state.
nodes_we_transitioned_into = node_obj.transitions[char]
for n in nodes_we_transitioned_into:
new_nodes.append(n.name) # add the name to the new nodes. We will set our current nodes to these nodes later on.
# Check if we have reached the end node
if self.check_termination(new_nodes):
print("We have reached the end node!")
self.terminated = True
return
if new_nodes == []: # No path goes to the end node.
self.terminated = True
print("We did NOT reach the end node")
return
# Set the all possible current states to the new states.
self.current_nodes = new_nodes
return
Now our code prints We have reached the end node!
now instead of “a”, if we put “b” instead, we of course shouldn’t get into the end node, so the program should print We did NOT reach the end node
. And it does. Great!
So now I have a (very convoluted) non-deterministic state machine. What now? Well, there are still a couple of problems we need to solve. Firstly, I should have actually implemented the “Matcher” class instead of implementing the matching as a tuple. Of course I could use specific characters to mean epsilon (aka) any character. There is also the problem where some matchers do not consume a character, so we need to account for those. We need a list of string offsets, where each offset is the place where we are in the string for each possible route. Also also we need to handle epsilon loops, where we could go in an infinite loop. Also if we reach the terminal state, and there are still unconsumed characters, then we actually did not match the string. Also also also the guy implemented something what’s called a iterative backtracking implementation, where the state machine follows one state to the end, and if it did not match, then go back to a state where there is a branch (there are two transitions for the same character). I implemented a non-backtracking implementation, which basically follows each state in parallel, aka executes each path forward one step and checks if any has terminated. According to the article, this causes some problems later when implementing addons to the regex engine, so I must actually change my implementation to match the backtracking one. A non-backtracking implementation is a perfectly valid implementation, but for the sake of following the blog, let’s change our implementation. I personally felt that my non-backtracking implementation is more intuitive, but I guess it will become problematic later on so let’s change it.
So, let’s change our implementation to a backtracking one… and also change the matching stuff to a matcher class.
Here is the skeleton class for a matcher:
class Matcher:
def matches(self, char: str) -> bool:
return False # Placeholder.
def isEpsilon(self): # This is just to check if this matcher is an epsilon matcher. (Epsilon matches basically anything and does not consume the character.)
return None # Placeholder
def getLabel(self):
return "not-named" # Placeholder
I actually have a bit of an embarrasing thing to admit, I don’t really know how inheritance works in python3. So let’s learn it!
Here:
class CharacterMatcher(Matcher):
def __init__(self, c: str) -> None:
# This sets the character for this state
self.c = c
def matches(self, char: str) -> bool:
if self.c == char:
return True
return False
def getLabel(self) -> str:
return self.c
def isEpsilon(self) -> bool:
return False
class EpsilonMatcher(Matcher):
def matches(self, char: str) -> bool:
if self.c == char:
return True
return False
def getLabel(self) -> str:
return self.c
def isEpsilon(self) -> bool:
return False
There is a bit of a problem. See in the javascript code there is this: matcher.matches(input, i)
and I do not really understand why there is the input, and then there is the i, because shouldn’t the function only take one parameter? Maybe we will use that later on so let’s just roll with it.
Uh oh….
from matchers import *
def main() -> int:
char_matcher = CharacterMatcher("a")
assert char_matcher.matches("a", 0)
return 0
if __name__=="__main__":
exit(main())
results in this:
assert char_matcher.matches("a", 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: CharacterMatcher.matches() takes 2 positional arguments but 3 were given
So let’s just add a dummy variable to it:
class Matcher:
def matches(self, char: str, i: int) -> bool:
return False # Placeholder.
def isEpsilon(self) -> bool: # This is just to check if this matcher is an epsilon matcher. (Epsilon matches basically anything and does not consume the character.)
return False # Placeholder
def getLabel(self) -> str:
return "not-named" # Placeholder
class CharacterMatcher(Matcher):
def __init__(self, c: str) -> None:
# This sets the character for this state
self.c = c
def matches(self, char: str, i: int) -> bool:
if self.c == char:
return True
return False
def getLabel(self) -> str:
return self.c
def isEpsilon(self) -> bool:
return False
class EpsilonMatcher(Matcher):
def matches(self, char: str, i: int) -> bool:
if self.c == char:
return True
return False
def getLabel(self) -> str:
return self.c
def isEpsilon(self) -> bool:
return False
Now the assert passes without erroring. Now, let’s add them to my engine.
Here is my refactored code:
from matchers import *
class Node:
def __init__(self, name: str): # This is a node in the regex engine.
self.transitions = [] # This will be added to later on.
self.name = name
def add_transition(self, destnode, matcher) -> None:
self.transitions.append([matcher, destnode])
class NFAEngine:
def __init__(self):
self.states = {} # all possible states. This will be added to later on.
self.current_states = [] # This is the current node(s) which we are at.
self.current_nodes = [] # This is a list of names
self.input_string = "" # This is our input to the state machine.
self.where_we_are_in_input = 0 # This is the index in the input string where we are.
self.alphabet = "" # This is the list of valid characters. If the input character is not in this, then automatically go to the dead state.
self.end_nodes = []
self.terminated = False
self.start_state = None
def update_input(self, string: str) -> None: # Update the input to the state machine.
self.input_string = string
self.where_we_are_in_input = 0
def consume_next_char(self):
char = self.input_string[self.where_we_are_in_input]
self.where_we_are_in_input += 1
return char
def get_current_nodes(self) -> Node:
out = []
for node in self.current_nodes:
out.append(self.states[node])
return out # This returns a list of all of the node objects. This will be used in the transition function when we transition all of the nodes forward.
def add_state(self, name: str) -> None: # This adds a node into our machine.
new_state = Node(name)
self.states[name] = new_state # add to the nodes dict
def add_transition(self, name1: str, name2: str, condition): # defines a transition. The condition is a string of length one aka a character.
'''
addTransition(fromState, toState, matcher) {
this.states[fromState].addTransition(this.states[toState], matcher);
}
'''
#assert len(condition) == 1 and isinstance(condition, str)
assert isinstance(condition, Matcher) or isinstance(condition, CharacterMatcher) or isinstance(condition, EpsilonMatcher)
# get destination node object
dest_node = self.states[name2]
self.states[name1].add_transition(self.states[name2], condition)
return
def set_start_node(self, name: str) -> None: # Sets the group of nodes where could be at to a list containing only the start node aka we are now at the very start.
self.current_nodes = [name]
self.start_state = name
def set_end_nodes(self, names: list) -> None:
self.end_nodes = names
def get_end_node(self) -> Node:
return self.states[self.end_node]
def check_termination(self, new_nodes: list) -> bool:
if len(new_nodes) > len(self.end_nodes):
checked_list = self.end_nodes
other_list = new_nodes
else:
checked_list = new_nodes
other_list = self.end_nodes
for elem in checked_list:
if elem in other_list:
return True
return False
def transition_states(self):
# TODO: First get the input character.
# Then look at the current node and see which transition(s) to take.
# Take the transition(s)
if self.terminated:
print("Can not advance an already terminated state machine!")
return
stack = [[0, self.states[self.start_state]]]
while len(stack):
# Pop off the state.
i, state = stack.pop(-1) # pop from the end of the stack
if state.name in self.end_nodes:
print("Match found!")
return True
input_char = self.input_string[i]
for i in range(len(state.transitions)-1,-1,-1): # Go from the last to the first
matcher, destination_state = state.transitions[i]
if matcher.matches(input_char, i): # if matches character
if not matcher.isEpsilon(): # If NOT an epsilon matcher, then advance the string for this path.
i += 1
stack.append([i, destination_state])
print("Match NOT found!")
return False
I am now in this commit: 62085c3b09095f12c714b1485642eab2da9064cf . You can follow my progress on my github: https://github.com/personnumber3377/Regexengine
There is also a problem that the finite automata matches partial strings. But it is actually a feature, not a bug. The guy even addresses it: “It seems to work, but there’s a problem, the machine returns true for strings that partially match that input.” and then later on:
This is clearly wrong and easily fixed by adding a condition in the return statement, BUT since we are implementing a modern regex engine, we don't really want that. Modern regex allow this unless you specify the end of string anchor $. For example in Javascript:
const regex = new RegExp("ab+");
console.log(regex.exec("abbc")); //Array [ "abb" ]
So it's not a bug, it's a feature. When we implement capturing groups it'll be more clear how much text the engine has actually matched.
When traversing over the input string, we can get stuck in an infinite epsilon loop. Remember that epsilon matchers match any character and does not consume the input character, so even if the input string is finite, we can still go into an infinite loop. We can circumvent this problem by keeping a memory of where we have already been and then zero this memory when we encounter a non-epsilon matcher. The formal definition of a finite state machine says that the machine doesn’t have a memory, so from now on we will actually be breaking the definition of a state machine slightly because of this.
Let’s implement it!
Here is my first try:
stack = [[0, self.states[self.start_state], set()]] # The last element is the epsilon memory.
while len(stack):
# Pop off the state.
i, state, epsilon_memory = stack.pop(-1) # pop from the end of the stack
if state.name in self.end_nodes:
print("Match found!")
return True
input_char = self.input_string[i]
for i in range(len(state.transitions)-1,-1,-1): # Go from the last to the first
matcher, destination_state = state.transitions[i]
if matcher.matches(input_char, i): # if matches character
if not matcher.isEpsilon(): # If NOT an epsilon matcher, then advance the string for this path.
i += 1
# Reset the memory
epsilon_memory = set()
else:
# Check if the current state is in the epsilon memory. If it is, then do not append the state.
if state.name in epsilon_memory:
continue
else:
# Add the current name to the epsilon memory.
epsilon_memory.add(state.name)
stack.append([i, destination_state, epsilon_memory])
and then after adding this driver code:
engine = NFAEngine()
engine.add_state("q0")
engine.add_state("q1")
engine.add_state("q2")
engine.set_start_node("q0")
engine.set_end_nodes(["q2"])
engine.add_transition("q0", "q1", CharacterMatcher("a"))
engine.add_transition("q1", "q1", EpsilonMatcher())
engine.add_transition("q1", "q2", CharacterMatcher("b"))
engine.update_input("ab")
engine.transition_states()
my regex engine hangs. :(
This is probably why it didn’t work: (in matchers.py)
def isEpsilon(self) -> bool:
return False
That is in the epsilon matcher. After replacing that False with True, I no longer get the hang. Great!
I am currently in the 837b11f3748be859bc2832c03d7f9773cf7e23d5 commit.
Ok, so now we have implemented a non-deterministic finite automata, but we want to use regexes. Our goal is to convert a regex string to an NFA. We will accomplish by using an abstract syntax tree and then using that to generate the NFA.
Yeah, the actually use antlr4 for the syntax tree building. Let’s hope that there is a python package for antlr4. There is! Now, antlr works in a quite a peculiar way. You first need to use antlr to parse a “grammar” file and then it spits out a python file which you then import.
Let’s try it out? I just blatantly copied the grammar from the tutorial to my stuff.
Ok, so I ran antlr4 but I did not get the RegexVisitor.py file which I expected to be generated. This is because I didn’t have -visitor
to the command line arguments. Now I have that file.
So I have to make my own parser it seems, or that I need to port the guys parser to python, which I will do. Maybe I learn something from it, or maybe not. Let’s see.
Ok, so as it turns out, this takes quite a while. Here is my translated version of ast.js (aka now it is ast.py):
'''
class Regex {
constructor(subpatterns) {
this.subpatterns = subpatterns;
this.groupName = null;
this._capturing = true;
this._atomic = false;
}
isCapturingGroup() {
return this._capturing;
}
nonCapturing() {
this._capturing = false;
return this;
}
isAtomic() {
return this._atomic;
}
atomic() {
this._atomic = true;
this._capturing = false;
return this;
}
}
class Expression {
constructor(quantifier, child) {
this.quantifier = quantifier;
this.child = child;
}
}
class RegexAlternative {
/**
*
* @param {*} groupName Null if it's not a named group.
* @param {...any} alternatives
*/
constructor(...alternatives) {
this.groupName = null;
this._capturing = true;
this.alternatives = alternatives;
this._atomic = false;
}
isCapturingGroup() {
return this._capturing;
}
nonCapturing() {
this._capturing = false;
return this;
}
isAtomic() {
return this._atomic;
}
atomic() {
this._atomic = true;
this._capturing = false;
return this;
}
}
class AtomicPattern {
constructor(char) {
this.char = char;
}
}
class DotPattern {
}
class ComplexClassRange {
constructor(start, end) {
this.start = start;
this.end = end;
}
matches(c) {
return c >= this.start && this.end >= c;
}
}
class ComplexClass {
constructor(individialChars, ranges, name, negated) {
this.chars = individialChars;
this.ranges = ranges;
this.name = name;
this.negated = negated;
}
matches(c) {
const base = this.chars.includes(c) || this.ranges.some(x => x.matches(c));
return this.negated ? !base : base;
}
}
class DollarAnchor {
}
class CaretAnchor {
}
class Backreference {
constructor(group) {
this.group = group;
}
}
'''
'''
class Regex {
constructor(subpatterns) {
this.subpatterns = subpatterns;
this.groupName = null;
this._capturing = true;
this._atomic = false;
}
isCapturingGroup() {
return this._capturing;
}
nonCapturing() {
this._capturing = false;
return this;
}
isAtomic() {
return this._atomic;
}
atomic() {
this._atomic = true;
this._capturing = false;
return this;
}
}
'''
class Regex:
def __init__(self, subpatterns) -> None:
self.subpatterns = subpatterns
self.groupName = None
self._capturing = True
self._atomic = False
def isCapturingGroup(self) -> bool:
return self._capturing
def nonCapturing(self) -> Regex:
self._capturing = False
return self
def isAtomic(self) -> bool:
return self._atomic
def atomic(self) -> Regex:
self._atomic = True
self._capturing = False
return self
'''
class Expression {
constructor(quantifier, child) {
this.quantifier = quantifier;
this.child = child;
}
}
'''
class Expression:
def __init__(self, quantifier, child) -> None:
self.quantifier = quantifier
self.child = child
'''
class RegexAlternative {
/**
*
* @param {*} groupName Null if it's not a named group.
* @param {...any} alternatives
*/
constructor(...alternatives) {
this.groupName = null;
this._capturing = true;
this.alternatives = alternatives;
this._atomic = false;
}
isCapturingGroup() {
return this._capturing;
}
nonCapturing() {
this._capturing = false;
return this;
}
isAtomic() {
return this._atomic;
}
atomic() {
this._atomic = true;
this._capturing = false;
return this;
}
}
'''
class RegexAlternative:
def __init__(self, alternatives) -> None:
self.groupName = None
self._capturing = True
self.alternatives = alternatives
self._atomic = False
def isCapturingGroup(self) -> bool:
return self._capturing
def nonCapturing(self) -> RegexAlternative:
self._capturing = False
return self
def isAtomic(self):
return self._atomic
def atomic(self):
self._atomic = True
self._capturing = False
return self
'''
class AtomicPattern {
constructor(char) {
this.char = char;
}
}
class DotPattern {
}
'''
class AtomicPattern:
def __init__(self, char):
self.char = char
class DotPattern:
def __init(self):
self = self # Dummy
'''
class ComplexClassRange {
constructor(start, end) {
this.start = start;
this.end = end;
}
matches(c) {
return c >= this.start && this.end >= c;
}
}
class ComplexClass {
constructor(individialChars, ranges, name, negated) {
this.chars = individialChars;
this.ranges = ranges;
this.name = name;
this.negated = negated;
}
matches(c) {
const base = this.chars.includes(c) || this.ranges.some(x => x.matches(c));
return this.negated ? !base : base;
}
}
'''
class ComplexClassRange:
def __init__(self, start, end) -> None:
self.start = start
self.end = end
def matches(self, c) -> bool:
return c >= self.start && self.end >= c
class ComplexClass:
def __init__(self, individualChars, ranges, name, negated) -> None:
self.chars = individualChars
self.ranges = ranges
self.name = name
self.negated = negated
def matches(c) -> bool:
base = c in self.chars
if self.negated:
return not base
else:
return base
'''
class DollarAnchor {
}
class CaretAnchor {
}
'''
class DollarAnchor:
def __init(self):
self = self # Dummy
class CaretAnchor:
def __init(self):
self = self # Dummy
'''
class Backreference {
constructor(group) {
this.group = group;
}
}
'''
class Backreference:
def __init__(self, group) -> None:
self.group = group
Now it is time to translate astBuilder.js to astBuilder.py!
Now after looking at the code for a while I found this:
visitBackreference(ctx) {
const [_, group] = /\\([0-9]+)/.exec(ctx.getText());
return new Backreference(Number(group));
}
which seems quite peculiar. We are essentially using regexes to implement a regex engine. I found this quite humorous and thought to share.
Now, I am almost done rewriting the entire thing, but now it occurs to me that maybe I should have used something like a javascript to python transpiler instead of doing this by hand. I think there is a JS to python transpiler but it probably doesn’t work for something so complex as this. I also have another embarrasing confession to make. I don’t know how to even use regexes in python. I am going to make a quick test file
Doing something like this:
import re
regex_string = "\\([0-9]+)"
matcher = re.compile(regex_string)
print(matcher.match("\\10"))
results in this error:
File "/usr/local/lib/python3.12/re/_parser.py", line 975, in parse
raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 8
Ok, so we need to change our regexes a bit. We need to use raw strings instead. This seems to work:
import re
regex_string = r"\\([0-9]+)"
matcher = re.compile(regex_string)
print(matcher.match("\\10").group())
So let’s use that.
After a very tedious session of just typing (almost) mindlessly, I now have this:
from grammar.regexVisitor import regexVisitor
from ast import *
'''
const EPSILON = Symbol("epsilon");
const ASTERISK = Symbol("*");
const PLUS = Symbol("+");
const OPTIONAL = Symbol("?");
const LAZY_ASTERISK = Symbol("*?");
const LAZY_PLUS = Symbol("+?");
const LAZY_OPTIONAL = Symbol("??");
'''
EPSILON = "epsilon"
ASTERISK = "*"
PLUS = "+"
OPTIONAL = "?"
LAZY_ASTERISK = "*?"
LAZY_PLUS = "+?"
LAZY_OPTIONAL = "??"
'''
const SHORTHAND_CHARACTER_CLASSES = {
"\\d": new ComplexClass([], [new ComplexClassRange("0", "9")], "\d", false),
"\\D": new ComplexClass([], [new ComplexClassRange("0", "9")], "\D", true),
"\\w": new ComplexClass(["_"], [new ComplexClassRange("A", "Z"), new ComplexClassRange("a", "z"), new ComplexClassRange("0", "9")], "\w", false),
"\\W": new ComplexClass(["_"], [new ComplexClassRange("A", "Z"), new ComplexClassRange("a", "z"), new ComplexClassRange("0", "9")], "\W", true),
"\\s": new ComplexClass( [" ", "\f", "\n", "\r", "\t", "\v", "\u00a0", "\u1680", "\u2028","\u2029","\u202f", "\u205f", "\u3000", "\ufeff"],
[new ComplexClassRange("\u2000", "\u200a")], "\s", false),
"\\S": new ComplexClass( [" ", "\f", "\n", "\r", "\t", "\v", "\u00a0", "\u1680", "\u2028","\u2029","\u202f", "\u205f", "\u3000", "\ufeff"],
[new ComplexClassRange("\u2000", "\u200a")], "\S", true),
};
'''
SHORTHAND_CHARACTER_CLASSES = {"\\d": ComplexClass([], [ComplexClassRange("0", "9")], "\d", False),
"\\D": ComplexClass([], [ComplexClassRange("0", "9")], "\D", True),
"\\w": ComplexClass(["_"], [ComplexClassRange("A", "Z"), ComplexClassRange("a", "z"), ComplexClassRange("0", "9")], "\w", False),
"\\W": ComplexClass(["_"], [ComplexClassRange("A", "Z"), ComplexClassRange("a", "z"), ComplexClassRange("0", "9")], "\W", True),
"\\s": ComplexClass([" ", "\f", "\n", "\r", "\t", "\v", "\u00a0", "\u1680", "\u2028", "\u2029", "\u202f", "\u205f", "\u3000", "\ufeff"],
[ComplexClassRange("\u2000", "\u200a")], "\s", False),
"\\S": ComplexClass([" ", "\f", "\n", "\r", "\t", "\v", "\u00a0", "\u1680", "\u2028", "\u2029", "\u202f", "\u205f", "\u3000", "\ufeff"],
[ComplexClassRange("\u2000", "\u200a")], "\s", True),
}
ESPECIAL_ESCAPE_CHARACTERS = {"\\n": "\n", "\\b": "\b", "\\t": "\t"}
'''
class AstBuilder extends RegexVisitor {
'''
class AstBuilder(regexVisitor):
'''
visitMain(ctx) {
return this.visit(ctx.regex());
}
'''
def visitMain(self, ctx):
return self.visit(ctx.regex()) # I have absolutely no idea what this does
'''
visitRegex(ctx) {
if (ctx.alternative().length === 0)
return new Regex(this.visit(ctx.expr()));
else {
const main = ctx.expr().length === 0 ? this._epsilonAlternative() : new Regex(this.visit(ctx.expr()));
const alternatives = this.visit(ctx.alternative());
main.nonCapturing();
alternatives.forEach(x => x.nonCapturing());
return new RegexAlternative(main, ...alternatives);
}
}
'''
def visitRegex(self, ctx):
if len(ctx.alternative()) == 0:
return Regex(self.visit(ctx.expr()))
else:
if len(ctx.expr()) == 0:
main = self._epsilonAlternative()
else:
main = Regex(self.visit(ctx.expr()))
alternatives = self.visit(ctx.alternative())
main.nonCapturing()
for x in alternatives:
x.nonCapturing()
return RegexAlternative([main, alternatives]) # I think this is right
'''
visitAlternative(ctx) {
if (ctx.expr().length !== 0)
return new Regex(this.visit(ctx.expr()));
else
return this._epsilonAlternative();
}
'''
def visitAlternative(self, ctx):
if len(ctx.expr()) != 0:
return Regex(self.visit(ctx.expr()))
else:
return self._epsilonAlternative()
'''
_epsilonAlternative() {
return new Regex([new Expression(null,new AtomicPattern(EPSILON))]);
}
'''
def _epsilonAlternative(self):
return Regex([Expression(None, AtomicPattern(EPSILON))])
'''
visitExpr(ctx) {
return new Expression(ctx.quantifier() ? this.visit(ctx.quantifier()) : null, this.visit(ctx.subexpr()));
}
'''
def visitExpr(self, ctx):
if ctx.quantifier():
return Expression(self.visit(ctx.quantifier()), self.visit(ctx.subexpr()))
else:
return Expression(None, self.visit(ctx.subexpr()))
'''
visitAtomicPattern(ctx) {
return new AtomicPattern(ctx.getText());
}
'''
def visitAtomicPattern(self, ctx):
return AtomicPattern(ctx.getText())
'''
visitEscapedReservedAtomicPattern(ctx) {
const char = ctx.getText();
const pattern = char in ESPECIAL_ESCAPE_CHARACTERS ? ESPECIAL_ESCAPE_CHARACTERS[char] : char.substring(1);
return new AtomicPattern(pattern);
}
'''
def visitEscapedReservedAtomicPattern(self, ctx):
char = ctx.getText()
if char in ESPECIAL_ESCAPE_CHARACTERS:
pattern = ESPECIAL_ESCAPE_CHARACTERS[char]
else:
# Thanks to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr
pattern = char[1:]
return AtomicPattern(pattern)
'''
visitCharacterClass(ctx) {
const txt = ctx.getText();
return SHORTHAND_CHARACTER_CLASSES[txt];
}
'''
def visitCharacterClass(self, ctx) -> None:
txt = ctx.getText()
assert txt in SHORTHAND_CHARACTER_CLASSES
return SHORTHAND_CHARACTER_CLASSES[txt]
'''
visitComplexCharacterClass(ctx) {
const negated = Boolean(ctx.negated);
const children = this.visit(ctx.complexCCPiece());
const single = [];
const ranges = [];
for (const c of children) {
single.push(...c.singles);
ranges.push(...c.ranges);
}
return new ComplexClass(single, ranges, ctx.getText(), negated);
}
'''
def visitComplexCharacterClass(self, ctx):
negated = bool(ctx.negated)
children = self.visit(ctx.complexCCPiece())
single = []
ranges = []
for c in children:
for x in c.singles:
single.append(x)
for x in c.ranges:
ranges.append(x)
return ComplexClass(single, ranges, ctx.getText(), negated)
'''
visitCcPiece_Respone(ctx) {
const piece = this.visit(ctx.allowedCharInCharacterClass());
const base = {singles: [], ranges: []};
if (piece.length === 1)
base.singles.push(piece[0]);
else
base.ranges.push(new ComplexClassRange(piece[0], piece[1]));
return base;
}
'''
def visitCcPiece_Respone(self, ctx):
piece = sefl.visit(ctx.allowedCharInCharacterClass())
base = [[], []]
if len(piece) == 1:
base[0].append(piece[0])
else:
base[1],append(ComplexClassRange(piece[0], piece[1]))
return base
'''
visitCcPiece_Escape(ctx) {
const txt = ctx.getText();
const base = {singles: [], ranges: []};
base.ranges.push(SHORTHAND_CHARACTER_CLASSES[txt]);
return base;
}
'''
def visitCcPiece_Escape(self, ctx):
txt = ctx.getText()
base = [[],[]]
base[1].append(SHORTHAND_CHARACTER_CLASSES[txt])
return base
'''
visitDotPattern() {
return new ComplexClass(["\n", "\r"], [], true, ".");
}
visitDollarAnchor() {
return new DollarAnchor();
}
visitCaretAnchor() {
return new CaretAnchor();
}
visitComplexClass(ctx) {
return this.visit(ctx.complexCharacterClass());
}
'''
def visitDotPattern(self):
return ComplexClass(["\n", "\r"], [], True, ".")
def visitDollarAnchor(self):
return DollarAnchor()
def visitCaretAnchor(self):
return CaretAnchor()
def visitComplexClass(self, ctx):
return self.visit(ctx.complexCharacterClass())
'''
visitBackreference(ctx) {
const [_, group] = /\\([0-9]+)/.exec(ctx.getText());
return new Backreference(Number(group));
}
visitAllowedCharInCharacterClass(ctx) {
const txt = ctx.getText();
return txt[0] === "\\" ? txt.substring(1) : txt;
}
'''
def visitBackreference(self, ctx):
regex_string = r"\\([0-9]+)"
matcher = re.compile(regex_string)
group = matcher.match(ctx.getText()).group()
if group[0] == "\\":
group = group[1:]
return int(group)
def visitAllowedCharInCharacterClass(self, ctx):
txt = ctx.getText()
if txt[0] == "\\":
return txt[1:]
else:
return txt
'''
visitAsteriskQuantifier() {
return ASTERISK;
}
visitQuestionQuantifier() {
return OPTIONAL;
}
visitLazyAsteriskQuantifier() {
return LAZY_ASTERISK;
}
visitLazyPlusQuantifier() {
return LAZY_PLUS;
}
visitLazyQuestionQuantifier() {
return LAZY_OPTIONAL;
}
visitGroupPattern(ctx) {
return this.visit(ctx.regexGroup());
}
'''
def visitAsteriskQuantifier(self):
return ASTERISK
def visitQuestionQuantifier(self):
return OPTIONAL
def visitLazyAsteriskQuantifier(self):
return LAZY_ASTERISK
def visitLazyPlusQuantifier(self):
return LAZY_PLUS
def visitLazyQuestionQuantifier(self):
return LAZY_OPTIONAL
def visitGroupPattern(self, ctx):
return self.visit(ctx.regexGroup())
'''
visitRegexGroup(ctx) {
const alternative = this.visit(ctx.regex());
alternative.groupName = ctx.name.map(x => x.text).join("");
if (ctx.nonCapture) alternative.nonCapturing();
if (ctx.atomic) alternative.atomic();
return alternative;
}
visitPlusQuantifier() {
return PLUS;
}
'''
def visitRegexGroup(self, ctx):
alternative = self.visit(ctx.regex())
final_name = ""
for thing in ctx.name:
final_name += thing.text
alternative.groupName = final_name
if ctx.nonCapture:
alternative.nonCapturing()
if ctx.atomic:
alternative.atomic()
return alternative
def visitPlusQuantifier(self):
return PLUS
Now just convert parser.js to parser.py aaanddd.. Uh oh..
Exception: Could not deserialize ATN with version (expected 4)
Ok, so I had to change the version number of antlr4. As per this here: https://github.com/antlr/antlr4/issues/3997
Now I get this error:
return c >= self.start && self.end >= c
^
SyntaxError: invalid syntax
This is easily fixable.
And apparently this is not how typehints work:
def nonCapturing(self) -> Regex:
NameError: name 'Regex' is not defined
After a couple of small changes now I can call the file by doing python3 parser.py , but the file doesn’t actually do anything yet.
Fuck! We were so close yet so far:
chars == abcdefg
Type of chars: <class 'antlr4.InputStream.InputStream'>
Lexer: <grammar.regexLexer.regexLexer object at 0x7f1dfc1350a0>
parser == <grammar.regexParser.regexParser object at 0x7f1dfb7e0160>
AstBuilder() == <astbuilder.AstBuilder object at 0x7f1dfb7f22b0>
Traceback (most recent call last):
File "parser.py", line 64, in <module>
exit(main())
File "parser.py", line 59, in main
parseRegex("abcdefg")
File "parser.py", line 56, in parseRegex
return AstBuilder().visit(tree)
File "/home/cyberhacker/.local/lib/python3.8/site-packages/antlr4/tree/Tree.py", line 34, in visit
return tree.accept(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/grammar/regexParser.py", line 144, in accept
return visitor.visitMain(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/astbuilder.py", line 60, in visitMain
return self.visit(ctx.regex()) # I have absolutely no idea what this does
File "/home/cyberhacker/.local/lib/python3.8/site-packages/antlr4/tree/Tree.py", line 34, in visit
return tree.accept(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/grammar/regexParser.py", line 202, in accept
return visitor.visitRegex(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/astbuilder.py", line 78, in visitRegex
return Regex(self.visit(ctx.expr()))
File "/home/cyberhacker/.local/lib/python3.8/site-packages/antlr4/tree/Tree.py", line 34, in visit
return tree.accept(self)
AttributeError: 'list' object has no attribute 'accept'
Here is my current script:
import antlr4
from grammar.regexLexer import *
from grammar.regexParser import *
from astbuilder import *
'''
This file is basically just this:
const antlr4 = require('antlr4');
const { ErrorListener } = require('antlr4/error');
const { AstBuilder } = require('./astBuilder');
const {regexLexer: RegexLexer} = require('./generated/regexLexer');
const {regexParser: RegexParser} = require('./generated/regexParser');
class CustomErrorListener extends ErrorListener {
syntaxError(recognizer, offendingSymbol, line, column, msg, e) {
console.log(msg);
throw Error(msg);
}
}
function parseRegex(regex) {
const chars = new antlr4.InputStream(regex)
const lexer = new RegexLexer(chars)
const tokens = new antlr4.CommonTokenStream(lexer)
const parser = new RegexParser(tokens)
parser.buildParseTrees = true;
parser.addErrorListener(new CustomErrorListener());
const tree = parser.main();
return new AstBuilder().visit(tree);
}
module.exports = parseRegex;
'''
def parseRegex(regex: str):
chars = antlr4.InputStream(regex)
print("chars == "+str(chars))
print("Type of chars: "+str(type(chars)))
lexer = regexLexer(chars)
print("Lexer: "+str(lexer))
# const tokens = new antlr4.CommonTokenStream(lexer)
tokens = antlr4.CommonTokenStream(lexer)
parser = regexParser(tokens)
print("parser == "+str(parser))
# parser.buildParseTrees = true;
parser.buildParseTrees = True
# const tree = parser.main();
tree = parser.main()
# return new AstBuilder().visit(tree);
print("AstBuilder() == "+str(AstBuilder()))
return AstBuilder().visit(tree)
def main() -> int:
parseRegex("abcdefg")
return 0
if __name__=="__main__":
exit(main())
I added debug statements to the code (including the antlr4 library) and here are the results:
chars == abcdefg
Type of chars: <class 'antlr4.InputStream.InputStream'>
Lexer: <grammar.regexLexer.regexLexer object at 0x7f61c1a070d0>
parser == <grammar.regexParser.regexParser object at 0x7f61c10d20a0>
AstBuilder() == <astbuilder.AstBuilder object at 0x7f61c10de3d0>
tree == []
type of tree: <class 'grammar.regexParser.regexParser.MainContext'>
tree == [22]
type of tree: <class 'grammar.regexParser.regexParser.RegexContext'>
oof == [<grammar.regexParser.regexParser.ExprContext object at 0x7f61c10d2b20>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10d2c40>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10d2d60>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10d2d00>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10de790>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10de0d0>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10de220>]
tree == [<grammar.regexParser.regexParser.ExprContext object at 0x7f61c10d2b20>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10d2c40>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10d2d60>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10d2d00>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10de790>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10de0d0>, <grammar.regexParser.regexParser.ExprContext object at 0x7f61c10de220>]
type of tree: <class 'list'>
Now, this is quite a predicament. Maybe I should actually do some research on how to use antlr in python? That possibly could work. :D
Ok, so I have managed to somewhat narrow down the problem, see when there is only one character in the regex, aka “a” for example, then I get this:
chars == a
Type of chars: <class 'antlr4.InputStream.InputStream'>
Lexer: <grammar.regexLexer.regexLexer object at 0x7fe5f78880d0>
parser == <grammar.regexParser.regexParser object at 0x7fe5f6f35130>
(main (regex (expr (subexpr (atomicChar a)))) <EOF>)
AstBuilder() == <astbuilder.AstBuilder object at 0x7fe5f6f35d00>
tree == []
type of tree: <class 'grammar.regexParser.regexParser.MainContext'>
thing == [22]
tree == [22]
type of tree: <class 'grammar.regexParser.regexParser.RegexContext'>
oof == [<grammar.regexParser.regexParser.ExprContext object at 0x7fe5f6f35bb0>]
tree == [<grammar.regexParser.regexParser.ExprContext object at 0x7fe5f6f35bb0>]
type of tree: <class 'list'>
and then if I have “ab” for example, I get this:
chars == ab
Type of chars: <class 'antlr4.InputStream.InputStream'>
Lexer: <grammar.regexLexer.regexLexer object at 0x7f74f9a400d0>
parser == <grammar.regexParser.regexParser object at 0x7f74f90ed130>
(main (regex (expr (subexpr (atomicChar a))) (expr (subexpr (atomicChar b)))) <EOF>)
AstBuilder() == <astbuilder.AstBuilder object at 0x7f74f90ede20>
tree == []
type of tree: <class 'grammar.regexParser.regexParser.MainContext'>
thing == [22]
tree == [22]
type of tree: <class 'grammar.regexParser.regexParser.RegexContext'>
oof == [<grammar.regexParser.regexParser.ExprContext object at 0x7f74f90edbb0>, <grammar.regexParser.regexParser.ExprContext object at 0x7f74f90edd90>]
tree == [<grammar.regexParser.regexParser.ExprContext object at 0x7f74f90edbb0>, <grammar.regexParser.regexParser.ExprContext object at 0x7f74f90edd90>]
type of tree: <class 'list'>
Ok, so oof is actually a list, but the function expects some other type. Let’s run the guys program and check the type of the variable in his code…
Here is the node output of oof:
Here is the actual thing...
[
ExprContext {
parentCtx: RegexContext {
parentCtx: [MainContext],
invokingState: 22,
ruleIndex: 1,
children: [Array],
start: [CommonToken],
stop: [CommonToken],
exception: null,
parser: [regexParser]
},
invokingState: 25,
ruleIndex: 3,
children: [ [AtomicPatternContext] ],
start: CommonToken {
source: [Array],
type: 20,
channel: 0,
start: 0,
stop: 0,
tokenIndex: 0,
line: 1,
column: 0,
_text: null
},
stop: CommonToken {
source: [Array],
type: 20,
channel: 0,
start: 0,
stop: 0,
tokenIndex: 0,
line: 1,
column: 0,
_text: null
},
exception: null,
parser: regexParser {
_listeners: [Array],
_interp: [ParserATNSimulator],
_stateNumber: -1,
_input: [CommonTokenStream],
_errHandler: [DefaultErrorStrategy],
_precedenceStack: [Array],
_ctx: null,
buildParseTrees: true,
_tracer: null,
_parseListeners: null,
_syntaxErrors: 0,
ruleNames: [Array],
literalNames: [Array],
symbolicNames: [Array]
}
},
ExprContext {
parentCtx: RegexContext {
parentCtx: [MainContext],
invokingState: 22,
ruleIndex: 1,
children: [Array],
start: [CommonToken],
stop: [CommonToken],
exception: null,
parser: [regexParser]
},
invokingState: 25,
ruleIndex: 3,
children: [ [GroupPatternContext] ],
start: CommonToken {
source: [Array],
type: 6,
channel: 0,
start: 1,
stop: 1,
tokenIndex: 1,
line: 1,
column: 1,
_text: null
},
stop: CommonToken {
source: [Array],
type: 7,
channel: 0,
start: 3,
stop: 3,
tokenIndex: 3,
line: 1,
column: 3,
_text: null
},
exception: null,
parser: regexParser {
_listeners: [Array],
_interp: [ParserATNSimulator],
_stateNumber: -1,
_input: [CommonTokenStream],
_errHandler: [DefaultErrorStrategy],
_precedenceStack: [Array],
_ctx: null,
buildParseTrees: true,
_tracer: null,
_parseListeners: null,
_syntaxErrors: 0,
ruleNames: [Array],
literalNames: [Array],
symbolicNames: [Array]
}
},
ExprContext {
parentCtx: RegexContext {
parentCtx: [MainContext],
invokingState: 22,
ruleIndex: 1,
children: [Array],
start: [CommonToken],
stop: [CommonToken],
exception: null,
parser: [regexParser]
},
invokingState: 25,
ruleIndex: 3,
children: [ [AtomicPatternContext] ],
start: CommonToken {
source: [Array],
type: 20,
channel: 0,
start: 4,
stop: 4,
tokenIndex: 4,
line: 1,
column: 4,
_text: null
},
stop: CommonToken {
source: [Array],
type: 20,
channel: 0,
start: 4,
stop: 4,
tokenIndex: 4,
line: 1,
column: 4,
_text: null
},
exception: null,
parser: regexParser {
_listeners: [Array],
_interp: [ParserATNSimulator],
_stateNumber: -1,
_input: [CommonTokenStream],
_errHandler: [DefaultErrorStrategy],
_precedenceStack: [Array],
_ctx: null,
buildParseTrees: true,
_tracer: null,
_parseListeners: null,
_syntaxErrors: 0,
ruleNames: [Array],
literalNames: [Array],
symbolicNames: [Array]
}
}
]
Ok, so it is a list representation, but it is not a list. After looking at the source code for antlr4 in javascript I found this:
ParseTreeVisitor.prototype.visit = function(ctx) {
if (Array.isArray(ctx)) {
return ctx.map(function(child) {
return child.accept(this);
}, this);
} else {
return ctx.accept(this);
}
};
Well, there’s why it doesn’t work.
After doing some modding of the antlr library:
class ParseTreeVisitor(object):
# This actually doesn't work
#def visit(self, tree):
# return tree.accept(self)
'''
ParseTreeVisitor.prototype.visit = function(ctx) {
if (Array.isArray(ctx)) {
return ctx.map(function(child) {
return child.accept(this);
}, this);
} else {
return ctx.accept(this);
}
};
'''
def visit(self, tree):
if isinstance(tree, list):
# Array.map basically.
out = []
for x in tree:
out.append(x.accept(self))
return out
else:
tree.accept(self)
Now it works for a simple regex such as “a”, but it doesn’t work for a more complicated regex such as “a | b”, now this is still a good sign, because we are atleast moving in the right direction. |
More debugging!
Well, the reason was that I forgot to put a “return” in the fix:
else:
tree.accept(self) # <--- HERE!
after putting return tree.accept(self)
, it now works for more complex regexes.
I copied all of the testcases from the blog to my program, but it doesn’t work. It fails for the “a+” case.
I get this output:
chars == a+
Type of chars: <class 'antlr4.InputStream.InputStream'>
Lexer: <grammar.regexLexer.regexLexer object at 0x7fd464bd20d0>
parser == <grammar.regexParser.regexParser object at 0x7fd464283100>
(main (regex (expr (subexpr (atomicChar a)) (quantifier +))) <EOF>)
AstBuilder() == <astbuilder.AstBuilder object at 0x7fd464290d00>
thing == [22]
ctx.alternative() == []
oof == [<grammar.regexParser.regexParser.ExprContext object at 0x7fd464283b80>]
type(oof) == <class 'list'>
Traceback (most recent call last):
File "parser.py", line 98, in <module>
exit(main())
File "parser.py", line 89, in main
parseRegex("a+")
File "parser.py", line 86, in parseRegex
return AstBuilder().visit(tree)
File "/home/cyberhacker/.local/lib/python3.8/site-packages/antlr4/tree/Tree.py", line 58, in visit
return tree.accept(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/grammar/regexParser.py", line 144, in accept
return visitor.visitMain(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/astbuilder.py", line 62, in visitMain
return self.visit(thing) # I have absolutely no idea what this does
File "/home/cyberhacker/.local/lib/python3.8/site-packages/antlr4/tree/Tree.py", line 58, in visit
return tree.accept(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/grammar/regexParser.py", line 202, in accept
return visitor.visitRegex(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/astbuilder.py", line 86, in visitRegex
poopoo = self.visit(oof)
File "/home/cyberhacker/.local/lib/python3.8/site-packages/antlr4/tree/Tree.py", line 55, in visit
out.append(x.accept(self))
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/grammar/regexParser.py", line 331, in accept
return visitor.visitExpr(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/astbuilder.py", line 131, in visitExpr
return Expression(self.visit(ctx.quantifier()), self.visit(ctx.subexpr()))
File "/home/cyberhacker/.local/lib/python3.8/site-packages/antlr4/tree/Tree.py", line 58, in visit
return tree.accept(self)
File "/home/cyberhacker/Asioita/Ohjelmointi/Python/Regexengine/grammar/regexParser.py", line 1281, in accept
return visitor.visitPlusQuantifier(self)
TypeError: visitPlusQuantifier() takes 1 positional argument but 2 were given
After doing this modification:
def accept(self, visitor:ParseTreeVisitor):
if hasattr( visitor, "visitPlusQuantifier" ):
print("self inside accept: "+str(self))
#return visitor.visitPlusQuantifier(self)
return visitor.visitPlusQuantifier()
else:
return visitor.visitChildren(self)
It now works. This will probably bite us in the ass later, but let’s roll with it for now. I also changed this in everything else. I am 100% sure that this will bite me in the ass later, but that is a problem for future me.
Ok, so now that we have the parsing shit out of the way, it is actually time to write the program which constructs the NFA from a regex string (or, well, it constructs the NFA from the AST, which was generated from the regex string.).
Ok, so reading the blog post, the guy decided to use thompson construction. I looked it up on wikipedia: https://en.wikipedia.org/wiki/Thompson%27s_construction and we can actually first convert an NFA to a DFA, which can then be minimized. The guy said that the minimizing an NFA is NP-hard, but minimizing a DFA is relatively easy! But let’s not get ahead of ourselves.
So the algorithm works on the basis that each regular expression is broken into subexpressions.
Sun Jan 28 05:26:38 PM EET 2024
Ok so it has been roughly a month since I last looked at this project. (Time flies.)
The blog post is written in a weird style, because it first writes the functions and then writes the class which encompasses those functions later on in the blog post. I am going to write this in a “reverse” order, where I first write the skeleton class and then write the functions.
Here is the NFABuilder class, which will be the main class we use when using our regexes (we will then finally make a wrapper class which also handles the conversion of the regex string to an AST):
# This file is actually the program, which constructs the NFA from an AST generated by antlr4 from the regex string.
from ast import *
from astBuilder import *
from NFA import *
class NFABuilder:
def __init__(self):
self.stateNumber = 0
def newState(self):
c = 'q'+str(self.stateNumber)
self.stateNumber += 1
return c
def resetStateNumbers(self):
i = 0 # ???????????
def regexToNFA(self, regexAST): # This converts the regex abstract syntax tree to an NFA
self.resetStateNumbers()
return self._regexToNFA(regexAST)
def _regexToNFA(self, regexAST):
if isinstance(regexAST, RegexAlternative): # Check for "|" or similar.
return self._alternativeToNFA(regexAST) # Not yet defined
else:
return self._singleRegexToNFA(regexAST) # Also not yet defined
Let’s first up write some helpers and abstractions: (In the blog post there is this line: “In code we’ll see this pattern (two states with a single transition) multiple times, so I’ll make a small abstraction called _oneStepNFA.”)
First up we will see a lot of empty epsilon transitions, so let’s create a small function as a shorthand for that.
The empty epsilon transition will use the oneStepNFA abstraction, so let’s first write oneStepNFA.
# snip
def _oneStepNFA(self, matcher): # This is used to create a transition where there are two states and a singular transition between them.
nfa = NFAEngine() # From NFA.py
# Get the two states required for this transition
a = self.newState()
b = self.newState()
nfa.addState(a)
nfa.addState(b)
nfa.addTransition(a,b,matcher) # The matcher will be passed to this function.
return nfa # Return the NFA, where there are only two states and a singular transition between them.
# and then lets write _emptyExpression
def _emptyExpression(self, character):
return self._oneStepNFA(EpsilonMatcher()) # from matchers.py
# snip
I don’t know why there is the character
argument to the emptyExpression .
Then, just as in the blog post, we implement a transition for a single character.
def _atomicPatternNFA(self, character):
matcher = EpsilonMatcher() if character == EPSILON else CharacterMatcher(character) # Check for epsilon
return self._oneStepNFA(matcher)
Ok, so now it is time to actually implement the “gluing together” of NFA:s.