If you don’t know what ANTLR is, here is what the official site says about it:
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees.
I’m writing this article because I couldn’t find any good article/blog explaining how to use it with Python. The book for ANTLR 4, called “The Definitive ANTLR 4 Reference” is written by the ANTLR 4’s creator Terence Parr, but it’s written for Java, and I don’t like Java.
Examples in this article are from Debian GNU/Linux. The example shown here can be followed for macOS and other Linux distributions as well, with no or minor changes.
ANTLR and Python Setup
ANTLR has two parts
- ANTLR tool, which will generate a parser from grammar. This is written in JAVA, and hence, you need to have java installed on your machine.
- ANTLR runtime library, the ANTLR tool can generate parser in many languages (Python in this case). To use the parser in our programs, we need a runtime library.
Installing ANTLR tool
To get ANTLR tool, we just need to download the .jar
file.
As we can see in above diagram, we pass our grammar as an input to the ANTLR tool, and it generated parser for many different languages.
Execute the following two command, this will download the antlr-4.13.1-complete.jar
into /usr/local/lib
directory.
You can store the .jar
anywhere you want, but then you need to adjust the next commands.
$ cd /usr/local/lib
$ curl -O http://www.antlr.org/download/antlr-4.13.1-complete.jar
Add that path in the CLASSPATH
, so that Java can find that .jar
i.e. ANTLR tool.
$ export CLASSPATH="/usr/local/lib/antlr-4.13.1-complete.jar:$CLASSPATH"
Check if everything executed correctly. If you get the same response from both the following commands, then it means that ANTLR tool installed correctly.
$ java -jar /usr/local/lib/antlr-4.13.1-complete.jar
ANTLR Parser Generator Version 4.13.1
-o ___ specify output directory where all output is generated
-lib ___ specify location of grammars, tokens files
-atn generate rule augmented transition network diagrams
-encoding ___ specify grammar file encoding; e.g., euc-jp
...
...
$ java org.antlr.v4.Tool
ANTLR Parser Generator Version 4.13.1
-o ___ specify output directory where all output is generated
-lib ___ specify location of grammars, tokens files
-atn generate rule augmented transition network diagrams
-encoding ___ specify grammar file encoding; e.g., euc-jp
...
...
To use ANTLR tool, we have to use any of the above commas, and none of them is easy to type. So, let’s create an alias for that.
$ alias antlr4=' java -jar /usr/local/lib/antlr-4.13.1-complete.jar'
After creating alias, the command antlr4
should give the same output as above.
Installing ANTLR4 Python Runtime
Installing runtime is easy, just download python package from PyPI
$ pip install antlr4-python3-runtime
That’s all we have to do here.
Hello ANTLR4
Let’s write a simple grammar, which recognizes the language “Hello <any_string>”, and create generate parser for that language in Python3.
The example is taken from the book The Definitive ANTLR 4 Reference.
The grammars for ANTLR are written in the file with extension .g4
Create a file names Hello.g4
, and write following grammar in it.
grammar Hello;
r : 'hello' ID ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip;
Language Explanation
- Parsing rules are written in small alphabets. Example
r
which says that any string which is in the form ofhello <ID>
belongs to language. - Lexical rules are written in capital alphabets. Example,
ID
, which saysID
is any string which composed of 1 or more letters between a to z . - Lexical rules are represented using regular expressions.
- The lexical rule,
WS
says that, skip one or more spaces, tabs, returns, and newlines.
Using Language
Let’s generate a parser for our small Hello
language in Python3.
The parameter -Dlanguage=Python3
tells ANTLR4 that we want our parser in Python3.
$ antlr4 -Dlanguage=Python3 Hello.g4
Now, if you list the files, you should have your parser in Python3.
$ ls
Hello.g4 Hello.interp HelloLexer.interp HelloLexer.py HelloLexer.tokens HelloListener.py HelloParser.py Hello.tokens main.py
We shall understand how to use all the files later, for now, let’s write a python program which parses the Hello
language.
Create main.py
file in the same directory.
from antlr4 import *
from HelloLexer import HelloLexer
from HelloParser import HelloParser
input_text = input("> ")
lexer = HelloLexer(InputStream(input_text))
stream = CommonTokenStream(lexer)
parser = HelloParser(stream)
tree = parser.r()
print(tree.toStringTree(recog=parser))
In the line tree = parser.r()
, the r
is there because we have used the r
as initial rule in our language.
Let’s try to run it.
$ python3 main.py
> hello world
(r hello world)
$ python3 main.py
> hello abc
(r hello abc)
Our languages are getting recognized.
Now let’s see what happens if we add an invalid string. If we remember correctly, only small alphabets are allowed in place of ID. So, if we any capital alphabet, our parser gives error.
$ python3 main.py
> hello World
line 1:6 token recognition error at: 'W'
(r hello orld)
Even if we misspell hello
we shall get error.
$ python3 main.py
> Hello world
line 1:0 token recognition error at: 'H'
line 1:1 missing 'hello' at 'ello'
(r <missing 'hello'> ello)
If we look closely, the parser is able to know where exactly the error is. This is how our compilers/linters give us line number where there is an error in our program.
Creating Morse Code Translator
Now let’s try to create a translator using ANTLR. This translator will convert Morse code to the string.
Create Morese.g4
file and paste the following grammar:
grammar Morse;
// Parser rules
morse_code : (letter | digit | WS)* ;
letter : A | B | C | D | E | F | G | H | I | J | K | L | M |
N | O | P | Q | R | S | T | U | V | W | X | Y | Z ;
digit : ZERO | ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE ;
// Lexer rules
A : '.-' ;
B : '-...' ;
C : '-.-.' ;
D : '-..' ;
E : '.' ;
F : '..-.' ;
G : '--.' ;
H : '....' ;
I : '..' ;
J : '.---' ;
K : '-.-' ;
L : '.-..' ;
M : '--' ;
N : '-.' ;
O : '---' ;
P : '.--.' ;
Q : '--.-' ;
R : '.-.' ;
S : '...' ;
T : '-' ;
U : '..-' ;
V : '...-' ;
W : '.--' ;
X : '-..-' ;
Y : '-.--' ;
Z : '--..' ;
ZERO : '-----' ;
ONE : '.----' ;
TWO : '..---' ;
THREE : '...--' ;
FOUR : '....-' ;
FIVE : '.....' ;
SIX : '-....' ;
SEVEN : '--...' ;
EIGHT : '---..' ;
NINE : '----.' ;
WS : [ \t\r\n]+ -> skip ;
Understanding grammar
Grammar is pretty easy to understand.
We have the following parser rules.
morse_code
which recognizes any number, or letters or digits.letter
can be any one of 26 characters.digit
can be any one of 10 digits.
We have defines Morse code rules for all characters and digits.
Generate Parser
Generate parser for Python3, using following command.
$ antlr4 -Dlanguage=Python3 Morse.g4
Use Parser
Now that we have generated parser, let’s use it.
from antlr4 import *
from MorseLexer import MorseLexer
from MorseParser import MorseParser
from MorseListener import MorseListener
input_text = input('> ')
lexer = MorseLexer(InputStream(input_text))
stream = CommonTokenStream(lexer)
parser = MorseParser(stream)
tree = parser.morse_code()
print(tree.toStringTree(recog=parser))
Let’s try some input.
$ python3 main.py
> .... . .-.. .-.. ---
(morse_code (letter ....) (letter .) (letter .-..) (letter .-..) (letter ---))
We shall explore more about ANTLR in Python in some other articles.
Ok, it’s recognizing our language.
Write Listener
If you observe carefully, there is another file generated along with lexer and parser Python file.
If we see the MorseListener.py
file, there is class MorseListener
. In this class, there are different methods prefixed with enter
and exit
.
For example, enterMorse_code
is executed when the morse_code
rule in the grammar starts executing, and exitMorse_code
function will be executed once that morse_code
rule finished. This works similarly for all other rules like enterLetter
, enterDigit
.
We can override these methods and implement whatever we want in there.
Create a class inheriting MorseListener
, and override enterMorse_code
, exitMorse_code
, enterLetter
, enterDigit
.
class MorseToPythonString(MorseListener):
def enterMorse_code(self, ctx:MorseParser.Morse_codeContext):
print('"', end="")
def exitMorse_code(self, ctx:MorseParser.Morse_codeContext):
print('"', end="")
def enterLetter(self, ctx:MorseParser.LetterContext):
for child in ctx.getChildren():
if child.symbol.type == MorseParser.A:
print("a", end="")
if child.symbol.type == MorseParser.B:
print("b", end="")
if child.symbol.type == MorseParser.C:
print("c", end="")
if child.symbol.type == MorseParser.D:
print("d", end="")
if child.symbol.type == MorseParser.E:
print("e", end="")
if child.symbol.type == MorseParser.F:
print("f", end="")
if child.symbol.type == MorseParser.G:
print("g", end="")
if child.symbol.type == MorseParser.H:
print("h", end="")
if child.symbol.type == MorseParser.I:
print("i", end="")
if child.symbol.type == MorseParser.J:
print("j", end="")
if child.symbol.type == MorseParser.K:
print("k", end="")
if child.symbol.type == MorseParser.L:
print("l", end="")
if child.symbol.type == MorseParser.M:
print("m", end="")
if child.symbol.type == MorseParser.N:
print("n", end="")
if child.symbol.type == MorseParser.O:
print("o", end="")
if child.symbol.type == MorseParser.P:
print("p", end="")
if child.symbol.type == MorseParser.Q:
print("q", end="")
if child.symbol.type == MorseParser.R:
print("r", end="")
if child.symbol.type == MorseParser.S:
print("s", end="")
if child.symbol.type == MorseParser.T:
print("t", end="")
if child.symbol.type == MorseParser.U:
print("u", end="")
if child.symbol.type == MorseParser.V:
print("v", end="")
if child.symbol.type == MorseParser.W:
print("w", end="")
if child.symbol.type == MorseParser.X:
print("x", end="")
if child.symbol.type == MorseParser.Y:
print("y", end="")
if child.symbol.type == MorseParser.Z:
print("z", end="")
def enterDigit(self, ctx:MorseParser.LetterContext):
for child in ctx.getChildren():
if child.symbol.type == MorseParser.ZERO:
print("0", end="")
if child.symbol.type == MorseParser.ONE:
print("1", end="")
if child.symbol.type == MorseParser.TWO:
print("2", end="")
if child.symbol.type == MorseParser.THREE:
print("3", end="")
if child.symbol.type == MorseParser.FOUR:
print("4", end="")
if child.symbol.type == MorseParser.FIVE:
print("5", end="")
if child.symbol.type == MorseParser.SIX:
print("6", end="")
if child.symbol.type == MorseParser.SEVEN:
print("7", end="")
if child.symbol.type == MorseParser.EIGHT:
print("8", end="")
if child.symbol.type == MorseParser.NINE:
print("9", end="")
enterMorse_code
will be executed at the start. It will print the double quote"
character.exitMorse_code
will be executed at the end, and it will print the double quote"
.- The
enterLetter
will be executed once any letter if encountered. In this function, we’re checking which token it has encountered, and according to that, we’re printing the character. - The
enterDigit
will be executed once any digit if encountered. Similar to theenterLetter
, we’re printing the digit according to the token.
Update the code
Use Walker
Let’s now create an object of ParseTreeWalker
. The walk
method is used to walk the tree, to that method, pass the listener class and the tree. Listener class will be out new listener class MorseToPythonString
.
input_text = input('> ')
lexer = MorseLexer(InputStream(input_text))
stream = CommonTokenStream(lexer)
parser = MorseParser(stream)
tree = parser.morse_code()
walker = ParseTreeWalker()
walker.walk(MorseToPythonString(), tree)
Try Out
Now that we have walker in place, let’s try executing it.
$ python3 main.py
> .... . .-.. .-.. ---
"hello"
As we can see, our program translating the input correctly.