Complier Design: Scanner, Parser and Analysis

Compiler is the translator between human readable high level language and the computer readable low level languages, it translate the a program from a source language into a target language. Why do we need compiler? Because for human beings, programming in a machine language, such as assembly is highly inefficient and time consuming.

Different compiler does the translation differently, for example, Java complier compile Java into a JVM byte code that stored in the class file, while a C compiler compile C language into an executable file. However, they all follow similar process and are designed based on the similar principles. Understanding how compiler works is an important steps for us to understand how programming languages are design, and how are they compiled into executable code.

A compiler usually has three major components: a front-end component, a middle component, and of course, a back-end component. We will start from the front-end component, which mainly including the scanner, the parser, and the semantic analysis modules.

  • Scanner

A scanner is also called a tokenizer. As indicated by the name, it reads the source file and generates the tex into a stream of known objects, called token. And the token will be send to the Parser for parsing. There are different elements on the coding text, say variables, operators, operands, strings, the lexical analysis is to identify their types.

In the lexical analysis, the parser reads one character at a time, remove the white space and comments from the text, then form the <token-type, value> tuple. For example, <type: Operator, value: +>, <type: Var, value: numberOfDays>. In this phase, parser will check whether the token is legal string. The definition of legal string is usually based on regular expression, which we will discuss in the future.

  • Parser

A parser translates the code to rules of grammar, it build the representation of the code. The following is a simple set of grammar rules:

expression = atom   | list
atom       = number | symbol    
number     = [+-]?['0'-'9']+
symbol     = ['A'-'Z']['A'-'Z''0'-'9'].*
list       = '(', expression*, ')'

When the parser receives the token stream from the scanner, it tries to match the token with the rules on the grammar set. If it is a number, then checking if the number is valid, if it is an expression, checking whether it is an atom expression or a list expression. The low level grammars can be combined and composed into high level grammars.

  • Syntax and Semantic Analysis

The parser will do two types of analysis: syntax analysis, semantic analysis. The difference is subtle: the syntax analysis checks whether or not the sentence is valid, while while semantic refers to the meaning of the sentence. For example:

int x;
print("%d", x)

This example is syntax valid, but semantic invalid. In semantic check, the parser understands the semantic by triggering a set of semantic actions. For example, compiler will maintain a symbol tables to track the scope and status of the variable. When the parser first see the variable, it enter the values and scope into symbol table, when it sees the variable in use, it will check if it has been defined and in the correct scope, otherwise it will throw error. The variable saving and checking is one of a semantic actions. There are more of such actions.

Java Memory Management

Java Memory Model

What’s in the memory of a running Java application?

Before we look into the details, we can guess that the following component is needed for an application to run:

  1. The code: Java code is not complied or executed in native code format, the code is converted into class file and executed through the execution engine. So the code must be loaded into the memory to execute.
  2. JVM: the Java Virtual Machine itself.
  3. Stack of the current thread, it stores current status of the execution, the running code, the local varaibles, and the trace of the execution.
  4. Data objects: when java code is run, the objects needs to be initiated to track he data, and they should not be in the stack area, therefore another area should be reserved fro the data objects.

We are pretty close with the Java memory model already with the above reasoning. Actually Java memory model has the following model:

  • Method area

this areas store all Java class level information, including the class name, it’s immediate parent class name. This area is one per JVM, and shared by different threads.

  • Heap area

this areas is used for all the objects that is instantiated. There is also one Heap Area per JVM. For example, all the static string lives on the Heap area so that they can be referred and shared.

  • Stack area

There is one running stack for each thread, each block is created as a new function is invoked.

  • PC Register

This area stores the address of current execution instruction of a thread. This is also one per thread.

Java GC

How does Java GC work?

Java program runs in Java Virtual Machine, Java Virtual Machine manages the memory allocation and recycle of the programs. In C/C++ programming, the program need to call alloc() and free() method to allocate and release the memories. JVM handles the memory allocation and recycle for the programs, it introduced the garbage collector for recycling the unused memories.

The most critical question is how does Java know whether the memory area is ready to be recycled? Java GC will scan the Stack area, if the object in the Heap is reachable from the stack, it means this object is still being referred and it should not recycle, otherwise, they area can be recycled.

What are some of the algorithms that Java GC use?

Mark and Sweep

In phase I of this algorithm, the Java GC traversal the heap area to decide which one is still reachable.

In phase II of this algorithm, the Java GC will mark the rest of the area as recyclable, and these areas will be overwrite next time a data allocation request is sent.

Does Java GC compacting the memory to improve the later on memory allocation performance? Yes, it does. The mark-compact algorithm would try to move the referred area to the start of the heap and so that the later on memory allocation is in a continuous chunk of area and the new allocation is faster.

The problem of Mark and Sweep algorithm is that is can become quite inefficient. If the objects size becomes big or the program becomes complex, traversal the diagram and clean them up could take a long time.

Generational Garbage Collection

Generational Garbage Collection is added on top of the Mark and Sweep algorithm. Most of the java objects are short lived, and if they are short lived, they most likely will leave forever.

Generational GC divides the Heap into a few areas: the younger generation, the old generation and the Permanent generation.

The objects are first allocated in the younger generation, which contains the eden, s1 and S2 area. If the size fills up, a minor GC is triggered to recycle the data, and if the objets survived, it is gradually pushed to older generation.

The old generation is used to store the long running objects, only when the object in the young generation reached certain age, they are added to the old generation. And a major GC is run to perform GC on old generation.

The permanent generation is used to store the metadata required for JVM to run, for example, the class file.

From Java 8 on, the permanent generation is removed and replaced with the metadata space, and method area is part of this area.

Java Memory Tuning

Java GC tpes

Java provides different types of GC:

  • The Serial GC

This is the default GC, and it runs in a single thread and in serial for different GC tasks.

  • The parallel GC

This GC runs in multiple thread and runs the GC in parallel.

  • The Concurrent Mark Sweep Collector

This GC is used to collect the tenured generation.

How do we tune java memory?

  • -Xmx: set the maximum heap size
  • -Xms: set the starting heap size
  • -XX:MaxPermSize: java 7 and below
  • -X:MetaspaceSize the meta space size.
  • -verbose:gc print to the console when a garbage collection is run.
  • -Xmn set the size of the young generation
  • -XX: HeapDumpOnOutOfMemory create a heap dump file

Java Stream

What is stream?

Stream is an abstraction of data operations. It takes input from the collections, Arrays of I/O channels.

From imperative to declarative

For example, given a list of people, find out the first 10 people with age less or equals to 18.

The following solution is the imperative approach:

public void imperativeApproach() throws IOException {
        List<Person> people = MockData.getPeople();

        List<Person> peopleAbove18 = new ArrayList<>();
        for (Person person : people) {
            if (person.getAge() <= 18) {

        for (Person person: peopleAbove18) {

The following is the declare approach style:

public void declareApproach() throws IOException {
        List<Person> people = MockData.getPeople();
                    // a lambda function as a filter
                  .filter(person -> person.getAge() <= 18)


We mentioned that stream is an abstraction to the data manipulation. It abstract them in the following way:

  • Concrete: can be the Sets, Lists, Maps, etc
  • Stream: can be filter, map, etc.
  • Concrete: collect the data to make it concrete again.

Intermediate and Terminate Operation

Java stream has different operation units:

  • Intermediate operators: map, filter, sorted
  • Terminators: collect, foreach, reduce

Each intermediate operation is lazily executed and return a stream, until a terminal operation is met.


With IntStream.range(), you can create a stream with fixed set of elements, for example:

    public void rangeIteratingLists() throws Exception {
        List<Person> people = MockData.getPeople();

        // Use int stream to loop through the list and print the object.
        IntStream.range(0, 10).forEach(i -> System.out.println(people.get(i)));

        // If you want to use for the first elements;

You can also iterate the function for the given number times:

    public void intStreamIterate() throws Exception {
        // This is very much like the fold function on Kotlin,
        // that it keep iterating based on the iterator you provided.
        IntStream.iterate(0, operand -> operand + 1).limit(10).forEach(System.out::println);

Max, Min and Comparators

Java stream provides built in Min/Max function that support customized comparators. For example:

    public void min() throws Exception {
        final List<Integer> numbers = ImmutableList.of(1, 2, 3, 100, 23, 93, 99);

        int min =;



Sometimes, we would like to get the distinct elements from the stream, then we could use the distinct api of the stream

  public void distinct() throws Exception {
    final List<Integer> numbers = ImmutableList.of(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 9, 9);

    List<Integer> distinctNumbers =


Filtering and Transformation

Stream filter api enables you to filter the content of the element, for example:

    public void understandingFilter() throws Exception {
        ImmutableList<Car> cars = MockData.getCars();

        // Predicate is an assertion that returns true or false
        final Predicate<Car> carPredicate = car -> car.getPrice() < 20000;

        List<Car> carsFiltered =

And map API enable you to transform the format of the element, for example, we could define a another object and transform the given stream to the targeted stream:

    public void ourFirstMapping() throws Exception {
        // transform from one data type to another
        List<Person> people = MockData.getPeople(); -> {
            return new PersonDTO(p.getId(), p.getFirstName(), p.getAge());


Group Data

One common function in SQL queries are data grouping, for example:


Java stream provides similar functionalities:

  public void groupingAndCounting() throws Exception {
    ArrayList<String> names = Lists

    Map<String, Long> counting =
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

    counting.forEach((name, count) -> System.out.println(name + " > " + count));

Reduce and Flatmap

Very similar to the Hadoop Map/Reduce job, where map take care of transformation of the data, while the reduce job collect the data and do the final computation. For example:

  public void reduce() throws Exception {
    Integer[] integers = {1, 2, 3, 4, 99, 100, 121, 1302, 199};

     // Compute the same of the elements, with the initial element as a
    int sum =, (a, b) -> a + b);

    // use the function reference
    int sum2 =, Integer::sum);


Flat map is different from the map function that it could flat the internal structure first.

For example:

List<List<String>> list = Arrays.asList(


The result of the stream is a String List.

Lambda Expression in Java/Kotlin

Higher-order function

In computer science, a higher-order function is a function that does at least one of the following:

  • Takes one or more functions as arguments
  • Returns a function as its result.

All other functions are first-order functions.

Anonymous class

In Java, the anonymous class enables you to declare and instantiate a class at the same time. If you only need to use a local class once, then you should use the anonymous class. For example, if you wish to define a runnable class to execute a task:

Executors.NewSingleThreadExecutor().execute(new Runnable() {
    public void run() {
        // Your task execution.

As you can see on the above example, the Runnable is a interface with one function run defined. The anonymous class implemented the interface.

Lambda Expression

Besides the anonymous class, Java also supports anonymous functions, named Lambda Expression.

While anonymous class enables you to declare the new class in a statement, it is sometimes not concise enough when there is only one function in the class.

For the example on the above section, we could simplify it’s implementation with a Lambda expression.

Executors.NewSingleThreadExecutor().execute(()-> {// Your task execution })

The lambda expression provides a few functionalities:

  • Enable to treat functionality as a method argument, or code as data
  • A function that can be created without belonging to any class.
  • A lambda expression can be passed around as if was an object and executed on demand.

In the mean time, the functions are first-class in Kotlin. They can be stored in variables and data structures, passed as arguments and returned from other higher-order functions.

The kotlin Lambda expression follows the following syntax:

  • It is always surrounded by curly braces,
  • Parameter declarations in full syntactic from go inside curly braces and have optional type annotations.
  • The body goes after an -> sign.
  • If the expression return type is not Unit, the last expression inside the body is treated as the return type.

As you can tell, the Lambda expression can’t specify the return types. If you wish the define the return type, you could use an alternative solution: anonymous function.

fun(x: Int, y: Int): Int = x + y

The major difference between Kotlin and Java is that Kotlin is a functional programming language. Kotlin has a dedicated type for functions, for example:

val initFunction: (Int) -> Int

The above expression means that the initFunctions is a function type, and the function takes in a integer and return a integer.

The above function be rewrite as:

val a = {i: Int -> i +1}

Python Notes: Global Interpreter Lock

Why Python use GIL

Python uses reference count for memory management. Each objects in python has a reference count variables that keep track of the number of reference to the object. When the reference count goes back to 0, Python would release this object from memory.

However, in multi threading scenario, multiple thread might access the same object, and object reference count could be changed incorrectly in race conditions. Then objects that should be released could still stay in memory and worst case, objects that should not be release are incorrectly released.

To solve this problem, python introduced GIL, which is a global lock in python interpreter level. The rules is that, any python code has to acquire this lock to be executed. You might ask why not add one lock to each objects? This could result in deadlock.

In this way, python code guarantees that only one thread would be able to change the object reference count.

Problem of GIL

The GIL solution, however has problems in that Python code would not be able to utilize multi CPUs. If your code is CPU intensive, python multi-thread would not help you at all. However, if you program is not CPU intensive, but I/O intensive, for example, network application, Python thread is still a good choice.

Solution to this problem?

Are there solutions to the problem? Yes, there are. Python community has tries many times to solve this problem. Python GIL is not really tie to python language it self, it ties to Python interpreter it self. So, as long as we change the underlying python interpreter, python could support multithread. For example, Jython, is implemented in Java.

However, another important reason why Python GIL is not removed is that python has many extended libraries that are writing in C. Those libraries works well with Python in that they don’t need to worry about multi-thread models, the GIL model is really easy to integrate. Moving those libraries to other interpreters are hard works.

Another solution is to use multiprocess instead of multithread. Python has good designed libraries that supports multiprocess. However, process management would have more overhead than thread management for operating system, which means the performance of multiprocess programs are worse than multithreads.

Python Notes: Decorator

By definition, a decorator is a function that takes another function and extends the behavior of the latter function without explicitly modifying it. Built-in decorators such as @staticmethod, @classmethod, and @property work in the same way.

How decorator works? Let's take a look at the following example:

def my_decorator(some_func):
    def wrapper():
        print("Some function is being called.")

    return wrapper

def another_func():
    print("Another function is being called.")

foo = my_decorator(another_func)


You would see the following print on the screen:

"Another function is being called."
"Some function is being called."

As you can see, we pass some_func into a closure, and do something before or after calling this function without modifying its original behavior, and we return the function. As we already learned, python functions are just like other python objects, they are first class objects. The returned function could be called just as any other functions.


The above example is already very similar to decorator. The difference is that decorator often comes with a @ symbol. This is an example of python syntax sugar, which often refers to syntax in a programming language that aims to make the things easy to read or to express. For example, the following is an example of a decorator:

def another_func():
    print("Another function is being called.")

Decorator that takes any argument

In python, we can use *args and **kargs to represent arbitrary arguments. The following example shows how to take arbitrary arguments in a decorator:

def proxy(func):
    def wrapper(*args, **kargs):
        return func(*args, **kargs)
    return wrapper

Decorator with parameters

Sometimes we want the decorator to take parameters, for example, we should implement it in this way:

def decorator(argument):
    def real_decorator(function):
        def wrapper(*args, **kwargs):
            result = function(*args, **kwargs)
        return wrapper
    return real_decorator

Decorator tips

One practical tips when defining decorator is to use the functoolss.wraps, this function would keep all the meta data information of the original functions, including the function signature and docstring information.

import functools

def uppercase(func):
    def warpper():
        return func()
    return wrapper

Python Notes: function as first class object

Per history of python blog, everything in python are first class objects, that means all objects that could be named in the language (e.g., integers, strings, functions, classes, modules, methods, etc.) to have equal status. Th:at is, they can be assigned to variables, placed in lists, stored in dictionaries, passed as arguments, and so forth..

Essentially, functions return a value based on the given arguments. In Python, functions are first-class objects as well. This means that functions can be passed around, and used as arguments, just like any other value (e.g, string, int, float).

Internally, python use a common C data structure that are used everywhere in the interpreter to represent all objects, either it is a python function or a integer.

However, when it comes to python function as first class, there are subtle things to think about when doing design.

Think about the following function definition:

class A:
    def __init__(self, x):
        self.x = x

    def foo(self, bar):
        print self.x, bar

What would happen if you assign to a variable: b = The first argument of the function would have to be the instance itself. To handle this problem, python 2 returns a unbound method, which is a warper around the original function, but it restrict that the first argument of the function has to be the object instance: a = A(), b(a). In python 3, however, this restriction is removed as the author found this is not very useful.

Let’s think about the second condition, when you have a instance of a class: a = A(1), b = In this case, python would return a bound method which is a thin wrapper around the original function. Bound method stores the instance as a internal object and this object would be the default first argument when calling this function.

Python Notes: Iterator, Generator and Co-routine


Python support iteration, for example, iterating over a list:

for elem in [1, 2, 3]:
    print elem

Iterating over a dict:

for key in {'Google': 'G',
              'Yahoo': 'Y',
              'Microsoft': 'M'}:
    print key

Iterating over a file:

with open('path/to/file.csv', 'r') as file:
    for line in file:
        print line

We use iterable objects in many ways, for example, reductions: sum(s), min(s), constructors: list(s), in operators: item in s.

The reason why we can iterate over iterable is because of iterable protocols: any objects that supports iter() and next() is an itterable. For example, we can define one itterable object in the following way:

class count:

def __init__(self, start):
    self.count = start

def __iter__(self):
    return self

Def next(self):
    if self.count < 0:
        raise StopIteration
    r = self.count
    self.count -=1
    return r

We can use the above example in this way:

c = count(5)
for i in c:
    print I
# 5, 4, 3, 2, 1


So what is a generator? By definition: a generator is a function that produces a sequence of results instead of a single value.

So generator is a function, it is different from other functions that it generates a sequence of results instead of a single value. Generator function is very different from normal function, calling the generator function will create one generator, but would not execute it, until next() is called. The following is an example of generator:

def count(n):
    while n > 0:
        yield n
        n -= 1

c = count(5)

Note that when we first initiate count, it won't execute. Until the first time we call, the generator would start to execute. But it will suspend on the yield command, until next time it executes.

So to speak, a generator is a convenient way of writing an iterator, and you don't have to worry about iterator protocols.

Except for yield based generator function, python also supports generator expression:

a = [1, 2, 3, 4]
b = [x*2 for x in a]
c = (x*2 for x in a)

b is still a regular list, while c is a generator.


Python coroutine is very similar to generator. Think about the following pattern:

def receive_count():
        while True:
            n = (yield) # Yield expression
            print "T-minues ", n
    except GeneratorExit:
        print "Exit from generator."

The above form of generator is called coroutine. Coroutine is different from generator in that it receives data instead of generates data. Think of it as a consumer or receiver.

To use python co-routine, you need to call next() first so that the function executes to the yield field part, then you can use send to send the value to the function. For example:

    c = receive_count() # trigger to yield function
    c.send(1) # sending 1 to the co-routine.
    # prints "T-minus 1"

Python provided a decorator called @consumer to execute the next() function part. With the consumer decorator, the co-routine can be used directly.

Then the question is: why don't we just declare co-routine as a regular function where you can send the value to it directly instead of relying on the yield expression? Using coroutine in the given examples doesn't fully justify it's value. More often, people use co-routine to implement a application level multiple threading. I will introduce more about this later.

Python Notes: Context management

Python supports context management. Which often used when handling resources, for example, file, network connection. With statement helps make sure the resources are cleaned up or released. There are two major functions for context management: __enter__ and __exit__.

__enter__ is triggered when the with statement is first triggered, while __exit__ statement is triggered when the statement finishes execution.

One very common usage of with statement is when we open up files:

    with open('/path/to/file', 'r') as file:
        for line in file():

The with statement on this example will automatically close the file descriptor no matter how this with block exits.

Python with statement also supports nesting, for example:

    with open('/open/to/file', 'r') as infile:
        with open('write/to/file', 'w') as outfile:
            for line in infile:
                if line.startswith('Test'):

If you want your code to support with statement based context management, just override the two functions on your code, for example:

class Resource:
    def __init__(self, res):
        self.res = res

    def __enter__(self):
        #define your code here

    def __exit__(self, exc_type, exc_value, traceback):
        #define your code here

There is another way to support context management, which is to use the contextlib.contextmanager, we will introduce this later.


Python Notes: Closure


The following is an example of python closure.

def outer_function(outter_arg):
    closure_var = outter_arg
    def inner_function(inner_arg):
        print("{} {}".format(closure_var, inner_arg))

    return inner_function

# Usage of a python closure
closure_func = outer_function("X")
closure_func("Y1") # print "Y1 X"
closure_funct("Y2") # output "Y2 X"

Variables and nested function

Python has two types of variables: local variable and global variable. Variables defined within a function has a local scope, while variables defined outside a function has global scope.

When a function is defined within another function, the function is called nested function. The nested function, however, can access the outer function’s variables. In the example above, outer function defined a variable called closure_var, and this variable would be initiated by the input variable. The inner function is able to access this variable and use this variable in its own function definition.

There are two steps of using a closure function: initiate the closure function and assign to a local variable, then call the variable with parameter to invoke the inner function.

When to use closure?

  • Reduce the use of global variables

Closure could hide variables inside the function, in this way we reduce the use of global variables.

  • Simplify single function classes.
    For example, the above example could be converted into a single function class:
class OuterClass:
    def __init__(outer_arg):
        self.arg = outer_arg

    def inner_function(self, inner_arg):
        print("{} {}".format(self.arg, inner_arg))

In general, closure runs faster than instance function calls.