Categories
HSQL

Week 1: HSQL

I intended Week 1 to touch up any big issues that I’d noticed in the work I’d done so far – error management, and refining the AST and Variable Table.

Error Management and Imports

So, all compilers generally have multiple stages, and each stage may throw errors/warnings. Along with that, often, there may be some issues which aren’t explicitly compilation issues (eg. Maybe the compiler wasn’t able to read the file). A common error, is when a variable is misused – eg. we can’t do a select query from a single integer, we need a table!

Consider this example file sample.hsql, which we are viewing in an IDE:

import a;
b = a;

For a compiler that has to check types, what is the type of b? Of course, it is what the type of a is.

Now a, is only understandable, if I parse and process the whole of a.hsql and get its types out. However, if there’s some kind of error in a.hsql, we need to show it on the editor page for a.hsql, not sample.hsql.

Here’s the relevant method signature (one of the few that I use):

ErrorManager.push(e: TranslationError): void

TranslationError is a class that neatly wraps up where the issue is, what kind of an issue (Error/Warning/…) and what kind of error (Syntactic/Semantic/IO/…). One easy way, is to add a file:string as a member of TranslationError and hope for the best. As soon as I tried it, I realised there was a big issue; I have called .push() all over the codebase; there is no way I can expect every object and function to sanely track the file and report the issue accordingly.

So, the ErrorManager object itself has to track it.

One thing to realise, is that the ast generation function is recursive, especially. So, it will call itself eventually to resolve the other file (which happens when i’m trying to understand an import, more on that in a while).

Seeing recursion, an immediate thought is – a stack. A stack, can help deal with recursive things without requiring explicit recursion. So, a fileStack:string[] is good enough to act as a stack (All hail Javascript). Now, to mirror the AST calls, I decided a nice and easy way was to push the file context (the current filename) onto the stack at the beginning of the function, and then pop it at the end.

getAST(fileName:string = this.mainFile){
    errorManager.pushFile(fileName);
    // generate the AST of a file, errors may happen here - it calls itself eventually too if there is an import
    errorManager.popFile();
}

This simple trick, can be shown to ensure that the fileStack top will always be whatever file is being referred to (of course, unless we haven’t even started referring to a file yet!).

And, Bingo!

Error messages with file locations!

File Extension management

So, the HSQL (trans/com)piler also needs to make sure to rename file by extension easily. Node’s path extensions are perfect, but there is a slight important point to note: Pathnames are handled worry-free as long as we create them in the same system and consume them in the same system too! As in, if we use a Windows based system to create a path eg. C:\My\File.txt, it may not work properly if I try to consume it in a POSIX(Linux/Mac/…) system. Of course there are ways around that provided by the API, but we don’t need to worry about this edge case as all the pathnames that are entered during runtime are consumed by the same host itself.

Writing some small support code, we can use path.parse and path.format to easily rename files (Typescript does not like us doing it, but its valid way according to documentation), and voila!

input.hsql -> input.ecl !

AST, AST and more AST

ASTs were one of the primary goals that I have been working towards.

Internally of course it’ll be a data structure, but we can represent it something like this:

Here’s a simple idea of what an AST could look like graphically

Note how much simpler it is than a parse tree. This does give me the added benefit of simplicity in the later stages; but its important to remember that the real star of the show is the little table on the right – a detailed type for every variable. More on that on some other time, but this representation should be mutable (something ANTLR parse trees don’t like being) and we can work with it a lot easier.

Variable table lookups

The Variable table, as of now contains a map of variables, their scope and what is their type. There’s two ways to introduce data into a program:

  1. Direct assignment – This is by creating a variable from some combination of literals. Eg. a = 5. Understanding what is the type of a, is trivial(ish) in this case.
  2. Imports – Imports is an important way to introduce data into a program. Eg. import a. Figuring out the type of a, is a bit complex here. All we can know without any more context, we can only say that a is a module, no more and no less. We do not have any more information about the contents of a; although for computation’s sake, we may assume that it has any and all members that are requested to be found in it. However, if its another HSQL file that is a.hsql, we can parse it and get the variables out of it. But, what about ECL?

ECL imports are a little tricky as we cant really get types out of ECL. So, the best we can do is try to see if a definition file is present (Let’s say a.ecl has some a.dhsql) that can give us more information on what a is.

Once you have data into the system, it can be propagated with assignments, and then, exported or given as an output.

Assignment ASTs

So, assignments are the core of this language.

y = f(x)

This is a nice pseudocode on how assignments in general work.

Given f(x) is a function on input x, we can try to figure to figure out what y will be shaped like. As in,

If we know what x is (a table? a singular value?) and we know what f is and how it transforms x, we can figure out what the shape of y is.

Eg.

a = b;

This is where we don’t really have special modifying function, but just an assignment. Whatever b is, a is definitely the same type.

Now consider,

a = select c1,c2 from b;

Now, here is a transformation function where if b is a table, a becomes a table with c1 and c2. If b is a singular value, then well, a is just invalid :P.

Carrying this knowledge over, we can say that assuming f(x) returns its data shape and whatever its AST node is, then our Eq node just has to create a new variable according to what the LHS has been defined as.

So to start with, I created the AST for only direct assignments.

To do this, its easy to find the type of f(x) = x here, as its just a lookup into the variable table to figure out the type, which we understood earlier.

Putting this in terms of code, is really easy for the first part. The particular part for assignment need only take the data type, and create a new variable as per lhs, and hope that the parse tree visitor has created the AST node for the RHS and has validated it (which returns its AST node, and data type).

An AST (notice the additional information) and the stmts, which is the root AST node’s children

Of course the image may not show it, but the RHS is contained as a child node of the Assignment section.

Call stack and a lesson on stack traces

Nice! ASTs are conceptually working. But I try to generate code, and I see this:

Yikes! Max call stack size exceeded!

All right, looking at the stack trace, its becomes obvious what happened here, visit keeps getting called. And since I haven’t yet added code generation for the equal statement, the mistake/oversight becomes obvious – if a visit<something>() isn’t defined, it will call visit() as a fallback. But, visit() calls accept() which calls the required visit<something>(). The week is getting scarily close to the end so after finding out a fix that will work for me, I decide to pick it up first thing next week 😛 .

Winding up and getting ready for next week

So, the first week for me was interesting as I had to get a few things ready, and had to transfer over some work. But as soon as that was ready, we were ready for some quality work. With all this over, the main focus next week is to –

  1. Get Codegeneration fixed – This will require some redesign of the codegeneration class
  2. Implement a basic Output statement atleast – Output statements can help us get to testing faster.
  3. Look at VSCode extensions – Try to get a reasonable extension started!
  4. Syntax and workflow ideas – There can never be enough looking at syntax and seeing what is the best syntax for doing something!
Categories
HSQL

Week 0.5: Development Environment

I run Linux for most of my development work (As would most college-level students for college work). I specifically prefer to use Manjaro (or any Arch-based distribution, to be honest) because of how well it has worked for me.

Honestly, stock Manjaro is good enough sometimes

However, if you head over to the HPCC Systems download page, you may notice a stark absence of Manjaro and/or Arch Linux. The supported operating systems (as of writing this), is CentOS 7/8 and various flavours of Ubuntu. Thankfully, all is not lost, as HPCC Systems is open source, and one of the best outcomes of that is that it can be built from source.

Scouring the manual and github repository brought me to this build instructions page where instructions were provided for building from source for the client tools. This, gave me a great idea for how to set up the environment.

A development environment is simple enough where you have a client computer making requests to a single-node cluster (which can be your computer again).

This reminded me of how useful Docker can be here. I’ve been using Docker a lot as of recent, and I tend to use it a lot in cases where the resource overhead of VMs do not feel good to me, but it still makes sense to keep the processes of a server process contained.

So this was going to be my two part installation:

  1. The HPCC Systems Client tools – compiled from source
  2. The HPCC Systems Cluster – on Docker as a separate container

Client Tools

Getting client tools to work was simple. I installed all the dependencies, plus sccache (Really helpful as it caches your compilation results and helps when you recompile). I combined the instructions from sccache’s page and the build instructions, to build only the client tools, and to use sccache as a cache tool. I typed in make -j6 and then took a small break.

Once that was all done and dusted, it seemed weird that make package was trying to install the package to my system, skipping the package making that usually happens in other distros. I skipped straight ahead to the sudo make install and now I had a working Client Tools install.

Woo it worked!

HPCC Systems

Although native installs are the best, I have found that VMs are a good way to manage it in case something goes wrong (Botched configuration, resources overuse) and are a breeze to clean up. As of recent I have started to use Docker, and its a nice replacement for VMs where the extra level of virtualization is not required. So, I also found some resources on building HPCC Systems images for Docker, but it was for v7, although I was using v8. Using this, I ended up making some changes and setting up a nice Dockerfile and docker-compose.yml to create and use a single-node cluster. Adding a restart: always to the Compose configuration, allows the image to restart on machine restart (which is a really nice feature I’ve needed).

Runs pretty well!

For me, Docker already had forwarded its ports to localhost(:8010,…). If not, there are environment variables ECL_WATCH_IP and ECL_WATCH_PORT that can be set for the client tools to connect to that specific cluster.

And all done!

And here we have the ECL Watch page!

With a development environment ready to go, I can start testing ECL and how translations between HSQL and ECL can work!

Categories
HSQL

Week 0: About HPCC Systems and HSQL

HPCC Systems is a high-performant, enterprise-ready open-source supercomputing platform (Check out their Official Page and their Github organization).

Maybe a better way of putting it is, it provides a really easy way to perform data cleaning, transformation, aggregations (and most other tasks you may need with big data) in an efficient and distributed fashion. But of course, that description only touches its capabilities and what work is being done with it.

HPCC Systems and ECL

So how would one use this entire system? Here’s one way. You can write up what you need the system to do, in a language called ECL. Its a really powerful declarative language that can allow you to write fast, expressive code that deals with data (or rather, how to deal with data). Here’s a nice example that shows how you may use ECL:

// The layout of your data
myDataLayout := RECORD
    UNSIGNED id;
    STRING16 firstName;
    STRING16 lastName;
END;
// Load up your data from '~myscope::names.csv', a (logical) file that is distributed/sprayed onto the system
myData := DATASET('~myscope::names.csv',myDataLayout);

// sort the data by firstName
sorted := SORT(myData,firstName);

// output like the first ten for our display sake, some people might save the whole output to somewhere else completely
OUTPUT(CHOOSEN(sorted,10));

The language is also very data-centric, if you can notice from the example above. This stress on the flow of data allows programs made with ECL to be aggressively optimized and parallelized.

(H)SQL

ECL as a language targets powerful data processing and transformation, but some people, may prefer the simplicity and familiarity of SQL. In fact, SQL is already present as a usable language in HPCC Systems. However, it is as an embedded language, and still requires the use of wrapping ECL.

So here, our idea has been to present HSQL, a SQL-like language, that should serve as a nice entry point for data analysts and newcomers into HPCC Systems. Here’s a brief look at idea at what the syntax is designed to look like:

IMPORT xyz; // imports should be a familiar concept from other languages
myPeopleData = SELECT * from xyz.peopleData;

sortedFirstPart = SELECT * from myPeopleData ORDER BY firstName LIMIT 10;
OUTPUT myPeopleData;

For many people who are used to SQL and other languages, this may seem more natural and easier to grasp. This SQL-inspired language is intended to translate completely to ECL, so most code can be interfaced with it no-problem, and people can work in projects using both ECL and HSQL. Of course, as they get used to the power and the effectiveness of ECL, they may choose to shift over to using ECL more, but the journey should be relatively smooth. (There’s some more syntaxes that are ML-specific, but I’ll leave them for later)

HSQL – Last year

I’ve been working on HSQL since last year; it started as a project from LexisNexis Risk Solutions, and we’ve been working hard since then.

Here’s how it was working back then:

How HSQL becomes ECL

We used ANTLR, and how it works is that you can enter your grammar, and it generates lexers (program that will scan a input text and chop it up into tokens) and parsers (who will take these tokens and try to get some structure out of the tokens). Our project was based on Javascript, as HSQL is also intended to work in cloud environments. Here’s a good example of how parse trees look from ANTLR look like (I use the ANTLR extension from the VSCode Marketplace for visualization):

This is a statement: plot from table1 title 'graph';

ANTLR allows us a nice way to traverse these parse trees (Called listeners and visitors, the latter essentially being an implementation of the Visitor Design Pattern). Using this, we can check if the program entered is semantically right, and then try to generate the ECL Code for it. Wrapping up this whole process in a nice looking tool, and wrapping that up further in a IDE extension, we were able to use it as a good base for how HSQL can work.

HSQL

Somewhere around last year, I had submitted a proposal to work further on HSQL to the HPCC Systems Summer Internship Program, to bring in key changes and enhancements that should make it a more extensible language. I was really happy when they let me know that they had accepted my proposal and were offering me an internship for the summer period. Here are some of the key points that I will be planning to work upon:

  1. Introduce an AST building phase where semantic validation can be done. This is really important as ANTLR does not like tree-rewriting (as of writing this) and can misbehave. Additionally, this also splits out code generation as a separate phase of the translation, which is useful especially when used in an IDE.
  2. Improve the interface of HSQLC for use in IDEs and as a standalone CLI tool.
  3. Add in some nice syntax for merges and filtering.

Working on HSQL

I have been working alongside my mentor to getting things ready for a while, and this is a brief summary of the changes I’ve made till now:

  1. Migrate to Typescript: Coming off from finishing the project, I realized that a big project like HSQLC requires documentation, and a strong emphasis on types. Hence, having a strongly-typed foundation to the project helps maintain some sense of the project when you refer to your own code a few months down the line.
  2. Introduce an AST generation stage: Splitting the then one shot goal of code generation into a pass of AST which some semantic validation followed by Code generation should simplify processing and also make it more IDE-friendly (In IDEs, we don’t need to proceed beyond the AST generation stage, as code generation is not really that useful there). Although this is in-progress, this should be helpful
  3. A slight better framework on testing: Last time around we had focused on testing the final ECL syntax only. Although that is a fair kind of test to do, this time around I believe it should be better to test individual components as they are being built up.
  4. Slightly more natural CLI tool: Use commands from yargs instead of flags, this resembles the way programs like compilers are normally called.

This should serve as a good base for my 12-week period of work on HSQL. And now, it is time to start!