Categories
HSQL

Week 7: WRITEs, DISTRIBUTE and Bugfixes

This week’s work was on getting the WRITE statement to work, and extending a DISTRIBUTE extension to SELECT. The rest, was getting some bugfixes and reading.

Write

The OUTPUT statement in HSQL is intended to be a way to output values to a workunit. In ECL however, OUTPUT can be used to write to a workunit as well as to a logical file. In HSQL, it is better to bifurcate the process into two separate commands. With this, the idea behind the WRITE statement, is to allow writing datasets into the logical file.

The syntax for WRITE was designed as:

WRITE <id> [TO [FILE]] [TYPE] [JSON|THOR|CSV|XML] <STRING> [OVERWRITE];

ECL can support outputs of 4 types – ECL, JSON, THOR, CSV. Typescript enums are a great way of supporting this:

export enum FileOutputType {
    THOR,
    CSV,
    JSON,
    XML,
}
// get the enum value
const y = FileOutputType.THOR;
// y is 0

// get the string
const z = FileOutputType[y];
// z is 'THOR'

The enums provide a two way string and integer conversion; which is convenient for storing the two representations while still being good computationally.

Adding in the AST and Code generation stages, we can test out some examples:

-- hsqlt make Commands/write_command/write.hsql
import %.a;

P = select * from a.t1 order by c1;

-- THOR/JSON/XML/CSV
-- syntax:
-- WRITE <id> [TO [FILE]] [TYPE] [JSON|THOR|CSV|XML] <STRING> [OVERWRITE];
write p to file type CSV '~output::writetofile.csv';

-- to thor files
write p to file type THOR '~output::writetofile.thor';

-- to xml
write p to file XML '~output::writetofile.xml';

This gives a nice way to write files to output so they can be used outside too.

DISTRIBUTE extension of SELECT

DISTRIBUTE in ECL is used to distribute data across nodes, for further computation; it is really helpful for managing data skew as we go along with further computations.

The idea for it is to be added as an extension in the HSQL select statement with the grammar being:

DISTRIBUTE BY <col1>[,<col2>,...]

This will add a DISTRIBUTE element to the end of the SELECT pipeline which will aid in managing data skew.

a = select * from marksmodule.marksds where marks>40 distribute by marks,subid;

The output for this expression looks like this:

__r_action_0 := FUNCTION
__r_action_1 := marksmodule.marksds( marks > 40);
__r_action_2 := TABLE(__r_action_1,{ __r_action_1 });
__r_action_3 := DISTRIBUTE(__r_action_2,HASH32(marks,subid));
RETURN __r_action_3;
END;
aa := __r_action_0;

The __r_action_3 is a computation that obtains the distributed definition and this is returned in that context. HASH32 has been recommended for use as per the documentation when distributing by expressions/columns, and was chosen as it would be the most used way of distributing that still maintains even distribution.

Bugfixes for the extension

Seems like the extension was lagging behind HSQLT. This is from the previous update that added in the file system reading. As the file locations have to now be provided separately, the Extension had not yet been configured for it. Additionally, the extension’s client and servers work on different folders. Due to this, the server can successfully locate files, whereas the client fails to. This required support for the compiler to refer to another directory ‘offset’ from the current directory (This has been added to the CLI tool as a -b flag).
Adding this and support for %, allows HSQL to be used with the extension a bit more easily.

Writes, clean outputs and no more ‘Warnings’!

Demo

As there was a demo that was done at the end of this week, various examples have been created to show how HSQL may be used. Currently, they do not showcase actual usecases, but only the syntax and the usability of the language. Nearly all supported features have been enumerated and the plan is to have a nice set of examples for each statement going ahead. The demo actually revealed some good to have feedback and the plan is to work towards getting some more statements and features as we go along.

Wrapping up

As Week 7 is coming to wraps, the plan is to work on these for Week 8:

  1. Add support for exporting definitions
  2. Add support for bundling definitions into modules in HSQL
  3. Look into procedure syntaxes in HSQL
  4. Evaluate existing featureset and make some examples to work on performance and correctness.
Categories
HSQL

Week 6: Plots and TextMate Grammars

This week’s plans were to essentially work on and implement the PLOT statement, and then to see how to update the TextMate Grammar and add in the new syntax.

Plots

Continuing from last week, we have auto-imports and an AST for Plots. With some little code generation, we can have a functioning PLOT statement.

Here’s the implementation I ended up choosing out of the 3 possible ways:

PARALLEL(OUTPUT(,,NAMED('<name>')),<module>.<function template>);

PARALLEL allows for the actions to be executed in parallel, or rather, parallel given the best ordering that the ECL Compiler finds.

The code generation section was added in using this above idea, and care was taken to not emit the plot statement if the plot method was not found (and also raise an error) – as in such a case no template for the function call would be found.

Let’s take this sample code for reference

import marks;

-- lets get all the average marks of each subject
counting = select DISTINCT subid,AVG(marks) as avgmarks from marks.marks group by subid;
// join this to get the subject names
marksJoined = select * from counting join marks.subs on counting.subid=marks.subs.subid;

output counting title marksAvg;
// filter out what we need for plotting

marksPlottable = select subname, avgmarks from marksJoined;
// some different plots
PLOT from marksPlottable title hello type bar;
PLOT from marksPlottable title hello2 type bubble;

Doing some sample codegeneration after implementation, we get this translation:

IMPORT Visualizer;
IMPORT marks;
__r_action_0 := FUNCTION
__r_action_1 := TABLE(marks.marks,{ subid,avgmarks := AVE(GROUP,marks) },subid);
__r_action_2 := DEDUP(__r_action_1,ALL);
RETURN __r_action_2;
END;
counting := __r_action_0;
__r_action_3 := FUNCTION
somesource := JOIN(counting,marks.subs,LEFT.subid = RIGHT.subid);
__r_action_4 := TABLE(somesource,{ somesource });
RETURN __r_action_4;
END;
marksJoined := __r_action_3;
OUTPUT(counting,,NAMED('marksAvg'));
__r_action_5 := FUNCTION
__r_action_6 := TABLE(marksJoined,{ subname,avgmarks });
RETURN __r_action_6;
END;
marksPlottable := __r_action_5;
PARALLEL(OUTPUT(marksPlottable,,NAMED('hello')),Visualizer.MultiD.bar('hello',,'hello'));
PARALLEL(OUTPUT(marksPlottable,,NAMED('hello2')),Visualizer.TwoD.Bubble('hello2',,'hello2'));

Submitting this job to the local cluster, we can browse the Visualizations by going over to the workunit’s ‘Resources’ tab.

And, we have plots!

TextMate Grammars

The current Grammar that is being used is from the Javascript version, and is a bit finicky. When it was being set up, I used an approach similar to modelling a CFG – I mapped the ANTLR statements over to syntax mapping rules. That’s not the best idea.

I lookup and read a few grammars that were present for existing languages, giving some extra time to SQL. Turns out, most of them do not use that method above (Not a surprise to anyone).

When doing syntax highlighting, the developer also needs feedback as its being typed. If you wait for the whole statement to match, the whole statement will be in plain white until the ending ‘;’.

A much better way, is to recognize specific parts of the language, fast, and colour them accordingly.

Since HSQL has a lot of tokens, this is good and syntax highlighting can be done pretty effectively. Adding in the tokens for the language, we see some good highlighting, which the user can take advantage of as they are typing.

That took a while to get right, and as a result, this solves too issues at mind:

  1. Performance and complexity: The regexes and captures were difficult, long and annoying to read. By just identifying and colouring the tokens, the grammar is much simpler, and easy to read (and maintain).
  2. Tying into the ANTLR grammar: As the earlier TextMate grammar was identifying whole statements, the two grammars needed to be completely in sync, which is difficult considering the fact that the language is being actively worked on. Identifying token-wise simplifies things, and as we no longer dependent on the grammar, the grammar can be worked on and the syntax highlighting will still work very well. Only every now and then, new keywords that will be a part of the language need to be added into the TextMate grammar.

Wrapping up

Week 6 completes, and so do the midterm goals that we had planned for. The plan for the next week is roughly:

  1. Decide on the plan ahead
  2. Fix VSCode extension related bugs: Due to some breakage due to the addition of the File Providers in earlier weeks, quite a few bugsfixes have to be done for the HSQL extension.
  3. Add in DISTRIBUTE BY to Select – This would correspond to DISTRIBUTE in ECL
  4. Investigate grammar for WRITE TO variant of the OUTPUT statement for writing to files.
Categories
HSQL

Week 5: Imports, more imports and plots

This week’s plan is to work on imports, as a step leading to the implementation of the PLOT statement. The idea is to be able to locate ECL imports, and hint to the compiler what is contained in an ECL file.

DHSQL

An idea inspired by the relationship between javascript and TypeScript, the idea is to have “type definition” for ECL files if we need to mention what is contained in these files.

Let’s consider an example for an ECL file, a.ecl:

export a:= MODULE
  export t1:= DATASET([{1,2},{3,4},{4,5}],{integer c1,integer c2});
  export t2:=DATASET([{1,3},{3,5},{3,5}],{integer c3,integer c4});
END;

Essentially, two datasets being exported. Currently, HSQl resolves it to an AnyModule, meaning that it won’t offer any type suggestions for the module when imported. To solve this, we can add in a a.dhsql in the same folder, which is intended to look something like this:

declare t1 as TABLE (integer c1,integer c2);
declare t2 as TABLE (integer c3,integer c4);

Now, when doing an import a;, HSQL can lookup this definition file, and use it to get an idea of what has been defined in the ECL file. Now, implementing this requires another piece of work – import resolution and priority.

Import resolution and priority

ECL has an interesting property where a folder may also be imported, and it is treated as a module, where all its children files/folders are members of that module. Although beyond our scope (and priority) as of now, its important to remember this as many modules use this structure too.

Now, considering all this, when we say import a; it can refer to 5 different sources:

  1. HSQL – a.hsql
  2. DHSQL – a.dhsql
  3. ECL – a.ecl
  4. A plain folder – a/...

It seems that ECL files are preferred over folders, and for our case, we need to refer to the DHSQL file if present (and of course .hsql files get more priority). So this gives us a nice order to refer to files in our system.

Adding to this, the ECL compiler itself has an order for scanning directories. It searches the standard libraries, the bundles and then the current directory as required. As we need to replicate this order too, to find the order, the idea is to first query eclcc about these paths before obtaining them. Once we have these , we can start looking at the paths in this general order:

  1. Standard Library includes – This essentially includes the standard library.
  2. ECL Bundles – This includes any bundles that have been installed by the user.
  3. Current directory – This includes imports from the current directory.

This gives us a good way to search for files. However, we also need to add some additional layers to this, to cater to some important features:

  1. A standard library type definitions layer: placed right at the top, the idea is to add definitions for standard library functions in this way.
  2. Memory map: Placed above the current directory file map (or replacing it), this can perform as –
    1. An override for on-disk files. This is common for files that are open in an IDE, but are unsaved. In this case, these files are open for use.
    2. An in-memory store of files – This is useful for use if HSQL needs to be used as a compile server.

DHSQL works!

After adding in the AST Generator, we have to take slight care that Output generation is skipped for DHSQL files (because it is a declaration file). So after adding this in, we can have much better queries. Last week’s examples were shown with patched in code that would show warnings otherwise. Let’s see what this addition brings us:

Before: Warnings about types – HSQL can’t help in detecting errors
After: Now that we have a DHSQL File, HSQL can tell what’s wrong and right about the table

Nice! Now when a select statement doesn’t give any errors, it should ideally work without any issues really in ECL too, given that the DHSQL file is correct.

DHSQL – Declaring Plots

As a precursor to making the plot statement, the idea is to have DHSQL files emit methods for plots. Let’s take a look at the syntax that’s in my mind.

Consider we are making Visualizer.dhsql, the idea is to make it somewhat like this:

declare Bubble as plot on 'TwoD.Bubble(\'{0}\',,\'{0}\')';

I think the statement itself is self explanatory with enough familiarity to the Visualizer Bundle. The idea, is to supply HSQL with the little bit of information/macro on what it should call. {0} is the plot name here, and will be automatically supplied to it.

Picking up from the previous topic, we can add in plot methods by injecting a Visualizer.dhsql.

Lots of plot methods!

Now, when we import Visualizer, it should pull in these methods onto our Variable table list. But before that, there’s one more thing to do.

Auto Imports

Auto imports are the best way to simplify any language. Here, the idea is to automatically pull in Visualizer, and later ML_Core as we use their respective statements. Last time around, we set values in the Parser module and retrieved them. This had the disadvantage of needing to pull them out of the parser and push them back into the visitor where the processing is happening. An easier way, is using ANTLR locals. These are dynamically scoped variables that can be accessed by any child rule (read as: really really useful if you need to just set some flags).

Locals, declaring and referencing

As these are members of the ProgramContext type itself, it becomes super easy to reference and get the values out during visiting (and skips some hoops that we had).

Where does this bring us? To auto imports. If the flags are set by the time we start visiting, that means we can be assured that there’s a plot statement somewhere in the program (Because parsing is over, and this is set during parsing).

The culmination of the above work

Finally, it became possible to test the workings.
Although AST and codegen are not present at this point, it simply means the plot statement code will not be generated.

Using a debugger, we can break just before the statements are pushed out from the AST Generator. And the results are promising!

Look at all the plot methods!

With this, we have a general idea on how the PLOT statement will be represented in AST. With this, I stopped to continue in the next week.

Wrapping up – Week 6

Week 6 is the midweek. The goal was to set up atleast 2 plot statements, but I can comfortably say the whole set we showed above will be possible. So, for Week 6, the idea is to finish:

  1. Plot codegen and output – This involves using the template and generating the PLOT statement.
  2. Add the new PLOT and OUTPUT statement to the VSCode Textmate grammar. This requires some research into how Textmate grammars are actually made in VSCode (we’ve been doing a lot of straight implementations so far)
Categories
HSQL

Week 5: Imports, more imports and plots

This week’s plan is to work on imports, as a step leading to the implementation of the PLOT statement. The idea is to be able to locate ECL imports, and hint to the compiler what is contained in an ECL file.

DHSQL

An idea inspired by the relationship between javascript and TypeScript, the idea is to have “type definition” for ECL files if we need to mention what is contained in these files.

Let’s consider an example for an ECL file, a.ecl:

export a:= MODULE
  export t1:= DATASET([{1,2},{3,4},{4,5}],{integer c1,integer c2});
  export t2:=DATASET([{1,3},{3,5},{3,5}],{integer c3,integer c4});
END;

Essentially, two datasets being exported. Currently, HSQl resolves it to an AnyModule, meaning that it won’t offer any type suggestions for the module when imported. To solve this, we can add in a a.dhsql in the same folder, which is intended to look something like this:

declare t1 as TABLE (integer c1,integer c2);
declare t2 as TABLE (integer c3,integer c4);

Now, when doing an import a;, HSQL can lookup this definition file, and use it to get an idea of what has been defined in the ECL file. Now, implementing this requires another piece of work – import resolution and priority.

Import resolution and priority

ECL has an interesting property where a folder may also be imported, and it is treated as a module, where all its children files/folders are members of that module. Although beyond our scope (and priority) as of now, its important to remember this as many modules use this structure too.

Now, considering all this, when we say import a; it can refer to 5 different sources:

  1. HSQL – a.hsql
  2. DHSQL – a.dhsql
  3. ECL – a.ecl
  4. A plain folder – a/...

It seems that ECL files are preferred over folders, and for our case, we need to refer to the DHSQL file if present (and of course .hsql files get more priority). So this gives us a nice order to refer to files in our system.

Adding to this, the ECL compiler itself has an order for scanning directories. It searches the standard libraries, the bundles and then the current directory as required. As we need to replicate this order too, to find the order, the idea is to first query eclcc about these paths before obtaining them. Once we have these , we can start looking at the paths in this general order:

  1. Standard Library includes – This essentially includes the standard library.
  2. ECL Bundles – This includes any bundles that have been installed by the user.
  3. Current directory – This includes imports from the current directory.

This gives us a good way to search for files. However, we also need to add some additional layers to this, to cater to some important features:

  1. A standard library type definitions layer: placed right at the top, the idea is to add definitions for standard library functions in this way.
  2. Memory map: Placed above the current directory file map (or replacing it), this can perform as –
    1. An override for on-disk files. This is common for files that are open in an IDE, but are unsaved. In this case, these files are open for use.
    2. An in-memory store of files – This is useful for use if HSQL needs to be used as a compile server.

DHSQL works!

After adding in the AST Generator, we have to take slight care that Output generation is skipped for DHSQL files (because it is a declaration file). So after adding this in, we can have much better queries. Last week’s examples were shown with patched in code that would show warnings otherwise. Let’s see what this addition brings us:

Before: Warnings about types – HSQL can’t help in detecting errors
After: Now that we have a DHSQL File, HSQL can tell what’s wrong and right about the table

Nice! Now when a select statement doesn’t give any errors, it should ideally work without any issues really in ECL too, given that the DHSQL file is correct.

DHSQL – Declaring Plots

As a precursor to making the plot statement, the idea is to have DHSQL files emit methods for plots. Let’s take a look at the syntax that’s in my mind.

Consider we are making Visualizer.dhsql, the idea is to make it somewhat like this:

declare Bubble as plot on 'TwoD.Bubble(\'{0}\',,\'{0}\')';

I think the statement itself is self explanatory with enough familiarity to the Visualizer Bundle. The idea, is to supply HSQL with the little bit of information/macro on what it should call. {0} is the plot name here, and will be automatically supplied to it.

Picking up from the previous topic, we can add in plot methods by injecting a Visualizer.dhsql.

Lots of plot methods!

Now, when we import Visualizer, it should pull in these methods onto our Variable table list. But before that, there’s one more thing to do.

Auto Imports

Auto imports are the best way to simplify any language. Here, the idea is to automatically pull in Visualizer, and later ML_Core as we use their respective statements. Last time around, we set values in the Parser module and retrieved them. This had the disadvantage of needing to pull them out of the parser and push them back into the visitor where the processing is happening. An easier way, is using ANTLR locals. These are dynamically scoped variables that can be accessed by any child rule (read as: really really useful if you need to just set some flags).

Locals, declaring and referencing

As these are members of the ProgramContext type itself, it becomes super easy to reference and get the values out during visiting (and skips some hoops that we had).

Where does this bring us? To auto imports. If the flags are set by the time we start visiting, that means we can be assured that there’s a plot statement somewhere in the program (Because parsing is over, and this is set during parsing).

The culmination of the above work

Finally, it became possible to test the workings.
Although AST and codegen are not present at this point, it simply means the plot statement code will not be generated.

Using a debugger, we can break just before the statements are pushed out from the AST Generator. And the results are promising!

Look at all the plot methods!

With this, we have a general idea on how the PLOT statement will be represented in AST. With this, I stopped to continue in the next week.

Wrapping up – Week 6

Week 6 is the midweek. The goal was to set up atleast 2 plot statements, but I can comfortably say the whole set we showed above will be possible. So, for Week 6, the idea is to finish:

  1. Plot codegen and output – This involves using the template and generating the PLOT statement.
  2. Add the new PLOT and OUTPUT statement to the VSCode Textmate grammar. This requires some research into how Textmate grammars are actually made in VSCode (we’ve been doing a lot of straight implementations so far)
Categories
HSQL

Week 4: Working towards SELECT

This week’s work is planned exclusively around getting select to work; its a huge part of the language. The decision, was to use FUNCTIONS, to emulate a SELECT. Why? Because functions in ECL have their own scoping, and local variables declared there, will be usable outside the scope of the function. However, a function cannot shadow variables that exist. Selects, have aliases, which can perfectly be matched by local function-scoped definitions, and can help skip the entire process of defining “formal” (visible variable name) and the “behind the scenes” variable names.

The idea

The idea is a simple 6 step process:

  1. Create a function and assign it to a value
  2. Enumerate all the sources that have been changed, or need processing – JOINS, select in select and aliases.
    1. JOINS – do a join
    2. select in select – follow this procedure and assign it to its alias
    3. Aliases – assign the original value to the alias
  3. Now, take all the sources, and join them all.
  4. Take last generated variable and then do the following in order
    1. Where – apply filter
    2. sort – sort
    3. group and column filters – table
    4. limit offset – choosen
    5. distinct – dedup all
  5. Wrap this whole block of statements in a function, returning the last generated variable.
  6. The variable assigned to the function on step 5, can now be the result. (assign it or just use it as an action)

From

The current syntax supports three kinds of data sources:

  1. Select statements – The statements can have a SELECT statement as part of a table source. To ease our worries and pain, SQL mandates that such sources have a defined alias. Whew. The idea for these, is to run this algorithm (read as: thank god recursion exists) for generating the select statement, and assign it to this select statement.
  2. Table Definitions – The statements can refer to table definitions and also have an alias. This is a simple case of a redefinition if a reassignment happens.
  3. Joins – Joins are a part of SQL, where there is a specific join type, and an expression that helps the join. Unfortunately joins do not have an alias, and this brings us to an annoyance – we need to force create an alias and let the user know in that case.

The multiple datasets are then joined, and the result variable is put on a stack. What happens from here on, is based on the order above. the top of the stack’s variable is used to generate a new variable, which is then placed on the top. This keeps the optional parts actually optional and generates the definitions only as necessary.

Implementing the from part, we get some interesting outputs that work well!

Selecting from multiple tables joins them
Alias reassigns them accordingly

From here, the rest of the statements can be added in. The column filtering is important, as they are the only other mandatory part of select.

The rest of SELECT

The rest of the select, uses the last generated variable, and creates a new one.

Making some of them, here’s what we get:

The last example, but completed
Columns must always have an alias, as ECL’s naming rules are a bit too difficult for us.
What about sorting? Covered as well!
An error: We track the datatypes as we create them, so since we know t1 and t2 dont have y6, the combined result can’t have them as well (I’m monkeypatching the data types for a1 and a2 for testing here, it will be implemented soon)

Looking at the grammar, the ORDER BY, GROUP BY, WHERE was added in. That’s a lot of querying capabilities as part of one sentence.

The having query segment was considered, but dropped as it can be done by two queries (Another where after the group by).

Winding up

This took a good part of the week. Adding the support code did take a bit of the time, but this allows us to use the most important part of SQL. The plan for the upcoming week is as follows:

  1. Start looking at import resolution. Imports need to be vetted to make sure that there exists atleast a folder, .ecl file or a HSQL file.
  2. HSQL might need a way to do auto imports for certain core modules such as Visualizer and ML_Core (for Plots and for Machine learning related statements). This should help with ease-of-use.
  3. Plots – The plot statement is another important objective as visualizing is an import usecase for HSQL. The idea is to support basic and important visualization methods, and try to find out if there’s a way that other modules can integrate into HSQL and add to more plot methods.

Categories
HSQL

Week 3: Extension, scopes, and planning Select

I started work on the extension first. Needing some changes, I needed two primary features: Syntax checking, and compiling to ECL as a primary featureset. So, I decided to go in order. For syntax checking, the basic process is:

  1. The language client keeps sending info about the text documents on the IDE, to the language server.
  2. The language server can act on these, and at its discretion, can send diagnostics.

There is a text document manager that is provided by the example implementation, but it does not emit any information about incremental updates, but rather provides the whole document itself.

Thankfully, enough configurability is present, to make your own text document manager. Using the standard document reference, I added the ability to add some listeners for incremental updates.

From there on out, I could check if the incremental updates would warrant a change in diagnostics (currently which i set to always update anyways) and then, push it to HSQLT to correct. Receiving the corrections, one can map the issues to the files, and the diagnostic severity levels; and then done! Pretty errors!

Syntax highlighting

The second part was rather simple, which was a compile command. Adding in a simple UI for confirming if the user wants outputs in case of warnings or errors, and writing it to the disk in that case, we get a nice little compile command (just remember to force VSCode to save all the files).

Can now use the command to compile straight away!

We can now pause most of the work on the extension, as this will work for testing the majority of HSQL. Once further progress is made on the language side, we can try working on ECL integration or some datatype highlighting.

Pretty diagrams

So I finally got around to updating the general architecture of what the current version of HSQL looks like and, here:

Pretty pretty

Lots of arrows here and pretty ideas aside, the current repository is meant to be used as an executable, as well a library. This should make it easy to create extensions based on the system, or even extend it later easily.

Packaging

Both hsqlt and the vscode extension are intended to be packaged before use. hsqlt has been configured to use pkg, and producing executables is very easy.

The VSCode extension though. I run a vsce package and I am greeted with:

But why?

Why is it trying to access the other project? I thought it was under .vscodeignore. Here’s some context, the extension and the compiler repo are located in a parent folder, and the compiler is linked to my global npm scope, and then linked back to this repo.

Digging further in the code, I open a Javascript Debugger Terminal, and see that it is from the ignore module. The module attempts to index which files to ignore, and follows symlinks. Adding to it, it also does not like accepting files which are outside the current folder (which we have here). And voila, the error. I have filed an issue under vsce and I hope to see some way around this. Worst-case scenario, I can unlink the repo before packaging (which might even work).

Select – Playing with scope

Select is a bit of a complex statement in SQL. The fact that it can call itself scares me, and having to deal with aliases is also scary. With aliases in mind, I got an idea – scoping. My mentor had mentioned earlier to think about procedures and syntax, and I have been working on scoping, knowing that I’d have to use them eventually. Interestingly, SQL Table aliases behave like a locally scoped definition; so perhaps table aliases can be mimicked with scoping.

Now how to enforce scoping? Functions come to mind. ECL Functions are similar to the usual program functions, save one critical difference – no variable shadowing. If a definition is declared, it cannot be shadowed in an easy way. So, time to go about it in a step by step way. How can I emulate a select query in ECL? I came up with this program flow for what the ecl should be shaped like

1. From - Create a gigantic source table from all the sources.
  a. Enumerate all the sources - the usual code generation
  b. apply aliases - redefine them
  c. Join them - JOIN
2. where - Apply a filter on the above table
3. sort - sort the data from above
4. Column filters and grouping - Use TABLE to do grouping and column filtering
5. apply limit offset from SQL - Use CHOOSEN on the above result
6. DISTINCT - DEDUP the above result

This method seems rather useful, as there is natural “use the result from above” flow to it. Additionally, with this, there is no way that we will be referring to data that has been deleted by a previous step. Honing this, i came up with this simple pseudo-algorithm –

1. Create a function and assign it to a value
2. Enumerate all the sources that have been changed, or need processing - JOINS, select in select and aliases.
  a. JOINS - do a join
  b. select in select - follow this procedure and assign it to its alias
  c. Aliases - assign the original value to the alias
3. Now, take _all_ the sources, and join them all. 
4. Take last generated variable and then do the following in order
  a. Where - apply filter
  b. sort - sort
  c. group and column filters - table
  d. limit offset - choosen
  e. distinct - dedup all
5. return the last generated variable.
6. The variable assigned to the function on step 1, is now the result

Let’s check this with an example

Here’s a SQL statement:

select *,TRIM(c1) from table1 as table2;

Here’s an idea of what the SQL would look like

__action_1 := function
    // inspect the sources - define all aliases
    table2:= table1;
    // the base select query is done - * from SELECT becomes the referenced table name
    __action_2 := TABLE(table2,{table2,TRIM(c1)});
    // return the last generated variable - __action_2
    return __action_2;
end;
__action_1; // this gives the output of the function

Seems pretty strange, yes. But I’ll be working on this next week, and we shall see how far things go.

Wrapping up for the week

With quite a bit of work done this week around, the plan is to pick up the following on the next week:

  1. Select AST+Codegeneration. Getting them ready as soon as possible is very important to getting things to a usable state. This one point alone is probably really large, as it involves getting a lot of components right to function completely. (In fact, I expect this to take quite a while.)
  2. Looking at syntaxes from other SQL/-like languages.

Categories
HSQL

Week 2: Codegen, AST and VSCode!

My first task this week is picking up from the codegen issue last week.

So to summarize, I need some good changes for codegeneration as a whole. So a while ago, I reused some concepts from yacc (or rather any parser) and the idea was to have a stack property, and this would be modified accordingly by each statement.

code:ECLCode[];

However, this is difficult to visualize – Each node of the parse tree will have some assumptions about the stack – ie. eg. After visiting a definition statement, the select statement will assume the definition statement would have put its code on the stack, can it can be used from there. This is good, but this does really complicate the whole process.

So, its easier to have each visit() statement return an instance of ECLCode, ie everything that it has translated. So once last week’s bug was fixed and we set have set up the new Parse Tree Visitor, we can move onto the next step!

Output

Taking a detour, OUTPUT is a very important statement. Most programs process some input and give some output and, well, the OUTPUT statement allows for just that.

Working out the particular syntax, we can see one issue from the beginning, ECL has an oddity with OUTPUT statements –

// this works
OUTPUT(someSingularValue,NAMED('hello1'));
// this works
OUTPUT(someTableValue,NAMED('hello2'));

//this does not work
OUTPUT(someSingularValue,,NAMED('hello3'));
//this works
OUTPUT(someTableValue,,NAMED('hello4'));

// this, should not work
OUTPUT(someSingularValue,'~output::hello5');
// but this doesn't work too!
OUTPUT(someTableValue,'~output::hello6');

// Although this works! What
OUTPUT(someTableValue,,'~output::hello7');
// this fails to compile too
OUTPUT(someTableValue,NAMED('hello8'),'~output::hello8');
// but this works!
OUTPUT(someTableValue,,NAMED('hello9'),'~output::hello9');

So, here’s the OUTPUT statement documentation, and seeing it, it becomes rather obvious what is happening.

Tables require a record expression, which is optional as in it may be left entry. For the first table output (hello2), it gets recognized as an expression and it can pass as correct syntax. But the latter ones, it can only be the table record variant, and syntaxes with only one comma, fails.

For reference, we can use SQL. SQL is mostly only tables. So, following that, I decided that the easiest (and maybe the laziest) is to not support singular (aka not table) data outputs. So, working that out will be two parts:

  1. Error on trying to output singular values
  2. Warnings on trying to output possibly singular values – eg. any values.

All right, with all that done, we can get around to get OUTPUT working.

ASTs working!

Codegeneration

With the codegeneration issues out of the place, it was a full speed ahead moment. And in no time at all, its done, OUTPUT and assignments.

A nice little warning too! (Because we don’t infer types on imports yet)

VSCode

All right, this is the meat of what’s part of HSQL; An easy way to deal with HSQL Code. And usually that means a not so easy way for us to deal with HSQL code. I’d worked with some language server work before (shameless plug here), and one of the things that was interesting was the recommendation to use Webpack to bundle the extension. The idea of the language server/client is very simple. There are two components:

  1. Language Client – Its the extension that plugs into the IDE. It feeds the language server any IDE contents, and what actions the user may be taking, and then takes any language server feedback and feeds it into the IDE.
  2. Language Server – It is a separate program that runs (maybe independently) and listens and communicates with API

So, the way the extension in VSCode is coded, is that you just start up the language server (or connect to it) and then you can communicate with it (uses JSON RPC).

It is apparent that what I needed was two repositories under a parent repo; where the parent repo had all the extension related content, and the child repos would have the language server and the client. And with that, I had to create a webpack config, to transpile the typescript projects, and generate two outputs, client.js, which would be executed as part of the extension, and server.js, which could be separately executed, but was mainly going to be bundled with the client.

I wrote up a nice webpack config, and it threw an error. Oh no.

This, took way too long. The majority of the day in fact; and I was knee deep trying to interpret what the compilation errors meant before the fix was apparent:

One entry.

transpileOnly: true

The trick was to tell ts-loader to not bother itself with anything other than transpiling and finally, it worked!

Pretending that didn’t happen, I pulled in older resources such as the textmate grammar for HSQL (It works pretty well for now), and added in some support code and voila, a working extension (I mean it didn’t do anything much yet, but atleast it compiled).

It knows when files are changing too! And syntax highlighting!

What was really interesting is that while the whole project was consuming atleast over a 100MB of space, Webpack easily pulled it down to less than 10MB for the two JS files (combined). This is a good indicator of why packaging is important in Javascript.

Wrapping up

With all this, I plan move into the next week, with the following tasks directly lined up:

  1. Have a compilation command for the HSQL extension.
  2. Explore some webpack optimizations? Perhaps like chunk splitting since the client and server will share the hsqlt module in common.
  3. Have some syntax checking support fixed in. This will remain as a good way to allow people to test as the compiler tool evolves.
  4. Start work on SELECT’s AST. This has been long pending, and its time to start the possibly biggest section of HSQL.

Categories
HSQL

Week 1: HSQL

I intended Week 1 to touch up any big issues that I’d noticed in the work I’d done so far – error management, and refining the AST and Variable Table.

Error Management and Imports

So, all compilers generally have multiple stages, and each stage may throw errors/warnings. Along with that, often, there may be some issues which aren’t explicitly compilation issues (eg. Maybe the compiler wasn’t able to read the file). A common error, is when a variable is misused – eg. we can’t do a select query from a single integer, we need a table!

Consider this example file sample.hsql, which we are viewing in an IDE:

import a;
b = a;

For a compiler that has to check types, what is the type of b? Of course, it is what the type of a is.

Now a, is only understandable, if I parse and process the whole of a.hsql and get its types out. However, if there’s some kind of error in a.hsql, we need to show it on the editor page for a.hsql, not sample.hsql.

Here’s the relevant method signature (one of the few that I use):

ErrorManager.push(e: TranslationError): void

TranslationError is a class that neatly wraps up where the issue is, what kind of an issue (Error/Warning/…) and what kind of error (Syntactic/Semantic/IO/…). One easy way, is to add a file:string as a member of TranslationError and hope for the best. As soon as I tried it, I realised there was a big issue; I have called .push() all over the codebase; there is no way I can expect every object and function to sanely track the file and report the issue accordingly.

So, the ErrorManager object itself has to track it.

One thing to realise, is that the ast generation function is recursive, especially. So, it will call itself eventually to resolve the other file (which happens when i’m trying to understand an import, more on that in a while).

Seeing recursion, an immediate thought is – a stack. A stack, can help deal with recursive things without requiring explicit recursion. So, a fileStack:string[] is good enough to act as a stack (All hail Javascript). Now, to mirror the AST calls, I decided a nice and easy way was to push the file context (the current filename) onto the stack at the beginning of the function, and then pop it at the end.

getAST(fileName:string = this.mainFile){
    errorManager.pushFile(fileName);
    // generate the AST of a file, errors may happen here - it calls itself eventually too if there is an import
    errorManager.popFile();
}

This simple trick, can be shown to ensure that the fileStack top will always be whatever file is being referred to (of course, unless we haven’t even started referring to a file yet!).

And, Bingo!

Error messages with file locations!

File Extension management

So, the HSQL (trans/com)piler also needs to make sure to rename file by extension easily. Node’s path extensions are perfect, but there is a slight important point to note: Pathnames are handled worry-free as long as we create them in the same system and consume them in the same system too! As in, if we use a Windows based system to create a path eg. C:\My\File.txt, it may not work properly if I try to consume it in a POSIX(Linux/Mac/…) system. Of course there are ways around that provided by the API, but we don’t need to worry about this edge case as all the pathnames that are entered during runtime are consumed by the same host itself.

Writing some small support code, we can use path.parse and path.format to easily rename files (Typescript does not like us doing it, but its valid way according to documentation), and voila!

input.hsql -> input.ecl !

AST, AST and more AST

ASTs were one of the primary goals that I have been working towards.

Internally of course it’ll be a data structure, but we can represent it something like this:

Here’s a simple idea of what an AST could look like graphically

Note how much simpler it is than a parse tree. This does give me the added benefit of simplicity in the later stages; but its important to remember that the real star of the show is the little table on the right – a detailed type for every variable. More on that on some other time, but this representation should be mutable (something ANTLR parse trees don’t like being) and we can work with it a lot easier.

Variable table lookups

The Variable table, as of now contains a map of variables, their scope and what is their type. There’s two ways to introduce data into a program:

  1. Direct assignment – This is by creating a variable from some combination of literals. Eg. a = 5. Understanding what is the type of a, is trivial(ish) in this case.
  2. Imports – Imports is an important way to introduce data into a program. Eg. import a. Figuring out the type of a, is a bit complex here. All we can know without any more context, we can only say that a is a module, no more and no less. We do not have any more information about the contents of a; although for computation’s sake, we may assume that it has any and all members that are requested to be found in it. However, if its another HSQL file that is a.hsql, we can parse it and get the variables out of it. But, what about ECL?

ECL imports are a little tricky as we cant really get types out of ECL. So, the best we can do is try to see if a definition file is present (Let’s say a.ecl has some a.dhsql) that can give us more information on what a is.

Once you have data into the system, it can be propagated with assignments, and then, exported or given as an output.

Assignment ASTs

So, assignments are the core of this language.

y = f(x)

This is a nice pseudocode on how assignments in general work.

Given f(x) is a function on input x, we can try to figure to figure out what y will be shaped like. As in,

If we know what x is (a table? a singular value?) and we know what f is and how it transforms x, we can figure out what the shape of y is.

Eg.

a = b;

This is where we don’t really have special modifying function, but just an assignment. Whatever b is, a is definitely the same type.

Now consider,

a = select c1,c2 from b;

Now, here is a transformation function where if b is a table, a becomes a table with c1 and c2. If b is a singular value, then well, a is just invalid :P.

Carrying this knowledge over, we can say that assuming f(x) returns its data shape and whatever its AST node is, then our Eq node just has to create a new variable according to what the LHS has been defined as.

So to start with, I created the AST for only direct assignments.

To do this, its easy to find the type of f(x) = x here, as its just a lookup into the variable table to figure out the type, which we understood earlier.

Putting this in terms of code, is really easy for the first part. The particular part for assignment need only take the data type, and create a new variable as per lhs, and hope that the parse tree visitor has created the AST node for the RHS and has validated it (which returns its AST node, and data type).

An AST (notice the additional information) and the stmts, which is the root AST node’s children

Of course the image may not show it, but the RHS is contained as a child node of the Assignment section.

Call stack and a lesson on stack traces

Nice! ASTs are conceptually working. But I try to generate code, and I see this:

Yikes! Max call stack size exceeded!

All right, looking at the stack trace, its becomes obvious what happened here, visit keeps getting called. And since I haven’t yet added code generation for the equal statement, the mistake/oversight becomes obvious – if a visit<something>() isn’t defined, it will call visit() as a fallback. But, visit() calls accept() which calls the required visit<something>(). The week is getting scarily close to the end so after finding out a fix that will work for me, I decide to pick it up first thing next week 😛 .

Winding up and getting ready for next week

So, the first week for me was interesting as I had to get a few things ready, and had to transfer over some work. But as soon as that was ready, we were ready for some quality work. With all this over, the main focus next week is to –

  1. Get Codegeneration fixed – This will require some redesign of the codegeneration class
  2. Implement a basic Output statement atleast – Output statements can help us get to testing faster.
  3. Look at VSCode extensions – Try to get a reasonable extension started!
  4. Syntax and workflow ideas – There can never be enough looking at syntax and seeing what is the best syntax for doing something!
Categories
HSQL

Week 0.5: Development Environment

I run Linux for most of my development work (As would most college-level students for college work). I specifically prefer to use Manjaro (or any Arch-based distribution, to be honest) because of how well it has worked for me.

Honestly, stock Manjaro is good enough sometimes

However, if you head over to the HPCC Systems download page, you may notice a stark absence of Manjaro and/or Arch Linux. The supported operating systems (as of writing this), is CentOS 7/8 and various flavours of Ubuntu. Thankfully, all is not lost, as HPCC Systems is open source, and one of the best outcomes of that is that it can be built from source.

Scouring the manual and github repository brought me to this build instructions page where instructions were provided for building from source for the client tools. This, gave me a great idea for how to set up the environment.

A development environment is simple enough where you have a client computer making requests to a single-node cluster (which can be your computer again).

This reminded me of how useful Docker can be here. I’ve been using Docker a lot as of recent, and I tend to use it a lot in cases where the resource overhead of VMs do not feel good to me, but it still makes sense to keep the processes of a server process contained.

So this was going to be my two part installation:

  1. The HPCC Systems Client tools – compiled from source
  2. The HPCC Systems Cluster – on Docker as a separate container

Client Tools

Getting client tools to work was simple. I installed all the dependencies, plus sccache (Really helpful as it caches your compilation results and helps when you recompile). I combined the instructions from sccache’s page and the build instructions, to build only the client tools, and to use sccache as a cache tool. I typed in make -j6 and then took a small break.

Once that was all done and dusted, it seemed weird that make package was trying to install the package to my system, skipping the package making that usually happens in other distros. I skipped straight ahead to the sudo make install and now I had a working Client Tools install.

Woo it worked!

HPCC Systems

Although native installs are the best, I have found that VMs are a good way to manage it in case something goes wrong (Botched configuration, resources overuse) and are a breeze to clean up. As of recent I have started to use Docker, and its a nice replacement for VMs where the extra level of virtualization is not required. So, I also found some resources on building HPCC Systems images for Docker, but it was for v7, although I was using v8. Using this, I ended up making some changes and setting up a nice Dockerfile and docker-compose.yml to create and use a single-node cluster. Adding a restart: always to the Compose configuration, allows the image to restart on machine restart (which is a really nice feature I’ve needed).

Runs pretty well!

For me, Docker already had forwarded its ports to localhost(:8010,…). If not, there are environment variables ECL_WATCH_IP and ECL_WATCH_PORT that can be set for the client tools to connect to that specific cluster.

And all done!

And here we have the ECL Watch page!

With a development environment ready to go, I can start testing ECL and how translations between HSQL and ECL can work!

Categories
HSQL

Week 0: About HPCC Systems and HSQL

HPCC Systems is a high-performant, enterprise-ready open-source supercomputing platform (Check out their Official Page and their Github organization).

Maybe a better way of putting it is, it provides a really easy way to perform data cleaning, transformation, aggregations (and most other tasks you may need with big data) in an efficient and distributed fashion. But of course, that description only touches its capabilities and what work is being done with it.

HPCC Systems and ECL

So how would one use this entire system? Here’s one way. You can write up what you need the system to do, in a language called ECL. Its a really powerful declarative language that can allow you to write fast, expressive code that deals with data (or rather, how to deal with data). Here’s a nice example that shows how you may use ECL:

// The layout of your data
myDataLayout := RECORD
    UNSIGNED id;
    STRING16 firstName;
    STRING16 lastName;
END;
// Load up your data from '~myscope::names.csv', a (logical) file that is distributed/sprayed onto the system
myData := DATASET('~myscope::names.csv',myDataLayout);

// sort the data by firstName
sorted := SORT(myData,firstName);

// output like the first ten for our display sake, some people might save the whole output to somewhere else completely
OUTPUT(CHOOSEN(sorted,10));

The language is also very data-centric, if you can notice from the example above. This stress on the flow of data allows programs made with ECL to be aggressively optimized and parallelized.

(H)SQL

ECL as a language targets powerful data processing and transformation, but some people, may prefer the simplicity and familiarity of SQL. In fact, SQL is already present as a usable language in HPCC Systems. However, it is as an embedded language, and still requires the use of wrapping ECL.

So here, our idea has been to present HSQL, a SQL-like language, that should serve as a nice entry point for data analysts and newcomers into HPCC Systems. Here’s a brief look at idea at what the syntax is designed to look like:

IMPORT xyz; // imports should be a familiar concept from other languages
myPeopleData = SELECT * from xyz.peopleData;

sortedFirstPart = SELECT * from myPeopleData ORDER BY firstName LIMIT 10;
OUTPUT myPeopleData;

For many people who are used to SQL and other languages, this may seem more natural and easier to grasp. This SQL-inspired language is intended to translate completely to ECL, so most code can be interfaced with it no-problem, and people can work in projects using both ECL and HSQL. Of course, as they get used to the power and the effectiveness of ECL, they may choose to shift over to using ECL more, but the journey should be relatively smooth. (There’s some more syntaxes that are ML-specific, but I’ll leave them for later)

HSQL – Last year

I’ve been working on HSQL since last year; it started as a project from LexisNexis Risk Solutions, and we’ve been working hard since then.

Here’s how it was working back then:

How HSQL becomes ECL

We used ANTLR, and how it works is that you can enter your grammar, and it generates lexers (program that will scan a input text and chop it up into tokens) and parsers (who will take these tokens and try to get some structure out of the tokens). Our project was based on Javascript, as HSQL is also intended to work in cloud environments. Here’s a good example of how parse trees look from ANTLR look like (I use the ANTLR extension from the VSCode Marketplace for visualization):

This is a statement: plot from table1 title 'graph';

ANTLR allows us a nice way to traverse these parse trees (Called listeners and visitors, the latter essentially being an implementation of the Visitor Design Pattern). Using this, we can check if the program entered is semantically right, and then try to generate the ECL Code for it. Wrapping up this whole process in a nice looking tool, and wrapping that up further in a IDE extension, we were able to use it as a good base for how HSQL can work.

HSQL

Somewhere around last year, I had submitted a proposal to work further on HSQL to the HPCC Systems Summer Internship Program, to bring in key changes and enhancements that should make it a more extensible language. I was really happy when they let me know that they had accepted my proposal and were offering me an internship for the summer period. Here are some of the key points that I will be planning to work upon:

  1. Introduce an AST building phase where semantic validation can be done. This is really important as ANTLR does not like tree-rewriting (as of writing this) and can misbehave. Additionally, this also splits out code generation as a separate phase of the translation, which is useful especially when used in an IDE.
  2. Improve the interface of HSQLC for use in IDEs and as a standalone CLI tool.
  3. Add in some nice syntax for merges and filtering.

Working on HSQL

I have been working alongside my mentor to getting things ready for a while, and this is a brief summary of the changes I’ve made till now:

  1. Migrate to Typescript: Coming off from finishing the project, I realized that a big project like HSQLC requires documentation, and a strong emphasis on types. Hence, having a strongly-typed foundation to the project helps maintain some sense of the project when you refer to your own code a few months down the line.
  2. Introduce an AST generation stage: Splitting the then one shot goal of code generation into a pass of AST which some semantic validation followed by Code generation should simplify processing and also make it more IDE-friendly (In IDEs, we don’t need to proceed beyond the AST generation stage, as code generation is not really that useful there). Although this is in-progress, this should be helpful
  3. A slight better framework on testing: Last time around we had focused on testing the final ECL syntax only. Although that is a fair kind of test to do, this time around I believe it should be better to test individual components as they are being built up.
  4. Slightly more natural CLI tool: Use commands from yargs instead of flags, this resembles the way programs like compilers are normally called.

This should serve as a good base for my 12-week period of work on HSQL. And now, it is time to start!