Categories
Development HSQL

Week 9: Lots of things!

This week has been a big week – extending SELECT to load up datasets from logical files, modules, functions, and running from a CLI was worked on. There’s a lot to uncover!

Select from dataset

As layouts have now been introduced, one handy way of using select can be added in – loading up a dataset. The idea behind it is similar to how join works in HSQL – a new aliased variable that contains the necessary table. The shape of the table may be made out from the layout that was supplied to the command. Consider the following example:

Select * from '~file::file.thor' layout l1;

This above example selects from the logical file using the layout l1. To see how this would translate to ECL, we can take the following example as what we want to achieve –

__r__action_x :=FUNCTION
  src1 := DATASET('~file::file.thor',l1,THOR);
  res := TABLE(src1,{src1});
  return res;
END;
__r__action_x; // this is the end result-> assign it if needed

Taking a look at SELECT, we can add support for ECL’s 4 supported types – XML, JSON, CSV, THOR. THOR is relatively easy to work with, but CSV files very commonly need to have their headers stripped (CSV(HEADING(1)) more precisely, when writing the ECL Code). This, might be a good point for SELECT’s offset clause to fit in. Currently it can only work with limit, but that restriction can be lifted later (by falling back to array subscripting).

Adding this SELECT Dataset as a separate node for the AST, we can treat it just like a join, and voila!

We can select from a dataset!

Modules

Modules are a great way to represent and structure code. Thankfully, as we’ve worked with select, the modelling style is similar here. The variable table entry for modules already exist, so its just a matter of adding in the right syntax. A good one to start would be

x = create module(
  export k = 65;
  shared j = 66;
  l = 67;
);

Currently the grammar still includes ECL’s three visibility modes, but we will look at that later.

When creating an AST node, we can create a new scope, and put our new variables in to create the module before popping them off to create the module entry; and while I thought this was easier, this was more of a “easier said than done”. However, thankfully, it didn’t take much longer since as I had mentioned, the data types were already in place.

And… modules!

The variable table (or symbol table in regular compiler literature) can already resolve by the dot notation, so the type checking it works seamlessly and translates identically without any problems to ECL.

Functions

Functions are intended to be a nice way to reuse code. Implementation wise, they look pretty intimidating. How to start?

Well, let’s start at the grammar

create function <name> ([args]){
  [body];
  return <definition>;
};

This is something to start with; it looks simple to work with and implement for the grammar. Going into the specifics, we can look at the arguments – there can be two types – the usual default primitive arguments, and a table. For the former, the standard C type rules work, for layouts, we can define them somewhat like this example:

create function yz(layout l1 x,int y){
  //...
};

So, an important thing here, is to actually resolve this layout. Once we realise the layout, we can work with AST generation for the function in a way similar to that of the select statement – we can push a new scope into the function, and thankfully, as ECL supports variable shadowing for functions, we can push in the arguments as variables into the scope. Once that is over, we can pop the scope, and the resultant variables are gone! The arguments and the result type needs to be stored by the datatype though, and a function needs to be represented in the variable table too.

Adding all this in, only does half the job. Now that there’s functions, there needs to be a way to call them.

Function call

Function calls are rather important to be able to execute them. Now, the easiest is to model it after most of the programming languages:

x = somefunction(y,z);

This function call can be checked during the AST Generation steps by the following steps:

  1. Locate the function from the variable table. If it cannot be found, throw an error.
  2. Match the function arguments. Does the length and the types match? If no, throw (lots of) errors.
  3. The function call node evaluates to the return type of the function now.

From here, the assignment node takes care of the returned type from there.

Adding all these steps, we can get something marvelous –

Functions and Function calls! Finally (The example also shows passing a select statements results into the first argument)

Running

HSQLT already consults eclcc for getting the paths; this is a good opportunity to use this functionality to extend this and use ecl to compile and submit the resultant .ecl program. After checking if the outputs were produced from the previous code generation stage onto the filesystem, we can take the mainfile which stores the entry point for parsing, and submit that file’s ecl program to the cluster using the client tools. Using some nice stream processing, we can project the stdout and stderr of the child ecl process and get this nice result:

An easy way to run the programs now!

And there’s still more to go. Going over to the extension:

Code completion – Snippets

Snippets are an easy way to work with programs. VSCode snippets are a first step towards code completion. The example below can be typed with a simple write and the line will be filled in. Using TAB, the statement can be navigated and the appropriate sections can be filled in.

Write Snippets

Code completion – Variable Resolving

Variable resolving can be combined with Snippets to allow the user to write programd faster and more effectively. Pressing Ctl+Space currently brings up the current program’s table for auto complete. This, can be added by storing the last compilation’s variable table, and using that to find the possible variable entries.

We can see the variables!

Wrapping up

Oh dear, that was a lot of things in one week. The plans for the next week can roughly be summed up as –

  1. Look into implementing TRAIN and PREDICT statements for Machine Learning.
  2. Look into code generation, fix any glaring issues that result in incorrect code generation.
Categories
Development HSQL

Week 7: WRITEs, DISTRIBUTE and Bugfixes

This week’s work was on getting the WRITE statement to work, and extending a DISTRIBUTE extension to SELECT. The rest, was getting some bugfixes and reading.

Write

The OUTPUT statement in HSQL is intended to be a way to output values to a workunit. In ECL however, OUTPUT can be used to write to a workunit as well as to a logical file. In HSQL, it is better to bifurcate the process into two separate commands. With this, the idea behind the WRITE statement, is to allow writing datasets into the logical file.

The syntax for WRITE was designed as:

WRITE <id> [TO [FILE]] [TYPE] [JSON|THOR|CSV|XML] <STRING> [OVERWRITE];

ECL can support outputs of 4 types – ECL, JSON, THOR, CSV. Typescript enums are a great way of supporting this:

export enum FileOutputType {
    THOR,
    CSV,
    JSON,
    XML,
}
// get the enum value
const y = FileOutputType.THOR;
// y is 0

// get the string
const z = FileOutputType[y];
// z is 'THOR'

The enums provide a two way string and integer conversion; which is convenient for storing the two representations while still being good computationally.

Adding in the AST and Code generation stages, we can test out some examples:

-- hsqlt make Commands/write_command/write.hsql
import %.a;

P = select * from a.t1 order by c1;

-- THOR/JSON/XML/CSV
-- syntax:
-- WRITE <id> [TO [FILE]] [TYPE] [JSON|THOR|CSV|XML] <STRING> [OVERWRITE];
write p to file type CSV '~output::writetofile.csv';

-- to thor files
write p to file type THOR '~output::writetofile.thor';

-- to xml
write p to file XML '~output::writetofile.xml';

This gives a nice way to write files to output so they can be used outside too.

DISTRIBUTE extension of SELECT

DISTRIBUTE in ECL is used to distribute data across nodes, for further computation; it is really helpful for managing data skew as we go along with further computations.

The idea for it is to be added as an extension in the HSQL select statement with the grammar being:

DISTRIBUTE BY <col1>[,<col2>,...]

This will add a DISTRIBUTE element to the end of the SELECT pipeline which will aid in managing data skew.

a = select * from marksmodule.marksds where marks>40 distribute by marks,subid;

The output for this expression looks like this:

__r_action_0 := FUNCTION
__r_action_1 := marksmodule.marksds( marks > 40);
__r_action_2 := TABLE(__r_action_1,{ __r_action_1 });
__r_action_3 := DISTRIBUTE(__r_action_2,HASH32(marks,subid));
RETURN __r_action_3;
END;
aa := __r_action_0;

The __r_action_3 is a computation that obtains the distributed definition and this is returned in that context. HASH32 has been recommended for use as per the documentation when distributing by expressions/columns, and was chosen as it would be the most used way of distributing that still maintains even distribution.

Bugfixes for the extension

Seems like the extension was lagging behind HSQLT. This is from the previous update that added in the file system reading. As the file locations have to now be provided separately, the Extension had not yet been configured for it. Additionally, the extension’s client and servers work on different folders. Due to this, the server can successfully locate files, whereas the client fails to. This required support for the compiler to refer to another directory ‘offset’ from the current directory (This has been added to the CLI tool as a -b flag).
Adding this and support for %, allows HSQL to be used with the extension a bit more easily.

Writes, clean outputs and no more ‘Warnings’!

Demo

As there was a demo that was done at the end of this week, various examples have been created to show how HSQL may be used. Currently, they do not showcase actual usecases, but only the syntax and the usability of the language. Nearly all supported features have been enumerated and the plan is to have a nice set of examples for each statement going ahead. The demo actually revealed some good to have feedback and the plan is to work towards getting some more statements and features as we go along.

Wrapping up

As Week 7 is coming to wraps, the plan is to work on these for Week 8:

  1. Add support for exporting definitions
  2. Add support for bundling definitions into modules in HSQL
  3. Look into procedure syntaxes in HSQL
  4. Evaluate existing featureset and make some examples to work on performance and correctness.
Categories
Development HSQL

Week 6: Plots and TextMate Grammars

This week’s plans were to essentially work on and implement the PLOT statement, and then to see how to update the TextMate Grammar and add in the new syntax.

Plots

Continuing from last week, we have auto-imports and an AST for Plots. With some little code generation, we can have a functioning PLOT statement.

Here’s the implementation I ended up choosing out of the 3 possible ways:

PARALLEL(OUTPUT(,,NAMED('<name>')),<module>.<function template>);

PARALLEL allows for the actions to be executed in parallel, or rather, parallel given the best ordering that the ECL Compiler finds.

The code generation section was added in using this above idea, and care was taken to not emit the plot statement if the plot method was not found (and also raise an error) – as in such a case no template for the function call would be found.

Let’s take this sample code for reference

import marks;

-- lets get all the average marks of each subject
counting = select DISTINCT subid,AVG(marks) as avgmarks from marks.marks group by subid;
// join this to get the subject names
marksJoined = select * from counting join marks.subs on counting.subid=marks.subs.subid;

output counting title marksAvg;
// filter out what we need for plotting

marksPlottable = select subname, avgmarks from marksJoined;
// some different plots
PLOT from marksPlottable title hello type bar;
PLOT from marksPlottable title hello2 type bubble;

Doing some sample codegeneration after implementation, we get this translation:

IMPORT Visualizer;
IMPORT marks;
__r_action_0 := FUNCTION
__r_action_1 := TABLE(marks.marks,{ subid,avgmarks := AVE(GROUP,marks) },subid);
__r_action_2 := DEDUP(__r_action_1,ALL);
RETURN __r_action_2;
END;
counting := __r_action_0;
__r_action_3 := FUNCTION
somesource := JOIN(counting,marks.subs,LEFT.subid = RIGHT.subid);
__r_action_4 := TABLE(somesource,{ somesource });
RETURN __r_action_4;
END;
marksJoined := __r_action_3;
OUTPUT(counting,,NAMED('marksAvg'));
__r_action_5 := FUNCTION
__r_action_6 := TABLE(marksJoined,{ subname,avgmarks });
RETURN __r_action_6;
END;
marksPlottable := __r_action_5;
PARALLEL(OUTPUT(marksPlottable,,NAMED('hello')),Visualizer.MultiD.bar('hello',,'hello'));
PARALLEL(OUTPUT(marksPlottable,,NAMED('hello2')),Visualizer.TwoD.Bubble('hello2',,'hello2'));

Submitting this job to the local cluster, we can browse the Visualizations by going over to the workunit’s ‘Resources’ tab.

And, we have plots!

TextMate Grammars

The current Grammar that is being used is from the Javascript version, and is a bit finicky. When it was being set up, I used an approach similar to modelling a CFG – I mapped the ANTLR statements over to syntax mapping rules. That’s not the best idea.

I lookup and read a few grammars that were present for existing languages, giving some extra time to SQL. Turns out, most of them do not use that method above (Not a surprise to anyone).

When doing syntax highlighting, the developer also needs feedback as its being typed. If you wait for the whole statement to match, the whole statement will be in plain white until the ending ‘;’.

A much better way, is to recognize specific parts of the language, fast, and colour them accordingly.

Since HSQL has a lot of tokens, this is good and syntax highlighting can be done pretty effectively. Adding in the tokens for the language, we see some good highlighting, which the user can take advantage of as they are typing.

That took a while to get right, and as a result, this solves too issues at mind:

  1. Performance and complexity: The regexes and captures were difficult, long and annoying to read. By just identifying and colouring the tokens, the grammar is much simpler, and easy to read (and maintain).
  2. Tying into the ANTLR grammar: As the earlier TextMate grammar was identifying whole statements, the two grammars needed to be completely in sync, which is difficult considering the fact that the language is being actively worked on. Identifying token-wise simplifies things, and as we no longer dependent on the grammar, the grammar can be worked on and the syntax highlighting will still work very well. Only every now and then, new keywords that will be a part of the language need to be added into the TextMate grammar.

Wrapping up

Week 6 completes, and so do the midterm goals that we had planned for. The plan for the next week is roughly:

  1. Decide on the plan ahead
  2. Fix VSCode extension related bugs: Due to some breakage due to the addition of the File Providers in earlier weeks, quite a few bugsfixes have to be done for the HSQL extension.
  3. Add in DISTRIBUTE BY to Select – This would correspond to DISTRIBUTE in ECL
  4. Investigate grammar for WRITE TO variant of the OUTPUT statement for writing to files.
Categories
Development HSQL

Week 3: Extension, scopes, and planning Select

I started work on the extension first. Needing some changes, I needed two primary features: Syntax checking, and compiling to ECL as a primary featureset. So, I decided to go in order. For syntax checking, the basic process is:

  1. The language client keeps sending info about the text documents on the IDE, to the language server.
  2. The language server can act on these, and at its discretion, can send diagnostics.

There is a text document manager that is provided by the example implementation, but it does not emit any information about incremental updates, but rather provides the whole document itself.

Thankfully, enough configurability is present, to make your own text document manager. Using the standard document reference, I added the ability to add some listeners for incremental updates.

From there on out, I could check if the incremental updates would warrant a change in diagnostics (currently which i set to always update anyways) and then, push it to HSQLT to correct. Receiving the corrections, one can map the issues to the files, and the diagnostic severity levels; and then done! Pretty errors!

Syntax highlighting

The second part was rather simple, which was a compile command. Adding in a simple UI for confirming if the user wants outputs in case of warnings or errors, and writing it to the disk in that case, we get a nice little compile command (just remember to force VSCode to save all the files).

Can now use the command to compile straight away!

We can now pause most of the work on the extension, as this will work for testing the majority of HSQL. Once further progress is made on the language side, we can try working on ECL integration or some datatype highlighting.

Pretty diagrams

So I finally got around to updating the general architecture of what the current version of HSQL looks like and, here:

Pretty pretty

Lots of arrows here and pretty ideas aside, the current repository is meant to be used as an executable, as well a library. This should make it easy to create extensions based on the system, or even extend it later easily.

Packaging

Both hsqlt and the vscode extension are intended to be packaged before use. hsqlt has been configured to use pkg, and producing executables is very easy.

The VSCode extension though. I run a vsce package and I am greeted with:

But why?

Why is it trying to access the other project? I thought it was under .vscodeignore. Here’s some context, the extension and the compiler repo are located in a parent folder, and the compiler is linked to my global npm scope, and then linked back to this repo.

Digging further in the code, I open a Javascript Debugger Terminal, and see that it is from the ignore module. The module attempts to index which files to ignore, and follows symlinks. Adding to it, it also does not like accepting files which are outside the current folder (which we have here). And voila, the error. I have filed an issue under vsce and I hope to see some way around this. Worst-case scenario, I can unlink the repo before packaging (which might even work).

Select – Playing with scope

Select is a bit of a complex statement in SQL. The fact that it can call itself scares me, and having to deal with aliases is also scary. With aliases in mind, I got an idea – scoping. My mentor had mentioned earlier to think about procedures and syntax, and I have been working on scoping, knowing that I’d have to use them eventually. Interestingly, SQL Table aliases behave like a locally scoped definition; so perhaps table aliases can be mimicked with scoping.

Now how to enforce scoping? Functions come to mind. ECL Functions are similar to the usual program functions, save one critical difference – no variable shadowing. If a definition is declared, it cannot be shadowed in an easy way. So, time to go about it in a step by step way. How can I emulate a select query in ECL? I came up with this program flow for what the ecl should be shaped like

1. From - Create a gigantic source table from all the sources.
  a. Enumerate all the sources - the usual code generation
  b. apply aliases - redefine them
  c. Join them - JOIN
2. where - Apply a filter on the above table
3. sort - sort the data from above
4. Column filters and grouping - Use TABLE to do grouping and column filtering
5. apply limit offset from SQL - Use CHOOSEN on the above result
6. DISTINCT - DEDUP the above result

This method seems rather useful, as there is natural “use the result from above” flow to it. Additionally, with this, there is no way that we will be referring to data that has been deleted by a previous step. Honing this, i came up with this simple pseudo-algorithm –

1. Create a function and assign it to a value
2. Enumerate all the sources that have been changed, or need processing - JOINS, select in select and aliases.
  a. JOINS - do a join
  b. select in select - follow this procedure and assign it to its alias
  c. Aliases - assign the original value to the alias
3. Now, take _all_ the sources, and join them all. 
4. Take last generated variable and then do the following in order
  a. Where - apply filter
  b. sort - sort
  c. group and column filters - table
  d. limit offset - choosen
  e. distinct - dedup all
5. return the last generated variable.
6. The variable assigned to the function on step 1, is now the result

Let’s check this with an example

Here’s a SQL statement:

select *,TRIM(c1) from table1 as table2;

Here’s an idea of what the SQL would look like

__action_1 := function
    // inspect the sources - define all aliases
    table2:= table1;
    // the base select query is done - * from SELECT becomes the referenced table name
    __action_2 := TABLE(table2,{table2,TRIM(c1)});
    // return the last generated variable - __action_2
    return __action_2;
end;
__action_1; // this gives the output of the function

Seems pretty strange, yes. But I’ll be working on this next week, and we shall see how far things go.

Wrapping up for the week

With quite a bit of work done this week around, the plan is to pick up the following on the next week:

  1. Select AST+Codegeneration. Getting them ready as soon as possible is very important to getting things to a usable state. This one point alone is probably really large, as it involves getting a lot of components right to function completely. (In fact, I expect this to take quite a while.)
  2. Looking at syntaxes from other SQL/-like languages.

Categories
Development HSQL

Week 2: Codegen, AST and VSCode!

My first task this week is picking up from the codegen issue last week.

So to summarize, I need some good changes for codegeneration as a whole. So a while ago, I reused some concepts from yacc (or rather any parser) and the idea was to have a stack property, and this would be modified accordingly by each statement.

code:ECLCode[];

However, this is difficult to visualize – Each node of the parse tree will have some assumptions about the stack – ie. eg. After visiting a definition statement, the select statement will assume the definition statement would have put its code on the stack, can it can be used from there. This is good, but this does really complicate the whole process.

So, its easier to have each visit() statement return an instance of ECLCode, ie everything that it has translated. So once last week’s bug was fixed and we set have set up the new Parse Tree Visitor, we can move onto the next step!

Output

Taking a detour, OUTPUT is a very important statement. Most programs process some input and give some output and, well, the OUTPUT statement allows for just that.

Working out the particular syntax, we can see one issue from the beginning, ECL has an oddity with OUTPUT statements –

// this works
OUTPUT(someSingularValue,NAMED('hello1'));
// this works
OUTPUT(someTableValue,NAMED('hello2'));

//this does not work
OUTPUT(someSingularValue,,NAMED('hello3'));
//this works
OUTPUT(someTableValue,,NAMED('hello4'));

// this, should not work
OUTPUT(someSingularValue,'~output::hello5');
// but this doesn't work too!
OUTPUT(someTableValue,'~output::hello6');

// Although this works! What
OUTPUT(someTableValue,,'~output::hello7');
// this fails to compile too
OUTPUT(someTableValue,NAMED('hello8'),'~output::hello8');
// but this works!
OUTPUT(someTableValue,,NAMED('hello9'),'~output::hello9');

So, here’s the OUTPUT statement documentation, and seeing it, it becomes rather obvious what is happening.

Tables require a record expression, which is optional as in it may be left entry. For the first table output (hello2), it gets recognized as an expression and it can pass as correct syntax. But the latter ones, it can only be the table record variant, and syntaxes with only one comma, fails.

For reference, we can use SQL. SQL is mostly only tables. So, following that, I decided that the easiest (and maybe the laziest) is to not support singular (aka not table) data outputs. So, working that out will be two parts:

  1. Error on trying to output singular values
  2. Warnings on trying to output possibly singular values – eg. any values.

All right, with all that done, we can get around to get OUTPUT working.

ASTs working!

Codegeneration

With the codegeneration issues out of the place, it was a full speed ahead moment. And in no time at all, its done, OUTPUT and assignments.

A nice little warning too! (Because we don’t infer types on imports yet)

VSCode

All right, this is the meat of what’s part of HSQL; An easy way to deal with HSQL Code. And usually that means a not so easy way for us to deal with HSQL code. I’d worked with some language server work before (shameless plug here), and one of the things that was interesting was the recommendation to use Webpack to bundle the extension. The idea of the language server/client is very simple. There are two components:

  1. Language Client – Its the extension that plugs into the IDE. It feeds the language server any IDE contents, and what actions the user may be taking, and then takes any language server feedback and feeds it into the IDE.
  2. Language Server – It is a separate program that runs (maybe independently) and listens and communicates with API

So, the way the extension in VSCode is coded, is that you just start up the language server (or connect to it) and then you can communicate with it (uses JSON RPC).

It is apparent that what I needed was two repositories under a parent repo; where the parent repo had all the extension related content, and the child repos would have the language server and the client. And with that, I had to create a webpack config, to transpile the typescript projects, and generate two outputs, client.js, which would be executed as part of the extension, and server.js, which could be separately executed, but was mainly going to be bundled with the client.

I wrote up a nice webpack config, and it threw an error. Oh no.

This, took way too long. The majority of the day in fact; and I was knee deep trying to interpret what the compilation errors meant before the fix was apparent:

One entry.

transpileOnly: true

The trick was to tell ts-loader to not bother itself with anything other than transpiling and finally, it worked!

Pretending that didn’t happen, I pulled in older resources such as the textmate grammar for HSQL (It works pretty well for now), and added in some support code and voila, a working extension (I mean it didn’t do anything much yet, but atleast it compiled).

It knows when files are changing too! And syntax highlighting!

What was really interesting is that while the whole project was consuming atleast over a 100MB of space, Webpack easily pulled it down to less than 10MB for the two JS files (combined). This is a good indicator of why packaging is important in Javascript.

Wrapping up

With all this, I plan move into the next week, with the following tasks directly lined up:

  1. Have a compilation command for the HSQL extension.
  2. Explore some webpack optimizations? Perhaps like chunk splitting since the client and server will share the hsqlt module in common.
  3. Have some syntax checking support fixed in. This will remain as a good way to allow people to test as the compiler tool evolves.
  4. Start work on SELECT’s AST. This has been long pending, and its time to start the possibly biggest section of HSQL.