Intro
TODO
Note: Code snippets in this documentation are simplified examples and may not represent the actual codebase.
Tokenizers
Tokenizers are responsible for converting a buffer string into a list of tokens. Each token has a Tag
enum that represents its type, such as equal
for the =
symbol, and a Loc
struct with start and end indices that represent its position in the buffer.
All tokenizers work similarly and are based on the zig tokenizer. They have two main methods: next, which returns the next token, and getTokenSlice, which returns the slice of the buffer that represents the token.
Here's an example of how to use a tokenizer:
const toker = Tokenizer.init(buff);
const token = toker.next();
std.debug.print("{s}", .{toker.getTokenSlice(token)});
Tokenizers are often used in a loop until the end
tag is reached. In each iteration, the next token is retrieved and processed based on its tag. Here's a simple example:
const toker = Tokenizer.init(buff);
var token = toker.next();
while (token.tag != .end) : (token = toker.next()) switch (token.tag) {
.equal => std.debug.print("{s}", .{toker.getTokenSlice(token)}),
else => {},
}
Available Tokenizers
There are four different tokenizers in ZipponDB:
- ZiQL: Tokenizer for the query language.
- cli: Tokenizer the commands.
- schema: Tokenizer for the schema file.
Each tokenizer has its own set of tags and parsing rules, but they all work similarly.
Parser
Parsers are the next step after tokenization. They take tokens and perform actions or raise errors. There are three parsers in ZipponDB: one for ZiQL, one for schema files, and one for CLI commands.
A parser has a State
enum and a Tokenizer
instance as members, and a parse method that processes tokens until the end
state is reached.
Here's an example of how a parser works:
var state = .start;
var token = self.toker.next();
while (state != .end) : (token = self.toker.next()) switch (state) {
.start => switch (token.tag) {
.identifier => self.addStruct(token),
else => printError("Error: Expected a struct name.", token),
},
else => {},
}
The parser's state is updated based on the combination of the current state and token tag. This process continues until the end
state is reached.
The ZiQL parser uses different methods for parsing:
parse
: The main parsing method that calls other methods.parseFilter
: Creates a filter tree from the query.parseCondition
: Creates a condition from a part of the query.parseAdditionalData
: Populates additional data from the query.parseNewData
: Returns a string map with key-value pairs from the query.parseOption
: Not implemented yet.
File parsing
File parsing is done through a small library that I did named ZipponData.
It is minimal and fast, it can parse 1_000_000 entity in 0.3s on one thread on a 7 7800X3D at around 4.5GHz with a Samsung SSD 980 PRO 2TB (up to 7,000/5,100MB/s for read/write speed).
To read a file, you create an iterator for a single file and then you can iterate with .next()
. It will return an array of Data
. This make everything very easy to use.
const std = @import("std");
pub fn main() !void {
const allocator = std.testing.allocator;
// 0. Make a temporary directory
try std.fs.cwd().makeDir("tmp");
const dir = try std.fs.cwd().openDir("tmp", .{});
// 1. Create a file
try createFile("test", dir);
// 2. Create some Data
const data = [_]Data{
Data.initInt(1),
Data.initFloat(3.14159),
Data.initInt(-5),
Data.initStr("Hello world"),
Data.initBool(true),
Data.initUnix(2021),
};
// 3. Create a DataWriter
var dwriter = try DataWriter.init("test", dir);
defer dwriter.deinit(); // This just close the file
// 4. Write some data
try dwriter.write(&data);
try dwriter.write(&data);
try dwriter.flush(); // Dont forget to flush !
// 5. Create a schema
// A schema is how the iterator will parse the file.
// If you are wrong here, it will return wrong/random data
// And most likely an error when iterating in the while loop
const schema = &[_]DType{
.Int,
.Float,
.Int,
.Str,
.Bool,
.Unix,
};
// 6. Create a DataIterator
var iter = try DataIterator.init(allocator, "test", dir, schema);
defer iter.deinit();
// 7. Iterate over data
while (try iter.next()) |row| {
std.debug.print("Row: {any}\n", .{ row });
}
// 8. Delete the file (Optional ofc)
try deleteFile("test", dir);
try std.fs.cwd().deleteDir("tmp");
}
Engines
ZipponDB segregate responsability with Engines.
For example the FileEngine
is the only place where files used, for both writting and reading. This simplify refactoring, testing, etc.
DBEngine
This is just a wrapper around all other Engines to keep them at the same place. This doesnt do anything except storing other Engines.
This can be find in main.zig
, in the main
function.
FileEngine
The FileEngine
is responsible for managing files, including reading and writing.
Most methods will parse all files of a struct and evaluate them with a filter and do stuff if true
. For example parseEntities
will parse all entities and if the filter return true
,
will write using the writer argument a JSON object with the entity's data.
Those methods are usually sperated into 2 methods. The main one and a OneFile
version, e.g. parseEntitiesOneFile
. The main one will call a thread for each file using multiple OneFile
version.
This is how multi-threading is done.
SchemaEngine
The SchemaEngine
manage everything related to schemas.
This is mostly used to store a list of SchemaStruct
, with is just one struct as defined in the schema. With all member names, data types, links, etc.
This is also here that I store the UUIDFileIndex
, that is a map of UUID to file index. So I can quickly check if a UUID exist and in which file it is store.
This work well but use a bit too much memory for me, around 220MB for 1_000_000 entities. I tried doing a Radix Trie but it doesn't use that much less memory, maybe I did a mistake somewhere.
ThreadEngine
The ThreadEngine
manage the thread pool of the database.
This is also where is stored the ThreadSyncContext
that is use for each OneFile
version of parsing methods in the FileEngine
. This is the only atomix value currently used in the database.
Multi-threading
ZipponDB uses multi-threading to improve performance. Each struct is saved in multiple .zid
files, and a thread pool is used to process files concurrently. Each thread has its own buffered writer, and the results are concatenated and sent once all threads finish.
The only shared atomic values between threads are the number of found structs and the number of finished threads. This approach keeps things simple and easy to implement, avoiding parallel threads accessing the same file.
Data Structures
AdditionalData
AdditionalData keep what is between []
. It is composed of 2 struct AdditionalData
and AdditionalDataMember
.
AdditionalDataMember
have the name of the member, it's position in the schema file and an AdditionalData
.
AdditionalData
have a limit (the first part [100]
), and a list of AdditionalDataMember
.
Filters
A filter is a series of condition. It use a tree approach, by that I mean the filter have a root node that is either a condition or have 2 others nodes (left and right).
For example the filter {name = 'Bob'}
have one root node that is the condition. So when I evaluate the struct, I just check this condition.
Now for like {name = 'Bob' AND age > 0}
, the root node have as left node the condition name = 'Bob'
and right node the condition age > 0
.
To look like that:
Condition
A condition is part of filters. It is one 'unit of condition'. For example name = 'Bob'
is one condition.
name = 'Bob' and age > 0
are 2 conditions and one filter. It is created inside parseCondition
in the ziqlParser
.
A condition have those infos:
- value: ConditionValue. E.g.
32
- operation: ComparisonOperator. E.g.
equal
orin
- data_type: DataType. E.g.
int
orstr
- data_index: usize. This is the index in when parsing returned by zid
DataIterator
NewData
NewData is a map with member name as key and ConditionValue as value, it is created when parsing and is use to add data into a file. I transform ConditionValue into Zid Data. Maybe I can directly do a map member name -> zid Data ?
RelationMap
A RelationMap
is use when I need to return relationship. Let's say we have this query GRAB User [orders [date]]
.
The RelationMap
have a struct_name, here Order. A member name, here orders. A map with UUID as key and a string as value.
When I first init the map, I am parsing the first struct (here User). So it populate the map with empty string for entities that I want to return. Here it will be UUID of Order.
Then I parse Order file and add the string to the right UUID, skipping UUID that are not in the map.
Once that done, I parse the JSON response that I generated when parsing User. Where a relationship should be, there is {<|[16]u8|>}
, where [16]u8
is the UUID of the entity that shoud be here.
So now I can just replace it by the right key in the map.
EntityWriter
This is responsable to transform the raw Data into a JSON, Table or other output format to send send to end user. Basically the last step before sending.