Incremental Parser Combinators In Javascript
Parsers are often run against data streams that are not fully available. Maybe you want to parse user input to provide realtime feedback and improve application responsiveness, or perhaps you need to parse web socket data incrementally to limit user interface blocking.
This post overviews the problem of incremental parsing in Javascript and details the implementation used by Parse to incrementally run monadic parser combinators.
Goals
- Functional style implementation.
- Input agnostic.
- Operate on unmodified Parse.js parser combinators (with a few exceptions).
- Correctly handle backtracking and end of file.
- Easy to use API.
Besides allowing parsers to be run incrementally, the implementation allows a few other interesting operations:
- Cache partially completed incremental parser states.
- Branch incremental parsers by feeding a state different inputs.
- Access the working output value of a incremental parser.
Links
- Parse - Base parser combinator library and complete incremental parsing implementation.
- Khepri - ECMAScript derived language used for implementation.
Problem Overview
Incremental parsing require solving two subproblems: creating resuming parsers and expressing incremental streams.
Incrementally Applying Parse Parsers
Behind its monadic parser combinator interface, Parse parsers are implemented with continuations. A parser takes four continuations, and calls one based on its completion condition: cok
when input is consumed and the parser succeeds, cerr
when input is consumed and the parser fails, eok
when no input is consumed and the parser succeeds, and cerr
when input is consumed and the parser fails.
// The always parser does not consume any input and
// completes successfully with a value.
always = \x ->
\state, m, cok, cerr, eok, eerr ->
eok(x, state, m);
During execution, each continuation contains the rest of the parser execution for its completion condition. By capturing a continuation and abortively returning it (not calling any continuation), parsers can easily be manually paused and resumed at set points during execution:
// Abortively complete with a value by
// not calling a continuation.
var abort = \x -> \state, m, cok, cerr, eok, eerr -> x;
/// Reify `eok` continuation and abort with it.
var pause = \state, m, cok, cerr, eok, eerr ->
\x -> eok(x, state, m);
var resumable = sequence(
character('a'),
pause,
character('b'));
var partial = parse.run(resumable, 'ab');
partial(); // ‘b’
We can even feed values into the paused parser and evaluate the continuation multiple times with different values:
var resumable = eager <| enumeration(
character(‘a’),
pause,
character(‘b’));
var partial = parse.run(resumable, ‘ab’);
partial('x'); // ['a', 'x', 'b']
partial('y'); // ['a', 'y', 'b']
partial(); // [‘a’, undefined, ‘b’]
The other continuations can be captured the same way, although eok
is the most useful.
Incremental Input
Parse.js parsers operates on Nu streams of input. Nu streams consist of a head value and a function that generates the rest of the stream, allowing lazily generated, potential infinite streams to be defined.
// Stream of [1, 2, 3]
var s =
stream(1, \() ->
stream(2, \() ->
stream(3, \() -> NIL)));
However, this definition requires every element of a valid Nu stream be accessible at construction. There is no way to define a Nu stream of something like user input because the stream cannot refer to values that do not yet exist (the keys a users will enter) when creating the initial stream. It is not clear what the head value should be, and updating the stream in response to user input would require mutation.
A good functional language can interface such streams with a parser (see the CC-delcont resumable parsing example), but I did not find any of these approaches suitable for Javascript. I wanted the incremental parsing API to be clear and easy to use with common Javascript programming patterns.
Incremental Streams
Incremental input streams consist of two parts: a stream of available data and some remainder that will contain the rest of the data in the input stream. A finished incremental stream has only available data.
The most simple incremental stream is a standard Nu stream. For a stream s
, the available data is s
and the remainder is the empty stream.
Incremental parsers run on the available section. In the simple stream case, this requires waiting until the entire input stream is available before any parsing takes place, which is how Parse normally operate.
Chunks
Instead of requiring the entire stream be available before starting parsing, we can break the stream it into chunks. As chunks become available, we shift them from remainder to the available stream and run the parser against chunks as they become available. (The idea for chunking the input was adapted from A Parsing Trifecta with influences from CC-delcont resumable parsing).
A chunk is a complete Nu stream of zero or more elements. Chunk size may vary and can be determined by data availability or desired behavior. The input stream consists of zero or more chunks, each uniquely identifiable by an id.
Given a chunk id, the incremental parsers must be able to determine the chunk of id of the next chunk in the sequence. A simple counter is used for this.
Backtracking
Backtracking presents one complication. Consider the parser:
var aa_or_ab = either(
attempt <| sequence(
character(‘a’),
character(‘a’)),
sequence(
character(‘a’),
character(‘b’)))
attempt
ensures that input like ’ab’
is handled correctly by saving and restoring the parser state. Without attempt
, input ’ab’
would fail since some input is consumed before the first option in either
fails.
If two chunks are feed to the parser:
// Begin parsing
var r = runInc(aa_or_ab);
// Provide chunk 0
var r2 = provideString(r, ‘a’);
// Provide chunk 1
var r3 = provideString(r2, ‘b’);
// End parsing and get result
finish(r3); // ‘b’
the parser backtracks when ’b’
is encountered. This restores a parser state at chunk 0. Instead of requesting another chunk 1 be provided, the parser should use the chunk 1 that was provided to complete the parsing.
Therefore, a chunk map must be stored external to the parser state to ensure the input stream is consistent even when backtracking. Otherwise, the incremental parser would require a new chunk 1 values for each backtracking.
Implementation Overview
A custom parser state for incremental parsing chunked input is defined. The incremental state tracks the id of the working chunk. When it runs out of data for a chunk, it reifies the current parser execution and abortively returns a request for the next chunk. An Session
data structure stores chunks. The provide
operation provides chunks for requests while finish
signals the end of file.
Structures
IncrementalState
- ParserState
used for incremental parsing. Wraps an internal ParserState
state.
Request
- Request for a chunk.
Session
- Partially applied parser state.
Core Operations
parseIncState
- Begin incremental parsing. Returns a Session
.
provide
- Feed a new chunk to a Session. Returns a Session
.
finish
- Signal that an incremental parser is complete. Return the result from parsing.
Code Implementation
The following code is adapted from Parse’s incremental implementation
Request and Session
The Request
and Session
data structures are straightforward:
var Request = function(chunk, k) {
this.chunk = chunk; // Requested chunk identifier
this.k = k; // Parsing continuation
};
var Session = function(done, k, chunks) {
this.done = done; // Is parsing complete?
this.k = k; // Parsing continuation
// Array mapping chunk identifiers to chunk data.
this.chunks = chunks;
};
Session.prototype.addChunk = \c ->
new Session(
this.done,
this.k,
this.chunks.concat(c));
Session.prototype.hasChunk = \id -> (id < this.chunks.length);
Session.prototype.getChunk = \id -> this.chunks[id];
IncrementalState
IncrementalState
wraps an internal parser state and forwards operations to it. It tracks the current chunk and generates requests for the next chunk in the input stream.
var IncrementalState = function(chunk, state) {
this.chunk = chunk; // Working chunk id
this.state = state; // Inner state
};
All the standard operations for input and position are forwarded to the inner state:
IncrementalState.prototype.eq = \other ->
(other && other.chunk === this.chunk &&
this.state.eq(other.state));
Object.defineProperties(IncrementalState.prototype, {
'input': { 'get': \() -> this.state.input },
'position': { 'get': \() -> this.state.position },
'userState': { get': \() -> this.state.userState }
});
IncrementalState.prototype.setInput = \input ->
new IncrementalState(
this.chunk,
this.state.setInput(input));
IncrementalState.prototype.setPosition = \position ->
new IncrementalState(
this.chunk,
this.state.setPosition(position));
IncrementalState.prototype.setUserState = \userState ->
new IncrementalState(
this.chunk,
this.state.setUserState(userState));
The inner state has no knowledge of chunks but operates on the working chunk’s data stream. Stream functions are also forwarded to the inner state:
IncrementalState.prototype.isEmpty = \() ->
this.state.isEmpty();
IncrementalState.prototype.first = \() -> this.state.first();
Chunks are handled in the next
function. next
takes a consumed token x
and returns a parser for the remainder of the input. IncrementalState.prototype.next
gets the inner state for the consumed token. When the inner state is empty (there is no more data for the current chunk), IncrementalState
requests the next chunk of data.
IncrementalState.prototype.next = \x -> {
var chunk = this.chunk;
// Get the next inner state from feeding the
// inner state the consumed token.
return bind(
next(this.state.next(x), getParserState),
\innerState -> {
// Check if the next inner state has any more data
// for the current chunk
if (innerState.isEmpty())
// Return a parser that abortively completes
// with a request for the next chunk.
// The request continuation takes the
// next chunk stream `i` and continues
// parsing with a new incremental parser
// state with inner state for input `i`.
return \_, m, cok ->
new Request(
chunk + 1,
\i -> cok(x, new IncrementalState(chunk + 1, innerState.setInput(i)), m))
// Otherwise, the working chunk was not empty.
// Continue parsing the rest of the data for
// the current chunk.
return \_, m, cok ->
cok(x, new IncrementalState(chunk, innerState), m)
});
};
One important observation is that requests are made when the next inner state is empty, not the current inner state. This allows the isEmpty
and first
methods of IncrementalState
to work correctly, as the current inner state always contains the stream of the current chunk. Only when the true eof is reached will the inner state contain an empty stream.
Operations
Provide
provide
passes data for a new chunk c
to Session r
. forceProvide
is the internal method that contains the provide logic. forceProvide
adds all chunks fed to it, even empty ones that provide
ignores.
var forceProvide = \r, c -> {
// Feeding data to a complete session is a noop.
if (r.done) return r;
// Add the chunk and get a new session.
var r2 = r.addChunk(c);
// Fulfill the initial request.
// trampoline handles the internal tail call implementation.
var result = r2.k(c) |> trampoline;
// While requests are returned, and we have chunks
// for these requests, fulfill these requests as well.
while (result instanceof Request && r2.hasChunk(result.chunk))
result = result.k(r2.getChunk(result.chunk)) |> trampoline;
return (result instanceof Request ?
new Session(false, result.k, r2.chunks) :
result);
};
provide = \r, c ->
(isEmpty(c) ?
r :
forceProvide(r, c));
Type checks are required to discriminate between a regular abortively returned value and an abortively returned request.
provide
always parses as much as possible before returning. Backtracking may restore states for prior chunks, so multiple requests for existing chunks may be made before a new chunk is requested. The while loop handles these requests.
A provideString
helper function is also useful for providing strings directly:
provideString = \r, input -> provide(r, streamFrom(input));
finish
finish
signals that no more input is coming. It finishes the current parser and runs the outermost continuations, returning the result.
finish = let
complete = \r -> r.k()
in
\r -> complete(forceProvide(r, NIL));
In order to handle backtracking with EOF correctly, finish must register an empty chunk for the end of the file. This is why forceProvide
is required.
Top level continuations are evaluated when finish
is called, not when the parser completes. If they do something non functional-style, such as throwing an exception, this prevents this behavior from occurring until finish
is called.
Starting Parsing
parseIncState
and parseInc
begin a new incremental parsing of a parser p
and return a new Session. parseInc
supplies a default inner state while parseIncState
allows a custom one to be provided.
parseIncState = \p, state, ok, err -> let
suc = \x, s -> new Session(true, (ok, x, s)),
fail = \x, s -> new Session(true, (err, x, s)),
k = \i -> parseState(
p,
new IncrementalState(0, state.setInput(i)),
suc,
fail)
in
provide(
new Session(false, k, []),
state.input);
parseInc = \p, ud, ok, err ->
parseIncState(
p,
new ParserState(NIL, Position.initial, ud),
ok,
err);
suc
and fail
are the outermost callbacks, suc
for success and fail
for error. Both return complete Sessions. The continuation of these Sessions is the actual ok
or err
callback. This ensures the ok
and err
are only evaluated when finish
is called and not when the parser completes.
The continuation of the returned session, k
, begins parsing when the first chunk stream i
is provided. The initial IncrementalState
is created by setting the input on the initial inner state.
The outer provide
in parseIncState
handles cases where the initial state has some input. It is a noop for cases like parseInc
where the input of the ParserState
is NIL
.
Examples
Simple Incremental Parser
var r = runInc(string('abc'));
// returns 'abc'
finish(provideString(r, 'abc'));
// returns 'abc'
finish(
provideString(
provideString(r, 'ab'),
'c'));
// throws Expected 'c' found 'x'
finish(
provideString(
provideString(r, 'ab'),
'x'));
// throws Expected 'c' found end of input
finish(provideString(r, 'ab'));
EOF
var r = runInc(
then(string('abc'), eof)));
// returns 'abc'
finish(provideString(r, 'abc'));
// throws Expected eof found 'd'
finish(
provideString(
provideString(r, 'ab'),
'cd'));
Inspecting Working Values
Incremental parsers can be finished multiple times. This is useful for providing feedback during parsing.
var r = runInc(
eager <| many(character('a')));
var result;
do {
r = provideString(r, 'a');
result = finish(r);
} while (result.length < 5);
result; // ['a', 'a', 'a', 'a', 'a']
External examples
- Parse-PN - Example incremental polish notation evaluator.
- Parse ECMA Incremental - Incremental ECMAScript lexer.
Limitations
The major limitation of this approach is that a custom parser state is used. Most application should never touch the parser state directly, but certain cases do require custom parser states.
parse.modifyParserState, parse.getParserState, and parse.setParserState
All of these operate on the IncrementalState
instead of the real, inner parser state. If the state is only interacted with using the standard API, this is not a problem.
Custom Parser States
Custom parser states with custom logic and properties will not work properly. Only the standard properties and operations are forwarded to the inner state.
parse.getInput and parse.setInput
parse.getInput
returns the input for just the current chunk. setInput
will only set the input for the current chunk and does not effect the value of the chunk stored in a Session
.
Abortive Parsers
Parse does not include any abortive parsers, but a custom one may be defined. Abortive results from such a parser are returned directly from provide
instead of a Session
. I felt this was the correct behavior.
Closing Thoughts
Even with the noted drawbacks, this approach works well and supports almost all unmodified Parse parsers. The solution was designed for Javascript, but could be adapted to other languages, although more appropriate solutions may exist for your language of choice.