Scanner, parser & AST
Before any bytecode or machine code exists, V8 must turn source text into a structured form. That is the job of the scanner (tokenizer), the parser (builds the AST), and — critically for startup performance — the PreParser (skips function bodies you may never run). This frontend is where V8's startup-latency engineering lives.
::: info Ubiquitous language Scanner: tokenizer/lexer. AST: abstract syntax tree. PreParser: a parser that validates syntax/scopes but builds no tree. Lazy parsing: deferring the full parse of a function until it is first called. Zone: a region allocator. :::
Scanner: streaming UTF-16
The scanner consumes a UTF-16 character stream with cheap single-character lookahead:
inline base::uc32 Peek() {
if (V8_LIKELY(buffer_cursor_ < buffer_end_)) {
return static_cast<base::uc32>(*buffer_cursor_);
} else if (ReadBlockChecked(pos())) {
return static_cast<base::uc32>(*buffer_cursor_);
} else { return kEndOfInput; }
}
— src/parsing/scanner.h#L40-L72
The stream is specialized by string representation — external one-byte strings get a buffered stream, two-byte strings stream directly, and on-heap strings that may move during GC get a relocating stream:
— src/parsing/scanner-character-streams.cc#L874-L912
Picking the right stream per source type avoids needless copying — relevant because scanning runs over all loaded script, including code that never executes.
Lazy by default: the PreParser
This is the key performance idea of the frontend. A web page or Node app loads far more JavaScript than it runs on any given path. Fully parsing every function — building an AST for each — would waste time and memory on functions that may never be called. So V8 is lazy by default: when it meets a function it is not about to run, it pre-parses the body instead of parsing it.
The PreParser builds no tree — just enough to validate syntax and record scope structure:
// Whereas the Parser generates AST during the recursive descent,
// the PreParser doesn't create a tree. Instead, it passes around minimal
// data objects (PreParserExpression, PreParserIdentifier etc.) which contain
// just enough data for the upper layer functions.
— src/parsing/preparser.h#L19-L25
The parser switches between modes:
bool parse_lazily() const { return mode_ == PARSE_LAZILY; }
enum Mode { PARSE_LAZILY, PARSE_EAGERLY };
— src/parsing/parser.h#L196-L197
and decides per function whether it can be pre-parsed:
bool can_preparse = impl()->parse_lazily() &&
eager_compile_hint == FunctionLiteral::kShouldLazyCompile;
bool is_lazy_top_level_function =
can_preparse && impl()->AllowsLazyParsingWithoutUnresolvedVariables();
— src/parsing/parser-base.h#L5143-L5148
The pre-parse output (scope info, inner-function positions) is stored compactly as
PreparseData and attached to the function's
SharedFunctionInfo. When the function is
finally called, the full parse runs — and the recorded data lets it skip
re-discovering scopes.
::: tip The (function(){…})() trick, explained
You may have heard that wrapping a function in parentheses and calling it
immediately makes V8 compile it eagerly. That's this heuristic: the parser treats
an immediately-invoked function as "about to run", flips to PARSE_EAGERLY, and
skips the pre-parse → full-parse round trip. It is a real, source-level behavior,
not folklore — though its practical impact is small and version-dependent.
:::
The AST and the Zone
The parser produces an AST whose nodes are allocated in a Zone, not the GC heap:
// Nodes are allocated in a separate zone, which allows faster
// allocation and constant-time deallocation of the entire syntax tree.
Every node is a ZoneObject with a compact type field:
class AstNode: public ZoneObject {
enum NodeType : uint8_t { /* generated from AST_NODE_LIST */ };
NodeType node_type() const { return NodeTypeField::decode(bit_field_); }
};
Zones
A Zone is a bump-pointer region allocator. You allocate fast and free everything at once by destroying the zone — perfect for the AST, which is thrown away after bytecode generation:
template <typename TypeTag>
void* Allocate(size_t size) {
size = RoundUp(size, kAlignmentInBytes);
if (V8_UNLIKELY(size > limit_ - position_)) Expand(size);
void* result = reinterpret_cast<void*>(position_);
position_ += size; // bump
return result;
}
No per-node free, no GC tracing of compiler scratch data, and segments are sized
(8–32 KB) for cache-friendly reuse via the
AccountingAllocator.
Zones are used pervasively — the parser, Maglev, and
TurboFan all build their IRs in zones. If you write a
compiler, this pattern (arena/region allocation for short-lived graphs) is one of
the highest-value techniques you can adopt.
See also
Bytecode generation — what consumes the AST.
Ignition — what runs the resulting bytecode.