V8pedia

Scanner, parser & AST

Before any bytecode or machine code exists, V8 must turn source text into a structured form. That is the job of the scanner (tokenizer), the parser (builds the AST), and — critically for startup performance — the PreParser (skips function bodies you may never run). This frontend is where V8's startup-latency engineering lives.

::: info Ubiquitous language Scanner: tokenizer/lexer. AST: abstract syntax tree. PreParser: a parser that validates syntax/scopes but builds no tree. Lazy parsing: deferring the full parse of a function until it is first called. Zone: a region allocator. :::

Scanner: streaming UTF-16

The scanner consumes a UTF-16 character stream with cheap single-character lookahead:

inline base::uc32 Peek() {
  if (V8_LIKELY(buffer_cursor_ < buffer_end_)) {
    return static_cast<base::uc32>(*buffer_cursor_);
  } else if (ReadBlockChecked(pos())) {
    return static_cast<base::uc32>(*buffer_cursor_);
  } else { return kEndOfInput; }
}

src/parsing/scanner.h#L40-L72

The stream is specialized by string representation — external one-byte strings get a buffered stream, two-byte strings stream directly, and on-heap strings that may move during GC get a relocating stream:

src/parsing/scanner-character-streams.cc#L874-L912

Picking the right stream per source type avoids needless copying — relevant because scanning runs over all loaded script, including code that never executes.

Lazy by default: the PreParser

This is the key performance idea of the frontend. A web page or Node app loads far more JavaScript than it runs on any given path. Fully parsing every function — building an AST for each — would waste time and memory on functions that may never be called. So V8 is lazy by default: when it meets a function it is not about to run, it pre-parses the body instead of parsing it.

The PreParser builds no tree — just enough to validate syntax and record scope structure:

// Whereas the Parser generates AST during the recursive descent,
// the PreParser doesn't create a tree. Instead, it passes around minimal
// data objects (PreParserExpression, PreParserIdentifier etc.) which contain
// just enough data for the upper layer functions.

src/parsing/preparser.h#L19-L25

The parser switches between modes:

bool parse_lazily() const { return mode_ == PARSE_LAZILY; }
enum Mode { PARSE_LAZILY, PARSE_EAGERLY };

src/parsing/parser.h#L196-L197

and decides per function whether it can be pre-parsed:

bool can_preparse = impl()->parse_lazily() &&
                    eager_compile_hint == FunctionLiteral::kShouldLazyCompile;
bool is_lazy_top_level_function =
    can_preparse && impl()->AllowsLazyParsingWithoutUnresolvedVariables();

src/parsing/parser-base.h#L5143-L5148

The pre-parse output (scope info, inner-function positions) is stored compactly as PreparseData and attached to the function's SharedFunctionInfo. When the function is finally called, the full parse runs — and the recorded data lets it skip re-discovering scopes.

::: tip The (function(){…})() trick, explained You may have heard that wrapping a function in parentheses and calling it immediately makes V8 compile it eagerly. That's this heuristic: the parser treats an immediately-invoked function as "about to run", flips to PARSE_EAGERLY, and skips the pre-parse → full-parse round trip. It is a real, source-level behavior, not folklore — though its practical impact is small and version-dependent. :::

The AST and the Zone

The parser produces an AST whose nodes are allocated in a Zone, not the GC heap:

// Nodes are allocated in a separate zone, which allows faster
// allocation and constant-time deallocation of the entire syntax tree.

src/ast/ast.h#L32-L39

Every node is a ZoneObject with a compact type field:

class AstNode: public ZoneObject {
  enum NodeType : uint8_t { /* generated from AST_NODE_LIST */ };
  NodeType node_type() const { return NodeTypeField::decode(bit_field_); }
};

src/ast/ast.h#L145-L152

Zones

A Zone is a bump-pointer region allocator. You allocate fast and free everything at once by destroying the zone — perfect for the AST, which is thrown away after bytecode generation:

template <typename TypeTag>
void* Allocate(size_t size) {
  size = RoundUp(size, kAlignmentInBytes);
  if (V8_UNLIKELY(size > limit_ - position_)) Expand(size);
  void* result = reinterpret_cast<void*>(position_);
  position_ += size;           // bump
  return result;
}

src/zone/zone.h#L54-L77

No per-node free, no GC tracing of compiler scratch data, and segments are sized (8–32 KB) for cache-friendly reuse via the AccountingAllocator. Zones are used pervasively — the parser, Maglev, and TurboFan all build their IRs in zones. If you write a compiler, this pattern (arena/region allocation for short-lived graphs) is one of the highest-value techniques you can adopt.

See also