Documentation

Tokenizer
in package

Application

The HTML5 tokenizer.

The tokenizer's role is reading data from the scanner and gathering it into semantic units. From the tokenizer, data is emitted to an event handler, which may (for example) create a DOM tree.

The HTML5 specification has a detailed explanation of tokenizing HTML5. We follow that specification to the maximum extent that we can. If you find a discrepancy that is not documented, please file a bug and/or submit a patch.

This tokenizer is implemented as a recursive descent parser.

Within the API documentation, you may see references to the specific section of the HTML5 spec that the code attempts to reproduce. Example: 8.2.4.1. This refers to section 8.2.4.1 of the HTML5 CR specification.

Constants

CONFORMANT_HTML = 'html'
CONFORMANT_XML = 'xml'

Properties

$carryOn : mixed
$events : mixed
$mode : mixed
$scanner : mixed
$text : mixed: Buffer for text.
$textMode : mixed
$tok : mixed
$untilTag : mixed

Methods

__construct() : mixed: Create a new tokenizer.
parse() : mixed: Begin parsing.
setTextMode() : mixed: Set the text mode for the character data reader.
attribute() : bool: Parse attributes from inside of a tag.
attributeValue() : string|null: Consume an attribute value. See section 8.2.4.37 and after.
bogusComment() : bool: Consume malformed markup as if it were a comment.
buffer() : mixed: Add text to the temporary buffer.
cdataSection() : bool: Handle a CDATA section.
characterData() : mixed: Parse anything that looks like character data.
comment() : bool: Read a comment.
consumeData() : mixed: Consume a character and make a move.
decodeCharacterReference() : string: Decode a character reference and return the string.
doctype() : bool: Parse a DOCTYPE.
endTag() : mixed: Consume an end tag. See section 8.2.4.9.
eof() : mixed: If the document is read, emit an EOF event.
flushBuffer() : mixed: Send a TEXT event with the contents of the text buffer.
is_alpha() : bool: Checks whether a (single-byte) character is an ASCII letter or not.
isCommentEnd() : bool: Check if the scanner has reached the end of a comment.
isTagEnd() : mixed: Check if the scanner has reached the end of a tag.
markupDeclaration() : mixed: Look for markup.
parseError() : string: Emit a parse error.
processingInstruction() : bool: Handle a processing instruction.
quotedAttributeValue() : string: Get an attribute value string.
quotedString() : mixed: Utility for reading a quoted string.
rawText() : bool: Read text in RAW mode.
rcdata() : bool: Read text in RCDATA mode.
readUntilSequence() : string: Read from the input stream until we get to the desired sequene or hit the end of the input stream.
sequenceMatches() : bool: Check if upcomming chars match the given sequence.
tagName() : mixed: Consume a tag name and body. See section 8.2.4.10.
text() : bool: This buffers the current token as character data.
unquotedAttributeValue() : mixed

CONFORMANT_HTML


    public
        mixed
    CONFORMANT_HTML
    = 'html'

CONFORMANT_XML


    public
        mixed
    CONFORMANT_XML
    = 'xml'

$carryOn


        protected
            mixed
    $carryOn
     = true

$events


        protected
            mixed
    $events

$mode


        protected
            mixed
    $mode
     = self::CONFORMANT_HTML

$scanner


        protected
            mixed
    $scanner

$text

Buffer for text.


        protected
            mixed
    $text
     = ''

$textMode


        protected
            mixed
    $textMode
     = 0

$tok


        protected
            mixed
    $tok

$untilTag


        protected
            mixed
    $untilTag
     = null

__construct()

Create a new tokenizer.


    public
                    __construct(Scanner $scanner, EventHandler $eventHandler[, string $mode = self::CONFORMANT_HTML ]) : mixed

Typically, parsing a document involves creating a new tokenizer, giving it a scanner (input) and an event handler (output), and then calling the Tokenizer::parse() method.`

Parameters

$scanner : Scanner: A scanner initialized with an input stream.
$eventHandler : EventHandler: An event handler, initialized and ready to receive events.
$mode : string = self::CONFORMANT_HTML

parse()

Begin parsing.


    public
                    parse() : mixed

This will begin scanning the document, tokenizing as it goes. Tokens are emitted into the event handler.

Tokenizing will continue until the document is completely read. Errors are emitted into the event handler, but the parser will attempt to continue parsing until the entire input stream is read.

setTextMode()

Set the text mode for the character data reader.


    public
                    setTextMode(int $textmode[, string $untilTag = null ]) : mixed

HTML5 defines three different modes for reading text:

Normal: Read until a tag is encountered.
RCDATA: Read until a tag is encountered, but skip a few otherwise- special characters.
Raw: Read until a special closing tag is encountered (viz. pre, script)

This allows those modes to be set.

Normally, setting is done by the event handler via a special return code on startTag(), but it can also be set manually using this function.

Parameters

$textmode : int: One of Elements::TEXT_*.
$untilTag : string = null: The tag that should stop RAW or RCDATA mode. Normal mode does not use this indicator.

attribute()

Parse attributes from inside of a tag.


    protected
                    attribute(array<string|int, string> &$attributes) : bool

Parameters

$attributes : array<string|int, string>

Return values

bool

attributeValue()

Consume an attribute value. See section 8.2.4.37 and after.


    protected
                    attributeValue() : string|null

Return values

string|null

bogusComment()

Consume malformed markup as if it were a comment.


    protected
                    bogusComment([string $leading = '' ]) : bool

8.2.4.44.

The spec requires that the ENTIRE tag-like thing be enclosed inside of the comment. So this will generate comments like:

Parameters

$leading : string = '': Prepend any leading characters. This essentially negates the need to backtrack, but it's sort of a hack.

Return values

bool

buffer()

Add text to the temporary buffer.


    protected
                    buffer(string $str) : mixed

Parameters

$str : string

cdataSection()

Handle a CDATA section.


    protected
                    cdataSection() : bool

Return values

bool

characterData()

Parse anything that looks like character data.


    protected
                    characterData() : mixed

Different rules apply based on the current text mode.

comment()

Read a comment.


    protected
                    comment() : bool

Expects the first tok to be inside of the comment.

Return values

bool

consumeData()

Consume a character and make a move.


    protected
                    consumeData() : mixed

HTML5 8.2.4.1.

decodeCharacterReference()

Decode a character reference and return the string.


    protected
                    decodeCharacterReference([bool $inAttribute = false ]) : string

If $inAttribute is set to true, a bare & will be returned as-is.

Parameters

$inAttribute : bool = false: Set to true if the text is inside of an attribute value. false otherwise.

Return values

string

doctype()

Parse a DOCTYPE.


    protected
                    doctype() : bool

Parse a DOCTYPE declaration. This method has strong bearing on whether or not Quirksmode is enabled on the event handler.

Return values

bool

endTag()

Consume an end tag. See section 8.2.4.9.


    protected
                    endTag() : mixed

eof()

If the document is read, emit an EOF event.


    protected
                    eof() : mixed

flushBuffer()

Send a TEXT event with the contents of the text buffer.


    protected
                    flushBuffer() : mixed

This emits an EventHandler::text() event with the current contents of the temporary text buffer. (The buffer is used to group as much PCDATA as we can instead of emitting lots and lots of TEXT events.)

is_alpha()

Checks whether a (single-byte) character is an ASCII letter or not.


    protected
                    is_alpha(string $input) : bool

Parameters

$input : string: A single-byte string

Return values

bool —

True if it is a letter, False otherwise

isCommentEnd()

Check if the scanner has reached the end of a comment.


    protected
                    isCommentEnd() : bool

Return values

bool

isTagEnd()

Check if the scanner has reached the end of a tag.


    protected
                    isTagEnd(mixed &$selfClose) : mixed

Parameters

$selfClose : mixed

markupDeclaration()

Look for markup.


    protected
                    markupDeclaration() : mixed

parseError()

Emit a parse error.


    protected
                    parseError(string $msg) : string

A parse error always returns false because it never consumes any characters.

Parameters

$msg : string

Return values

string

processingInstruction()

Handle a processing instruction.


    protected
                    processingInstruction() : bool

XML processing instructions are supposed to be ignored in HTML5, treated as "bogus comments". However, since we're not a user agent, we allow them. We consume until ?> and then issue a EventListener::processingInstruction() event.

Return values

bool

quotedAttributeValue()

Get an attribute value string.


    protected
                    quotedAttributeValue(string $quote) : string

Parameters

$quote : string: IMPORTANT: This is a series of chars! Any one of which will be considered termination of an attribute's value. E.g. ""'" will stop at either ' or ".

Return values

string —

The attribute value.

quotedString()

Utility for reading a quoted string.


    protected
                    quotedString(string $stopchars) : mixed

Parameters

$stopchars : string: Characters (in addition to a close-quote) that should stop the string. E.g. sometimes '>' is higher precedence than '"' or "'".

Return values

mixed —

String if one is found (quotations omitted).

rawText()

Read text in RAW mode.


    protected
                    rawText(string $tok) : bool

Parameters

$tok : string: The current token.

Return values

bool

rcdata()

Read text in RCDATA mode.


    protected
                    rcdata(string $tok) : bool

Parameters

$tok : string: The current token.

Return values

bool

readUntilSequence()

Read from the input stream until we get to the desired sequene or hit the end of the input stream.


    protected
                    readUntilSequence(string $sequence) : string

Parameters

$sequence : string

Return values

string

sequenceMatches()

Check if upcomming chars match the given sequence.


    protected
                    sequenceMatches(string $sequence[, bool $caseSensitive = true ]) : bool

This will read the stream for the $sequence. If it's found, this will return true. If not, return false. Since this unconsumes any chars it reads, the caller will still need to read the next sequence, even if this returns true.

Example: $this->scanner->sequenceMatches('</script>') will see if the input stream is at the start of a '</script>' string.

Parameters

$sequence : string
$caseSensitive : bool = true

Return values

bool

tagName()

Consume a tag name and body. See section 8.2.4.10.


    protected
                    tagName() : mixed

text()

This buffers the current token as character data.


    protected
                    text(string $tok) : bool

Parameters

$tok : string: The current token.

Return values

bool

unquotedAttributeValue()


    protected
                    unquotedAttributeValue() : mixed

Tokenizer in package Application

Tags

Table of Contents

Constants

Properties

Methods

Constants

CONFORMANT_HTML

CONFORMANT_XML

Properties

$carryOn

$events

$mode

$scanner

$text

$textMode

$tok

$untilTag

Methods

__construct()

Parameters

parse()

setTextMode()

Parameters

attribute()

Parameters

Tags

Return values

attributeValue()

Return values

bogusComment()

Parameters

Return values

buffer()

Parameters

Tags

cdataSection()

Return values

characterData()

Tags

comment()

Return values

consumeData()

decodeCharacterReference()

Parameters

Return values

doctype()

Tags

Return values

endTag()

eof()

flushBuffer()

is_alpha()

Parameters

Return values

isCommentEnd()

Return values

isTagEnd()

Parameters

markupDeclaration()

parseError()

Parameters

Return values

processingInstruction()

Return values

quotedAttributeValue()

Parameters

Return values

quotedString()

Parameters

Return values

rawText()

Parameters

Return values

rcdata()

Parameters

Return values

readUntilSequence()

Parameters

Return values

Tokenizer
in package

Application