Tokenizer
in package
The HTML5 tokenizer.
The tokenizer's role is reading data from the scanner and gathering it into semantic units. From the tokenizer, data is emitted to an event handler, which may (for example) create a DOM tree.
The HTML5 specification has a detailed explanation of tokenizing HTML5. We follow that specification to the maximum extent that we can. If you find a discrepancy that is not documented, please file a bug and/or submit a patch.
This tokenizer is implemented as a recursive descent parser.
Within the API documentation, you may see references to the specific section of the HTML5 spec that the code attempts to reproduce. Example: 8.2.4.1. This refers to section 8.2.4.1 of the HTML5 CR specification.
Tags
Table of Contents
Constants
- CONFORMANT_HTML = 'html'
- CONFORMANT_XML = 'xml'
Properties
- $carryOn : mixed
- $events : mixed
- $mode : mixed
- $scanner : mixed
- $text : mixed
- Buffer for text.
- $textMode : mixed
- $tok : mixed
- $untilTag : mixed
Methods
- __construct() : mixed
- Create a new tokenizer.
- parse() : mixed
- Begin parsing.
- setTextMode() : mixed
- Set the text mode for the character data reader.
- attribute() : bool
- Parse attributes from inside of a tag.
- attributeValue() : string|null
- Consume an attribute value. See section 8.2.4.37 and after.
- bogusComment() : bool
- Consume malformed markup as if it were a comment.
- buffer() : mixed
- Add text to the temporary buffer.
- cdataSection() : bool
- Handle a CDATA section.
- characterData() : mixed
- Parse anything that looks like character data.
- comment() : bool
- Read a comment.
- consumeData() : mixed
- Consume a character and make a move.
- decodeCharacterReference() : string
- Decode a character reference and return the string.
- doctype() : bool
- Parse a DOCTYPE.
- endTag() : mixed
- Consume an end tag. See section 8.2.4.9.
- eof() : mixed
- If the document is read, emit an EOF event.
- flushBuffer() : mixed
- Send a TEXT event with the contents of the text buffer.
- is_alpha() : bool
- Checks whether a (single-byte) character is an ASCII letter or not.
- isCommentEnd() : bool
- Check if the scanner has reached the end of a comment.
- isTagEnd() : mixed
- Check if the scanner has reached the end of a tag.
- markupDeclaration() : mixed
- Look for markup.
- parseError() : string
- Emit a parse error.
- processingInstruction() : bool
- Handle a processing instruction.
- quotedAttributeValue() : string
- Get an attribute value string.
- quotedString() : mixed
- Utility for reading a quoted string.
- rawText() : bool
- Read text in RAW mode.
- rcdata() : bool
- Read text in RCDATA mode.
- readUntilSequence() : string
- Read from the input stream until we get to the desired sequene or hit the end of the input stream.
- sequenceMatches() : bool
- Check if upcomming chars match the given sequence.
- tagName() : mixed
- Consume a tag name and body. See section 8.2.4.10.
- text() : bool
- This buffers the current token as character data.
- unquotedAttributeValue() : mixed
Constants
CONFORMANT_HTML
public
mixed
CONFORMANT_HTML
= 'html'
CONFORMANT_XML
public
mixed
CONFORMANT_XML
= 'xml'
Properties
$carryOn
protected
mixed
$carryOn
= true
$events
protected
mixed
$events
$mode
protected
mixed
$mode
= self::CONFORMANT_HTML
$scanner
protected
mixed
$scanner
$text
Buffer for text.
protected
mixed
$text
= ''
$textMode
protected
mixed
$textMode
= 0
$tok
protected
mixed
$tok
$untilTag
protected
mixed
$untilTag
= null
Methods
__construct()
Create a new tokenizer.
public
__construct(Scanner $scanner, EventHandler $eventHandler[, string $mode = self::CONFORMANT_HTML ]) : mixed
Typically, parsing a document involves creating a new tokenizer, giving it a scanner (input) and an event handler (output), and then calling the Tokenizer::parse() method.`
Parameters
- $scanner : Scanner
-
A scanner initialized with an input stream.
- $eventHandler : EventHandler
-
An event handler, initialized and ready to receive events.
- $mode : string = self::CONFORMANT_HTML
parse()
Begin parsing.
public
parse() : mixed
This will begin scanning the document, tokenizing as it goes. Tokens are emitted into the event handler.
Tokenizing will continue until the document is completely read. Errors are emitted into the event handler, but the parser will attempt to continue parsing until the entire input stream is read.
setTextMode()
Set the text mode for the character data reader.
public
setTextMode(int $textmode[, string $untilTag = null ]) : mixed
HTML5 defines three different modes for reading text:
- Normal: Read until a tag is encountered.
- RCDATA: Read until a tag is encountered, but skip a few otherwise- special characters.
- Raw: Read until a special closing tag is encountered (viz. pre, script)
This allows those modes to be set.
Normally, setting is done by the event handler via a special return code on startTag(), but it can also be set manually using this function.
Parameters
- $textmode : int
-
One of Elements::TEXT_*.
- $untilTag : string = null
-
The tag that should stop RAW or RCDATA mode. Normal mode does not use this indicator.
attribute()
Parse attributes from inside of a tag.
protected
attribute(array<string|int, string> &$attributes) : bool
Parameters
- $attributes : array<string|int, string>
Tags
Return values
boolattributeValue()
Consume an attribute value. See section 8.2.4.37 and after.
protected
attributeValue() : string|null
Return values
string|nullbogusComment()
Consume malformed markup as if it were a comment.
protected
bogusComment([string $leading = '' ]) : bool
8.2.4.44.
The spec requires that the ENTIRE tag-like thing be enclosed inside of the comment. So this will generate comments like:
<!--</+foo>-->
Parameters
- $leading : string = ''
-
Prepend any leading characters. This essentially negates the need to backtrack, but it's sort of a hack.
Return values
boolbuffer()
Add text to the temporary buffer.
protected
buffer(string $str) : mixed
Parameters
- $str : string
Tags
cdataSection()
Handle a CDATA section.
protected
cdataSection() : bool
Return values
boolcharacterData()
Parse anything that looks like character data.
protected
characterData() : mixed
Different rules apply based on the current text mode.
Tags
comment()
Read a comment.
protected
comment() : bool
Expects the first tok to be inside of the comment.
Return values
boolconsumeData()
Consume a character and make a move.
protected
consumeData() : mixed
HTML5 8.2.4.1.
decodeCharacterReference()
Decode a character reference and return the string.
protected
decodeCharacterReference([bool $inAttribute = false ]) : string
If $inAttribute is set to true, a bare & will be returned as-is.
Parameters
- $inAttribute : bool = false
-
Set to true if the text is inside of an attribute value. false otherwise.
Return values
stringdoctype()
Parse a DOCTYPE.
protected
doctype() : bool
Parse a DOCTYPE declaration. This method has strong bearing on whether or not Quirksmode is enabled on the event handler.
Tags
Return values
boolendTag()
Consume an end tag. See section 8.2.4.9.
protected
endTag() : mixed
eof()
If the document is read, emit an EOF event.
protected
eof() : mixed
flushBuffer()
Send a TEXT event with the contents of the text buffer.
protected
flushBuffer() : mixed
This emits an EventHandler::text() event with the current contents of the temporary text buffer. (The buffer is used to group as much PCDATA as we can instead of emitting lots and lots of TEXT events.)
is_alpha()
Checks whether a (single-byte) character is an ASCII letter or not.
protected
is_alpha(string $input) : bool
Parameters
- $input : string
-
A single-byte string
Return values
bool —True if it is a letter, False otherwise
isCommentEnd()
Check if the scanner has reached the end of a comment.
protected
isCommentEnd() : bool
Return values
boolisTagEnd()
Check if the scanner has reached the end of a tag.
protected
isTagEnd(mixed &$selfClose) : mixed
Parameters
- $selfClose : mixed
markupDeclaration()
Look for markup.
protected
markupDeclaration() : mixed
parseError()
Emit a parse error.
protected
parseError(string $msg) : string
A parse error always returns false because it never consumes any characters.
Parameters
- $msg : string
Return values
stringprocessingInstruction()
Handle a processing instruction.
protected
processingInstruction() : bool
XML processing instructions are supposed to be ignored in HTML5, treated as "bogus comments". However, since we're not a user agent, we allow them. We consume until ?> and then issue a EventListener::processingInstruction() event.
Return values
boolquotedAttributeValue()
Get an attribute value string.
protected
quotedAttributeValue(string $quote) : string
Parameters
- $quote : string
-
IMPORTANT: This is a series of chars! Any one of which will be considered termination of an attribute's value. E.g. ""'" will stop at either ' or ".
Return values
string —The attribute value.
quotedString()
Utility for reading a quoted string.
protected
quotedString(string $stopchars) : mixed
Parameters
- $stopchars : string
-
Characters (in addition to a close-quote) that should stop the string. E.g. sometimes '>' is higher precedence than '"' or "'".
Return values
mixed —String if one is found (quotations omitted).
rawText()
Read text in RAW mode.
protected
rawText(string $tok) : bool
Parameters
- $tok : string
-
The current token.
Return values
boolrcdata()
Read text in RCDATA mode.
protected
rcdata(string $tok) : bool
Parameters
- $tok : string
-
The current token.
Return values
boolreadUntilSequence()
Read from the input stream until we get to the desired sequene or hit the end of the input stream.
protected
readUntilSequence(string $sequence) : string
Parameters
- $sequence : string
Return values
stringsequenceMatches()
Check if upcomming chars match the given sequence.
protected
sequenceMatches(string $sequence[, bool $caseSensitive = true ]) : bool
This will read the stream for the $sequence. If it's found, this will return true. If not, return false. Since this unconsumes any chars it reads, the caller will still need to read the next sequence, even if this returns true.
Example: $this->scanner->sequenceMatches('</script>') will see if the input stream is at the start of a '</script>' string.
Parameters
- $sequence : string
- $caseSensitive : bool = true
Return values
booltagName()
Consume a tag name and body. See section 8.2.4.10.
protected
tagName() : mixed
text()
This buffers the current token as character data.
protected
text(string $tok) : bool
Parameters
- $tok : string
-
The current token.
Return values
boolunquotedAttributeValue()
protected
unquotedAttributeValue() : mixed