The Zorba XQuery processor implements the XQuery and XPath Full Text 1.0 specification that, among other things, tokenizes a string into a sequence of tokens.
By default, Zorba uses the ICU library for tokenization. For Roman alphabets, Zorba (ICU) considers only alpha-numeric sequences of characters to be part of a token; whitespace and punctuation characters are not and separate tokens. However, alpha-numeric sequences matching the regular expression [0-9][.,][0-9]
are retained as part of a token, e.g.: "98.6" and "1,432.58" are tokens.
Alternatively, you can implement your own tokenizer by deriving from the Tokenizer
class.
The Tokenizer
class is:
For details about the ptr
type, the destroy()
function, and why the destructor is protected
, see the Memory Management document.
The State
struct
is created by Zorba and passed to your constructor. It simply keeps track of the current token, sentence, and paragraph numbers.
To implement a Tokenizer
, you need to implement the tokenize_string
() function where:
utf8_s | A pointer to the UTF-8 byte sequence comprising the string to be tokenized. |
utf8_len | The number of bytes in the string to be tokenized. |
lang | The language of the string. |
wildcards | If true , allows XQuery wildcard syntax characters to be part of tokens. |
callback | The Callback to call once per token. |
item | The Item whence this token came. If the token occurred within an element, the Item is the text node. If the token occurred within an attribute, the Item is the attribute node. |
A complete implementation of tokenize_string
() is non-trivial and therefore an example is beyond the scope of this API documentation. However, the things a tokenizer should take into consideration include:
The task of iterating over an XML element's child nodes is done by tokenize_node_impl()
. Its default implementation treats XML elements, comments, and processing instructions as token separators. (See Properties.) If you want to change that, you need to override tokenize_node_impl()
.
By default, Zorba increments the current paragraph number once for each XML element encountered. However, this doesn't work well for mixed content. For example, in the XHTML:
all the tokens are both in the same sentence and paragraph, but Zorba will consider that 3 paragraphs by default.
Your tokenizer can take control over when the paragraph number is incremented by overriding the item()
function. The item()
function is passed the Item
of the current XML element and whether the item is being entered or exited.
For example, the item()
function for tokenizing XHTML would be along the lines of:
To implement a Tokenizer
, you need also to implement the properties
() function that fills in the Properties
struct where:
comments_separate_tokens | If true , XML comments separate tokens. For example, net<!– –>work would be 2 tokens instead of 1. |
elements_separate_tokens | If true , XML elements separate tokens. For example, <b>B</b>old would be 2 tokens instead of 1. |
processing_instructions_separate_tokens | If true , XML processing instructions separate tokens. For example, net<?PI pi?>work would be 2 tokens instead of 1. |
languages | The list of languages supported by the tokenizer. |
uri | The URI that uniquely identifies the Tokenizer. |
In addition to a Tokenizer
, you must also implement a TokenizerProvider
that, given a language, provides a Tokenizer
for that language:
Specifically, you need to implement the getTokenizer()
function where:
lang | The language to tokenize. |
state | The State to use. If null , t is not set. |
t | If not null , set to point to a Tokenizer for lang. |
A simple TokenizerProvider
for our tokenizer can be implemented as:
To enable your tokenizer to be used, you need to register it with the XmlDataManager: