A Tokenizer breaks a string into a stream of word tokens. More...
#include <zorba/tokenizer.h>
Classes | |
class | Callback |
A Callback is called once per token. More... | |
struct | Numbers |
A Numbers contains the current token, sentence, and paragraph numbers. More... | |
Public Types | |
enum | ElementTraceOptions { trace_none = 0x0, trace_begin = 0x1, trace_end = 0x2 } |
Trace options for XML elements combined via bitwise-or. More... | |
typedef std::unique_ptr < Tokenizer, internal::ztd::destroy_delete < Tokenizer > > | ptr |
typedef unsigned | size_type |
Public Member Functions | |
virtual void | destroy () const =0 |
Destroys this Tokenizer. | |
virtual void | element (Item const &qname, int trace_options) |
This function is called whenever an XML element is entered during tokenization. | |
Numbers const & | numbers () const |
Gets this Tokenizer's associated Numbers. | |
Numbers & | numbers () |
Gets this Tokenizer's associated Numbers. | |
virtual void | tokenize (char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang, bool wildcards, Callback &callback, void *payload=0)=0 |
Tokenizes the given string. | |
int | trace_options () const |
Gets the trace options. | |
Protected Member Functions | |
Tokenizer (Numbers &numbers, int trace_options=trace_none) | |
Constructs a Tokenizer. | |
virtual | ~Tokenizer ()=0 |
Destroys a Tokenizer. |
A Tokenizer breaks a string into a stream of word tokens.
Each token is assigned a token, sentence, and paragraph number.
A Tokenizer determines word and sentence boundaries automatically, but must be told when to increment the paragraph number.
Definition at line 39 of file tokenizer.h.
Definition at line 42 of file tokenizer.h.
typedef unsigned zorba::Tokenizer::size_type |
Definition at line 44 of file tokenizer.h.
Trace options for XML elements combined via bitwise-or.
trace_none |
Trace no elements. |
trace_begin |
Trace the beginning of elements. |
trace_end |
Trace the ending of elements. |
Definition at line 111 of file tokenizer.h.
zorba::Tokenizer::Tokenizer | ( | Numbers & | numbers, |
int | trace_options = trace_none |
||
) | [protected] |
Constructs a Tokenizer.
numbers | the Numbers to use. |
trace_options | The bitwise-or of the available trace options, if any. |
virtual zorba::Tokenizer::~Tokenizer | ( | ) | [protected, pure virtual] |
Destroys a Tokenizer.
virtual void zorba::Tokenizer::destroy | ( | ) | const [pure virtual] |
Destroys this Tokenizer.
This function is called by Zorba when the Tokenizer is no longer needed.
If your TokenizerProvider dynamically allocates Tokenizer objects, then the implementation can simply be (and usually is) delete this
.
If your TokenizerProvider returns a pointer to a static Tokenizer object, then the implementation should do nothing.
virtual void zorba::Tokenizer::element | ( | Item const & | qname, |
int | trace_options | ||
) | [virtual] |
This function is called whenever an XML element is entered during tokenization.
Note that this function is called only if trace_options()
returns non-zero.
qname | The element's QName. |
trace_options | The bitwise-or of the trace option(s) in effect for a particular call. |
Tokenizer::Numbers const & zorba::Tokenizer::numbers | ( | ) | const [inline] |
Gets this Tokenizer's associated Numbers.
Definition at line 197 of file tokenizer.h.
Tokenizer::Numbers & zorba::Tokenizer::numbers | ( | ) | [inline] |
Gets this Tokenizer's associated Numbers.
Definition at line 193 of file tokenizer.h.
virtual void zorba::Tokenizer::tokenize | ( | char const * | utf8_s, |
size_type | utf8_len, | ||
locale::iso639_1::type | lang, | ||
bool | wildcards, | ||
Callback & | callback, | ||
void * | payload = 0 |
||
) | [pure virtual] |
Tokenizes the given string.
utf8_s | The UTF-8 string to tokenize. It need not be null-terminated. |
utf8_len | The number of bytes in the string to be tokenized. |
lang | The language of the string. |
wildcards | If true , allows XQuery wildcard syntax characters to be part of tokens. |
callback | The Callback to call once per token. |
payload | Optional user-defined data. |
int zorba::Tokenizer::trace_options | ( | ) | const [inline] |
Gets the trace options.
If the value is trace_none
, then the paragraph number will be incremented upon entering an XML element; if the value is anything other than trace_none
, then the tokenizer assumes responsibility for incrementing the paragraph number.
Definition at line 125 of file tokenizer.h.