public class NGramTokenizer extends CharacterDelimitedTokenizer
-delimiters <value> The delimiters to use (default ' \r\n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
Modifier and Type | Field and Description |
---|---|
protected int |
m_CurrentPosition
the current position for returning elements
|
protected int |
m_MaxPosition
the number of strings available
|
protected int |
m_N
the current length of the N-grams
|
protected int |
m_NMax
the maximum number of N
|
protected int |
m_NMin
the minimum number of N
|
protected String[] |
m_SplitString
all the available grams
|
m_Delimiters
Constructor and Description |
---|
NGramTokenizer() |
Modifier and Type | Method and Description |
---|---|
protected void |
filterOutEmptyStrings()
filters out empty strings in m_SplitString and
replaces m_SplitString with the cleaned version.
|
int |
getNGramMaxSize()
Gets the max N of the NGram.
|
int |
getNGramMinSize()
Gets the min N of the NGram.
|
String[] |
getOptions()
Gets the current option settings for the OptionHandler.
|
String |
getRevision()
Returns the revision string.
|
String |
globalInfo()
Returns a string describing the stemmer
|
boolean |
hasMoreElements()
returns true if there's more elements available
|
Enumeration |
listOptions()
Returns an enumeration of all the available options..
|
static void |
main(String[] args)
Runs the tokenizer with the given options and strings to tokenize.
|
Object |
nextElement()
Returns N-grams and also (N-1)-grams and ....
|
String |
NGramMaxSizeTipText()
Returns the tip text for this property.
|
String |
NGramMinSizeTipText()
Returns the tip text for this property.
|
void |
setNGramMaxSize(int value)
Sets the max size of the Ngram.
|
void |
setNGramMinSize(int value)
Sets the min size of the Ngram.
|
void |
setOptions(String[] options)
Parses a given list of options.
|
void |
tokenize(String s)
Sets the string to tokenize.
|
delimitersTipText, getDelimiters, setDelimiters
runTokenizer, tokenize
protected int m_NMax
protected int m_NMin
protected int m_N
protected int m_MaxPosition
protected int m_CurrentPosition
protected String[] m_SplitString
public String globalInfo()
globalInfo
in class Tokenizer
public Enumeration listOptions()
listOptions
in interface OptionHandler
listOptions
in class CharacterDelimitedTokenizer
public String[] getOptions()
getOptions
in interface OptionHandler
getOptions
in class CharacterDelimitedTokenizer
public void setOptions(String[] options) throws Exception
-delimiters <value> The delimiters to use (default ' \r\n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
setOptions
in interface OptionHandler
setOptions
in class CharacterDelimitedTokenizer
options
- the list of options as an array of stringsException
- if an option is not supportedpublic int getNGramMaxSize()
public void setNGramMaxSize(int value)
value
- the size of the NGram.public String NGramMaxSizeTipText()
public void setNGramMinSize(int value)
value
- the size of the NGram.public int getNGramMinSize()
public String NGramMinSizeTipText()
public boolean hasMoreElements()
hasMoreElements
in interface Enumeration
hasMoreElements
in class Tokenizer
public Object nextElement()
nextElement
in interface Enumeration
nextElement
in class Tokenizer
protected void filterOutEmptyStrings()
m_SplitString
public void tokenize(String s)
public String getRevision()
public static void main(String[] args)
args
- the commandline options and strings to tokenizeCopyright © 2015 University of Waikato, Hamilton, NZ. All rights reserved.