Logo Search packages:      
Sourcecode: uimaj version File versions  Download package

Public Member Functions | Static Public Attributes | Static Private Member Functions | Private Attributes

org::apache::uima::internal::util::TextStringTokenizer Class Reference

List of all members.

Public Member Functions

void addSeparators (String chars)
void addToEndOfSentenceChars (String chars)
void addWhitespaceChars (String chars)
void addWordChars (String chars)
int getCharType (char c)
String getToken ()
int getTokenEnd ()
int getTokenStart ()
int getTokenType ()
boolean isValid ()
void setEndOfSentenceChars (String chars)
void setSeparators (String chars)
void setShowSeparators (boolean b)
void setShowWhitespace (boolean b)
void setToFirst ()
void setToNext ()
void setWhitespaceChars (String chars)
void setWordChars (String chars)
 TextStringTokenizer (String string)

Static Public Attributes

static final int EOS = 0
static final int SEP = 1
static final int WCH = 3
static final int WSP = 2

Static Private Member Functions

static final char[] addToSortedList (String s, char[] list)
static final char[] makeSortedList (String s)

Private Attributes

final int end
char[] eosDels = new char[0]
boolean nextComputed = false
int nextTokenEnd
int nextTokenStart
int nextTokenType
int pos
char[] separators = new char[0]
boolean showSeparators = true
boolean showWhitespace = true
final String text
char[] whitespace = new char[0]
char[] wordChars = new char[0]

Detailed Description

An implementation of a text tokenizer for whitespace separated natural lanuage text.

The tokenizer knows about four different character classes: regular word characters, whitespace characters, sentence delimiters and separator characters. Tokens can consist of

The character classes are completely user definable. By default, whitespace characters are the Unicode whitespace characters. All other characters are word characters. The two separator classes are empty by default. The different classes may have non-empty intersections. When determining the class of a character, the user defined classes are considered in the following order: end-of-sentence delimiter before other separators before whitespace before word characters. That is, if a character is defined to be both a separator and a whitespace character, it will be considered to be a separator.

By default, the tokenizer will return all tokens, including whitespace. That is, appending the sequence of tokens will recover the original input text. This behavior can be changed so that whitespace and/or separator tokens are skipped.

A tokenizer provides a standard iterator interface similar to StringTokenizer. The validity of the iterator can be queried with hasNext(), and the next token can be queried with nextToken(). In addition, getNextTokenType() returns the type of the token as an integer. NB that you need to call getNextTokenType() before calling nextToken(), since calling nextToken() will advance the iterator.

TextStringTokenizer.java,v 1.6 2003/04/07 14:50:11 goetz Exp

Definition at line 67 of file TextStringTokenizer.java.

The documentation for this class was generated from the following file:

Generated by  Doxygen 1.6.0   Back to index