Package edu.berkeley.nlp.lm
Interface WordIndexer<W>
- Type Parameters:
W- A type representing words in the language. Can be aString, or something more complex if needed
- All Superinterfaces:
Serializable
- All Known Implementing Classes:
StringWordIndexer
Enumerates words in the vocabulary of a language model. Stores a two-way
mapping between integers and words.
- Author:
- adampauls
-
Nested Class Summary
Nested Classes -
Method Summary
Modifier and TypeMethodDescriptionReturns the start symbol (usually something like </s>intgetIndexPossiblyUnk(W word) Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.intgetOrAddIndex(W word) Gets the index for a word, adding if necessary.intReturns the start symbol (usually something like <s>Returns the unk symbol (usually something like <unk>getWord(int index) Gets the word object for an index.intnumWords()Number of words that have been added so farvoidsetEndSymbol(W sym) voidsetStartSymbol(W sym) voidsetUnkSymbol(W sym) voidInforms the implementation that no more words can be added to the vocabulary.
-
Method Details
-
getOrAddIndex
Gets the index for a word, adding if necessary.- Parameters:
word-- Returns:
-
getOrAddIndexFromString
-
getIndexPossiblyUnk
Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.- Parameters:
word-- Returns:
-
getWord
Gets the word object for an index.- Parameters:
index-- Returns:
-
numWords
int numWords()Number of words that have been added so far- Returns:
-
getStartSymbol
W getStartSymbol()Returns the start symbol (usually something like <s>- Returns:
-
setStartSymbol
-
getEndSymbol
W getEndSymbol()Returns the start symbol (usually something like </s>- Returns:
-
setEndSymbol
-
getUnkSymbol
W getUnkSymbol()Returns the unk symbol (usually something like <unk>- Returns:
-
setUnkSymbol
-
trimAndLock
void trimAndLock()Informs the implementation that no more words can be added to the vocabulary. Implementations may perform some space optimization, and should trigger an error if an attempt is made to add a word after this point.
-