Package edu.berkeley.nlp.lm.io
Class KneserNeyLmReaderCallback<W>
java.lang.Object
edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback<W>
- Type Parameters:
W-
- All Implemented Interfaces:
ArrayEncodedNgramLanguageModel<W>,LmReader<ProbBackoffPair,,ArpaLmReaderCallback<ProbBackoffPair>> LmReaderCallback<LongRef>,NgramOrderedLmReaderCallback<LongRef>,NgramLanguageModel<W>,Serializable
public class KneserNeyLmReaderCallback<W>
extends Object
implements NgramOrderedLmReaderCallback<LongRef>, LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>, ArrayEncodedNgramLanguageModel<W>, Serializable
Class for producing a Kneser-Ney language model in ARPA format from raw text.
Confusingly, this class is both a
LmReaderCallback (called from
TextReader, which reads plain text), and a LmReader, which
"reads" counts and produces Kneser-Ney probabilities and backoffs and passes
them on an ArpaLmReaderCallback- Author:
- adampauls
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel
ArrayEncodedNgramLanguageModel.DefaultImplementationsNested classes/interfaces inherited from interface edu.berkeley.nlp.lm.NgramLanguageModel
NgramLanguageModel.StaticMethods -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final floatprotected final intprotected final HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts> protected final ConfigOptionsprotected static final longprotected final intprotected final WordIndexer<W> This array represents the discount used for each ngram order. -
Constructor Summary
ConstructorsConstructorDescriptionKneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder) KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder, ConfigOptions opts) -
Method Summary
Modifier and TypeMethodDescriptionvoidaddNgram(int[] ngram, int startPos, int endPos, LongRef value, String words, boolean justLastWord, long[][] scratch) voidCalled for each n-gramvoidvoidcallJustLast(W[] ngram, LongRef value, long[][] scratch) voidcleanup()Called once all reading is done.static double[]static double[]protected floatgetDiscountForOrder(int ngramOrder) protected floatgetHighestOrderProb(int[] ngram, int startPos, int endPos) intMaximum size of n-grams stored by the model.floatgetLogProb(int[] ngram) Equivalent togetLogProb(ngram, 0, ngram.length)floatgetLogProb(int[] ngram, int startPos, int endPos) Calculate language model score of an n-gram.floatgetLogProb(List<W> ngram) Scores an n-gram.protected floatgetLowerOrderBackoff(int[] ngram, int startPos, int endPos) protected floatgetLowerOrderProb(int[] ngram, int startPos, int endPos) longEach LM must have a WordIndexer which assigns integer IDs to each word W in the language.voidhandleNgramOrderFinished(int order) Called when all n-grams of a given order are finishedvoidhandleNgramOrderStarted(int order) Called when n-grams of a given order are startedprotected floatinterpolateProb(int[] ngram, int startPos, int endPos) voidparse(ArpaLmReaderCallback<ProbBackoffPair> callback) floatscoreSentence(List<W> sentence) Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols.voidsetOovWordLogProb(float logProb) Sets the (log) probability for an OOV word.
-
Field Details
-
serialVersionUID
protected static final long serialVersionUID- See Also:
-
DEFAULT_DISCOUNT
protected static final float DEFAULT_DISCOUNT- See Also:
-
lmOrder
protected final int lmOrder -
wordIndexer
This array represents the discount used for each ngram order. The original Kneser-Ney discounting (-ukndiscount) uses one discounting constant for each N-gram order. These constants are estimated as D = n1 / (n1 + 2*n2) where n1 and n2 are the total number of N-grams with exactly one and two counts, respectively. For simplicity, our code just uses a constant discount for each order of 0.75. However, other discounts can be specified. -
ngrams
-
opts
-
startIndex
protected final int startIndex
-
-
Constructor Details
-
KneserNeyLmReaderCallback
- Parameters:
wordIndexer-maxOrder-inputIsSentences- If true, input n-grams are assumed to be sentences, and all sub-ngrams of up to ordermaxOrderare added. If false, input n-grams are assumed to be atomic.
-
KneserNeyLmReaderCallback
-
-
Method Details
-
call
-
callJustLast
-
call
Description copied from interface:LmReaderCallbackCalled for each n-gram- Specified by:
callin interfaceLmReaderCallback<W>- Parameters:
ngram- The integer representation of the words as given by the provided WordIndexervalue- The value of the n-gramwords- The string representation of the n-gram (space separated)
-
addNgram
public void addNgram(int[] ngram, int startPos, int endPos, LongRef value, String words, boolean justLastWord, long[][] scratch) - Parameters:
ngram-startPos-endPos-value-words-
-
interpolateProb
protected float interpolateProb(int[] ngram, int startPos, int endPos) -
getHighestOrderProb
protected float getHighestOrderProb(int[] ngram, int startPos, int endPos) -
getLowerOrderProb
protected float getLowerOrderProb(int[] ngram, int startPos, int endPos) -
getLowerOrderBackoff
protected float getLowerOrderBackoff(int[] ngram, int startPos, int endPos) -
getDiscountForOrder
protected float getDiscountForOrder(int ngramOrder) -
cleanup
public void cleanup()Description copied from interface:LmReaderCallbackCalled once all reading is done.- Specified by:
cleanupin interfaceLmReaderCallback<W>
-
defaultDiscounts
public static double[] defaultDiscounts() -
defaultMinCounts
public static double[] defaultMinCounts() -
parse
- Specified by:
parsein interfaceLmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>
-
getWordIndexer
Description copied from interface:NgramLanguageModelEach LM must have a WordIndexer which assigns integer IDs to each word W in the language.- Specified by:
getWordIndexerin interfaceNgramLanguageModel<W>- Returns:
-
handleNgramOrderFinished
public void handleNgramOrderFinished(int order) Description copied from interface:NgramOrderedLmReaderCallbackCalled when all n-grams of a given order are finished- Specified by:
handleNgramOrderFinishedin interfaceNgramOrderedLmReaderCallback<W>- Parameters:
order-
-
handleNgramOrderStarted
public void handleNgramOrderStarted(int order) Description copied from interface:NgramOrderedLmReaderCallbackCalled when n-grams of a given order are started- Specified by:
handleNgramOrderStartedin interfaceNgramOrderedLmReaderCallback<W>- Parameters:
order-
-
getLmOrder
public int getLmOrder()Description copied from interface:NgramLanguageModelMaximum size of n-grams stored by the model.- Specified by:
getLmOrderin interfaceNgramLanguageModel<W>- Returns:
-
scoreSentence
Description copied from interface:NgramLanguageModelScores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols. This is a convenience method and will generally be inefficient.- Specified by:
scoreSentencein interfaceNgramLanguageModel<W>- Returns:
-
getLogProb
Description copied from interface:NgramLanguageModelScores an n-gram. This is a convenience method and will generally be relatively inefficient. More efficient versions are available inArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)andContextEncodedNgramLanguageModel.getLogProb(long, int, int, edu.berkeley.nlp.lm.ContextEncodedNgramLanguageModel.LmContextInfo).- Specified by:
getLogProbin interfaceNgramLanguageModel<W>
-
getLogProb
public float getLogProb(int[] ngram, int startPos, int endPos) Description copied from interface:ArrayEncodedNgramLanguageModelCalculate language model score of an n-gram. Warning: if you pass in an n-gram of length greater thangetLmOrder(), this call will silently ignore the extra words of context. In other words, if you pass in a 5-gram (endPos-startPos == 5) to a 3-gram model, it will only score the words fromstartPos + 2toendPos.- Specified by:
getLogProbin interfaceArrayEncodedNgramLanguageModel<W>- Parameters:
ngram- array of words in integer representationstartPos- start of the portion of the array to be readendPos- end of the portion of the array to be read.- Returns:
-
getLogProb
public float getLogProb(int[] ngram) Description copied from interface:ArrayEncodedNgramLanguageModelEquivalent togetLogProb(ngram, 0, ngram.length)- Specified by:
getLogProbin interfaceArrayEncodedNgramLanguageModel<W>- See Also:
-
getTotalSize
public long getTotalSize() -
setOovWordLogProb
public void setOovWordLogProb(float logProb) Description copied from interface:NgramLanguageModelSets the (log) probability for an OOV word. Note that this is in general different from the log prob of theunktag probability.- Specified by:
setOovWordLogProbin interfaceNgramLanguageModel<W>
-