The purpose of this lab session is to get some hands-on experience with Part-of-Speech (POS) tagging.
/infres/ic2/suchanek/nlp/wikipedia_pos.txt.
You may work in any programming language (as long as it produces the desired output). In case you opt for Java, you can proceed as follows:
HiddenMarkovModel, which contains two fields:
protected Map<String, Map<String,Double>> transitionProb; protected Map<String, Map<String,Double>> emissionProb;The first will map a tag to a successor tag and the number of times that this transition was seen. The second will map a tag to a word and the number of times that this emission was seen.
toString() method that outputs both maps in a readable format.
public void foundTransition(String fromTag, String toTag); public void foundEmission(String tag, String word);...which increase the respective counter by 1.
For example, if the transitionProb map is empty and we call
foundTransition("NNP", "VB");
foundTransition("NNP", "VB");
foundTransition("NNP", "ADJ");
foundTransition("VB", "ADJ");
then the map should be NNP -> { VB ->2, ADJ -> 1}, VB -> {ADJ -> 1}.
normalize(), which normalizes the counts in transitionProb and emissionProb to probabilities.
For example, the above map should become NNP -> { VB ->0.666, ADJ -> 0.333}, VB -> {ADJ -> 1}
Hints:
BufferedReader
myString.split(X) (where, e.g., X=" " or X="/")
/infres/ic2/suchanek/nlp/sample_pos.txt.
Explain the result of toString() of the model.
Serializable.
Write the methods
public double emissionProbability(String tag, String word); public double transitionProbability(String fromTag, String toTag);which return 0 if the input could not be found.
/infres/ic2/suchanek/nlp/sound_check.txt and store it on disk (using ObjectOutputStream).
Viterbi, which has a field protected HiddenMarkovModel model and a constructor that takes a HMM stored on disk as argument (use ObjectInputStream).
public List<String> parse(String sentence), which, given a sentence, returns the list of most likely tags.Hints:
double[][] probabilities=new double[numWords][numTags]; int[][] backpointers=new int[numWords][numTags];