The purpose of this lab session is to extract structured information from a natural language text corpus.
To hand in the results,
Send the ZIP file by email until Monday 18th of December 2011, 10:00 to
Fabian_SUCHANEK_and_Moussa_LO.zip
fabian@{LASTNAME}.name
Replace {LASTNAME} by my last name.
For these labs, we provide the data in a preprocessed version containing only the text. The format of this file is as follows: The first line is the title of the first article, while the following lines (up to the first blank line) form the content of this article, in plain text format. The second article comes after the next blank line, and so on. There are 50,441 articles in total.
Hint: Regular expressions work as follows in Java:
Pattern pattern=Pattern.compile(EXPRESSION);
Matcher matcher=pattern.matcher(STRING);
if(matcher.find()) {
// matcher.group(N) holds the N-th group of the match
// matcher.group() holds the entire match
}
InformationExtraction, there is the method String extractConcept(String sentence). This method receives the first sentence of a Wikipedia article (without brackets). For example, for the Wikipedia article about Alan Turing, the method will receive the string "Alan Turing was an English mathematician and computer scientist.". Write code that extracts the concept of which the article entity is an instance. In the example, this is the string "English mathematician". The method should return that string.
Try to make your code smarter to improve the precision. The higher the precision, the better. You may use all types of tricks in the code, but do not tailor the code to the first 20 articles. Hand your measure of precision and the first 50 results.
InformationExtraction, there is the method String extractDate(String sentence). This method receives the first sentence of a Wikipedia article (with brackets). For example, for the Wikipedia article about Alan Turing, the method will receive the string "Alan Mathison Turing (June 23rd, 1912 - June 7, 1954) was an English mathematician and computer scientist.". Write code that extracts the first date mentioned in this sentence. In the example, this is "June 23rd, 1912". (Optionally: normalize the date to "1912-06-23"). The method should return that date as a string. Measure the precision on 20 results manually. Hand in your measure of precision and the first 50 results.
extractLocation(), which extracts the location of an entity. For example, your code should produce the result "<Afghanistan, is_located_in, South Asia>". Measure the precision on 20 results manually. Hand in your measure of precision and the first 50 results.