Information Extraction

Fabian Suchanek (fabian@{LASTNAME}.name)

Gaston Berger University, Senegal, Winter 2011

 

The purpose of this lab session is to extract structured information from a natural language text corpus.

To hand in the results,

  1. Create a ZIP file that has your names in the filename, for example
       Fabian_SUCHANEK_and_Moussa_LO.zip
  2. put your entire code into that ZIP file
  3. put the results that the exercise asks for into the ZIP file (NOT: all results)
Send the ZIP file by email until Monday 18th of December 2011, 10:00 to
   fabian@{LASTNAME}.name
Replace {LASTNAME} by my last name.

Data Set

Our corpus will be the Simple English Wikipedia, a simpler and smaller encyclopedia than the regular English Wikipedia).

For these labs, we provide the data in a preprocessed version containing only the text. The format of this file is as follows: The first line is the title of the first article, while the following lines (up to the first blank line) form the content of this article, in plain text format. The second article comes after the next blank line, and so on. There are 50,441 articles in total.

Code

In the following, I give the instructions for Java. However, you may work in any programming language, as long as your program delivers the desired output. In Java, you can use the following classes:

Hint: Regular expressions work as follows in Java:

Pattern pattern=Pattern.compile(EXPRESSION);
Matcher matcher=pattern.matcher(STRING);
if(matcher.find()) {
  // matcher.group(N) holds the N-th group of the match
  // matcher.group() holds the entire match
}
Use named regular expressions.

Tasks

For all of the following, go for precision rather than recall.
  1. Download the Wikipedia corpus. Have a look at the corpus.
  2. Download the provided classes. Have a look at these classes.
  3. In the class InformationExtraction, there is the method String extractConcept(String sentence). This method receives the first sentence of a Wikipedia article (without brackets). For example, for the Wikipedia article about Alan Turing, the method will receive the string "Alan Turing was an English mathematician and computer scientist.". Write code that extracts the concept of which the article entity is an instance. In the example, this is the string "English mathematician". The method should return that string.
  4. Measure the precision of your output manually on the first 20 results. A result is correct, if it is the answer to the question "What is X? X is a ...". For Alan Turing, the following answers are wrong: Analogously, we do not want results such as "kind", "way", "word", "member", "part", and so on. We want only clean concept names.

    Try to make your code smarter to improve the precision. The higher the precision, the better. You may use all types of tricks in the code, but do not tailor the code to the first 20 articles. Hand your measure of precision and the first 50 results.

  5. In the class InformationExtraction, there is the method String extractDate(String sentence). This method receives the first sentence of a Wikipedia article (with brackets). For example, for the Wikipedia article about Alan Turing, the method will receive the string "Alan Mathison Turing (June 23rd, 1912 - June 7, 1954) was an English mathematician and computer scientist.". Write code that extracts the first date mentioned in this sentence. In the example, this is "June 23rd, 1912". (Optionally: normalize the date to "1912-06-23"). The method should return that date as a string. Measure the precision on 20 results manually. Hand in your measure of precision and the first 50 results.
  6. Optional: Write a method extractLocation(), which extracts the location of an entity. For example, your code should produce the result "<Afghanistan, is_located_in, South Asia>". Measure the precision on 20 results manually. Hand in your measure of precision and the first 50 results.