Information Extraction

Fabian Suchanek (fabian@{LASTNAME}.name)

Winter 2011

 

The purpose of this lab session is to extract structured information from a natural language text corpus.

Hand in

by email until Monday 12th of December 2011, 10:00.

Data Set

Our corpus will be the Simple English Wikipedia, a simpler and smaller encyclopedia than the regular English Wikipedia).

For these labs, we provide the data in a preprocessed version containing only the text. The format of this file is as follows: The first line is the title of the first article, while the following lines (up to the first blank line) form the content of this article, in plain text format. The second article comes after the next blank line, and so on. There are 50,441 articles in total.

We provide the following classes:

  1. a class Page, which represents a page in Wikipedia.
  2. a class Parser, which allows iterating over the pages of a Wikipedia corpus.
  3. a class Triple, which represents a triple of subject, predicate and object.

Tasks

For all of the following, go for precision rather than recall.
  1. Create an abstract class Extractor, which has an abstract method extract(Page):Triple. Create subclass, TestExtractor, which produces triples of the form <PageTitle, "is", "processed">. Create a class Main, which has only one method: run. This method takes as input a corpus file, a target file and a list of extractors. It iterates over all pages in the corpus (using the Parser class that is provided). For each page, it calls all extractors and writes the triples into the target file (one triple per line). Test this method with the TestExtractor on the first 50 pages of the corpus.
  2. Write an extractor DateExtractor that uses a regular expression to find the first date mentioned in the article. Let it return a triple of the form <PageTitle, "hasDate", Date>. Try the extractor with the pages of Elvis Presley and Alan Turing. If you are adventurous, try normalizing the dates you extract to the form [-]YYYY-MM-DD. Regular expressions work as follows in Java:
    Pattern pattern=Pattern.compile(EXPRESSION);
    Matcher matcher=pattern.matcher(STRING);
    while(matcher.find()) {
      // matcher.group(N) holds the N-th group of the match
      // matcher.group() holds the entire match
    }
    Use named regular expressions. Measure the precision on 20 output pairs manually, hand in the results.
  3. Write an extractor TypeExtractor that extracts the type of the article entity ("Leicester is a city"). Manually exclude terms that are too abstract ("member of...", "way of..."). The results should be triples of the form <PageTitle, is_a, Type>. A triple is correct if Type is the answer to the question "What is PageTitle?" ("What is Leceister?" — "Leceister is a city!").
    Measure the precision on 20 output pairs manually, hand in the results.
  4. Write an extractor LocationExtractor that extracts the location of a place ("Hollywood is a district in Los Angeles"). Alternatively or additionally: Write a TypeAndLocationExtractor, which first calls the TypeExtractor, checks if the article entity is a city, district, etc., and, if so, extracts the location. Measure the precision on 20 output pairs manually, hand in the results.