You are here

Labelling Web Documents using Statistical Topic Models Based on Priors Extracted from Wikipedia

Project Type: 
PDF-led

This project will develop algorithms to automatically generate descriptive labels for large collections of web documents.

Project Leader(s): 

Postdoctoral fellow: Dr. Mathieu Sinn, David R. Cheriton School of Computer Science, University of Waterloo

Lead faculty member: Dr. Pascal Poupart, David R. Cheriton School of Computer Science, University of Waterloo

We will develop algorithms to automatically generate descriptive labels for large collections of web documents. Such labels can be used by companies in order to decide on which web sites they want to place advertisements, or by electronic publishers to categorize media offers. Currently, there doesn't exist any approach that can robustly and automatically label clusters of documents with a level of quality that approaches human labellers. Since the main difficulties are to capture the underlying concepts of a group of documents and to express them in a short human readable phrase, we will develop statistical topic models that leverage the online encyclopaedia Wikipedia to produce high quality labels. Google has an immediate need for such an approach to improve their internal use of document clusters and may develop new commercial services that depend on the availability of a fully automated, high quality labelling technique.

Non-academic participants: