link

November 10, Wednesday
12:00 – 13:30

The Bare Essentials - Non-redundant Corpus Construction
Graduate seminar
Lecturer : Rafi Cohen
Affiliation : CS, BGU
Location : 202/37
Host : Graduate Seminar
Can we use statistical Natural Language Processing methods on redundant data? Document collections (corpora) may include large amount of redundancy due to copied texts, this phenomena is common in news articles and Electronic Health Records. Methods for detecting and handling redundancy are common in the fields of Bioinformatics for creating sequence databases as well as for plagiarism detection. We will show that redundant text may bias statistical methods for processing such corpora as well as a robust heuristic for identifying a non-biased subset a corpus.