link

January 3, Tuesday
12:00 – 14:00

Clustering verbs into semantic classes based on their subcategorisation frame distribution
Computer Science seminar
Lecturer : Mr. Yuval Krymolowski
Lecturer homepage : http://www.cl.haifa.ac.il/~yuval/
Affiliation : School of Informatics, University of Edinburgh
Location : -101/58
Host : Dr. Michael Elkin
his is joint work with Anna Korhonen (Computer Laboratory, Cambridge University).

This work touches on the relation between syntax and semantics, it stems from the work of Beth Levin in "English Verb Classes and Alternations" (1993): 'If the syntactic properties of a verb indeed follow in large part from its meaning, then it should be possible to identify general principles that derive the behavior of a verb from its meaning.' (p. 11)

In our work, we observe the behaviour (in syntactic terms) of verbs in a corpus and derive the semantic classes of these verbs. The behaviour is described by subcategorisation frames. The subcategorisation frame (SCF) of a verb is the syntactic structure of its arguments. For example the SCF of a transitive verb is "NP" because it has a direct object which is a noun phrase (NP)

We parse the texts and observe the (SCF) distribution of the verbs. We then cluster the verbs using the information bottleneck method as well as a very naive nearest neighbour method

I will describe two works: one on texts from a general corpus (British National Corpus) and the other on a domain specific (biomedical) corpus. The main difference between the corpora is that the biomedical texts contains much less polysemy than is usually seen in general language. We report how polysemy is reflected in the clustering results. Since verbs in a domain specific corpus can have a meaning different from that in general language - we observe semantic classes that are particular to the biomedical domain.

This work has important implications for NLP applications The semantic classes obtained in an unsupervised manner reflect knowledge about the domain. This knowledge can be used for various information retrieval and information extraction applications.