Multi-Lingual Automatic Lexical Acquisition

 >> Team 


Knowledge of language, its learning and processing requires knowledge of its words. Yet, the information encoded in a lexical entry, its use, the relation of words to each other in the lexicon, and the relationship of the lexicon to the grammar are complex and unsettled issues, on which researchers hold very different views. We think that the accumulated evidence that lexical effects are strong (``it is all in the words'') can be reconciled with the theoretical needs for generalisation and succinctness by exploring the notion of classes of words.

The investigation of the notion of class promises to be informative to some of the common concerns that have appeared in recent theoretical and computational linguistics literature: in particular, automatic lexical acquisition and organisation, and the integration of grammatical and probabilistic information. The goal of the current project is to investigate these issues by studying what part of a verb's lexical entry contains class-related information, using a corpus-based experimental approach. 

The Proposal 

The linguistic notions that we investigate are related to the argument structure of a verb -- the entities that constitute the fundamental architecture of a proposition (who did what to whom). Merlo-Stevenson (2001)  investigate statistical correlates to notions such as Agent or Patient, to automatically distinguish action verbs from change of state verbs in English. We focus our attention on the thematic relations of the NP arguments and the notion of argumenthood for prepositional phrases.

In the proposed research, we extend this corpus-based approach to the investigation of new semantic roles such as Beneficiary and Instrument, and to the investigation to new languages (French and Italian). In this way, we intend to demonstrate the completeness of the method by applying it to a large portion of the thematic inventory, and its cross-linguistic validity by investigating new languages.

The methodology

The approach  is based on the observation that verb classes show robust statistical regularities in subcategorization frames and argument structure. The methodology we propose to adopt has been extensively used before, Merlo-Stevenson (2001),  Merlo-Leybold (2001).





Technical Report