Corpus Linguistics
Part I. Corpus-linguistics basics & SFB Corpora
WS 2012/13 Tuesdays (8.1.2013, 22.1.2013, 29.1.2013, 5.2.2013)
10:30 - 12:00 46.21.04.13 (Kruppstr. 108)
Lecture Slides & Materials
Course book:
Tony McEnery, Richard Xiao and Yukio Tono (2006). Corpus-Based Language Studies. London: Routledge
8.1.2013
Course overview.
Corpus linguistics basics:
definitions of a corpus;
corpus design;
taxonomies of corpora;
methodology or a theory;
corpus-based vs. corpus-driven approach;
main fields of application of corpus linguistics;
data-intensive linguistics.
Corpus Linguistics Basiscs I
Chris Brew and Marc Moens. Data-Intensive Linguistics
22.1.2013
Corpus linguistics basics (cont.):
representativeness;
ballance & sampling;
mark-up & annotation;
The British National Corpus (BNC);
case study: forensic linguistics.
Corpus Linguistics Basiscs II
BNC User Reference Guide
"... and then ... Language description and author attribution" by Malcolm Coulthard
statement of: DEREK WILLIAM BENTLEY, aged 19, 1 Fairview Road, London Road
29.1.2013
Multilingual corpora in SFB 991:
JRC-Acquis (EC documents in all member-states languages - Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish)
LCC (mostly newspaper texts in Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, Icelandic, Italian, Japanese, Korean, Norwegian, Serbian, Sorbian, Spanish, Swedish, Turkish, etc.)
MultextEast (Orwell’s “1984” in Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak and Slovene, aligned at sentence level)
Monolingual corpora in SFB 991:
Chinese
Polish
Russian
Bulgarian
Macedonian
AntConc - a freeware concordance program
Corpus Linguistics Basiscs III
5.2.2013
Monolingual corpora in SFB 991 (cont.):
German corpora: Mannheimer Korpus 1 & 2, Bonner Zeitungskorpus, LCC (Leipzig Corpora Collection) German corpus, political speeches (president & government); Negra & TiGer
English corpora: BNC (British National Corpus), Penn Treebank, Penn Discourse Treebank, OntoNotes English Corpus, Park 700 Dependency Bank
BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web
Corpus Linguistics Basiscs IV
Part II. Basic programming skills
WS 2012/13 Monday & Tuesday (25.2.2013, 26.2.2013)
9:00 - 12:00 (on the campus)
Home
Last updated on 13.2.2013