<< Back to Index
Plenary Talks

    Monika  Bednarek




    The  University of Sydney

    Corpus  Linguistics and Profanity (‘bad’ language, swear words, curse words, taboo  words…)

    In  this talk I aim to bring together my enduring research interests in corpus  linguistics, media linguistics and the language of emotion/opinion, with a  focus on the use of ‘bad’ language in American English – in particular, what  are commonly called ‘swear words’, ‘taboo words’, ‘curse words’ or  ‘profanity’. My talk will consider the challenges that such language poses  for corpus linguistics, while also discussing its role in interpersonal  language systems such as appraisal  and negotiation. I will further  report on research in-progress that examines the use of such language in  contemporary American television series that are highly popular and heavily  exported on a global scale. This study is based on a new corpus of dialogue  transcribed from over 60 contemporary US television series: The Sydney  Television Corpus (SydTV). SydTV is a small, specialisedcorpus which has been  designed to be representative of the language variety of contemporary US  American TV dialogue. Contemporary is here defined as the year of  first broadcast falling between 2000 and 2012. This specific time frame was  adopted because the first decade of the 21st centurywas  characterised by the global rise of American TV series, and the ‘golden  age of television’ that characterises US TV series from this period is still  on-going.  US American  is defined as having the United States as country of origin. TV dialogue  is defined as the actual dialogue uttered by actors on screen as they  are performing characters in fictional TV series. I will compare frequencies  of ‘bad’ language words in this corpus with frequencies in other corpora,  while also including qualitative analysis of their usage in contemporary TV  dialogue.



    Susan Hunston


    University of Birmingham

    What do the  numbers mean? The use of quantitative information in Corpus Linguistics

    In this paper I shall present a view of Corpus Linguistics that  conceptualises it in three phases, distinguished by the use made in each  phase of quantitative data. Quantitative information has become increasingly  important in Corpus Linguistics, and increasingly sophisticated as measures  that are sensitive to how language works have become more readily available.  Questions around the use of quantitative information are driven by the need  in Corpus Linguistics to innovate methodologically (to find the best way of  making new discoveries) and theoretically (to describe language in new ways).

    In phase 1 studies, corpora from different geographical areas,  or chronological times, or registers, are compared by quantifying the  relative frequency of given grammatical or semantic categories. Such methods  have underpinned substantial advances, for example the Longman Grammar of  Spoken and Written English, work on Systemic-Functional Linguistics, and work  comparing learner varieties of English, among many others.

    Phase 2 studies prioritise lexis over grammar and individual  wordforms over categories of form or meaning. In these studies, frequency is  often reduced to a concept of ‘typicality’ or ‘centrality’. Comparison  between corpora is usually not the identifying feature of such work. Examples  include Sinclair’s work on Units of Meaning, or Frances and Hunston’s work on  grammar patterns. The key aspects of phase 2 studies are its exploratory,  ‘bottom-up’ approach and the novelty of its insights.

    A challenge for Corpus Linguistics is to marry the rigour of  quantitative measures with innovation in insight. One way of doing this is to  allow numbers to drive the way that information in the corpus is organised.  This is what I term phase 3 studies, and two examples are given in this  paper. One is a study of adjectives in a corpus of comments about university  teaching staff (see Millar and Hunston 2015). The other is a study of lexis  in a corpus compiled from an interdisciplinary academic journal (see Murakami  et al in press). In both cases the initial corpus work treats the corpus as a  ‘bag of words’, allowing co-occurrence calculations to organise the data  before linguistic considerations are brought to bear. Phase 3 studies are not  necessarily an improvement on other methods, but as methodological innovation  is key to progress in Corpus Linguistics they do offer an additional  exploratory and data-driven approach to be considered.



    Liang Maocheng




          Beijing Foreign Studies University


    Corpus linguistics in the Big Data era:  opportunities and challenges

    The  large size of modern corpora enables language researchers to look at a lot of  language at once, and to draw conclusions about language on the basis of  observations of a lot more data than before (Sinclair 1991). Consequently,  the validity of any corpus research is closely related to the size of the  corpora used. This has led to an endless pursuit for ever larger corpora, and  it becomes commonly accepted, as Sinclair (2004: 189) claims, that "There  is no virtue in being small. Small is not beautiful. It is simply a  limitation."


    With  the coming of the Big Data era, the increased size of corpora renders it  possible to make observations of even more data, and further enhances the  validity of corpus research. It will very likely open new horizons in  language studies. However, a couple of difficulties will also emerge. These  include, among other things, the collecting and storage of data, the  homogeneity and heterogeneity of texts, the management of meta-data, the  efficiency of data analysis, and new statistical technologies for handling  Big Data.


    This  paper discusses the changes in Corpus Linguistics brought about by Big Data,  and highlights some challenges and possible solutions. A software  demonstration will also be given to illustrate how Big Data technologies can  be used to enhance corpus research.







    Lancaster  University

    A  New Look at Learner Language - the Trinity Lancaster Corpus

    Does cultural and linguistic background affect learner speech -  and if so how? What impact may age have on learner production? Is gender a  linguistically important feature when exploring the speech of learners of  English? How does learner language production vary by task type? Is learner  language different when a learner is leading an interaction as opposed to  being led through an interaction by a person who is proficient in the  language?

    Questions such as these have been addressed regularly in  the literature on learner language. However, until recently it was difficult  to explore these questions in learner speech. Using a new, multi-million  word, corpus being developed at Lancaster University with Trinity College  London we can start to address these issues. By exploiting this large,  orthographically transcribed, corpus of learner speech, amply provided with plentiful  relevant metadata, we can gain fresh insights into learner speech.

    In this talk I will overview the construction of the  Trinity Lancaster Corpus, discussing the tasks the speakers engaged in and  the range of metadata we have available for those speakers. Following from  that I will review some initial findings from the corpus. The findings will  use a range of metadata to show how, when considered singly and in groups,  that metadata can give us answers to questions such as those outlined.



    Ute  Römer



    Georgia  State University

    Learner  corpora, emerging constructions, and language teaching

    This  talkadopts a usage-based perspectiveon language acquisition to investigate  how knowledge of verb-argument constructions (VACs) develops in second  language learners across proficiency levels. I will first present findings  from an analysis of L1 German and L1 Spanish learner use of English VACs,  such as the ‘V aboutn’(e.g.,let’s talk about the weather) or  the ‘V with n’ construction (e.g., he always agreed with her).I  will then discuss what the findings mean for language teaching.The analysis  is based oncorpora of learner writing at different levels of proficiency,  described in further detail below. I was interested in determining (1) how  VACs develop in second language (L2)writingas proficiency increases and (2)  how the use and emergence of VACs is affected by the learner’s first  language.


    The  paper builds on previous work on learner knowledge of VACs carried out in a  usage-based linguistics tradition (Gries, &Wulff, 2005; Römer, et al.,  2014a and 2014b). This work has shown that advanced learners of English have  constructional knowledge, that learners’ VAC knowledge differs in systematic  ways from that of native speakers, and that learners’ verb-VAC associations  differ across L1 groups.What previous studies have not been able to address,  mostly due to the unavailability of pertinent data at lower proficiency  levels,is how this constructional knowledge unfolds over time (though see Li,  Eskildsen, &Cadierno, 2014). Likewise, only few studies have  systematically contrasted learners from different L1 backgrounds to  investigate the role of transfer from the first language. The present talk  seeks to take steps to closing both of these gaps.


    To  gather information on learner VAC use at different proficiency levels, I use  subsets of the Education First-Cambridge Open Language Database (EFCAMDAT;  Geertzen, Alexopoulou, &Korhonen, 2013), consisting of writing samples by  learners of a range of L1s who were placed into 16 different proficiency  levels.For this study, I retrieved sets of texts written by German and  Spanish learners at Common European Framework of Reference (CEFR) levels A1  through C2. The resulting EFCAMDAT subsets—over 28,000 texts and 2.8 million  words from L1 German learners, and over 40,000 texts and 3.2 million words  from L1 Spanish learners—constitute a pseudo-longitudinal learner corpus that  complements existing corpus resources. From these EFCAMDAT  subsets,Iexhaustively retrieved instances of 19 different VACs. In addition,  and in order to provide further evidence on advanced learner VAC knowledge,  data on the same 19 VACs was retrieved from the German and Spanish  subcomponents of the International Corpus of Learner English (ICLE) and the  Louvain International Database of Spoken English Interlanguage (LINDSEI).



Copyright © 2002-2015 亚太语料库语言学和北航外语学院