Plenary Talks -北航外国语

国际会议

学院首页 > 学科与科研 > 正文

Plenary Talks

来源：发布时间：2016-05-12

Monika Bednarek

The University of Sydney

Corpus Linguistics and Profanity (‘bad’ language, swear words, curse words, taboo words…)

In this talk I aim to bring together my enduring research interests in corpus linguistics, media linguistics and the language of emotion/opinion, with a focus on the use of ‘bad’ language in American English – in particular, what are commonly called ‘swear words’, ‘taboo words’, ‘curse words’ or ‘profanity’. My talk will consider the challenges that such language poses for corpus linguistics, while also discussing its role in interpersonal language systems such as appraisal and negotiation. I will further report on research in-progress that examines the use of such language in contemporary American television series that are highly popular and heavily exported on a global scale. This study is based on a new corpus of dialogue transcribed from over 60 contemporary US television series: The Sydney Television Corpus (SydTV). SydTV is a small, specialisedcorpus which has been designed to be representative of the language variety of contemporary US American TV dialogue. Contemporary is here defined as the year of first broadcast falling between 2000 and 2012. This specific time frame was adopted because the first decade of the 21^stcenturywas characterised by the global rise of American TV series, and the ‘golden age of television’ that characterises US TV series from this period is still on-going. US American is defined as having the United States as country of origin. TV dialogue is defined as the actual dialogue uttered by actors on screen as they are performing characters in fictional TV series. I will compare frequencies of ‘bad’ language words in this corpus with frequencies in other corpora, while also including qualitative analysis of their usage in contemporary TV dialogue.

Susan Hunston

University of Birmingham

What do the numbers mean? The use of quantitative information in Corpus Linguistics

In this paper I shall present a view of Corpus Linguistics that conceptualises it in three phases, distinguished by the use made in each phase of quantitative data. Quantitative information has become increasingly important in Corpus Linguistics, and increasingly sophisticated as measures that are sensitive to how language works have become more readily available. Questions around the use of quantitative information are driven by the need in Corpus Linguistics to innovate methodologically (to find the best way of making new discoveries) and theoretically (to describe language in new ways).

In phase 1 studies, corpora from different geographical areas, or chronological times, or registers, are compared by quantifying the relative frequency of given grammatical or semantic categories. Such methods have underpinned substantial advances, for example the Longman Grammar of Spoken and Written English, work on Systemic-Functional Linguistics, and work comparing learner varieties of English, among many others.

Phase 2 studies prioritise lexis over grammar and individual wordforms over categories of form or meaning. In these studies, frequency is often reduced to a concept of ‘typicality’ or ‘centrality’. Comparison between corpora is usually not the identifying feature of such work. Examples include Sinclair’s work on Units of Meaning, or Frances and Hunston’s work on grammar patterns. The key aspects of phase 2 studies are its exploratory, ‘bottom-up’ approach and the novelty of its insights.

A challenge for Corpus Linguistics is to marry the rigour of quantitative measures with innovation in insight. One way of doing this is to allow numbers to drive the way that information in the corpus is organised. This is what I term phase 3 studies, and two examples are given in this paper. One is a study of adjectives in a corpus of comments about university teaching staff (see Millar and Hunston 2015). The other is a study of lexis in a corpus compiled from an interdisciplinary academic journal (see Murakami et al in press). In both cases the initial corpus work treats the corpus as a ‘bag of words’, allowing co-occurrence calculations to organise the data before linguistic considerations are brought to bear. Phase 3 studies are not necessarily an improvement on other methods, but as methodological innovation is key to progress in Corpus Linguistics they do offer an additional exploratory and data-driven approach to be considered.

Liang Maocheng

Beijing Foreign Studies University

Corpus linguistics in the Big Data era: opportunities and challenges

The large size of modern corpora enables language researchers to look at a lot of language at once, and to draw conclusions about language on the basis of observations of a lot more data than before (Sinclair 1991). Consequently, the validity of any corpus research is closely related to the size of the corpora used. This has led to an endless pursuit for ever larger corpora, and it becomes commonly accepted, as Sinclair (2004: 189) claims, that "There is no virtue in being small. Small is not beautiful. It is simply a limitation."

With the coming of the Big Data era, the increased size of corpora renders it possible to make observations of even more data, and further enhances the validity of corpus research. It will very likely open new horizons in language studies. However, a couple of difficulties will also emerge. These include, among other things, the collecting and storage of data, the homogeneity and heterogeneity of texts, the management of meta-data, the efficiency of data analysis, and new statistical technologies for handling Big Data.

This paper discusses the changes in Corpus Linguistics brought about by Big Data, and highlights some challenges and possible solutions. A software demonstration will also be given to illustrate how Big Data technologies can be used to enhance corpus research.

TonyMcEnery

Lancaster University

A New Look at Learner Language - the Trinity Lancaster Corpus

Does cultural and linguistic background affect learner speech - and if so how? What impact may age have on learner production? Is gender a linguistically important feature when exploring the speech of learners of English? How does learner language production vary by task type? Is learner language different when a learner is leading an interaction as opposed to being led through an interaction by a person who is proficient in the language?

Questions such as these have been addressed regularly in the literature on learner language. However, until recently it was difficult to explore these questions in learner speech. Using a new, multi-million word, corpus being developed at Lancaster University with Trinity College London we can start to address these issues. By exploiting this large, orthographically transcribed, corpus of learner speech, amply provided with plentiful relevant metadata, we can gain fresh insights into learner speech.

In this talk I will overview the construction of the Trinity Lancaster Corpus, discussing the tasks the speakers engaged in and the range of metadata we have available for those speakers. Following from that I will review some initial findings from the corpus. The findings will use a range of metadata to show how, when considered singly and in groups, that metadata can give us answers to questions such as those outlined.

Ute Römer

Georgia State University

Learner corpora, emerging constructions, and language teaching

This talkadopts a usage-based perspectiveon language acquisition to investigate how knowledge of verb-argument constructions (VACs) develops in second language learners across proficiency levels. I will first present findings from an analysis of L1 German and L1 Spanish learner use of English VACs, such as the ‘V aboutn’(e.g.,let’s talk about the weather) or the ‘V with n’ construction (e.g., he always agreed with her).I will then discuss what the findings mean for language teaching.The analysis is based oncorpora of learner writing at different levels of proficiency, described in further detail below. I was interested in determining (1) how VACs develop in second language (L2)writingas proficiency increases and (2) how the use and emergence of VACs is affected by the learner’s first language.

The paper builds on previous work on learner knowledge of VACs carried out in a usage-based linguistics tradition (Gries, &Wulff, 2005; Römer, et al., 2014a and 2014b). This work has shown that advanced learners of English have constructional knowledge, that learners’ VAC knowledge differs in systematic ways from that of native speakers, and that learners’ verb-VAC associations differ across L1 groups.What previous studies have not been able to address, mostly due to the unavailability of pertinent data at lower proficiency levels,is how this constructional knowledge unfolds over time (though see Li, Eskildsen, &Cadierno, 2014). Likewise, only few studies have systematically contrasted learners from different L1 backgrounds to investigate the role of transfer from the first language. The present talk seeks to take steps to closing both of these gaps.

To gather information on learner VAC use at different proficiency levels, I use subsets of the Education First-Cambridge Open Language Database (EFCAMDAT; Geertzen, Alexopoulou, &Korhonen, 2013), consisting of writing samples by learners of a range of L1s who were placed into 16 different proficiency levels.For this study, I retrieved sets of texts written by German and Spanish learners at Common European Framework of Reference (CEFR) levels A1 through C2. The resulting EFCAMDAT subsets—over 28,000 texts and 2.8 million words from L1 German learners, and over 40,000 texts and 3.2 million words from L1 Spanish learners—constitute a pseudo-longitudinal learner corpus that complements existing corpus resources. From these EFCAMDAT subsets,Iexhaustively retrieved instances of 19 different VACs. In addition, and in order to provide further evidence on advanced learner VAC knowledge, data on the same 19 VACs was retrieved from the German and Spanish subcomponents of the International Corpus of Learner English (ICLE) and the Louvain International Database of Spoken English Interlanguage (LINDSEI).