Network: It's, um, English, just like we really speak it: Using an immense data base, lexicographers have taken raw language and produced a revolutionary new dictionary. Robert Nurden reports

Robert Nurden
Sunday 23 October 1994 20:02 EDT
Comments

Your support helps us to tell the story

From reproductive rights to climate change to Big Tech, The Independent is on the ground when the story is developing. Whether it's investigating the financials of Elon Musk's pro-Trump PAC or producing our latest documentary, 'The A Word', which shines a light on the American women fighting for reproductive rights, we know how important it is to parse out the facts from the messaging.

At such a critical moment in US history, we need reporters on the ground. Your donation allows us to keep sending journalists to speak to both sides of the story.

The Independent is trusted by Americans across the entire political spectrum. And unlike many other quality news outlets, we choose not to lock Americans out of our reporting and analysis with paywalls. We believe quality journalism should be available to everyone, paid for by those who can afford it.

Your support makes all the difference.

The unexpurgated gossip of dinner ladies from Hackney, the negotiations of a company director from Newcastle, and the cries of croquet enthusiasts from Bromley have all helped to create a revolutionary dictionary.

Their spontaneous chit-chat is part of the Spoken Corpus project, devised by Longman, the publishing company. The project involved creating a data base of 10 million words taken directly from everyday situations - the largest ever compiled in English.

About 150 volunteers agreed to wear a tape recorder for up to two weeks so that all their conversations could be recorded. The result: tapes which, if joined up, would be 34 times the height of Mount Everest.

The tapes were transcibed to produce on disk the world's largest data base of spoken English. Lexicographers then turned this English in the raw - much of it extremely rich - into material for dictionaries.

The Spoken Corpus's first off- spring is the Longman Language Activator, described by a leading grammarian, Professor Sir Randolph Quirk, as 'the book the world's been waiting for'. The Activator is a dictionary for advanced students of English that gives not only a word's definition - as monolingual dictionaries do - but points the reader to related words or phrases as they are actually used. It offers a far wider range of such phrases than a standard thesaurus. For instance, the word 'lucky' leads to 'fall on your feet', 'not know you're born' and even 'keep your fingers crossed'.

Technology has not merely enabled the lexicographers to work more efficiently and accurately, but also to help dictionary compilers to trace the ebb and flow of new expressions in the language - which phrases are taking root and which are disappearing. Previously, they had to guess.

Search techniques also enable lexicographers to test frequency of usage. The data base reveals the words that are favourites in speech but infrequent on the page. The word 'really', for instance, is used five times more often in speech than in writing. The search also shows that women speak in different ways from men, and reveals important details about regional and class speech patterns.

Electronic corpora - the word corpus refers to the fact that this is a collection of words - enable specialists to home in on categories of language that interest them: social science, legal terminology, physics, geography, poetic usage and so on.

Linguists have long known about the importance of phatic communion - noises and pauses, that we use to express doubt, joy, fear, aggression, to play for time or be just plain pig-headed. Um and ah, and even suckings of teeth or intakes of breath, are highly subtle vocal devices that can now be analysed in depth.

The Spoken Corpus is part of the British National Corpus, a collaborative venture between universities and educational bodies that has produced more than 100 million words, 90 million of them written. The Corpus has already changed the way textbooks for foreign students are put together, and has helped Longman to produce the first multimediaCD-rom dictionary. In future, monolingual dictionaries, which are devoted to helping people whose first language is English, will also contain real examples of spoken English, rather than invented ones. The days of dry academics poring over file cards in dusty research rooms have long gone: the dinner ladies from Hackney have seen to that.

Oxford University Computing Services, which is handling sales of the Spoken Corpus disks, have not yet put a price on them. But the Corpus should become available in the next few weeks to linguists, lexicographers and compilers of English teaching materials.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in