Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

Language Flags
September 13, 2013

Mining for Shakespeare

Tiffany Trader

Did you know that the word “all” appears the fewest times in the Shakespearean corpus? Or that the over or under-use of common words like “you” and “there” serve as markers for individual writing style?

The writing process makes manifest a distinctive lexical inventory, a personal lexicon that is as unique as a snowflake or fingerprint. Given sufficiently large data sets and suitable quantitative methods, subtle variations in writing style can be teased to reveal linguistic preferences, psychological issues and even authorship itself.

An article in the acclaimed non-profit science journal PLOS ONE further explores this topic. The paper’s authors applied a ranking method to a text corpus containing 55,055 unique words from 168 plays written during the 16th and 17th centuries.

The researchers cross-referenced this word list with the works of John Fletcher, Ben Jonson, Thomas Middleton and William Shakespeare and generated a list of their top 20 most and least frequently used words.

“The results of using this new method were very encouraging,” the researchers subsequently wrote of their work. “For some authors, such as Shakespeare, the slight under-use of particular words provided better markers of individuation than over-used words.”

Shakespeare’s lowest ranked (i.e., least used) words were: all, to (infinitive), now, and ye. His top ranked words were: will (the noun), thee, you and did.

Jonson and Middleton also shied away from the use of “ye”, but this not the case for Fletcher, who had a clear preference for the word in comparison to his contemporaries.

The authors of the journal article refer to these statistical variations in word use as “quantifiable markers that can objectively measure an author’s creative mind at work.”

What’s interesting here is that writing style can be boiled down to “the tendency to over-utilise or avoid particular common words and phrasings,” referring to a spontaneous process, faster than thought. Using quantitative methods, such as word frequencies, to find insight in bodies of texts is known as computational stylistics.

The same scoring method used in this computational linguistics study is being used in the analysis of biomedical data to help in the fight against cancer and other diseases. Just as there are markers unique to linguistic style, there are biomarkers in medical research. It all comes down “to the quantification of subtle variations of attributes present in large amounts of data,” according to the researchers.

Related Items

Can Supercomputers Predict the Future?

UN Taking Big Data Pulse for Humanitarian Efforts

‘Sherlock’ Applies Problem Solving to Complex Challenges

SC14 Virtual Booth Tours

AMD SC14 video AMD Virtual Booth Tour @ SC14
Click to Play Video
Cray SC14 video Cray Virtual Booth Tour @ SC14
Click to Play Video
Datasite SC14 video DataSite and RedLine @ SC14
Click to Play Video
HP SC14 video HP Virtual Booth Tour @ SC14
Click to Play Video
IBM DCS3860 and Elastic Storage @ SC14 video IBM DCS3860 and Elastic Storage @ SC14
Click to Play Video
IBM Flash Storage
@ SC14 video IBM Flash Storage @ SC14  
Click to Play Video
IBM Platform @ SC14 video IBM Platform @ SC14
Click to Play Video
IBM Power Big Data SC14 video IBM Power Big Data @ SC14
Click to Play Video
Intel SC14 video Intel Virtual Booth Tour @ SC14
Click to Play Video
Lenovo SC14 video Lenovo Virtual Booth Tour @ SC14
Click to Play Video
Mellanox SC14 video Mellanox Virtual Booth Tour @ SC14
Click to Play Video
Panasas SC14 video Panasas Virtual Booth Tour @ SC14
Click to Play Video
Quanta SC14 video Quanta Virtual Booth Tour @ SC14
Click to Play Video
Seagate SC14 video Seagate Virtual Booth Tour @ SC14
Click to Play Video
Supermicro SC14 video Supermicro Virtual Booth Tour @ SC14
Click to Play Video