Use Case
Attributed to John Taylor, Public domain, via Wikimedia Commons
The project leverages computing power to conduct large-scale analyses across literary corpora in ways which were almost impossible to accomplish before.
Project Description
Scholars are unsure just what William Shakespeare wrote. We now know that plays published under his name contain contributions from other dramatists, and that he had a hand in others' plays. Moreover, half of Shakespeare's plays are known to us via multiple early versions whose differences might reflect revision of the play by Shakespeare and/or someone else, or censorship, or corruption of the text in scribal and print transmission. This project is funded by the Academies Partnership in Supporting Excellence in Cross-Disciplinary Research (APEX) scheme, grant APX\R1\241032. Two researchers at De Montfort University, Professor Gabriel Egan (expert in Shakespeare) and Professor Raouf Hamzaoui (expert in Information Theory), collaboratively explore the differences between the early editions of Shakespeare using new information-theoretic techniques that shed light on literary style, habits of revision, censorship, and textual corruption in ways not previously possible. This work is timely as the full set of plays (Shakespeare's and other writers') has only recently become available to investigators as large numbers of well-curated digital texts.
Skills and technical support requirements
Through training, individual experimentation and technical support, the fellowship has enabled Prof Egan to further develop several competencies, which are necessary for upscaling research via HPC DRI. Although the project comprised a strong foundation of existing datasets, scripts in Python and capacity in using Unix-based systems, the fellowship has further enhanced its technical potential. This includes better articulation and refinement of workflows, code restructuring and management through version control practices. Moreover, technical guidance and collaboration with the RSE team from Bede Tier 2 HPC enabled overcoming authentication challenges, resolve job permission issues and has been crucial in supporting code optimisation, job scheduling and running analyses in parallel. As such, the project has successfully employed HPC to calculate how each of around 110 Word Adjacency Networks (i.e. WANs revealing writing style based on authors' habits of clustering their 100 most frequent words) in Shakespeare’s plays compares to the WANs of full canons from seven other dramatists.
User roles and DRI requirements
Various DRI related user-roles were undertaken throughout the course of the workflow which was part of this DISKAH Fellow’s project. As a source explorer/corpus builder, the Fellow gathered digital transcriptions of plays from dramatists of the late 16th and early 17th centuries. These originated from available free digital transcriptions of around 60% of all the books published in the UK up to the year 1700, in the form of the Text Creation Partnership (TCP) Phase One and Phase Two datasets. A step-change in researchers' use of the TCP data has been the EarlyPrint project at Washington University of St Louis, which has completed morpho-syntactic tagging of the TCP files and provided a web-based Corpus Query Language front-end to interrogate this enhanced data. Files from these resources were then combined by the Fellow, who - as a data-shaper - prepared the dataset for the next phase of the workflow. This included acting as an AI/compute developer to generate two-dimensional matrices, capturing the writing habit of a dramatist by clustering their 100 most frequent words in a Word Adjacency Network (WAN).
While the initial steps of the above workflow were easily accomplished on a desktop computer, the comparison of relative entropy between only two WAN matrices would take around 10 minutes. Hence, comparing each of 110 the WANs in Shakespeare’s plays to the WANs of full canons from seven other dramatists -especially when exploring different variable choices- would be a cumbersome task requiring several days of computation.
The DISKAH fellowship via the partnership with N8 secured access to the Bede Tier 2 HPC, as well as technical support from a dedicated RSE team, allowing the Fellow to perform full WAN comparisons as an AI/compute developer. Such comparisons reveal specific stylistic features, revision patterns and other authorship practices in Shakespeare’s plays, which the Fellow plans to publish, as a scholarly communicator, in academic papers and an upcoming monograph.
Highlights
The project leverages computing power to conduct large-scale analyses across literary corpora in ways which were almost impossible to accomplish before. By exploring different variable choices in the WANs of Shakespeare’s plays and comparing them -through HPC- to the WANs of seven dramatists from the late 16th and early 17th centuries, the project strengthens literary investigations with innovative digital methods that can be applied at scale. This work, in turn, has the potential to transform textual variation research and provide new insights into Shakespeare’s work, his contemporary authors and beyond.
Future work
The project’s future work will expand on WAN comparisons and experimentation with variable choices across the set of corpora.
Project Outputs
Dataset and software: https://apex.dmu.ac.uk
