ABSTRACT

Sept. 20 (Fri.)

[SESSION 1]
'The Language of Tragedy: Tracking a Shifting Target in a Historical Corpus'

Jonathan Hope (Professor of Literary Linguistics at Strathclyde University, Glasgow, UK)

In a number of papers, Michael Witmore and I have used digital techniques to explore the relationship between genre and language in Shakespeare (2004, 2007, 2010, 2012). Perhaps our most significant, and surprising, finding to date has been the very strong relationship that can be identified between literary genre (which we take to be a high-level attribute of texts) and micro-linguistic features (a low-level attribute of texts).
We consider genre to be an attribute of texts which is primarily ascribed culturally: writers, critics or others can identify the genre of a text at the time of writing, or subsequent to writing. Texts can be assigned to 'new' genres unknown to their writers (as happened to Shakespeare's Late Plays at the end of the nineteenth century). Texts can be assigned to more than one genre simultaneously, and critical communities can argue fruitfully about the generic attribution of texts. These features of genre suggest that the genre of a text is in some sense not an intrinsic part of the text itself.
By contrast, the micro-linguistic features that occur in texts do so objectively, from the first writing of the text, and do not change. There is no room for argument about item frequencies unless the terms of counting are changed: if a linguistic item is reclassified, for example. Such micro-linguistic features are therefore an intrinsic part of the text.
If these views of genre and micro-linguistic feature are accepted, it is surprising to find that frequency counts of certain micro-linguistic features give very reliable predictions of genre: a high-level, apparently subjective feature of texts is closely tied to low-level, objective ones.
So far, we have considered genre within the work of one writer, Shakespeare, but we have now begun to explore the linguistic basis of genre in a multi-author corpus spread out over time. In this paper we will attempt to trace the genre of tragedy in the corpus of printed Early Modern plays (c. 400 plays from 1550-1640). We are interested in tracing linguistic evolution in 'tragedy' over the period: does the genre share a linguistic 'fingerprint' over the whole period? Is there a recognisable direction of movement through the multi-dimensional space of our analysis? In what sense is a late tragedy the 'same' as an early tragedy?
Our aim is not only to investigate the development of tragic drama in English, but to demonstrate techniques for tracking moving targets in historical corpora. One of the exciting possibilities of future digital research is the description of the development of genres over long time periods, but this raises profound ontological questions about continuity of identity; and indeed the nature of the types of identity we ascribe to groups of texts.

[Key Words: Text analytics, Early Modern English, genre theory, visualisation, data mining]

Jonathan Hope and Michael Witmore, 2004, 'The very large textual object: a prosthetic reading of Shakespeare', Early Modern Literary Studies 9.3 / Special Issue 12: 6.1-36 http://www.shu.ac.uk/emls/09-3/hopewhit.htm
Michael Witmore and Jonathan Hope, 2007, 'Shakespeare by the Numbers: on the Linguistic Texture of the Late Plays' in Early Modern Tragicomedy, Subha Mukherji and Raphael Lyne (eds), (D.S. Brewer), pp. 133-53
Jonathan Hope and Michael Witmore, 2010, 'The Hundredth Psalm to the Tune of "Green Sleeves": Digital Approaches to the Language of Genre', Shakespeare Quarterly, vol. 61, no. 3 (Fall 2010), pp. 357-90 http://muse.jhu.edu/journals/shakespeare_quarterly/toc/shq.61.3.html
Michael Witmore and Jonathan Hope, 2012, 'Après le déluge, More Criticism: Philology, Literary History and Ancestral Reading in the Coming Post-Transcription World', Renaissance Drama 40, pp. 135-50

'Different Characteristics of Variant Readings Based on Comparison of Major Textual Similarity Measures'

Maki Miyake (Currently an associate professor in the Graduate School of Language and Culture, the Osaka University)

In this study, we present a computational approach to textual variant analysis, focusing specially on the influential modern editions of the Greek New Testament. As the Nestle-Aland Greek New Testament is considered to be one of the most useful critical materials for current biblical studies, we set the 27th edition as the norm text. Even though this is not the latest version (the 28th edition was released last September), it is still widely recognized as the primary source. We compared the 27th edition with some modern editions such as Westcott-Hort (1881), Scrivener (1894) and Robinson-Pierpont Majority Text (2005) within the synoptic Gospels; the Gospels of Matthew, Mark and Luke are similar in structure, content, and word.
The similarity measures were conducted on a contiguous sequence of words whose sizes ranged from 4 to 10 tokens. We also did the calculation verse by verse which is equivalent to doing it line by line.
We applied some promising similarity measures to investigate the differences between the editions. The major word matching algorithms, commonly used in Information Theory, were used to calculate the difference between the words. More specifically, we made use of two string-based measures such as Levenshtein distance and Jaro-Winkler distance and Cosine similarity as a token-based measure.
Levenshtein distance is well-known as an edit-distance and it calculates a distance based on simple edit operations such as insertion, deletion, and substitution.
Jaro-Winkler distance, in turn, treats the more complex edit operations including character transpositions. Cosine similarity measures the angle between the two token sets represented as vector elements. Alongside the classical measures, we focus on a fuzzy matching technique called FCosine that was recently proposed by Wang et. al. (2011). The FCosine measure is combined with Levenshtein distance and Cosine similarity and this hybrid method is capable of approximate string matching as it takes advantage of both edit-based and token-based distance. By comparing the statistical results, we explored what aspect makes the difference of variant readings. At the same time, we discussed that each similarity measure indicates how effectively the variant's characteristics.

[Key Words: Textual Variants, Edit Distance, Cosine Similarity]

'Data Criticism: a Methodology for the Quantitative Evaluation of Non-Textual Historical Sources with Case Studies on Silk Road Maps and photographs'

Asanobu Kitamoto (Associate professor of National Institute of Informatics)
Yoko Nishimura (Researcher, Toyo Bunko, The Oriental Library)

This paper proposes the concept of "data criticism" and demonstrates how this concept can be applied to non-textual historical sources such as maps and photographs. Data criticism deals with the evaluation of sources in the context of historical studies, which is the same role as textual criticism. Nevertheless, the criticism of data, especially non-textual (visual) sources such as maps and photographs, has not been well studied due to the lack of methodology for quantitative evaluation. Visual sources are essential for the spatial interpretation of historical facts, but the necessity of critical methods was relatively unnoticed. We believe that this is due to the appearance of visual sources, which look like the record of facts because the creation of those sources heavily depends on objective techniques and tools such as survey and camera. We show, however, that visual sources are not facts, and their quality needs to be evaluated before using them as reliable sources, in the same way as text. Hence we use old maps and photographs as case studies to demonstrate how data criticism will bring about new perspectives on the interpretation of visual sources.

We first introduce the computational part of data criticism. We developed a suite of geo-referencing techniques for maps, including support for different map projections, one-point vs. entire registration of maps, and the preservation of linear features of maps such as streets. This process is typically performed by a built-in geometric correction in geographic information systems (GIS), such as rubber-sheeting, but we show that these techniques have rooms for improvement, and propose new algorithms using ground control points and lines, or the latitude-longitude mesh on the map. The result is integrated into web-based tools, on top of geo-browsers such as Google Maps and Google Earth, and we also developed a new interface "mappinning" (map + pinning) for performing one-point registration on-the-fly.

We then discuss the historical part of data criticism; namely, how data criticism can lead to a new interpretation of visual sources and discovery of new historical facts. First, we criticized Central Asian maps from European expeditions by Stein, Hedin and others. We quantified the distribution of errors and explained the cause of errors in the context of technical limitation of survey technology available at the time. Second, we geo-referenced the old map of Beijing, "Complete Map of Peking, Qianlong Period," created about 250 years ago. We identified mis-alignment of maps possibly due to incorrect restoration in the past, and revealed for the first time the original form of the map by connecting 203 sheets based on our proposed geometric correction method. Third, we focused on ruins located in a few regions in Tarim basin, and identified conceptual relationships between ruins that appear in expedition reports and current survey reports in different names and types. This suggests that consistent interpretation of facts is possible even when text is inconsistent across sources.

In the age of data, historical studies should incorporate and integrate larger variety of data, and we believe that 'data criticism' is a key methodology for the proper interpretation of data, and the discovery of historical facts. It is a natural extension of textual criticism that has been studied for a long time, but it focuses on the identical central concept of historical studies; namely, the evaluation of the value of sources. Finally, most of the maps and photographs introduced in this paper are accessible at Digital Silk Road Website http://dsr.nii.ac.jp/geography/

[Key Words: Data criticism, old maps, historical sources, quantitative evaluation, Silk Road]

[SESSION 2]
'A Consideration on Marking up the Madhyāntavibhāgaṭīkā by Using TEI P5: A Complicated Case of a Critical Edition Including Extensive Reconstruction'

Koichi Takahashi (Research fellow, University of Tokyo)

This paper will discuss how to mark up the text called Madhyāntavibhāgaṭīkā, a Buddhist text composed in Sanskrit about 6th century. The modern critical edition of this work has an extremely strange appearance. The opening part of the edition reads as follows:

uttamajanā hi prāyaśoguruṃśraddhādevatāṃcābhyarcyakarmasupravartantaityayamapyahamuttamajanana
yamanuvartīMadhyāntavibhāgasūtrabhāṣyaṃcikīrsur [itijñāpanārthaṃ] tatpraṇeturvaktuścapūjāṃkṛitvātada
rthavibhāgāyaprayuktaitipratipādayannāha(Madhyāntavibhāgaṭīkā ed.by S. Yamaguchi, Nagoya, 1934, p.1)

The text started with the italic style, but it suddenly began to use the roman font at the end of the word "karmasu". However, it changed again its style into the italic just after a few lines. In this way, the italic and roman lines alternately appear in this edition which has more than 200 pages. The strange structure of this edition is a result of repairing the damaged part of the Sanskrit manuscript. One third of eachleaf is entirely lost in this manuscript, which is a single material for editing this work. However, a situation peculiar to the Buddhist literature made it possible to reconstruct the lost part of this text. Most of the Buddhist canons and exegeses had been translated into Tibetan until the 12th century. Some modern scholars attempted to restore the lost part by means of philological investigation of the Tibetan rendering. Dr. S. Yamaguchi, one of these scholars, successfully repaired the whole part of the text, and issued its critical edition in 1934. This criticale dition uses the italic style to indicate the reproduction.As a result, in his edition the italic lines repeatedly appear at random without any association with the logical structure.In other words, the edition includes the physical structure representing the damaged portion of the manuscript as well as the logical structure shown by paragraphing.

In this way, the critical edition of the Madhyāntavibhāgaṭīkā has extremely complicated structure. In spite of the unusual appearance, it is expected to be digitized because of its philosophical significance. XML seems to be a suitable data type for this purpose, and also the TEI P5 could make it easier to mark up the text. For example, it provides the tag sets to indicate the damaged part, and to signify the text supplied by the transcriber or editor.These tag sets, however, seem to be prepared for more simple cases. Therefore, this paper will discuss the adequacy and efficiency of these tag sets for marking up the text with the extremely complex structure.

[Key Words: XML, TEI P5, Sanskrit, Buddhist, Madhyāntavibhāgaṭīkā]

'A Nordic Tradition for Digital Scholarly Editions?'

Espen S. Ore (System developer, Dept. of Linguistics and Scandinavian Studies, Unit for Digital Documentation, University of Oslo, Norway)

Since the work on digital editions started in the different Nordic countries more than 20 years ago there seems to have developed a consensus or convergence of editorial styles for digital scholarly editions in the Nordic countries (here defined as Denmark, Finland, Norway and Sweden, for Finnish editions especially those based on material written in Swedish[1]). In this paper I want to discuss if this is really the case and present some fairly recent digital editions from Nordic countries.

The editions still in development or finished within the last five years span texts written within a time period from the early 18th century up until recent times - and if work on medieval texts is included the start of the time span goes even further back. The texts include both published literature and unpublished private writing. One special case here is the written works of the Norwegian painter Edvard Munch.

Since 1995 the community working on Nordic Textual Scholarship has had a formal network with (now) biannual conferences and special workshops[2]. This paper looks at how this level of collaboration may have lead to a convergence of how digital editions are built - both the process of constructing a digital edition and also in how the final result is presented. Among the editions discussed in this paper are Ludvig Holberg (No/Da), Søren Kierkegaard (Da), Henrik Ibsen (No), Edvard Munch (No), Zakarias Topelius (Fi), Selma Lagerlöf (Sw) and others.

The BEE used its own system for text encoding, MECS[3], and so did the edition of the complete works of Kierkegaard[4] although here as with Wittgenstein's Nachlass the proprietary encoding was later translated into XML. Since the project start on the edition of Henrik Ibsen's Writings (HIW) TEI has been the preferred encoding system. HIW was started in 1998 and so used SGML/TEI for its first few years, but this was changed into XML/TEI with the introduction of TEI P4. All the Nordic edition projects that have started after those mentioned above have been based on XML/TEI.

A digital edition is not - or not only - classified by the text encoding that is used. For almost all the Nordic digital editions we find that they are developed as archives of digital data where the edition itself is only one of more possible views into the complete archive. The following elements are usually included, at least in the archive: images (facsimiles of manuscripts and/or printed editions), encoded transcriptions of mansuscripts and/or printed editions, new editions of the texts, usually stored as encoded text and possibly facsimiles of a new printed edition and secondary data of various kinds.

While the projects listed so far have been digitally created editions there has also been a Nordic trend towards national archives of digitally available texts whether they are published as images of text (PDF and other facsimiles) or displayed text produced from encoded text files. One of the possible Nordic twists here is that text archievs are created and maintained at a national level rather than a regional or purely institutional[5]. And for all of them the published edition is not the formal end of the project: the long time conservation of the archive is an in built part of the project. This usually means that the digital archive is stored at a national institution such as a national library or a university (universities are mostly government financed institutions in the Nordic countries) although we might find an exception with Edvard Munch's writings where the digital archive is stored at the Munch Museum.

As far as there has been a convergence in digital editions in the Nordic countries it is that a) certain large editorial projects which may have national or other funding develop complete editions of selected authors and which build up complex archives of text and images while b) at a national level there are institutions which provide long time archives for digital editions whether they are text editions or facsimile editions. How far this represents a particular Nordic school of digital editions depends on what is done elsewhere. This paper will compare the situation in the Nordic countries with some other Western European countries and North America.

[Key Words: Digital editions, textual scholarship, text encoding, text archives]

See Dahlström, Mats and Espen S. Ore: "Elektronisches Edieren in Skandinavien" in Geschichte der Edition in Skandinavien, de Gruyter, to be published June 2013 - ISBN 978-3-11-031757-2
http://www.nnedit.org (checked April 19, 2013)
Huitfeldt, C.: "Multi-Dimensional Texts in a One-Dimensional Medium", Computers and the Humanities 28: 235-241, 1995, http://link.springer.com/content/pdf/10.1007%2FBF01830270 (checked April 18, 2013)
Preface to Søren Kierkegaards Skrifter (Søren Kierkegaard's Writings), electronic version http://sks.dk/red/forord-e.asp (checked April 18, 2013)
See for instance Arkiv for Dansk Litteratur (Da): http://adl.dk/adl_pub/omadl/cv/OmAdl.xsql?nnoc=adl_pub, bokselskap.no (No): http://www.bokselskap.no/ and The Swedish Literature bank (Sw/Fi): http://litteraturbanken.se/#!om/inenglish (all checked on April 18, 2013)

'Linked Open Data for Chinese Characters'

Tomohiko Morioka (Assistant Professor, Center for Informatics in East Asian Studies, Institute for Research in Humanities, Kyoto University)

his report describes RDF mapping of the CHISE character ontolog[1]y, especially it focuses representation and usage of Chinese Characters (Kanji, Hanzi, etc.). Character is a basis of various kinds of data. Chinese Character is a logogram, each character is basically indicates a morpheme. In the point of view, a Chinese Character can be a hub of linguistic and semantic information about various words and texts written by Chinese characters.

The CHISE character ontology is a large scale character ontology which includes 238 thousand character-objects including Unicode and non-Unicode characters and their glyphs, etc. It was developed for CHISE (Character Information Service Environment[2]) which is a character processing system not depended on character codes. The framework of CHISE is based on database processing. It uses CONCORD as the database engine. CONCORD is a prototyping style object oriented database system based on directed graph, like RDF. We developed a Web service to display and edit objects of CONCORD, called "EsT". EsT was originally designed a HTML based Web application, however we added RDF mapping and implemented RDF/XML output feature. Thus the CHISE character ontology can be used as a Linked Open Data.

As demonstrations of data-linking between character ontology and other information resources, I made two examples: (1) links between Oracle Bone characters and rubbing/photos of their sources (e.g. {fig.1}[3]) (2) links between characters and "Bibliography of Oriental Studies[4]" (e.g. {fig.2}). In a view of character ontology, these links work as sources/examples of characters to describes and indicates characters clearly. In addition, they also work as entrance of other information resources.

CHISE provides search service for Chinese characters, named "CHISE IDS Find[5]". If a user specifies one or more components of Chinese character into the "Character components"window and run the search, characters include every specified components are displayed. It is a very useful service to search Chinese characters, especially for characters which are difficult to input by ordinary input-methods designed for daily-used characters. Result page of the CHISE IDS Find has links for character objects using EsT (the first column of each line is a link for EsT page to display detailed information of character object). Thus users can trace the links for other information resources via EsT. It is also useful for search robots and third-party's Web services based on Web API. In the point of view, RDF/XML feature of EsT can provide deeper cooperation for third-party's applications. In an ordinary HTML page of EsT, there is a RDF button (right side of the top line). To click the RDF button, EsT outputs RDF/XML format instead of HTML. {fig.3}

As a future plan, I'm planning to make links between Chinese characters and morphemes of classical Chinese. It will be a example to integrate character processing layer and linguistic processing layers. This report focuses the concepts of the CHISE character ontology and basic mechanism of its RDF mapping briefly and describes possibility of it as a information hub for other kinds of information resources.

[Key Words: Chinese Character, RDF, Linked Open Data, character ontology, character representation]

Tomohiko Morioka, "CHISE: Character Processing based on Character Ontology", Large-scale Knowledge Resources (LKR2008), pp.148-162, LNAI 4938.
http://www.chise.org/
Catalogue of the Oracle Bones in the Kyoto University Research Institute for Humanistic Studies, http://chise.zinbun.kyoto-u.ac.jp/koukotsu/
http://ruimoku.zinbun.kyoto-u.ac.jp/ruimoku7/
http://www.chise.org/ids-find

[PANEL 1]
'Semantic Uplift in the Digital Humanities'

Cormac Hampson (Knowledge and Data Engineering Group, Trinity College Dublin, Ireland)
Jennifer Edmond (Director of Strategic Projects, Faculty of Arts Humanities and Social Sciences, Trinity College Dublin)
Susan Schreibman (Long Room Hub, Trinity College Dublin, Ireland)
Vicky Garnett and Karolina Badzmierowska

The rise in digital humanities research has led to a huge increase in the digitisation of cultural heritage artefacts. However, the resultant data often resides in independent silos that do not share their rich information or connect to other relevant collections. This lack of connectivity reduces the impact of the digitisation process and makes it more challenging to find potential synergies between dispersed artefacts. A major obstacle in providing tangible links between digital collections is the proliferation of unstructured text and insufficiently rich metadata. Trinity College Dublin is directly addressing this challenge on a number of fronts, as exemplified by its leading role in three European Commission funded projects; CULTURA[1], CENDARI[2] and DigCurV[3]. This panel session will discuss the latest research on semantic uplift (best practice education of curators, entity extraction, ontology mapping etc.) that is being performed in Trinity College Dublin, and explore its relevance to the wider digital humanities community.
In the strictest sense, semantic uplift is the process of enriching digital content with metadata and exposing it to the greater web of data. This can involve a variety of processes such as extracting entities from unstructured text, mapping the entities to domain specific ontologies, and publishing this enriched information as linked data. But the processes required to ensure digital content can be identified and interrogated by users is not always a purely technical one. Ensuring that common standards are agreed and followed, that technical developments don't exacerbate disparities in the analogue record, and that institution staff gain access to the necessary training to understand these new methodologies, and the rich potential they offer, are obstacles on this path. Trinity College Dublin is helping institutions tackle this problem through its participation in the DigCurV project. The DigCurV project is addressing the availability of vocational training for digital curators, with a focus on the new skills and competences (such as the technologies that can enrich the semantics and exposure of their data), which are essential for the long-term management of digital collections. A key outcome of the DigCurV project is the establishment of a curriculum framework from which training programmes can be developed in future. Such initiatives are central to ensuring the technical literacy of digital curators, and ultimately the improvement of how cultural collections are stored, connected and maintained.
As a wealth of cultural archives have already undergone the digitisation process, techniques are necessary to enrich these existing collections through semantic uplift. Two Trinity College led projects in the Digital Humanities are addressing this challenge directly. CULTURA is using sophisticated natural language processing and normalisation techniques as a means of adding additional structure to digital archives. Importantly, this approach is being tested on resources such as the 1641 Depositions[4], which contain archaic language and have huge inconsistencies in spelling and structure. The lessons learnt in applying semantic uplift to such challenging artefacts are highly relevant to current research directions in digital humanities.
By enriching content with more structure it significantly eases the difficulties of connecting data to other relevant external resources. This linking process is a key focus of CENDARI, a broad-based infrastructure project, where ontologies are being used to facilitate mappings, and multi-lingual resources from medieval and modern European history are ultimately being matched to concepts on the semantic web. The recent proliferation of published Linked Open Data by digital humanities organisations (such as Europeana[5]) highlights the importance of improving such processes, and the significance of semantic uplift to digital humanities practitioners.

http://www.cultura-strep.eu/
http://www.cendari.eu/
http://www.digcur-education.org/
http://1641.tcd.ie/
http://pro.europeana.eu/linked-open-data/li>

[PLENARY 1]
'A new stage in the development of cultural resource databases'

Ellis Tinios (Honorary Lecturer in History, University of Leeds, UK)
Akihiko Takano (Professor, the National Institute of Informatics, and Visiting Professor, Ritsumeikan University, Japan)
John Resig (Dean of Computer Science, the Khan Academy, USA)
Ryo Akama (Professor, Ritsumeikan University, Japan)

TOP