ABSTRACT
Sept. 20 (Fri.)
[SESSION 1] 'The Language of Tragedy: Tracking a Shifting Target in a Historical Corpus' Jonathan Hope (Professor of Literary Linguistics at Strathclyde University, Glasgow, UK) In a number of papers, Michael Witmore and I have used digital techniques to explore the relationship between genre and language in Shakespeare (2004, 2007, 2010, 2012). Perhaps our most significant, and surprising, finding to date has been the very strong relationship that can be identified between literary genre (which we take to be a high-level attribute of texts) and micro-linguistic features (a low-level attribute of texts).
[Key Words: Text analytics, Early Modern English, genre theory, visualisation, data mining] Jonathan Hope and Michael Witmore, 2004, 'The very large textual object: a prosthetic reading of Shakespeare', Early Modern Literary Studies 9.3 / Special Issue 12: 6.1-36 http://www.shu.ac.uk/emls/09-3/hopewhit.htm 'Different Characteristics of Variant Readings Based on Comparison of Major Textual Similarity Measures' Maki Miyake (Currently an associate professor in the Graduate School of Language and Culture, the Osaka University) In this study, we present a computational approach to textual variant analysis, focusing specially on the influential modern editions of the Greek New Testament. As the Nestle-Aland Greek New Testament is considered to be one of the most useful critical materials for current biblical studies, we set the 27th edition as the norm text. Even though this is not the latest version (the 28th edition was released last September), it is still widely recognized as the primary source. We compared the 27th edition with some modern editions such as Westcott-Hort (1881), Scrivener (1894) and Robinson-Pierpont Majority Text (2005) within the synoptic Gospels; the Gospels of Matthew, Mark and Luke are similar in structure, content, and word.
The similarity measures were conducted on a contiguous sequence of words whose sizes ranged from 4 to 10 tokens. We also did the calculation verse by verse which is equivalent to doing it line by line. We applied some promising similarity measures to investigate the differences between the editions. The major word matching algorithms, commonly used in Information Theory, were used to calculate the difference between the words. More specifically, we made use of two string-based measures such as Levenshtein distance and Jaro-Winkler distance and Cosine similarity as a token-based measure. Levenshtein distance is well-known as an edit-distance and it calculates a distance based on simple edit operations such as insertion, deletion, and substitution. Jaro-Winkler distance, in turn, treats the more complex edit operations including character transpositions. Cosine similarity measures the angle between the two token sets represented as vector elements. Alongside the classical measures, we focus on a fuzzy matching technique called FCosine that was recently proposed by Wang et. al. (2011). The FCosine measure is combined with Levenshtein distance and Cosine similarity and this hybrid method is capable of approximate string matching as it takes advantage of both edit-based and token-based distance. By comparing the statistical results, we explored what aspect makes the difference of variant readings. At the same time, we discussed that each similarity measure indicates how effectively the variant's characteristics. [Key Words: Textual Variants, Edit Distance, Cosine Similarity] 'Data Criticism: a Methodology for the Quantitative Evaluation of Non-Textual Historical Sources with Case Studies on Silk Road Maps and photographs' Asanobu Kitamoto (Associate professor of National Institute of Informatics) This paper proposes the concept of "data criticism" and demonstrates how this concept can be applied to non-textual historical sources such as maps and photographs. Data criticism deals with the evaluation of sources in the context of historical studies, which is the same role as textual criticism. Nevertheless, the criticism of data, especially non-textual (visual) sources such as maps and photographs, has not been well studied due to the lack of methodology for quantitative evaluation. Visual sources are essential for the spatial interpretation of historical facts, but the necessity of critical methods was relatively unnoticed. We believe that this is due to the appearance of visual sources, which look like the record of facts because the creation of those sources heavily depends on objective techniques and tools such as survey and camera. We show, however, that visual sources are not facts, and their quality needs to be evaluated before using them as reliable sources, in the same way as text. Hence we use old maps and photographs as case studies to demonstrate how data criticism will bring about new perspectives on the interpretation of visual sources. We first introduce the computational part of data criticism. We developed a suite of geo-referencing techniques for maps, including support for different map projections, one-point vs. entire registration of maps, and the preservation of linear features of maps such as streets. This process is typically performed by a built-in geometric correction in geographic information systems (GIS), such as rubber-sheeting, but we show that these techniques have rooms for improvement, and propose new algorithms using ground control points and lines, or the latitude-longitude mesh on the map. The result is integrated into web-based tools, on top of geo-browsers such as Google Maps and Google Earth, and we also developed a new interface "mappinning" (map + pinning) for performing one-point registration on-the-fly. We then discuss the historical part of data criticism; namely, how data criticism can lead to a new interpretation of visual sources and discovery of new historical facts. First, we criticized Central Asian maps from European expeditions by Stein, Hedin and others. We quantified the distribution of errors and explained the cause of errors in the context of technical limitation of survey technology available at the time. Second, we geo-referenced the old map of Beijing, "Complete Map of Peking, Qianlong Period," created about 250 years ago. We identified mis-alignment of maps possibly due to incorrect restoration in the past, and revealed for the first time the original form of the map by connecting 203 sheets based on our proposed geometric correction method. Third, we focused on ruins located in a few regions in Tarim basin, and identified conceptual relationships between ruins that appear in expedition reports and current survey reports in different names and types. This suggests that consistent interpretation of facts is possible even when text is inconsistent across sources. In the age of data, historical studies should incorporate and integrate larger variety of data, and we believe that 'data criticism' is a key methodology for the proper interpretation of data, and the discovery of historical facts. It is a natural extension of textual criticism that has been studied for a long time, but it focuses on the identical central concept of historical studies; namely, the evaluation of the value of sources. Finally, most of the maps and photographs introduced in this paper are accessible at Digital Silk Road Website http://dsr.nii.ac.jp/geography/ [Key Words: Data criticism, old maps, historical sources, quantitative evaluation, Silk Road] |
[SESSION 2] 'A Consideration on Marking up the Madhyāntavibhāgaṭīkā by Using TEI P5: A Complicated Case of a Critical Edition Including Extensive Reconstruction' Koichi Takahashi (Research fellow, University of Tokyo) This paper will discuss how to mark up the text called Madhyāntavibhāgaṭīkā, a Buddhist text composed in Sanskrit about 6th century. The modern critical edition of this work has an extremely strange appearance. The opening part of the edition reads as follows: uttamajanā hi prāyaśoguruṃśraddhādevatāṃcābhyarcyakarmasupravartantaityayamapyahamuttamajanana The text started with the italic style, but it suddenly began to use the roman font at the end of the word "karmasu". However, it changed again its style into the italic just after a few lines. In this way, the italic and roman lines alternately appear in this edition which has more than 200 pages. The strange structure of this edition is a result of repairing the damaged part of the Sanskrit manuscript. One third of eachleaf is entirely lost in this manuscript, which is a single material for editing this work. However, a situation peculiar to the Buddhist literature made it possible to reconstruct the lost part of this text. Most of the Buddhist canons and exegeses had been translated into Tibetan until the 12th century. Some modern scholars attempted to restore the lost part by means of philological investigation of the Tibetan rendering. Dr. S. Yamaguchi, one of these scholars, successfully repaired the whole part of the text, and issued its critical edition in 1934. This criticale dition uses the italic style to indicate the reproduction.As a result, in his edition the italic lines repeatedly appear at random without any association with the logical structure.In other words, the edition includes the physical structure representing the damaged portion of the manuscript as well as the logical structure shown by paragraphing. In this way, the critical edition of the Madhyāntavibhāgaṭīkā has extremely complicated structure. In spite of the unusual appearance, it is expected to be digitized because of its philosophical significance. XML seems to be a suitable data type for this purpose, and also the TEI P5 could make it easier to mark up the text. For example, it provides the tag sets [Key Words: XML, TEI P5, Sanskrit, Buddhist, Madhyāntavibhāgaṭīkā] 'A Nordic Tradition for Digital Scholarly Editions?' Espen S. Ore (System developer, Dept. of Linguistics and Scandinavian Studies, Unit for Digital Documentation, University of Oslo, Norway) Since the work on digital editions started in the different Nordic countries more than 20 years ago there seems to have developed a consensus or convergence of editorial styles for digital scholarly editions in the Nordic countries (here defined as Denmark, Finland, Norway and Sweden, for Finnish editions especially those based on material written in Swedish[1]). In this paper I want to discuss if this is really the case and present some fairly recent digital editions from Nordic countries. The editions still in development or finished within the last five years span texts written within a time period from the early 18th century up until recent times - and if work on medieval texts is included the start of the time span goes even further back. The texts include both published literature and unpublished private writing. One special case here is the written works of the Norwegian painter Edvard Munch. Since 1995 the community working on Nordic Textual Scholarship has had a formal network with (now) biannual conferences and special workshops[2]. This paper looks at how this level of collaboration may have lead to a convergence of how digital editions are built - both the process of constructing a digital edition and also in how the final result is presented. Among the editions discussed in this paper are Ludvig Holberg (No/Da), Søren Kierkegaard (Da), Henrik Ibsen (No), Edvard Munch (No), Zakarias Topelius (Fi), Selma Lagerlöf (Sw) and others. The BEE used its own system for text encoding, MECS[3], and so did the edition of the complete works of Kierkegaard[4] although here as with Wittgenstein's Nachlass the proprietary encoding was later translated into XML. Since the project start on the edition of Henrik Ibsen's Writings (HIW) TEI has been the preferred encoding system. HIW was started in 1998 and so used SGML/TEI for its first few years, but this was changed into XML/TEI with the introduction of TEI P4. All the Nordic edition projects that have started after those mentioned above have been based on XML/TEI. A digital edition is not - or not only - classified by the text encoding that is used. For almost all the Nordic digital editions we find that they are developed as archives of digital data where the edition itself is only one of more possible views into the complete archive. The following elements are usually included, at least in the archive: images (facsimiles of manuscripts and/or printed editions), encoded transcriptions of mansuscripts and/or printed editions, new editions of the texts, usually stored as encoded text and possibly facsimiles of a new printed edition and secondary data of various kinds. While the projects listed so far have been digitally created editions there has also been a Nordic trend towards national archives of digitally available texts whether they are published as images of text (PDF and other facsimiles) or displayed text produced from encoded text files. One of the possible Nordic twists here is that text archievs are created and maintained at a national level rather than a regional or purely institutional[5]. And for all of them the published edition is not the formal end of the project: the long time conservation of the archive is an in built part of the project. This usually means that the digital archive is stored at a national institution such as a national library or a university (universities are mostly government financed institutions in the Nordic countries) although we might find an exception with Edvard Munch's writings where the digital archive is stored at the Munch Museum. As far as there has been a convergence in digital editions in the Nordic countries it is that a) certain large editorial projects which may have national or other funding develop complete editions of selected authors and which build up complex archives of text and images while b) at a national level there are institutions which provide long time archives for digital editions whether they are text editions or facsimile editions. How far this represents a particular Nordic school of digital editions depends on what is done elsewhere. This paper will compare the situation in the Nordic countries with some other Western European countries and North America. [Key Words: Digital editions, textual scholarship, text encoding, text archives]
'Linked Open Data for Chinese Characters' Tomohiko Morioka (Assistant Professor, Center for Informatics in East Asian Studies, Institute for Research in Humanities, Kyoto University) his report describes RDF mapping of the CHISE character ontolog[1]y, especially it focuses representation and usage of Chinese Characters (Kanji, Hanzi, etc.). Character is a basis of various kinds of data. Chinese Character is a logogram, each character is basically indicates a morpheme. In the point of view, a Chinese Character can be a hub of linguistic and semantic information about various words and texts written by Chinese characters. The CHISE character ontology is a large scale character ontology which includes 238 thousand character-objects including Unicode and non-Unicode characters and their glyphs, etc. It was developed for CHISE (Character Information Service Environment[2]) which is a character processing system not depended on character codes. The framework of CHISE is based on database processing. It uses CONCORD as the database engine. CONCORD is a prototyping style object oriented database system based on directed graph, like RDF. We developed a Web service to display and edit objects of CONCORD, called "EsT". EsT was originally designed a HTML based Web application, however we added RDF mapping and implemented RDF/XML output feature. Thus the CHISE character ontology can be used as a Linked Open Data. As demonstrations of data-linking between character ontology and other information resources, I made two examples: (1) links between Oracle Bone characters and rubbing/photos of their sources (e.g. {fig.1}[3]) (2) links between characters and "Bibliography of Oriental Studies[4]" (e.g. {fig.2}). In a view of character ontology, these links work as sources/examples of characters to describes and indicates characters clearly. In addition, they also work as entrance of other information resources. CHISE provides search service for Chinese characters, named "CHISE IDS Find[5]". If a user specifies one or more components of Chinese character into the "Character components"window and run the search, characters include every specified components are displayed. It is a very useful service to search Chinese characters, especially for characters which are difficult to input by ordinary input-methods designed for daily-used characters. Result page of the CHISE IDS Find has links for character objects using EsT (the first column of each line is a link for EsT page to display detailed information of character object). Thus users can trace the links for other information resources via EsT. It is also useful for search robots and third-party's Web services based on Web API. In the point of view, RDF/XML feature of EsT can provide deeper cooperation for third-party's applications. In an ordinary HTML page of EsT, there is a RDF button (right side of the top line). To click the RDF button, EsT outputs RDF/XML format instead of HTML. {fig.3} As a future plan, I'm planning to make links between Chinese characters and morphemes of classical Chinese. It will be a example to integrate character processing layer and linguistic processing layers. This report focuses the concepts of the CHISE character ontology and basic mechanism of its RDF mapping briefly and describes possibility of it as a information hub for other kinds of information resources. [Key Words: Chinese Character, RDF, Linked Open Data, character ontology, character representation]
|
[PANEL 1] 'Semantic Uplift in the Digital Humanities' Cormac Hampson (Knowledge and Data Engineering Group, Trinity College Dublin, Ireland) The rise in digital humanities research has led to a huge increase in the digitisation of cultural heritage artefacts. However, the resultant data often resides in independent silos that do not share their rich information or connect to other relevant collections. This lack of connectivity reduces the impact of the digitisation process and makes it more challenging to find potential synergies between dispersed artefacts. A major obstacle in providing tangible links between digital collections is the proliferation of unstructured text and insufficiently rich metadata. Trinity College Dublin is directly addressing this challenge on a number of fronts, as exemplified by its leading role in three European Commission funded projects; CULTURA[1], CENDARI[2] and DigCurV[3]. This panel session will discuss the latest research on semantic uplift (best practice education of curators, entity extraction, ontology mapping etc.) that is being performed in Trinity College Dublin, and explore its relevance to the wider digital humanities community.
|
[PLENARY 1] 'A new stage in the development of cultural resource databases' Ellis Tinios (Honorary Lecturer in History, University of Leeds, UK) Hitherto institutions such as libraries, museums and universities developed cultural resource databases to catalogue and manage materials in their possession. A logical progression from this start has been to link these institutional databases to form union catalogues. At the same time, specialists in certain subject areas have created very detailed but narrowly focused databases to support their own work. Researchers in the humanities have led the way in the creation of both the personal and institutional databases. Recently new capabilities have been developed that possess significant implications for future database development and use: associative search systems, image matching systems and the like. These technologies have moved beyond the experimental stage and are now ready for practical application. They promise a higher level of functionality than was previously possible in databases. Researchers in the humanities must now reassess their approach to archiving and data management in order to make the most of the new functionalities made possible by these developments. This panel brings together two scholars in Japanese culture studies who are heavy users of databases and two leading developers of the next generation of databases to explore the future prospects. |