Online Access to Manuscripts of Byzantine Chant Louis W. G. Barton
Constantine J. Terzopoulos
Julia Craig-McFeely
The University of Oxford St Anne’s College Oxford OX2 6HS; U.K.
Metropolis of Hydra Spetsai and Aegina Greece
Royal Holloway, University of London Digital Image Archive of Medieval Music Egham, Surrey TW20 0EX; U.K.
louis.barton.je.77@aya.yale.edu
frc@psaltiki.net
j.craig-mcfeely@rhul.ac.uk
ABSTRACT The manuscripts of Byzantine and post-Byzantine ecclesiastical chant are of foundational importance for many fields of study. In this paper, we advocate for digital archiving of manuscripts that contain musical notation of Eastern Orthodox chants or hymns, as digital images. We summarise the best practices for making digital photographs such that handling of manuscripts may be kept to a minimum, and so that the needs of liturgists, chant scholars, palaeographers, and others will be met. We propose a software architecture for building Web-based, distributed digital libraries, such that the rights of holders of manuscripts will be protected, but the practical needs of end-users will be accommodated. We discuss the problem of metadata ontology for indexing of digital libraries, where searching for manuscript images on the Web will produce targeted and thorough results, but, at the same time, library cataloguers will not be limited by a standard ontology that might be inappropriate to their requirements or intentions.
A vast resource of archaic manuscripts containing neume notation of music has survived as a legacy from the past thousand years or more. Most of these manuscripts have been inaccessible for use in liturgy or for scholarly study due to being in physical isolation, concerns over the fragility of these irreplaceable artefacts, and the all-too-frequent tragedy of thefts from monastic libraries. The advent of the Internet and Worldwide Web, progress in the technology of digital photography, and growing sensitivity to the preservation of culture outside the popular, commercial media, have provided to our generation an opportunity to evangelise the ancient chants and notations more than has been possible before.
Keywords Byzantine chant manuscripts, digital photography, distributed digital image libraries, metadata ontologies, decentralisation.
1. INTRODUCTION The Church has always lived in a mystic link between earth and heaven. One’s experiences of the earthly side of this link—the beauty of Church buildings, solemnity of the divine liturgy, icons of saints and angels, prayers, readings from Scripture, the scent of incense, and so on—all are in anagogic response to point us to this reality. No less so are the archaic manuscripts of ecclesiastic song tangible expressions of this link. They point us to a reality that our earthly eyes have not seen, nor our earthly ears heard. The rich treasure of hymns, chant, and notated manuscripts that is our inheritance from preceding generations has, however, been almost completely neglected during the past century [12]. Various trends have caused this neglect, most pervasive of which has been the continuous turning toward American and Western European cultural forms. In particular, the art of reading and writing in traditional forms of neume notation has been neglected in favour of Western common-practice (or, ‘modern’) musical notation. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2007, The University of Oxford.
Figure 1. MS. Athos, Laura Gamma 67, f. 65v. Image: Microfilm at Monumenta Musicae Byzantinae [9]. An object lesson about why digital archiving is important can be drawn from the loss of the manuscript MS. Athos Laura Gamma 67 (the so-called ‘Chartres fragment’). This early 11th-century manuscript of Middle Byzantine neume notation was destroyed in WWII by the Allied bombing of Chartres, France, in May 1944. All that remains of it today is a black-and-white photographic record of mediocre quality (cf., Figure 1, showing sticheron “He sophia tou theou” in fourth plagal mode). It is impossible to know what misfortunes may fall upon other neumed manuscripts in the future: fire at monasteries; acts of terrorism; nuclear war; and so on. This, in itself, is a compelling reason for archiving these manuscripts in high-quality, digital images; the images should be stored in several, widely-separated places to increase the chance that at least one image will survive to future generations.
This research project is supported by Eduserv.
2. DIGITAL PHOTOGRAPHY We prefer digital photography over film or digital scanning. [In PDF, ‘Zoom In’ the starred ‘Â’ images, below, for more detail.]
2.1 Pixel Resolution and Colour Depth Both the ‘resolution’ (i.e., the number of picture elements, or pixels, per square centimetre) and the ‘colour depth’ (i.e., the number of distinct colours) that can be recorded in a photograph can—depending on the photographic equipment used—be extremely high, going well beyond the limitations of human vision. This allows images to be digitally magnified with little or no loss of clarity. Such magnification might be unimportant when one is just reading the liturgical text and melos, but it can be quite important for studies of scribal handwriting, of ink, of paper or parchment, and so on. This can be crucial information for dating of manuscripts, for distinguishing when a later scribe added to or changed the writing, etc. In cases where a manuscript has suffered damage from water, fire, mould, insects, etc., or when scribal corrections were made, tiny gradations of colour may be digitally filtered to reveal details that are imperceptible to the human eye.
Figure 3. Kastoria Cathedral Library, MS. 8, f. 40v; detail. Digital magnification of the image provides no more exact visual information than one sees at normal size (Figure 4 shows an enlargement of this photograph by 3-times magnification).
Figure 4. MS. Kastoria 8, f. 40v; photo at 3x magnification.
Figure 2. Kastoria Cathedral Library, MS. 8, ff. 40v-41r. Figure 2 shows two folios of the highly important Kastoria 8 manuscript. It is a 14th-century asmatikon consisting of 83 folios (its size is 21.5 cm tall by 15 cm wide); it is unique in having Middle Byzantine round notation written in parallel with a more ancient, Palaeo-Byzantine system of notation with Great Signs (µεγάλαι σηµάδια)—a writing system that remains undeciphered today. Generally, manuscripts containing parallel notations (often called ‘double vorlagen’) are of special importance for studying the development and evolution of notations (another example is Sinai MS. 8). Kastoria 8 is like the Rosetta stone in that, it may enable scholars eventually to decipher the Great Signs, and so reestablish a knowledge of the Church’s ancient liturgical practices. This black-and-white, film photograph (shown at reduced size in Figure 2) is the best-quality image currently available; an excerpt of the photograph is shown full-size in Figure 3. This is typical of poor quality images that scholars have to deal with when studying Eastern Orthodox neumed manuscripts. The colour depth is just in shades of grey, and so it is quite difficult to distinguish between scribal ink and damage or imperfections of the writing surface. Digital filtering of this image by computer software for revealing obscure details would also be practically impossible to do.
Figure 5 shows an excerpt of MS. Grottaferrata (Italy), Cryptensis Γ.γ.VI, f. 90, as a grayscale image that has been digitally filtered. Instead of helping, grayscale filtering has diminished the detail.
Figure 5. Digital filtering applied to a grayscale photograph. Today, one can find on the Web some full-colour, digital images of Eastern Orthodox neumed manuscripts, but in many cases the resolution of these images is insufficient for scholarly study. We found, for instance, an image of Panteleimon Monastery Codex 1013, f. 258, online (an excerpt is shown in Figure 6 at full scale). The image shows the beauty of this 19th-century manuscript to the glory of God, but the resolution is too low even for reading of the text or for transcription of the melos from this image.
2.2 Cradling, Lighting, and Camera Angle Figure 9 shows elements of a portable kit for on-site photography. Obviously, a paramount concern is that the handling of a manuscript shall never cause any damage. For this reason, and to provide vertical angle for the camera, an ordinary copy-stand or a specially-built tool is used. A cradle is excellent for a photo lab, but cradles usually are infeasible to carry in portable kits. It is best to photograph an entire codex at one time (rather than selected folios), so the manuscript will be handled just one time for photographing.
Figure 6. Panteleimon Monastery, MS. 1013, f.258; excerpt. Image: From the website of the Hellenic Ministry of Culture. Size: 59 kB. [Image is no longer online at the HMC website.] We recommend that, if manuscripts are subjected to handling for the purpose of photography, then the best available photographic techniques ought to be used in the first instance. It should not be necessary to handle the manuscript again in the future for making images of it again. Perhaps an entirely new type of imaging technology that will emerge (such as, 3-dimensional holography) might warrant re-imaging of some manuscripts. In special cases, other techniques that exist today, such as UV (ultraviolet light) imaging, infrared reflectography, or X-ray imaging (cf., Figures 7 and 8) might warrant re-handling [7]. The current technology of digital imaging, however, produces images that look quite realistic and that satisfy most requirements for image-data analysis. A practical limit on resolution and colour depth is space: higher quality requires substantially more data storage. (Transmission time must also be considered in the case of Web-deployment of images.) Just one image in TIFF format can require 300-400 megabytes of storage. (A standard CD would hold just two such images.) One must realistically assess the cost of long-term data storage in media, controlled environment, and security. A sensible balance between image quality and data quantity is given in our checklist, below. For data analysis, we find that colour depth is almost as important as resolution: these two factors, plus focus, are interrelated for discriminating of manuscript features by small differences of ink colour—tiny gradations that are imperceptible to the human eye. These gradations are valuable in many kinds of image analysis. When file size of digital images is a constraining consideration, normally we decrease the resolution rather than reducing the colour depth.
Figure 7. UV imaging. Image: Yale University, Beinecke Library.
Figure 8. Comparative wavelengths for imaging.
Usually, each codex has a ‘natural position’ in which it can be laid open, such that a full page can be photographed, and distortion is not introduced at the gutter of the codex’s binding (see, Figure 10).
Figure 9. Portable kit with custom manuscript holder [4]. Remark the use of a non-reflective, black background.
2.3 Reference Metrics
Figure 10. Image distortion near the codex’s binding (right). Source: MS. Poissy Antiphonal, f. 46v (Latin chant). Image from: http://www.lib.latrobe.edu.au/MMDB/MusicDBDB/. The capture device must be held absolutely still by a mechanical armature directly above the subject. Exposure time may be long (four to six minutes), and the slightest vibration is significant. The image must not be skewed by the angle of the capture device to the subject. Figure 11 shows image distortion that resulted from a hand-held camera, which was not directly above the subject.
Figure 11. Image distortion caused by oblique camera angle. Source: MS. Sim Kar 59, ff.2v-3r (neon anastasimatarion; 1860). Correct lighting is a crucial for realising high-quality images. Wide-spectrum, artificial lighting is generally better than natural daylight, because one has more control for equal light across the whole subject without any glare. Daylight also contains harmful UV (ultraviolet) rays, which are excluded from daylight-balanced artificial lights. Light is best applied indirectly via reflectors (as, for example, portrait photographers normally do). Figure 12 shows a ‘lighting hot spot’ in the centre of the image, which was caused by glare of direct lighting on the subject.
Figure 12. ‘Hot spot’ in an image, caused by uneven light. Source: National Library, EBE 2406, f. 221v (papadike; 1453).
Each photograph should have included in it (but not overlapping the subject) standard metrics of physical dimensions and of colour registration. This is important so that the image can be accurately calibrated; it can be crucial in automated digital processing; and it forms an integral part of the archival record. An example of the industry-standard metrics is shown in Figure 13, to the right of the subject (in this example, the edge of a vellum manuscript).
2.4 Checklist for Digital Photography This checklist was written for the ‘NEUMES 2006’ conference [http://purl.oclc.org/SCRIBE/NEUMES/conference2006/] by Dr Craig-McFeely. Remark especially that the JPEG image format is totally excluded from the ‘workflow’. This means that archival images and working drafts shall not be saved in ‘.jpg’ format; doing so would cause permanent loss of detail in the saved image. ;
Specify that you want High Resolution (and say specifically 400 dpi at real size, or higher).
;
PNG or uncompressed TIFF format both at capture and in delivery; NO JPEGs in the workflow.
;
8-bit (or 24-bit depending on library protocols for describing bit-depth) RGB colour minimum colour depth (this is the standard setting). Colour depth of 16-bit or 48-bit creates much larger images, but also stores much more colour information.
;
Colour profile of capture device embedded in the image.
;
Industry-standard scales of size, colour, and grayscale (Figures 13 a and b) photographed beside the image but not touching it.
;
White and black points, colour correction and contrast adjusted at the time of capture (not after).
;
Image shall be sharp (in focus) to the finest level of detail.
;
NO unsharp mask shall be applied during capture.
;
NO unsharp mask applied after capture;
;
NO rotation, deskewing, reshaping, levels, colour or exposure adjustment after capture (i.e., no image adjustments of any kind after capture).
Figures 13a-b. Standard metrics, placed at side of the subject. Images: (a) Bodleian Library, Univ. of Oxford; (b) DIAMM [4].
Figure 14. Excerpt: PNG image of archival quality, at high resolution and colour depth. Image file size of one folio is 58.7 MB. Source: Worcester Cathedral (U.K.), MS F173, f. 6v (Latin).
Figure 15. Excerpt: Medium-quality image, appropriate for access by subscription. Image file size of one folio is 11.4 megabytes. Source: MS Codex Sinaiticus (4th-century Bible in Greek). Image: http://www.itsee.bham.ac.uk/images/Sinaiticusfullquality.jpg.
Figure 16. Excerpt: Legible-quality JPEG image, appropriate for free access on the Web. Image file size of two folios is 327 kB. Source: Schøyen Collection Ms. 2033 (Gospel lectionary; 11th c.). Image: http://www.schoyencollection.com/music_files/ms2033.jpg.
2.5 The Choice of Image Format Today’s digital cameras typically allow images to be saved in TIFF or JPEG data format; TIFF should be chosen. JPEG is a lossy format: its compression algorithm achieves smaller file sizes by discarding details of the image. The JPEG2000 file format is non-lossy, but it is not widely supported by computer software.
in PNG than in TIFF. PNG is endorsed by the World Wide Web Consortium (W3C) and is supported by most Web browsers. PNG has other features that make it a good choice, such as embedding of basic metadata in the image file, gamma-correction to adjust for colour bias in the end-user’s computer screen, and so on.
2.6 Image Archiving and Web Deployment
TIFF is a flexible format that can be lossy or lossless, depending on whether image compression is used. TIFF allows metadata to be stored with the image. Non-compressed TIFF images are quite large: using the resolution and colour depth that we recommend, an image of just one manuscript page can have a file size of 100 megabytes or more (depending on dimensions of the manuscript). Acquiring and maintaining enough data storage for a collection of such images can be a problem. Furthermore, most Web browsers do not support TIFF and are unlikely ever to do so.
As with the monastic rule of St Benedict in the West, Eastern monasticism assumes a vow of stability, that is, dedicating oneself to remain at the same monastery for life. The underlying reasons for stability, by analogy, are valid also in the ever-changing world of Web-based resources. If manuscript images are to be of abiding benefit as religious, scholarly, and cultural resources, their ‘virtual locations’ on the Web must remain stable. Likewise, the ‘rule’ of library science requires stability of accession numbers.
A better choice is PNG (Portable Network Graphics). Its compression algorithm is non-lossy and produces files that are substantially smaller than TIFF. The file size of an image of a manuscript page typically is 45% smaller
Circumstances may require the change of a Web-hosting address, restructuring of directories, takedown of an image, and so forth, and stability may be a difficult goal to realise. One can, however, mitigate this problem by good planning and a long-term vision.
Whenever possible, we reference the metadata, indexes and search of a manuscript image indirectly via a ‘virtual address’ that can remain stable. A virtual address could be via a directory service to map from this address to the current, actual location of the image. Similarly, the Web convention of virtual domain names (‘www.’) uses virtual addressing: a website may be moved to a different physical location, but references by domain-name are not affected. The granularity of virtual Web-addressing, however, does not cover the case of an image’s filename being changed, or the image begin moved within a website; references to it get broken. The Uniform Resource Identifier (URI) plan may eventually help. In any case, the crucial factor is one’s intent of long-term stability. High-quality imaging creates huge files, and so Web-deployment of archival-quality manuscript images is impractical; browsing through 100 MB images is slow, even on a broadband connection. Providers usually deploy images that have been scaled down to fit a browser viewport. Often, the colour depth has been reduced by JPEG compression, but PNG would be a better choice. Re-scaling and compression are helpful for the provider as well as end-users. End-users have rapid access to images for browsing. The provider may withhold archival-quality images from general access, and offer them only for a fee or by-subscription. The latter could be via file-download or mailing of a CD-ROM. Some providers are experimenting with other delivery methods; for example, [4] uses Zoomify for image viewing. Even if the provider does not charge a fee for access to archival images, access-by-subscription can be used for requiring end-users to agree to a contract governing how the provider’s archival-quality images may and may not be used. Original, archival-quality images must not be discarded. Consuelo Dutschke [5] relates this tragic story about a university: after they deployed reduced, compressed images of their manuscripts on the Web they deleted their archival-quality images just to save space. Long-term reliability of digital media is critical. Redundant copies should be stored at geographically separate locations in case of disaster. Data-storage facilities in caves are an excellent option.
3. DIGITAL LIBRARIES One can think of the Worldwide Web as a huge database system, where each document or image acts like a ‘database record’, and search engines (such as Google.com) are like database indexes by which users can find ‘records’ that are of interest. The Web differs from a conventional database mainly in that the ‘records’ are distributed. That is, the data are stored on many Web servers that are independent and non-collaborative. Standardised data formats, protocols, indexes, and hyperlinks bind this ‘database’ together. Potential uses of Web technology in evangelisation should not be underestimated. The potential is great for touching the hearts and minds of people across the globe. Access to usable-quality images of Eastern Orthodox manuscripts via the Web promotes scholarly study; such study may lead to a deeper understanding of the song and practices of past generations, and this can strengthen the links that connect the modern to the ancient Church. Web resources act to raise public awareness, appreciation, and respect for tradition. It would seem in the best interest of the Church that its treasury of manuscripts containing chant and hymns should be accessible to the public as readable images on the Web. As we proposed in [1][2][3], this is accomplished most effectively by the distributed content model, rather than the old, centralised database model. The older model, which pre-dates the Web, requires a Web portal.
3.1 Access Barrier at the Server A recent, reactionary trend among Web-content providers is to keep their content behind a portal. Portals can provide extra functionality for users, such as search tools and the integration of descriptive metadata with content. A portal, however, acts as an “access barrier” (Figure 17) by blocking direct access to content. A portal may be necessary if the provider wishes to charge a fee for access, or if users are required to agree to a contract-of-use. Portals can also be effective in blocking access to a website by robots. But, if a portal is not needed, then one should not be used.
Figure 17. A Web portal acts as an “access barrier” to content. Figure 17 is a conceptual diagram showing the layers of software that a user interacts with on a webserver that uses the centralised database model. A portal is a “first class object” on the Web (see, definition below)—normally it is a static webpage. In a typical system, the user must validate that s/he has an account with this provider via an “authenticate” method. Typically, the session of an authenticated user is tracked by a cookie, which the provider stores on the user’s computer. Authenticated users may be given access to content via search methods of the portal or by browse methods. Some providers may allow authenticated users direct access to content via the content’s URL (or, Web address), but this method of authentication usually can be circumvented. We call the above the “centralised database model” because it is the way databases usually are deployed on the Web. A provider, however, might or might not actually use a relational database management system behind the portal. The salient characteristics of this model are: (a) limited access via an “access barrier”; (b) storage of content at one central repository, or a small number of repositories; and (c) editorial control over the content by some agency or designated group of editors. Some content providers are using the centralised database model just because software is readily-available and easy-to-use with this model. For instance, the Microsoft Access® database program is widely available on Windows® computers, and it is seductive by its ease-of-use and its WYSIWYG (‘what you see is what you get’) user-interface. After a large amount of content and functionality have been created in Access, however, it becomes difficult to migrate the content and functionality to another software model. Microsoft’s internal design of Access is proprietary—no effective method exists for migrating complete Access applications to a different system. Providers who have created an Access database on a desktop computer may find that the easiest way to deploy its content on the Web is via a Web-portal to Microsoft Access.
3.2 Exposing Objects at the Server In [2], we discuss at length the problem of ‘dark matter’ on the Web; that is to say, the centralised database model causes content to be hidden from search engines, such as Google.com. End-users searching for content on the Web via a search engine typically will not see content that is behind a portal. Various workarounds have been used for solving this problem. A provider can expose on the Web a list of content and descriptive metadata. Search engines might be given an account, so that they can index the content, but not allow users direct access to it unless they have an account and password. Despite these possible workarounds, one of our principal recommendations is that manuscripts images of Eastern Orthodox chant and hymns should be freely accessible to everyone via the Web in legible form.
Figure 18. Hybrid architecture to expose content on a server. A crucial facet of the distributed content model we are proposing is that images shall be “first class objects” on the Web. That is, each image shall be accessible by its own URL without password or session login. Descriptive metadata about an image should be exposed, and associated with the image via XML (Extensible Markup Language) or HTML. First class objects should be in an exposed directory of the server (viz., the index of that directory is exposed), or their URLs shall be listed on an exposed webpage. Indexing robots shall be able to find this content, and end-users shall be able to access the content directly. The URL may contain a command (e.g., in PERL or a Java Servlet call) that generates a webpage dynamically. Figure 18 is a hybrid scheme depicting XML metadata and compressed image exposed in a first class object, and a hyperlink referring the user to a Web portal for access to the archival image. The high-quality images are ‘hidden’, but anyone can view the compressed image. Basic metadata may also be embedded in PNG images. Exposed images shall be of sufficiently high quality to allow accurate reading and transcription of text and neumation (e.g., Figure 16)—not degraded to the point of being useless for scholarly work (e.g., Figure 6). Exposed images may be ‘watermarked’ or overprinted with a notice of origin, or a copyright notice at one corner if this is desired.
Of course, the hybrid architecture does not resolve the situation of some providers who are looking for the easiest way to deploy their existing databases on the Web. A further refinement of the distributed content model involves the use of dynamically-generated (or ‘pushed’) content. As mentioned above, this may be done by a Java Servlet or CGI call in an HTML hyperlink or Form. The server that responds to such a call may track users who are currently logged-in, by a cookie, a unique session ID embedded in a dynamically-generated HTML page, or by some other method. When the server takes a call for an image, it can verify that the user is logged-in before it pushes the image from a protected directory. This, however, does not resolve the problem of ‘dark matter’ not being indexed by search engines.
3.3 The Distributed Content Model Figure 19 is a conceptual diagram of how the end-user can find images via dialog with an index under the distributed model. An index catalogues images from many resources, and it creates a ‘virtual library’ of references to distributed images. The user can access an image from its hosting site by clicking a hyperlink in the index. The user can search an index by keywords or other method such as maps, timelines, etc., as shown by [6]. The user may see in the index additional metadata about an image to what is found on the hosting site. One manuscript image may be indexed by many different indexes, where each index may have a special focus (such as: Orthodox liturgy; manuscript art; types of neume notation; etc.). Scholarly-edited indexes may store specialised information about images that is not of interest for the purposes of the image-hosting service. Users would use whatever index most closely reflects their areas of interest.
3.4 Image Copyright and Revenue The centralised database model has advantages for controlling access to content and charging of fees. This may be important for institutions that need income for defraying the costs of digital imaging and on-going maintenance of archives. Other sources of revenue, however, should be considered: renting of advertising space; donations; grants; reserving full-quality images for fee payers (see, the revenue scheme at [8]); etc. We believe that chant manuscripts are the heritage of everyone, and they should not be withheld just for those who are financially able to pay a fee.
Figure 19. On the Web, anyone can publish. Anyone can make a virtual public library, too. In the Distributed Digital Library model, an index collates Web content to create a library.
Whether a claim of copyright ownership over digital photographs can be sustained legally is questionable. Under copyright law of the U.S.A., for instance, a photograph may be copyrighted if it is an ‘original work of art or authorship’ (e.g., a photograph with creative posing of the subject or artistic lighting). An image that is a ‘faithful reproduction’ (like a photocopy) of a 2-dimensional object does not have a valid copyright in the U.S.A. Clearly, the content of an archaic manuscript (that is to say, its ecclesiastic text and melos) is in the ‘public domain’, since the original author has long been deceased. Normally, a copyright lasts just for the lifetime of the original author, plus 70 years. In this Internet age, the holders of manuscripts that contain neume notation, sacred texts, or decorative art have a decision to make: whether they shall share this legacy with all people by providing images online for study, appreciation, and evangelisation.
An ontology produces a set of metadata categories for describing the members of a collection. Each category may have a closed set of values that it may take on. A category also may be ‘optional’ (viz., nulls allowed). Logically, an ‘ontology’ is a theory about the predicates (or, properties) that exist for the individuals in a class of objects. It is tautological that, no finite ontology can fully exhaust the semantics (or, the meaning) of a real-world object. For instance, a list of chemical constituents of an apple is not the same as a fresh apple. ‘Ontological commitment’ (or, an interpretation) is a choice of categories (and perhaps, their sets of values) that will be used in describing objects of a set. A metadata designer must commit to a fixed ontology for data-entry, if searching for objects by keywords shall return consistent and complete results.
4. Descriptive Metadata In the distributed content model that we propose, each image has descriptive metadata tied to it. Such data identify the manuscript (its location, dimensions, date, etc.) and explain its content (the mode, ecclesiastical text, religious occasion, etc.). This provides information to users who view the image, but it is important also for indexing. Metadata comprise the keywords by which users can search and find images on the Web via ordinary search engines.
4.1 Ontological Commitment Generally, an image itself exposes no metadata. (Some image file formats enable metadata to be embedded in the image, but such data might not exposed on the Web in the same way that HTML or XML data are.) There is widespread agreement on the ‘basic’ metadata for manuscripts; these include the name of the collection, the shelf mark, the city of the holding institution, folio number, physical dimensions, etc. Any inconsistencies between different schemes of ‘basic’ metadata can be reconciled.
Figure 21. Projections of ontology semantics onto a real object. Figure 21 illustrates the concept that various ontologies (depicted in a 3-dimensional “semantic metadata space”) project semantic views onto an actual object, such as a manuscript image (depicted in a 2-dimensional “real-world object space”).
Figure 20. Ontological commitment increases with semantics.
In the field of computers and the humanities, many initiatives have proposed ‘standard’ metadata ontologies, such as: TEI (Text Encoding Initiative); the Dublin Core; ontologies for Latin chant manuscripts [4][5][10], CANTUS, Cantus Planus, etc.; ontologies for Orthodox chant manuscripts [6][10]; and so on. The richer the ontologies are, the harder it is to reconcile their differences. Example: the database dictionary of the Mount Athos digitisation project [currently offline] has a field for “Musical Notation”: Validate against: Musical Notation Look-Up Table Definition: type of musical notation present. Properties: text, 100 characters. The “Musical Notation Look-Up Table” is not provided just now, but it likely will be irreconcilable with other ontologies like the University of Indiana Variations [http://variations2.indiana.edu/]. Variations uses a closed set of names for Western notations plus: Non-Western List the country of origin in parentheses Digital Scriptorium’s ‘standard’ metadata ontology [5] specifies: Definition: type of musical notation, when present. Example: Neumes in campo aperto Properties: Data type: text. Field size: 60 characters
Figure 20 illustrates the principle that, as metadata become more specific about the meaning (or, semantics) of an image’s content, the degree to which a cataloguer must commit to an ontology of the manuscript content rises disproportionately. We separate metadata at two threshold points (red dots). “Metadata linked to he content semantics” are specific to particular areas in the text.
The Digital Scriptorium ontology has a “Subjects” category. Its closed value set includes (in part): Ecclesiastical-cnclsSynds”; “Ecclesiastical-other”; “Ecclesiastical-papal”; …; “Monastic”; “Musical.” This set covers a broad range of Latin medieval and Renaissance manuscripts but it will not satisfy the requirements of Eastern Orthodox chant and hymns for standard, searchable terms.
The ontological commitments of the authors of a ‘standard’ can be inferred—consider the following example in Indiana Variations: <WorkStructure label="Symphonies, no. 7, op. 92, A major (Ludwig van Beethoven)" id="1"> <Section title="" type="Movement" label="I. Poco sostenuto - Vivace" id="2"> <Section title="" type="Section" label="Introduction" id="3"/> This ontology also has categories for “Date of First Performance” (optional) and “Publisher” (required). Many of its categories are limited to a closed, “Controlled Vocabulary” of English terms. Ontological commitment necessarily is strong when describing the content of a manuscript like Kastoria 8 (cf., Figures 2-4) whose Great Signs are undeciphered. Detailed metadata “linked to the content semantics” requires an ontological commitment to one semantic interpretation where no sure knowledge yet exists today.
4.2 Bias of Americanism in Metadata Admittedly, English is the lingua franca of computing. In the case of Latin manuscripts, end-users may be more adept at searching metadata in English rather than Latin, even though English is not the user’s native language. Nevertheless, we are uneasy about the trend in Internet culture to ‘standardise’ ontologies in English. Metadata about Eastern Orthodox manuscripts may require Greek terms that do not exist in English, such that transliteration is the only option. Transliteration variants, however, are common. For example, the word κ ντηµα appears as ‘kentema’ or ‘kentima’; κφωνητικ may be ‘ekphonetic’ or ‘ecphonetic’; and so on. If an ontology has a closed set of values, then variant transliterations may be listed in a table of synonyms, which indexes and search services shall treat as equivalents. Metadata categories that have open sets of values or allow free-form text, however, are not easily normalized by synonym tables. Latinisation of Greek words often is ad-hoc. Phonetic mapping by-rule is not reliably predictive of orthography as it is actually practiced. Archaic, technical, or infrequently-used words are especially troublesome for keywords.
c) The character set can be declared as UTF-8, which allows one to mix ASCII characters and Unicode characters in the same document. In UTF-8, ψαλτικ ν can be written correctly as: &#968;&#945;&#955;&#964;&#953;&#954;&#8182;&#957; Alternatively, using HTML character entities one can write: &psi;&alpha;&lambda;&tau;&iota;&kappa;&#8182;&nu; Hexadecimal number &#x3C8; may replace decimal &#8182;. The XML specification [http://www.w3c.org/TR/REC-xml] states that UTF-8 is the default encoding of XML documents, and that all XML-compliant processors must accept UTF-8 encoding. We recommend using UTF-8 for both XML and HTML documents. A similar problem arises with metadata about manuscripts written the Cyrillic alphabet (Russian, Old Church Slavonic, etc.) or other scripts of the Eastern Church. In transcribing archaic Greek texts, we encounter special difficulty with Polytonic Greek, which is not covered by the standard Unicode block of Greek letters (U+0370U+03FF) [http://www.unicode.org]. An ‘Extended Greek’ block (U+1F00-U+1FFF) is available with ‘precomposed’ Polytonic letters. Alternatively, one can use letters from the standard block plus ‘combining characters’. Alternate encodings, however, add a burden on index services, which must treat these as equivalent. To display Polytonic Greek, a special font usually is required.
4.3 Specialised Metadata Ontologies The main reason for a ‘standard’ ontology is to produce consistent and complete results from keyword searches. In our view a single, standard ontology may not be appropriate for manuscript images. The ontology needed for one purpose or area of interest might be substantially different from the ontology needed for another area.
type α
type β
The use of Greek orthography in metadata is unavoidable in some cases, such as for quoting an incipit (the first few words of a chant or hymn). Transcription of a manuscript text or an incipt shall be literal. Text translation to English, or transliteration to the Latin alphabet can be helpful, but it is not a definitive record. For metadata to appear in the Greek language on a webpage with an image, there are three main options of character-encoding. a)
The ‘content-type’ of an HTML document can be declared as ‘charset’ (character set) ISO-8859-7. This is commonly done in Greek-language webpages. The keyboards of computers for the Greek market normally are mapped to this encoding. From the American perspective, however, such text is not ‘human readable’ because it does not display correctly in a plain-text, ASCII viewer. For example, the word ψαλτικών (“chant”) displays in an ASCII editor as “øáëôéêþí” [sic].
b) HTML 4.0 provides character entities in ASCII for Greek letters, such that the HTML sourcecode shall be ‘human readable’. In this encoding, ψαλτικ ν could be written as: &psi;&alpha;&lambda;&tau;&iota;&kappa;&omega;&nu; This is not entirely acceptable, because the character entity set does not include Greek letters with diacritic marks. (Diacritics for Latin letters, however, are supported.) For example, the character ‘ ’ typically gets degraded to just ‘ω’.
Figure 22. Three models of index/metadata/content separation. We proposed “specialised indexes” in [2], whereby each index may have its own ontology regarding a set of distributed content. Also, XML technologies are emerging to facilitate “secondary markup” of content, viz., metadata in a separate markup document referring to content in the primary document. Special ontologies can exist in secondary markup. Such ontologies would be indexed automatically by Web search engines. One ontology could treat chant manuscripts in their relation to the modern liturgy; another ontology might be committed to palaeographical study; and so on. In the conventional database model, indexes and data are tightly coupled and are under central control. Figure 22 illustrates, at left, the concept that a database normally encapsulates an index, metadata, and content-data into one software unit. Manuscript images (as content) can be loosely coupled via a local filepath.
The middle part of Figure 22 depicts the ordinary Web model, where metadata and data-content are fully integrated at a website. Indexing is done separately by search engines (e.g., Google.com). In our distributed content model (Figure 22, at right), content and semantically-rich metadata are segregated. This model encourages multiple ontologies referring to the same content, and there is no bias toward English. Under this model, only ‘basic’ metadata are tightly coupled with the content. Two methods of “semantic projection” onto content are illustrated: (α) “specialised indexes,” where the rich metadata are encapsulated with the index; and (β) “secondary markup,” where specialised ontologies can be exposed to conventional indexing services. Both methods have a many-toone (N:1) relationship of metadata to content (not 1:1). Examples of type α are [6] (a “scholarly-edited” index), and [11] (a specialinterest index which anyone can contribute to—it has an integral search engine, hyperlinks to distributed content, free-form metadata in multiple languages, and an ontology of keyword ‘tags’). We are aware of no working examples of type β at this writing.
6. CONCLUSIONS Manuscripts containing neume notation should be archived as high-resolution, high colour-depth, digital images. High-quality images may be reserved for fee-based or subscription access, but images of sufficient quality for reading and accurate transcription should be made freely available via the Web. Free images should be visible on the Web as “first class objects.” We advocate that only ‘basic’ metadata should be stored with an image. High-level semantics for various interests and languages should be stored in specialised indexes and/or in “secondary markup.”
7. ACKNOWLEDGMENTS
Multiple, rich metadata ontologies allow a collection of content to be indexed in terms of many different conceptual schemes, user languages, or fields of interest. Search results can be made more targeted and ‘intelligent’ than is possible by a single ontology.
We are grateful to the Eduserv Foundation for their financial support. We thank the Abbot of Saint Catherine’s Monastery, His Beatitude Damaskenos, Archbishop of Sinai and Raithu for his blessing and kind co-operation. We thank Fr Justin, the librarian of the St Catherine Monastery Library, for his work and for graciously having sent Fr Constantine several excellent images of important Sinai music manuscripts. We also thank Fr Moses the Hagiorite, Frs Theophilos and Prochoros of the monastery of Pantokrator, and Fr Theologos of Iveron for their assistance in gathering information on the Athonite Monastery library projects.
5. CURRENT DEVELOPMENTS
8. REFERENCES
Digital archiving projects have been started at Greek Orthodox monasteries and at other repositories. This includes imaging and description of many neumed manuscripts. Table 1 summarises the holdings of several repositories. Worldwide, there are more than 4,000 manuscripts containing Byzantine or post-Byzantine neume notation; of this number, approximately half are housed at Mount Athos. The Holy Community of Mt Athos has twenty monasteries; those listed in Table 1 are noted by a ‘✥’ symbol. The 400 MSS of Pantokrator, alone, involve 270,000 pages. The projects at Mt Athos, and at St Catherine’s Monastery on Mount Sinai (Egypt), are of significant importance and deserve special support. Table 1. Manuscript Holdings at Various Digitization Projects. Repository Grand Lavra (Megiste Lavra) ✥ Pantokrator Monastery ✥
Total MSS
Music MSS
2,242
130
400
60 to 70
300
at least 96
1,700
400
over 2,000
514
Simon Petras Monastery ✥
123
30
Gregoriou Monastery ✥
804
approx. 60
Koutloumousiou Monastery ✥
770
Xenophontos Monastery ✥ Vatopaidi Monastery ✥ Iberon Monastery ✥
Chilandari Monastery ✥
103 *
Docheiariou Monastery ✥
107 *
St Pantaeleimon Monastery ✥ St Catherine’s Monastery, Sinai National Library of Greece Univ. of Athens, Music Dept.
1,920 *
150 *
3,300
approx. 350
over 4,500
241**
N/A
approx. 250
* Digitization plans are unknown to us as of this publication date. ** Music MSS are not currently included in digitization at N.L.G.
[1] Barton, Louis W. G., John A. Caldwell, and Peter G. Jeavons, “E-Library of Medieval Chant Manuscript Transcriptions,” Proceedings of the 5th ACM / IEEE Joint Conference on Digital Libraries, (ACM, 2005). [2] Barton, Louis W. G. Peter G. Jeavons, John A. Caldwell, and Koon Shan Barry Ng, “First Class Objects and Indexes for Chant Manuscripts,” Proceedings of the 7th ACM / IEEE Joint Conference on Digital Libraries, (ACM, 2007). [3] Barton, Louis W. G., “The NEUMES Project: Digital Transcription of Medieval Chant Manuscripts,” Proceedings of the Second International Conference on WEB Delivering of Music, (IEEE Computer Society, 2002). [4] Digital Image Archive of Medieval Music (DIAMM); online at, http://www.diamm.ac.uk/. [5] Digital Scriptorium; online at, http://www.scriptorium.columbia.edu/. [6] Distributed Digital Library of Chant Manuscript Images; online at, http://purl.oclc.org/SCRIBE/NEUMES ↵ /distributed_image_library/. [7] Kokla, Vassiliki, Vassilis Konstantinou, Alexandra Psarrou, “Towards the Creation of generalised Computational Models for the Characterisation of inks used in Byzantine Manuscripts,” Proceedings of the 15th World Conference on Nondestructive Testing, (Rome, 2000). [8] Library of Congress, The; Photoduplication Service; online at, http://www.loc.gov/preserv/pds/digital.html. [9] Monumenta Musicae Byzantinae, Copenhagen; online at, http://www.igl.ku.dk/MMB/chartres.html. [10] NEUMES Project, The; online at, http://purl.oclc.org/SCRIBE/NEUMES/. [11] SWiK index; online at, http://swik.net/SWiK. [12] Terzopoulos (Terss), Fr Constantine J., “Psaltic Notes”; online at, http://www.psaltiki.net/PsalticNotes/.