You are currently browsing the tag archive for the ‘text comparison’ tag.

Some small stuff from around the world or the web or the world that is the web that deserves some attention here in this and future posts to be. First of all the oldest bible (ok maybe no small stuff), the Codex Sinaiticus, has been digitized and has concurrently been made accessible online. As the project website states ‘Codex Sinaiticus, a manuscript of the Christian Bible written in the middle of the fourth century, contains the earliest complete copy of the Christian New Testament’. I attended a lecture by David Parker, one of the project members, last year as part of a symposium on Text comparison and digital creativity, at which Parker presented the project as it was enfolding and discussed the difficulties and challenges the online presentation of a document that has been scattered around different institutions, presents, making it truly a ‘virtual’ Codex Sinaiticus. Collected once more into one online object, now you can actually browse through the quires and folio’s of the manuscript and zoom into details that fancy your interest. And you can even adjust the lighting and surely do many more interesting things that I have not as of yet explored.
The goals of the project revolved around the historical research, conservation, digitization, transcription and dissemination of the manuscript. The study of the production of Codex Sinaiticus has proved invaluable for the study of book or manuscript history and production. The history of Codex Sinaiticus has also been very important for the development of the idea and the creation of the concept we nowadays refer to as ‘bible’ as a collection of canonical books:
‘The ability to place these ‘canonical books’ in a single codex itself influenced the way Christians thought about their books, and this is directly dependent upon the technological advances seen in Codex Sinaiticus. The quality of its parchment and the advanced binding structure that would have been needed to support over 730 large-format leaves, which make Codex Sinaiticus such an outstanding example of book manufacture, also made possible the concept of a ‘Bible’. The careful planning, skilful writing and editorial control needed for such an ambitious project gives us an invaluable insight into early Christian book production.’
The presentation of the manuscript on the website is marvelous. And what a chance to brush up your Ancient Greek! You can check it out for yourself here.

Last week, on the 19th and 20th of March, the first Academic Publishing in the Mediterranean Region (APM) conference was held, an offshoot of the APE (Academic Publishing in Europe) conference, which was held for the fourth time last January in Berlin. Both conferences want to transgress the traditional sectoral boundaries that exist in scholarly communication, where the scholars, publishers, policy makers, middlemen and librarians all have their separate gatherings and meetings. APE and APM are independent and international conferences about all aspects of academic publishing, to foster knowledge exchange and dialogue between the different stakeholders in scholarly communication. The APM, held in Florence, specifically focused on the diversities and particularities of the Mediterranean region with its many languages and its focus on the Humanities and Social Sciences (HSS) and monographs. Culture, tradition, books and manuscripts are still very important in the Mediterranean region, as the opening speaker Augusto Marinelli (the rector of the University of Florence) remarks. However, electronic experiments and digitization projects are also inceasingly undertaken. These innovations are however taking place in the context of the current financial crisis, which is hitting hard on the Italian publishing and library industry, says Mauro Guerrini, from the Italian Library Association (AIB – Associazione Italiana Biblioteche). He states that where in a knowledge economy knowledge is the key to innovation and development, decreasing (library) resources and cutbacks in science and scholarly communication might be detrimental to the overall economic development.
One of the possible solutions to this impasse might lie in what Maria Cristina Pedicchio (President of the Technology District in Molecular Medicine and Professor of Algebra at the University of Trieste) calls private-public partnerships in research. Referring to the knowledge triangle from the Lisbon Strategy; research, education and innovation should lead to growth and jobs, as was the expected scenario. Public private partnerships could be a powerful tool for innovation in this respect. When knowledge and research are the key issues for economic and social development and governements do not invest in them, they will fall even further. We need to invest in research and human capital in order to stay competitive says Pedicchio. Part of the EU strategy is focused on clusterpolicies to develop innovative clusters. But there is no single model, we need different clusters operating in different models. The specific local aspects also
play a large role. Pedicchio says that in order to obtain open innovation, we need open clusters. Innovation can only be created in visible dynamic environments, not in isolated organisations. For this to come about we need the support of the triple helix: academic research, private sectors and public administrations. Innovation depends on the interaction between strong academic research (universities), dynamic entrepreneurship and the availability of risk capital (private sector) as well as public administration.
Pedicchio goes on to discuss different kinds of cluster experiments in various European countries. From these experiments she concludes we need a multidisciplinary cultural approach. Pedicchio shows that these kind of collaborations can lead to the development of cultural open spaces which can foster and enhance research and innovation and can attract human resources, companies and financing.
The prerequisites for these kind of open collaborations, says Pedicchio, are the possibility of international and intersectional mobility, the availabilty of knowledge by means of open access policies for the dissemination of science and frontier knowledge, the investment in young people, and the dissemination of knowledge to society at large. We need to make national clusters but at the same time we need to try to integrate them. National policies need to be involved in this process, as locality is a physical request for clusters; they need to be local, physically based adhering to regional policies. This means a constant changing and adaption between European policies and national policies.
The second keynote, delivered by Andrea Bozzi (Director of the Institute for Computational Linguistics) focused on the scholarly editing of old manuscripts in digital library collections by means of computational tools. Bozzi explained the connection between computer science and the tradition of text transmission, focussing especially on texts that are transmitted by manuscripts. As Bozzi explained, we can now make a model for digital philology, developing integrated tools for scholarly editing. This can lead to a new kind of historical publication which can be enriched and which adds new value to the publication which hitherto has been static. Bozzi asked what the dimension of these integrated tools can be for a new kind of library and its users. He mentioned several digital tools for scholarly editing, such as an integrated open source environment for images and/or texts, image enhancement (within this environment), text indexing and concordance (by means of free web services), collaborative textual criticism, stemmatology and NLP tools (lemmatization, morphological analysis, treebank construction, comparison, meaning extraction, etc.). These are all new tools for studying manuscript archives in a collaborative way. They need to be combined with scholarly editing criteria. An example of a digital annotation tool is the Pinakes Text-architecture, which is a web based relational database application (Pinakes was the first library catalogue system, developed in the Library of Alexandria by the Greek poet Callimachus of Cyrene). From the website:
“Pinakes is a non-commercial tool the aim of which is to offer a renewed historiographic approach to the classification of the scientific heritage. Thanks to the integration of different types of objects, such as instruments, manuscripts, texts, iconography a.o., Pinakes aims at transforming the traditional approach to the primary sources of the history of science into a sort of archeology of scientific knowledge.”
As Bozzi stated, it is a highly flexible system and can find its application in for example Greek papyrology, egyptology, Roman philology and general philology. It can also be applied to different languages and documents. As an example of what Pinakes can do as a tool for the textual criticism of Medieval manuscripts, Bozzi showed how it can for example link to collated sources. In this way one can make an analysis of the variants in the collated source. Differences and variants can be retrieved in the critical apparatus, which is a very important aspect of historical linguistics. Framing tools can remember the encoding and record the variants in the critical apparatus: in this way you have enriched the text by using these specific tools. This technique could also be applied to old print books says Bozzi, where one could find different editions and detect the differences between them.
In the future Bozzi wants to focus on the integration with other NLP tools and on the application of the system to cuneiform texts on tablets. Most importantly he wants to develop a way to export the edited texts, critical apparatuses, annotations and indexes, to a print publication under agreement with publising houses via POD.
Pinakles can become a specialized scholarly editing tool and an integrated web-based platform within the electronic publishing roadmap of Interedition (an interoperable, supranational infrastructure for digital editions). Bozzi reflected on what the role of libraries can be in building this infrastructure and which role publishers could play. For one, libraries also need to receive tools to offer them to their users. In this respect Bozzi argued that it is very important that we have standards for these kind of research infrastrucutres, also for primary sources.
The ultimate goal should be a digital infrastructure for the Humanities: we need to enrich the European research by cooperation and in this respect the setting of standards is fundamental, as Bozzi concludes.

From the afternoon session on the Mediterranean region and its diversities I would like to focus on Andrea Angiolini’s (Società editrice il Mulino, Italy) lecture on The Darwin Project, a publishing infrastructure and working space for monographs & textbooks. As Angiolini argues, the differences between HSS and STM are fading out. This means new challenges for the publisher and new needs for our scholars and students. Angiolini clarifies that Mulino is a very traditional publisher, who believes that physical books are still the best thing to publish especially when it comes to reading them. In this respect Mulino is quite slow in the whole digital process. As Angiolini says, they would like to stay in between scholars, librarians and the market. However, something is gradually changing in Italy, both in the university and in the market. HSS research is increasingly moving from monographs to both monographs and journals and from a generalist approach to a more specialized one. There is also a visible shift from Italian to mixed language communication and from a less formal career and texts evaluation process to a more formalized one.
As the bookstores are buying less and acquisition budgets for libraries are decreasing, the break-even point for publishers is moving further away. This, combined with a research style that is increasingly being conducted online, has led Mulino, in order to stay effective (to reach a public, to service the scholar and the market) to move to the online domain and develop the Darwin (Digital Archive for Web Integrated Networks) project. Darwin is an integrated system for the online publication of digital editions. It can be seen as an infrastructure aimed at adding value to printed books. In this resepect Angiolini says it wants to meet the needs and demands of the users, based on standards.
Within the Darwin project, monographs will be published both in print and in digital editions. Abstracts and DOI will be added at the chapter level and all the books will be fully quotable. New is that texts are based on docbook and not on PDF, where docbook is a better format for searchability etc. It is a richer format that can do anything the paper can. Some more functions include opening and collapsing comments within the text. You can also interact with the text and annotate it and make the note public or private. You can search different parts of the publication and highlight certain parts (semantic search). In this respect Angiolini argues that Darwin is not only designed for reading and searching but also for studying and collaborating while doing research.You can make it into a workspace, with public or private note taking and public or private bookmarks. The project will be online in autumn 2009.
It will be an open project claims Angiolini, adoptable to different texts and formats, and different access models (though it will be based on and start off as a subscription model). As Angiolini states, if we want to publish research and be effective at the same time, we must take a mixed way, otherwise soon monographs will no longer exist. We are moving from contents to contents plus editorial services. This produces a new publishers profile.This change is almost mandatory if publishers want to be part of the solution and not of the problem in the digital age.
After Angiolini’s lecture a remark was made from the public, whether Angiolini thinks people would annotate (on) a propriatory platform? How to combine Darwin with other platforms and will Darwin be compatible with other publishers websites and will it let scholars mix their notes? Wouldn’t users rather use Zotero, or other browser based environments? Angiolini replied by stating that Darwin is still an experiment and that he does not know how scholars will exactly go about and use it.
Highlights from day 2 of APM will follow soon.
Long overdue, here are my notes on the second day of the colloquium Text Comparison and Digital Creativity.
The day started with key note speaker Bella Hass Weinberg who stressed the point that even in the digital era the creativity of text comparison still lies with the researcher. She states: much of the research done concerning text comparison in the pre-computer era cannot be done by a computer. Translation involves hermeneutics, which means that text comparison is about interpretation, not just about linguistics. One of the main questions thus remains: when we analyze texts, what comes first: semantics or linguistics? Weinberg says semantics. When it comes to computers and text comparison, there are still big problems with machine translations and machine comparison of texts because this assumes the perfection of OCR. OCR can work for clearly demarcated letters, it does not work well for scripts that have different shapes according to the position they take in the word. Weinberg goes on to argue that many centuries earlier there was already very sophisticated text comparison without the aid of computers. She thus asks, ‘what can we do with the computer now that we could not do before?’ Computers have facilitated analysis, but in counting and comparing as a basic feature, not for the rest.
Unfortunately, Weinberg only discussed the current state of affairs in ICT and text comparison without contemplating the possible future developments. The next session, however, tried to show what kind of new developments have been made within the digital realm to assist textual comparison. Vika Zafrin, a digital humanities expert, talked about her research on distributed networks with/in text encoding and annotation. In her definition annotation stretches into hypertext resources such as social tagging (for instance deli.cio.us), blog comments and comments solicited via specialized software. She gave the example of the Virtual Humanities Lab (VHL) from Brown University, which created an annotation tool/engine which functioned as a web based space for collaborative work. Zafrin argued that semantic encoding (for instance, what kinds of elements and attributes to use in a DTD) can also be seen as a form of annotation. She mentioned a tool with the help of which comments can be inserted directly into document schemata. She also mentioned some other digital tools: with Diigo you can highlight and annotate web pages. Zotero is a scholarly annotation tool: you can put notes and tags to your objects in your Zotero library. There are also new developments in media annotating, such as Vertov which allows for the annotating of multimedia files. As Zafrin showed, it seems that scholars are increasingly digging distributed networking. Although there are still issues concerning the credibility of scholarship on the Internet and the amount of quality control, Zafrin argued that internet scholarship has many pros too: it will enable scholars to find each other more easily, so it makes the conversation broader. Increased disciplinarity is also encouraged by distributed networking. Next to that Zafrin noted correctly that distributed networked tools are the only ones available for born-digital content.
Adriaan van der Weel talked about how new media are giving us new perspectives on knowledge production. In the digital era the tissue of our society still stays book based. Van der Weel explained this situation by pointing at the history of the textual medium. For the discovery of what the invention of Gutenberg actually meant, took quite a while. So it will equally take some time to find out what digital textuality actually means. What we did at first was appropriate the computer to the ‘book order’. In this sense digital textuality is still a hybrid since we adapted it to the book. Van der Weel spoke of a gestation period, in which the new medium needs to be both discovered and invented. What are the essential differences between what the computer can do and what we could do before? Important in this respect is that the book as a medium never functionally changed. But the nature of text did change with the change of medium. This is what Van der Weel called medial transformativity: there exist discontinuities between the textual mediums, for each medium has its own bias based in its technical properties. Van der Weel went on to elaborate on some possibilities the computer offers in the process of knowledge dissemination. He concluded by saying that to establish the true nature of digital textuality, we need to recognize that next to the process of discovery (the invention of the digital medium) we still need some time for the process of invention: we humanities scholars need to say what we want from the digital medium. We need to be widely creative and experimental to determine what we want: we need to be inventors.
In the next session Peter Øhrstrøm gave a nice overview of his endeavor to turn a seventeenth century book into a hypertext. The book Ogdoas Scholastica by Jacob Lorhard in which the term ontology was coined for the first time, lends itself perfectly for this because of its extensive use of dichotomies. Øhrstrøm argued that the representation of Lorhard’s ontology using modern hypertext provides a better and deeper understanding and overview of the history of ontology, Lorhard’s ideas of knowledge and knowledge organization and Lorhard’s use of ideas from other writers. Ben Salemans delved deeper into the question whether ICT can be seen as a methodological innovation: does it speed up new techniques or does it create a new domain of techniques? Although he remarked these are of course more questions for philosophy of science, he does argue that forms of ‘deductive’ science can also be helped by the computer. Finally John Lavagnino talked about the possibility of systematic emendation and the help of ICT.
Wido van Peursen and Ernst Thoutenhoofd closed the day with their lecture about current and future text comparison and digital creativity, which also served as a wrap up and summary of the colloquium. As they argued, digital creativity is in principle a paradox. The digital stands for calculation and sorting where creativity stands for unpredictability and subjectivity. But digital creativity can also be seen as the ingenuity of human beings to create algorithms for the processing of language. A second paradox can be seen between presence (materiality, ‘what meaning cannot convey’, textual carriers, physics) and meaning (interpretation, attribution of meaning, texts and meta-physics). Van Peursen argued that there has been an increased interest in the material carriers of text, connected to the technological innovations. There are however challenges to the use of the computer in interpretation. The question is: what does the computer contribute to our interpretation of the text? A third paradox exists between scholarly (interpretation, analysis and subjective) and scientific (sort, quantify etc.) research. Does the computer thus give text comparison a more ‘scientific’ character? To some extent it does, but Van Peursen also argued that scholarly judgment and experience are still needed. Finally, he asked whether we only imitate the classical instruments or whether we also develop new research strategies. In this respect it seems that there is a process going on of both continuation and innovation, where the developments in knowledge creation and representation seem both to revolve around the reconsideration of notions like data, information and knowledge. We seem to be heading towards new forms of collaborative knowledge creation and in this way we are now in a transitional phase between the order of the book and the digital order.
Ernst Thoutenhoofd ends the lecture with a short exploration of the notion of presence in virtual environments like Second Life, in which the experience of reality can be seen as a strictly cognitive event. Where in a sense all reality is virtual, the computer can serve the same function as reality, which is the mediation of presence. All our types of knowledge are also mediated and our experiences are also constantly being mediated. In this way the humanities can be seen as a mediation field. And this is something the humanities need to remember; we are not only studying our world but we are studying ourselves or the interactions between ourselves and our world in which we create each other in the same time.
Last Thursday and Friday the two day colloquium ‘Text comparison and digital creativity’ was held at the KNAW, as part of its 200th anniversary year. The symposium was a joint initiative from the Virtual Knowledge Studio (VKS) and the Leiden based Turgama project. For more information on both organizers simply follow the links.
One of the main points of the colloquium was the paradox of digital creativity. The digital stands for the objective, calculative ‘scientific’ method, where creativity stands for subjectivity, interpretation and scholarship. The concept of digital creativity is coupled to the practice of text comparison, leading to the question what the influence of the recent digital developments has been on the field of textual comparison. One of the main questions asked to the speakers was whether ICT developments only lead to a speeding up of the research process or whether it truly introduces new methods of investigation. Moreover, in what sense does the computer affect the creation and representation of knowledge and data?
Another conceptual theme used during the colloquium was the dichotomy of ‘presence’(materiality, ‘physics’) and meaning (meta-physics, interpretation). The question is whether the hegemony of meaning has come under attack by a re-awakened interest in presence, as stated by Wido van Peursen and Ernst Thoutenhoofd in their introductory text to the colloquium. More focused on the topic of the colloquium they raise the question:
How [does] the computational, analytical work done in digital scholarship relate to the subjective moods of interpretation and intuition that characterize traditional philology?¹
Isn’t text comparison becoming more and more like an exact science with the coming of computation? And is it being separated from text interpretation in this respect? Or does interpretation still play an important role in textual comparison?
One of the keynote speakers was David Crystal, who explored the changing nature of text in his lecture, focusing on the emergence of what he coins Digitally Mediated Communication (DMC). In his lecture he compared DMC (looking at both continuities and discontinuities) with other traditional ‘texts’ (speech, writing and sign), by taking a look at the salient features of these mediums of communication. He concludes that although DMC has more properties linked to writing, it deploys properties of both writing and speech. More interesting however is the fact that, as Crystal argues, DMC has lead to the rise of texts with properties that have no written/speech equivalence (he mentions SPAM filters, search engine rankings and moderated/filtered texts), that are multi-authored (Wiki’s) and have no boundaries (texts are never finished). He argues that the salient features of DMC are still for a large part unknown or uncertain, urging for the study of its properties from within linguistics.
The session on texts as artefacts showed how artefacts can be represented or studied using digital technologies. Bruce Zuckerman gave a tour of the InscriptiFact website/database, which offers different ways of searching for the (texts as) artefacts and inscriptions (using a wide range of indexing techniques) and of representations of (texts as) artefacts, using pictures that are for example movable around the screen and searchable themselves. Zuckerman also talked about an experimental feature of the database where artefacts can be viewed under different angles of lighting, using a light dome thus greatly improving their ‘presence’. These techniques, as he argued have led to different levels of interpretation that were not or almost not possible before. Roger Boyle introduced a technique that makes it possible to take a look ‘inside’ paper, which helps with the identification and finding of watermarks. Watermarks can often be unintentional marks of value and time, which can help to establish the attribution and dating of texts. He argues that computer science can and does bring more than a bag of techniques to improve pictures for codicologists, paleographists and papyrologists.
During the session on texts as objects of transmission, David Parker gave a lecture on the virtual Codex Sinaiticus, which in its original form is scattered around different locations. Four partner institutions are now working together in creating a virtual CS. Parker explored the similarities and differences between the ancient production and its electronic reproduction. Ulrich Schmid gave a kind of similar lecture about his endeavors with transmitting the New Testament online, exploring specifically the question how the digital medium can help us facing challenges in text editing, while also seeing a lot of challenges still surrounding the creation of a fully interactive digital edition (from technical difficulties to platform, preservation and copyright issues).
Eep Talstra’s lecture focused more on philosophical and methodological questions concerning bible study in the digital age. Talstra asked whether we speed up classical techniques, or whether we develop a new domain of techniques for access to classical texts. Basically he asks the question how to use computer technology in the domain of Bible and Philology. For in the study of classical texts three layers of text analysis come together: text as a literary composition, as a linguistic structure and as a source for the study of language. Can computer assisted textual analysis help us to do justice to the three layers present in the classical data, more than classical tools could do for us?

Mats Dahlström explored the issue of scholarly editions and editing. Are they just recordings of matters of fact? In this respect he mentions one of the biggest tensions in scholarly editing, namely the tension between different scholarly and scientific ideals: are scholarly editions a representation of facts or interpretations? He argues that in this respect the pattern of conflicts is not medium specific; it is rather a general trait of textual transmission. The new medium will not do the tensions away; it will in some cases even enhance them, in which Dahlström sees similarities between the tensions in library digitization and in scholarly editing, which he goes on to compare during the rest of his lecture.
Some of the main points made during the first day were that technology should serve as a tool for the scholar/scientist, not the other way around, and that Humanists should be proactive in their demands towards technological implementations. There should be a dialogue between implementation from above and humanities input from below.
More on Day II of the colloquium will follow shortly.










Recent Comments