Main »

From Metadata To Linked Data

These are my reflections and notes on the From Metadata to Linked Data Summer School being held at Trinity College, Dublin. They are being written on the fly so they will have typos and whenever things get interesting I stop writing. (There is therefore an inverse relationship between the notes and my interest level.) The Twitter hashtag is #m2ld .

The general flow of the workshop is:

  • XML and TEI - encoding texts
  • Parallel segmentation and the Versioning Machine
  • Scholarly workflows and CollateX
  • Text analysis with Voyeur
  • Semantic web and Linked data

Susan Schreibman: Introduction to XML and TEI

Susan started the first day with a very fast and good introduction to XML and TEI. She discussed "tag abuse" (using the wrong TEI tags) and why we should try to follow guidelines. I learned that the "P" in P4 of the TEI Guidelines stands for "principle." She encouraged participants to start with TEI Lite. I wonder if TEI-Analytics is ready to go as a next level of TEI beyond Lite. (Probably not yet as there isn't a lot of documentation.) Susan closed off the introduction by talking about authority lists.

The question came up as to whether there was a good place to go to find established authority lists. Some suggestions were Library of Congress and Getty.

After the break we worked hands-on with oXygen on marking up a poem.

Shawn Day: Geospatial Markup

Shawn Day was the after lunch speakers. He walked us through geospatial encoding. He used as an example the Herodotus TimeMap where you have the text linked to a map and timeline. He also mentioned The Republic of Letters.

He made the point that geospatial is more than GIS. It is space and place. It is about geovisualization which can be cartographic, but it can have other dimensions. Geospatial is not automatic (you have to do a lot by hand.)

Some jargon:

  • Geoparsing is a growing field of parsing text to identify words and phrases as a place, assigning geographic identifiers, and linking to
  • Geotagging is adding specific references to media from text to images.
  • Geocoding is to give a place name and get back the code needed to geotag.
  • Geolocating is assessing the location of a real world object (like my iPhone) based on a mobile connection.
  • GIS is Geographic Information System

Some of the standards now are KML, Keyhole Markup Language. It is a language for visualizing geographic information and is what Google Maps/Earth use. GML is another standard.

How do we deal with names in the TEI? There are guidelines for Place Names and Places. <place> is sophisticated and lets you encode all sorts of information beyond just the name. With <place> you can have a place that has a particular name at a particular time. If you use KML in TEI then you have to declare the KML namespace.

A key principle is to not specify accuracy that doesn't exist!

We then did a hands-on exercise using a TEI document and then a KML document (that could be loaded into Google Earth.) Finally Shawn then showed some time visualization tools like HistoryFlow.

Jennifer Edmond: Digital but still Human: The Place of Technology in the Humanities Research Process

We ended the day with a discussion led by Jennifer. The discussion circled around issues of how the digital humanities interacts with the "traditional" humanities. Are we peeling off and becoming our own discipline with different questions, different traditions and different methods? Or, do we represent a direction of the humanities that is becoming mainstream? I couldn't help wondering what a sociologist of knowledge or historian will say a century from now. What will actually influence the evolution of the academy? Could granting agencies, by focusing on digital projects have more influence than discussions in the disciplines? Could our students, by choosing new media over old, affect hiring to the point where the traditional humanities practices go the way of the classics?

Day 2

Susan Schreibman: Introduction to Parallel Segmentation Encoding and the Versioning Machine

Susan introduced the Versioning Machine in preparation for an exercise. This was my first time using it and I love its simplicity. It is open source, and more importantly, open to adaptation if you know a bit of XSLT or CSS. We talked about what you can learn by juxtaposing variants. It is a form of visualization.

Joris van Zundert: New Directions in Scholarly Textual Editing

Joris started by talking about the reduplication of effort and asked why we are all building the same tools. The Interedition project is trying to "promote the interoperability of the tools and methodology we use in the field of digital scholarly editing and research." I've grumbled about the reinventing the wheel rhetoric before - perhaps what we should be aiming for is interoperability. Can Interedition achieve this in Europe?

Joris then talked about Data Based Research as a paradigm shift of going from small amounts of texts to large data stores. He rightly critiqued the term Evidence Based Research where the assumption is that large amounts of data constitute evidence and the individual texts we have traditionally used are not. He also mentioned real time analysis of data similar to weather predictions. He talked about Nell, a project spidering the web and trying to extract facts from the pages spidered. These facts are then organized and browsers can vote them up or down. See the facts about books.

He summarized the current trends by W2 > 3C, ie. the web 2.0 leads to Collaboration, . "The worse thing that can happen with the building of tools is that they are successful." He gave the example of eLaborate a text transcription tool that has been too successful. (They don't have the resources to sustain their tools.) We are living in a time of a parade of prototypes. These prototype tools are monolithic and can't survive the end of the project and passing of interest of the researchers. This raises questions about how we solve this? In Interedition they think distribution and redundancy will keep things safe. Putting things in the cloud builds in redundancy. (I'm not sure what definition of cloud makes this true - I'm guessing he is thinking of commercial clouds that are run to a level of reliability.)

He then talked about Microservices - tiny little pieces of software. He gave as an example a collation workflow (Collate X) and the individual microservices. The idea is that one can make new applications by recombining different microservices. We will then need computation curation.

I like the idea of microservices, though I think we end up needing certain larger services (like Fedora or a cached text index/textbase that .) I'm not sold on the reliability of depending on services from other places that may not be always running. We probably need to be able reimplement other people's services either by installing instances that we can control or by using their pseudo code to reimplement in our environments. Alternatively we may have to run our services on commercial services for which we pay for reliability.

Joris ends by speculating about what the new edition will look like. He sees the edition as a container - a larger application - that combines microservices and content into an interpretative framework.

Tara Andrews: Introduction to Scholarly Workflows for Textual Editing

Tara talked about the process of scholarly editing. She presented a very conservative process with the following steps:

  • Collatio
  • Recensio
  • Examinatio
  • Emendatio

Now we tend to think of the steps in a digital flow as including:

  • Data creation
  • Data analysis
  • Data publication

Note the switch to "data". Not sure what that means.

She then shifted to talking about Stemmatic analysis. She showed a tree diagram of versions of a text generated by statistical tools like stemmatic analysis that come from genetics. (Did I get that right?) She made the important point that generating diagrams doesn't prove anything - the diagrams (and underlying statistics) need to be interpreted.

Tara then walked us through the three steps and raised some important issues around publishing and review. Is publishing electronic editions considered for tenure and promotion? How can they be reviewed? Should we just use the digital tools up to publication where we go out to print.

Day 3

Geoffrey Rockwell: Text Analysis

I taught text analysis on Wednesday morning. I showed and let students use TAPoRware and Voyeur. You can seem my script at .

Tobias Blanke: How do computers understand texts

Tobias went in the afternoon and he filled in a lot of the theory that I passed over. In general I like this approach of letting students play first and then stepping back. Tobias wanted to make sure that we understand the limits of analytics.

Tobias pointed to a number of resources like the Jonathan Stray talk on journalism and large data which talks about how he analyzed the Wikileaks collection of thousands of documents. What do we do when the corpus we are handling is too large to read? Is augmented reading different when you haven't read the original or plan to?

Tobias then talked about how the computer hands texts - ie. what is in the black box. He mentioned TREC and the importance of a community that defines goals and evaluates algorithms against goals. This is something we have to bring back to the digital humanities.

Tobias gave a great overview of some of the key models, algorithms and statistical measurements from the vector space model to the TF-IDF to Zipf's law.

An interesting question was raised about whether we really want the computer to duplicate human experts or do we want to be surprised? Algorithms that just duplicate human expertise aren't really that interesting unless they can show us why they came to a conclusion, and in that showing surprise us. A related question is what would we do with a tool that generated complete interpretations? Would we stop doing literary studies? Would we want to read computer generated interpretations or is the act of interpretation what we are really interested in. The lazy undergraduate might want interpretations generated so they don't have to read or think, but most of us want to think for ourselves. We play chess in the age of computer chess playing systems that beat most of us because we enjoy the playing!

Tobias then talked about what to do with poorly OCRed text, which is common in really large text collections like Google Books (where there simply are too many books to correct by hand). What techniques well with large amounts of text with poor OCR? Tobias mentioned nGrams as an example.

After the break we looked at Information Extraction which is not really text analysis (for interpretation.) It is about extracting facts and entities from a text. He showed Open Calais and dbpedia Spotlight. He mentioned the GATE system as offering tools for extracting knowledge. This connects with Greg Crane's points in "What to do with a million books."

Information Filtering: Tobias talked about information filtering where you use filters to an incoming document flow. He ended his session with a Yahoo Pipes exercise.

Day 4

Owen Conlan: From Concepts to Knowledge

Owen talked about how we represent knowledge on the computer. He presented a classical view of meaning as a symbol (word "jaguar") that points to a real object (the jaguar in the jungle). He then talked about ways that we represent knowledge on a computer:

  • Objects
  • XML
  • Clauses (the old AI way)
  • Graphs

He then talked about graphs and how computers can infer things from graphs and that led to ontologies. A well formed ontology "is one that is expressed in a  well-defined syntax that has a well-defined machine interpretation  consistent with the above ontology definition."

I found myself trying to remember the philosophy of language and logic classes I used to take and how ontologies in a knowledge engineering sense map onto discussions in philosophy. Here are a few questions:

  • The classical model of meaning being a relationship to the world works for physical objects but not for all sorts of things we talk about from "justice" to the past. Is that a problem?
  • The computer doesn't actually model the point to the world - it models the relationships between the symbols. We then interpret the relationship to the world. Is that the strength of computer modeling?
  • We don't need ontologies to have meaningful conversations, why should computers? What would be the alternative? Could computers just use statistics where they have a large unstructured textbase (like the web) and draw statistical inferences rather than have formally modeled descriptions of the world?
  • How is this different from the AI work of the 60s and 70s? One answer is that the AI approach was trying to generalize about the world while ontologies are local models. Another answer is the use. AI was trying to get to "artificial intelligence" while the knowledge engineering folk are trying to build systems for particular applications. There is however a hope in the KE world that a good ontology can be reused by other people. I sense some creeping desire for generality (and return to the promise of AI where we get more out than we put in.)
  • How do we deal with ambiguity? The genius of human poetry is how we use ambiguity or polysemy. Ontologies seem to be focused reducing ambiguity.

We discussed the recent industry initiative to standardize on microdata.

Alexander O'Connor: An overview of the semantic web and linked data

Alexander walked us through RDF and SPARQL. RDF is not XML and XML is not RDF. You can serialize RDF in XML, but you don't have to. He showed TURTLE

Alexander had us playing with SNORQL which lets you try SPARQL queries against the dbpedia.

Alexander used so we could give feedback during the lecture.

Owen Conlan: Use Case: Introduction and Discussion: Linking Data for the Humanities

Owen closed off the day by giving us a picture of where things are going and how linked data can be used in the Humanities. He talked about DBpedia.

The Vision is:

  • Publish RDF on the web
  • Share vocabularies (ontologies?)
  • Use URIs as names for things that provide useful information when looked up
  • Tools that can use all this

A URI that identifies a real-world object != URI that identifies a document about that object. We can make statements about the real object and about the document that describes it. How do we link these? Owen talked about two approaches, 202 Redirects and Fragments.

To publish your data you need:

  • Understand the principles
  • Understand your data
  • Choose URIs for Things in your Data
  • Set up Your Infrastructure
  • Link to other Data Sets

VoID is a way to describe the relationships between vocabularies from different projects.

Laura Mandell showed how they use RDF in NINES.

To search RDF out that has been shared see - it is a registry of open linked data

Day 5: Case Studies

On day 5 we looked at case studies provided by participants. We used these to talk through how you bring all the technologies together in a real project. The group I was in had a lively discussion about cataloging and linking museum artefacts. Museums have databases of their data, now they are experimenting with linked data, but there are questions about the importance of quality, the uses of metadata, whether Google might be easier for users, and the costs of good metadata. Here is are some of the questions we were discussing:

  • What is the point of good metadata? Is the work worth it? How is metadata used and by who?
  • What is the point of metadata at all? Are there are alternatives for users?
  • What is the use of linking metadata records from one museum to another? Can it be used to audit metadata (by comparing it to another museum catalogue)? Do users want to search across museum catalogues?
  • What are the alternatives? Is Google a viable alternative for users trying to find stuff across museums?

We talked about how one can use the Mechanical Turk or image games (or other forms of crowdsourcing) to improve metadata, create it, and provide alternatives.



edit SideBar

Page last modified on July 08, 2011, at 05:06 AM - Powered by PmWiki