From Metadata To Linked Data
These are my reflections and notes on the From Metadata to Linked Data Summer School being held at Trinity College, Dublin. They are being written on the fly so they will have typos and whenever things get interesting I stop writing. (There is therefore an inverse relationship between the notes and my interest level.) The Twitter hashtag is #m2ld .
The general flow of the workshop is:
Susan Schreibman: Introduction to XML and TEI
Susan started the first day with a very fast and good introduction to XML and TEI. She discussed "tag abuse" (using the wrong TEI tags) and why we should try to follow guidelines. I learned that the "P" in P4 of the TEI Guidelines stands for "principle." She encouraged participants to start with TEI Lite. I wonder if TEI-Analytics is ready to go as a next level of TEI beyond Lite. (Probably not yet as there isn't a lot of documentation.) Susan closed off the introduction by talking about authority lists.
The question came up as to whether there was a good place to go to find established authority lists. Some suggestions were Library of Congress and Getty.
After the break we worked hands-on with oXygen on marking up a poem.
Shawn Day: Geospatial Markup
Shawn Day was the after lunch speakers. He walked us through geospatial encoding. He used as an example the Herodotus TimeMap where you have the text linked to a map and timeline. He also mentioned The Republic of Letters.
He made the point that geospatial is more than GIS. It is space and place. It is about geovisualization which can be cartographic, but it can have other dimensions. Geospatial is not automatic (you have to do a lot by hand.)
Some of the standards now are KML, Keyhole Markup Language. It is a language for visualizing geographic information and is what Google Maps/Earth use. GML is another standard.
How do we deal with names in the TEI? There are guidelines for Place Names and Places. <place> is sophisticated and lets you encode all sorts of information beyond just the name. With <place> you can have a place that has a particular name at a particular time. If you use KML in TEI then you have to declare the KML namespace.
A key principle is to not specify accuracy that doesn't exist!
We then did a hands-on exercise using a TEI document and then a KML document (that could be loaded into Google Earth.) Finally Shawn then showed some time visualization tools like HistoryFlow.
Jennifer Edmond: Digital but still Human: The Place of Technology in the Humanities Research Process
We ended the day with a discussion led by Jennifer. The discussion circled around issues of how the digital humanities interacts with the "traditional" humanities. Are we peeling off and becoming our own discipline with different questions, different traditions and different methods? Or, do we represent a direction of the humanities that is becoming mainstream? I couldn't help wondering what a sociologist of knowledge or historian will say a century from now. What will actually influence the evolution of the academy? Could granting agencies, by focusing on digital projects have more influence than discussions in the disciplines? Could our students, by choosing new media over old, affect hiring to the point where the traditional humanities practices go the way of the classics?
Susan Schreibman: Introduction to Parallel Segmentation Encoding and the Versioning Machine
Susan introduced the Versioning Machine in preparation for an exercise. This was my first time using it and I love its simplicity. It is open source, and more importantly, open to adaptation if you know a bit of XSLT or CSS. We talked about what you can learn by juxtaposing variants. It is a form of visualization.
Joris van Zundert: New Directions in Scholarly Textual Editing
Joris started by talking about the reduplication of effort and asked why we are all building the same tools. The Interedition project is trying to "promote the interoperability of the tools and methodology we use in the field of digital scholarly editing and research." I've grumbled about the reinventing the wheel rhetoric before - perhaps what we should be aiming for is interoperability. Can Interedition achieve this in Europe?
Joris then talked about Data Based Research as a paradigm shift of going from small amounts of texts to large data stores. He rightly critiqued the term Evidence Based Research where the assumption is that large amounts of data constitute evidence and the individual texts we have traditionally used are not. He also mentioned real time analysis of data similar to weather predictions. He talked about Nell, a project spidering the web and trying to extract facts from the pages spidered. These facts are then organized and browsers can vote them up or down. See the facts about books.
He summarized the current trends by W2 > 3C, ie. the web 2.0 leads to Collaboration, . "The worse thing that can happen with the building of tools is that they are successful." He gave the example of eLaborate a text transcription tool that has been too successful. (They don't have the resources to sustain their tools.) We are living in a time of a parade of prototypes. These prototype tools are monolithic and can't survive the end of the project and passing of interest of the researchers. This raises questions about how we solve this? In Interedition they think distribution and redundancy will keep things safe. Putting things in the cloud builds in redundancy. (I'm not sure what definition of cloud makes this true - I'm guessing he is thinking of commercial clouds that are run to a level of reliability.)
He then talked about Microservices - tiny little pieces of software. He gave as an example a collation workflow (Collate X) and the individual microservices. The idea is that one can make new applications by recombining different microservices. We will then need computation curation.
I like the idea of microservices, though I think we end up needing certain larger services (like Fedora or a cached text index/textbase that .) I'm not sold on the reliability of depending on services from other places that may not be always running. We probably need to be able reimplement other people's services either by installing instances that we can control or by using their pseudo code to reimplement in our environments. Alternatively we may have to run our services on commercial services for which we pay for reliability.
Joris ends by speculating about what the new edition will look like. He sees the edition as a container - a larger application - that combines microservices and content into an interpretative framework.
Tara Andrews: Introduction to Scholarly Workflows for Textual Editing
Tara talked about the process of scholarly editing. She presented a very conservative process with the following steps:
Now we tend to think of the steps in a digital flow as including:
Note the switch to "data". Not sure what that means.
She then shifted to talking about Stemmatic analysis. She showed a tree diagram of versions of a text generated by statistical tools like stemmatic analysis that come from genetics. (Did I get that right?) She made the important point that generating diagrams doesn't prove anything - the diagrams (and underlying statistics) need to be interpreted.
Tara then walked us through the three steps and raised some important issues around publishing and review. Is publishing electronic editions considered for tenure and promotion? How can they be reviewed? Should we just use the digital tools up to publication where we go out to print.
Geoffrey Rockwell: Text Analysis
Tobias Blanke: How do computers understand texts
Tobias went in the afternoon and he filled in a lot of the theory that I passed over. In general I like this approach of letting students play first and then stepping back. Tobias wanted to make sure that we understand the limits of analytics.
Tobias pointed to a number of resources like the Jonathan Stray talk on journalism and large data which talks about how he analyzed the Wikileaks collection of thousands of documents. What do we do when the corpus we are handling is too large to read? Is augmented reading different when you haven't read the original or plan to?
Tobias then talked about how the computer hands texts - ie. what is in the black box. He mentioned TREC and the importance of a community that defines goals and evaluates algorithms against goals. This is something we have to bring back to the digital humanities.
Tobias gave a great overview of some of the key models, algorithms and statistical measurements from the vector space model to the TF-IDF to Zipf's law.
An interesting question was raised about whether we really want the computer to duplicate human experts or do we want to be surprised? Algorithms that just duplicate human expertise aren't really that interesting unless they can show us why they came to a conclusion, and in that showing surprise us. A related question is what would we do with a tool that generated complete interpretations? Would we stop doing literary studies? Would we want to read computer generated interpretations or is the act of interpretation what we are really interested in. The lazy undergraduate might want interpretations generated so they don't have to read or think, but most of us want to think for ourselves. We play chess in the age of computer chess playing systems that beat most of us because we enjoy the playing!
Tobias then talked about what to do with poorly OC Red? text, which is common in really large text collections like Google Books (where there simply are too many books to correct by hand). What techniques well with large amounts of text with poor OCR? Tobias mentioned nGrams as an example.
After the break we looked at Information Extraction which is not really text analysis (for interpretation.) It is about extracting facts and entities from a text. He showed Open Calais and dbpedia Spotlight. He mentioned the GATE system as offering tools for extracting knowledge. This connects with Greg Crane's points in "What to do with a million books."
Information Filtering: Tobias talked about information filtering where you use filters to an incoming document flow. He ended his session with a Yahoo Pipes exercise.
Owen Conlan: From Concepts to Knowledge
Owen talked about how we represent knowledge on the computer. He presented a classical view of meaning as a symbol (word "jaguar") that points to a real object (the jaguar in the jungle). He then talked about ways that we represent knowledge on a computer:
He then talked about graphs and how computers can infer things from graphs and that led to ontologies. A well formed ontology "is one that is expressed in a well-defined syntax that has a well-defined machine interpretation consistent with the above ontology definition."
I found myself trying to remember the philosophy of language and logic classes I used to take and how ontologies in a knowledge engineering sense map onto discussions in philosophy. Here are a few questions:
We discussed the recent industry Schema.org initiative to standardize on microdata.
Alexander O'Connor: An overview of the semantic web and linked data
Alexander walked us through RDF and SPARQL. RDF is not XML and XML is not RDF. You can serialize RDF in XML, but you don't have to. He showed TURTLE
Alexander had us playing with SNORQL which lets you try SPARQL queries against the dbpedia.
Alexander used http://typewith.me/ so we could give feedback during the lecture.
Owen Conlan: Use Case: Introduction and Discussion: Linking Data for the Humanities
Owen closed off the day by giving us a picture of where things are going and how linked data can be used in the Humanities. He talked about DBpedia.
The Vision is:
A URI that identifies a real-world object != URI that identifies a document about that object. We can make statements about the real object and about the document that describes it. How do we link these? Owen talked about two approaches, 202 Redirects and Fragments.
To publish your data you need:
Laura Mandell showed how they use RDF in NINES.
To search RDF out that has been shared see http://ckan.net/ - it is a registry of open linked data
Day 5: Case Studies
On day 5 we looked at case studies provided by participants. We used these to talk through how you bring all the technologies together in a real project. The group I was in had a lively discussion about cataloging and linking museum artefacts. Museums have databases of their data, now they are experimenting with linked data, but there are questions about the importance of quality, the uses of metadata, whether Google might be easier for users, and the costs of good metadata. Here is are some of the questions we were discussing:
We talked about how one can use the Mechanical Turk or image games (or other forms of crowdsourcing) to improve metadata, create it, and provide alternatives.
|Page last modified on July 08, 2011, at 05:06 AM - Powered by PmWiki|