Digging Into Data Challenge 2011

Main.DiggingIntoDataChallenge2011 History

Hide minor edits - Show changes to output

June 10, 2011, at 03:54 PM by 205.201.247.78 -
Added lines 200-201:
Patrick Juola asked the team whether what they were doing was closer to genre analysis than authorship attribution.
June 10, 2011, at 03:52 PM by 205.201.247.78 -
June 10, 2011, at 03:52 PM by 205.201.247.78 -
Added lines 183-186:
Peter Ainsworth concluded with results from the subprojects. This project identified a unique colour space palette for the manuscript artist/master known as the follower of the Rohan Master. For the quilts they found that Victorian crazy quilts are similar to modern quilts along with insights into colour. For the maps they have results that help them understand the differences between French and British maps. They found that French mapmakers had a better grasp of how climactic features affect the topography. An unexpected insight that they hadn't thought of.

Humanists haven't had the chance to think of such things before, but they could help with authorship. Authorship remains a useful concept for anonymous works. Collaboration with non-humanists forces humanists to make explicit their methodology of visual inspection.

Changed lines 189-190 from:

to:
!!! Respondent: Sha Xin Wei
Wei took the chance to talk about disciplinary practice. He feels that the next step is to co-develop of tools with interpretation. Human-in-the-loop machine processing to humanist-in-the-loop. He commented on the need to interpret and critique the tools too. We should turn our interpretative skills on our own tools. What you see might be what you expect to see.

He talked about tertiary orality. There are lot more people with mobile phones than internet on the planet.

He talked about the huge problem of signal analysis and semantic analysis. There are a lot of assumptions about what is signal and noise. He showed a video of a responsive environment with shallow semantics. He talked for a performative approaches. Meaning is constructed in performance. How can we use that guide development of tools.

He sees interpretation as a promulgation of scholarly dialogue. Are graphs really sufficient to the phenomena.

Finally, what is the unit of analysis? Maybe there are no primitives. Maybe there is experimental practice as a form of performance. Can we imagine an experimental form of humanities where we build the very things we are studying.

June 10, 2011, at 03:29 PM by 205.201.247.78 -
Added lines 181-182:
Peter then talked about the computer science side of things. The memorandum of agreement was important when negotiating across disciplines. He also showed how they recorded skype sessions for future reference. They also shared a software repository. The repository also was used to share algorithms. Segmentation tended to be different from different types of images, but the algorithms for edge detection could be similar. They are developing a statistical framework that can tell you how much confidence you should have in the statistics about the segments.
June 10, 2011, at 03:23 PM by 205.201.247.78 -
Changed lines 175-180 from:
Peter then talked about the medieval manuscripts. The manuscripts while being produced in one spot are now dispersed. We can bring them together in the cloud. Their main qarry are the artists and scribes that created the manuscripts. They created an image tool for examining the pages. The Art Historians try to define traits of a master artist. They are trying to assist in the tracking of artists. Likewise they are interested in the scribes and their particular orthography. Doing it by hand/eye is difficult. Could digital techniques help? Could it help with the identity of the shadowy figures who copied manuscripts.




to:
Peter then talked about the medieval manuscripts. The manuscripts while being produced in one spot are now dispersed. We can bring them together in the cloud. Their main qarry are the artists and scribes that created the manuscripts. They created an image tool for examining the pages. The Art Historians try to define traits of a master artist. They are trying to assist in the tracking of artists. Part of the issue is segmentation so one has smaller shapes (heads and helmets, for example). They also used colour space analysis. Likewise they are interested in the scribes and their particular orthography. Doing it by hand/eye is difficult. Could digital techniques help? Could it help with the identity of the shadowy figures who copied manuscripts. They applied sobel edge detection.

Dean Rehberger talked about 19th and 20th century quilts. He commented on how important the young scholars were to the success of the project as were computer scientists who cared. In their quilt database they have tens of thousands of items with lots of metadata. One question they addressed was whether they could identify which were "crazy" quilts - a Victorian invention. They were produced by women and not as regular. They have a crazy explosion of shapes and colour. Segmentation was again an issue. Their algorithm got to about 70% accuracy. What was interesting was the group of false positives. Now they are trying to see if they can determine if something is an Amish quilt. They are also interested in how quilters take up ideas from each other.

Peter Bajcsy then talked about dealing with maps of the 17th and 18th century. One thing they did was to use the neatlines for scale rather than scale indicators (which the computer can't find easily.) The neatline is the frame around the edge that often has ticks that indicate scale. He showed a table showing the great lakes over different maps and how close the map area for each lake was compared to the actual area. Can this be used to tell how accurate the map as a whole are? Can they tell things about maps from different countries (English vs French.)

See http://isda.ncsa.uiuc.edu/DID/ for more.



June 10, 2011, at 03:02 PM by 205.201.247.78 -
Changed lines 173-178 from:





to:
Peter talked about some of the challenges of collaboration across multiple sites. They chronicled their journey in a [[http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3372/2950 | First Monday article]]. One thing that helped was a memorandum of understanding at the beginning. This covered permissions and credit.

Peter then talked about the medieval manuscripts. The manuscripts while being produced in one spot are now dispersed. We can bring them together in the cloud. Their main qarry are the artists and scribes that created the manuscripts. They created an image tool for examining the pages. The Art Historians try to define traits of a master artist. They are trying to assist in the tracking of artists. Likewise they are interested in the scribes and their particular orthography. Doing it by hand/eye is difficult. Could digital techniques help? Could it help with the identity of the shadowy figures who copied manuscripts.





June 10, 2011, at 02:52 PM by 205.201.247.78 -
Changed lines 170-175 from:





to:
!! Digging into Image Data to Answer Authorship Related Questions
Peter Ainsworth started by talking about the complexity of authorship. It is in the 15th century that authorship emerges as a significant designation. We care about authorship because it is key to understanding cultural production. Their challenge was to look at authorship through 3 very different image corpora (manuscripts, quilts and maps.) In all three cases they don't who produced the items, though they think they might. They designed image analysis algorithms to extract features and then classify images.






June 10, 2011, at 02:44 PM by 205.201.247.78 -
Changed lines 168-173 from:
She asked what do you do with a million text? It turns out that you can break them down to their words. What do you do with a billion words then? She feels it harder to swallow the distant reading techniques and questions that comes when you have billions of words. She asked questions about what she might be able to do. She asked about words and their range of meanings. She had a general request for a more thorough examination of the effects of OCR errors.




to:
She asked what do you do with a million text? It turns out that you can break them down to their words. What do you do with a billion words then? She feels it harder to swallow the distant reading techniques and questions that comes when you have billions of words. She asked questions about what she might be able to do. She asked about words and their range of meanings. She had a general request for a more thorough examination of the effects of OCR errors. She closed with a plea for both the large scale tools and provision for the input of scholars (in addition to citizen participants.)






June 10, 2011, at 02:41 PM by 205.201.247.78 -
Changed lines 168-171 from:
She asked what do you do with a million text? It turns out that you can break them down to their words. What do you do with a billion words then? She feels it harder to swallow the distant reading techniques and questions that comes when you have billions of words. She asked questions about what she might be able to do.


to:
She asked what do you do with a million text? It turns out that you can break them down to their words. What do you do with a billion words then? She feels it harder to swallow the distant reading techniques and questions that comes when you have billions of words. She asked questions about what she might be able to do. She asked about words and their range of meanings. She had a general request for a more thorough examination of the effects of OCR errors.




June 10, 2011, at 02:34 PM by 205.201.247.78 -
Changed lines 166-167 from:
Cynthia chose to address the team.
to:
Cynthia chose to address the team. She talked about mongrel texts and how the first generation of printed texts were problematic. Many of the important texts have no modern edition and the digitization of what is there will mean that many texts go from their medieval mongrel phase to electronic form without modern editing.

She asked what do you do with a million text? It turns out that you can break them down to their words. What do you do with a billion words then? She feels it harder to swallow the distant reading techniques and questions that comes when you have billions of words. She asked questions about what she might be able to do.


June 10, 2011, at 02:25 PM by 205.201.247.78 -
Changed lines 157-158 from:
Greg Crane talked about what a variorum edition is. John Darlington talked about why 2000 years of Latin is great for studying variation. They have crawled 1.2 M books from the Internet Archive of which 25 K are catalogued as Latin but many of them are not. He talked about the problem of polysemy when using large text databases. They trained a broad-coverage word sense disambiguation tool using parallel texts (English/Latin). Where you have a Latin work and its English translation you can train a disambiguation tool which can then be run on the rest of the corpus.
to:
Greg Crane talked about what a variorum edition is. A colleague talked about why 2000 years of Latin is great for studying variation. They have crawled 1.2 M books from the Internet Archive of which 25 K are catalogued as Latin but many of them are not. He talked about the problem of polysemy when using large text databases. They trained a broad-coverage word sense disambiguation tool using parallel texts (English/Latin). Where you have a Latin work and its English translation you can train a disambiguation tool which can then be run on the rest of the corpus.
Added lines 161-167:
John Darlington talked about creating high-throughput infrastrucrure for OCR and text-based feature extraction for Greek and Latin. He talked about e-science frameworks and how they can be developed for supporting projects like this. Another colleague talked about how e-science could be applied to large-scale OCR. The key is minimizing the need for human intervention. To do this one needs ground truth that can be used to train the OCR.

Greg Crane closed the presentation. He talked about how we need the participation of youth to transform our intellectual culture. We have to invert the hierarchical virtuoso culture of classics so that student researchers can do meaningful work (instead of being told that they can't contribute until they have a [=PhD=].) Citizen scholarship is the future.

!!! Respondent: Cynthia Damon
Cynthia chose to address the team.

June 10, 2011, at 02:09 PM by 205.201.247.78 -
Changed lines 1-2 from:
I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. I was the Canadian lead on the [[http://criminalintent.org | Datamining with Criminal Intent]] project. What follows are my conference notes. They are therefore incomplete and rough. You can also see the twitter feed searching for #DiD11 or look at Jen Howards notes at http://bit.ly/l1k2bL . A group photo of the Canadian respondents, grant council folk and investigators is at http://www.theoreti.ca/?p=3702 .
to:
I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. I was the Canadian lead on the [[http://criminalintent.org | Datamining with Criminal Intent]] project. What follows are my conference notes. They are therefore incomplete and rough. You can also see the twitter feed searching for [=#DiD11=] or look at Jen Howards notes at http://bit.ly/l1k2bL . A group photo of the Canadian respondents, grant council folk and investigators is at http://www.theoreti.ca/?p=3702 .
June 10, 2011, at 02:09 PM by 205.201.247.78 -
Changed lines 1-2 from:
I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. I was the Canadian lead on the [[http://criminalintent.org | Datamining with Criminal Intent]] project. What follows are my conference notes. They are therefore incomplete and rough. You can also see the twitter feed searching for #DiD11 or look at Jen Howards notes at http://bit.ly/l1k2bL .
to:
I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. I was the Canadian lead on the [[http://criminalintent.org | Datamining with Criminal Intent]] project. What follows are my conference notes. They are therefore incomplete and rough. You can also see the twitter feed searching for #DiD11 or look at Jen Howards notes at http://bit.ly/l1k2bL . A group photo of the Canadian respondents, grant council folk and investigators is at http://www.theoreti.ca/?p=3702 .
June 10, 2011, at 02:07 PM by 205.201.247.78 -
Changed lines 97-98 from:
I was part of the presentation on the [[http://criminalintent.org | Criminal Intent]] project so I
to:
I was part of the presentation on the [[http://criminalintent.org | Criminal Intent]] project so I couldn't take notes, but you can see the slides at http://criminalintent.org along with instructions on how to do it yourself.
Added lines 104-105:
You can read the full text of his paper at http://lenz.unl.edu/papers/2011/06/10/prison-art.html
Changed lines 154-160 from:
The concluded by talking about culturomics: "the application of high throughput data collection and analysis to the study of culture." Reminds me of Lev Manovich's Cultural Analytics, though he is looking at non-textual data in many cases.
to:
The concluded by talking about culturomics: "the application of high throughput data collection and analysis to the study of culture." Reminds me of Lev Manovich's Cultural Analytics, though he is looking at non-textual data in many cases. 

!! Towards Dynamic Variorum Editions
Greg Crane talked about what a variorum edition is. John Darlington talked about why 2000 years of Latin is great for studying variation. They have crawled 1.2 M books from the Internet Archive of which 25 K are catalogued as Latin but many of them are not. He talked about the problem of polysemy when using large text databases. They trained a broad-coverage word sense disambiguation tool using parallel texts (English/Latin). Where you have a Latin work and its English translation you can train a disambiguation tool which can then be run on the rest of the corpus.

Bruce Robertson then talked about work with Greek. He talked about his workflow for processing Greek texts. Because OCR doesn't work well on Greek they used students to correct stuff.

June 10, 2011, at 01:18 PM by 205.201.247.78 -
Deleted line 0:
Changed lines 146-147 from:
Erez Lieberman-Aiden & JB Michel from Harvard have been working on the Google Books corpus and developed the Google N-Gram viewer as a result.
to:
Erez Lieberman-Aiden & JB Michel from Harvard have been working on the Google Books corpus and developed the Google [=NGram=] viewer as a result.
Changed lines 150-152 from:
They showed some very interesting graphs of takeup of inventions (how is "radio" talked about after its invention.)
to:
They showed some very interesting graphs of takeup of inventions (how is "radio" talked about after its invention.) They tracked fame over time (people get famous faster and get forgotten faster.) They tracked the careers that make one famous (political figures, authors, and actors do best.) They did a lot of work on censorship by the nazis and how in Germany certain people were suppressed.

The concluded by talking about culturomics: "the application of high throughput data collection and analysis to the study of culture." Reminds me of Lev Manovich's Cultural Analytics, though he is looking at non-textual data in many cases.
June 10, 2011, at 12:54 PM by 205.201.247.78 -
Changed lines 1-2 from:
I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. I was the Canadian lead on the [[http://criminalintent.org | Datamining with Criminal Intent]] project. What follows are my conference notes. They are therefore incomplete and rough.
to:

I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. I was the Canadian lead on the [[http://criminalintent.org | Datamining with Criminal Intent]] project. What follows are my conference notes. They are therefore incomplete and rough. You can also see the twitter feed searching for #DiD11 or look at Jen Howards notes at http://bit.ly/l1k2bL .
Changed lines 135-151 from:
Our datasets show that the humanities have comparable if not larger collections. Our data is also messier and more interesting.
to:
Our datasets show that the humanities have comparable if not larger collections. Our data is also messier and more interesting (at least to us).

Mark Liberman then showed some of their results, but they see this project as being of interest to people beyond linguists.

!!! Respondent: Dan Jurafsky
The respondent started by reflecting on what happens with "micro-revolutions"? Large datasets can lead to research micro-revolutions. He gave a survey of what can be done across disciplines with lots of data. With very large datasets you can look at "lopsided scarcity" where in a long tail situation you want to look at sparse items. You can look at patterns that in a normal dataset would appear so infrequently that statistical inferences can't be made.

This project has advanced research on a reasonably forced alignment tool. Another technical problem that they tackled is anonymization.

He closed by talking about the collaboration of humanities and computer science. It is hard in both fields to get tenure for this type of work. Humanities scholars feel under attack. How can we entice folk in CS to take the humanities more seriously? At U of Alberta we are lucky that we have about 6 CS faculty interested in our work. It seems to be a cultural thing - a department with lots of people working with humanists will have a climate that is welcoming.

!! Keynote on Culturomics: Quantitative analysis of culture using millions of digitized books
Erez Lieberman-Aiden & JB Michel from Harvard have been working on the Google Books corpus and developed the Google N-Gram viewer as a result.

They talked about research practices in research. We can read a few books very carefully or we can read a lot of books algorithmically. They have been thinking about cultural evolution and change. They realized they could look at language change. They showed how irregular verbs tend to regularize over time, especially if they are used less frequently.

They showed some very interesting graphs of takeup of inventions (how is "radio" talked about after its invention.)
June 10, 2011, at 11:32 AM by 205.201.247.78 -
Changed line 134 from:
Our datasets are comparable.
to:
Our datasets show that the humanities have comparable if not larger collections. Our data is also messier and more interesting.
June 10, 2011, at 11:29 AM by 205.201.247.78 -
Added lines 122-134:
John reflected on large data and the deluge of humanities data coming. Compare these big science projects:

* Human genome: 3 GB
* Hubble space telescope: .5 TB/Year
* Sloan digital sky survey: 16 TB

To these one can compare some humanities projects:

DASS audio sampler: 350 GB
Year of Speech: >1 TB
Beazley Archive of ancient Artifacts: 25 TB

Our datasets are comparable.
June 10, 2011, at 11:22 AM by 205.201.247.78 -
Changed lines 120-121 from:
John Coleman talked about [[http://www.phon.ox.ac.uk/mining/  | Mining a Year of Speech]] project that dealt with the challenge of dealing with very large audio corpora. An audio corpora is going to be hundreds of time bigger than a corresponding annotated text corpus. John talked about the challenge of linking the audio to annotations by various types of people.
to:
John Coleman talked about [[http://www.phon.ox.ac.uk/mining/  | Mining a Year of Speech]] project that dealt with the challenge of dealing with very large audio corpora. An audio corpora is going to be hundreds of time bigger than a corresponding annotated text corpus. John talked about the challenge of linking the audio to annotations by various types of people. They think of their year's collection as a grove of corpora (where each corpus is a tree.)
June 10, 2011, at 11:20 AM by 205.201.247.78 -
Changed lines 120-121 from:
John Coleman talked about [[http://www.phon.ox.ac.uk/mining/  | Mining a Year of Speech]] project that dealt with the challenge of dealing with very large audio corpora.
to:
John Coleman talked about [[http://www.phon.ox.ac.uk/mining/  | Mining a Year of Speech]] project that dealt with the challenge of dealing with very large audio corpora. An audio corpora is going to be hundreds of time bigger than a corresponding annotated text corpus. John talked about the challenge of linking the audio to annotations by various types of people.
June 10, 2011, at 11:15 AM by 205.201.247.78 -
Changed lines 113-115 from:
Tom talked about the role of the humanities. One role is to bring critical voices that question the bullshit. Another role is to talk about governance.

to:
Tom talked about the role of the humanities. One role is to bring critical voices that question the bullshit. Another role is to talk about governance. Another is to think about how the media are being changed. The cloud drives disintermediation. There is no longer a single big media channell for businesses to use to get to everyone. 

An interesting fact he mentioned is the explosion of  rules and regulations world-wide. Large corporations need to deal with these rules world wide which can be a nightmare. 

The impact of the cloud is probably slowing down. Now is when the social and human innovations will start to kick in. He ended by talking about the Waterloo campus at Stratford where they are developing programs that teach technology, creative arts, and business together.  They are also putting on an annual conference, Canada 3.0.

!!  Mining a Year of Speech
John Coleman talked about [[http://www.phon.ox.ac.uk/mining/  | Mining a Year of Speech]] project that dealt with the challenge of dealing with very large audio corpora.
June 10, 2011, at 10:50 AM by 205.201.247.78 -
Changed lines 113-115 from:
to:
Tom talked about the role of the humanities. One role is to bring critical voices that question the bullshit. Another role is to talk about governance.

June 10, 2011, at 10:47 AM by 205.201.247.78 -
Changed lines 102-113 from:
He redeployed a question from before: "Is it not art?" While that was asked of visualizations before in a sarcastic fashion, Steve asked it again with respect. Is what we are doing telling new stories an art?
to:
He redeployed a question from before: "Is it not art?" While that was asked of visualizations before in a sarcastic fashion, Steve asked it again with respect. Is what we are doing telling new stories an art?

!! Tom Jenkins: Bringing Humanity to Data to Create Meaning
Chad Gaffield introduced Tom Jenkins from Open Text. Tom is the Executive Chairman and Chief Strategy Office of Open Text Corporation which evolved out of the New Oxford English Dictionary project at Waterloo. He is now in the SSHRC Council. Tom talks eloquently about the importance of the humanities in the information revolution.

Tom distinguished between tool makers and tool users. The STEM community are the tool makers. The social sciences, humanists, and artists are the users. One wants to have both in society.

Tom then talked about the beginnings of Open Text and how they dominated the web search business for about 3 years. Most of the web is now behind firewalls (the dark or deep web.) Open Text builds technologies for the deep web.

He then switched to talking about the impact of the cloud. He argued that Web 3.0 is the move from the social cloud to the semantic web. The cloud is making rich mobile and social devices available. We are amazed when we first access them and then a year later they are outdated. Tom argued that the amazing thing about the shift from Web 1.0 to Web 2.0 is the rise of Facebook. We are social animals so it makes sense that Facebook would challenge Google. He talked about the tension between transparency (Facebook) and privacy (Wikileaks). This tension is a social science and ethics issue, not a technology issue.

June 10, 2011, at 09:31 AM by 205.201.247.78 -
Changed lines 94-102 from:
There were questions about this idea of double use in cases where there isn't more than one (universe or Shakespeare.) You can really test results against a multiple other phenomena when there is one. David seemed to think that hiding data from yourself is a way to have control sets. This seems artificial. Once you have adapted your hypothesis to fit the full dataset you are back into the same situation of using one instance to form a theory.
to:
There were questions about this idea of double use in cases where there isn't more than one (universe or Shakespeare.) You can really test results against a multiple other phenomena when there is one. David seemed to think that hiding data from yourself is a way to have control sets. This seems artificial. Once you have adapted your hypothesis to fit the full dataset you are back into the same situation of using one instance to form a theory.

!! Data Mining with Criminal Intent
I was part of the presentation on the [[http://criminalintent.org | Criminal Intent]] project so I

!!! Respondent: Stephen Ramsay
Steve reminded us of the history of text analysis and visualization. He reminded us of our call to have more playful experimentation. Steve talked about how we are indebted to science in the project. He drew attention to how we argued that we would use scientific tools to tell new stories. Human stories are what the digital humanities are about.

He redeployed a question from before: "Is it not art?" While that was asked of visualizations before in a sarcastic fashion, Steve asked it again with respect. Is what we are doing telling new stories an art?
June 10, 2011, at 08:27 AM by 205.201.247.78 -
Changed line 94 from:
There were questions about this idea of double use in cases where there isn't more than one (universe or Shakespeare.) You can really test results against a multiple other phenomena when there is one. David seemed to think that hiding data from yourself is a way to have control sets. This seems artificial. Once you have adapted your hypothesis to fit the full dataset you are back into the same situation of using one instance to form a theory.
to:
There were questions about this idea of double use in cases where there isn't more than one (universe or Shakespeare.) You can really test results against a multiple other phenomena when there is one. David seemed to think that hiding data from yourself is a way to have control sets. This seems artificial. Once you have adapted your hypothesis to fit the full dataset you are back into the same situation of using one instance to form a theory. 
June 10, 2011, at 08:23 AM by 205.201.247.78 -
Changed lines 92-94 from:
David then talked of the danger of exploratory tools. He gave the example of the theory of continental drift that double uses the data (where the data used to generate a hypothesis is then used to prove it). We need to build tools that only show a subset of data so that you can then test on the full dataset.
to:
David then talked of the danger of exploratory tools. He gave the example of the theory of continental drift that double uses the data (where the data used to generate a hypothesis is then used to prove it). We need to build tools that only show a subset of data so that you can then test on the full dataset.

There were questions about this idea of double use in cases where there isn't more than one (universe or Shakespeare.) You can really test results against a multiple other phenomena when there is one. David seemed to think that hiding data from yourself is a way to have control sets. This seems artificial. Once you have adapted your hypothesis to fit the full dataset you are back into the same situation of using one instance to form a theory.
June 10, 2011, at 08:17 AM by 205.201.247.78 -
Changed lines 79-92 from:
Steve Downie talked about MIREX - the music information retrieval challenge and exchange. They agree on challenges and then compete to generate the best algorithms. The algorithms are then run on a large music database to compare them. Thus Steve's team was able to compare segmentation algorithms against the ground truth.
to:
Steve Downie talked about MIREX - the music information retrieval challenge and exchange. They agree on challenges and then compete to generate the best algorithms. The algorithms are then run on a large music database to compare them. Thus Steve's team was able to compare segmentation algorithms against the ground truth. He showed a visualization/sonnification that compares the ground truth segmentation to the different algorithms.

!!! Respondent: David Huron
David talked about what has happened in Genetics. They have repeatedly hired people from very different backgrounds in order to tackle the key questions. They have hired computer scientists to bring new thinking into the field.

The humanities has lost several disciplines like psychology and linguistics to science. The questions in linguistics have stayed the same, but the methods have changed.

Disciplines should be defined by questions, not practices. If we define the humanities as a set of close reading and interpretative practices then it will never change. If we think of it as the disciplines addressing questions about the human and human expression then we should be willing to adopt new methods and hire people out of new areas.

David went on to talk about how tools need audiences. Good tools don't succeed on their own. You need to build an audience. Focus on the questions and discovery not technology. Email in the late 1970s was useless because few others had it. Now it is useful because everyone you want to correspond with has it. (Of course the spammers have also found a way to make it problematic again.)

In the arts and humanities we supposedly put a premium on community and human interaction and yet, paradoxically, we don't do it. David argued for collaborative practices. Central to collaboration is confessing ignorance. In the humanities we don't dare confess not knowing anything. We overvalue the pedantry of knowing everything (or pretending). We socialize our students to mask their ignorance.

David then talked of the danger of exploratory tools. He gave the example of the theory of continental drift that double uses the data (where the data used to generate a hypothesis is then used to prove it). We need to build tools that only show a subset of data so that you can then test on the full dataset
.
June 10, 2011, at 07:47 AM by 205.201.247.78 -
Added lines 76-79:

David talked about the interesting relationship with the broader community that is interested in music.  It strikes me that the musical community rivals the genealogists in their engagement in citizen research.

Steve Downie talked about MIREX - the music information retrieval challenge and exchange. They agree on challenges and then compete to generate the best algorithms. The algorithms are then run on a large music database to compare them. Thus Steve's team was able to compare segmentation algorithms against the ground truth.
June 10, 2011, at 07:41 AM by 205.201.247.78 -
Changed lines 68-71 from:
Alastair talked about what is happening in the UK and what JISC is supporting. He started with the work from UCL on Log Analysis

Splashes and Ripples
showed significant improvements.
to:
Alastair talked about what is happening in the UK and what JISC is supporting. He started with studies about utilization starting with the work from UCL on Log Analysis (the [[http://www.ucl.ac.uk/infostudies/research/circah/lairah/ | LARIAH project]]) that showed disappointing use of online resources. This 2006 study was followed by [[http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1846535 | Splashes and Ripples]] (2011) which showed significant improvements.
Changed lines 73-75 from:
Ichiro showed a screencast of a PhD student doing the annotating.
to:
Ichiro showed a screencast of a PhD student doing the annotating. It was impressive to see the student parsing a rock song.

David De Roure talked about the structured analysis that they ran on the deluge of data available.  Using the student-sourced ground truth they could created linked data repository to support scholars in a sustainable way.  The idea is to share standardized linked data so that it can be mashed with other data. There philosophy is that the web is a content management system and their website is an API. To do this they developed an ontology of music segmentation
.
June 10, 2011, at 07:25 AM by 205.201.247.78 -
Added lines 64-75:

! Day 2

!! Alastair Dunning, JISC
Alastair talked about what is happening in the UK and what JISC is supporting. He started with the work from UCL on Log Analysis

Splashes and Ripples showed significant improvements.

!! Structural Analysis of Large Amounts of Musical Information (SALAMI)
Ichiro Fujinaga started talking about the SALAMI project which gathered "ground truth" data about what educated listeners would think is the structure of a musical work. This is to test music recognition algorithms. They double key annotated 1000 songs of various genres. They used Sonic Visualizer (and open source tool) for the annotation.

Ichiro showed a screencast of a PhD student doing the annotating.
June 09, 2011, at 03:08 PM by 205.201.247.78 -
June 09, 2011, at 03:07 PM by 205.201.247.78 -
June 09, 2011, at 02:57 PM by 205.201.247.78 -
Changed lines 56-57 from:
Chris asked why you can't manipulate data in a visualization? It is essentially a browser, so why can't we annotate, manipulate, and cycle stuff back into the visualization. All data is annotation. So Chris has come up with "ampliation" which is annotation and interpretation.
to:
Chris asked why you can't manipulate data in a visualization? It is essentially a browser, so why can't we annotate, manipulate, and cycle stuff back into the visualization. All data is annotation. So Chris has come up with "ampliation" which is annotation and interpretation. It means to enlarge and extend.

He showed some demos of very cool visual tools that tended to combine multiple panels that interact much like [[http://voyeurtools.org | Voyeur]] does, but customized for the data. See his [[http://www.cs.ou.edu/~weaver/improvise/index.html | Improvise]] site for the tools and applications.

Then they presented a visualization design that was developed by a design student from Milan. This allows one to layer filters into a query that controls a visualization.

!!! Respondent: Stephen Nichols
Stephen addressed the issue of how we deal with correspondence projects. Letters can be trivial or important. One needs to "ampliate" the data with context and interpretation.
June 09, 2011, at 02:34 PM by 205.201.247.78 -
Changed lines 51-52 from:

to:
!! Digging into the Enlightenment
Dan Edelstein and Chris Weaver presented on their project visualizing enlightenment correspondence. It is one of the [[https://republicofletters.stanford.edu/ | Mapping the Republic of Letters]] projects.

Chris Weaver talked about visualization. He argued that viz is not just a static artefact - it is a verb too. It is a process. He also argued that a visualization is not a representation. He recommended the book [[https://republicofletters.stanford.edu | Illuminating the Path]] which talks about visual analytics which he feels is a methodology that crosses disciplines.

Chris asked why you can't manipulate data in a visualization? It is essentially a browser, so why can't we annotate, manipulate, and cycle stuff back into the visualization. All data is annotation. So Chris has come up with "ampliation" which is annotation and interpretation.

June 09, 2011, at 01:16 PM by 205.201.247.78 -
Changed lines 49-52 from:
Jennifer was the respondent. She asked about the value of speech (audio) datasets (over all the text available.) Prosody research is one case of where you need audio. She argued that the Harvesting project has reduced the costs for researchers who need access to such audio datasets.


to:
Jennifer was the respondent. She asked about the value of speech (audio) datasets (over all the text available.) Prosody research is one case of where you need audio. She argued that the Harvesting project has reduced the costs for researchers who need access to such audio datasets, though it has problems.


June 09, 2011, at 01:14 PM by 205.201.247.78 -
Changed lines 49-50 from:
Jennifer was the respondent. She asked about the value of speech (audio) datasets (over all the text available.) Prosody research is one case of where you need audio.
to:
Jennifer was the respondent. She asked about the value of speech (audio) datasets (over all the text available.) Prosody research is one case of where you need audio. She argued that the Harvesting project has reduced the costs for researchers who need access to such audio datasets.


June 09, 2011, at 01:10 PM by 205.201.247.78 -
Changed lines 48-49 from:

to:
!!! Respondent: Jennifer Cole
Jennifer was the respondent. She asked about the value of speech (audio) datasets (over all the text available.) Prosody research is one case of where you need audio.

June 09, 2011, at 12:56 PM by 205.201.247.78 -
Changed lines 46-49 from:
Michael ? talked about trying to validate the outcomes from the study of harvested clips. You can use labs, but language behaviour in the lab is sometime different from that generated spontaneously.


to:
Michael Wagner talked about trying to validate the outcomes from the study of harvested clips. You can use labs, but language behaviour in the lab is sometime different from that generated spontaneously. Harvested (and spontaneous) experiments and lab experiments can compliment each other.


June 09, 2011, at 12:54 PM by 205.201.247.78 -
Changed lines 46-49 from:
Michael ? talked about trying to validate the outcomes from the study of harvested clips.


to:
Michael ? talked about trying to validate the outcomes from the study of harvested clips. You can use labs, but language behaviour in the lab is sometime different from that generated spontaneously.


June 09, 2011, at 12:53 PM by 205.201.247.78 -
Changed lines 46-47 from:

to:
Michael ? talked about trying to validate the outcomes from the study of harvested clips.


June 09, 2011, at 12:49 PM by 205.201.247.78 -
Changed lines 37-38 from:
Is representing a change a form of history. Information visualization is a hot topic. He is critical of the rhetoric of the digital humanities. What does it include and exclude? Why the focus on visualization? Peter is interested with understanding and causation. Visualization is not understanding, though it might lead to it? Understanding is more than seeing patterns and shapes. Causation can't be seen, only inferred. Regression analysis is rarely discussed in the digital humanities.
to:
Is representing a change in the form of history? Information visualization is a hot topic, but what does it show? Peter is critical of the rhetoric of the digital humanities, especially around visualization. What does it include and exclude? Why the focus on visualization? Peter is interested in understanding and causation. Visualization is not understanding, though it might lead to it? Understanding is more than seeing patterns and shapes. Causation can't be seen, only inferred. (I would counter that causation can be represented in text or visualization.) Regression analysis is rarely discussed in the digital humanities even though it is a way of testing causal inferences.
Changed lines 42-47 from:
Linguistics has been going from armchair experiments (sit in an armchair and ask yourself how you would say it) to lab experiments (record others saying it) to large web experiments (harvest lots of examples of others saying it.)

There is a machine learning component to this project
.


to:
Linguistics has been going from armchair experiments (sit in an armchair and ask yourself how you would say it) to lab experiments (record others saying it) to large web experiments (harvest lots of examples of others saying it.) The web harvesting, however, takes hand checking by an RA.

There is a machine learning component to this project. Jonathan Hull talked about the classification experiment. Too complex to describle (I have to listen carefully), but fascinating
.


June 09, 2011, at 12:34 PM by 205.201.247.78 -
Changed lines 42-43 from:

to:
Linguistics has been going from armchair experiments (sit in an armchair and ask yourself how you would say it) to lab experiments (record others saying it) to large web experiments (harvest lots of examples of others saying it.)

There is a machine learning component to this project.


June 09, 2011, at 12:32 PM by 205.201.247.78 -
Changed lines 10-11 from:
The conversations circled around some of the cultural challenges of the digital humanities and how Digging could make a difference. We discussed the relationship between projects like those being supported by Digging and the "traditional" humanities. Is big data transformative or the return of old approaches quantified?
to:
The conversations circled around some of the cultural challenges of the digital humanities and how Digging could (or should) make a difference. We discussed issues of credit, team work, and the relationship between projects like those being supported by Digging and the "traditional" humanities. Is big data transformative or the return of old approaches quantified and on another scale?
June 09, 2011, at 12:30 PM by 205.201.247.78 -
Changed lines 1-2 from:
I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. What follows are my conference notes. They are therefore incomplete and rough.
to:
I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. I was the Canadian lead on the [[http://criminalintent.org | Datamining with Criminal Intent]] project. What follows are my conference notes. They are therefore incomplete and rough.
Changed lines 40-43 from:
Mats Rooth started the Harvesting presentation which takes advantage of the fact that you can search the internet for strings and get sites that have audio. This allows them to harvest audio for linguistic research. There are lots of media sites that will give you the time offset for an audio passage and let you download the audio. This lets them be harvested for hundreds or thousands of tokens for a word sequence.


to:
Mats Rooth started the Harvesting presentation which takes advantage of the fact that you can search the internet for strings and get sites that have audio. This allows them to harvest audio for linguistic research. There are lots of media sites that will give you the time offset for an audio passage and let you download the audio. This lets them be harvested for hundreds or thousands of tokens for a word sequence. He gave as an exmaple, audio clips of people saying "than I did".


June 09, 2011, at 12:26 PM by 205.201.247.78 -
Added lines 39-43:
!! Harvesting Speech Datasets for Linguistic Research on the Web
Mats Rooth started the Harvesting presentation which takes advantage of the fact that you can search the internet for strings and get sites that have audio. This allows them to harvest audio for linguistic research. There are lots of media sites that will give you the time offset for an audio passage and let you download the audio. This lets them be harvested for hundreds or thousands of tokens for a word sequence.


June 09, 2011, at 12:12 PM by 205.201.247.78 -
Changed lines 33-34 from:
Peter Baskerville was the respondent on the Railroads project. He pointed out how hard it is review changing targets like scholarly datasets. He commended the authors for their frankness on the tedious work of structuring databases. We may be overwhelmed by an abundance of data, but we have to avoid being drowned. We have to find ways to integrate and crosswalk data.
to:
Peter Baskerville was the respondent on the Railroads project. He pointed out how hard it is review changing targets like scholarly datasets. He commended the authors for their frankness on the tedious work of structuring databases. We may be overwhelmed by an abundance of data, but we have to avoid being drowned. We have to find ways to integrate and crosswalk data. The assumption of abundance might close down interesting research that adds to the stockpile of abundant data.

Peter talked about the mistake of calling for a revolution. Traditional humanities work like ata collection, careful coding and interpretation are still needed. Big data needs both human and automated handling. It is not a revolution, but leads to a cyborg of human and automated method.

Is representing a change a form of history. Information visualization is a hot topic. He is critical of the rhetoric of the digital humanities. What does it include and exclude? Why the focus on visualization? Peter is interested with understanding and causation. Visualization is not understanding, though it might lead to it? Understanding is more than seeing patterns and shapes. Causation can't be seen, only inferred. Regression analysis is rarely discussed in the digital humanities.

June 09, 2011, at 12:00 PM by 205.201.247.78 -
Added lines 32-34:
!!! Respondent: Peter Baskerville
Peter Baskerville was the respondent on the Railroads project. He pointed out how hard it is review changing targets like scholarly datasets. He commended the authors for their frankness on the tedious work of structuring databases. We may be overwhelmed by an abundance of data, but we have to avoid being drowned. We have to find ways to integrate and crosswalk data.

June 09, 2011, at 11:57 AM by 205.201.247.78 -
Changed lines 30-31 from:
Will Thomas closed this presentation talking about how he is using the tool to help him read newspapers differently. He stressed how we can now visualize data spatially. Will compared his project to one from the 70s/80s that focused on railroads in Vermont. That project was the work of a solitary scholar and it left behind a PDF of a table of data. The Railroads project is a team project that has far more data and it is presenting it back in multiple ways. Will talked about having students learn through enrichming the data. Will argued that we now face a social change in disciplines like history of recognizing team work.
to:
Will Thomas closed this presentation talking about how he is using the tool to help him read newspapers differently. He stressed how we can now visualize data spatially. Will compared his project to one from the 70s/80s that focused on railroads in Vermont. That project was the work of a solitary scholar and it left behind a PDF of a table of data. The Railroads project is a team project that has far more data and it is presenting it back in multiple ways. Will talked about having students learn through enriching the data. Will argued that we now face a social change in disciplines like history of recognizing team work. Historians until recently used space for illustration and they used the power of narrative to bridge gaps. Now the interactive map is reshaping scholarly practice.
June 09, 2011, at 11:55 AM by 205.201.247.78 -
Changed lines 30-31 from:
Will Thomas closed this presentation talking about how he is using the tool to help him read newspapers differently.
to:
Will Thomas closed this presentation talking about how he is using the tool to help him read newspapers differently. He stressed how we can now visualize data spatially. Will compared his project to one from the 70s/80s that focused on railroads in Vermont. That project was the work of a solitary scholar and it left behind a PDF of a table of data. The Railroads project is a team project that has far more data and it is presenting it back in multiple ways. Will talked about having students learn through enrichming the data. Will argued that we now face a social change in disciplines like history of recognizing team work.
June 09, 2011, at 11:48 AM by 205.201.247.78 -
Changed lines 30-31 from:

to:
Will Thomas closed this presentation talking about how he is using the tool to help him read newspapers differently.
June 09, 2011, at 11:44 AM by 205.201.247.78 -
Changed lines 26-31 from:
An important part of the project was data integration, standardization, and quality. They developed a series of online case studies built on Aurora software.
to:
An important part of the project was data integration, standardization, and quality. They had to structure data to fit it together. They then developed a series of online case studies that draw on a data warehouse. Richard and others focused on many of the data difficulties faced. This is a problem in the humanities where each dataset is unique and hard to merge with others. On the other hand the data is robust and meaningful. This is one of the challenges of large scale digital humanities - we have deep data that is about real people, but it is too complex for the simple mining tools from other fields.

The [[http://auroraproject.unl.edu/whitepaper.html | Aurora Project]] is the underlying software infrastructure they built for this project. The idea is that "apps" can be built on top of Aurora for particular research questions. Aurora seems to be a sort of scholarly middle ware.


June 09, 2011, at 11:21 AM by 205.201.247.78 -
Changed lines 24-26 from:
Why railroads? Early global process that transformed landscape and economy of the US. They are studying the effects of new and global infrastructure on people, environment and economic work.
to:
Why railroads? Early global process that transformed landscape and economy of the US. They are studying the effects of new and global infrastructure on people, environment and economic work. 

An important part of the project was data integration, standardization, and quality. They developed a series of online case studies built on Aurora software.
June 09, 2011, at 11:19 AM by 205.201.247.78 -
Changed lines 19-24 from:
Not having a lot of funding Brett and others developed the idea of making it a contest or challenge. The competition was so popular that for the second round there are now 8 agencies supporting the Challenge.
to:
Not having a lot of funding Brett and others developed the idea of making it a contest or challenge. The competition was so popular that for the second round there are now 8 agencies supporting the Challenge.

!! Railroads and the Making of Modern America
Richard Healey started the project presentations on the [[http://railroads.unl.edu/ | Railroads and the Making of Modern America project]].

Why railroads? Early global process that transformed landscape and economy of the US. They are studying the effects of new and global infrastructure on people, environment and economic work.
June 09, 2011, at 11:12 AM by 205.201.247.78 -
Changed lines 17-19 from:
# What can funders do to encourage the development of new methods on large data sets?
to:
# What can funders do to encourage the development of new methods on large data sets?

Not having a lot of funding Brett and others developed the idea of making it a contest or challenge. The competition was so popular that for the second round there are now 8 agencies supporting the Challenge.
June 09, 2011, at 11:09 AM by 205.201.247.78 -
Changed lines 1-2 from:
I'm at the [[http://diggingintodata.rg Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. What follows are my conference notes. They are therefore incomplete and rough.
to:
I'm at the [[http://diggingintodata.rg | Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. What follows are my conference notes. They are therefore incomplete and rough.
Changed lines 8-17 from:
* Encouraging research mashups. Another benefit of Digging is that it encouraged established projects to interoperation. The project Im on (Datamining with Criminal Intent) built interoperability between the Old Bailey project, Zotero and Voyeur.
to:
* Encouraging research mashups. Another benefit of Digging is that it encouraged established projects to interoperation. The project Im on (Datamining with Criminal Intent) built interoperability between the Old Bailey project, Zotero and Voyeur.

The conversations circled around some of the cultural challenges of the digital humanities and how Digging could make a difference. We discussed the relationship between projects like those being supported by Digging and the "traditional" humanities. Is big data transformative or the return of old approaches quantified?

!! Brett Bobley: Introduction
Brett started the event talking about some of the questions and challenges that motivated the Digging Into Data Challenge.

# Our ability to digitize materials has outstripped our methods for analyzing them.
# Simply having data for reading is not enough - we need computational access to run new methods.
# What can funders do to encourage the development of new methods on large data sets?
June 09, 2011, at 10:00 AM by 205.201.247.78 -
Changed lines 3-4 from:
Before the conference proper started we had a meeting with CLIR who are evaluating the programme.
to:
Before the conference proper started we had a meeting with CLIR who are evaluating the programme. Some of the points that resonated with me are:
June 09, 2011, at 09:59 AM by 205.201.247.78 -
Added lines 1-8:
I'm at the [[http://diggingintodata.rg Digging Into Data Challenge]] conference that is bringing together the investigators of the first round of the Challenge. What follows are my conference notes. They are therefore incomplete and rough.

Before the conference proper started we had a meeting with CLIR who are evaluating the programme.

* Gender representation is an issue. The Challenge and in the digital humanities in general we need to work harder to involve women researchers, especially as leaders. We run the risk of DH being seen as the last bastion of old me in the humanities.
* Representation by new scholars is also an issue. The Challenge should bring together the graduate students and the new faculty they need to be encouraged to meet up and they need the validation of attention from the research councils.
* Supporting international research. One of the innovations of Digging is that it has one review process that crossed national boundaries. If your project was approved all the national partners got funded. We should see this model generalized beyond the digital humanities.
* Encouraging research mashups. Another benefit of Digging is that it encouraged established projects to interoperation. The project Im on (Datamining with Criminal Intent) built interoperability between the Old Bailey project, Zotero and Voyeur.
Page last modified on June 10, 2011, at 03:54 PM - Powered by PmWiki

^