Main »

Text Mining The Novel

These are my meeting notes about the 4th year meeting of the Text Mining the Novel project.

Note: as always, this is written live and full of problems.

Thursday, October 26th

Ted Underwood: Proposal for a starter set of English-language fiction 1700-2007

Ted started us off with a discussion of what a starter set of texts for exploring text mining of novels. See his proposal here.

He imagined three sets:

  • a fiction set
  • a de-duplicated fiction set
  • a smaller, managable and cleaned set of about 3000 volumes

The problem is what sort of text selection criteria for the texts should we be talking about:

  • Canon
  • Best sellers
  • Those based on newspapers
  • Representative text

We need to be able to define a subset so that people can explore something, but what to provide without appearing to have a bias. We don't want a canon, but something you have to pick from to make user think.

Susan Brown: Linked Data in Relation to Access and Revised, Computationally Informed Collection

Susan Brown is working on a linked open dataset on British women authors. She wants to situate her work to the issues around access. She believes that we need to work harder if we want a new history of the novel through quantitative means. Ian Watt in the The Rise of the Novel recognized the need for qualitative methods to supplement quantitative methods.

The asked what sort of linked open data infrastructure would be needed to study the novel beyond what they are doing with Orlando. She talked about the LINCS project which imagine a triple store of linked data about the things that matter to humanists. There would be data in the store and tools to get new data into the tools. There would be interfaces to explore the store.

We are coming to an infrastructure moment. Bratton has argued in The Stack about the need for distributed infrastructure. Alan Liu argues that infrastructure is now ideology.

When we think of the infrastructure that we use it is often social. Things like WordPress. Linked open data lets projects stay in silos, but still share their metadata.

She showed HuViz which looks engaging and inviting.

Mark Algee-hewitt: In(compatibilities): Making Meaning from Multiple Corpora

Mark talked about corpora at the Stanford Literary Lab. They are trying to gather them and to make them interoperable. They've run into fundamental incompatibilities. Size, period, assembly methods, and so on. What do we do about this? Why does it matter?

  • For small groups just starting there are no starting sets.
  • There is a lack of large reliable corpora.
  • There is a question of representativeness in corpora

He showed these great sets of graphs of ngrams which would show a great trend over time and when you disaggregated the corpora you found problems at the intersections of corpora. They don't match up where there is an overlap. He also showed a problem with signals being overwhelmed by size.

We have had a generation of boutique projects that collected small highly selective corpora. Now we want to join them all to break down the silos, but the reality is that they don't aggregate well because of incompatibilities. This even happens in merged corpora from the same player like the Artemis tools in Gale.

To some extent text mining tells us about how corpora are selected.

Day 2: Friday, October 27th

Laura Mandell: OCR

Laura talked about the challenges of getting good OCRed text, especially from 18th century books. She did a test on "circumstantial information" using different systems on a set of difficult texts.

Lisa Teichmann: High School Reading Lists

Lisa talked about school reading lists and banned books. Books that are challenged change over time. How do reading lists compare to other reading lists. She had an Austrian high school reading list and a Turkish list. The two lists didn't overlap except for the world literature. Both are mostly male and mostly non-contemporary. She wanted to see how these lists overlapped with other lists of prestige literature.

Anatoly Detwyler: The People's Literature: Identifying a New Canon of Modern Chinese Literature

Anatoly talked about developing a corpus for modern Chinese literature. This would lower the threshold for DH work. These could be used to study modern Chinese lit or as a control set.

How to create a robust data-driven list? He looked at a number of lists. There are important anthologies. He also looked at works reprinted at least twice from WorldCat's search API. He then culled things himself. This ended in a list of 290. This covers about 90% of what is on university reading lists.

Hoyt Long: Anthologizing Modern Japanese Literature

Hoyt talked about the problems of canon formation. He then talked about a Japanese project to build a corpus from anthologies. They ended up with a list of 139 works. Hoyt is working with a catalogue 600K total entries from 1,260 anthologies. After filtering you get 42K entries. He then tried different ways of ranking these using citation index algorithms.

There is a Japanese version of the Gutenberg project that has gathered statistics on usage (of their site.)

He showed some comparison rankings based on different measures.

He then had some great questions about whether we can identify works that lost attention and so on.

Matt Erlin and Douglas Knox: How to Do Things with VIAF Clusters

Matt talked about a project to look at world literature which he defined as works that circulate beyond their country of origin (usually in translation.) There is a resource out there, the VIAF (virtual international authority file) which aggregates information about works with links to "expressions" of a work (mostly translations.) One can then rank works by the number of expressions. Hamlet ranks first. He talked about different ways of sorting the data he got from VIAF.

Digital archives like other collections are always biased in some way. This doesn't invalidate the goal of creating curated corpora.

Matt Wilkens: Race and Literature

Matt is working with Richard So are working on race and American fiction lists. They are compiling different types of lists from publishers. The 20th century American novel is 97% white out of 30K. 3% for everybody else. What can we do about this? The History of Black Writing project out of Kansas is trying to trying gather and then digitize any novel by African American writers. There is good reason to believe that the copies of Af Am novels they are getting from various places were being published but in other channels.

Then Matt talked about a case study of a literary geography of British and foreign works from 1880-1940. He got four corpora:

  • Hathi - 7,399 volumes
  • Prominent - something like the canon - 576 vols
  • London - regionalist writing
  • Foreign - generically expansive, a list of writers from outside UK writing in English

He then showed graphs comparing these lists as to gender, ethnicity, genre, and whether it is in Hathi. He had interesting things to say about the foreign writings list. This was drawn from the scholarship. It had a broader range of genre than other lists as many foreigners were writing in different genres.

He talked about all the challenges of building out lists with enough texts given the lack of balance.

Andrew Piper: Ethics of Scraping Fan Fiction

Andrew talked about the ethical issues around scraping fan fiction.

Simon DeDeo: (Cognitive) Surprise! The Anxiety of Influence

Despite all our technology, the only way that one brain to influence another is through text, speech and media. What we see is based on what we expect.

He talked about how we don't really see the world as it is. We throw so much out. Deep learning reproduces this by reducing input. What we see bears as much relation to what is there as Windows does to reality.

How then does reality enter in then? For Simon we are sensitive to surprise - sensitive to the contradictions. Simon then talked about the game 20 questions. One can think of the number of questions it takes to get an answer as the amount of information there is in the answer. (Shannon's Mathematical Theory of Communication, 1945) We can then get probabilities/uncertainty.

He talked then about measuring the surprise in an archive like physics papers. We can calculate how the probability of a newer paper based on the (bag of words of) what came before. Then he talked about looking at fan fiction where there is less surprise. Then he showed graphs from a cool project by a student on poetry. This used stylistic features not bag of words. He showed different patterns of how you get into the Norton anthology of poetry. For some like Yeats

Then he talked about looking at surprise in the transcripts of the estates general during the French revolution. He talked about "resonance" which would be a measure of novel speeches that have impact over time (resonance over time).

Simon then talked about how this model might apply to the internet where you have forking discussions. They looked at deviation in downstream comments.



edit SideBar

Page last modified on October 27, 2017, at 01:15 PM - Powered by PmWiki