This literal lift from piece I wrote as part of the Living Archive project I am involved in:
If you do anything that falls into the current catch-all that is the digital humanities it is hard to avoid it at the moment. Big data means, more or less, humanities projects that want to use data sets, that are big? Already notice the change. Humanities scholars and something called a ‘data set’. Not, some books, films, paintings. And then this data set is to be big. How big? Science big (well, not really as their big data sets make humanities data sets look small).
The thumbnail sketch of this is often via Moretti’s ‘distant reading’, where instead of closely reading a small set of texts (the novels of Dickens, the films of Ford, or even a sample of romance novels or westerns) you develop methods to analyse enormous numbers of novels. As a consequence the questions you ask are different, and also, once appropriately digitised, various pattern matching exercises can be performed simply to see what there is. So, instead of looking at 20 novels in your research, you study, say, 20 million.
This requires infrastructure – bandwidth, storage, tools, labs, boffins and geeks. Heady stuff. And most of all, it needs money. Politically, in some sectors of the research/university universe this is doubly attractive, simply because our performance is routinely measured not just by publications and successful grants, but the size of these grants. Therefore a single big data humanities funded project becomes particularly attractive simply because it means against your institution’s research criteria you are plainly being more successful – for better or worse, more research income always trumps less, regardless of outcomes (well, up to a point).
Manovich’s lab is into big data, and he’s written some useful pointers and cases studies from his lab (for example his Vertov essay, and the chapter in Berry’s anthology. ). Because of the scale of things visualisation becomes significant in this domain, and so there is a lot of highly skilled and interesting work that lies at the intersection of computing, digital humanities, visualisation, and then what questions to ask.
This was Bordwell and Carroll’s shot across the bows of big theory cinema studies. Grand theory, in their critique, was continental inspired work that presumed to have a theoretical framework that was all encompassing, which was then wheeled out and placed over cinema (films, audiences, institutions, it didn’t really matter) but disregarded the specificity of the objects under study and historical practice. At times their description of ‘grand theory’ risked caricature (Sinnerbrink’s New Philosophies of Film covers this material well), and as a card carrying member of some forms of grand theory I am unwilling to dismiss it in quite this manner. However, what they do offer is what they describe as ‘middle theory’ which is specifically grounded in smaller, localised samples that looks at (well, so they’d like to think) what is, and then theorises about that. (Actually I don’t think Bordwell and Carroll are nearly as elegantly hermeneutic as that, but that’s by the by here.) What I like though is the intent and inclination here. You look closely, not naively or without some opening questions or gambits you’re wanting to wonder about, but it is a much more open process than that proffered by high theory where the risk, in term of the object of study, is that the theoretical framework becomes tautologically validated using the film as evidence, rather than the other way around. It is a method that in some ways can be thought to put the (romantic?) desire to understand the films (in the case of Bordwell and Carroll) as first, rather than the doing of theory as an end.
So between the enormity of big data, and the traditional intensely close reading of what we might as well describe as small data (the westerns of Ford, the tragedies of Shakespeare), we can see the rise of what I would like to describe as middle data. What is middle data? It is a methodological field, not an algorithmic process (for those that recognise it, this is from Barthes’ “From Work to Text” as a nod to the importance of grand theory to my own work) which means that it lies between the sorts of new practices that emerge when we apply novel digital techniques to things that can be treated as data using what I’d like to think of as more traditional propositions.
The living archive project is, pretty much, a middle data project. The data set is small, and focussed, though larger than what you’d ordinarily look at in depth for the usual close reading. It is not large enough to be capital ‘B’ big data, yet since it is thoroughly digital in its methods and how it has been thought of in itself as an inherently digital system it has been made in a way that enables and facilitates a rich variety of other research methods or propositions. By inherently digital I mean the project, from inception, was never conceived of as merely the digitisation and dissemination of an analogue collection onto the web, but looked to use the affordances of the digital in concert with the network to enable, invent, test, play with, other sorts of possibilities.
Some of these methods are quite simple, for example the ability to curate clips (acts) from performances into novel series outside of the shows in which they appeared through informal tags and curation into collections. This makes it reasonably simple for a scholar to collect all acts of a certain type into their own collection, to perhaps compare changes or even similarities in performance across the history of the circus, or perhaps to look at the history of costume, performance style, gender and performance, or public sphere politics, physical circus and performance.
However, it is more interesting than being able to curate parts into collections because the overall digital system has been built as a platform which means it provides APIs to let new and different interfaces to the data been easily written. This means it is a relatively lightweight (and agile) task to write a different interfaces to interrogate the data in different ways and to present the outcomes of this in different ways. For instance, PhD candidate Reuben Stanton spent only a few hours to develop an interface to retrieve and then visualise how many shows are currently public versus how many are private. This visualisation, which is currently known within the project team as ‘the iceberg’ dynamically draws a time line and places public shows above the ‘waterline’ and private below. While pragmatically useful to simply indicate, and encourage, further marking up of the nonpublic shows so that they can be made public by Circus Oz, it also indirectly reveals other things.
For instance, there are few shows in the early years, then a peak of recorded shows, and then a more recent decline. The recent decline can be partly explained by a backlog of digitisation and upload, but it could also be speculated that the preserved record appears like this because video becomes an available (though expensive) technology, with an early spike of interest and use. It is not, though, very simple or straightforward to use, and so is not used regularly after initial enthusiasm. (Of course there is also the obvious point of trying to have preserved the original media for so long in now obsolescent formats means that less will be usable.) The rise could respond to the development of new formats like Digital DV, where tape length is mow much longer and more suited to recording a show, storage is cheaper and simpler (tape costs per hour plummeted at this time), and so it becomes easier, cheaper, and less disruptive to record a show. The tapes are physically much smaller and it is literally a simple task to film and then stick the unedited tape into a draw. Finally the more recent decline might reflect the move to a completely digital mode where, to begin with, it was trivial to record a lot (and we all did) but then storage of non–tape media (we could call it non physical but that isn’t really accurate) on hard drives was informal and may have been stored willy nilly then erased, lost, forgotten, deleted and otherwise and indirectly treated as emphemeral.
This is speculative, and none of it may be correct. However, the ability to interrogate the data through different questions and to visualise this, simply and quickly (in this case how many shows are there?, from when? which are public? and draw this) means the system and research project supports the ability to easily develop novel questions or propositions utilising the data. This means there is an agility to how the system can be utilised in addition to its explicit role to provide access to the company and the public to the record of performance of Circus Oz.
At the moment I like to conceive of this as ‘middle data’ as it utilises computational analysis, a constrained data set, and visualisation to facilitate varieties of close analysis (small data?). It is a method that implicitly requires and relies upon the digital and computational (it is not just digitisation but is also varieties of analysis and calculation enabled by the computer) to discover patterns to undertake closer analysis’s of our object of study.
For example, in 2000 a colleague and I developed a film annotation system (the gloriously named Smil Annotation Film Engine, aka SMAFE) which allowed me to add in and out markers (timecode) to an entire film and to then extract these shots and sequences in various ways and to show the results. The film was John Ford’s 1956 western The Searchers and I marked up the film manually around the presence of doorways. This was based on an intuitive hunch I had, where in Ford’s work there is a clear distinction made and used about the inside and outside of the home and so I believed doorways may be a significant though unrealised elsment of the film’s miss-en-scene. Simply viewing the film, and trying to notice doors, is one method. However, once marked up the system then allowed me to search on specific criteria (camera oustide, inside, looking in, looking out, and so on). A strong poetic pattern around doorways was immediately apparent, and visible. The point here is that the tool allowed the exploration of questions normally associated with close reading using methodologies that relied upon the affordances of the computational. This is a method that worked in concert, and so then supported a more developed close reading of the film – just how were doorways part of the film’s miss-en-scene? why? and what might this contribute to our understanding of what the film might be about?
Barthes, Roland. “From Work to Text.” Image–Music–Text. Trans. Stephen Heath. London: Flamingo, 1977. 155–64. Print.
Bordwell, David, and Nöel Carroll. Post–Theory: Reconstructing Film Studies. Madison: University of Wisconsin Press, 1996. Print.
Moretti, Franco. Distant Reading. 1st ed. Verso, 2013. Print.
Moretti, Franco. Graphs, Maps, Trees: Abstract Models for Literary History. Verso, 2007. Print.
Sinnerbrink, Robert. New Philosophies of Film: Thinking Images. 1st ed. Bloomsbury Academic, 2011. Print.
Manovich, Lev. “How to Compare One Million Images?” Understanding Digital Humanities. Ed. David M. Berry. London: Palgrave Macmillan, 2012. 249–278. Print.
Manovich, Lev. “Visualizing Vertov.” Russian Journal of Communication 5.1 (2013): 44–55. Taylor and Francis+NEJM. Web. 24 May 2013.