We have some great news: a prototype version of the Arclight app has been completed. While we are continuing to refine it for use by researchers and the broader public, the app is already capable of running fast Scaled Entity Searches across the entire 1.6 million page corpus of the Media History Digital Library. Individual searches for as many as 31,000 separate entities have successfully returned results.
A longer article will be published in the upcoming months about Arclight’s implementation of Scaled Entity Search, but in the meantime, we would like to share some initial findings about the primary dataset we have been processing in order to generate entity lists: the Early Film Credits (EFC) dataset. The EFC dataset is based on Einar Lauritzen and Gunnar Lundquist’s seminal two-volume filmography of early American cinema, The American Film-Index.[1] Lauritzen and Lundquist’s work is widely considered to be the most reliable filmography available on American narrative films of the 1910s. Paul Spehr, who converted the filmography to digital form for his 1996 work American Film Personnel and Company Credits, 1908-1920,[2] generously made this data available for use by Project Arclight. Spehr provided us with the data in the form of a remarkably well-formatted and consistent text file, which meant that we could easily write a Perl script to convert the file to Extensible Markup Language (XML).
Once in XML, all of the individual entities in the data—which included some 30,000 individual films, 17,000 personnel (including animal actors!), and 1,000 separate companies—were individually marked up for counting, sorting, and analysis. This enabled us to generate very precise entity lists: for instance, we might be interested in creating a list of all Marshall Neilan-directed films starring Mary Pickford from 1918. By using xQuery, an XML querying language that returns results based on requests for specific elements or attributes, we could return such a list very quickly. However, because our basic research question—Who’s trending in media history?—is quite general, we strove to create as complete a list of the entities contained in the EFC data as possible. Thus, we simply queried lists of all the individual companies, films, and personnel contained in the data.
Although we still have some processing to do on this data before using it as a source of entities in SES, it has proven revelatory, even in unprocessed form, for the quantitative information it contains. As a statistical representation of company output, for instance, the data reveals the sheer prolificacy of the major film producers during the one-reel period (roughly 1907-15). Eight companies—seven of which were members of the Motion Picture Patents Company—accounted for almost 40% of all the fiction films produced during this period: Vitagraph (2840 films), Lubin (1880), Essanay (1808), Kalem (1708), Selig (1644), Edison (1502), American Biograph (1315), and Universal (1019). Vitagraph alone accounted for nearly 8% of all fiction titles in the dataset. In keeping with what we know about industry shifts in the teens, the set becomes more heavily weighted toward feature-oriented companies after 1916. Restricting the set to the period 1917-20, Universal (796), Paramount (667), Pathe (518), Triangle (418), and Fox (403) shoot to the top of the list—although Vitagraph remains at #3 in terms of total output, at 657 films (Click here for the complete list in Google Doc form).
At the same time, the data show the extent of the “long tail” of production during this period. Two-thirds of the individual companies recorded in the EFC dataset released only one or two films between 1908 and 1920, and a solid majority (74%) of the 733 films from such companies were released after 1914. This tail, though long, is also very thin; collectively, these companies account for only about 2% of all the titles in the set. Yet a closer look at them reveals their diverse nature; the titles they produced were not all obscure one-offs. They run the gamut from the period’s huge special features (The Birth of a Nation and Civilization), to independent features from well-known stars (Hobart Henley’s 1918 Parentage, produced by Frank J. Seng), to films we know virtually nothing about (1916’s Carma, directed by John Harvey and starring Sylva Carmen, from Florida Productions).
As named entities, the diversity of these companies shows one of the perils of “distant reading”: losing qualitatively crucial context in the sea of quantitative scale. However, the sheer number of these companies prompts us to examine them closely, both as a totality and as definable groups, and to search for any shared historical characteristics. And at least one feature links a great many of them: distribution via alternative means, whether through states’ rights, roadshowing, or as a special through one of the major companies. That a majority of these films come from 1915 or later suggests that the importance of alternative distribution in the aftermath of the wider industry’s transition to features may have been underestimated. Current histories tend to frame this period as one of industry consolidation and vertical integration; this data forces us to consider the extent to which marginal forms of distribution continued to be important for certain producers and under certain circumstances.
Similar interesting patterns show up in the other mass entity sets, but point less to historical questions than to historiographical and interpretive ones. The named entity list of personnel, for example, finds a not unexpected frequency of credits for certain figures with historiographical prominence. When the list is organized in terms of total number of credits, D.W. Griffith (749 credits), Mack Sennett (548), and Billy Bitzer (486) lead the pack, both because their films have generally survived and because their involvement in productions has been a common subject of silent cinema research. However, immediately below them are Eddie Lyons (432) and Lee Moran (423), two figures who, though not nearly as prominent in film history, were not only extremely prolific in the period covered by the set, but also participated in multiple aspects of production. Lyons, for example, is credited in 268 films in the set, but his total number of credits is 432, simply because he tended to direct, star in, and write the films he was involved with. The first woman to appear in the list, Mabel Normand (at #11 with 290 credits), similarly worked as a director (32 credits) and scenarist (7 credits) in addition to acting (click here for a list of the top 30 credited named entities).
At the other end of the spectrum, certain high-profile personnel receive very few credits because of the structure of the data set. Since the data includes no fields for executive-level personnel above the level of director, figures like George Kleine and Harry Aitken receive only a single credit each (for films they are listed as writing or directing), despite the fact that they effectively produced hundreds of films. Figures who figure prominently in one aspect of production, like Griffith’s screenwriter Frank Woods, are underrepresented in other aspects—production supervision in Woods’ case. The named entity credits thus remind us to be aware of the limitations of our data, and to understand the interpretive challenges and constraints of its structure.
The titles list, because it represents the largest and in many ways most ambiguous set of entities in the EFC dataset, also presents the greatest challenge when it comes to SES. Film titles have a tendency to generate large numbers of false positives, particularly when they are short – witness Vitagraph’s War (1915) or Fox’s She (1917). This is less of a problem when performing an individual keyword search, since contextualizing information about the film can easily be used to disambiguate the results. Scaled searches, by contrast, are only effective when the entities in question are relatively unambiguous. Thus, the titles list is less useful overall than the other two. However, it does show a number of interesting patterns. The first involves the uniqueness of film titles during the 1910s; out of some 35,000 titles total, 81% (about 29,000) are represented only once in the set. The set of most repeated titles, on the other hand, reads like a list of narrative tropes: Never Again, The Turning Point, The Trap, The Awakening, The Sacrifice, Retribution, The Greater Love, The Flirt. The first truly unambiguous title on the list, Rip Van Winkle, designated seven separate films released during this period. Other classics found four or more adaptations between 1908 and 1920, including Romeo and Juliet, Carmen, Uncle Tom’s Cabin, and Dr. Jekyll and Mr. Hyde.
While these initial findings will certainly be nuanced as we continue to process the data for Scaled Entity Search, they point to the inherent power—and undeniable perils—of large data sets. We are excited to contextualize all of the raw quantitative information the EFC set contains with the SES results for individual entities. Ultimately, this data gives us a second set of quantitative metrics, allowing us to compare a given entity’s “hits” in the EFC with its number of hits in SES. The relationship between those two numbers, though necessarily problematized by issues of disambiguation and other interpretive challenges, may prove quite revelatory.
[1] Stockholm: Film-Index, 1976.
[2] Jefferson: McFarland, 1996.