Organizing a Content Profile for a Large Heterogeneous Collection of Interactive Projects

Adam Larson, Ethan Wolfe, Roger Lorelli and Dr. Eric Kaltman 


Our aim is to create a content profile of a large heterogeneous data set of interactive projects. Just as librarians preserve books, we must preserve software for historical record. This work analyzes an 18 terabyte backup drive of interactive projects from Carnegie Mellon’s Entertainment Technology Center. Data sets that span over a 20 year period, with hundreds of interactive projects are typically not made available to the public because of intellectual property concerns, so exploration into one is a unique opportunity. Our methodology used qualitative methods to go through each folder and examine the context and categorize it into a 15 category spreadsheet. We also used quantitative methods to examine and index the whole database to aggregate meta-data on each file. This information allowed us to compile a complete and comprehensive content profile of the data set.
We found that the data set followed the same trend of the historical popularity of Adobe Flash, until its deprecation in 2019. The game projects that would have typically used Adobe Flash were predictably replaced with Unity Engine. In our qualitative categorization of these projects, we found that 56% of the 546 projects were games and 14% were mobile applications. Our quantitative findings illuminated data redundancy issues in the database and gave us raw numbers about quantities of files used. For example, of the 9,202,496 files that were scanned, 618,453 files were .bin files and 869,223 files were .pngs. In terms of size, approximately 7.59 terabytes were from video files. Most importantly, we found that 36% of the data set was redundant. This information allows Carnegie Mellon to condense their data set, save money on server costs, and have a more accurate appraisal of data generated in the future.


Session 2 – 3:00p.m. – 4:15p.m.

Room B – Sierra 1422