Mining Goodreads

Literary Reception Studies at Scale

  • Sponsors:
    • Price Lab for Digital Humanities
    • Humanities and Human Flourishing Project
    • Penn Libraries
  • Principal Investigator:
    • James F. English (Professor, English)
  • Principal Developers:
    • Scott Enderle (DH Specialist Librarian, Penn Libraries)
    • Rahul Dhakecha (Masters Student, CIS)
  • Project Developers:
    • Tianli Han (Masters Student, CIS)
    • Sharvan Shah (Masters Student, CIS)
  • Student RAs:
    • Daniel Sample (BA, English)
    • Alex Anderson (BA, Comp Lit)
    • Savannah Lambert (BA, English)
  • Consultants:
    • James Pawelski (Positive Psychology Center)
    • Lyle Unger (Professor, CIS and Psychology)
    • Louis Tay (Professor, Psychology, Purdue University)

Goodreads is the world’s leading social reading and curation site. Its 50 million unique users per month rate and review the books they read, arranging them on personalized “shelves,” and forming fan clubs, friend networks, and discussion groups. It is a site of major importance for literary reception studies, a field that cuts across the disciplines of sociology, psychology, and literary studies, and involves scrutiny not (merely) of works of literature but of the tastes and habits and experiences of actual readers. Perhaps the greatest challenge for reception scholars has been the difficulty of gathering datasets adequately representative of such a vast and varied field of practice as reading. The typical procedure has been to generalize from quite small scale studies of at most a few dozen relatively homogeneous subjects. The massive quantity of data about all sorts of readers and reading contained on the Goodreads site is potentially transformative of the field.

The Mining Goodreads project at Penn, cosponsored by the Humanities and Human Flourishing project and the Price Lab for Digital Humanities, focuses on readers of contemporary fiction. We have built a database comprising nearly three million Goodreads reviews, together with corresponding ratings (on a 5-star scale), and metadata on the books, authors, and users themselves. We have gathered all the reviews of 500 bestsellers and of 1300 novels that made the shortlists of major awards since 1960, plus a substantial cross-section of the reviews of curated top-200 lists of mystery/crime/detective fiction, science fiction, and chick lit/modern romance novels. In addition, we have compiled all the reviews written by a random set of 1,672 highly active Goodreads users (sampled from all users who had posted at least 150 reviews as of January 2018. Between them these readers have read more than 200,000 unique works of fiction across the full range of popular and not-so-popular genres. We have refined and structured this data and developed quite a lot of code, most of which may be shared with other researchers to help them pursue their own questions about contemporary readers, tastes, and values.

We have not made use of any information pertaining to users who maintain private accounts or profiles on Goodreads. Although the information we have collected is freely visible to any browser, we have designed our experiments and composed our outputs so as to ensure the anonymity of all users in our sample. The User ID numbers that appear in our Bokeh visualizations are randomly assigned for our own purposes; they are not actual Goodreads IDs.

This page is maintained by The Price Lab for Digital Humanities and Penn Libraries Digital Scholarship