Foray Into Topic Modeling

Topic modeling suggests new avenues for Proust studies. Applications like Mallet and PhiloMine compute the statistical relationships among tokens (as single, double, or triple word phrases) appearing within specified spans of text such as paragraphs or groups of, say, fifty words. Since the Recherche embodies more than one million words, topic modeling can be used to highlight features of the text that are not perceptible during the act of serial reading. I ran the first tome, which contains Du côté de chez Swann part I, “Combray,” and part II, “Un Amour de Swann,” through Mallet to generate token clusters for ten topics, which reveals some interesting patterns. The command line output shows the top nineteen recurring words that are statistically significant within the top ten recurring patterns (topics) in the text.

  1. chose moment pouvait jamais puis rien esprit pourtant visage savait voulait dire savoir mal trouvait première devait autres instant
  2. dit bien dire air jamais beaucoup tête toujours princesse ami docteur reste choses sais enfin regard répondit jeune entendu
  3. vie amour plaisir souvent celle ainsi gilberte pu pensée besoin donnait tant sorte milieu cause femmes étais connaître joie
  4. après temps jusqu heure pendant allait presque chambre longtemps près seul passer heures penser jour tard souvenir chercher toute
  5. combray côté déjà rue soleil semblait fleurs saint bois place eau ciel petits vers jardin matin champs dessus autour
  6. faisait toutes petite peine seule beau toute sourire donner phrase quelques trouver parfois contraire nature suite musique croire corps
  7. swann odette chez verdurin monde disait gens femme forcheville homme soir effet amis connaissait demander personne cœur cottard
  8. voir faire aller autres jours jour toujours maison venait venir désir grande contre dès autant paris rien lequel bien
  9. grand tante mère père françoise faire bien fille disait parents maman voix partie personne bonne petit mort famille laisser
  10. devant guermantes yeux nom air petit surtout or doute mieux église image fit vue dame tant aussitôt figure lesquelles

Some of the results are unsurprising, such as topic 7, which clearly derives from the many evening scenes at the Verdurins (soir, chez, maison) where Swann courted Odette among their coterie (forcheville, cottard), often becoming jealously heartbroken (cœur, désir) with wondering whether she was seeing other admirers on the sly (demander, conaissait, amis). Other topics reveal interesting patterns that fit with scenes across the entire narrative, such as number 10. It emphasizes the use and observation of the eyes (yeux, vue) in connection with the Duc and Duchesse de Guermantes, whose mysterious airs and glances are described in several Combray church passages, as well as their association with art and symbolism of France (image, figure). But what also emerges is the consistency of the preposition before (devant), emphasizing the narrator's location not only in front of their paintings and of their glances, but also in front of a church in connection to a woman (dame), a recurrence that we can tease out by reading the database passages from the English translation.

Using a PHP script and MySQL database (graciously provided by Elijah Meeks), we can extract the tokens, word counts, and their connections from the Mallet topic model files into a graph file that generates edges and nodes, allowing us to view the ten topics as a network model in Gephi.

This entirely computer-generated model of associative networks in tome 1 of the Recherche is markedly different from the static model created by my particular reading of the church motif above, though it shares some consistencies and interesting disparities.

For instance, when we drill down and filter to look more closely at the terms that join the different topics, we see that the word for nothing (rien) is the one that most frequently connects topics 6 and 9, which respectively center on themes of beautiful bodily gestures in music and family/home relationships, while time (temps) joins topic 6 with 3, which is focused on positive terms for love of Gilberte.

According to the statistical features of the text, then, the first two parts of Du côté de chez Swann associate the expression of romantic love primarily with time, while the memory of familial love is associated primarily with absence. This perhaps comes as no shock to most readers of Proust, but if we compare this model with a search for the term nothing in the church motif database, we receive a number of passages associated predominantly with romantic love. These two fields of data, then, suggest a reading of the church motif as concerned with concepts of absence in romantic love, somewhat against the grain of the rest of the novel. There is not enough space here to deal with the problematics of translation/tutor text comparisons or the relation of computation algorithms to critical interpretation. But it is clear that domain expertise is just as necessary with digital scholarship as it is in print, as shown by the (illuminating) disparities between a human-reading and machine-reading of the text.