In this exercise, we’re trying to study how we can use similarity to display interesting information.
I took 17 ebooks from the Internet (from the open-source project: http://www.gutenberg.org), by using the download popularity at that time (and filtering to get at least somewhat popular books, in english, and of usable size).
Then I run the TF-IDFs algorithm on that corpus, kept the best 50 words per document, and rendered each book as a chromosome, while each word is a gene. For each word, the mapping is the following :
– the IDF factor, since it is the same across the entire corpus, is considered as the size (height) of the gene: if the word is important, then its presence has an high impact on the overall property of the book.
– the TF is used to display the intensity of the gene: if the word/gene is more present in this book, then it means that it expresses itself more that others genes/words, and appears more white.
– the size of a book depends on the the number of words.
In addition, we can cycle through the book y using the W and X key: the current book is then selected in green, and all the best words of this book that appear at least once in others are displayed in green to across all the corpus, with the count per book. Then it is easy to see what are the similar book, and how they are similar.