Emeline Liu

   About       Resume       Archive       Feed   


Text Mining the Bible with R, Pt. 2

Associations

I LOVE looking at associations. Maybe I’m just naive and not particularly well-read on current updates in AI/machine learning, but I think the human brain is still more effective at reading between the lines right now. Or, at least my brain is more effective at drawing its own conclusions instead of writing algorithms…. But anyway, in this post I’ll be going over associations for the four most commons words in my corpus, which I found in the previous post. Here they are again:

library(tm)
library(plyr)
library(httr)
library(stringr)
library(tm)
#The rest of these libraries are used for visualization
library(ggplot2)
library(graph)
library(Rgraphviz)
library(wordcloud)
library(FactoMineR)
library(cluster)
library(topicmodels)
library(SnowballC)
cname <- ("C:/Users/Emeline/Documents/R_projects/bible/text")   

corpus <- Corpus(DirSource(cname))

#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)

#Remove special characters
removeSpecChar <- function(x) gsub("[^[:alnum:]///' ]", "", x)
corpus <- tm_map(corpus, removeSpecChar)

#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)

#Create a term document matrix, remove words with fewer than 3 characters
tdm1 <- TermDocumentMatrix(corpus, control=list(wordLengths=c(3,Inf)))
#Create document term matrix, remove words with fewer than 3 characters
dtm1 <- DocumentTermMatrix(corpus, control=list(wordLengths=c(3,Inf)))

#organize by frequency
dtmsort <- sort(colSums(as.matrix(dtm1)), decreasing=TRUE)
#most frequent
dtmsort[1:4]
##  will shall  lord   god 
## 57072 52639 49610 33122

I was originally planning to look at the top six most common words, but I became overly excited and wrote too many things, so I cut it down to a more digestable three.

From what I can tell, the findAssocs function in R works by using correlations across the document term matrix. Therefore, a high correlation coefficient should mean that two words were found together in a larger number of documents. A good illustration of this concept can be found at this forum post here. Please note - I set the correleation coefficient arbitrarily to return around 5-15 words per each term of interest, so they do vary by each search word.

Will

I’ll start with the most commonly found word.

findAssocs(dtm1, "will", corlimit=0.95)
##           will
## became    0.98
## room      0.98
## male      0.97
## used      0.97
## fishermen 0.96
## like      0.96
## outside   0.96
## examples  0.95

The highly correlated words here weren’t too inspiring for me to write about. I did a search on BibleHub and ran a findAssocs on “room” and “rooms.” From the BibleHub search, the occurences appear to typically be people asking for a room to stay in.

findAssocs(dtm1, "room", corlimit=0.95)
##           room
## became    0.98
## will      0.98
## answers   0.97
## outside   0.97
## badly     0.96
## examples  0.96
## fishermen 0.96
## floors    0.96
## rooms     0.96
## taxes     0.96
## used      0.96
## whitewash 0.96
## brothers  0.95
findAssocs(dtm1, "rooms", corlimit=0.95)
##             rooms
## brothers     1.00
## adriatic     0.99
## chalk        0.99
## coated       0.99
## fortyfirst   0.99
## fortynine    0.99
## outside      0.99
## prior        0.99
## skinned      0.99
## thirtysixth  0.99
## twentysixth  0.99
## basin        0.98
## everyone     0.98
## unwashed     0.98
## whitewash    0.98
## mixed        0.97
## ninetynine   0.97
## taxes        0.97
## used         0.97
## fortyone     0.96
## overcomes    0.96
## room         0.96
## thirtyeight  0.96
## babies       0.95
## became       0.95
## male         0.95
## pigs         0.95
## ruby         0.95

These associations also didn’t shed much light for me. I think these numbers found for “rooms” is perhaps an issue with me not cleaning my corpus well enough… My only other note was that “male” appears to be occasonally used interchangeably with “man.” To compare their frequencies:

term.freqall <- rowSums(as.matrix(tdm1))
term.freqall["male"]
## male 
## 1050
term.freqall["man"]
##   man 
## 20219

Shall

Next up is “shall.”

findAssocs(dtm1, "shall", corlimit=0.97)
##              shall
## habitations   0.99
## lodge         0.99
## reprover      0.99
## travailing    0.99
## prevailed     0.98
## principality  0.98
## retained      0.98
## selfwilled    0.98
## springing     0.98
## testimonies   0.98
## untimely      0.98
## baldness      0.97
## bundle        0.97
## confounded    0.97
## constrained   0.97
## heed          0.97
## incline       0.97
## recompensed   0.97
## reproached    0.97
## thus          0.97

I was particularly intrigued by the appearance of the word “reprover,” which is defined as one who criticizes or corrects, disapproves or censures, disproves or refutes. In the bible, the original Hebrew word appears to be “yissor,” or “one who reproves, faultfinder.” source When I took my biblical literacy class, I found out that the name Satan comes from the Hebrew word satan, which translates to adversary or enemy. From my Roman Catholic upbringing, I always felt that those who intentional sought to find fault with God were doomed to Hell. I thought I run a findAssocs on “reprover” to see if it correlated highly with “Satan”.

findAssocs(dtm1, "lord", corlimit=0.7)
##           lord
## leopards  0.80
## lips      0.78
## kings     0.77
## buckets   0.76
## clouds    0.76
## eagles    0.76
## gardens   0.76
## noses     0.76
## air       0.75
## lilies    0.74
## master    0.74
## sides     0.74
## cheese    0.73
## fathers   0.73
## hills     0.73
## cursed    0.72
## lions     0.72
## nuts      0.72
## powder    0.72
## rewarded  0.72
## simple    0.72
## bones     0.71
## goats     0.71
## highest   0.71
## kicked    0.71
## offerings 0.71
## rocks     0.71
## ships     0.71
## spindle   0.71
## uprising  0.71
## mounts    0.70
## sapphire  0.70
## valleys   0.70

Unfortunately that theory was not fruitful. This “baldness” appears again though; BibleHub states that this is likely due to the ancient Egyptian habit of constantly shaving the head, only allowing hair growth when in mourning. In contrast, Jews would shave the head to mourn.

Lord

The associations for “lord” were by far the most surprising to see. I mean, leopards??

findAssocs(dtm1, "lord", corlimit=0.7)
##           lord
## leopards  0.80
## lips      0.78
## kings     0.77
## buckets   0.76
## clouds    0.76
## eagles    0.76
## gardens   0.76
## noses     0.76
## air       0.75
## lilies    0.74
## master    0.74
## sides     0.74
## cheese    0.73
## fathers   0.73
## hills     0.73
## cursed    0.72
## lions     0.72
## nuts      0.72
## powder    0.72
## rewarded  0.72
## simple    0.72
## bones     0.71
## goats     0.71
## highest   0.71
## kicked    0.71
## offerings 0.71
## rocks     0.71
## ships     0.71
## spindle   0.71
## uprising  0.71
## mounts    0.70
## sapphire  0.70
## valleys   0.70

I actually did not find a good reason for the leopards. The animal would have been native to biblical lands, but I’m not sure why it’s commonly associated with “lord.”

Furthermore, I remain perplexed about the lips. I searched for those associations:

findAssocs(dtm1, "lips", corlimit=0.97)
##               lips
## valleys       0.98
## bones         0.97
## camels        0.97
## dove          0.97
## horses        0.97
## people        0.97
## simple        0.97
## spears        0.97
## streets       0.97
## yoke          0.97

but I’m not sure that is very insightful… I think lips is just commonly used as a rhetorical device.

But this does offer me a great opportunity for a quick aside!

Quick Aside: Song of Songs

The Song of Songs, also known as the Song of Solomon, the Canticles of Canticles, or just Canticles, is one of the scrolls in the Old Testament. I personally refer to it as a book of erotic love poetry, but that is somewhat melodramatic of me. However, it is a back-and-forth poem of two lovers and is unique in the bible for its celebration of sexual love. I immediately thought of this section when I saw “lips,” as I was sure there had to be some vivid imagery there.

First, Song of Solomon 4:3 - “Your lips are like a scarlet ribbon; your mouth is lovely. Your temples behind your veil are like the halves of a pomegranate.”

Next, Song of Solomon 4:11 - “Your lips drop sweetness as the honeycomb, my bride; milk and honey are under your tongue. The fragrance of your garments is like the fragrance of Lebanon.”

Pretty sensual if I say so….

God

findAssocs(dtm1, "god", corlimit=0.98)
##           god
## damascus 0.99
## figs     0.99
## fourteen 0.99
## son      0.99
## blood    0.98
## lion     0.98
## rain     0.98
## sword    0.98
## tooth    0.98

Damascus is the capital of Syria and is one of the oldest continuousy inhabited cities in the world. The city was surrounded by the Aramaean state of Aram-Damascus from the late 12th century BCE to 732 BCE. The bible does chronicle a significant portion of the land’ history, mostly its interaction with Isarael. This would probably explain why it commonly appears with the word “God.” As another quick aside, there is a horrific prophecy about the destruction of Damascus in Isaiah 17, which starts off with “See, Damascus will no longer be a city but will become a heap of ruins.” There are people who are convinced that this prophecy is starting to come true currently, due to the modern conflict between Israel and Syria. I will refrain from commenting on that here….

As for the figs, there’s even a whole page on Wikipedia dedicated to the subject! Link here. The earliest reference to the fig tree is when Eve eats from the forbidden tree and feels such shame that she clothes herself and Adam in fig leaves. Otherwise, figs are commonly referenced in parables or as rheotical devices.

Conclusions

Associations are super fun! However, most of this blog post involved me having to look at other research to get any meaningful insight into why those terms would be correlated. In my opinion, associations are another great way to get a high-level grasp of the data, but are just a piece in the puzzle of getting the full story.