Emeline Liu

   About       Resume       Archive       Feed   


Text Mining the Bible with R, Pt. 1

Introduction

This is the first post in a series using text analysis on the Bible. I’ll be going over the first exploratory steps I typically follow for text mining.

Biblical Literacy

This section gives some background on my relationship with the Bible, please skip down if you’d just like to get to the code.

I took a biblical literacy class with Prof. Timothy Beal during my sophomore year at CWRU, and to this day I’m not sure any other course has impacted how I look at the world as much as RLGN 209 did. While I grew up in a Roman Catholic household, I never really read the Bible, never quite noticed the level of cultural sway, never bothered to question what is arguably one of the most influential pieces of literature in history.

The Bible is full of contradictions, stories of a vengeful and often inexplicable God, archaic rules for living, and abject statements that God shouldn’t be questioned. To me, this had made religion feel impossibly difficult to grasp. I ask probably too many questions at all times, and have the bad habit of thinking through things in black and white. How could I have faith in something rooted in a text that’s so twisted and obsolete? But even as a believer married to a minister, Prof. Beal’s angle was that the Bible should be a place for questions that may never be answered. Religion can be a place for questions that may never be answered. And it’s completely fine that those questions never get answered, because that in itself can be faith as well.

(In the interest of full disclosure, I no longer affiliate with the Roman Catholic church, nor do I believe in a centralized God, but I would consider myself spiritual.)

I love data science because it allows me to ask as many questions as I want and be able to find the answers on my own. I love the Bible because it’s fascinating and heavy and a work of art. Might as well see what happens when I combine the two.

Data sources

The Old Testament was originally written in Biblical Hebrew, with certain passages in Biblical Aramaic. The New Testment was written in Koine Greek, which was the common language of Greece during the Hellenistic and Roman antiquity. The Bible has since been translated in 531 languages. In English alone, there exist quite a few different translations because of diverging linguistic, philosophical, and theologic considerations. i.e. what to do with idioms? What kind of angle is the translator trying to convey? What about concepts that are now completely foreign?

I decided to try to use as many translations as possible. The reasons - the tm package has multiple functions that work only with more than one document, and this would also hopefully take care of linguistics quirks in that might be present in just one version.

For my corpus, I included: 1. New International Version 2. American Standard Version 3. Bible in Basic English 4. God’s Living Word 5. LXX2012: Septuagint in American English 6. Revised Version with Apocrypha (1895) 7. The World English Bible 8. The World Messianic Bible 9. The King James Version, which is widely considered to be the most influential of the translations

The text files, along with the analysis, is checked in here at my Github.

Cleaning

I used the TM package to assemble my corpus. I performed standard text transformations - move all case to lower, remove numbers, remove punctation, remove common stopwords, strip whitespace, and get rid of special characters. I also utilized stemming, which should cut words to their base.

library(tm)
library(plyr)
library(httr)
library(stringr)
library(tm)
#The rest of these libraries are used for visualization
library(ggplot2)
library(graph)
library(Rgraphviz)
library(wordcloud)
library(FactoMineR)
library(cluster)
library(topicmodels)
library(SnowballC)
cname <- (&quot;C:/Users/Emeline/Documents/R_projects/bible/text&quot;)   

corpus <- Corpus(DirSource(cname))

#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords(&quot;english&quot;))
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)

#Remove special characters
removeSpecChar <- function(x) gsub("[^[:alnum:]///' ]", "", x)
corpus <- tm_map(corpus, removeSpecChar)

#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)

#Create a term document matrix, remove words with fewer than 3 characters
tdm1 <- TermDocumentMatrix(corpus, control=list(wordLengths=c(3,Inf)))
#Create document term matrix, remove words with fewer than 3 characters
dtm1 <- DocumentTermMatrix(corpus, control=list(wordLengths=c(3,Inf)))

The final result of the cleaning is a term document matrix (You can also use a document term matrix; term document and document term matrices are transposes.). A term document matric is a mathetical matrix that contains the frequency of terms that occur in a collection of documents.

tdm1
## <<TermDocumentMatrix (terms: 34812, documents: 9)>>
## Non-/sparse entries: 119507/193801
## Sparsity           : 62%
## Maximal term length: 35
## Weighting          : term frequency (tf)

Most and Least Frequent words

I like to look at the most and least frequent words in a corpus first, because it gives you a very quick picture of what’s going on.

#organize by frequency
dtmsort <- sort(colSums(as.matrix(dtm1)), decreasing=TRUE)
#most frequent
head(dtmsort)
##  will shall  lord   god  said  unto 
## 57072 52639 49610 33122 33046 28684
#least frequent
tail(dtmsort)
##  zoreah   zorit     zue    zuza  zuzims zuzites 
##       1       1       1       1       1       1

To look at the twenty most frequent words:

dtmsort[1:20]
##   will  shall   lord    god   said   unto    man israel   king    son 
##  57072  52639  49610  33122  33046  28684  20219  20159  19368  19315 
##    one   thou people   came   come   land    men    thy  house   made 
##  18901  17445  16135  15941  15745  14752  14676  14510  14095  13612

Here’s a plot of all words with a frequency greater than 10000.

#plot frequent words
(freq.terms <- findFreqTerms(tdm1, lowfreq=10000))
##  [1] "also"     "came"     "children" "come"     "day"      "give"    
##  [7] "god"      "hand"     "house"    "israel"   "king"     "land"    
## [13] "let"      "lord"     "made"     "man"      "may"      "men"     
## [19] "now"      "one"      "people"   "put"      "said"     "shall"   
## [25] "son"      "thee"     "thou"     "thy"      "unto"     "upon"    
## [31] "went"     "will"
term.freq <- rowSums(as.matrix(tdm1))
term.freq <- subset(term.freq, term.freq&gt;=10000)
df1 <- data.frame(term=names(term.freq),freq=term.freq)

#Creates plot
ggplot(df1, aes(x=term,y=freq)) + geom_bar(stat=&quot;identity&quot;) + xlab(&quot;Terms&quot;) + ylab(&quot;Count&quot;) + coord_flip()

Word Cloud

Word cloud with all words with minimum frequency of 5000.

#word cloud
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
wordcloud(words=names(word.freq), freq=word.freq, min.freq=5000,random.order=F, scale=c(5, .1), colors=brewer.pal(9, &quot;Spectral&quot;))

I(/most humans) like word clouds because they’re easy to digest - we can quickly locate the most common terms and get a fast grasp on what’s going on.

Cluster Dendrogram

A dendogram is a hierarchical diagram that shows similarities between objects. When I look at other text mining posts on the internet, I usually see people remove the sparse terms of the document term matrix, and use that for their cluster dendrogram. I was still getting a mammoth number of entries even when removing terms with a sparsity of 0.95, which would be getting rid of all words that occur with a ratio of less than 0.95. I decided to just make the diagram with the 35 most common words.

#Gets the 35 most frequent words
dtmsortcut <- dtmsort[1:35]
m2 <- as.matrix(dtmsortcut)
#Creates the distance matrix
distMatrix <- dist(scale(m2))
fit <- hclust(distMatrix,method=&quot;ward.D&quot;)
plot(fit, cex=0.67)
rect.hclust(fit, k=5) #5 clusters

Topic Modeling

I always look at topic models, but I am never able to get anything too meaningful out of them. I’m assuming that’s because I don’t understand the algorithms behind the method at all. But here are five topics and the first eight terms of every topic.

#Remove zero rows
dtm1 <- dtm1[apply(dtm1[,-1], 1, function(x) !all(x==0)),]
lda <- LDA(dtm1, k=5) #find 5 topics
(term <- terms(lda, 8)) #first 8 terms of every topic
##      Topic 1 Topic 2 Topic 3  Topic 4  Topic 5   
## [1,] "will"  "shall" "will"   "shall"  "will"    
## [2,] "lord"  "unto"  "lord"   "lord"   "shall"   
## [3,] "said"  "lord"  "god"    "will"   "said"    
## [4,] "put"   "thou"  "one"    "god"    "yahweh"  
## [5,] "give"  "thy"   "king"   "said"   "god"     
## [6,] "god"   "god"   "son"    "israel" "israel"  
## [7,] "made"  "said"  "people" "king"   "children"
## [8,] "come"  "will"  "said"   "upon"   "man"

Conclusion

To me, this is the point where I have a better picture of the information and I can start formulating the more detailed questions to ask. If you actually got this far in this post - stay tuned, I’ll be going over associations next.