Text Mining the Bible with R, Pt. 1
This is the first post in a series using text analysis on the Bible. I’ll be going over the first exploratory steps I typically follow for text mining.
This section gives some background on my relationship with the Bible, please skip down if you’d just like to get to the code.
I took a biblical literacy class with Prof. Timothy Beal during my sophomore year at CWRU, and to this day I’m not sure any other course has impacted how I look at the world as much as RLGN 209 did. While I grew up in a Roman Catholic household, I never really read the Bible, never quite noticed the level of cultural sway, never bothered to question what is arguably one of the most influential pieces of literature in history.
The Bible is full of contradictions, stories of a vengeful and often inexplicable God, archaic rules for living, and abject statements that God shouldn’t be questioned. To me, this had made religion feel impossibly difficult to grasp. I ask probably too many questions at all times, and have the bad habit of thinking through things in black and white. How could I have faith in something rooted in a text that’s so twisted and obsolete? But even as a believer married to a minister, Prof. Beal’s angle was that the Bible should be a place for questions that may never be answered. Religion can be a place for questions that may never be answered. And it’s completely fine that those questions never get answered, because that in itself can be faith as well.
(In the interest of full disclosure, I no longer affiliate with the Roman Catholic church, nor do I believe in a centralized God, but I would consider myself spiritual.)
I love data science because it allows me to ask as many questions as I want and be able to find the answers on my own. I love the Bible because it’s fascinating and heavy and a work of art. Might as well see what happens when I combine the two.
The Old Testament was originally written in Biblical Hebrew, with certain passages in Biblical Aramaic. The New Testment was written in Koine Greek, which was the common language of Greece during the Hellenistic and Roman antiquity. The Bible has since been translated in 531 languages. In English alone, there exist quite a few different translations because of diverging linguistic, philosophical, and theologic considerations. i.e. what to do with idioms? What kind of angle is the translator trying to convey? What about concepts that are now completely foreign?
I decided to try to use as many translations as possible. The reasons - the tm package has multiple functions that work only with more than one document, and this would also hopefully take care of linguistics quirks in that might be present in just one version.
For my corpus, I included: 1. New International Version 2. American Standard Version 3. Bible in Basic English 4. God’s Living Word 5. LXX2012: Septuagint in American English 6. Revised Version with Apocrypha (1895) 7. The World English Bible 8. The World Messianic Bible 9. The King James Version, which is widely considered to be the most influential of the translations
The text files, along with the analysis, is checked in here at my Github.
I used the TM package to assemble my corpus. I performed standard text transformations - move all case to lower, remove numbers, remove punctation, remove common stopwords, strip whitespace, and get rid of special characters. I also utilized stemming, which should cut words to their base.
The final result of the cleaning is a term document matrix (You can also use a document term matrix; term document and document term matrices are transposes.). A term document matric is a mathetical matrix that contains the frequency of terms that occur in a collection of documents.
## <<TermDocumentMatrix (terms: 34812, documents: 9)>> ## Non-/sparse entries: 119507/193801 ## Sparsity : 62% ## Maximal term length: 35 ## Weighting : term frequency (tf)
Most and Least Frequent words
I like to look at the most and least frequent words in a corpus first, because it gives you a very quick picture of what’s going on.
## will shall lord god said unto ## 57072 52639 49610 33122 33046 28684
## zoreah zorit zue zuza zuzims zuzites ## 1 1 1 1 1 1
To look at the twenty most frequent words:
## will shall lord god said unto man israel king son ## 57072 52639 49610 33122 33046 28684 20219 20159 19368 19315 ## one thou people came come land men thy house made ## 18901 17445 16135 15941 15745 14752 14676 14510 14095 13612
Here’s a plot of all words with a frequency greater than 10000.
##  "also" "came" "children" "come" "day" "give" ##  "god" "hand" "house" "israel" "king" "land" ##  "let" "lord" "made" "man" "may" "men" ##  "now" "one" "people" "put" "said" "shall" ##  "son" "thee" "thou" "thy" "unto" "upon" ##  "went" "will"
Word cloud with all words with minimum frequency of 5000.
I(/most humans) like word clouds because they’re easy to digest - we can quickly locate the most common terms and get a fast grasp on what’s going on.
A dendogram is a hierarchical diagram that shows similarities between objects. When I look at other text mining posts on the internet, I usually see people remove the sparse terms of the document term matrix, and use that for their cluster dendrogram. I was still getting a mammoth number of entries even when removing terms with a sparsity of 0.95, which would be getting rid of all words that occur with a ratio of less than 0.95. I decided to just make the diagram with the 35 most common words.
I always look at topic models, but I am never able to get anything too meaningful out of them. I’m assuming that’s because I don’t understand the algorithms behind the method at all. But here are five topics and the first eight terms of every topic.
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 ## [1,] "will" "shall" "will" "shall" "will" ## [2,] "lord" "unto" "lord" "lord" "shall" ## [3,] "said" "lord" "god" "will" "said" ## [4,] "put" "thou" "one" "god" "yahweh" ## [5,] "give" "thy" "king" "said" "god" ## [6,] "god" "god" "son" "israel" "israel" ## [7,] "made" "said" "people" "king" "children" ## [8,] "come" "will" "said" "upon" "man"
To me, this is the point where I have a better picture of the information and I can start formulating the more detailed questions to ask. If you actually got this far in this post - stay tuned, I’ll be going over associations next.