First Steps with R / Part I
Working with R on "Look Homeward, Angel!" by Thomas Wolfe
My first beginner steps with the programming language R are based on german translation of the novel "Look homeward, Angel!" written autobiographically by Thomas Wolfe in 1929.
Because I am artistically involved with the Latvian colour panel painter Mark Rothko, I am analysing the German translation of the autobiographical novel Schau heimwärts, Engel!
by Thomas Wolfe? What does one have to do with the other?
This is a short and concise story: at the latest when I began to take an interest in the biographical data of Mark Rothko, who was born as Marcus Rothkowitz in Daugavpils, Latvia, I began to textually reappraise his life and the points of contact with the reality that might surround him. By chance I saw the film Genius
(Link) and started reading books by Thomas Wolfe because of the lasting impression this film made on me. I associated text passages from Wolfe's novels with the actual and hypothetical life experiences of Mark Rothko. I think Wolfe has found the perfect words for possible experiences of Rothko. Wolfe's lavastrom-like language, similar to Mark Rothko's paintings, is imbued with an immense power, with which he attributes a massive and impressive influence on our lives to the things around us. His language, his linguistic richness have impressed me very much and made a great impression on me.
Thomas Wolfe
Thomas Wolfe was an American writer and was born in Ashville, North Carolina in 1900, the last of eight children, the son of an Irish-Scottish mother and a Pennsylvania-German stonemason. Wolfe died of brain tuberculosis at the age of 38.
There is Thomas Wolfe, a boy of, I believe, thirty years or less, whose only novel Look Homeward, Angel can be placed alongside our best literary works, a colossal creation of deep lust for life.
While reading the book, I came across many terms that I no longer know or understand today. These are partly linguistic peculiarities of the translator, as well as literary references to literature, theatre and operetta or simply some music which are completely foreign to us today, as they no longer appear in our educational canon or are at most quoted once, without us knowing what these references refer to. By way of example, I would just like - quite indiscriminately - topercolate
, call Gargantua
, Lord Fauntleroy curls
or bucolic wilderness
. I will provide assistance and accompanying explanations for many of these terms in another article coming hopefully soon.
This finally led me to perform my first text analysis with R on the text Schau heimwärts, Engel!
by Thomas Wolfe. The underlying text of the German translation of Hans Schiebelhuth, posthumously honoured with the Georg Büchner Prize, is available online in the Gutenberg project.
This documentation of my first steps with R is explicitly aimed at beginners and should encourage them to use R as a basis for a possibly even better understanding of the text. I am still at the very beginning and let myself drift where my whimsy and my research lead me.
Der Text
The beginning
In the first step I created a single file from the forty chapters of the book.
I named this file schauheimwaertsengel.txt.
First of all, I don't want to work with RStudio but only with the simple R-Console.
What's R?
R is a system for statistical calculations and graphics. It consists of a programming language and a runtime environment with graphics, a debugger, access to certain system functions, and the ability to execute programs stored in script files.
R has a modular structure, which means that in addition to the packages with basic functions already included, more than 2500 further packages can be installed.
Woher bekomme ich R?
R can be downloaded from the R project page. I have installed the 64-bit version.
Erste Schritte
Let me start by saying I am an absolute beginner in R. I am happy about every comment that helps me to transform cumbersome approaches of a beginner into more intelligent actions and ways of thinking. Since I am not a designing programmer but a programming designer, I have no real programming language background, which sometimes makes it difficult for me to adapt certain examples from net findings for my purposes.
My first steps were based on this video series from Jalayer Academy. Thank you very much for this very understandable tutorial.
After the installation I start R and find this screen: The R console.
In R we work with objects and commands. First we find out where our working directory is. For this we use the command getwd() - get working directory.
We can change this with setwd(). All my R-projects are located on C in the R folder. Since I am working on the texts of Wolfe, I named the directory r-wolfe. When setting working directories one should make sure that the directory exists.
So we write and then check again:
Perfect. All files that we load or save as objects are now in C:/R/r-wolfe. With dir() we can display all files in the working directory.
The directory contains two files: austug.txt and the full text schauheimwaertsengel.txt. The application of all commands to both texts is for better illustration, the text and coding examples can be better displayed with the short text from the file auszug.txt. The complete text of Schau <q<heimwärts, Engel! contains more than 1.2 million characters and almost 10,000 lines. For copyright reasons, I ask everyone to compile this text themselves.
To be able to edit the text in R, I load it as an object in R. At first I had some problems with the umlauts.
First of all I would like to display the text of auszug.txt once in the console. Please pay attention to the upper and lower case of commands in R!
Let's have a look at it.
The content of the loaded document is displayed directly in the console with the command. It has 3 lines and we see immediately that the special characters are not loaded correctly because the UTF8 format was not recognized correctly.
But if we specify the correct encoding, the problem is quickly solved:
That's more like it!
To add the text from the file auszug.txt to an object named auszug with a command we use <- . The object can then be easily displayed in the console with extract.
We do the same with the file schauheimwaertsengel.txt, which we load into the object wolfetext.
Remove empty lines with collapse()
As we have seen above in auszug, the object auszug contains some empty lines, which we remove with this command.
We replace the lines with collapse() with a space character. Any duplicate spaces are removed later. The result is now one line of text.
I do without the representation of the object wolfetext here. But I proceed with the text exactly the same way as with auszug, i.e. I replace all lines by a space with collapse() and assign the result back to the object wolfetext with the command.
Remove punctuation marks and special characters with gsub()
Our texts contain punctuation marks in addition to many so-called stop or fill words. These can be removed most easily with Regular Expressions. Regular Expressions are a magical thing in itself, a good introduction to Regular Expressions I found in this tutorial and here, a nearly complete list of all expressions you can find at this point. More Links: Regular Expressions in stringr, R Regular Expressions, Regular Expressions with The R Language.
First we remove all punctuation and special characters and replace them with spaces.
And we see: All punctuation and special characters have been successfully removed and replaced by spaces.
With the next command we assign the results of the two commands to our two objects auszug and wolfetext. auszug and wolfetext now contain text without punctuation and special characters and without spaces.
Also all digits are now all removed with
For a better processing of the texts, all words in the text are now converted to lower case using the function toLower().
And we now assign the command to auszug and wolfetext again.
Now we remove all strings with length 1 from our texts and assign the command directly to the object extract again. We proceed in the same way with wolfetext. This is done with the RegEx \\b[A-z]\\b{1}.
Now I would like to save the two objects auszug and wolfetext before further processing. This is done with the function write.table().
Our working directory now contains 4 files. The files auszug-lower.txt and schauheimwaertsengel-lower.txt contain the original text without blank lines, without punctuation marks and without numbers or 1-part strings.
Packages in R
The stringr package and a little bit of statistics.
R packages are collections of functions and data sets developed by the community. They enhance the performance of R by improving existing basic R functionality or adding new functionality. (...) Recently the official repository (CRAN) has reached 10,000 published packages, and many more are publicly available on the Internet. (Source: https://www.datacamp.com/community/tutorials/r-packages-guide).
Since I want to count certain words, I installed the package stringr first. You can install packages simply via the Packages menu or with the install.packages() function.
In some cases we may have to specify in the popup window from which server we want to download the package and where it should be installed. And with library() an installed package is loaded into the workspace. If the package is already installed, it has to be loaded at least with every sesson.
How often
er or sie?
Now we can let some search functions loose on our texts. I start by searching for the strings "er" and "sie", first of all, so that the result is easier to check, with the object auszug.
We get 12 hits. If we look at the auszug we see that später
, engländer
and for example segler
also contain the string er
. If we add two spaces to our search, we will only find the word er
.
We get 3 hits, which is correct. For the file wolfetext there are 4946 hits for him and for her 3901 hits. That is a ratio of 55.9% to 44.1%. Not necessarily a surprising result for an autobiographical novel written by a male-dominated male author in 1929.
The search can also be combined:
In the next step I would like to examine the occurrence of the protagonists quantitatively, to find out if I can detect a tendency to certain modal verbs in the text. Furthermore, I would like to remove all stop or filler words from the text and finally create a word cloud from the whole text.
I will do this in the second part, which will hopefully be published here this week.
If you have suggestions or criticism for me, leave them in the comments! And if you have some good tips for R-Instructions that are specifically about text analysis and text mining, please leave them in the comments. I also love to read what you have already done with R!
tl, dr;
My first steps with the programming language R are based on the german translation of the novel
Schau heimwärts, Engel!
written autobiographically by Thomas Wolfe in 1929
Comments (0)