First Steps with R / Part I

Working with R on "Look Homeward, Angel!" by Thomas Wolfe

My first beginner steps with the programming language R are based on german translation of the novel "Look homeward, Angel!" written autobiographically by Thomas Wolfe in 1929.

Because I am artistically involved with the Latvian colour panel painter Mark Rothko, I am analysing the German translation of the autobiographical novel Schau heimwärts, Engel! by Thomas Wolfe? What does one have to do with the other?

This is a short and concise story: at the latest when I began to take an interest in the biographical data of Mark Rothko, who was born as Marcus Rothkowitz in Daugavpils, Latvia, I began to textually reappraise his life and the points of contact with the reality that might surround him. By chance I saw the film Genius (Link) and started reading books by Thomas Wolfe because of the lasting impression this film made on me. I associated text passages from Wolfe's novels with the actual and hypothetical life experiences of Mark Rothko. I think Wolfe has found the perfect words for possible experiences of Rothko. Wolfe's lavastrom-like language, similar to Mark Rothko's paintings, is imbued with an immense power, with which he attributes a massive and impressive influence on our lives to the things around us. His language, his linguistic richness have impressed me very much and made a great impression on me.

Thomas Wolfe

Thomas Wolfe was an American writer and was born in Ashville, North Carolina in 1900, the last of eight children, the son of an Irish-Scottish mother and a Pennsylvania-German stonemason. Wolfe died of brain tuberculosis at the age of 38.

There is Thomas Wolfe, a boy of, I believe, thirty years or less, whose only novel Look Homeward, Angel can be placed alongside our best literary works, a colossal creation of deep lust for life.

Quelle Zitat: Sinclair Lewis

While reading the book, I came across many terms that I no longer know or understand today. These are partly linguistic peculiarities of the translator, as well as literary references to literature, theatre and operetta or simply some music which are completely foreign to us today, as they no longer appear in our educational canon or are at most quoted once, without us knowing what these references refer to. By way of example, I would just like - quite indiscriminately - to
percolate, call Gargantua, Lord Fauntleroy curls or bucolic wilderness. I will provide assistance and accompanying explanations for many of these terms in another article coming hopefully soon.

This finally led me to perform my first text analysis with R on the text Schau heimwärts, Engel! by Thomas Wolfe. The underlying text of the German translation of Hans Schiebelhuth, posthumously honoured with the Georg Büchner Prize, is available online in the Gutenberg project.

Thomas Wolfe

This was new territory.
His heart grew light.

Quelle: Thomas Wolfe, Schau heimwärts, Engel!, Rowohlt Verlag GmbH, Hamburg 1954

This documentation of my first steps with R is explicitly aimed at beginners and should encourage them to use R as a basis for a possibly even better understanding of the text. I am still at the very beginning and let myself drift where my whimsy and my research lead me.

Der Text

The beginning

In the first step I created a single file from the forty chapters of the book.

I named this file schauheimwaertsengel.txt.

First of all, I don't want to work with RStudio but only with the simple R-Console.

What's R?

R is a system for statistical calculations and graphics. It consists of a programming language and a runtime environment with graphics, a debugger, access to certain system functions, and the ability to execute programs stored in script files.

R has a modular structure, which means that in addition to the packages with basic functions already included, more than 2500 further packages can be installed.

Woher bekomme ich R?

R can be downloaded from the R project page. I have installed the 64-bit version.

Erste Schritte

Let me start by saying I am an absolute beginner in R. I am happy about every comment that helps me to transform cumbersome approaches of a beginner into more intelligent actions and ways of thinking. Since I am not a designing programmer but a programming designer, I have no real programming language background, which sometimes makes it difficult for me to adapt certain examples from net findings for my purposes.

My first steps were based on this video series from Jalayer Academy. Thank you very much for this very understandable tutorial.

After the installation I start R and find this screen: The R console.

In R we work with objects and commands. First we find out where our working directory is. For this we use the command getwd() - get working directory.

getwd()
[1] "C:/Users/benutzer/Documents"
Where is the working directory?

We can change this with setwd(). All my R-projects are located on C in the R folder. Since I am working on the texts of Wolfe, I named the directory r-wolfe. When setting working directories one should make sure that the directory exists.

So we write and then check again:

setwd("C:/R/r-wolfe")
getwd()
[1] "C:/R/r-wolfe"
Define own working directory.

Perfect. All files that we load or save as objects are now in C:/R/r-wolfe. With dir() we can display all files in the working directory. 

dir()
[1] "auszug.txt"    "schauheimwaertsengel.txt"
Which files are in the working directory?

The directory contains two files: austug.txt and the full text schauheimwaertsengel.txt. The application of all commands to both texts is for better illustration, the text and coding examples can be better displayed with the short text from the file auszug.txt. The complete text of Schau <q<heimwärts, Engel! contains more than 1.2 million characters and almost 10,000 lines. For copyright reasons, I ask everyone to compile this text themselves.

To be able to edit the text in R, I load it as an object in R. At first I had some problems with the umlauts.

First of all I would like to display the text of auszug.txt once in the console. Please pay attention to the upper and lower case of commands in R!

Let's have a look at it.

readLines("auszug.txt")
[1] "Ein Engländer namens Gilbert Gaunt (was er später in Gant änderte, vermutlich ein Zugeständnis an die Aussprache der Yankees) war im Jahre 1837 auf einem Segler von Bristol nach Baltimore gekommen. "
[2] ""                                                                                                                                                                                                          
[3] "Den Wert eines Gasthauses, das er sich gekauft hatte, ließ er seine unfürsorgliche Kehle hinunterrollen." 
Visible problems with the umlauts.

The content of the loaded document is displayed directly in the console with the command. It has 3 lines and we see immediately that the special characters are not loaded correctly because the UTF8 format was not recognized correctly.

But if we specify the correct encoding, the problem is quickly solved:

readLines("auszug.txt", encoding = "UTF-8")
[1] "Ein Engländer namens Gilbert Gaunt (was er später in Gant änderte, vermutlich ein Zugeständnis an die Aussprache der Yankees) war im Jahre 1837 auf einem Segler von Bristol nach Baltimore gekommen. "
[2] ""                                                                                                                                                                                                      
[3] "Den Wert eines Gasthauses, das er sich gekauft hatte, ließ er seine unfürsorgliche Kehle hinunterrollen." 
readLines() with the correct encoding="UTF-8.

That's more like it!
To add the text from the file auszug.txt to an object named auszug with a command we use <- . The object can then be easily displayed in the console with extract.

auszug <- readLines("auszug.txt", encoding = "UTF-8")
auszug
[1] "Ein Engländer namens Gilbert Gaunt (was er später in Gant änderte, vermutlich ein Zugeständnis an die Aussprache der Yankees) war im Jahre 1837 auf einem Segler von Bristol nach Baltimore gekommen. "
[2] ""                                                                                                                                                                                                      
[3] "Den Wert eines Gasthauses, das er sich gekauft hatte, ließ er seine unfürsorgliche Kehle hinunterrollen."  
With <- we assign a command to an object.

We do the same with the file schauheimwaertsengel.txt, which we load into the object wolfetext.

wolfetext <- readLines("schauheimwaertsengel.txt", encoding = "UTF-8")

Remove empty lines with collapse()

As we have seen above in auszug, the object auszug contains some empty lines, which we remove with this command.

auszug <- paste(auszug, collapse=' ') 
Remove empty lines with collapse().

We replace the lines with collapse() with a space character. Any duplicate spaces are removed later. The result is now one line of text.

auszug
[1] "Ein Engländer namens Gilbert Gaunt (was er später in Gant änderte, vermutlich ein Zugeständnis an die Aussprache der Yankees) war im Jahre 1837 auf einem Segler von Bristol nach Baltimore gekommen.   Den Wert eines Gasthauses, das er sich gekauft hatte, ließ er seine unfürsorgliche Kehle hinunterrollen."
The object excerpt reduced to one line.

I do without the representation of the object wolfetext here. But I proceed with the text exactly the same way as with auszug, i.e. I replace all lines by a space with collapse() and assign the result back to the object wolfetext with the command.

wolfetext <- paste(wolfetext, collapse=' ') 
wolfetext now contains the complete text from Schau heimwärts, Engel!

Remove punctuation marks and special characters with gsub()

Our texts contain punctuation marks in addition to many so-called stop or fill words. These can be removed most easily with Regular Expressions. Regular Expressions are a magical thing in itself, a good introduction to Regular Expressions I found in this tutorial and here, a nearly complete list of all expressions you can find at this point. More Links: Regular Expressions in stringr, R Regular Expressions, Regular Expressions with The R Language.

First we remove all punctuation and special characters and replace them with spaces.

gsub(pattern="\\W", replace=" ", auszug)
[1] "Ein Engländer namens Gilbert Gaunt  was er später in Gant änderte  vermutlich ein Zugeständnis an die Aussprache der Yankees  war im Jahre 1837 auf einem Segler von Bristol nach Baltimore gekommen    Den Wert eines Gasthauses  das er sich gekauft hatte  ließ er seine unfürsorgliche Kehle hinunterrollen "
Remove punctuation and special characters.

And we see: All punctuation and special characters have been successfully removed and replaced by spaces.

With the next command we assign the results of the two commands to our two objects auszug and wolfetext. auszug and wolfetext now contain text without punctuation and special characters and without spaces.

auszug <- gsub(pattern="\\W", replace=" ", auszug)
wolfetext <- gsub(pattern="\\W", replace=" ", wolfetext)
Objects are assigned the results of the commands.

Also all digits are now all removed with

auszug <- gsub(pattern="\\d", replace=" ", auszug)
wolfetext <- gsub(pattern="\\d", replace=" ", wolfetext)
Remove numbers with RegEx \\d.

For a better processing of the texts, all words in the text are now converted to lower case using the  function toLower().

tolower(auszug)
[1] "ein engländer namens gilbert gaunt  was er später in gant änderte  vermutlich ein zugeständnis an die aussprache der yankees  war im jahre      auf einem segler von bristol nach baltimore gekommen    den wert eines gasthauses  das er sich gekauft hatte  ließ er seine unfürsorgliche kehle hinunterrollen "
Lower case with tolower().

And we now assign the command to auszug and wolfetext again.

auszug <- tolower(auszug)
wolfetext <- tolower(wolfetext)
auszug and wolfetext are now completely written in lower case.

Now we remove all strings with length 1 from our texts and assign the command directly to the object extract again. We proceed in the same way with wolfetext. This is done with the RegEx \\b[A-z]\\b{1}.

auszug <- gsub(pattern="\\b[A-z]\\b{1}", replace=" ", auszug)
wolfetext <- gsub(pattern="\\b[A-z]\\b{1}", replace=" ", wolfetext )
Remove short strings.

Now I would like to save the two objects auszug and wolfetext before further processing. This is done with the function write.table()

write.table(auszug, "auszug-lower.txt", sep="\t")
write.table(wolfetext, "schauheimwaertsengel-lower.txt", sep="\t")
Before further processing an intermediate backup.

Our working directory now contains 4 files.  The files auszug-lower.txt and schauheimwaertsengel-lower.txt contain the original text without blank lines, without punctuation marks and without numbers or 1-part strings.

dir()
[1] "auszug-lower.txt" "auszug.txt" "schauheimwaertsengel.txt" "schauheimwaertsengel-lower.txt" 
Overview of the working directory.

Packages in R

The stringr package and a little bit of statistics.

R packages are collections of functions and data sets developed by the community. They enhance the performance of R by improving existing basic R functionality or adding new functionality. (...) Recently the official repository (CRAN) has reached 10,000 published packages, and many more are publicly available on the Internet. (Source: https://www.datacamp.com/community/tutorials/r-packages-guide).

Since I want to count certain words, I installed the package stringr first. You can install packages simply via the Packages menu or with the install.packages() function.

install.packages("stringr")
Install package stringr.

In some cases we may have to specify in the popup window from which server we want to download the package and where it should be installed. And with library() an installed package is loaded into the workspace. If the package is already installed, it has to be loaded at least with every sesson.

library(stringr)
The library() function loads a library into the console.
How often

er or sie?

Now we can let some search functions loose on our texts. I start by searching for the strings "er" and "sie", first of all, so that the result is easier to check, with the object auszug.

str_count(auszug, "er")
[1] 12
Search for er with the result 12.

We get 12 hits. If we look at the auszug we see that  später, engländer and for example  segler also contain the string  er. If we add two spaces to our search, we will only find the word er.

str_count(auszug, " er ")
[1] 3
Real er hits.

We get 3 hits, which is correct. For the file wolfetext there are 4946 hits for him and for her 3901 hits. That is a ratio of 55.9% to 44.1%. Not necessarily a surprising result for an autobiographical novel written by a male-dominated male author in 1929.

str_count(wolfetext, " er ")
[1] 4946
str_count(wolfetext, " sie ")
[1] 3901
Results Occurrence of er and sie in the text.

The search can also be combined:

str_count(wolfetext, c(" er "," sie "))
[1] 4946 3901
Combined search.

In the next step I would like to examine the occurrence of the protagonists quantitatively, to find out if I can detect a tendency to certain modal verbs in the text. Furthermore, I would like to remove all stop or filler words from the text and finally create a word cloud from the whole text.

I will do this in the second part, which will hopefully be published here this week.

If you have suggestions or criticism for me, leave them in the comments! And if you have some good tips for R-Instructions that are specifically about text analysis and text mining, please leave them in the comments. I also love to read what you have already done with R!

tl, dr;

My first steps with the programming language R are based on the german translation of the novel Schau heimwärts, Engel! written autobiographically by Thomas Wolfe in 1929

Comments (0)


Write a comment




By sending this comment, I agree that the name and e-mail address will be stored by cronhill.de in connection with the comment I have written. The e-mail address will not be published or passed on to third parties.