The Dirty Work - Reshaping Data for Visualization

This post is the first of the three in a series to make an interactive data visualization web app:

Part 1 - The Dirty Work - Reshaping Data for Visualization
Part 2 - The Jigsaw Puzzle Pieces - Creating Graphs with ggplot2
Part 3 - Assembling the Pieces - Creating R Shiny App

The three posts aim to give a concrete example with a full workflow of visualizing data from preprocessing, graphing to app building.

The case is adapted from how the R Shiny App for a Library Survey was actually created. Due to the nature and the goal of the survey, this interactive visualization paid much attention to how different subgroups of respondents (by country/major/status) used the library services. Most tasks concern how to graph categorical data.

This first post deals with the dirty part, data cleaning, with a focus on reshaping data structures for visualization. It starts with reshaping data for single graphs, and extends to reshaping for R Shiny App that deals with a lot more repetition and automation. Along the way, you will also see useful techniques in data cleaning applicable in many scenarios other than graphing.

Note that it is always the actual needs of plotting that determine how we reshape our data. In this case, the challenges of data preprocessing, or reshaping data, for visualization arise from how data were collected, generated, structured and stored in ways quite far from what could be ready to use; challenges also come from how different packages require different data structures to produce the same plots.

The codes and techniques we discuss here consist of Shiny App creation, specifically what we would call the global.R script. It is not part of the ui or server functions, but handle the data needs of building up a Shiny App. This page shows a very small portion of what the global.R script looks like for the survey Shiny App.

Ready?

First of all, let’s load all the packages that we will use for this post.

library(readxl)
library(dplyr)
library(stringr)
library(data.table)

For illustration purposes below, I loaded the libraries each time I needed them so that it is known which package was attached. In reality, however, including the libraries at the very beginning of the script as above is recommended.

sample data

Let’s load the sample data. The data came from a library service survey, cleaned and deidentified for the purpose of this demo.

## setwd()
load("~/Desktop/r sample files/survey")

Let’s take a first look at our data. head(), str() and summary() are all useful functions to get a sense of our data. See a brief introduction here.

head(survey)

## # A tibble: 6 x 17
##   status  country major  Q1.1   Q1.2  Q1.3  top_reason     place_options  
##   <fct>   <fct>   <fct>  <chr>  <chr> <chr> <chr>          <chr>          
## 1 Freshm… U.S.    Undef… Never  Never Never Find a quiet … quiet (occasio…
## 2 Freshm… China   Undef… Occas… Often Often Find a quiet … crowded,focused
## 3 Freshm… U.S.    Undef… Never  Occa… Often Meet up with … (close to) sil…
## 4 Freshm… China   Undef… Occas… Occa… Occa… Meet up with … focused,(close…
## 5 Freshm… U.S.    Undef… Never  Occa… Occa… Find a quiet … (close to) sil…
## 6 Freshm… China   Undef… Occas… Often Often Find a quiet … (close to) sil…
## # ... with 9 more variables: space_lib <chr>, rank_crowded <dbl>,
## #   rank_modpop <dbl>, rank_noisy <dbl>, rank_quiet <dbl>,
## #   rank_silent <dbl>, rank_relaxed <dbl>, rank_focused <dbl>,
## #   workshops <chr>

Reshaping Data

We will skip all the cleaning before reshaping, such as merging files, removing missing cases, recoding etc. Below we review several scenarios that we actually encountered when visualizing the survey data.

summarizing before plotting

There were several times that we needed to create a frequency table before visualizing. Here’s how to do it.

tb <- survey %>% count(major) %>% data.frame() # generates a frequency table
tb <- tb %>% arrange(-n) 
tb

##                                       major   n
## 1             Business, Finance & Economics 118
## 2                                 Undefined  57
## 3              Humanities & Social Sciences  32
## 4                          CS & Engineering  27
## 5 Data Science & Interactive Media Business  24
## 6                                   Science  21
## 7                               Mathematics  21
## 8                    Interactive Media Arts  20

Another way to do it is to create a data frame from table().

tb2 <- data.frame(table(survey$major))
names(tb2) <- c("major", "n")
tb2 <- tb2[order(tb2$n, decreasing = T),]
tb2

##                                       major   n
## 1             Business, Finance & Economics 118
## 8                                 Undefined  57
## 2              Humanities & Social Sciences  32
## 6                          CS & Engineering  27
## 3 Data Science & Interactive Media Business  24
## 5                                   Science  21
## 7                               Mathematics  21
## 4                    Interactive Media Arts  20

extracting strings from the separators

Have you ever got a bit annoyed when you saw a column of strings separated by the commas, tabs or spaces? Worse if we need to plot something out of the data. Data like this need to be cleaned before plotting. We will see two cases below. Each explains one situation.

One note before we proceed: Because we designed the survey questions or recoded the submitted answers, all variable values were known to us. For instance, we knew all the values for “top reason of visiting the library”. This makes it easier for us to deal with the strings.

Case I

First of all, we have data from student submissions on “My favorite place to study in Library”. The messy raw data were recoded to be consisent across respondents. Most have single strings, but some have multiple ones separated by comma.

head(survey$space_lib, n = 25)

##  [1] "desk"             "windows"          "428"             
##  [4] "428"              "talking"          "silent"          
##  [7] "windows"          "400"              "silent"          
## [10] "428"              "NULL"             "silent"          
## [13] "silent"           "windows"          "group study room"
## [16] "400"              "group study room" "group study room"
## [19] "400"              "407"              "407, silent"     
## [22] "other"            "other"            "group study room"
## [25] "NULL"

Let’s say we want to create a frequency table counting how many times a unique value appeared. We are not too worried about this case because the strings themeselves do not contain the separator, which is the comma. So the program will not get confused which comma is what - is it a separator or part of the string. We can just split the strings by comma.

On the contrary, in the case below we will see shortly, some strings came with the separator, such as the comma in “Print, photocopy, scan”. Simply splitting those strings would have been inappropriate.

Back to where we were.

lib <- unlist(strsplit(survey$space_lib, ","))
## strsplit() splits the elements of the character vector space_lib into substrings by the commas
## strsplit() is from base R
## the result is a long list of all the extracted strings in vectors
## we need to unlist() the list, and get a data.frame after unlist()
head(lib, n = 30)

##  [1] "desk"             "windows"          "428"             
##  [4] "428"              "talking"          "silent"          
##  [7] "windows"          "400"              "silent"          
## [10] "428"              "NULL"             "silent"          
## [13] "silent"           "windows"          "group study room"
## [16] "400"              "group study room" "group study room"
## [19] "400"              "407"              "407"             
## [22] " silent"          "other"            "other"           
## [25] "group study room" "NULL"             "silent"          
## [28] "silent"           "silent"           "407"

lib <- str_trim(lib, side = "both")
## str_trim() is a function from the stringr package
## removes the trailing and leading spaces generated in spliting

lib <- lib[!lib %in% c("NULL")]
## removes values coded as "NULL"

lib <- data.frame(lib) %>% count(lib) %>% data.frame() %>% arrange(-n)
## counts the frequency of each unique value
lib

##                 lib  n
## 1            silent 77
## 2               407 58
## 3           windows 47
## 4             other 31
## 5  group study room 23
## 6               400 22
## 7               428 17
## 8              desk 14
## 9              sofa 11
## 10       large desk  7
## 11           corner  6
## 12       small desk  4
## 13               4F  2
## 14            table  2
## 15           outlet  1
## 16          privacy  1
## 17          reserve  1
## 18          talking  1

Case II

In the second case, we want to visualize the distribution of visiting library by country. The data for “top reasons to visit the library” is top_reason. The group variable is country.

head(survey$top_reason)

## [1] "Find a quiet place to study,Hang out between classes,Work on a class assignment/paper"  
## [2] "Find a quiet place to study,Hang out between classes,Work on a class assignment/paper"  
## [3] "Meet up with friends,Use a library computer,Print, photocopy, scan"                     
## [4] "Meet up with friends,Print, photocopy, scan,Get readings from Course Reserve"           
## [5] "Find a quiet place to study,Work on a class assignment/paper,Print, photocopy, scan"    
## [6] "Find a quiet place to study,Work on a class assignment/paper,Borrow books and materials"

As we can see, each column contains multiple strings separated by commas. Ideally, we would like one column to have one unique value. Like what we did before, we need to extract the individual strings from the commas. The problem is, however, this time some strings contain “,”, the separator, within themselves. For instance, “Print, photocopy, scan” has “,” between the verbs. So does “Use specialized databases (e.g. Bloomberg, Wind)”.

One solution is, as the first step, to create one column for each value of top_reason, matched with country. The cell would be filled if the case is true for an individual, meaning that person from a specific country voted for a reason.

Step I

Let’s first store all unique values of top_reason in a vector Reason, which will be used as a pattern in extracting the strings.

Reason <- c("Work on a class assignment/paper", "Watch video or listen audio", "Use specialized databases \\(e.g. Bloomberg, Wind\\)", "Use a library computer", "Use a group study room", "Print, photocopy, scan", "Other", "Meet up with friends", "Hang out between classes", "Get readings from Course Reserve", "Get help from a librarian", "Find a quiet place to study", "Borrow books and materials", "Attend a library workshop")

Reason

##  [1] "Work on a class assignment/paper"                    
##  [2] "Watch video or listen audio"                         
##  [3] "Use specialized databases \\(e.g. Bloomberg, Wind\\)"
##  [4] "Use a library computer"                              
##  [5] "Use a group study room"                              
##  [6] "Print, photocopy, scan"                              
##  [7] "Other"                                               
##  [8] "Meet up with friends"                                
##  [9] "Hang out between classes"                            
## [10] "Get readings from Course Reserve"                    
## [11] "Get help from a librarian"                           
## [12] "Find a quiet place to study"                         
## [13] "Borrow books and materials"                          
## [14] "Attend a library workshop"

Note that I added \\ to "Use specialized databases (e.g. Bloomberg, Wind)". The reason was that to quote a metacharacter (paranthese) with special meaning, double backslashes need to come before it.

Step II

We can now reformat the data to a data frame with each column being a “reason”.

r <- data.frame(survey$country)
## first we create a new data frame
## let the first column be country because we will plot on country subsets

Values of each column will be extracted from the messy strings in the original dataset, if matched with the column name. For instance, if "Find a quiet place to study", among the other separated strings, matches with the column "Find a quiet place to study" for an individual, the string will be added to the cell.

for (m in 1: length(Reason)){
  r[,Reason[m]] <- str_extract(survey$top_reason, Reason[m])
}
## length(Reason) counts how many elements the Reason vector contains
## str_extract(string, pattern) is from the stringr package
## r[,Reason[m]] adds each reason as a column to the data frame "r" one by one
## str_extract(survey$top_reason, Reason[m]) extracts each element of reason by the pattern Reason[m]

head(r)

##   survey.country Work on a class assignment/paper
## 1           U.S. Work on a class assignment/paper
## 2          China Work on a class assignment/paper
## 3           U.S.                             <NA>
## 4          China                             <NA>
## 5           U.S. Work on a class assignment/paper
## 6          China Work on a class assignment/paper
##   Watch video or listen audio
## 1                        <NA>
## 2                        <NA>
## 3                        <NA>
## 4                        <NA>
## 5                        <NA>
## 6                        <NA>
##   Use specialized databases \\(e.g. Bloomberg, Wind\\)
## 1                                                 <NA>
## 2                                                 <NA>
## 3                                                 <NA>
## 4                                                 <NA>
## 5                                                 <NA>
## 6                                                 <NA>
##   Use a library computer Use a group study room Print, photocopy, scan
## 1                   <NA>                   <NA>                   <NA>
## 2                   <NA>                   <NA>                   <NA>
## 3 Use a library computer                   <NA> Print, photocopy, scan
## 4                   <NA>                   <NA> Print, photocopy, scan
## 5                   <NA>                   <NA> Print, photocopy, scan
## 6                   <NA>                   <NA>                   <NA>
##   Other Meet up with friends Hang out between classes
## 1  <NA>                 <NA> Hang out between classes
## 2  <NA>                 <NA> Hang out between classes
## 3  <NA> Meet up with friends                     <NA>
## 4  <NA> Meet up with friends                     <NA>
## 5  <NA>                 <NA>                     <NA>
## 6  <NA>                 <NA>                     <NA>
##   Get readings from Course Reserve Get help from a librarian
## 1                             <NA>                      <NA>
## 2                             <NA>                      <NA>
## 3                             <NA>                      <NA>
## 4 Get readings from Course Reserve                      <NA>
## 5                             <NA>                      <NA>
## 6                             <NA>                      <NA>
##   Find a quiet place to study Borrow books and materials
## 1 Find a quiet place to study                       <NA>
## 2 Find a quiet place to study                       <NA>
## 3                        <NA>                       <NA>
## 4                        <NA>                       <NA>
## 5 Find a quiet place to study                       <NA>
## 6 Find a quiet place to study Borrow books and materials
##   Attend a library workshop
## 1                      <NA>
## 2                      <NA>
## 3                      <NA>
## 4                      <NA>
## 5                      <NA>
## 6                      <NA>

table(r$`Use specialized databases \\(e.g. Bloomberg, Wind\\)`)

## 
## Use specialized databases (e.g. Bloomberg, Wind) 
##                                               17

## You may try in the earlier steps quoting without backlashes. No values will be carried over here. The frequency will be 0 in this table.

Step III

Thinking of our goal again - what we need is a frequency table by country. We need to reshape the data to that structure.

We need a data frame. This time the reasons came into one column.

dtset <- data.frame(Reason)
dtset

##                                                  Reason
## 1                      Work on a class assignment/paper
## 2                           Watch video or listen audio
## 3  Use specialized databases \\(e.g. Bloomberg, Wind\\)
## 4                                Use a library computer
## 5                                Use a group study room
## 6                                Print, photocopy, scan
## 7                                                 Other
## 8                                  Meet up with friends
## 9                              Hang out between classes
## 10                     Get readings from Course Reserve
## 11                            Get help from a librarian
## 12                          Find a quiet place to study
## 13                           Borrow books and materials
## 14                            Attend a library workshop

levels(dtset$Reason)[levels(dtset$Reason) == "Use specialized databases \\(e.g. Bloomberg, Wind\\)"] <- "Use specialized databases (e.g. Bloomberg, Wind)"

## reset the values of factor Reason: we've done quoting and let's remove the backslashes

Next we will summarize the distribution by country for each “reason”.

g <- c("China", "U.S.", "Other")

for (n in 1:length(g)){
  dtset[,g[n]] <- apply(r[r[,1] == g[n], 2:15], 2, function(x) length(which(!is.na(x))))
}
## adding one group member a time to the data.frame
## g[n]: vector elements ("China", "U.S.", "Other")
## apply(subset[rows where elements == values of the group vector, columns applied], rows, function(counts non-missing cases))

dtset$Total<- rowSums(dtset[,2:4], na.rm = TRUE, dims = 1)
## creates a new column to calculate the total

dtset <- dtset[order(dtset$Total,decreasing = T),]
## reorders the rows by Total

dtset

##                                              Reason China U.S. Other Total
## 12                      Find a quiet place to study   123   47    50   220
## 6                            Print, photocopy, scan   104   45    52   201
## 1                  Work on a class assignment/paper    89   46    31   166
## 13                       Borrow books and materials    63   15    28   106
## 10                 Get readings from Course Reserve    30    6     8    44
## 4                            Use a library computer    11   20    10    41
## 5                            Use a group study room    32    4     5    41
## 8                              Meet up with friends    23    4     9    36
## 9                          Hang out between classes    16    6     5    27
## 11                        Get help from a librarian    13    2     7    22
## 3  Use specialized databases (e.g. Bloomberg, Wind)    12    3     2    17
## 14                        Attend a library workshop     9    5     3    17
## 2                       Watch video or listen audio     7    2     2    11
## 7                                             Other     2    2     1     5

When we need it, we can also reshape dtset from wide to long.

dtset %>% melt()

##                                              Reason variable value
## 1                       Find a quiet place to study    China   123
## 2                            Print, photocopy, scan    China   104
## 3                  Work on a class assignment/paper    China    89
## 4                        Borrow books and materials    China    63
## 5                  Get readings from Course Reserve    China    30
## 6                            Use a library computer    China    11
## 7                            Use a group study room    China    32
## 8                              Meet up with friends    China    23
## 9                          Hang out between classes    China    16
## 10                        Get help from a librarian    China    13
## 11 Use specialized databases (e.g. Bloomberg, Wind)    China    12
## 12                        Attend a library workshop    China     9
## 13                      Watch video or listen audio    China     7
## 14                                            Other    China     2
## 15                      Find a quiet place to study     U.S.    47
## 16                           Print, photocopy, scan     U.S.    45
## 17                 Work on a class assignment/paper     U.S.    46
## 18                       Borrow books and materials     U.S.    15
## 19                 Get readings from Course Reserve     U.S.     6
## 20                           Use a library computer     U.S.    20
## 21                           Use a group study room     U.S.     4
## 22                             Meet up with friends     U.S.     4
## 23                         Hang out between classes     U.S.     6
## 24                        Get help from a librarian     U.S.     2
## 25 Use specialized databases (e.g. Bloomberg, Wind)     U.S.     3
## 26                        Attend a library workshop     U.S.     5
## 27                      Watch video or listen audio     U.S.     2
## 28                                            Other     U.S.     2
## 29                      Find a quiet place to study    Other    50
## 30                           Print, photocopy, scan    Other    52
## 31                 Work on a class assignment/paper    Other    31
## 32                       Borrow books and materials    Other    28
## 33                 Get readings from Course Reserve    Other     8
## 34                           Use a library computer    Other    10
## 35                           Use a group study room    Other     5
## 36                             Meet up with friends    Other     9
## 37                         Hang out between classes    Other     5
## 38                        Get help from a librarian    Other     7
## 39 Use specialized databases (e.g. Bloomberg, Wind)    Other     2
## 40                        Attend a library workshop    Other     3
## 41                      Watch video or listen audio    Other     2
## 42                                            Other    Other     1
## 43                      Find a quiet place to study    Total   220
## 44                           Print, photocopy, scan    Total   201
## 45                 Work on a class assignment/paper    Total   166
## 46                       Borrow books and materials    Total   106
## 47                 Get readings from Course Reserve    Total    44
## 48                           Use a library computer    Total    41
## 49                           Use a group study room    Total    41
## 50                             Meet up with friends    Total    36
## 51                         Hang out between classes    Total    27
## 52                        Get help from a librarian    Total    22
## 53 Use specialized databases (e.g. Bloomberg, Wind)    Total    17
## 54                        Attend a library workshop    Total    17
## 55                      Watch video or listen audio    Total    11
## 56                                            Other    Total     5

making use of info from several columns

It is typical that a survey has a group of questions evaluating one problem. In our survey, we have several variables rank_crowded, rank_modpop, rank_noisy, rank_quiet, and rank_silent asking if library users would like to study in an environment that is “crowded”, “moderately populated”, “noisy”, “quiet (occasional whispers)”, or “(close to) silent”.

head(survey[c("rank_crowded","rank_modpop", "rank_noisy", "rank_quiet", "rank_silent")])

## # A tibble: 6 x 5
##   rank_crowded rank_modpop rank_noisy rank_quiet rank_silent
##          <dbl>       <dbl>      <dbl>      <dbl>       <dbl>
## 1           NA          NA         NA          1          NA
## 2            1          NA         NA         NA          NA
## 3           NA           3         NA         NA           1
## 4           NA          NA         NA         NA           2
## 5           NA          NA         NA          3           1
## 6           NA          NA          2          3           1

summary(survey[c("rank_crowded","rank_modpop", "rank_noisy", "rank_quiet", "rank_silent")])

##   rank_crowded    rank_modpop      rank_noisy      rank_quiet  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.00  
##  Median :1.000   Median :3.000   Median :2.500   Median :2.00  
##  Mean   :2.099   Mean   :2.689   Mean   :4.075   Mean   :1.94  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:7.000   3rd Qu.:2.00  
##  Max.   :7.000   Max.   :6.000   Max.   :7.000   Max.   :6.00  
##  NA's   :239     NA's   :259     NA's   :280     NA's   :153   
##   rank_silent   
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.724  
##  3rd Qu.:2.000  
##  Max.   :5.000  
##  NA's   :139

When plotting, we’d like to present the results altogether. Therefore, we’d like to create a matrix with all the columns grouped by student status.

rank_status <- 
  data.frame(survey %>% group_by(status) %>% summarise(rank_crowded = round(mean(rank_crowded, na.rm = TRUE),2))) %>% 
  left_join(data.frame(survey %>% group_by(status) %>% summarise(rank_modpop = round(mean(rank_modpop, na.rm = TRUE),2)))) %>%
  left_join(data.frame(survey %>% group_by(status) %>% summarise(rank_noisy = round(mean(rank_noisy, na.rm = TRUE),2)))) %>%
  left_join(data.frame(survey %>% group_by(status) %>% summarise(rank_quiet = round(mean(rank_quiet, na.rm = TRUE),2)))) %>%
  left_join(data.frame(survey %>% group_by(status) %>% summarise(rank_silent = round(mean(rank_silent, na.rm = TRUE),2)))) %>%
  rename(`crowded` = rank_crowded, `moderately populated` = rank_modpop, `noisy` = rank_noisy, `quiet (occasional whispers)` = rank_quiet, `(close to) silent` = rank_silent) 

## grouped by status

rank_status

##           status crowded moderately populated noisy
## 1       Freshman    1.52                 2.43  3.08
## 2      Sophomore    1.60                 2.36  3.40
## 3         Junior    1.25                 1.67  2.00
## 4         Senior    3.42                 3.21  4.93
## 5 Other Programs    4.00                 3.50  7.00
##   quiet (occasional whispers) (close to) silent
## 1                        1.90              1.76
## 2                        2.13              1.71
## 3                        1.45              1.89
## 4                        1.97              1.58
## 5                        1.91              1.93

Later, we will create a heatmap out of this matrix. The current data structure does not support the graph making, so we need to further reshape the data from wide to long.

rank_status <- rank_status %>% melt()
## melt() is from data.table package
## reshaping the matrix/data.frame for making the heatmap easier using melt()

rank_status

##            status                    variable value
## 1        Freshman                     crowded  1.52
## 2       Sophomore                     crowded  1.60
## 3          Junior                     crowded  1.25
## 4          Senior                     crowded  3.42
## 5  Other Programs                     crowded  4.00
## 6        Freshman        moderately populated  2.43
## 7       Sophomore        moderately populated  2.36
## 8          Junior        moderately populated  1.67
## 9          Senior        moderately populated  3.21
## 10 Other Programs        moderately populated  3.50
## 11       Freshman                       noisy  3.08
## 12      Sophomore                       noisy  3.40
## 13         Junior                       noisy  2.00
## 14         Senior                       noisy  4.93
## 15 Other Programs                       noisy  7.00
## 16       Freshman quiet (occasional whispers)  1.90
## 17      Sophomore quiet (occasional whispers)  2.13
## 18         Junior quiet (occasional whispers)  1.45
## 19         Senior quiet (occasional whispers)  1.97
## 20 Other Programs quiet (occasional whispers)  1.91
## 21       Freshman           (close to) silent  1.76
## 22      Sophomore           (close to) silent  1.71
## 23         Junior           (close to) silent  1.89
## 24         Senior           (close to) silent  1.58
## 25 Other Programs           (close to) silent  1.93

Reshaping for Shiny

Reshaping for R Shiny App asks for a lot more repetition and automation. In this case, repetition and automation mainly come from the idea that we want a chunk of codes to automatically produce graphs for the same purpose and repeatedly for different groups. For instance, earlier we may have created a plot by country; now we want to plot graphs by country as well as status and major. Later when we hit the “subgroup button” in Shiny App, we want to be able to dynamically choose the subgroup of data to be visualized and presented.

It may sound abstract and vague at the moment. When we get to Part III, we will get a more concrete idea of how that works.

Remember earlier we created a frequency table for major. Here we want to do the same thing for status and country as well. We need to repeat the previous procedure on each one of them. As a result, we will get three data frames.

One more thing we want to do is store the three data frames in one object on a higher level. A list is a good option for such a task. The reason is that later when we need to access the single data frames, we can do so by subsetting (dtset[[i]]).

set <- survey[c("status", "country", "major")]
tb3 <- replicate(length(set), data.frame(), simplify=FALSE)
## create a list of data frames for each group

for (m in 1:length(set)){
  tb3[m] <- list(data.frame(table(set[m])))
  names(tb3[[m]]) <- c("", "#")
  tb3[[m]] <- tb3[[m]][order(tb3[[m]]$`#`,decreasing = T),]
}
## replicate() creates a list of 4 data frame; each data frame for one summary table
## tb3[m] <- list() adds one data frame by another
## tb3[[m]] works on data frame
## tb3[[m]][order(tb3[[m]]$`#`,decreasing = T),] orders rows by freq

for (i in 1:3){
  colnames(tb3[[i]])[1] <- c("Student Status", "Country / Region", "Major")[i]
}
## rename the column names

tb3

## [[1]]
##   Student Status   #
## 1       Freshman 131
## 2      Sophomore  80
## 4         Senior  71
## 3         Junior  19
## 5 Other Programs  19
## 
## [[2]]
##   Country / Region   #
## 1            China 178
## 3            Other  71
## 2             U.S.  69
## 4        Undefined   2
## 
## [[3]]
##                                       Major   #
## 1             Business, Finance & Economics 118
## 8                                 Undefined  57
## 2              Humanities & Social Sciences  32
## 6                          CS & Engineering  27
## 3 Data Science & Interactive Media Business  24
## 5                                   Science  21
## 7                               Mathematics  21
## 4                    Interactive Media Arts  20

Earlier we summarized the distribution of “top reason for visiting the library” by country. We now want to do it for each group (status, country and major) and store them all in a list, like what we did above.

g <- list(status = c("Freshman", "Sophomore", "Junior", "Senior", "Other Programs") ,
          country = c("China", "U.S.", "Other"),
          major = c("Business, Finance & Economics","Humanities & Social Sciences","Data Science & Interactive Media Business",                     "Interactive Media Arts","Science","CS & Engineering","Mathematics","Undefined"))

Reason <- c("Work on a class assignment/paper", "Watch video or listen audio", "Use specialized databases \\(e.g. Bloomberg, Wind\\)", "Use a library computer",  "Use a group study room", "Print, photocopy, scan",  "Other", "Meet up with friends",  "Hang out between classes", "Get readings from Course Reserve", "Get help from a librarian", "Find a quiet place to study", "Borrow books and materials", "Attend a library workshop")

r <- data.frame(survey$status, survey$country, survey$major)
for (m in 1: length(Reason)){
  r[,Reason[m]] <- str_extract(survey$top_reason, Reason[m])
}
## reformat dataset structure so that each column is a Reason[i]

dtset <- replicate(3, data.frame(Reason), simplify = FALSE)
for (i in 1:3){
  levels(dtset[[i]]$Reason)[levels(dtset[[i]]$Reason) == "Use specialized databases \\(e.g. Bloomberg, Wind\\)"] <- "Use specialized databases (e.g. Bloomberg, Wind)"
}
## reset the values of factor Reason for each data frame dtset[[i]]: 
## we've done quoting and let's remove the backslashes

for (m in 1:length(dtset)){
  for (n in 1:length(g[[m]])){
    subgroup <- r[r[,m] == g[[m]][n], (length(g)+1):(length(Reason) + length(g))]
    dtset[[m]][,g[[m]][n]] <- apply(subgroup, 2, function(x) length(which(!is.na(x))))
  }
  dtset[[m]]$Total<- rowSums(dtset[[m]][,2:n], na.rm = TRUE, dims = 1)
  dtset[[m]] <- dtset[[m]][order(dtset[[m]]$Total,decreasing = T),]
}

## m denotes how many groups we have. We have 3 groups: country/status/major
## n denotes how many elements we have in each group. e.g. 3 elements in the group "country" (China, U.S., Other)

dtset

## [[1]]
##                                              Reason Freshman Sophomore
## 12                      Find a quiet place to study       94        55
## 6                            Print, photocopy, scan       76        50
## 1                  Work on a class assignment/paper       76        44
## 13                       Borrow books and materials       38        25
## 10                 Get readings from Course Reserve       13        12
## 5                            Use a group study room       18        12
## 4                            Use a library computer       16        10
## 8                              Meet up with friends       17        12
## 9                          Hang out between classes       14         5
## 11                        Get help from a librarian        7         7
## 3  Use specialized databases (e.g. Bloomberg, Wind)        4         3
## 14                        Attend a library workshop       13         0
## 2                       Watch video or listen audio        5         4
## 7                                             Other        2         1
##    Junior Senior Other Programs Total
## 12     11     45             16   205
## 6      10     52             14   188
## 1       9     28             10   157
## 13      9     34              1   106
## 10      3     15              2    43
## 5       3      7              1    40
## 4       3      7              5    36
## 8       1      4              2    34
## 9       2      2              4    23
## 11      1      8              0    23
## 3       2      8              0    17
## 14      2      1              1    16
## 2       0      1              1    10
## 7       1      1              0     5
## 
## [[2]]
##                                              Reason China U.S. Other Total
## 12                      Find a quiet place to study   123   47    50   170
## 6                            Print, photocopy, scan   104   45    52   149
## 1                  Work on a class assignment/paper    89   46    31   135
## 13                       Borrow books and materials    63   15    28    78
## 5                            Use a group study room    32    4     5    36
## 10                 Get readings from Course Reserve    30    6     8    36
## 4                            Use a library computer    11   20    10    31
## 8                              Meet up with friends    23    4     9    27
## 9                          Hang out between classes    16    6     5    22
## 3  Use specialized databases (e.g. Bloomberg, Wind)    12    3     2    15
## 11                        Get help from a librarian    13    2     7    15
## 14                        Attend a library workshop     9    5     3    14
## 2                       Watch video or listen audio     7    2     2     9
## 7                                             Other     2    2     1     4
## 
## [[3]]
##                                              Reason
## 12                      Find a quiet place to study
## 6                            Print, photocopy, scan
## 1                  Work on a class assignment/paper
## 13                       Borrow books and materials
## 10                 Get readings from Course Reserve
## 5                            Use a group study room
## 4                            Use a library computer
## 8                              Meet up with friends
## 11                        Get help from a librarian
## 9                          Hang out between classes
## 3  Use specialized databases (e.g. Bloomberg, Wind)
## 14                        Attend a library workshop
## 2                       Watch video or listen audio
## 7                                             Other
##    Business, Finance & Economics Humanities & Social Sciences
## 12                            80                           21
## 6                             78                           23
## 1                             63                           18
## 13                            36                           13
## 10                            10                            6
## 5                             17                            3
## 4                             15                            3
## 8                             17                            1
## 11                             9                            2
## 9                              6                            1
## 3                             10                            3
## 14                             5                            1
## 2                              7                            0
## 7                              1                            1
##    Data Science & Interactive Media Business Interactive Media Arts
## 12                                        17                     10
## 6                                         12                     10
## 1                                         16                      9
## 13                                         9                      8
## 10                                         2                      4
## 5                                          2                      5
## 4                                          3                      5
## 8                                          1                      2
## 11                                         1                      3
## 9                                          5                      1
## 3                                          0                      1
## 14                                         4                      1
## 2                                          0                      0
## 7                                          0                      1
##    Science CS & Engineering Mathematics Undefined Total
## 12      12               18          16        47   174
## 6       15               13          14        37   165
## 1        8               16          11        26   141
## 13      11                9           4        17    90
## 10       4                4           8         7    38
## 5        3                4           3         4    37
## 4        2                3           1         9    32
## 8        2                4           2         7    29
## 11       1                3           2         2    21
## 9        1                3           0        10    17
## 3        0                1           1         1    16
## 14       3                2           0         1    16
## 2        1                1           1         1    10
## 7        0                0           0         2     3

We will see how we actually make use of the list of data frames when we talk about building the Shiny App.