This post is the first of the three in a series to make an interactive data visualization web app:
The three posts aim to give a concrete example with a full workflow of visualizing data from preprocessing, graphing to app building.
The case is adapted from how the R Shiny App for a Library Survey was actually created. Due to the nature and the goal of the survey, this interactive visualization paid much attention to how different subgroups of respondents (by country/major/status) used the library services. Most tasks concern how to graph categorical data.
This first post deals with the dirty part, data cleaning, with a focus on reshaping data structures for visualization. It starts with reshaping data for single graphs, and extends to reshaping for R Shiny App that deals with a lot more repetition and automation. Along the way, you will also see useful techniques in data cleaning applicable in many scenarios other than graphing.
Note that it is always the actual needs of plotting that determine how we reshape our data. In this case, the challenges of data preprocessing, or reshaping data, for visualization arise from how data were collected, generated, structured and stored in ways quite far from what could be ready to use; challenges also come from how different packages require different data structures to produce the same plots.
The codes and techniques we discuss here consist of Shiny App creation, specifically what we would call the global.R
script. It is not part of the ui
or server
functions, but handle the data needs of building up a Shiny App. This page shows a very small portion of what the global.R
script looks like for the survey Shiny App.
First of all, let’s load all the packages that we will use for this post.
library(readxl)
library(dplyr)
library(stringr)
library(data.table)
For illustration purposes below, I loaded the libraries each time I needed them so that it is known which package was attached. In reality, however, including the libraries at the very beginning of the script as above is recommended.
Let’s load the sample data. The data came from a library service survey, cleaned and deidentified for the purpose of this demo.
## setwd()
load("~/Desktop/r sample files/survey")
Let’s take a first look at our data. head()
, str()
and summary()
are all useful functions to get a sense of our data. See a brief introduction here.
head(survey)
## # A tibble: 6 x 17
## status country major Q1.1 Q1.2 Q1.3 top_reason place_options
## <fct> <fct> <fct> <chr> <chr> <chr> <chr> <chr>
## 1 Freshm… U.S. Undef… Never Never Never Find a quiet … quiet (occasio…
## 2 Freshm… China Undef… Occas… Often Often Find a quiet … crowded,focused
## 3 Freshm… U.S. Undef… Never Occa… Often Meet up with … (close to) sil…
## 4 Freshm… China Undef… Occas… Occa… Occa… Meet up with … focused,(close…
## 5 Freshm… U.S. Undef… Never Occa… Occa… Find a quiet … (close to) sil…
## 6 Freshm… China Undef… Occas… Often Often Find a quiet … (close to) sil…
## # ... with 9 more variables: space_lib <chr>, rank_crowded <dbl>,
## # rank_modpop <dbl>, rank_noisy <dbl>, rank_quiet <dbl>,
## # rank_silent <dbl>, rank_relaxed <dbl>, rank_focused <dbl>,
## # workshops <chr>
We will skip all the cleaning before reshaping, such as merging files, removing missing cases, recoding etc. Below we review several scenarios that we actually encountered when visualizing the survey data.
There were several times that we needed to create a frequency table before visualizing. Here’s how to do it.
tb <- survey %>% count(major) %>% data.frame() # generates a frequency table
tb <- tb %>% arrange(-n)
tb
## major n
## 1 Business, Finance & Economics 118
## 2 Undefined 57
## 3 Humanities & Social Sciences 32
## 4 CS & Engineering 27
## 5 Data Science & Interactive Media Business 24
## 6 Science 21
## 7 Mathematics 21
## 8 Interactive Media Arts 20
Another way to do it is to create a data frame from table()
.
tb2 <- data.frame(table(survey$major))
names(tb2) <- c("major", "n")
tb2 <- tb2[order(tb2$n, decreasing = T),]
tb2
## major n
## 1 Business, Finance & Economics 118
## 8 Undefined 57
## 2 Humanities & Social Sciences 32
## 6 CS & Engineering 27
## 3 Data Science & Interactive Media Business 24
## 5 Science 21
## 7 Mathematics 21
## 4 Interactive Media Arts 20
Have you ever got a bit annoyed when you saw a column of strings separated by the commas, tabs or spaces? Worse if we need to plot something out of the data. Data like this need to be cleaned before plotting. We will see two cases below. Each explains one situation.
One note before we proceed: Because we designed the survey questions or recoded the submitted answers, all variable values were known to us. For instance, we knew all the values for “top reason of visiting the library”. This makes it easier for us to deal with the strings.
First of all, we have data from student submissions on “My favorite place to study in Library”. The messy raw data were recoded to be consisent across respondents. Most have single strings, but some have multiple ones separated by comma.
head(survey$space_lib, n = 25)
## [1] "desk" "windows" "428"
## [4] "428" "talking" "silent"
## [7] "windows" "400" "silent"
## [10] "428" "NULL" "silent"
## [13] "silent" "windows" "group study room"
## [16] "400" "group study room" "group study room"
## [19] "400" "407" "407, silent"
## [22] "other" "other" "group study room"
## [25] "NULL"
Let’s say we want to create a frequency table counting how many times a unique value appeared. We are not too worried about this case because the strings themeselves do not contain the separator, which is the comma. So the program will not get confused which comma is what - is it a separator or part of the string. We can just split the strings by comma.
On the contrary, in the case below we will see shortly, some strings came with the separator, such as the comma in “Print, photocopy, scan”. Simply splitting those strings would have been inappropriate.
Back to where we were.
lib <- unlist(strsplit(survey$space_lib, ","))
## strsplit() splits the elements of the character vector space_lib into substrings by the commas
## strsplit() is from base R
## the result is a long list of all the extracted strings in vectors
## we need to unlist() the list, and get a data.frame after unlist()
head(lib, n = 30)
## [1] "desk" "windows" "428"
## [4] "428" "talking" "silent"
## [7] "windows" "400" "silent"
## [10] "428" "NULL" "silent"
## [13] "silent" "windows" "group study room"
## [16] "400" "group study room" "group study room"
## [19] "400" "407" "407"
## [22] " silent" "other" "other"
## [25] "group study room" "NULL" "silent"
## [28] "silent" "silent" "407"
lib <- str_trim(lib, side = "both")
## str_trim() is a function from the stringr package
## removes the trailing and leading spaces generated in spliting
lib <- lib[!lib %in% c("NULL")]
## removes values coded as "NULL"
lib <- data.frame(lib) %>% count(lib) %>% data.frame() %>% arrange(-n)
## counts the frequency of each unique value
lib
## lib n
## 1 silent 77
## 2 407 58
## 3 windows 47
## 4 other 31
## 5 group study room 23
## 6 400 22
## 7 428 17
## 8 desk 14
## 9 sofa 11
## 10 large desk 7
## 11 corner 6
## 12 small desk 4
## 13 4F 2
## 14 table 2
## 15 outlet 1
## 16 privacy 1
## 17 reserve 1
## 18 talking 1
In the second case, we want to visualize the distribution of visiting library by country. The data for “top reasons to visit the library” is top_reason
. The group variable is country
.
head(survey$top_reason)
## [1] "Find a quiet place to study,Hang out between classes,Work on a class assignment/paper"
## [2] "Find a quiet place to study,Hang out between classes,Work on a class assignment/paper"
## [3] "Meet up with friends,Use a library computer,Print, photocopy, scan"
## [4] "Meet up with friends,Print, photocopy, scan,Get readings from Course Reserve"
## [5] "Find a quiet place to study,Work on a class assignment/paper,Print, photocopy, scan"
## [6] "Find a quiet place to study,Work on a class assignment/paper,Borrow books and materials"
As we can see, each column contains multiple strings separated by commas. Ideally, we would like one column to have one unique value. Like what we did before, we need to extract the individual strings from the commas. The problem is, however, this time some strings contain “,”, the separator, within themselves. For instance, “Print, photocopy, scan” has “,” between the verbs. So does “Use specialized databases (e.g. Bloomberg, Wind)”.
One solution is, as the first step, to create one column for each value of top_reason
, matched with country
. The cell would be filled if the case is true for an individual, meaning that person from a specific country voted for a reason.
Let’s first store all unique values of top_reason
in a vector Reason
, which will be used as a pattern in extracting the strings.
Reason <- c("Work on a class assignment/paper", "Watch video or listen audio", "Use specialized databases \\(e.g. Bloomberg, Wind\\)", "Use a library computer", "Use a group study room", "Print, photocopy, scan", "Other", "Meet up with friends", "Hang out between classes", "Get readings from Course Reserve", "Get help from a librarian", "Find a quiet place to study", "Borrow books and materials", "Attend a library workshop")
Reason
## [1] "Work on a class assignment/paper"
## [2] "Watch video or listen audio"
## [3] "Use specialized databases \\(e.g. Bloomberg, Wind\\)"
## [4] "Use a library computer"
## [5] "Use a group study room"
## [6] "Print, photocopy, scan"
## [7] "Other"
## [8] "Meet up with friends"
## [9] "Hang out between classes"
## [10] "Get readings from Course Reserve"
## [11] "Get help from a librarian"
## [12] "Find a quiet place to study"
## [13] "Borrow books and materials"
## [14] "Attend a library workshop"
Note that I added \\
to "Use specialized databases (e.g. Bloomberg, Wind)"
. The reason was that to quote a metacharacter (paranthese) with special meaning, double backslashes need to come before it.
We can now reformat the data to a data frame with each column being a “reason”.
r <- data.frame(survey$country)
## first we create a new data frame
## let the first column be country because we will plot on country subsets
Values of each column will be extracted from the messy strings in the original dataset, if matched with the column name. For instance, if "Find a quiet place to study"
, among the other separated strings, matches with the column "Find a quiet place to study"
for an individual, the string will be added to the cell.
for (m in 1: length(Reason)){
r[,Reason[m]] <- str_extract(survey$top_reason, Reason[m])
}
## length(Reason) counts how many elements the Reason vector contains
## str_extract(string, pattern) is from the stringr package
## r[,Reason[m]] adds each reason as a column to the data frame "r" one by one
## str_extract(survey$top_reason, Reason[m]) extracts each element of reason by the pattern Reason[m]
head(r)
## survey.country Work on a class assignment/paper
## 1 U.S. Work on a class assignment/paper
## 2 China Work on a class assignment/paper
## 3 U.S. <NA>
## 4 China <NA>
## 5 U.S. Work on a class assignment/paper
## 6 China Work on a class assignment/paper
## Watch video or listen audio
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## Use specialized databases \\(e.g. Bloomberg, Wind\\)
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## Use a library computer Use a group study room Print, photocopy, scan
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 Use a library computer <NA> Print, photocopy, scan
## 4 <NA> <NA> Print, photocopy, scan
## 5 <NA> <NA> Print, photocopy, scan
## 6 <NA> <NA> <NA>
## Other Meet up with friends Hang out between classes
## 1 <NA> <NA> Hang out between classes
## 2 <NA> <NA> Hang out between classes
## 3 <NA> Meet up with friends <NA>
## 4 <NA> Meet up with friends <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## Get readings from Course Reserve Get help from a librarian
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 Get readings from Course Reserve <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## Find a quiet place to study Borrow books and materials
## 1 Find a quiet place to study <NA>
## 2 Find a quiet place to study <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 Find a quiet place to study <NA>
## 6 Find a quiet place to study Borrow books and materials
## Attend a library workshop
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
table(r$`Use specialized databases \\(e.g. Bloomberg, Wind\\)`)
##
## Use specialized databases (e.g. Bloomberg, Wind)
## 17
## You may try in the earlier steps quoting without backlashes. No values will be carried over here. The frequency will be 0 in this table.
Thinking of our goal again - what we need is a frequency table by country. We need to reshape the data to that structure.
We need a data frame. This time the reasons came into one column.
dtset <- data.frame(Reason)
dtset
## Reason
## 1 Work on a class assignment/paper
## 2 Watch video or listen audio
## 3 Use specialized databases \\(e.g. Bloomberg, Wind\\)
## 4 Use a library computer
## 5 Use a group study room
## 6 Print, photocopy, scan
## 7 Other
## 8 Meet up with friends
## 9 Hang out between classes
## 10 Get readings from Course Reserve
## 11 Get help from a librarian
## 12 Find a quiet place to study
## 13 Borrow books and materials
## 14 Attend a library workshop
levels(dtset$Reason)[levels(dtset$Reason) == "Use specialized databases \\(e.g. Bloomberg, Wind\\)"] <- "Use specialized databases (e.g. Bloomberg, Wind)"
## reset the values of factor Reason: we've done quoting and let's remove the backslashes
Next we will summarize the distribution by country for each “reason”.
g <- c("China", "U.S.", "Other")
for (n in 1:length(g)){
dtset[,g[n]] <- apply(r[r[,1] == g[n], 2:15], 2, function(x) length(which(!is.na(x))))
}
## adding one group member a time to the data.frame
## g[n]: vector elements ("China", "U.S.", "Other")
## apply(subset[rows where elements == values of the group vector, columns applied], rows, function(counts non-missing cases))
dtset$Total<- rowSums(dtset[,2:4], na.rm = TRUE, dims = 1)
## creates a new column to calculate the total
dtset <- dtset[order(dtset$Total,decreasing = T),]
## reorders the rows by Total
dtset
## Reason China U.S. Other Total
## 12 Find a quiet place to study 123 47 50 220
## 6 Print, photocopy, scan 104 45 52 201
## 1 Work on a class assignment/paper 89 46 31 166
## 13 Borrow books and materials 63 15 28 106
## 10 Get readings from Course Reserve 30 6 8 44
## 4 Use a library computer 11 20 10 41
## 5 Use a group study room 32 4 5 41
## 8 Meet up with friends 23 4 9 36
## 9 Hang out between classes 16 6 5 27
## 11 Get help from a librarian 13 2 7 22
## 3 Use specialized databases (e.g. Bloomberg, Wind) 12 3 2 17
## 14 Attend a library workshop 9 5 3 17
## 2 Watch video or listen audio 7 2 2 11
## 7 Other 2 2 1 5
When we need it, we can also reshape dtset
from wide to long.
dtset %>% melt()
## Reason variable value
## 1 Find a quiet place to study China 123
## 2 Print, photocopy, scan China 104
## 3 Work on a class assignment/paper China 89
## 4 Borrow books and materials China 63
## 5 Get readings from Course Reserve China 30
## 6 Use a library computer China 11
## 7 Use a group study room China 32
## 8 Meet up with friends China 23
## 9 Hang out between classes China 16
## 10 Get help from a librarian China 13
## 11 Use specialized databases (e.g. Bloomberg, Wind) China 12
## 12 Attend a library workshop China 9
## 13 Watch video or listen audio China 7
## 14 Other China 2
## 15 Find a quiet place to study U.S. 47
## 16 Print, photocopy, scan U.S. 45
## 17 Work on a class assignment/paper U.S. 46
## 18 Borrow books and materials U.S. 15
## 19 Get readings from Course Reserve U.S. 6
## 20 Use a library computer U.S. 20
## 21 Use a group study room U.S. 4
## 22 Meet up with friends U.S. 4
## 23 Hang out between classes U.S. 6
## 24 Get help from a librarian U.S. 2
## 25 Use specialized databases (e.g. Bloomberg, Wind) U.S. 3
## 26 Attend a library workshop U.S. 5
## 27 Watch video or listen audio U.S. 2
## 28 Other U.S. 2
## 29 Find a quiet place to study Other 50
## 30 Print, photocopy, scan Other 52
## 31 Work on a class assignment/paper Other 31
## 32 Borrow books and materials Other 28
## 33 Get readings from Course Reserve Other 8
## 34 Use a library computer Other 10
## 35 Use a group study room Other 5
## 36 Meet up with friends Other 9
## 37 Hang out between classes Other 5
## 38 Get help from a librarian Other 7
## 39 Use specialized databases (e.g. Bloomberg, Wind) Other 2
## 40 Attend a library workshop Other 3
## 41 Watch video or listen audio Other 2
## 42 Other Other 1
## 43 Find a quiet place to study Total 220
## 44 Print, photocopy, scan Total 201
## 45 Work on a class assignment/paper Total 166
## 46 Borrow books and materials Total 106
## 47 Get readings from Course Reserve Total 44
## 48 Use a library computer Total 41
## 49 Use a group study room Total 41
## 50 Meet up with friends Total 36
## 51 Hang out between classes Total 27
## 52 Get help from a librarian Total 22
## 53 Use specialized databases (e.g. Bloomberg, Wind) Total 17
## 54 Attend a library workshop Total 17
## 55 Watch video or listen audio Total 11
## 56 Other Total 5
It is typical that a survey has a group of questions evaluating one problem. In our survey, we have several variables rank_crowded
, rank_modpop
, rank_noisy
, rank_quiet
, and rank_silent
asking if library users would like to study in an environment that is “crowded”, “moderately populated”, “noisy”, “quiet (occasional whispers)”, or “(close to) silent”.
head(survey[c("rank_crowded","rank_modpop", "rank_noisy", "rank_quiet", "rank_silent")])
## # A tibble: 6 x 5
## rank_crowded rank_modpop rank_noisy rank_quiet rank_silent
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA NA NA 1 NA
## 2 1 NA NA NA NA
## 3 NA 3 NA NA 1
## 4 NA NA NA NA 2
## 5 NA NA NA 3 1
## 6 NA NA 2 3 1
summary(survey[c("rank_crowded","rank_modpop", "rank_noisy", "rank_quiet", "rank_silent")])
## rank_crowded rank_modpop rank_noisy rank_quiet
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.00
## Median :1.000 Median :3.000 Median :2.500 Median :2.00
## Mean :2.099 Mean :2.689 Mean :4.075 Mean :1.94
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:7.000 3rd Qu.:2.00
## Max. :7.000 Max. :6.000 Max. :7.000 Max. :6.00
## NA's :239 NA's :259 NA's :280 NA's :153
## rank_silent
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.724
## 3rd Qu.:2.000
## Max. :5.000
## NA's :139
When plotting, we’d like to present the results altogether. Therefore, we’d like to create a matrix with all the columns grouped by student status.
rank_status <-
data.frame(survey %>% group_by(status) %>% summarise(rank_crowded = round(mean(rank_crowded, na.rm = TRUE),2))) %>%
left_join(data.frame(survey %>% group_by(status) %>% summarise(rank_modpop = round(mean(rank_modpop, na.rm = TRUE),2)))) %>%
left_join(data.frame(survey %>% group_by(status) %>% summarise(rank_noisy = round(mean(rank_noisy, na.rm = TRUE),2)))) %>%
left_join(data.frame(survey %>% group_by(status) %>% summarise(rank_quiet = round(mean(rank_quiet, na.rm = TRUE),2)))) %>%
left_join(data.frame(survey %>% group_by(status) %>% summarise(rank_silent = round(mean(rank_silent, na.rm = TRUE),2)))) %>%
rename(`crowded` = rank_crowded, `moderately populated` = rank_modpop, `noisy` = rank_noisy, `quiet (occasional whispers)` = rank_quiet, `(close to) silent` = rank_silent)
## grouped by status
rank_status
## status crowded moderately populated noisy
## 1 Freshman 1.52 2.43 3.08
## 2 Sophomore 1.60 2.36 3.40
## 3 Junior 1.25 1.67 2.00
## 4 Senior 3.42 3.21 4.93
## 5 Other Programs 4.00 3.50 7.00
## quiet (occasional whispers) (close to) silent
## 1 1.90 1.76
## 2 2.13 1.71
## 3 1.45 1.89
## 4 1.97 1.58
## 5 1.91 1.93
Later, we will create a heatmap out of this matrix. The current data structure does not support the graph making, so we need to further reshape the data from wide to long.
rank_status <- rank_status %>% melt()
## melt() is from data.table package
## reshaping the matrix/data.frame for making the heatmap easier using melt()
rank_status
## status variable value
## 1 Freshman crowded 1.52
## 2 Sophomore crowded 1.60
## 3 Junior crowded 1.25
## 4 Senior crowded 3.42
## 5 Other Programs crowded 4.00
## 6 Freshman moderately populated 2.43
## 7 Sophomore moderately populated 2.36
## 8 Junior moderately populated 1.67
## 9 Senior moderately populated 3.21
## 10 Other Programs moderately populated 3.50
## 11 Freshman noisy 3.08
## 12 Sophomore noisy 3.40
## 13 Junior noisy 2.00
## 14 Senior noisy 4.93
## 15 Other Programs noisy 7.00
## 16 Freshman quiet (occasional whispers) 1.90
## 17 Sophomore quiet (occasional whispers) 2.13
## 18 Junior quiet (occasional whispers) 1.45
## 19 Senior quiet (occasional whispers) 1.97
## 20 Other Programs quiet (occasional whispers) 1.91
## 21 Freshman (close to) silent 1.76
## 22 Sophomore (close to) silent 1.71
## 23 Junior (close to) silent 1.89
## 24 Senior (close to) silent 1.58
## 25 Other Programs (close to) silent 1.93
Reshaping for R Shiny App asks for a lot more repetition and automation. In this case, repetition and automation mainly come from the idea that we want a chunk of codes to automatically produce graphs for the same purpose and repeatedly for different groups. For instance, earlier we may have created a plot by country; now we want to plot graphs by country as well as status and major. Later when we hit the “subgroup button” in Shiny App, we want to be able to dynamically choose the subgroup of data to be visualized and presented.
It may sound abstract and vague at the moment. When we get to Part III, we will get a more concrete idea of how that works.
Remember earlier we created a frequency table for major
. Here we want to do the same thing for status
and country
as well. We need to repeat the previous procedure on each one of them. As a result, we will get three data frames.
One more thing we want to do is store the three data frames in one object on a higher level. A list is a good option for such a task. The reason is that later when we need to access the single data frames, we can do so by subsetting (dtset[[i]]
).
set <- survey[c("status", "country", "major")]
tb3 <- replicate(length(set), data.frame(), simplify=FALSE)
## create a list of data frames for each group
for (m in 1:length(set)){
tb3[m] <- list(data.frame(table(set[m])))
names(tb3[[m]]) <- c("", "#")
tb3[[m]] <- tb3[[m]][order(tb3[[m]]$`#`,decreasing = T),]
}
## replicate() creates a list of 4 data frame; each data frame for one summary table
## tb3[m] <- list() adds one data frame by another
## tb3[[m]] works on data frame
## tb3[[m]][order(tb3[[m]]$`#`,decreasing = T),] orders rows by freq
for (i in 1:3){
colnames(tb3[[i]])[1] <- c("Student Status", "Country / Region", "Major")[i]
}
## rename the column names
tb3
## [[1]]
## Student Status #
## 1 Freshman 131
## 2 Sophomore 80
## 4 Senior 71
## 3 Junior 19
## 5 Other Programs 19
##
## [[2]]
## Country / Region #
## 1 China 178
## 3 Other 71
## 2 U.S. 69
## 4 Undefined 2
##
## [[3]]
## Major #
## 1 Business, Finance & Economics 118
## 8 Undefined 57
## 2 Humanities & Social Sciences 32
## 6 CS & Engineering 27
## 3 Data Science & Interactive Media Business 24
## 5 Science 21
## 7 Mathematics 21
## 4 Interactive Media Arts 20
Earlier we summarized the distribution of “top reason for visiting the library” by country. We now want to do it for each group (status
, country
and major
) and store them all in a list, like what we did above.
g <- list(status = c("Freshman", "Sophomore", "Junior", "Senior", "Other Programs") ,
country = c("China", "U.S.", "Other"),
major = c("Business, Finance & Economics","Humanities & Social Sciences","Data Science & Interactive Media Business", "Interactive Media Arts","Science","CS & Engineering","Mathematics","Undefined"))
Reason <- c("Work on a class assignment/paper", "Watch video or listen audio", "Use specialized databases \\(e.g. Bloomberg, Wind\\)", "Use a library computer", "Use a group study room", "Print, photocopy, scan", "Other", "Meet up with friends", "Hang out between classes", "Get readings from Course Reserve", "Get help from a librarian", "Find a quiet place to study", "Borrow books and materials", "Attend a library workshop")
r <- data.frame(survey$status, survey$country, survey$major)
for (m in 1: length(Reason)){
r[,Reason[m]] <- str_extract(survey$top_reason, Reason[m])
}
## reformat dataset structure so that each column is a Reason[i]
dtset <- replicate(3, data.frame(Reason), simplify = FALSE)
for (i in 1:3){
levels(dtset[[i]]$Reason)[levels(dtset[[i]]$Reason) == "Use specialized databases \\(e.g. Bloomberg, Wind\\)"] <- "Use specialized databases (e.g. Bloomberg, Wind)"
}
## reset the values of factor Reason for each data frame dtset[[i]]:
## we've done quoting and let's remove the backslashes
for (m in 1:length(dtset)){
for (n in 1:length(g[[m]])){
subgroup <- r[r[,m] == g[[m]][n], (length(g)+1):(length(Reason) + length(g))]
dtset[[m]][,g[[m]][n]] <- apply(subgroup, 2, function(x) length(which(!is.na(x))))
}
dtset[[m]]$Total<- rowSums(dtset[[m]][,2:n], na.rm = TRUE, dims = 1)
dtset[[m]] <- dtset[[m]][order(dtset[[m]]$Total,decreasing = T),]
}
## m denotes how many groups we have. We have 3 groups: country/status/major
## n denotes how many elements we have in each group. e.g. 3 elements in the group "country" (China, U.S., Other)
dtset
## [[1]]
## Reason Freshman Sophomore
## 12 Find a quiet place to study 94 55
## 6 Print, photocopy, scan 76 50
## 1 Work on a class assignment/paper 76 44
## 13 Borrow books and materials 38 25
## 10 Get readings from Course Reserve 13 12
## 5 Use a group study room 18 12
## 4 Use a library computer 16 10
## 8 Meet up with friends 17 12
## 9 Hang out between classes 14 5
## 11 Get help from a librarian 7 7
## 3 Use specialized databases (e.g. Bloomberg, Wind) 4 3
## 14 Attend a library workshop 13 0
## 2 Watch video or listen audio 5 4
## 7 Other 2 1
## Junior Senior Other Programs Total
## 12 11 45 16 205
## 6 10 52 14 188
## 1 9 28 10 157
## 13 9 34 1 106
## 10 3 15 2 43
## 5 3 7 1 40
## 4 3 7 5 36
## 8 1 4 2 34
## 9 2 2 4 23
## 11 1 8 0 23
## 3 2 8 0 17
## 14 2 1 1 16
## 2 0 1 1 10
## 7 1 1 0 5
##
## [[2]]
## Reason China U.S. Other Total
## 12 Find a quiet place to study 123 47 50 170
## 6 Print, photocopy, scan 104 45 52 149
## 1 Work on a class assignment/paper 89 46 31 135
## 13 Borrow books and materials 63 15 28 78
## 5 Use a group study room 32 4 5 36
## 10 Get readings from Course Reserve 30 6 8 36
## 4 Use a library computer 11 20 10 31
## 8 Meet up with friends 23 4 9 27
## 9 Hang out between classes 16 6 5 22
## 3 Use specialized databases (e.g. Bloomberg, Wind) 12 3 2 15
## 11 Get help from a librarian 13 2 7 15
## 14 Attend a library workshop 9 5 3 14
## 2 Watch video or listen audio 7 2 2 9
## 7 Other 2 2 1 4
##
## [[3]]
## Reason
## 12 Find a quiet place to study
## 6 Print, photocopy, scan
## 1 Work on a class assignment/paper
## 13 Borrow books and materials
## 10 Get readings from Course Reserve
## 5 Use a group study room
## 4 Use a library computer
## 8 Meet up with friends
## 11 Get help from a librarian
## 9 Hang out between classes
## 3 Use specialized databases (e.g. Bloomberg, Wind)
## 14 Attend a library workshop
## 2 Watch video or listen audio
## 7 Other
## Business, Finance & Economics Humanities & Social Sciences
## 12 80 21
## 6 78 23
## 1 63 18
## 13 36 13
## 10 10 6
## 5 17 3
## 4 15 3
## 8 17 1
## 11 9 2
## 9 6 1
## 3 10 3
## 14 5 1
## 2 7 0
## 7 1 1
## Data Science & Interactive Media Business Interactive Media Arts
## 12 17 10
## 6 12 10
## 1 16 9
## 13 9 8
## 10 2 4
## 5 2 5
## 4 3 5
## 8 1 2
## 11 1 3
## 9 5 1
## 3 0 1
## 14 4 1
## 2 0 0
## 7 0 1
## Science CS & Engineering Mathematics Undefined Total
## 12 12 18 16 47 174
## 6 15 13 14 37 165
## 1 8 16 11 26 141
## 13 11 9 4 17 90
## 10 4 4 8 7 38
## 5 3 4 3 4 37
## 4 2 3 1 9 32
## 8 2 4 2 7 29
## 11 1 3 2 2 21
## 9 1 3 0 10 17
## 3 0 1 1 1 16
## 14 3 2 0 1 16
## 2 1 1 1 1 10
## 7 0 0 0 2 3
We will see how we actually make use of the list of data frames when we talk about building the Shiny App.