My business is with R tonight, with maps and color. #IVMOOC Week 5

So the midterm has me a bit disheartened.  Not because of difficulty, but because of the miscommunication between the admin and the students.  In Canvas the midterm description was set to Feb 16th 8pm EST deadline.  After putting my son to bed, and while making dinner, I logged into Canvas and rushed to do the midterm.  Flying through it, I ended up with a 21.  Then I see the next day on twitter that it’s due on the 19th, yet this wasn’t reflected on Canvas.  Had I known this, I would have set aside a better time to actually write the midterm, un-hurried and would have done much better. I had some misclicks, skipped questions not addressed and some bonehead mistakes (For all of the graphic work and video editing I’ve done why in the blue hell would I put saturation down as Qualitative? Derp.)

Also, upon seeing the correct answers, there some dubious incorrect answers.  Word Cloud vs Tag Cloud?  Cartograms vs Cartogram?  These small errors don’t mean much, but they show some issues with the automatic grading system.  I hope they get corrected in future versions.  My mark can stand as it is, it wasn’t my best effort and is more reflective of poor time management, but this can be really discouraging for people and just make them walk away from the course.

And that’s all I have to say about that.

Week 5.  For this week, I pretty much read the book chapter rather than watch videos, and the hands on videos didn’t really do more than what was covered in the chapters and sci2 wiki.  I wasn’t terribly happy with the sci2 output, so I went to R.  After thinking for a bit, I abandoned most of my attempts to collect all the data through R, because I remembered the ‘tree’ command in unix.  So after a download of homebrew, brewing up tree and looking at the manual, I had my workflow:

  1. Generate directory hierarchy using tree
  2. Export it to xml
  3. Translate the xml to JSON
  4. load the JSON into R
  5. Refine the data
  6. Add secondary variable to encode the data (file counts)
  7. Treemapify the data and plot
  8. Clean up the plot in Affinity Designer

Below is my R code:

library(plyr)
library(magrittr)
library(treemap)
library(treemapify)
library(ggplot2)

# This originally started out by mapping the directory heirarchy using R, but then I remembered
# Tree from unix (I'll get to that later).
# I used this bit of code, plus excel to get the file counts for each directory
# If I had a better knowledge of regular expressions, I would subset the n.files data frame
# by a Directory structure to the depth I would like (1)

path<-"~/OwnCloud/Graduate Studies Research"
dirs <- list.dirs(path)

num.files <- function(x) {
out <- length(list.files(x))
out
}

n.files <- ldply(dirs,num.files)

# Prior to this step, I collected the information about the directory structure
# using the unix Tree command to output the director structure, depth 2 to xml.
# tree -dhX -L 1 -o test.xml --du ~/OwnCloud/Graduate*
# I then converted the XML to JSON using an online converter, because I couldn't
# get the XML and JSONIO librarys to work correctly

data.temp <- fromJSON("~/json.txt") %>%
data.frame()

# Trimming the data frame and putting the columns into the correct data formats

data3<-data3[, c(1:3)]
data3$file.count <- c(55,23,271,32,47,68,313,2120,171,44,63,86)
names(data3) <- c("root", "subdir", "size", "count")
data3$size <- as.numeric(data3$size)
data3$subdir <- factor(data3$subdir)
data3$count <- cut(data3$count,c(0, 50, 100, 150, 300, 2200),labels=c("0-50", "50-100", "100-150", "150-300", "300+"))

# Using treemapify to transform the data into a structure that ggplot2 can use.
# Then used ggplotify to plot it, because a layered grammar of graphics is awesome.

treemapify(data3,
area="size",
fill="count",
label="subdir") %>%
ggplotify %>%
+ guides(fill=guide_legend("# of Files")) %>%
+ ggtitle(bquote(atop(.("Treemap Visualization"), atop(italic(.("Graduate Studies Directory, Depth = 1")), ""))))

And here is the resulting visualization:

IVMOOC Assignment 5

I'll take "Famous Titles" for $400 Alex. #IVMOOC Week 4

Two posts in week, and two weeks of #IVMOOC down! Topical analysis this week, and keeping in theme, here’s what I’m doing.  Alongside setting up data architecture, moving research along, planning future moves, and comparing and contrasting frameworks (including engineering thinking, skills, accreditation, assessment and professional behaviour), I’m eating lunch, listening to Hidetake Takayama, writing this post and trying to figure out what to make for dinner. Pretty much the same as everyone else!

Anyways.  I liked this week, but I got sidetracked with digging into visualizing data from twitter.  In the midst of watching one of the videos, I wanted to see how the #antivaxprof shenanigans were heating up or calming down.  I fired up R, figured out how to use the twitter API through ROAuth and TwitteR, and got to business. This is data from Feb 4th to 11th that contained #antivaxprof. I wanted to see graphically how tweets were being used, as in when tweets were originating and how they were progressing.  Below is the quick and dirty plot of that. Red dots are original tweets, greens are retweets of the same tweet.  Each plotted to the retweet count of the original.

Twitter Viz

Then I wanted to see how this was capturing the attention of the twittersphere, simply by looking at the frequency of tweets per day, and overlaying that with the total retweet count of all those tweets in a day.  (I guess the #IVMOOC workflow is sinking in.)  Enter plyr, and here is the quick and dirty of that.  I see this graph as showing that the public interest of the story declined in both activity and reach over time.  Thing about controversies and social media.  Short bursts of attention that die down, or at least settle only to those most affected.  I’m certain that this data is a little biased though, since twitter is just one small slice of the social media globe.

Twitter Viz 2

Then I went back to actually doing what I was supposed to be doing for #IVMOOC.  I searched the NSF grants database for “Engineering Education”, and work on extracting and visualizing the word co-occurrence network in the titles.   Had to use a bit of scripting to get things the way I wanted it.

resizeLinear(references,2,40)
resizeLinear(weight,.5,5)
colorize(references,gray,black)
g.edges.color = "34,94,168"
for n in g.nodes:
n.x = n.xpos*40
n.y = n.ypos*40
n.strokecolor = n.color
if (n.references > 20):
n.labelvisible = true

A bit of time and some cleaning up in Affinity Designer, and I submitted this:

IVMOOC Assignment 4

I can see a lot of potential uses for this in my own work and research.  Fantastic week.  Next up: the midterm.  

(Cue ominous music)

Where in the world is Carmen Sandiego?: #IVMOOC Week 3

Getting through this week was a bit tough, not due to content or complexity, but due to life, work and winter.  Being the only well parent is a lot of work.  Couple that with deadlines, grant writing and research sessions made for very little free time.  I made it though!

After watching the videos and hands-on clips for the week, I felt tackling this assignment was pretty easily.  I chose to visualize the amount of NSF funding for Engineering Education research. My search off of the scholarly database yielded 2000 results, with multiple awards per state.  Aggregating in the very manual way outlined in the hands on video was far too inefficient and tedious.  So I turned to R and some of the wonderful packages that have been created (gdata, ggmaps, ggplot2, plyr).  I did use SCI2 to geocode the data, since I used my query limits with google maps when playing around with ggmaps.

After that it was pretty easy.  I aggregated at the state level, as the SCI2 geocoder gave me some errors when encoding the zip codes, or the city level.  Pulled up a map of the continental USA, and plotted the amount of funding indicating amounts with both size and color.  The code I used is below:

library(plyr)
library(gdata)
library(ggmap)
library(ggplot2)

setwd("~/MOOCs/IVMOOC";)
data <- read.csv("NSF Master LandL.csv";)

data.state <-ddply(data, .(state, Latitude, Longitude), summarize, expected_total_amount = sum(expected_total_amount))
usa <- get_map(location = 'united states', zoom = 4, color="color", maptype='road')
g <- guide_legend("Amount of Funding (USD)")

ggmap(usa) +
geom_point(aes(x=Longitude, y=Latitude, show_guide = TRUE, size=expected_total_amount, color=expected_total_amount), data = data.state) +
scale_colour_gradient(low="#e7e1ef", high="#dd1c77") +
guides(colour = g, size = g) +
theme(legend.position = "bottom") +
ggtitle("Geocoding of NSF Funding of Engineering Education Research") +
xlab("Longitude") +
ylab("Latitude")

I wanted to explore at both the state level and at the city level, but ran into those limits, and the amount of time I could spend on this.  Maybe in the future, I’ll delve into this a bit more and expand the code to geocode by the zip in R, and then plot those on the US map.    My visualization:

Geospatial Assignment

All in all, a fun week.  The bits I found most useful:  a nice overview of colors and their use in geospatial maps and the hands-on sessions. Next up, week 4 and the midterm!