12 Apr 2016
I’ve a confession to make. I’m a terrible blogger. I do so out of need and interest, and I’ve a lot of the latter but little of the former.
“The game has changed.” Clu
(Yes, Tron Returns. Get over it.)
Bob Rudis, @hrbrmstr, a useR, visualization practitioner, data-driven security expert, blogger has constructed a wonderful set of information visualization challenges for the next year. (There goes my free time). I was a little late to the party and missed the first week due to massive family sickness, but I’m going to try to follow along and participate for the coming weeks.
This week’s task is outlined on hbrmstr’s blog. In short its looking at US Department of Housing data on the homeless population in the USA and producing a truthful, insightful and effective visualization of the data. This is an emotionally charged data set and I kept thinking about it all week. Not only from the perspective of what questions I wanted to answer with my visualization, but from the perspective of someone who’s been fortunate enough to have a home, a family and a great deal of opportunity and what that means to me.
Some may say that data is just numbers, and that it’s hard to connect to it. I don’t think it is. Data always represents something, and in this case it represents real people in a terrible situation. I believe that keeping that in mind connects you to the data, keeps you honest and motivates you.
“that’s all I have to say about that” Forrest Gump
I explored the data, producing scatterplot matrices and looking for interesting relationships. Then I realized that I wanted to see what was being done to help the homeless people. Most US government funding is made public, so I took a walk over to the Department of Housing, found the Community of Care Program grant awards by state and integrated that into the analysis. (I don’t know if this is cheating, but I wanted to see the funding and its effects on the homeless population). This is still VASTLY incomplete and only sheds some light on a small sliver of the entire story (support and funding don’t fix the problem, but it can help). This line of analysis is very nuanced and very deep. This little bit of work is a grain of sand in the Sahara.
I’m a fan of Alberto Cairo and his work. His books are fantastic, great reads and are highly recommended. In “The Functional Art” he used an image from Epoca magazine that is a scatterplot with each point representing summary values for a year on a separate x and y scale. These can be connected illustrating how these relationships change over the years. I don’t know what it’s officially called, but I call it a temporal scatterplot. I thought this would be a fantastic fit for my intended visualization.
I thought this was best represented at two levels, National and State. This provides the “Big Picture” as well as the capturing the smaller stories that differ between the states. I believe if you only present one of these, it’s terribly misleading as the aggregate always smooths out the individual bumps and assumes homogeneity where none exists.
First up, the National “Big Picture”. The homeless population and funding data were aggregated by year and charted homeless population vs coc funding and with a spline connecting each point. (Thanks to @hrbrmstr for the ggalt package and the tip on geom_xspline2)
This shows overall that funding has only been increasing over the years and the homeless population has been declining. However, what is going on 2010 with the increase in the homeless population? This is where further research and additional analysis can shed some light on a situation. My initial cursory research reveals that this was the year ending the Iraq war, the ‘flash crash’, and the bottoming of the jobs market post recession. That’s a lot of causes, should someone with more expertise want to keep going.
Next up, the State-level results. The homeless population and funding data were aggregated by year and state, normalized to population and displayed per 100,000 people. They are plotted similarly to the National level, with the addition of small multiples for each state. The multiples are arranged according to the funding they received.
This illustrates a vastly different picture than the National level. It shows wild swings in both funding and homeless population and directly challenges the Big Picture view of throwing money at a problem. Look at the the first three panels, look at Rhode Island, Vermont and Alaska. Something really odd is happening that warrants a closer look. Ultimately, these differences and odd relationships lend support to my initial statement that this is far more complex and requires a great deal more thought, investigation and expertise than I can provide. It also directly challenges one unstated assumption of my own, that the funding and homeless population are variables that are in-phase, when most likely they are lagged in an cyclical cause-effect loop.
I also thought that both might be better served as interactive documents, so I added some yaml and roxygen comments to use rmarkdown::render() (which calls knitr::spin) with the shiny runtime to produce a simple interactive html document direct from the R code. The ggplot code to produce the static images is there too, you can either change the include options to show either static or interactive but be sure to uncomment the actual call to the plot object.
You can view the interactive version on my shinyapps.io account
If you’re interested in seeing what I did, simply go to my github (link in sidebar) and check out the code. I’ve included png and svg versions, and all of the data.
All in all, while I enjoyed the challenge, it gave me far more to think about than simply visualizing information.
01 Oct 2015
For the past couple of years I’ve been delving into the world of information visualization. Part due to intense interest, part due to job responsibilities. I’ve read a lot, taken a lot of courses, and have been working to polish my skills along the way. Having a design focused background (engineering) coupled with a broad foundation of data analysis and statistics (PhD), and my own modest skills in graphics and typography has helped quite a bit.
To make a long story short, part of my role is helping engineering programs cope with the large amounts of curriculum, assessment, survey and student data. At the heart of that is presenting information in effective ways, following best-practises, to organize, distill and design visualizations that maximize data utility while maintaining the truth contained within. Some of this work is driven by meeting accreditation and quality concerns, but the crux is providing instructors, administrators and programs with needed insight to develop data-informed continuous improvement practises.
The other day I was provided a sample report that had some visualizations of curriculum mapping data. These samples were created for a workshop provided by the accreditor, and are intended to show programs what sort of reporting is required. I’m all for samples that illustrate the implicit requirements of an evaluator. What I didn’t really like is the fact that the samples were presenting the information in a way that disregards effective practise, and doesn’t provide an accurate or meaningful representation of the data. The charts don’t lie, they just make it difficult to see the underlying components of the data. They also serve a singular purpose, rather than providing a broad view that can be used for a variety of purposes.
Here’s Goofus’ charts:
Some background: The 2 letter codes on the x-axis represent specific outcomes. The I-D-A, represents the level at which the outcome is developed (I = introduced, D = developed, A = advanced). On the second chart, the semester is given by numbers 1-8, with a full 4 year program being comprised of 8 semesters.
1) Goofus’ presentation makes it difficult to see the counts for each level within the outcome, you can see relative distributions, but nothing quantitative. There isn’t really a part-to-whole relationship to convey, so presenting the bars in this fashion serves little purpose.
2) Goofus repeats his offense from the first chart: relative sizing, difficult to see exact numbers. Additionally, his chart aggregates the IDA for all attributes. Why is he doing this? Just to see how much the program focuses on in a particular area? The aggregation hides the detailed information for each outcome.
3) Goofus! Pie charts can be difficult to read, especially with multiple categories being compared. People have difficulty comparing relative sizes based on angle. Goofus also has to be careful with the data mappings, as pie charts are maligned due to the frequent errors in percentages. Also, they don’t really describe the data in a relevant fashion (what does representing it as a percentage actually provide you?).
I think Goofus organized the charts incorrectly. They appear as: Meso-Micro-Macro, which is unintuitive. There’s also text/readability issues, but those can be fixed with careful font selection and emphasis.
So I decided play the part of Gallant and improve them (or at least what I consider to be improvements).
First, I changed the order from Macro-Meso-Micro, which gives the user a sense of drill-down in view. Second I changed the font to a serif, because I find those easier to read. (Gill Sans being an exception due to purposeful design). Lastly, rather than present these as stark bars (which they are), I applied a Tuftian principle and overlaid a white grid. This allows for easier comparison and clear depiction of the units.
1) Gallant throws away the pie and uses a waffle chart (or a square pie chart) to present the data as counts rather than percentages. This allows the reader to see the proportion still, and also have some links back to absolute counts of the data.
2) Gallant’s separates the curriculum levels out, allowing for quick comparison between outcomes and levels (without drawing and arithmetic) The discretization allows the interested to see exactly how many courses contributed to the development of the outcome.
3) Gallant uses Tufte’s small multiples: The multiples are arranged by semester on the horizontal and IDA on the vertical. Each multiple plots the number of courses in that semester for each outcome. This allows a more detailed overview of a program to se what attributes were more intensely developed over the program, when they were developed, and how many courses developed them.
I believe that Gallant’s presentation of the data gives the reader more utility and increases the value of a single graphic. It also provides a sense of connectedness, with each graph being a related view to the other.
The moral of this post is: “Learn from those that came before you”. Simply charting the data without thinking of it’s intended use first, not showing the parts to the whole and ignoring best practise will make you a Goofus.
So be a Gallant.
Oh. These charts were created in R, using ggplot2 and arranged with gridExtra. You can see the code on my github repo
08 Jun 2015
So IVMOOC has been over for quite some time now. Looking back on the experience, I consider it to be a personal success. The first half of the course was interesting, and gave a good space for practise with tools and refinement of the skill set needed for visualization. The midterm and final were shenanigan-laden and badly designed assessment, but the instructors have made great steps to address the issues and commit to improvements.
The second half of the course, the client projects, was where things really picked up for me. Initially I wanted to work on the CoBRA, the Comic book readership archive, but couldn’t find a group. I saw that some of the other students that tweeted within the course Kristin (@cysiphist) and Max Kemman (@MaxKemman) had settled on a project and group. I asked if the group was full, and quickly joined.
Was that ever a good choice.
This was the first time in many MOOCs that I’ve actually stayed for the project part. Usually I left because I didn’t want a virtual team hassle and other team members not pulling their weight. This was entirely not the case. I had lucked into joining a team of like-minded professionals. Teh-Hen, Dulce, Max and Kristin are amazing people to work with. Each person worked without prompting, and worked in an area of the project that they felt most comfortable in. They also picked up slack without question or complaint, as I had life bulldoze over me during the final report write-up. I can’t say enough good things about the group, and they made working on the project fun and the course entirely worthwhile. If all MOOC group projects went like this, then they would be amazing. At least this one has inspired me to stick around for others.
The project was really interesting too. Visualizing the networks and relationships represented in the Digital Humanities Quarterly. If you’re interested, you can take a look at our work here. Dulce and I worked separately, yet with many parallel approaches to collecting, cleaning and validating the data. I also worked on documenting the process, creating and posting all of our work on the github repo, and creating the front page. The project let me stretch out and gain some new skills in R (web scraping, a lot of use of dplyr, magrittr, rmarkdown, shiny). I had some plans that didn’t quite make it there, which was a shiny-app portraying some of the wordcloud analysis from Dulce and Kristin side-by-side.
We’ve been asked to allow the project to be used as an example for future years, and have been asked by the DHQ to work on a paper. Not bad for what I originally had slated as a ‘fun learning experience’. I also managed to walk away with a certificate and badge for my efforts. Break out the digital scout sash. The skills I’ve learned from the course have already translated to my job and other professionally related work, as I’m finding applications for the user-driven workflow and many of the analysis techniques that can be applied to assessment and learning in higher education.
I’ve got to wrap this post up, as it’s long overdue. I learned a great deal about information visualization, but perhaps the most important lesson was to put faith in the other people in the course, and actually give the group parts of MOOCs a chance. Maybe you’ll end up pleasantly surprised.