CEEA Reflections

year

Yes, I get it. I don’t post often enough. I think I’m going to change that though, and overhaul things around here, for a few reasons.

  1. Better place to showcase my own work
  2. A place for professional reflection
  3. Interaction with the rstats and visualization communities
  4. Personal portfolio

Since I was last here my kids have grown (!), my wife and I continue on the parenting journey (whee!), I launched my consulting career (Woo!), had an interview for a teaching-focused faculty position (unsuccessful), started working more on building the local R community (Yay!), working with the software carpentry to promote reseach computing (oooh!), and put a LOT of development time into miscellaneous projects at work (visualization, applications and data infrastructure). I put a great deal of time into developing and organizing my R code into a number of libraries to help me work more efficiently.

I don’t think their anything worthy outside of Queen’s, but they are a way to offer other interested parties the ability to use what I’ve built for their own purpose.

As part of continuing to develop professionally, I was successful on a bid for a Visualization workshop for the Canadian Engineering Education Assocation’s Annual Conference. The conference happened just last week (June 4th to June 7th) in Toronto, Ontario. I love attending CEEA, partly because it’s the ONLY Canadian conference in my research interests (Engineering Education) and partly because of the amazing people you meet there.

I have never had a bad time at CEEA. If you’re reading this, and interested in Engineering Education come to the next meeting in sunny Vancouver (UBC is hosting!).

My workshop was on the fundamentals of data visualization. I focused on a split lecture/activity format going over the principles of effective visualization (blending Tufte, Cleveland, Bertin, Few and Kirkland with Cairo, Camoes, Borner, Evergreen and Kosara). This was for engineers, so they innately understand apects of design, and a user-centered approach and are quite technically literate (or at least excel literate). What I wanted to provide for them were the elements of theory and practise that they are blissfully unaware of: to be able to give them some guidelines, some theory and a workflow for creating their own visualization.

I was happy with how everything came together, but the hypercritical part of me sees many areas for improvement. There was a lot of interest, I had a full room well over the 40 people registration limit and actually ran out of my 2-page guides. Thankfully I did put everything into a github repo which contains the PDF slides and the 2-page guide from my talk. I had a lot of people come up and speak with me after about my work, and made a lot of fantastic contacts around the visualization of formative assessment data to better engage and support student learning.

The rest of the conference was great. We had a good turnout at the EGAD workshop later in the day, a fantastic reception dinner, which kicked off a great conference. Two great keynote speakers, lots of great lightning talks, fantastic conversations around engineering education, outcomes assessment and continuous improvement. Fantastic food, dinner at Casa Loma, and a more consultation and transparency from the Canadian Engineering Accreditation Board.

Looking forward to next year!

52Vis - Homeless

I’ve a confession to make. I’m a terrible blogger. I do so out of need and interest, and I’ve a lot of the latter but little of the former.

“The game has changed.” Clu

(Yes, Tron Returns. Get over it.)

Bob Rudis, @hrbrmstr, a useR, visualization practitioner, data-driven security expert, blogger has constructed a wonderful set of information visualization challenges for the next year. (There goes my free time). I was a little late to the party and missed the first week due to massive family sickness, but I’m going to try to follow along and participate for the coming weeks.

This week’s task is outlined on hbrmstr’s blog. In short its looking at US Department of Housing data on the homeless population in the USA and producing a truthful, insightful and effective visualization of the data. This is an emotionally charged data set and I kept thinking about it all week. Not only from the perspective of what questions I wanted to answer with my visualization, but from the perspective of someone who’s been fortunate enough to have a home, a family and a great deal of opportunity and what that means to me.

Some may say that data is just numbers, and that it’s hard to connect to it. I don’t think it is. Data always represents something, and in this case it represents real people in a terrible situation. I believe that keeping that in mind connects you to the data, keeps you honest and motivates you.

“that’s all I have to say about that” Forrest Gump

I explored the data, producing scatterplot matrices and looking for interesting relationships. Then I realized that I wanted to see what was being done to help the homeless people. Most US government funding is made public, so I took a walk over to the Department of Housing, found the Community of Care Program grant awards by state and integrated that into the analysis. (I don’t know if this is cheating, but I wanted to see the funding and its effects on the homeless population). This is still VASTLY incomplete and only sheds some light on a small sliver of the entire story (support and funding don’t fix the problem, but it can help). This line of analysis is very nuanced and very deep. This little bit of work is a grain of sand in the Sahara.

I’m a fan of Alberto Cairo and his work. His books are fantastic, great reads and are highly recommended. In “The Functional Art” he used an image from Epoca magazine that is a scatterplot with each point representing summary values for a year on a separate x and y scale. These can be connected illustrating how these relationships change over the years. I don’t know what it’s officially called, but I call it a temporal scatterplot. I thought this would be a fantastic fit for my intended visualization.

I thought this was best represented at two levels, National and State. This provides the “Big Picture” as well as the capturing the smaller stories that differ between the states. I believe if you only present one of these, it’s terribly misleading as the aggregate always smooths out the individual bumps and assumes homogeneity where none exists.

First up, the National “Big Picture”. The homeless population and funding data were aggregated by year and charted homeless population vs coc funding and with a spline connecting each point. (Thanks to @hrbrmstr for the ggalt package and the tip on geom_xspline2)

National

This shows overall that funding has only been increasing over the years and the homeless population has been declining. However, what is going on 2010 with the increase in the homeless population? This is where further research and additional analysis can shed some light on a situation. My initial cursory research reveals that this was the year ending the Iraq war, the ‘flash crash’, and the bottoming of the jobs market post recession. That’s a lot of causes, should someone with more expertise want to keep going.

Next up, the State-level results. The homeless population and funding data were aggregated by year and state, normalized to population and displayed per 100,000 people. They are plotted similarly to the National level, with the addition of small multiples for each state. The multiples are arranged according to the funding they received.

State

This illustrates a vastly different picture than the National level. It shows wild swings in both funding and homeless population and directly challenges the Big Picture view of throwing money at a problem. Look at the the first three panels, look at Rhode Island, Vermont and Alaska. Something really odd is happening that warrants a closer look. Ultimately, these differences and odd relationships lend support to my initial statement that this is far more complex and requires a great deal more thought, investigation and expertise than I can provide. It also directly challenges one unstated assumption of my own, that the funding and homeless population are variables that are in-phase, when most likely they are lagged in an cyclical cause-effect loop.

I also thought that both might be better served as interactive documents, so I added some yaml and roxygen comments to use rmarkdown::render() (which calls knitr::spin) with the shiny runtime to produce a simple interactive html document direct from the R code. The ggplot code to produce the static images is there too, you can either change the include options to show either static or interactive but be sure to uncomment the actual call to the plot object.

You can view the interactive version on my shinyapps.io account

If you’re interested in seeing what I did, simply go to my github (link in sidebar) and check out the code. I’ve included png and svg versions, and all of the data.

All in all, while I enjoyed the challenge, it gave me far more to think about than simply visualizing information.

Goofus and Gallant: Charts

For the past couple of years I’ve been delving into the world of information visualization. Part due to intense interest, part due to job responsibilities. I’ve read a lot, taken a lot of courses, and have been working to polish my skills along the way. Having a design focused background (engineering) coupled with a broad foundation of data analysis and statistics (PhD), and my own modest skills in graphics and typography has helped quite a bit.

To make a long story short, part of my role is helping engineering programs cope with the large amounts of curriculum, assessment, survey and student data. At the heart of that is presenting information in effective ways, following best-practises, to organize, distill and design visualizations that maximize data utility while maintaining the truth contained within. Some of this work is driven by meeting accreditation and quality concerns, but the crux is providing instructors, administrators and programs with needed insight to develop data-informed continuous improvement practises.

The other day I was provided a sample report that had some visualizations of curriculum mapping data. These samples were created for a workshop provided by the accreditor, and are intended to show programs what sort of reporting is required. I’m all for samples that illustrate the implicit requirements of an evaluator. What I didn’t really like is the fact that the samples were presenting the information in a way that disregards effective practise, and doesn’t provide an accurate or meaningful representation of the data. The charts don’t lie, they just make it difficult to see the underlying components of the data. They also serve a singular purpose, rather than providing a broad view that can be used for a variety of purposes.

Here’s Goofus’ charts:

Goofus' Chart

Some background: The 2 letter codes on the x-axis represent specific outcomes. The I-D-A, represents the level at which the outcome is developed (I = introduced, D = developed, A = advanced). On the second chart, the semester is given by numbers 1-8, with a full 4 year program being comprised of 8 semesters.

1) Goofus’ presentation makes it difficult to see the counts for each level within the outcome, you can see relative distributions, but nothing quantitative. There isn’t really a part-to-whole relationship to convey, so presenting the bars in this fashion serves little purpose.

2) Goofus repeats his offense from the first chart: relative sizing, difficult to see exact numbers. Additionally, his chart aggregates the IDA for all attributes. Why is he doing this? Just to see how much the program focuses on in a particular area? The aggregation hides the detailed information for each outcome.

3) Goofus! Pie charts can be difficult to read, especially with multiple categories being compared. People have difficulty comparing relative sizes based on angle. Goofus also has to be careful with the data mappings, as pie charts are maligned due to the frequent errors in percentages. Also, they don’t really describe the data in a relevant fashion (what does representing it as a percentage actually provide you?).

I think Goofus organized the charts incorrectly. They appear as: Meso-Micro-Macro, which is unintuitive. There’s also text/readability issues, but those can be fixed with careful font selection and emphasis.

So I decided play the part of Gallant and improve them (or at least what I consider to be improvements).

Gallant's Chart

First, I changed the order from Macro-Meso-Micro, which gives the user a sense of drill-down in view. Second I changed the font to a serif, because I find those easier to read. (Gill Sans being an exception due to purposeful design). Lastly, rather than present these as stark bars (which they are), I applied a Tuftian principle and overlaid a white grid. This allows for easier comparison and clear depiction of the units.

1) Gallant throws away the pie and uses a waffle chart (or a square pie chart) to present the data as counts rather than percentages. This allows the reader to see the proportion still, and also have some links back to absolute counts of the data.

2) Gallant’s separates the curriculum levels out, allowing for quick comparison between outcomes and levels (without drawing and arithmetic) The discretization allows the interested to see exactly how many courses contributed to the development of the outcome.

3) Gallant uses Tufte’s small multiples: The multiples are arranged by semester on the horizontal and IDA on the vertical. Each multiple plots the number of courses in that semester for each outcome. This allows a more detailed overview of a program to se what attributes were more intensely developed over the program, when they were developed, and how many courses developed them.

I believe that Gallant’s presentation of the data gives the reader more utility and increases the value of a single graphic. It also provides a sense of connectedness, with each graph being a related view to the other.

The moral of this post is: “Learn from those that came before you”. Simply charting the data without thinking of it’s intended use first, not showing the parts to the whole and ignoring best practise will make you a Goofus.

So be a Gallant.

Oh. These charts were created in R, using ggplot2 and arranged with gridExtra. You can see the code on my github repo