My Journey To Data Visualization

DH in Data Visualization title card

In the Digital Humanities Winter School (Pune, 2018) conducted by Center of Digital Humanities Pune’s Dr. Dhanashree Thorat, I became a sort of specimen among all the humanities and liberal arts academicians. At one point, one professor, on account of the rising interest in Digital Humanities even pointed out:

There are so many children pushed into science and engineering streams in India, who would rather become journalists, musicians, etc and end up in Digital Marketing just to make use of the skills they earned in school. They are the ones who will take over as Digital Humanists, so where does that leave the Humanities scholars?

Guilty on all charges.

Read about my experience at my first ever DH-inclusive Conference here.

Needless to say, that hit a lot closer home. I said so and laughed it off. I will be making another post on the job opportunities that come with being Digital Humanists, but the particular case study I want to scrutiny in my Data Viz post is that of Elijah Meeks. I first ran into this Stanford Chinese History scholar turned alt-ac Digital Humanist turned corporate data visualizer on a Digital Humanities subreddit AMA. At the time, he was already working in Netflix Data Visualization team, so he was answering questions on Data Analysis, Data Science, and Data Visualization in a commercial field but I had also noticed the Stanford DH blog had a lot of his entries on his Gephi and D3.js adventures. So I followed along. When I was not listening to First Draft podcast (hosted by Meeks, Jason Heppler, and Paul Zenke) – I was reading the Stanford blog, and trying to figure out my angle for my own blog post on this increasingly complex and vast topic. 

Image courtesy: Isaac Smith

The podcast featured discussions on how “Tableau is just Excel for 2015” and the fan culture surrounding it is only since it allows people to impress teachers and bosses for all its “slicing and dicing of data”. One episode focused entirely on the differences between academic (alt-ac) and commercial aspects of Data Visualization. The Stanford blog gave a precursory knowledge on network analysis, and specifically Gephi as a form of Data Visualization.

At the same time, I was working my way through FreeCodeCamp’s Data Visualization certificate. As with all of my FreeCodeCamp experience, the instruction modules were challenging but workable, while the projects themselves were a mammoth beast I simply could not conquer. Read-Search-Ask method only gave me what other people were doing to pass the tests, but not why this works only in so-and-so order. It was not the first time, and will not be the last time I wished I had someone to guide me through this frustrating mess. Yes, yes, I can ask the forum and they’re all very kind, but how could I ask explanation for an entire block of code for each project?  (O_O)

However, I did receive some respite in the form of Meeks’ textbook D3.js in Action. This was the first time I was supplementing interactive learning with some textbook knowledge and it made a world of a difference for me.

Image courtesy: Manning

Then I had the absolutely bright (please note the sarcasm) idea of downloading a couple of data sets, installing Gephi, Tableau, Power BI, and trying to replicate the visualizations on them and with D3.js while learning all these tools along the way. Apparently, I am a glutton for unpaid, knowledge enhancing, interesting looking work so I decided to go through the Kaggle track courses on Python, Pandas, Machine Learning, Data Science, and Data Visualization, then also apply the data sets in Python. Re-doing it with R is where I drew the line. Phew.

So I went ahead and did just that.

What’s Data Visualization got to do with Digital Humanities?

I learned so much about the value of storytelling with data that I didn’t exactly apply either in these small projects. There’s always so much more to learn. In hindsight, it was way too ambitious of me to try to incorporate all of Data Visualization in one post.

Here’s a sneak peek.

The above info-graphic includes a list of prominent people in the Data Visualization field along with a couple of meta-jokes that probably only I will get…

Here’s my secret: I confess I started learning Data Visualization, only because it looked cool. I love looking at stats, knowing how many books I’ve read, what was my rating for them on an average, and other random simple stats. I still believe I actually made it to 38,000+ words in NaNoWriMo 2018, only to watch the simple word count graph climbing up slowly. That is not to say I like numbers. I have an uneasy relationship with mathematics most of the time. That’s precisely why the charts and graphs fascinate me. Bunch of boring numbers, possibly fascinating insights! Stats laid out and interesting connections heretofore unknown! It also seemed to have more tangible results in the form of a Digital Humanities project. As in, I could sit down and see the different kind of Doctors, connected via the monsters and their motivations, with a click of a button… Or so I thought. (Yes I found a data set which had just that.)

The flip side is, to put it simply, I am no good at art either. The design side of it doesn’t come naturally to me. So far, my strategy has been: I just mess around and hope I hit on something that looks good. Then I realized the lines between gimmicky pointless graphs and visually clean graphs is actually not well defined. In Susie Lu‘s classification of Data Visualization practitioners, I would probably want to be one of “The Italians” but I know I am more likely to be “The Fun Freelancer”.

In my content creation and digital marketing job, one of my favorite tasks was to make infographics of my blog posts. Of course, infographics are definitely not data visualization projects. In fact, infographics are subjective, with a definite story arc, a beginning, middle, and an end. Data visualizations are data-driven and open-ended for the viewer to draw their own conclusions. Infographics might not even have a large range of data sets as is the case with data viz. While info-graphics are used for marketing, blogging, resume, and case study purposes, data viz is more likely to be applicable in reports, editorials, and newsletter.

While Digital Humanities is the intersection of digital technology and humanities research, Data Visualization is a perfect amalgamation of the two. In Zhao Kaidi’s paper on Data Visualization from The School of Computing, National University of Singapore, Data Visualization is the process of finding connecting patterns, over-arching trends, comparative ratios, blink-and-miss relationships, and overall understanding of a system with its data set using computer graphics.

Cave drawings are often cited as the earliest data visualizations but I personally do not understand how a cave drawing is numerical data. It simply is not data visualization, however, visual it may be. The case can be made for hieroglyphs, idea-grams, emojis, etc but again, they are not on the same scale. Languages and alphabets also cannot really count as data visualization since they are symbols of communication at a much more personal scale. 

Image Courtesy: Wikimedia Commons

Data Viz is not Scientific Visualization. It manipulates human skills of pattern recognition into understanding a large set of numerical data.

In recent years, there has been a rise of BuJos – Bullet Journals which help track daily, monthly, and yearly productivity, mood, activities, etc. in simple graphs. A complicated system of symbols and colors makes up the keys and legends. They are more personal data than say, a fitness tracker or a calendar on your phone, and even smaller-scale than crucial business decisions. Data powers the experience but the drive is always human stories.

Image courtesy: Matt Ragland

I absolutely love the idea of mapping out daily life in small charts using such graphs to get a lens on a person’s life. The story that is told via data analysis for humans is more nuanced, contextualized, and intricately… It is the difference between blindly following data and having your own logical assumptions guiding the data to make sense to take decisive actions accordingly. This is part of Data Humanism. There are certain traits that count as wholly human. Philip K. Dick would have you believe that is empathy (Do Androids Dream of Electric Sheep?) but I would argue that imagination, contextual inquiry, and human connection are required to make not just the ethical moral, as well as efficient use of data.

The science of data visualization includes cognitive psychology, color theory, design, and Neuroesthetics. This by itself makes it the perfect fit within Digital Humanities. In computer terms, the processing power of our brain is largely taken up by our sight. It commands the highest bandwidth. In fact the same as an entire computer network. If you are curious, for reference sake, Touch occupies the same bandwidth as a USB key, our Hearing takes up the bandwidth of a hard disk, and finally, Taste takes up a meager calculator.

There are several “And… Boom!” moments in this video and several other data visualization talks, but what strikes me the most is that there is someone interpreting the chart. The boxes, lines, colors, squiggles going up-down-up-down-left-right-left-right… There is a reason for each one. The understanding of this data comes with constantly asking, “Why?” Why was there a gap in fear mongering in media during September 2001? Why is the scale of money owed and forgiven so vastly ‘unexplainable’? At the same time, there is more satisfaction of a concept understood when we “figure” it out on our own, rather than being spoon-fed by somebody. We are more likely to be understanding of any inferences we make as opposed to having it shoved in front of us.

There’s this one iconic scene in The Little Prince where the Little Prince asks the narrator, a Pilot to draw him a sheep. He complains about each of the sheep-drawn (old, fat, small) until the pilot just draws a box and says the sheep is inside the box and all the prince needs to do is imagine the sheep. The Little Prince, who is the metaphor for the Pilot’s inner-child, gets it right away and is delighted. The point of this anecdote is that all visualization is always open to interpretation. Collaborative data visualization is the synergy of creativity, imagination, and learning what works.

I also want to note that given how much content we consume from all around the world, we all have a dormant design literacy within us. It took me a while to know it, but when I was a website content creator, I instinctively knew the kind of space I knew appropriate or the correct image sizing. I was, by no means, great at it immediately, but I have never done anything beyond a few HTML / CSS projects and still struggle with getting it right, but I knew what could work.

Numerical data is difficult to understand, especially at a large scale. On the other hand, the psychology of data viz evidently touches the sore spot, to enable better-informed business decisions. Just as in Digital Marketing, the benefits of Data Visualization is based on cognition, perception, and persuasion. A good copy / visual is easily understandable, ‘parsable’, and convinces potential decision-makers on a specific path. Again, as in the case of Digital Marketing, we are pushed to express in the language of the eye. All 2,000+ words must have images, graphs, infographics, stats; more videos the lesser the bounce rate! Speaking of which…

I made an infographic on Data Visualization. A Visualization on Data Viz would be where it gets really meta…

Humans have less attention span than that of a goldfish. Data Visualization should give more attention to pre-attentive processing over attentive processing. Visual even work better than verbal mainly because cultural and language barriers are not as universal as visual imagery. There’s less room for misinterpretation even as people get to their own conclusion.

Vilayanur S. Ramachandran’s 8 laws of Aesthetic Experience:
There are quite a few things I learned to keep in mind when it comes to the visuals of a visualization. Memory is known to be a sensory trigger. Color specifically highlights the more important sections of the memory and thus in Data Visualization helps to compare or contrast metrics. Contrasting colors especially create greater impact for us than multiple colors. The aesthetic experience of a visual is also given a semantic color association with certain motifs. Designers keep accessibility for color-blindness issues in mind as well.

Image courtesy: M. B. M.

While I consider most of Data Visualization, especially considering the subject matter in each project, to be largely under Digital Humanities spectrum, it falls under my interests, even otherwise. Specifically, the analysis side of the Digital Humanities. Digital Humanities is a cycle of collection, archival, analysis, visualization of data. I would also like to be working on the later stages of the Digital Humanities cycle. For this reason, I find the collection of data sets part more tedious than figuring out the connections part.

No doubt, many will say I am wrong, but I spent hours searching for the right data set with right components, only to clean or curate them some more before I actually started even visualizing them, only to realize this is not what I wanted.

Let the dataset change your mindset.

– Hans Rosling

On the whole, Data Visualization gives a data-driven persuasion, decision, and action backed by evidence. The focus is on practical over theoretical.

Applications of Data Visualizations

How is Data Visualization used?

Chris Bailey from Leeds Metropolitan University in his Introduction: Making Knowledge Visual “re-visualized” Visual Culture as the following:

  • Data Capture
  • Data Investigation
  • Data Analysis
  • Data Modelling
  • Data Presentation
  • Data Dissemination

These steps were covered in the Digital Humanities Winter School, where we had a couple of workshops on Omeka, Timeline.js, and Storyline.js.

I had heard a lot about George Mason University‘s  Omeka and the ‘.js’ got me thinking we might be doing a little bit of programming even. While these were not particularly coding intensive, they are obviously very useful to Digital Humanists and Humanities researchers.

We divided into a dozen groups and set up our laptops to get down to business. Our task of the hour was to find a couple of tourist spots around the city of Pune. My partner-in-project, a honey-voiced SuperLinguist scholar-in-training told me her list of unconventional (read: Non-crowded, not-as-popular because not-as-upper-caste) tourist spots and we input this into our Omeka storyboard.

Questions asked by the participants:
Why is filling up the metadata fields important?
Metadata is data about data. This is an archival effort.
Moreover, if we want to make a textual analysis or data visualization of the collection in the storyboard, metadata is the all-important content required for a successful gathering of results. 

Which should we use? Omeka.com vs Omeka.net vs Omeka.org?
Messing around for personal use?  Omeka.com
Working on a large-scale project? Pick a plan on Omeka.net
Looking for customization and open data options? Pick Omeka.org (Omeka S)

(At this point, my old laptop’s poor keyboard had given out in parts and I had to connect it to a virtual keyboard on my phone. The whole setup was ridiculous but thankfully SuperLinguist went along with my mess and I ended up having bucket-loads of fun.)

Next, we used Storymap.js to locate the same tourist spots were to be located out on maps with slides accompanying them. (Cultural Heritage sites in Pune using Storymap.js) My favourite examples from the site are obviously a guided tour of The Garden of Earthly Delights and Arya’ Starks journey across The Seven Kingdoms.

One of the student presenters created a record of all the botanical logs from the local research institute with Timeline.js. It is a very easy-to-use tool. A clean excel sheet of data with the dates and titles ensured she had a clean, sophisticated looking timeline.

Dr. Dhanashree Thorat also introduced me to this one-stop DH tools website TAPoR, which is an exercise in data viz as well. All those coloured dots are different Data Viz tools and I am just thankful I didn’t set my mind out to learn all of them.

Data Visualization Tools

Gephi: Network Analysis Toolkit

I was initially very confused with all the buttons (Okay, I still am. I couldn’t get Gephi to install even after I installed Java so I also installed Java environment.) But I managed to push through despite it feeling very complicated. Gradually, it felt like MS Paint on steroids. I first used the sample data set on Les Miserables characters, then I used a consensus on the Indian socio-economic status from data.world.

It seemed quite… ordinary. To switch things up, I used this Doctor Who data set. Once it came to view, it looked mesmerizing. Immediately I noticed that the number of episodes for each Doctor was wildly varying, but the monsters in each episode were overlapping. I changed the node size for each of them according to their modularity. Now I could see the overlaps clearly. When I filtered the data, it looked less messy. 

Tableau

I tried out Tableau because it seemed very popular with business enterprises giving out more sophisticated PowerPoint in their board meetings, or so I imagine. It looks sleek and there are dozens on tutorials.

I downloaded data on the books I read over the years from Goodreads and “cleaned” it up a little on good ol’ Excel alternative Google Sheets. My initial objective was only to find out the highest number of books and pages I read over the last 10 years. Then I wanted to check if the year for the two could be the same. Therefore, I picked a simple bar graph but I admit, I did go a little crazy and pick a treemap, pie chart, and region map before settling into the more mundane and comfortable bar graph.

Tableau is much more user-friendly, in that, the buttons and words are more familiar. The dimensions and measures are already sorted out. There are mathematical terms for calculations, but that’s also easy to get a hang of. The x-coordinate and y-coordinate can be fixed by simply dragging and dropping the column headings in the data sets. I found it easier to just use SUM(Number of Pages) and SUM(Read Count) for the purposes of my objective.

The different colors on each bar represent each book, along with its number of pages. But I wanted to see the number of pages I read, and which book had the highest number of pages. Here’s a zoomed-out version of that graph.

Zoomed-in:

Now I wanted to use a bigger data set, use more calculations, and use other graphs.

Enter a show-stoppin’ Broadway dataset.

I found a bunch of figures on musicals from Broadway.com. My objective here was to find the highest amount made by all musicals and plays in the last 5 years, the highest number of audience in attendance for these, and how long each show ran in the theaters.
However, they were not in .csv format, so here’s what I did.

  1. Make sense of the data

2. Click on Data from the toolbar.

3. Select Split text into columns

4. Since the values were separated by a Space, I selected exactly that as the Separator.

5. Import the sheet to Tableau. As you can see, the data is more intelligible.

After messing around A LOT here is the storyboard for Broadway shows of the last 5 years.

The individual graphs:

D3.js (FreeCodeCamp)

The Data Visualization certification in FreeCodeCamp includes JSON, AJAX, API, and D3.js projects. The data sets are provided by FCC and so are the user cases, of course.

Notes on D3.js from D3.js in Action by Elijah Meeks

D3.js is used in Research, Big Data dashboards, and to find patterns in data for Business Intelligence.

  • D3 is Selecting & Binding
  • Selecting is the action of picking a group of the data set to which actions can be performed.
  • “Binding” combines data with the web page elements
  • Selections are a group of one or more web page elements that may be associated with a set of data.
  • “Binding data” is a selection of web page elements and a corresponding set of data.
  • <g> is a group element with no graphical representation, that is not in bounded space. For example, <g> is used to move label+circle at the same time
  • Method Chaining / Function Chaining is the returning of the getter and setter methods using .data() to bind each element in selection to each item in an array
Image courtesy: Markus Spiske
D3
Weaknesses
Strengths
Not a simple charting library
Data-driven graphics
More than graphs+charts
Create vector graphics
Geo-spatial+Network viz
High level of interactivity
Create original techniques.
Abstraction+Syntax for maps+dynamic text content+data viz = All-in-one
Does not depend on obsolete web browsers

The book incorporates links to interactive graphs directly in eBook versions and gives a detailed understanding of both Data Viz and D3.js principles overall.

Python (Kaggle)

Kaggle is a website for Data Science enthusiasts. There are competitions with a lot of money to be won, hundreds of data sets, and even a short curriculum. The lessons start out simple, especially if you have a summary sheet on another tab.

As with FCC, I did not get the hang of it until weeks later really. I slogged my way through Machine Learning, whizzed past Python, floundered around Pandas, and made it to Data Visualization finally.

Here are the most important commands (because I simply cannot remember all of them, I will try to put together a collection of the commands only required for Data Visualization)

First we have to import pandas and initialize the data set:

import pandas as pdreviews = pd.read_csv("file.csv", index_col = 0) 

index_col is just a way to pass column in index, not 0, 1, 2, 3… r.  

Bar chart

dataset['column'].value_counts.head(n).plot.bar()
  • value_counts returns the number (count) of unique values
  • head(n) gives n number of rows under that column
  • plot of DataFrame using matplotlib / pylab.

This counts the absolute value, but the data we usually look for is relative to another column’s value. In that case, we have to calculate it accordingly.

(dataset['column'].value_counts().head(n) / len(dataset)).plot.bar() 

Here we are checking each value in a column up to n rows with respect to the length of the dataset. Wow, explaining this is weird.

Line chart

dataset['column'].value_counts().sort_index().plot.line() 
  • sort_index() gives us the labels along an axis.

Area chart

dataset['column'].value_counts().sort_index().plot.area()

Interval Variable and Histogram

The difference between two states is a qualitative measurement that can be mapped on different charts.

dataset[dataset['column'] < 200]['column'].plot.hist()

A histogram is a bar graph displaying a range of values rather than a single bar representing each value.
The only disadvantage to this is that skewed data won’t be represented well. To avoid this skewing, possibly essential data will need to be cut off from the whole set.

So Much More

Of course, I am nowhere close to even finishing the first few tasks I had in mind in time for this post.

What should I have included in this post that a beginner data visualizer would want to know more about? Please let me know! I would love to learn more. My eyes are currently set on Palladio and Flourish thanks to Miriam Posner.

Further Reading

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.