Visualisation of data is very important and is broadly used nowadays in industry. It makes it much easier to communicate important insights and to make business decisions, rather than looking at the data itself.
But they need to be used carefully as there are many ways that this can go wrong. Using the wrong type of visualisation not only makes it harder for the reader to absorb the information, but can potentially make them misinterpret the information as well.
Choosing the right type of visual depends on the data or insights that need to be presented. There is a fine line between trying to make it pretty and trying to make it meaningful. There can be cases where a person might end up overdoing the use of colours for example, which may make the visual confusing. Or even trying to visualise too much information on a single visual is not good practice in general.
Please note that the data being used to generate the visualisations in this blog are either fake or open-source data.
Good practices to follow while generating a visualisation
But what actually makes a good visual? What are some of the things that need to be included in almost any visualisation that is generated?
1. Axis Labels
Axis labelling is crucial in any type of visualisation. The labels must be meaningful as it helps the reader understand the data or insights on the visual. Units need to be included wherever applicable.
I will consider three types of visualisations in this blog, namely bar charts, histograms, and scatter plots.
A title might seem trivial, but it helps a lot in adding context to the visual. As soon as the reader looks at the plot, they will most likely look at the title first (if ever there is one!). By including a descriptive title, the reader will immediately know what the visualisation is about.
Using the visualisations from the previous sub-section:
Using the bar chart from the previous sub-section, some formatting has been made to improve the plot.
- Gridlines can be useful in some cases to help the reader read and compare the values.
- The values of the y-axis has been converted to a more intuitive form of the values.
- The values of the bars have been included on top of the bars.
The bar charts below show how to deal with overlapping x-labels to make them clearer. The y-labels were also changed to integer values as in this case, the number of letters cannot be a decimal number.
Things to consider before making a visualisation
Visualising data involving values of different entities:
Bar charts can be used to visualise data involving values of different groups or entities.
Visualising data involving proportions for comparison purposes:
To visualise data involving proportions, Stacked bar charts or Grouped bar charts can be used.
Stacked bar charts can be used:
1. If the groups or categories in the data contains at most 3 subgroups.
2. Or if you want to compare the Totals of the groups as well as the the proportions of the subgroups in the data.
But whenever you have more than 3 subgroups, it might be better to use Grouped bar charts instead of Stacked bar charts.
It is recommended to avoid using pie charts as much as possible for data involving proportions. This is because sometimes it might be difficult to compare the slices on the pie chart. Use one of the types of bar charts instead.
Checking the distribution of data:
To check the distribution of numeric data in a dataset, histograms are normally used. They plot the counts or frequencies of the ranges of values in the data. The appropriate number of bins must be used so that the plot is most representative of the data.
Below histograms show how the number of bins affect the shape of the histogram. Please note that they have all been generated using the same data.
As we can see, choosing too few bins (10 bins) lead to missing out on information. And choosing too many bins (2000 bins) lead to a rough-looking distribution. In this case, setting the number of bins to 100 produced the smoothest and most informative plot.
So, depending on your data, you should play around a bit with it to find the number of bins which is the most representative of the data.
Checking the relationship between numeric variables:
To check the relationship between two numeric variables, scatter plots can be used. The data points are represented by markers which can be of various types (‘.’, ‘+’, ‘*’) on the plot.
A perfect linear relationship would be represented by a straight line.
The relationship between the variables can be:
1. Positive or Negative
Positive: When one variable increases, the other one will tend to increase as well.
Negative: When one variable increases, the other one will tend to decrease.
2. Strong or Weak
Strong: All the points in the plot are close to each other.
Weak: The points are further away with outliers.
Note: The variables in the second scatter plot are still related, but not as strongly as in the first one, with the points being further away from each other.
3. Linear or Non-Linear
Linear: The scatter plot of variables showing linear relationships will show patterns that can be represented by a straight line.
Non-Linear: The scatter plot of variables showing non-linear relationships will show patterns that cannot be represented by a straight line.
4. No relationship
When there is no relationship, there is no obvious patterns in the plot of the variables.
A good visualisation is one where the reader can get all the information in the plot clearly and without any question in their head.
To conclude, I want to reflect on some of the important take aways from this blog.
- Since there are many types of visualisations out there, take a moment to understand what information you want to share, and to decide which type of visualisation would be best for that. Otherwise it can be misleading and the reader might misinterpret the information.
- Make sure the visualisation is clearly labelled and formatted where necessary so that the reader can absorb the information easily. But be wary of not overdoing it as well.
Thank you for reading my blog. I hope it helped.