Categorisation - How to Unlock the Value of Business Data

Information Overload

The amount of data produced by humanity is growing exponentially every year with no signs of stopping. It’s estimated that in just the first month of 2023 more data will be produced than in the entirety of 2013. With this inexorable rise comes incredible new opportunities for innovation, but with it the inevitable problem of information overload and the challenge of how to organise this information in a way that provides the most benefit to humanity.

There is clearly great business value in well organised data. In executing its mission statement to “Organise the world's information and make it universally accessible and useful” Google has become one of a handful of companies to achieve a market cap of over one trillion dollars. Indeed the vast amounts of data a business generates are almost worthless without some means of organising and categorising it.

‍

The Importance of Categorisation

Imagine going into a library only to discover that all of the books had been taken off the shelves and piled up in an enormous heap in the middle of the floor. The vast body of knowledge towering in front of you would be completely useless without the books being categorised on the shelves in the familiar Dewey Decimal system.

This is why categorisation is essential for a businesses to unlock the value in the data they own. Put simply: Once the amount of data you hold is more than you can browse through it becomes worthless without the means to categorise it.

‍

Taxonomies - Organising with Trees

Before we can start organising data into categories it’s essential that we decide what those categories are. Going back to our jumbled library it’s no use if I call Inspector Poirot a “Detective novel”, but you call it a “Murder Mystery”. That’s where taxonomies come in. It might sound like a technical word, but it’s simply a tree structure into which everything has its place.

There are many types of Taxonomies we are all familiar with from the aforementioned Dewey Decimal System for books, to Darwin’s Tree of Life breaking animals and plants down into their species.

There are many standardised taxonomies for different types of data which will allow you to freely exchange information and collaborate with other companies, but the most important thing is that the categories you use are consistent and meaningful to you.

‍

The Power of Categorised Data

Once data has been categorised it unlocks a vast array of opportunities for us to use that data:

‍

Searching and Navigating

This is perhaps the most obvious. We find it fast and easy to find our way around a library or a book shop, or navigate through a hierarchically structured website like Wikipedia.

‍

Querying and Analytics

Once we have data arranged in a taxonomy it becomes easy to ask questions of it for example:

“How many detective novels are there in my library and what percentage of them are by Agatha Christie?”
“How many of the emails that come into my inbox are spam and is it going up or down over time?”

‍

Automation

This is where taxonomies become really powerful. Categorised information can be used to drive automated behaviours:

“Invoices go to Purchasing, but Contracts go to legal”
“Email marked as spam skips the inbox”
“Flagged content, must be reviewed by a moderator”
“Personally identifiable information must not be not be retained for more than a month”

‍

Driving Calculations

As well as driving automation categorisation can drive calculations:

“New release movies cost 50% more”
“VAT applies to biscuits, but not cakes”
“A litre of diesel generates 2.4KG of carbon emissions”

‍

Ways to Categorise

There are many ways to categorise data from the straightforward to techniques using the most cutting edge machine learning techniques.

‍

Manual

The most simple way to categorise is manually. Humans are amazing at understanding and categorising data.

Manual categorisation works well when the data you need to categorise numbers in the hundreds, or thousands of items but once you are dealing with millions of items it can become cost prohibitive unless the work is of a very high value.

Even though manual categorisation is the least sophisticated it still has a number of important uses in even the most sophisticated systems:

As a check of a sample to ensure your automated mechanisms are behaving as expected
As training data for machine learning systems
As a last resort for data that could not be read by an automated system. e.g. Reading the address on a parcel that could not be read by a computer vision system.

‍

Automatic - Rule Based

If you want to categorise large volumes of data then having some kind of rule based system can make this process much faster and easier to scale.

Rules can be very simple, or extremely complex for example:

Emails that contain a specific word for example “URGENT” could be flagged to the user
Books which contain the word “Prairie”, but not “dog” could be categorised as “Westerns”
Messages which contain text which matches the regular expression for a purchase order code could be categorised as purchase orders

Many existing tools have categorisation engines built into them like email clients, or specific categorisation tools may be used for businesses that need to ingest large volumes of data.

‍

Validation and Cleaning

Real world data is often messy and incomplete and this is where rule based systems benefit from cleaning and validation tools. These might:

Remove duplicates
Fix common spelling errors
Apply consistent formatting

This makes it much easier for simple rules to accurately categorise imperfect, real world data.

‍

Automatic - Machine Learning

The inflexible nature of rules is both a blessing and a curse. A rule based system is always consistent and its decision making can always be explained (which is more than can be said for humans in many instances), but before we can create a rule we must find a subject matter expert who can explain in unambiguous terms exactly how the categorisation should be done. In many real world situations such as “Is this a picture of a cat, or a dog?”, or “What genre of music is this?” humans can categorise very accurately, but struggle to explain how they came to a conclusion.

This is where artificial intelligence techniques come in.

There are two main sub types of artificial intelligence categorisation: supervised and unsupervised.

‍

Supervised Learning

Supervised learning is the most common type of machine learning categoriser. In supervised learning a categoriser is trained using a set of pre-categorised data that is known to be correct. Typically this will be done using a set of data that has been manually categorised by humans. Sometimes this will have been done specially for the purpose of training the machine learning system, but often if a manual process is being automated then you can use the body of manual work that has been done in the past as the basis for training the model.

If we go back to our example of a library a supervised learning system could learn from the current locations of all the books in the library in order to automatically put new books onto the right shelves as they arrive.

Supervised learning is especially good for categorisation tasks where you have a pre-existing set of categorised that it’s hard to extrapolate rules from such as image recognition.

‍

Unsupervised

Sometimes we don’t have a set of categorised data to work from and that’s where unsupervised learning comes in. Unsupervised learning tries to work out patterns and clusters in existing data without relying on explicitly labeled data. The most common example is the recommendation engine used by Amazon and other online shops. When you buy something it uses information about what other people who bought that item also bought to try to find other things that you might wish to buy.

Machine Learning is an extremely powerful technology, but it has two major weaknesses:

Explainability and Bias.

‍

Explainability

Unlike a rule based system it’s not always obvious why a machine learning model has made the decision that it made and unlike a human you can’t ask it.

While this is OK for a system where it’s OK if it works most of the time like a Spam filter that might be completely unacceptable in finance, or medical applications. Some techniques can be used with machine learning to make it easier to understand why certain decisions were made, but it will never be as black and white as a rule based system.

‍

Bias

This is a subject that has become increasingly important as more of our lives are affected by the output of machine learning. It’s easy to think of machine learning as a process as coldly logical as a Star Trek Vulcan, but in reality when a machine learning model is trained from a the output of a human it may well take on the biases of that human. In a real world example an application that reviewed CVs developed by Amazon displayed a gender bias in its behaviour based on the biased input data it was trained with.

‍

Bringing it all Together

As we have discussed there are huge benefits to categorising your data and many ways in which this can be done. Here at Curvestone data categorisation is one of our key technology pillars for helping businesses innovate and grow.

If you would like to know more please get in touch.

‍