“You’re only as solid as what you build on”. It is essential to iterate through the fundamentals of Multi-dimensionality and related concepts before getting our hands on QuarkCube.
The concept of multi-dimensionality
Multi-dimensionality can simply be defined as the ability to view, store and process data based on many dimensions. That sounds scholarly. But, is it a new and vague concept to us?
hmmm…. Not really… We are actually applying it in our daily lives…
Let’s take the example of a closet. The one in the picture is small. So, say we have a closet which is 25 times the size of the one in this image.
It’s your friend’s and he comes up to you asking, “Hey pal, can you tell me the number of full sleeved Van Huesen white linen shirts with black buttons in my closet?”
You realise that it’s a big task ahead. Why is it a big task? It’s because though the closet looks neat, it is far from being organized and it is going to take you a while to search for the shirts satisfying all his conditions and then count them.
(Well, counting wouldn’t take you that long, but searching certainly will).
It would have been an easy task if your friend had grouped all his clothes according to its characteristics and placed them appropriately.
How can he possibly organize it?
There is not just one way to do it. It basically differs from person to person. Let’s list down a few ways to organize it. He may group the clothes based on
Color of garment,
Type of material,
Type of sleeves,
Type of stripes,
Type of neck,
Brand,
Type of clothing,
Type of button,
Color of button.
It would have been easier for you to answer his question, if
The closet had shirts, pants and socks and ties separately (type of clothing)
Among the type of clothing, it was split based on brand (brand)
In the Van Huesen shirts type, white shirts and other coloured shirts separately (colour of garment)
Among those white shirts, he had placed full sleeved and half sleeved shirts separately (type of sleeves)
Among the full sleeved, he had linen kept separately from the cotton (type of material)
Among those white full sleeved linen shirts, he placed the shirts according to the button colour (colour of button)
It would have taken you less effort to quickly fetch those “white full sleeved Van Huesen linen shirts with black buttons” and count or answer anything that your friend asks for.
The chronology of grouping may vary, but the end result will be the same.
This explains the advantage of storing and processing the data based on the different dimensions. Multi-dimensional databases store the data based on the dimensions specified by the user.
This is a pretty simple example wherein data refer to clothes; dimensions refer to the different features/aspects based on which we grouped the clothes.
In the analogy, clothes are the data, but it’s not the kind of data that a computer directly deals with, hence we will look into another example.
Imagine we have data on the marks of students of grade 12, their region and subjects.
This data has less number of records, hence searching will be quick. But imagine if we have the records for all the students in India. It is going to take time to get a whole glimpse of the data. i.e., summarize the data.
Multi-dimensional databases group the data which are homogeneous (i.e., have the same characteristics), aggregate and store the aggregated value.
Below given is the representation where, subject and regions are two dimensions in our data.
What is stored within those cells are the aggregates of all the records having the corresponding row and column characteristics.
When the data is already aggregated and stored readily, it will easy to answer any question like “What is the total marks of the students in the west in the subject ‘Physics’?”
That’s one long sentence but by referencing based on row and column, we find its value i.e., 112.
It is to be noted that QuarkCube allows many other functions such as sum, standard deviation, variance, count, etc., in order to summarize the data.
This looks simple because it has only 2 dimensions (region, subject). Now let’s consider another dimension ‘Gender’ of the students and try to group them based on 3 dimensions.
The resulting data is represented in the form of a cube because the data is grouped based on more than 2 dimensions. The third dimension is gender. The picture displays sliced versions of the cube (Male and female layers). These layers put together form the whole data block.
The image below shows us the whole cube. Each block in a cube holds exactly one value (though it looks 3d). For instance, 95 is the total marks obtained by the male students (in the data) who are in the northern region in the subject physics. The power of a multi-dimensionality is felt when the size of data is huge.
Let’s elaborate on certain phraseologies related to multi-dimensionality.
Dimensions are the elementary components of a multi-dimensional model. They are the perspectives from which we view the data. In other words, the aspects on which we are grouping/clustering similar data are called dimensions.
For data scientists you can relate dimensions to categorical variables and for those who are designing experiments, you can relate dimensions to the factors of the experiment.
Hierarchies are the different levels in a dimension. For example, let’s say we have data on city, state, country of a customer, we frame a dimension “location of the customer” and we define a hierarchy such that we can consolidate data of customers at the different levels in a hierarchy (country-wise, state-wise and city-wise).
A dimension can have many hierarchies. Each of the different perspectives to the data in a dimension, is taken as an individual hierarchy.
An example is all we need to understand. Let’s say you own a bakery. The same items that you sell may be grouped based on their category (bread, cookies, tarts, pies and cakes) or also based on their production (in-house or procured). The grouping can be done by two ways. Hence, the number of hierarchies for the items sold (dimension) that you sell is two.
Attributes are those components which explain more about a particular dimension. They are like adjectives. For example, in our data if we have columns ‘customer_id’, ‘gender’, customer_id becomes the dimension and gender becomes the extra detail about the dimension i.e., Customer_id, hence Gender is an attribute of Customer_id. Such attributes which stay constant with respect to time are the ‘Real time attributes’. In the bakery example, we may call either of the hierarchies as an attribute as well, because the category of item and production variables basically explain more about the dimension (i.e.,Products sold).
In the picture, whether or not a vegetable is above the ground, is the attribute which describes the vegetables (dimension).
An attribute which is anticipated vary with respect to time dimension/other dimension is termed as ‘varying attribute’.
For instance, you own a tender coconut shop with 3 varieties of tender coconuts. For specific months you purchase certain variety tender coconuts from Vendor A and vendor B and you track your sales. We can summarise the data based on the dimensions ‘Type of tender coconut’, ‘Month’ and also ‘Vendor’. Vendor type changes with respect to time and the variety of tender coconut, we add it as a varying attribute to help analyse individual performances of the vendors for the product varieties.
Measures are numeric components of a model. The values which are aggregated/summarised based on the dimensions are termed as ‘Measures’. Basically it is that metric that one wants to track or is most interested in. A model must have a measure. In the tender coconut example, we track sales based on ‘Type of tender coconut’, ‘Month’ and ‘Vendor’, hence sales is the measure. (Of course as an owner, one would want to know about the performance, for which, the most important number to track is that of sales). Analysing sales will help identify the scope for business improvement.
A data scientist can relate it to the numerical variable which he is trying to predict in a in a regression problem.
Representation of data
As we know, a cube stores the aggregate of the data which was grouped on multiple dimensions. Let’s try to visualise it.
For instance, in the bakery example let’s consider our dimensions are
Year
Market (Asia, America)
Items (Cakes, tarts, cookies, pies, bread)
Presence of egg (0-No, 1-Yes)
While storing the data based on different dimensions, we will be able to fetch the data quickly and since it is an aggregated value, it helps us with ‘just in time’ information that we will need to make big decisions.
We will be able to get answers for any combination quickly. It will also help us compare performances between different products, different quarters, different markets and different variation in products. For example, what is the total actual sales for eggless pies? Columns in orange are that which hold values for pies. The sum of totals corresponding to the answers our question.
Using the other hierarchy of the items sold (grouping based on in-house/procurement),
For this cube, we can quickly fetch data for a question like “What is the total sales in Asia for eggless procured items?” Cubes store the information tracked on the different dimensions and at different hierarchy levels.
Multi-dimensionality is a powerful feature and its effectiveness is felt when the size of data (no. of records, no. of features) is huge. The inherent ability to summarise and store data, makes multi-dimensional cubes the go-to for effective business intelligence. In the data-driven world, the ideology of multi-dimensional analysis stays inevitable.
Commenti