Thursday, 16 March 2017

The new age of analytics

In last few years, role of analytics has grown drastically with accumulation of data. Transaction for economic activities are done using applications that facilitate convenience and security for parties. Access of technology to the common mass has acted as catalyst towards digital transformation of society. A footprint is left for every digital activity which is a useful means for measuring behaviour of individual customer. The large chunk of data popularly known as Big Data is associated with 3Vs i.e. volume, variety and velocity. Companies have been storing data in their ERPs (Enterprise Resource Planning) for last decade and are finding it difficult to predict from such large data sets. Also there are other areas such as sports that use data for analysis and gaining insights for competitive advantage over oppositions. Analytics goes hand in hand with data and most important aspect is feature selection which lays groundwork for further analysis. Understanding mathematical intricacies of sophisticated algorithms and methods is a necessary, but not sufficient condition for sound result. The understanding of data aids in identification of variables and subsequent building models for analysis.
A basic linear regression model is used for predicting dependent or response variable on the basis of explanatory or independent variable. So, how do we say that a mundane linear regression can address complex practical problems? The use of regression alone may not provide complete picture and it might be easier to understand with visualisations such as scatter plot. A high r-value say around 0.9 would seem to point at a good relationship between the two variables, but one must not mistake relationship with causation. For example, a continuous production facility has substantial difference in output between day and night shifts. Running a linear regression model would provide answer that night shift is negatively correlated with output, but we actually don’t know what causes it. Data does not answer how does night shift differs from day shift with same equipment behaving differently during two time periods. Data for different time horizons of night shift reveals that production builds in an exponential manner with passage of time. A deep delve into operational working environment suggested that inadequate supervision actually results in workers taking rest during early period of night and increasing production during later part of night. This suggests that domain or context is important factor in interpretation of result.

The above example seems simple, but understanding the cause for such problems takes time and companies incur losses for such inefficient performance. Data accumulates at an unprecedented rate and it becomes imperative to use past data in prudent and timely manner, since it requires effort in retrieving data. In today’s world practical applied analytics is important since it reduces lead time between conception and recommendation. Feature selection and context helps in building the model that will be a precursor for selection of appropriate methodology(s). I use word methodologies for suggesting that one might need to look at initial data, clean it and again look for features using correlation analysis, then build models and consider multiple variables, transformation and additional iterations to arrive at the final model for the result. The primary understanding of data (sufficient condition) uses several methodologies (necessary condition) for effective and practical analysis of data.