In last few years, role of analytics has grown drastically
with accumulation of data. Transaction for economic activities are done using
applications that facilitate convenience and security for parties. Access of
technology to the common mass has acted as catalyst towards digital
transformation of society. A footprint is left for every digital activity which
is a useful means for measuring behaviour of individual customer. The large
chunk of data popularly known as Big Data is associated with
3Vs i.e. volume, variety and velocity. Companies have been storing data in
their ERPs (Enterprise Resource Planning) for last decade and
are finding it difficult to predict from such large data sets. Also there are
other areas such as sports that use data for analysis and gaining insights for
competitive advantage over oppositions. Analytics goes hand in hand with data
and most important aspect is feature selection which lays
groundwork for further analysis. Understanding mathematical intricacies of
sophisticated algorithms and methods is a necessary, but not sufficient
condition for sound result. The understanding of data aids in identification of
variables and subsequent building models for analysis.
A basic linear regression model is
used for predicting dependent or response variable on the basis of explanatory
or independent variable. So, how do we say that a mundane linear regression can
address complex practical problems? The use of regression alone may not provide
complete picture and it might be easier to understand with visualisations such
as scatter plot. A high r-value say around 0.9 would seem to
point at a good relationship between the two variables, but one must not
mistake relationship with causation. For example, a continuous production
facility has substantial difference in output between day and night shifts.
Running a linear regression model would provide answer that night shift is
negatively correlated with output, but we actually don’t know what causes it.
Data does not answer how does night shift differs from day shift with same
equipment behaving differently during two time periods. Data for different time
horizons of night shift reveals that production builds in an exponential manner
with passage of time. A deep delve into operational working environment
suggested that inadequate supervision actually results in workers taking rest
during early period of night and increasing production during later part of
night. This suggests that domain or context is
important factor in interpretation of result.
The above example seems simple, but
understanding the cause for such problems takes time and companies incur losses
for such inefficient performance. Data accumulates at an unprecedented rate and
it becomes imperative to use past data in prudent and timely manner, since it
requires effort in retrieving data. In today’s world practical applied
analytics is important since it reduces lead time between conception
and recommendation. Feature selection and context helps
in building the model that will be a precursor for selection
of appropriate methodology(s). I use word methodologies for suggesting that one
might need to look at initial data, clean it and again look for features using
correlation analysis, then build models and consider multiple variables,
transformation and additional iterations to arrive at the final model for the
result. The primary understanding of data (sufficient condition) uses several
methodologies (necessary condition) for effective and practical analysis of
data.