Predictive Analytics

Predictive Analytics is the latest major trend in commercial data analysis, driven by the convergence of three factors:

  • new, readily available 'community-sourced' software tools
  • the huge increase in data sourced from multiple channels (particularly the web) and
  • cloud-based technologies which provide end-users with access to high-end computing power.

Predictive Analytics vs. Business Intelligence

Predictive Analytics is different to traditional data analysis techniques (sometimes termed 'Business Intellience') because it is generally used to forecast future states and events, rather than describe historic performance. It also uses far more sophisticated statistical/mathematical techniques, and requires a unique skillset combination of both maths and computer proficiency.

Predictive Analytics is Model Based

In practice, Predictive Analytics usually begins with the identification of a model developed from the underlying data. Once the model has been tested across larger datasets, it can be used to predict future events given predetermined inputs. In contrast, Business Intelligence uses basic summarisation and aggregation techniques to describe historic data, and generally stops at that point with reporting that simply describe the historic data in a summarised way. In the real world however, the two approaches exhibit some overlap. As an exmple, it is possible to use sumarisation to predict future states, and application of predictive models can be used to explain past performance, though neither of these are the focus of their respective disciplines. Typical examples of the application of Predictive Modelling include the development of models describing customer churn (defection rates and drivers), credit default risks, and customer segmentation based on behavioural criteria.

Utilises sophisticated statistical algorithms

As mentioned previously, a key difference between Predictive Analytics and traditional Business Intelligence is the sophistication of the techniques used to process data. Business Intelligence uses simple summation, averages amd (infrequently) variances. Predictive Analystics uses techniques from the realm of mathematics and statistics, such as clustering (k-means etc.), principal component analysis, regression (linear and non-linear), decision trees,  machine learning, optimisation and time series forecasting. Rather than producing a single output in the form of data summaries or totals, Predictive Analytics often produces a model that is then used to forecast future states. Broadly, the most common use of these models when applied against past data is model verification, rather than reporting.

What's propelling Predictive Analytics?

In addition to the availabiltiy of software tools, two other factors have propelled the commercial application of Predictive Analytics: vast and ever-increasing volumes of data, and the ready availability of high performance computing resources  via shared, cloud-based services. As an example, a Predicitve Analytics practitioner can analyse Twitter data feeds that can run into the gigabytes using cloud based, CPU-cycle (or time) charged services such as Amazon EC2. All this can be done without any individual company-based investment in acres of computers - all that is required is Predictive Analytics software, an internet connection and a credit card!

Advanced (and more costly) skills are required for Predictive Analytics

A combination of mathermatics/statistics and high computer proficiency skills is required to perform Predictive Analytics successfully, meaning that Data Scientists command salaries that exceed those of a traditional business analyst by a factor of 3:1 #1. In industry jargon, practitioners of Predictive Analytics are termed 'Data Scientists' - a term which differentiates them from 'business analysts' and reflects their higher level technical skills (and higher pay packets). As a starting point for numeracy skills, a Data Scientist should have familiarity with regression, principal component analysis, clustering techniques, decision trees, ANOVA, matrices and concepts underlying the appropriate use of statistical techniques (correlation causality vs. association, homoscedasticity, measurement reliability, autocorrelation, Type I and II errors etc.). For learning and genetic algorithms, appropriate choice of algorithm and limitations thereof are important. In addition to solid knowledge of maths and statistics data scientists need to know how to 'wrangle' data. This somewhat quaint term is entirely appropriate: most data needs to be extracted, cleansed, transformed and reformatted before it's suitable to analytics work. This means that data scientists need good scripting skills using tools and languages like Python and R, and they are usually proficient in programming too. This combination of skills (maths/statistics and programming) are important, because the hallmark of a good data scientist is an ability to work on a problem in a self-sufficient manner - it would be unacceptable to hold up an analytics jobs until 'someone in IT prepares our data for us'!

Sourcing Data Scientists for Predictive Modelling - maths, not business graduates

The different and higher level numeracy skill requirements for Predictive Analytics are important: while traditional Business Intelligence work can be done by a standard business/commerce graduate analyst (albeit with a numerical skills bent), Predictive Analytics requires at least a background in formal undergraduate mathematics, statistics or econometrics, as well as strong computing skills, including but not limited to programming. Successful junior Data Scientists perfoming Predictive Analytics work are more likely to be sourced from graduates with formal qualifications in mathematics, statistics or econometrics - not commerce or business. It is probably easier (though probably more costly) to source a maths, statistics or econometrics graduate and provide them with commercial training than it is to to take a business/commerce graduate and bring them up to the required numeracy standard for predictive analytics work.

Tips for budding data scientists

If you're studying for a business degree, it will be important to focus on subjects that improve your numeracy and programming skills. If you're already doing a maths or statistics course, the addition of some commercially oriented subjects would be beneficial, but these are secondary when it comes to hiring a data scientist. For both disciplines, anything you can do to apply yourself to real-world problems will be looked on favourably - try to choose practical projects in your course that do this. In terms of 'what programming language do I learn', the answer is: any of the standard ones. A good programmer can easily apply him or herself to any programming language, at least to the level required for predictive analytics. Scripting-based languages like Python are probably best and this language is becoming the most widely used for data preparation ('wrangling'), but PHP, Perl, and even VBA are all useful. The addition of more formal languages like C, C++ is useful, but very high level skills to program in these languages are generally not required.

Predictive Analytics Software

In response to the huge and growing demand for Predictive Analytics, there are over 40 vendors of software tools that focus on providing end-users with the techniques required to develop and apply Predictive Analytics models. All the 'legacy' providers of large-scale data analysis software are represented (IBM, SAP, SAS etc.) as well as a raft of new vendors. Many in this latter category supply free or community versions of their software (eg. RapidMiner) in an effort to harness the 'wisdom of the masses'. In the 'community' model of distribution, vendors make available a free version of their product, and obtain their revenue from paid versions which supply support and consulting services. In many cases, the vendor applies some functionality limitations to the community version, but this is not a hard-and-fast rule. A third category of Predictive Analytics software suppliers includes organisations that have historically provided statistical or Business Intelligence packages, and have then developed additional modules that provide Predictive Analytics add-on modules to their base software. Examples include SPSS (commercial) and R (freeware). With well-established products, these suppliers have created additional modules to perform Predcitive Analytics, while maintaining their core focus on statistics.


next article: RapidMiner

#1: Sydney Morning Herald - May 12, 2012 "Data miners find there's gold in them thar files"