In this short article, we'll show you a simple example of one of the classics of data science - in this case modelling using the Decision Tree methodology.
We'll quickly take you through the steps required to develop a simple predictive model using RapidMiner - and only assume you have a very basic knowledge of this tool (namely, how to create a new process and load a set of data). You'll also need some familiarity with statistical concepts like nominal, ordinal and numeric attribute types, and correlation vs. causation but I've tried to explain these briefly below.
If you have access to RapidMiner (the demo version will work fine), you can download the source data here and follow along!
Decision Trees - what are they and why are they useful?
A classic modelling method incorporates the Decision Tree concept as a way to predict future values based on a set of identified attributes, and RapidMiner's straightforward model development user interface makes creating this model about as simple as it can be. The Decision Tree concept is in wide use - for instance it's commonly used by financial institutions when seeking to understand the risks of consumer lending. By examining customer attributes like age and income, and correlating these variables (singly, and in combination) against loan outcomes like defaults, the credit provider can get a sense of the best and riskiest consumers to lend to. Moreover, they can use this information to price their products via their interest rates and other product pricing features. The beauty of the Decision Tree is in its simple-to-understand visualisation of these correlations. If you'd like to understand more about Decision Trees, this article provides a good introduction, however they're generally seen as quite intuitive and you should be able to adequately understand them by following this example.
In this article and the next, we'll work through an example, showing you the steps required to set up a training dataset, formulate a predictive model and then test this model against real data to determine its usefulness.
The question: what sorts of people survived the Titanic disaster?
We'll use a well-known training dataset that contains passenger data from the ill-fated Titanic ship. Thought of as unsinkable, the ship hit an iceberg on its maiden voyage and sank in the seas off Canada with the loss of over 1,500 lives. The dataset we'll use lists, for each passenger, their key attributes like fare paid, class of service and age (the independent variables) . It also contains a key 'outcome' attribute (the dependent variable) - did the passenger survive the disaster of the ship's sinking or not? Using a decision tree, we'll take this training dataset and let RapidMiner choose the best attributes to predict survival, and then we'll test the model to see how 'good' the predictions are by measuring what the model says versus what actually happened.
Step 1: Getting the Titanic 'survival' data
The first step is to get hold of the Titanic dataset. There are numerous versions available on the web, but the one we're using can be downloaded here. It's worth opening this file in Excel first to get a quick overview of its contents. The dataset is a passenger list, and for each passenger you can see the following attributes:
|name||The name of the passenger|
|age||Age in years|
|fare||The fare paid|
|pclass||An integer from 1 to 3, representing the class of travel (1=First class, 2=Second class etc.)|
|sex||Gender of passenger (M=Male, F=Female)|
|survived||A flag representing whether the passenger survived (1) or not (0)|
Now let's start up RapidMiner to develop our predictive model. First, create a new process using the 'blank' template. You should see an empty Process panel to which we'll add the required RapidMiner operators. We'll now step through each of the operators in sequence.
Step 2. Read the data into RapidMiner
As our training dataset is in Excel format, we'll use the '
Read Excel' operator. Find the operator in the Operators panel, then drag it across to the Process panel. Each operator has a set of parameters, and the first one to fill in, in this case, is the 'excel name'. Use the browser icon to find your file (Titanic3.xls). The easiest way to set up a dataset correctly in RapidMiner is to use the Import Configuration Wizard, but before doing this click the 'first row as names' checkbox to ensure these are used. After clicking the wizard button, you'll be taken through the steps required to convert your Excel data into a RapidMiner dataset. As you step through the wizard let RapidMiner choose the default attribute types and roles, although we'll be adjusting one of them using the '
Set Role' operator later (Step 4 below).
At this point you might like to take a look at the data in RapidMiner. You can do this by connecting the 'out' port of the '
Read Excel' operator to the 'res' (Results) port on the right hand side of the screen. Save your model as something like 'TitanicSurvivalPredictor' and click 'Run'. You should now see a table of the Excel files contents, with the first row used as the attribute names.
Step 3. Select the most useful attributes in the dataset
The dataset contains a number of attributes we don't need for modelling survival. Selecting attributes for a model from a full set of variables is an important topic in its own right, but for brevity, I've already selected the attributes we'll use in our model, conscious of making a useful, predictive model while not overfitting the variables (see below). It's always good practice to reduce your dataset attributes down to just what you need for your model - this makes your data and model easier to understand, and reduces the risk of 'overfitting'. It's critically important to understand the statistical concept of 'overfitting', and if you're not sure what this means, it's well worthwhile familiarising yourself. Why? Because 'overfitted' models are dangerous models: they look useful, but in fact are very poor at being used for future predictions. You can be sure that any competent data scientist clearly understands the risk of overfitting and manages their models accordingly.
To select the variables we need, add a '
Select Attributes' operator to your model, and connect the 'out' port of the 'Read Excel' operator to the '
exa' (Example) port of the '
Select Attributes' operator. In the '
Select Attributes' operator, you now need to adjust the parameters so we get only the attributes we need. Select '
subset' for '
attribute filter type', then click '
Select Attributes'. We'll be using the following attributes (variables), so select these:
||The age of the survivor, as at the rescue date|
||The value of the fare paid for this survivor - higher numbers equate to more expensive tickets|
||The travelling class eg. First or 3rd class|
||The gender of the survivor|
||A flag that represents whether the person survived or not|
Step 4. Discretize the 'Survived' and 'PClass' attributes
RapidMiner's Decision Tree operator requires that your prediction attribute (Survived) be nominal (ie. non-numeric or, in this case, 'Survived' vs. 'Did not survive'). Our dataset however comes to use with the '
survived' status as either 1 or 0 ie. numeric. Without fixing this by transforming the variable into a nominal attribute, the model could produce outcomes like 'survived=0.8' - of course this makes no sense: like being pregnant, you can't be 'sort of' dead! The Discretize operators let you adjust your numeric attributes that represent classes, turning them into binomial or polynomial non-numeric nominal values, which is what we need. Add a '
Discretize by User Specification' operator to your process model, and select '
Single' for the '
attribute filter type' and '
survived' for the '
attribute'. This means you're going to discretize just one attribute - whether the passenger survived or not. Click the '
Edit List' button, and add two class names: '
DidSurvive' and '
DidNotSurvive'. Set the upper limit as '
1' for DidSurvive and '
0' for did not survive. If you join your operator to the results and run the model, you'll see the adjusted dataset containing the words '
DidSurvive' and '
DidNotSurvive' now in the '
survive' attribute. This is what we need for the decision tree - you'll find that the technique of discretizing attributes is an important one when developing predictive models.
We also have another discretize task to perform. The '
pclass' attribute represents the cabin class for the passenger, with values from '
1' (First Class) to '
3' (Third Class). Because this attribute is numeric, as for '
survived' described above, RapidMiner will assume it's a continuous variable (like fare paid), and our decision tree could end up with non-sensical 'forks' in the tree for pclass like '<2.5' but... there's no such thing as '2 and half' class on board ships! In fact, we want to treat pclass as a discrete nominal attribute so that the tree forks on 1st, 2nd or 3rd class only. To do this, add another '
Discretize by User Specification' operator, select a single variable, and add three classes with their appropriate values:
Join these two discretize operators to the flow by connecting the '
exa' output ports to the '
exa' input ports. We now have a dataset where cabin class (
pclass) assumes one of the three textual values, and RapidMiner's decision tree won't split them into non-existent, non-integer pclass values.
Step 5. Set Role
On its own, RapidMiner's decision tree doesn't know what we're actually trying to predict with this model. How do we do this ie. tell RapidMiner what we'd like to predict? The answer is 'by setting a label attribute for the prediction attribute'. To do this, add a '
Set Role' operator, and choose '
survived' as the '
attribute name'. Now choose a '
target role' of '
label'. The label attribute tells RapidMiner that this attribute is the one that contains the prediction - in our case whether the passenger survived or not. You might like to think of the '
label' as the model's answer attribute. Note that each dataset can contain only one '
label' attribute (all other attributes are termed '
regular'). In other words, all attributes in a model have a role: the attributes you use to the do the prediction are termed '
regular' (which happens to be the default for all attributes when loading a data set). The attribute that defines the outcome or answer is called the '
label', and in our case, there can be only one of them: did the person survive the disaster or not, because that's the outcome we're trying to predict.
Step 6. Create a Decision Tree
At last we can insert our
Decision Tree operator (you can search for it in the operator search text box, or select it directly under the operator selection dropdown in 'Modelling - Predictive - Trees - Decision Tree'). Add this to your Process panel, and join it up to your data by connecting the output of the last operator we used ('
Set Role') to the '
tra' (training) input port. For now, leave Decision Tree's parameters as is - once you've created a decision tree, you can look at modifying advanced parameters like 'pruning' and 'maximal depth' to fine-tune to the model, balancing simplicity with accuracy. Don't forget to connect the output from Decision Tree to the results connector so you can see the results - you must always do this with any model, and it's common error for newcomers to RapidMiner to forget this step, wondering why they don't see anything when they run their models!
Step 7. Run the Process
Click 'Run', and if you've set up your operators correctly, you should see the results in a view (tab) called '
Tree (decision tree)'. You should see that first line predictor attribute is '
sex'. With a Decision Tree, RapidMiner attempts to list its predictor variables in 'most influential' to 'least influential' order. What do we find from the Decision Tree image? A quick overview suggests females were more likely than males to survive the Titanic's sinking. As you go a level further down, you should see '
pclass' ie. cabin class as the next most signficant predictor of survival - yes, the movie was right: first class passengers seemed to have a higher probability of surviving than those down in steerage.
Let's talk about cross-correlation and overfitting briefly, because there's a great example of this issue in practice with this dataset. Intuitively, it makes sense that to a large extent '
fare' is a proxy for '
pclass' (cabin class) - expensive fares are more likely to be for first class, while the cheapest fares are more likely to be associated with the cheaper cabins. Without going into detail, here's a great example of how to avoid overfitting: ideally, you want your model to have only the attributes that do not 'cross-correlate' - as far as possible. Given '
fare' and '
pclass' can be expected to cross-correlate, it would be a bad idea to include both attributes in your model. Simplistically, you could describe the situation where you included both in your model as 'these two attributes 'predict' each other well' - but, you know that this really tells us nothing (in fact, it would detract from the model's usefuleness and give you or others a false sense of the model's reliability). The take-out message is: avoid including multiple attributes that are near proxies for each other - pick one or the other, whichever you're most interested in, or whichever you think would be most useful for prediction using your model. This example illustrates the concept of overfitting where you can mistakenly get a model that appears to fit the data well, but in fact tells you nothing you didn't already know. Note that there are formal statistical tests for cross-correlation, but you can go a long way to simplify your models and avoid overfitting by just using common sense: in informal terms don't include multiple attributes that mean roughly the same thing.
We won't cover fine-tuning your model in this article, but for more advanced users, a next step may be to go back to your Process panel and adjust the Decision Tree operator parameters like pruning to change to objectives of the algorithm, and consequently creating different 'trees'.
That's it! You've loaded a training dataset and created a predictive model that predicts survivability based on a passenger list containing a set of influencing attributes.
We haven't covered the next steps of assessing the model's validity using more advanced metrics, but what you've got so far should be a good introduction into the use of Decision Trees in RapidMiner to understand the effect of attributes on an outcome. Note also that RapidMiner comes with a simple 'Golf' dataset and there are numerous tutorials online on how to use this dataset with Decision Trees to predict the likelihood of playing golf given various weather conditions.
Here's a screenshot of the completed model for you to check your own model against:
Your model tells you 'what', but 'why'?
While we've briefly touched on the model's results above, remember how we discovered that those paying higher fares did have a slight survival advantage, and younger female travellers survived significantly more often than older males? There's an important Data Science fact behind these findings. While a high-quality, predictive model may tell you what happened based on inputs, to understand 'why', you generally need to understand your model's total or qualitative environment much more soundly - in this case culturally and from a gender history perspective. In the case of the financial services credit risk assessment Decision Tree modelling, answering 'why' will often require delving into sociology for instance (eg. why are older people more risk averse, and consequently better credit risks?) In some cases, answering 'why' (for example with our Titanic model, why did younger females survive at a higher rate than older males?) can be answered by obtaining other, often unrelated data and developing new models from this data. However, often this quantitative data is unavailable, and answering 'why' requires delving into qualitative historical influences. The key point here is that when someone asks you 'why' a model predicts what it predicts, you will often need to to answer 'I don't know'. Or, being more specific 'my model tells us which attributes are associated with which outcomes'. But be aware of the limitations of your model: the difference between a competent data scientist, and an also-ran analyst attempting data science modelling tasks without the technical statistical background or the maturity to understand what their models mean is often given away by their answer to this question. While reporting to your client or sponsor words to the effect of 'I don't know' is almost certainly too abrupt, a much better answer is probably 'I don't know, but here's what I'd need to do to find out why'. Be aware though: the 'what' can be easy, the 'why' can be extremely complex to ascertain, and require skills way outside those expected of a traditional data scientist. The Titanic example above is instructive: why did younger females survive at a higher rate than older males? The answer, if it's obtainable, will lie in a historical and gender sociological study (and I'm only guessing here)...not the traditional bailiwick of a data scientist. In short, as a data scientist, you'd prove your mettle by highlighting what's required to answer the 'why' question, not answering it yourself.
Association vs. Causality
Focussing in on the last section's discussion of identifying the 'why', it's critically important to note that any modelling of data reports 'association' only. It makes no claims over causality. A more basic example illustrates the point: if you looked outside onto a street on a rainy day, you might note that there seem to be far more umbrellas in use than on a dry day. But we don't for a minute say that 'lots of umbrellas cause wet weather'. Rather, we say that lots of umbrellas are associated with wet weather. The same concept applies to any and all modelling you do: models generally test associations, but have little to say about causality. Understanding the causes of rainy weather has little to do with umbrellas, and it's extremely important to bear this concept in mind when doing any predictive insights work. Confusing causality with association is a common trap for young players, and differentiates those able to succesfully apply Data Science techniques and those who don't. These kinds of mistakes in the world of data are more common than you think, and one of the reasons maths/statistics/data science graduates are being paid at multiples of their business graduate/commerce colleagues when working with data in commercial organisations.
Effective Data Science Modelling - what do I need to know and apply?
While there's probably another article on this for the future (and you can already find many online), I hope this example article has illustrated what I've seen are probably the most important concepts to grasp from both a technical and practical perspective (ie. you know both the concept, and how to apply it) when it comes to successfully applying for a Data Science role. These are the most important statistical/mathematical skillsets I've seen as being most important in the selection of successful candidates (in no particular order by the way):
- Choice of Data Science tool/technique/algorithm - match the objective, and the data (eg. in this example we chose a Decision Tree as the most useful too for a flat dataset with intuitive and descriptive nominal outcomes, and - to a lesser extent - attributes)
- Overfitting - risks and minimisation techniques. Avoid cross-correlating attributes (test if necessary, but also use common sense)
- Association vs. Causality - understanding, accepting and communicating the difference. The last point (communicating) is the most important
- Causality identification techniques - being able to link qualitative and quantitative model data to find, or at least hypothesise the 'why' (this is the most difficult skill by the way - don't steer into it unless you know what you're doing ie. are you really both a data scientist, and a sociologist? Communicate your work's limitations (which are not necessarily yours by the way!) and defer to experts when necessary
- Model complexity minimisation - 'simplicity as an objective' or, in particular, care when choosing unintuitive techniques like black box algorithms that can be difficult or impossible when it comes to describing how they work. Remember your audience/client: they need assurance they can both explain and rely on your model. Being unable to describe it, or describing it with the complex provisos of a black-box model is a significant downside and generally not worth the predictive improvement unless it's an order of magnitude above its next best model competitor (IMHO). This concept itself needs clear articulation!
Note that these are almost job interview points for a (qualified) interviewer talking to someone applying for a Data Science job. Of course, these are in addition to the efficient and practical use of all sorts of Data Science tools (like big-data technology, languages like Python, modules like Panda, cleansing/transformation techniques and the huge suite of Data Insight software tools available these days). Remember, backgrounds are still important: generally speaking, a maths/statistics graduate can be expected to have good technical data handling and Data Science tool skills using the field's suite of technologies (this is usually part of their course, or something they're expected to pick up by themselves). In short, it's virtually assumed that a maths/statistics graduate is competent, or can pick up very quickly, languages like Python, and tools like Panda. For a business/commerce graduate, these are things that generally need specific probing for in an interview. More brutally, an A-grade business/commerce graduate may or may not have good technology skills unless there's specific evidence on their transcript: a maths/statistics graduate can generally be expected to either have these skills already, or be able to pick them up on the job.
Trained data scientists, through specific post-grad or commercial courses tend to have these things drummed into them from day one because they're led by experienced industry people used to selecting employees or contractors, but...some of these skills are subtle, and will, in a competent interview, be explorerd then - particularly if the candidate's formal background does not suggest such knowledge. Technology skills are relatively easy to test - provided the applicant (fairly) knows what's expected before the interview (the web has a plethora of online Python tests for instance). The deep understanding of subtle statistical concepts like causation and overfitting will only come out in an interview, and the situational interview model (or a variation thereof eg. MMI - Multi-Mini Interview) seems to be the best-practice way to assess these skills. These kind of interviews do require a skilled interviewer, or at least one who works together with a non data-scientist interviewer (eg. the 'HR' person). The unfortunately common 'outsourcing of interview to HR' approach is not appropriate to Data Science jobs where both a technical and qualitative understanding of the role are required to select the best person. In these cases the assessment criteria may tend toward the 'behavioural' - which will advantage or disadvantage candidates depending on important non-technical factors relevant to their job performance, but outside their abilities to perform the technical characteristics of the role. In this situation, the technical requirements of the role will probably (and merely) be assessed based on the candidate's CV, given the interviewer's technical inability to assess these skills to any deep level in the interview itself. With forewarning on who will conduct the interview, a candidate would be best advised to ask prepare accordingly! (Note: the increasingly large starting salaries now provided by organisations wishing to recruit data scientists combined with the growing organisational understanding of the criticality of these roles appears to be making the selection panels for these roles increasingly likely to include skilled data scientisits in this recruitment process. In summary, it is less likely now that a candidate would be interviewed without senior data scientists included as part of the interview process).
Creation of model - what are you actually trying to describe here? This is dependent on the role: do you get to influence where you focus your efforts? Or is this pre-determined by your leaders/managers? This aspect of a role differentiates what I'd call a 'Data Scientist' from a 'Data technician'. In the crazy, hot world of data science, you may find the money is not that different, but for your own job satisfaction (and ability to actualise your skills), be sure you understand what the role needs.
Other useful articles
|http://auburnbigdata.blogspot.com.au/2013/03/decision-tree-in-rapidminer.html||Decision tree overview|
|http://www.simafore.com/blog/bid/107076/How-to-choose-optimal-decision-tree-model-parameters-in-Rapidminer||Decision Tree parameters|