Introduction to Displayr 5: Machine Learning and Multivariate Statistics
This post gives a brief overview of how the more advanced data science analysis methods work in Displayr.
Which advanced data science methods are available?
There are too many to list. Seriously. Displayr allows users to perform analyses using R. R gives you access to, by far, the largest repository for advanced data science analysis methods. Everything from random forest and decision trees through to generalized linear models, PCA, and MANOVA is available.
Displayr makes you more productive when using R
Most of Displayr's advanced analyses use R. This leads to an obvious question: why not instead use R directly, or use one of the specialized R development environments like RStudio? The short answer is that unless your data is "big", you will be more productive when using R in Displayr. This is because:
- We have curated some of the most common methods (see below).
- Displayr has better tools for simultaneously exploring and tidying the data. That is, you can create simple tables and plots by dragging and dropping, and, at the same time, tidying the data.
- Displayr's automatic updating greatly reduces the number of errors and the amount of time taken to build a model in an iterative fashion. If you are not sure what I am talking about check out Introduction to Displayr 4: Simple calculations.
- Displayr's Pages and Data trees make it easy to organize complex analyses (see Introducing Displayr: the data science and reporting app for everyone).
- Displayr integrates the reporting and analysis. See Introduction to Displayr 6: Reporting - automated and reproducible.
The curated methods
Commonly-used machine learning and multivariate statistical methods are available by point and click from Insert > Analysis. For example, if you select Insert > Analysis > Regression you get a generalized linear model. Its output is shown below.
While this looks reasonably pretty, real beauty is not on the surface. If you click on the output, Displayr shows you:
- Tips for improving the analysis (see the orange boxes to the right).
- The selections made during the creation of the analysis. Consequently, you can always work out how the analysis was conducted and there is no need to create any separate documentation.
- The analysis is also "live": modifications to the selections will update the analysis.
If you are familiar with generalized linear models, you will also see that some of the more challenging aspects of doing high quality analysis are built in and can be accessed via the user interface. In particular:
- You can change the Type of the model to access related generalized linear models and related models (e.g., Binary Logit, Poisson, MNL, NBD).
- Robust standard errors are available at the click of a button.
- Built-in options handle Missing data (e.g., multiple imputation).
The code is all there!
If this is not your first square dance, there is a good chance that you will be thinking "yeah, sure, looks pretty, but at the end of the day I know I am going to need to writing code, as the special stuff I do is not in the menus". Don't be alarmed: Displayr's got this covered. Click on the Properties > R CODE and you will see the underlying code. The variables in blue are the various controls (i.e, check boxes, drop-downs) from the Inputs tab. Note that the variable labels shown on the previous outputs are now replaced by variable names in yellow. We can see and edit all the code. When you make your selections in the user interface, Displayr writes the necessary code in the background. This can save a heap of time but still gives you the flexibility to customize and write your own code.
If you look at the top-left of the screenshot below you will see the list of Pages in the Displayr document that I have been creating while writing this post. It shows that I am currently on a page called "Regression Tree". Underneath it are two further nodes. The first node, "cart", represents the specific analysis shown. The second, "Title", is the text object at the top of the page.
Underneath, you can see the linear regression output introduced before, where the model itself is named glm (you can change these names, by clicking on them).
The third page in Pages shows a comparison of the two models, based on the correlations of the predictions of the two models with the observed data. If you are not familiar with R, the actual code will be a bit mysterious, but that is not so important. The important thing to note here is that even though on the two previous pages we had one graphical summary of a model (the Sankey diagram of the regression tree), and a tabular summary of the regression, we can refer to each of these objects in other R Outputs. Doing so creates a chain of analyses, where each object used is a link.
In the R code you can also see that the various terms have shading behind them. The R Outputs are blue. Yellow indicates that "NPS Google" is a variable set in the Data tree. If you hover your mouse over these you will get a preview.
Once we have chained together analyses, they are linked and automatically update. For example, if you remove one of the predictors from the linear regression or regression tree, their models will automatically update, as will the comparison of the models.
The final post in this series of introductory posts is Introduction to Displayr 6: Reporting - automated and reproducible.