## What is.

Sampling error is the difference between the sample values and the true population values, which results from the use of random sampling.

Continue reading

A latent variable is a variable that is inferred using models from observed data. These can be inferred through a wide range of approaches.

Continue reading

Feature engineering is the process of selecting and transforming variables when creating a predictive model using machine learning or statistical modeling

Continue reading

Non-sampling error refers to any deviation between the results of a survey and the truth which are not caused by the random selecting of observations.

Continue reading

Survey data processing is the manipulation or transformation of raw survey data into meaningful results which can be analyzed to answer a research question.

Continue reading

A conversion rate is the percentage of people that move from one stage to the next stage in the process. It is often used to identify weak spots in a company's ...

Continue reading

A model is a usable description of how a system is believed to work. It is a simplification of reality, with unnecessary detail excluded.

Continue reading

Standard error is the estimated standard deviation of the sampling distribution of a parameter. It quantifies the uncertainty around a parameter.

Continue reading

A price sensitivity meter is a set of survey questions that is used to work out how to set prices for products. It works out a framework for what people conside...

Continue reading

Multidimensional scalling (MDS) is a technique used to visualize the distance between objects when the distance between pairs of objects are known.

Continue reading

The chi-square test of homogeneity tests to see whether different columns (or rows) of data in a table come from the same population or not.

Continue reading

A column chart is a data visualization where each category is represented by a rectangle with the height of the rectangle being proportional to the values.

Continue reading

Overplotting is when the values or labels in a data visualization overlap, making the data visualization diffcult to read. Find out more.

Continue reading

A bubble chart is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot. They can be used to show three variables.

Continue reading

The effective sample size is an estimate of the sample size required to achieve the same level of precision if that sample was a simple random sample.

Continue reading

A labeled scatter plot is a data visualization that displays the values of two different variables, with text labels showing the meaning of each data point.

Continue reading

A scatter plot is a chart that displays the values of two variables as points. The data for each point is represented by its position on the chart.

Continue reading

A/B testing involves testing two different approaches to solving a problem - approach A or approach B and working out which is better according to the data.

Continue reading

The chi-square test of independence tests to see whether there is a relationship between two categorical variables in a dataset.

Continue reading

The R-Squared statistic quantifies the predictive accuracy of a statistical model. It is also known as the coefficient of determination and RÂ².

Continue reading

The chi-square frequency test works out if two variable values are consistent with expectations and if a difference is statistically signficant.

Continue reading

Survey quotas are the number of observations to meet a specified requirement. Learn about interlocking and non-interlocking survey quotas.

Continue reading

Data filtering is the process of choosing a smaller part of your dataset and using that subset for viewng or analysis. It is usually temporary.

Continue reading

Data measurement scales are classifications which indicate the types of mathematical operations that can be performed on the data.

Continue reading

A crosstab is table showing the relationship between two or more variables. Crosstabs are useful for finding patterns and correlations in data.

Continue reading

Dummy variables are variables that take values of 0 and 1, where the values indicate the presence or absence of something.

Continue reading

A small multiple is a data visualization that consists of multiple charts arranged in a grid. This makes it easy to compare the entirety of the data.

Continue reading

Logistic regression is a type of regression analysis used when the dependent variable is binary (i.e., has only two possible outcomes).

Continue reading

D-error is a measure that quantifies how good a design is at extracting information from repsondents. I'll show you how to compute D-error, Bayesian D-error and...

Continue reading

A correlation matrix is a handy way to visualize correlation coefficients between sets of variables. A correlation matrix is also used as an input for more adva...

Continue reading

Deep learning is a subset of machine learning. Like other machine-learning techniques, deep learning creates a mapping from input data to a target outcome.

Continue reading

Missing data can be structurally missing, missing completely at random, mising at random, or nonignorable (also known as missing not at random).

Continue reading

Alternatives to a random sample include quota samples, convenience samples, volunteer samples, purporsive samples, and snowball/referral samples.

Continue reading

Factor analysis and principal component analysis identify patterns in the correlations between variables. They are used to identify underlying variables.

Continue reading

MaxDiff is a survey research technique for working out relative preferences for multiple items. It is also known as maximum difference or best-worst scaling.

Continue reading

Spurious Correlation is when two variables falsely appear to be causally related, normally due to an unseen, third factor.

Continue reading

Heteroscedasticity is a specific type of pattern in the residuals of a model where the variability for a subset of the residuals is much larger.

Continue reading

The replication crisis is the growing belief that many scientific studies are unable to be reproduced. This could imply that significant theories are wrong.

Continue reading

Statistics is a mathematical field which deals with quantitative data. Data science is a multidisciplinary field which deals with data in a range of forms.

Continue reading

Data stacking is a way of organising data to find anomalies. Data stacking involves splitting a data set into smaller files and stacking the values for each var...

Continue reading

Residuals in statistics or machine learning are the difference between an observed data value and a predicted data value. They are also known as errors.

Continue reading

Metadata is data about data. This refers to not the data itself, but rather to any information that describes some aspect of the data.

Continue reading

A decision tree is a diagram that shows how to make a prediction based on a series of questions. The responses determines which branch is followed next.

Continue reading

In this post, I'll explain what random sampling is as well as all the different forms random sampling can occur in as well as an alternative to it.

Continue reading

Functional data analysis is a collection of methods for analyzing data over a curve, surface or continuum. Find out when to use it here.

Continue reading

Research is reproducible when the exact results of a study can be reproduced given the original code, data and software. Find out the benefits of reproducible r...

Continue reading

Find out what replicable research is and why it is important for any study.

Continue reading

What's the deal with missing data? In this post we'll explain what missing data is, why it is a problem and how you can handle it!

Continue reading

Rebasing involves modifying a calculation by changing the sample (base) used in the calculation. Rebasing is commonly performed to remove ambiguous responses fr...

Continue reading

String splitting is the process of breaking up a text string in a systematic way, so that the individual parts of the text can be processed. For example, a time...

Continue reading

Correlation is usually defined as a measure of the linear relationship between two quantitative variables (e.g., height and weight). Often a slightly looser def...

Continue reading

Data sorting is any process that involves arranging the data into some meaningful order to make it easier to understand, analyze or visualize.

Continue reading

A random forest is an ensemble of decision trees. Like other machine-learning techniques, random forests use training data to learn to make predictions.

Continue reading

Selection bias is an error in not ensuring random sampling. Learn more about the sources and examples of selection bias and how to avoid them.

Continue reading

A distance matrix is a table that shows the distance between pairs of objects. Learn more about Distance Matrices in this educational deep-dive.

Continue reading

Raw data typically refers to data tables where rows contains observations and columns represent a variable that describes some property of each observation.

Continue reading

A p-value is quantitative summary of the evidence in favor or against a hypothesis of interest. It is computed using a statistical test.

Continue reading

What are the Strengths and Weaknesses of Hierarchical Clustering? Learn more about pros, cons and alternatives to Hierarchical Clustering.

Continue reading

Market segmentation typically involves forming groups of similar people. Segmentation variables are characteristics used to determine if they are similar.

Continue reading

When you have a series of numbers where values can be predicted based on preceding values in the series, the series is said to exhibit autocorrelation.

Continue reading