Category : Statistical Models

Understanding: Take An Interaction Out Of A Model

Whenever you are building a model, one of the most important decisions you need to make is related to which interaction terms you should include. 

As a rule of thumb, the default in regression is to leave them out. So, this means that you should only add an interaction with a solid reason. If you try to add all possible interactions, it will seem that you are simply data fishing. However, the truth is that this is a very common practice in ANOVA models. Most people simply add all interactions that are possible and only take them out when there is a solid reason. 

take-an-interaction-out-of-a-model-1

Discover all the stat calculators you need online.

While we believe that our approach is better, we’re not actually discussing that in this article. Instead, the main goal is to explain what it really means when an interaction is or is not in a model. 

Understanding: Take An Interaction Out Of A Model

In order to fully understand, we believe that there’s nothing better than an example. 

Imagine that you have a model of the height of a shrub (Height) based on the amount of bacteria in the soil (Bacteria) and whether the shrub is located in partial or full sun (Sun).

While height is measured in cm; bacteria is measured in thousand per ml of soil, and sun = 0 if the plant is in partial sun, and sun = 1 if the plant is in full sun.

Discover how to calculate sample size.

Here’s the model without an interaction term: 

Height = 42 + 2.3*Bacteria + 11*Sun

model-without-an-interaction-term-1

And here is the model with one: 

Height = 35 + 1.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun

model-with-one-1

Understanding mean imputation.

If you take a closer look at the two previous images, you can easily withdraw some conclusions:

1. Adding the interaction allows the effect (slope) of Bacteria to differ in the two Sun conditions.

2. It also allows the effect (mean difference) in the Sun condition (the height of the orange lines) to differ at different values of Bacteria.

3. The interaction coefficient itself (3.2) estimates this difference in effect for one predictor, depending on the value of the other.

If it turns out that the best estimate of the interaction coefficient was zero, not 3.2, then the bottom graph would have looked like the top one.

But there is also a counter-intuitive part. 

When you include an interaction in a model, you’re estimating that coefficient from the data. Sometimes the estimate will be zero, or very close to zero, and sometimes not.

When you don’t specify an interaction term, the difference in effects doesn’t just go away. That difference always exists. However, when you don’t specify it, you are simply setting it to zero.

Check out how cloud computing can benefit data science.  [ https://statcalculators.com/how-cloud-computing-can-benefit-data-science ]

This is, incidentally, the same issue with removing an intercept from a model, when it theoretically should be zero or is not statistically different from zero.

In the case of the intercept, the general consensus is removing it will lead to unacceptable bias and poor fit. It’s a rare situation where removal is worthwhile.

Technically, the same is true for interactions, but they are generally held to a different standard. Why? The first reason is the fact that interactions are usually more exploratory and the second reason is that they also add more complexity to a model.

As with all issues of model complexity, sometimes the better coefficient estimates and model fit are worth it and sometimes they aren’t. Whether the complexity is worth it depends on the hypotheses, the sample size, and the purpose of the model.


Statistics Basics – What You Need To Know

If you just decided to start studying statistics, then you need to know that there are some statistics basics that you need to be aware off. The truth is that these statistics basics that we are about to show you give you the basis of this new area and will be helpful as you’re now starting. 

Discover everything you need to know about statistics.

The reality is that statistics os a powerful tool when you are doing data analysis. After all, it allows you to get a lot of information and, sometimes, you are simply using simple charts and graphs. While a simple bar chart may deliver a high-level of information, the truth is that with statistics, you can get a more information-driven and target way. Ultimately, math helps you get to concrete answers and conclusions to the question that you are studying. 

Statistics-Basics

With statistics, you get a deeper insight into how your data is structured and then, based on that structure, how you can apply different techniques to get even more information. 

So, now that you already know that you need to learn some statistics basics, it is time to get started. 

Basics Of Statistics

One of the things that you need to understand about basics in statistics is that there are many different concepts and formulas that you need to always keep in mind. But don’t worry because we are going to take a look at each one of them. 

Statistics Basic Concepts

There’s no question that when you are looking at data that you collected and trying to get more information out of it, you will need to use some statistical features such as bias, variance, mean, median, percentiles, among others. 

basics-in-statistics

If you take a look at the above image, you will see different statistics basic concepts that are important when you are learning statistics. 

As you can see, the line in the middle is called the median value of the data. Ultimately, the median is used over the mean because it is more robust to outlier values. Then, you can see the quartiles. The first one is the 25th percentile which means that 25% of the points in the data fall below that value. On the opposite side, you have the third quartile which is the 75th percentile. This means that 75% of the points in the data fall below that value. Last but not least, you can also see the min and max values that represent the upper and lower ends of our data range. 

Check out our standard error calculator.

Besides, these simple basics in statistics, there is another one that is known as the three Ms: 

#1: Mean: 

basics-of-statistics

The mean is just the average result of an experiment, test, survey, or quiz. So, how can you calculate it?

Here’s an example. Let’s say that you discovered the heights of 5 different people: 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8 inches.

In order to determine the mean, you need to sum up all the heights and then divide the sum total by the number of heights that you discovered. So, in this case: 

Mean = (5 feet 6 inches + 5 feet 7 inches + 5 feet 10 inches + 5 feet 8 inches + 5 feet 8 inches) / 5

Mean = 339 inches / 5

Mean = 67.8 inches or 5 feet 7.8 inches

#2: Median:

statistics-basic-concepts

Median is the middle value of your data. So, as you can imagine, you need to calculate the median differently in case you have an odd amount of values or an even amount of values. Let’s take a look at each one of these cases:

  • Odd Number Of Values: 

Let’s take the previous example we used to calculate the mean above. In case you don’t remember, you had collected the heights of 5 people: 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8 inches.

To calculate the median, you need to order the numbers from the smallest to the largest first: 

5 feet 6 inches, 5 feet 7 inches, 5 feet 8 inches, 5 feet 8 inches, 5 feet 10 inches

As you can see, the value in the middle is 5 feet 8 inches which is also the median. 

Learn more about the standard deviation.

  • Even Number Of Values:

Let’s take another example of data. Imagine that you got the following values when you collected your data: 7, 2, 43, 16, 11, 5.

The first thing that you need to do to determine the median is to, again, line up the values in order from the smallest to the largest:

2, 5, 7, 11, 16, 43

And now you have 2 values in the middle – 7 and 11. To determine the median, you will need to calculate the mean between these two values: 

Median = (7 + 11) / 2 

Median = 9

#3: Mode: 

statistics-basics-formulas

The mode is just the most common result that appears in your data set. Let’s use the same heights’ example once again: 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8 inches.

So, to determine the mode, you can put these values in order to make it easier to find the most common value in the data:

5 feet 6 inches, 5 feet 7 inches, 5 feet 8 inches, 5 feet 8 inches, 5 feet 10 inches

As you can see, the only value that repeats is 5 feet 8 inches – it occurs two times. 


Looking to calculate the standard error of the mean?

Variance

In statistics, another important concept that you need to understand is variance. Simply put, variance is just the spread of a data set. So, you can say that it is a measurement that is used to identify how far each number in the data set is from the mean. 

One of the things that you need to know about variance is that this is an important concept especially when you want to calculate probabilities of future events. The reality is that it is a great way to find all the possible values and likelihoods that a random variable can take within a specific range.

Some of the implications of the variance concept that you should keep in mind include:

  • The larger the variance, the more spread is in the data set. 
  • A large variance means that there are more values far from the mean and far from each other. 
  • A small variance means that the values in your data set are closer together in value.
  • When you have a variance that is equal to zero, this means that all of the values within your data set are identical. 
  • All variances that are not equal to zero are positive numbers. 

Now that you understand what variance is, you need to know how to calculate it. Simply put, the variance is the difference between each number in the data set and the mean, squaring the difference to make it a positive number, and then dividing it by the number of values on your data set. Here’s the formula:

variance-formula

Where:

X = individual data value

u = the mean of the values

N = total number of data values in your data set.

One of the things that is worth noting is that when you are calculating a sample variance to estimate a population variance, the denominator of the variance equation becomes N – 1. This removes bias from the estimation, as it prohibits the researcher from underestimating the population variance.

One of the main advantages of variance is the fact that it treats all deviations from the mean of the data set in a similar way, no matter the direction. The main disadvantage of using the variance is the fact that it gives added weight to values that are far from the mean (outliers). And when you square these numbers, you may get skewed interpretations of the data set as a whole. 


Use our standard error calculator to confirm your results.

Covariance

Among the statistics basics formulas, there is another one that is especially important – covariance. But before we get to the formula, you need to understand what covariance is. 

Simply put, covariance shows you how two variables are related to one another. So, more technically, covariance refers to the measure of how two random variables in a data set will change together. 

  • When you have a positive covariance, this means that the two variables are positively related which is the same as saying that they move in the same direction. 
  • When you have a negative covariance, this means that the two variables are inversely related which is the same as saying that they move in opposite directions. 

Here’s the covariance formula:

covariance-formula

Where:

X = represents the independent variable

Y = represents the dependent variable

N = represents the number of data points in the sample

X-Bar = represents the mean of the X

Y-Bar = represents the mean of the dependent variable Y

Bottom Line

As you can see, these basic statistical terms are very simple to understand and you shouldn’t have any difficulties in putting them into practice. However, these basic concepts and formulas are extremely important since they are the basis of statistics. 


Differences Between Explanatory Models And Predictive Models

As a statistics student or as a researcher, you know that sometimes you are asked to create a specific model. Let’s say that you are asked to create a specific model that predicts who will drop out of college in a specific year. So, you decide to use a binary logistic regression. After all, you know that your outcome will only carry two values: 0 for not dropping and 1 for dropping out. 

explanatory models and predictive models

Learn everything you need to know about statistics.

The truth is that no matter if you are a student or already a researcher, you were trained to build models with the purpose of discovering and understanding the relationships that may exist between an outcome and a set of predictors. However, what you may not know is that model building only works for predictive models. So, how can you solve this situation? 

This is what we are about to discover today by looking into explanatory models and predictive models and stating their differences. 

Explanatory Models

Explanatory Models

When you are using explanatory models, then you understand that you are looking to identify variables that have a scientifically meaningful and statistically significant relationship with an outcome. 

So, your main goal is to test the theoretical hypothesis to ensure that there is an emphasis on both theoretically meaningful relationships and determining whether each relationship is statistically significant.

Some of the steps in explanatory models include fitting potentially theoretically important predictors, checking for statistical significance, evaluating effect sizes, and running diagnostics.

Looking for a quick t student calculator?

Predictive Models

Predictive Models

When you pick a predictive model, your main goal is different. In this case, your goal is to use the relationships between predictors and the outcome variable to generate good predictions for future outcomes. With this in mind, you can easily understand that predictive models are created in a very different way than explanatory models. After all, in this case, you are looking for predictive accuracy. 

Variables that are used in a predictive model are based on association, and not on statistical significance or scientific meaning.

There are times when statistically significant variables will not be included in a predictive model. A significant predictor that adds no predictive benefit is excluded.

Learn how to calculate t value.

If the predictor is significant but only observable immediately before or at the time of the observed outcome, it cannot be used for predictions.

For example, theoretical models have shown that water temperatures are a highly significant factor in determining whether a tropical storm turns into a hurricane. That variable is not useful in a prediction model of the expected number of hurricanes during the upcoming season because it can only be measured immediately before an impending hurricane.

Explanatory Models And Predictive Models - the differences

That’s too late.

One of the things to keep in mind when you are using predictive models is that you should always explore. Changing the effect of a continuous predictor by squaring or taking the square root of its value is one approach. The primary limitation for including a predictor in the model is its availability for future model running.

Make sure to use our free student t value calculator.

The primary risk when creating a predictive model is to avoid overfitting which is the result of creating a model that fits the current sample so perfectly that it may not be a good representation of the population. So, how can you decrease this risk?

The best thing to do in this case is to only use half of your data to create your model. Then test your model on the other half.