Photo by Choong Deng Xiang on Unsplash
Hypothesis Tests Every Data Scientist Should Know
Key Hypothesis Tests for Data Scientists
Hypothesis tests are crucial for validating assumptions about data, offering a quantifiable measure of how accurate or inaccurate those assumptions may be. These tests should not be viewed as definitive proofs; instead, they serve as tools for decision-making and evaluating evidence in situations of uncertainty.
In this blog post, I will highlight the following tests:
t-Test
Chi-Square Test
One-way ANOVA
t-Test
t-Test is used to compare means between two groups. There are three types of t-tests.
One Sample t-Test
Independent Samples t-Test
Paired Samples t-Test
One Sample t-Test
Purpose: Compare the mean of a single sample to a known or hypothesized population mean.
Null Hypothesis(H0): The sample mean equals the population mean
Alternative Hypothesis: The sample mean differs
import numpy as np
from scipy.stats import ttest_1samp
# Sample data
data = [12.9, 10.3, 11.2, 13.8, 9.6, 12.3, 14.1, 11.7, 10.9, 12.5]
# Hypothesized population mean
population_mean = 12.0
# Perform one-sample t-test
t_statistic, p_value = ttest_1samp(data, population_mean)
# Print results
print("T-statistic:", t_statistic)
print("P-value:", p_value)
# Interpretation
alpha = 0.05 # Significance level
if p_value < alpha:
print("Reject the null hypothesis.")
else:
print("Fail to reject the null hypothesis")
Independent Samples t-Test
Purpose: Compare means between two independent groups.
Null Hypothesis (H0): There is no difference between the means.
Alternative Hypothesis: The means of the groups differ.
from scipy.stats import ttest_ind
# Example data
group1 = [5.1, 4.8, 6.3, 5.5, 5.7]
group2 = [7.2, 6.9, 7.8, 7.4, 7.0]
# Perform independent t-test
t_stat, p_value = ttest_ind(group1, group2)
print("t-statistic:", t_stat)
print("p-value:", p_value)
# Decision
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Paired Samples t-Test:
Purpose: Compare the means of two related groups. The test is appropriate when you have "before-and-after" measurements or when the same subjects are measured under two conditions.
Null Hypothesis: The mean difference is zero.
Alternative hypothesis: The mean difference is not zero.
from scipy.stats import ttest_rel
# Example data (pre-test and post-test scores)
pre_test = [85, 89, 78, 92, 88, 76, 95, 91]
post_test = [88, 90, 80, 94, 86, 79, 97, 93]
# Perform paired t-test
t_stat, p_value = ttest_rel(pre_test, post_test)
print("t-statistic:", t_stat)
print("p-value:", p_value)
# Decision
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Chi-Square Test
Here, I am highlighting the chi-square test for independence
This function is commonly used to test for independence between two categorical variables.
It can answer questions like:
Does gender influence the preference for a particular type of chocolate?
How does bike type preference vary among different age groups?
Consider the following contingency table
Group\Category | Category A | Category B | Category C |
Group A | 10 | 20 | 30 |
Group B | 6 | 9 | 17 |
Variables: Group and Category
Null Hypothesis: The two variables are independent (no relationship exists).
Alternative Hypothesis: The two variables are not independent (there is an association).
Degrees of freedom: In this context, the minimum number of types from both variables is required to perform this hypothesis test.
$$df = (Number\space of \space rows - 1) \times (Number \space of \space cols -1)$$
import numpy as np
from scipy.stats import chi2_contingency
# Define the contingency table
observed = np.array([[10, 20, 30],
[6, 9, 17]])
# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(observed)
# Print the results
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:")
print(expected)
alpha = 0.05
if p < alpha:
print("There is relationship between two variables")
else:
print("The variables are independent")
One Way ANOVA
t-Test can compare means between two groups, one-way ANOVA can be used to compare means between more than two groups
Null Hypothesis: All group means are equal
Alternative Hypothesis: At least one group mean is different
import numpy as np
from scipy.stats import f_oneway
# Example data: Scores from three groups
group1 = [85, 90, 88, 92, 85]
group2 = [78, 82, 79, 81, 80]
group3 = [89, 91, 93, 94, 92]
# Perform One-Way ANOVA
stat, p_value = f_oneway(group1, group2, group3)
# Print results
print("F-Statistic:", stat)
print("P-Value:", p_value)
# Decision
alpha = 0.05 # Significance level
if p_value < alpha:
print("At least one group mean is different.")
else:
print("All group means are equal.")