Hypothesis Tests Every Data Scientist Should Know

Key Hypothesis Tests for Data Scientists

·

4 min read

Hypothesis tests are crucial for validating assumptions about data, offering a quantifiable measure of how accurate or inaccurate those assumptions may be. These tests should not be viewed as definitive proofs; instead, they serve as tools for decision-making and evaluating evidence in situations of uncertainty.

In this blog post, I will highlight the following tests:

  • t-Test

  • Chi-Square Test

  • One-way ANOVA

t-Test


t-Test is used to compare means between two groups. There are three types of t-tests.

  1. One Sample t-Test

  2. Independent Samples t-Test

  3. Paired Samples t-Test

One Sample t-Test

Purpose: Compare the mean of a single sample to a known or hypothesized population mean.

Null Hypothesis(H0): The sample mean equals the population mean

Alternative Hypothesis: The sample mean differs

import numpy as np
from scipy.stats import ttest_1samp

# Sample data
data = [12.9, 10.3, 11.2, 13.8, 9.6, 12.3, 14.1, 11.7, 10.9, 12.5]

# Hypothesized population mean
population_mean = 12.0

# Perform one-sample t-test
t_statistic, p_value = ttest_1samp(data, population_mean)

# Print results
print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis")

Independent Samples t-Test

Purpose: Compare means between two independent groups.

Null Hypothesis (H0): There is no difference between the means.

Alternative Hypothesis: The means of the groups differ.

from scipy.stats import ttest_ind

# Example data
group1 = [5.1, 4.8, 6.3, 5.5, 5.7]
group2 = [7.2, 6.9, 7.8, 7.4, 7.0]

# Perform independent t-test
t_stat, p_value = ttest_ind(group1, group2)

print("t-statistic:", t_stat)
print("p-value:", p_value)

# Decision
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Paired Samples t-Test:

Purpose: Compare the means of two related groups. The test is appropriate when you have "before-and-after" measurements or when the same subjects are measured under two conditions.

Null Hypothesis: The mean difference is zero.

Alternative hypothesis: The mean difference is not zero.

from scipy.stats import ttest_rel

# Example data (pre-test and post-test scores)
pre_test = [85, 89, 78, 92, 88, 76, 95, 91]
post_test = [88, 90, 80, 94, 86, 79, 97, 93]

# Perform paired t-test
t_stat, p_value = ttest_rel(pre_test, post_test)

print("t-statistic:", t_stat)
print("p-value:", p_value)

# Decision
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Chi-Square Test


Here, I am highlighting the chi-square test for independence

This function is commonly used to test for independence between two categorical variables.

It can answer questions like:

  1. Does gender influence the preference for a particular type of chocolate?

  2. How does bike type preference vary among different age groups?

Consider the following contingency table

Group\CategoryCategory ACategory BCategory C
Group A102030
Group B6917

Variables: Group and Category

Null Hypothesis: The two variables are independent (no relationship exists).

Alternative Hypothesis: The two variables are not independent (there is an association).

Degrees of freedom: In this context, the minimum number of types from both variables is required to perform this hypothesis test.

$$df = (Number\space of \space rows - 1) \times (Number \space of \space cols -1)$$

import numpy as np
from scipy.stats import chi2_contingency

# Define the contingency table
observed = np.array([[10, 20, 30],
                     [6, 9, 17]])

# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(observed)

# Print the results
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:")
print(expected)

alpha = 0.05

if p < alpha:
    print("There is relationship between two variables")
else:
    print("The variables are independent")

One Way ANOVA


t-Test can compare means between two groups, one-way ANOVA can be used to compare means between more than two groups

Null Hypothesis: All group means are equal

Alternative Hypothesis: At least one group mean is different

import numpy as np
from scipy.stats import f_oneway

# Example data: Scores from three groups
group1 = [85, 90, 88, 92, 85]
group2 = [78, 82, 79, 81, 80]
group3 = [89, 91, 93, 94, 92]

# Perform One-Way ANOVA
stat, p_value = f_oneway(group1, group2, group3)

# Print results
print("F-Statistic:", stat)
print("P-Value:", p_value)

# Decision
alpha = 0.05  # Significance level
if p_value < alpha:
    print("At least one group mean is different.")
else:
    print("All group means are equal.")