Title: Descriptive Statistics Functions for Numeric Data
Version: 0.1.2
Description: Provides fundamental functions for descriptive statistics, including MODE(), estimate_mode(), center_stats(), position_stats(), pct(), spread_stats(), kurt(), skew(), and shape_stats(), which assist in summarizing the center, spread, and shape of numeric data. For more details, see McCurdy (2025), "Introduction to Data Science with R" https://jonmccurdy.github.io/Introduction-to-Data-Science/.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.2
Depends: R (≥ 3.5)
LazyData: true
Suggests: roxygen2
NeedsCompilation: no
Packaged: 2025-07-20 21:01:46 UTC; lukepapayoanou
Author: Luke Papayoanou [aut], Jon McCurdy [aut, cre]
Maintainer: Jon McCurdy <j.r.mccurdy@msmary.edu>
Repository: CRAN
Date/Publication: 2025-07-22 11:01:57 UTC

MSMU: Fundamental Data Functions Package

Description

The MSMU package provides core functions for descriptive statistics and exploratory data analysis. It includes functions for computing central tendency, spread, shape, and position statistics, along with utility functions for estimating modes and standardized ranges. The package contains

Functions

Datasets

Author(s)

Luke Papayoanou, Jon McCurdy


Find the Mode of a Numeric Vector

Description

Calculates the mode (most frequent value) of a numeric vector. If there is a tie, returns all values that share the highest frequency.

Usage

MODE(x)

Arguments

x

A numeric vector.

Value

A numeric value (or vector) representing the mode(s) of x.

Examples

# Mode of a Numeric Vector
MODE(c(1,2,3,3,3,4,5,5,3,8))

# Mode of the number of cylinders in mtcars dataset
data("mtcars")
MODE(mtcars$cyl)


Professional baseball teams data

Description

This dataset contains historical performance and statistics for professional baseball teams across multiple seasons from 2000-2020.

Usage

baseball_teams

Format

A data frame with 630 rows and 12 columns:

year

Year (integer)

team_name

Team (character)

games_played

Number of games played (integer)

wins

Number of wins (integer)

losses

Number of losses (integer)

world_series

World series winner that specific year (character)

runs_scored

Number of total runs scored during season (integer)

hits

Number of total hits during season (integer)

homeruns

Number of total homeruns during season (integer)

earned_run_average

Team earned run average per 9 innings (numeric)

fielding_percentage

Team fielding percentage (numeric)

home_attendance

Average home game attendance (integer)

Source

Data retrieved from Lahmans Baseball Database with alterations made for educational purposes


College basketball data

Description

This dataset contains performance statistics for 363 men’s college basketball teams from the 2022-23 season.

Usage

basketball

Format

A data frame with 363 rows and 18 columns:

School

School (character)

State

State (character)

W

Wins (integer)

L

Loss's (integer)

W.L.

Win Loss percentage (numeric)

SRS

Simple Rating System (numeric)

SOS

Strength of Schedule (numeric)

Points.Scored

Points scored (integer)

Points.Allowed

Points allowed (integer)

FG.

Team field goal percentage (numeric)

X3P.

Three point percentage (numeric)

FT.

Free throw percentage (numeric)

Rebounds

Number of rebounds (integer)

AST

Number of assists (integer)

STL

Number of steals (integer)

Blocks

Number of blocks (integer)

Turn.Overs

Number of turn overs (integer)

Fouls

Number of fouls (integer)

Source

Data retrieved from Sports Reference with alterations made for educational purposes.


Summary of Central Tendency

Description

Computes a variety of center statistics for a numeric vector, including: mean, median, trimmed means (10% and 25%), and estimated mode (via probability density function using estimate_mode()).

Usage

center_stats(x)

Arguments

x

A numeric vector.

Value

A named numeric vector with values for:

mean

Arithmetic mean

median

Median

trim25

25% trimmed mean

trim10

10% trimmed mean

est_mode

Estimated mode from estimate_mode()

See Also

estimate_mode

Examples

# Center Stats of continuous random data
set.seed(123)
x <- rnorm(1000, mean=50, sd=10)
center_stats(x)

# Center Stats of Sepal Length in iris data set
data("iris")
center_stats(iris$Sepal.Length)


Christmas data

Description

Santa's dataset, exploring if Santa gives children presents based a variety of variables!

Usage

christmas

Format

A data frame with 1000 rows and 15 columns:

Gender

Gender (character)

Toy_Count

Number of toys (integer)

Chores_Completed

Number of Chores completed (numeric)

Favorite_Color

Childs Favorite color (character)

Helping_Hand

Childs helping hand number/score (integer)

Complaints_Received

Number of complaints child says (numeric)

Tantrum_Count

Number of Tantrums child has (integer)

Rule_Breaks

Number of rule breaking child does (numeric)

Sharing_Behavior

Childs willingness to share (numeric)

Hours_of_Sleep

Childs average hours of sleep per night (numeric)

Screen_Time

Childs average hours of screen time (numeric)

School_Grade

Childs school grade (numeric)

Parent_Presence

Childs parent presence (numeric)

Greed_Score

Santas numeric system for labeling childrens greed (numeric)

Outcome

Whether a child gets a present or coal (character)

Source

Santa


Class demographics

Description

A sample dataset representing demographic and academic information for 50 college students.

Usage

class_demographics

Format

A data frame with 50 rows and 6 columns:

names

Persons name (character)

ages

Persons age (int)

state

Persons state (character)

year

Persons year in college (character)

majors

Persons major (character)

sport

Binary Sport, 1(yes) or 0(no) (integer)

Source

Synthetic Data


College data

Description

This dataset provides detailed information on 777 U.S. colleges and universities from 1995, covering aspects of admissions, academics, finances, and student demographics.

Usage

college_data

Format

A data frame with 777 rows and 16 columns:

Name

College name (character)

Region

US region (character)

Accept

Acceptance (integer)

Enroll

Enrollment (integer)

Top10perc

Percent of students that were top 10 in highschool class (integer)

Top25perc

Percent of students that were top 25 in highschool class (integer)

F.Undergrad

Full time undergrad (integer)

P.Undergrad

Part time undergrad (integer)

Outstate

Number of Out of state students (integer)

Room.Board

Annual room and board price (integer)

PhD

Percentage of Faculty with a PhD (integer)

Terminal

Percentage of Faculty with a terminal degree (integer)

S.F.Ratio

Student Faculty ratio (numeric)

perc.alumni

Percent of alumni who donate to the college (integer)

Expend

Instructional expenditure per student (integer)

Grad.Rate

Graduation Rate (integer)

Source

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. Adapted from the College data set in the ISLR library with alterations made for educational purposes.


County data

Description

Data for 3142 counties in the United States containing demographic, educational, economic, and technological statistics.

Usage

county_data

Format

A data frame with 3142 rows and 17 columns:

state

State (character)

name

County name (character)

fips

County level FIPS code (integer)

pop

County population (integer)

households

Number of households (integer)

median_age

Median age of people in county (numeric)

age_over_18

Percent age of people over 18 (numeric)

age_over_65

Percent age of people over 65 (numeric)

hs_grad

Percent of highschool grads (numeric)

bachelors

Percent of people with bachelors degrees (numeric)

white

Percent of population that is white (numeric)

black

Percent of population that is black (numeric)

hispanic

Percent of population that is hispanic (numeric)

household_has_smartphone

Percent of households who have a smartphone (numeric)

mean_household_income

Average household income (integer)

median_household_income

Median household income (integer)

unemployment_rate

Unemployment rate (numeric)

Source

Adapted from the county_complete data set in the usdata library with alterations made for educational purposes.


Course scores data

Description

This dataset contains academic performance records for 200 students across four years of high school, with scores or letter grades in English and Math.

Usage

course_scores

Format

A data frame with 200 rows and 10 columns:

student

Student ID (integer)

type

Grade type (character)

Freshman_English

Freshman English Score/letter grade (character)

Freshman_Math

Freshman Math Score/letter grade (character)

Sophomore_English

Sophomore English Score/letter grade (character)

Sophomore_Math

Sophomore Math Score/letter grade (character)

Junior_English

Junior English Score/letter grade (character)

Junior_Math

Junior Math Score/letter grade (character)

Senior_English

Senior English Score/letter grade (character)

Senior_Math

Senior Math Score/letter grade (character)

Source

Synthetic Data


Synthetic Census dataset

Description

A synthetic dataset containing demographic and socioeconomic information for 1,000 individuals.

Usage

data_210_census

Format

A data frame with 1000 rows and 5 columns:

age

Persons Age (integer)

gender

Persons Gender (character)

degree

Persons level of education (character)

salary

Persons Yearly Salary (integer)

height

Persons Height in inches (integer)

Source

Synthetic Data


2020 election data

Description

Dataset providing detailed results from the 2020 U.S. presidential election at the county level.

Usage

election_2020

Format

A data frame with 32177 rows and 7 columns:

state

State (character)

state_ev

State electoral votes (integer)

county

County name (character)

candidate

Candidate name (character)

party

Candidate party (character)

total_votes

Total number of votes (integer)

won

True or false for the candidate to win the county (logical)

Source

Data retrieved from MIT Election Data and Science Lab, 2018, "County Presidential Election Returns 2000-2020” with alterations made for educational purposes.


Estimate Mode using Density function to find Mode of continuous data

Description

Estimates the mode of a numeric vector by identifying the value corresponding to the peak of its estimated probability density function.

Usage

estimate_mode(x)

Arguments

x

A numeric vector. Missing values (NA) are removed.

Value

A single numeric value representing the estimated mode.

Examples

# Estimate the mode of continuous random data
set.seed(123)
x <- rnorm(1000, mean=5, sd=2)
estimate_mode(x)

# Estimate the mode of miles-per-gallon (mpg) in the mtcars dataset
data("mtcars")
estimate_mode(mtcars$mpg)


Exam data

Description

Synthetic dataset containing academic performance and background information for 1,000 students.

Usage

exam_data

Format

A data frame with 1000 rows and 8 columns:

gender

Students gender (character)

race.ethnicity

Students race/ethnicity (character)

parental.level.of.education

Parents level of education (character)

lunch

Students lunch plan (character)

test.preparation.course

Student test prep level (character)

math.score

Students math score (integer)

reading.score

Students reading score (integer)

writing.score

Students writing score (integer)

Source

Data retrieved from roycekimmons generated data


Football/Quarterback data

Description

Dataset containing performance statistics for 106 football players who attempted a pass in the NFL for the 2022 season.

Usage

football

Format

A data frame with 106 rows and 17 columns:

Player

Players name (character)

Tm

Players team (character)

Age

Players Age (integer)

Pos

Players position (character)

G

Number of games (integer)

GS

Number of games starting (integer)

Wins

Number of wins (integer)

Cmp

Number of completions (integer)

Att

Number of throwing attempts (integer)

Cmp.

Completion percentage (numeric)

Yds

Number of yards thrown (integer)

TD

Number of touchdowns (integer)

Int

Number of interceptions thrown (integer)

Y.A

Yards per Attempt (numeric)

Y.G

Yards per Game (numeric)

Rate

Passer rating (numeric)

QBR

Total Quarterback Rating (numeric)

Source

Data retrieved from Pro Football Reference with alterations made for educational purposes.


Heart data

Description

Dataset containing medical and diagnostic information for 303 patients, used to study the presence of Atherosclerotic Heart Disease (AHD).

Usage

heart

Format

A data frame with 303 rows and 14 columns:

Age

Patients age (integer)

Sex

Patients Sex (1 = Male, 0 = Female) (integer)

ChestPain

Chest pain type (character)

RestBP

Resting blood pressure (in mm Hg on admission to the hospital) (integer)

Chol

Serum cholesterol in mg/dl (integer)

Fbs

fasting blood sugar > 120 mg/dl (1 = true; 0 = false) (integer)

RestECG

Resting electrocardiographic results (integer)

MaxHR

Maximum heart rate achieved (integer)

ExAng

Exercise induced angina (1 = yes; 0 = no) (integer)

Oldpeak

ST depression induced by exercise relative to rest (numeric)

Slope

The slope of the peak exercise ST segment (integer)

Ca

Number of major vessels (0-3) colored by fluoroscopy (integer)

Thal

Thal condition (character)

AHD

Atherosclerosis Heart Disease condition (character)

Source

Data retrieved from UC Irvine Machine Learning Repository


Housing data

Description

Data on houses that were recently sold in the Duke Forest neighborhood of Durham, NC in November 2020.

Usage

housing_data

Format

A data frame with 98 rows and 6 columns:

price

Home price (numeric)

bed

Number of bedrooms (integer)

bath

Number of bathrooms (numeric)

area

Square footage (integer)

year_built

Date house was built (integer)

lot

lot size (numeric)

Source

Adapted from the duke_forest dataset in the openintro library with alterations made for educational purposes.


Income data

Description

Dataset containing basic demographic and financial information for 20 individuals.

Usage

income_data

Format

A data frame with 20 rows and 5 columns:

ID

ID (integer)

Ages

age (integer)

Years_til_Retirement.65

Years until retirement at 65 (integer)

Salary

Salary (integer)

Birth_weight

Birth weight (integer)

Source

Synthetic Data


Compute Sample Kurtosis

Description

Calculates the kurtosis of a numeric vector. A value near 0 suggests normal kurtosis (mesokurtic), positive values indicate heavier tails (leptokurtic), and negative values indicate lighter tails (platykurtic).

Usage

kurt(x)

Arguments

x

A numeric vector.

Details

The z-scores are computed as:

z_i = \frac{x_i - \bar{x}}{sd}

The kurtosis is then calculated as:

\text{Kurtosis} = \frac{1}{n} \sum_{i=1}^{n} z_i^4 - 3

Where:

Value

A single numeric value representing the kurtosis

Examples

# Kurtosis of mpg in mtcars
data("mtcars")
kurt(mtcars$mpg)



Ledger data

Description

Dataset mimicking a ledger showing the price an item was bought and sold for, the date it occurred, and the color of the product.

Usage

ledger_data

Format

A data frame with 4 rows and 104 columns:

color

colors (character)

type

age (integer)

Jan_08

Price on date (numeric)

Jan_15

Price on date (numeric)

Jan_16

Price on date (numeric)

Jan_31

Price on date (numeric)

Feb_02

Price on date (numeric)

Feb_03

Price on date (numeric)

Feb_04

Price on date (numeric)

Feb_14

Price on date (numeric)

Feb_20

Price on date (numeric)

Feb_22

Price on date (numeric)

Feb_25

Price on date (numeric)

Feb_27

Price on date (numeric)

Feb_28

Price on date (numeric)

Mar_01

Price on date (numeric)

Mar_05

Price on date (numeric)

Mar_09

Price on date (numeric)

Mar_12

Price on date (numeric)

Mar_16

Price on date (numeric)

Mar_20

Price on date (numeric)

Mar_21

Price on date (numeric)

Mar_22

Price on date (numeric)

Mar_24

Price on date (numeric)

Mar_27

Price on date (numeric)

Mar_28

Price on date (numeric)

Mar_31

Price on date (numeric)

Apr_06

Price on date (numeric)

Apr_08

Price on date (numeric)

Apr_10

Price on date (numeric)

Apr_18

Price on date (numeric)

Apr_19

Price on date (numeric)

Apr_24

Price on date (numeric)

Apr_26

Price on date (numeric)

Apr_29

Price on date (numeric)

May_01

Price on date (numeric)

May_04

Price on date (numeric)

May_12

Price on date (numeric)

May_17

Price on date (numeric)

May_24

Price on date (numeric)

May_25

Price on date (numeric)

May_28

Price on date (numeric)

Jun_01

Price on date (numeric)

Jun_04

Price on date (numeric)

Jun_11

Price on date (numeric)

Jun_16

Price on date (numeric)

Jun_25

Price on date (numeric)

Jun_28

Price on date (numeric)

Jul_03

Price on date (numeric)

Jul_04

Price on date (numeric)

Jul_08

Price on date (numeric)

Jul_10

Price on date (numeric)

Jul_11

Price on date (numeric)

Jul_13

Price on date (numeric)

Jul_18

Price on date (numeric)

Jul_23

Price on date (numeric)

Jul_25

Price on date (numeric)

Aug_05

Price on date (numeric)

Aug_12

Price on date (numeric)

Aug_13

Price on date (numeric)

Aug_24

Price on date (numeric)

Aug_26

Price on date (numeric)

Sep_02

Price on date (numeric)

Sep_06

Price on date (numeric)

Sep_07

Price on date (numeric)

Sep_08

Price on date (numeric)

Sep_16

Price on date (numeric)

Sep_21

Price on date (numeric)

Sep_22

Price on date (numeric)

Sep_23

Price on date (numeric)

Sep_27

Price on date (numeric)

Oct_07

Price on date (numeric)

Oct_09

Price on date (numeric)

Oct_10

Price on date (numeric)

Oct_15

Price on date (numeric)

Oct_16

Price on date (numeric)

Oct_17

Price on date (numeric)

Oct_19

Price on date (numeric)

Oct_20

Price on date (numeric)

Oct_21

Price on date (numeric)

Oct_22

Price on date (numeric)

Oct_29

Price on date (numeric)

Oct_30

Price on date (numeric)

Oct_31

Price on date (numeric)

Nov_03

Price on date (numeric)

Nov_04

Price on date (numeric)

Nov_12

Price on date (numeric)

Nov_13

Price on date (numeric)

Nov_14

Price on date (numeric)

Nov_16

Price on date (numeric)

Nov_18

Price on date (numeric)

Nov_23

Price on date (numeric)

Nov_24

Price on date (numeric)

Dec_02

Price on date (numeric)

Dec_03

Price on date (numeric)

Dec_06

Price on date (numeric)

Dec_11

Price on date (numeric)

Dec_12

Price on date (numeric)

Dec_13

Price on date (numeric)

Dec_16

Price on date (numeric)

Dec_17

Price on date (numeric)

Dec_18

Price on date (numeric)

Dec_19

Price on date (numeric)

Dec_26

Price on date (numeric)

Source

Synthetic Data


MLB data

Description

Batter statistics for 2018 Major League Baseball season

Usage

mlb_eda

Format

A data frame with 1270 rows and 13 columns:

name

Players name (character)

team

Players team (character)

position

Players position (character)

games

Number of games (integer)

AB

Number of at bats (integer)

R

Number of runs (integer)

H

Number of hits (integer)

doubles

Number of doubles (integer)

HR

Number of Home runs (integer)

RBI

Number of Runs Batted In (integer)

AVG

Players batting average (numeric)

SLG

Players Slugging percentage (numeric)

OPS

Players On-base Plus Slugging (numeric)

Source

Data retrieved from MLB, with alterations made for educational purposes.


Mount St.Mary's dorm data

Description

Dataset summarizing the distribution of male and female students across various dormitories at Mount College, categorized by academic year.

Usage

mount_dorms

Format

A data frame with 4 rows and 11 columns:

year

Students year (character)

m_Pangborn

Males living in Pangborn (integer)

m_Sheridan

Males living in Sheridan (integer)

m_Terrace

Males living in Terrace (integer)

m_Powell

Males living in Powell (integer)

m_Towers

Males living in the Towers (integer)

f_Pangborn

Females living in Pangborn (integer)

f_Sheridan

Females living in Sheridan (integer)

f_Terrace

Females living in Terrace (integer)

f_Powell

Females living in Powell (integer)

f_Towers

Females living in the Towers (integer)

Source

Synthetic Data


Percent Within N Standard Deviations of the Mean

Description

Calculates the percentage of values in a numeric vector that fall within n standard deviations of the mean.

Usage

pct(x, n)

Arguments

x

A numeric vector.

n

A positive numeric value indicating how many standard deviations from the mean to use as bounds.

Value

A single numeric value representing the percentage (0–100) of values within the specified range.

Examples

# Percentage of values that fall within 2 sds of the mean in random normal data
set.seed(123)
x <- rnorm(1000)
pct(x,2)

# Percentage of values that fall within 2 sds of the mean in iris Sepal Lengths
data("iris")
pct(iris$Sepal.Length, 2)



Computes Position Statics, Quintiles and Quartiles

Description

Calculates the quintiles, including quartiles(data is split in 4 equal parts) and quintiles(data is split in 5 equal parts) of a numeric vector using the 'quantile()' function. NA's are removed.

Usage

position_stats(x)

Arguments

x

A numeric vector.

Details

Percentiles are values that divide a dataset into 100 equal parts, each representing 1% of the distribution. For example, the 25th percentile is the value below which 25% of the data fall.

Quartiles are special percentiles that divide the data into four equal groups: Q1 (25th percentile), Q2 (50th percentile or median), Q3 (75th percentile).

Quintiles divide data into five equal groups, each representing 20% of the distribution: 20th percentile, 40th, 60th, 80th percentiles split the data into quintiles.

Value

A list with two elements:

quint

Numeric vector of quintiles (0%, 20%, 40%, ..., 100%)

quart

Numeric vector of quartiles (0%, 25%, 50%, 75%, 100%)

Examples

# Position stats of random data
set.seed(123)
x <- rnorm(1000)
position_stats(x)

# Position stats of MPG in mtcars data set
data("mtcars")
position_stats(mtcars$mpg)



Reaction Data

Description

This dataset contains synthetic reaction time measurements for 100 individuals under different conditions.

Usage

reaction_time

Format

A data frame with 100 rows and 6 columns:

person

Person id (integer)

color

color (character)

left

left (numeric)

right

right (numeric)

age

Person age (numeric)

gender

Person gender (character)

Source

Synthetic Data


Computes Sample Skew and Kurtosis

Description

Calculates the skewness of a numeric vector (via skew()). A positive value indicates right skew (long right tail), while a negative value indicates left skew (long left tail). A zero value represents symmetry. Calculates the kurtosis of a numeric vector (via kurt()). A value near 0 suggests normal kurtosis (mesokurtic), positive values indicate heavier tails (leptokurtic), and negative values indicate lighter tails (platykurtic).

Usage

shape_stats(x)

Arguments

x

A numeric vector.

Value

A list with two elements:

skew

Skew of Data from skew()

kurt

Kurtosis of Data from kurt()

Examples

# Shape stats of mpg in mtcars
data("mtcars")
shape_stats(mtcars$mpg)



Compute Sample Skewness

Description

Calculates the skewness of a numeric vector. A positive value indicates right skew (long right tail), while a negative value indicates left skew (long left tail). A zero value represents symmetry

Usage

skew(x)

Arguments

x

A numeric vector.

Value

A single numeric value representing the skewness of the distribution.

Examples

# Skew of Sepal Lengths in iris
data("iris")
skew(iris$Sepal.Length)



Historic soccer data

Description

This dataset contains historical match results from various international soccer games between different countries for the years 1872-2024.

Usage

soccer

Format

A data frame with 13750 rows and 5 columns:

date

Date of match (character)

home_team

Home team name (character)

away_team

Away team name (character)

home_score

Home teams goal count (integer)

away_score

Away teams goal count (integer)

Source

Data retrieved from Kaggle International football results dataset with alterations made for educational purposes.


Summary of Spread Statistics

Description

Computes a variety of spread statistics for a numeric vector, including: standard deviation, iqr, the normalized minimum, maximum, and range as well as the percentage of data within 1, 2, and 3 standard deviations (via pct())

Usage

spread_stats(x)

Arguments

x

A numeric vector

Value

sd

Standard Deviation

iqr

Inter Quartile Range

minz

Normalized Minimum

maxz

Normalized Maximum

diffz

Normalized Range

pct1

Percent of data within 1 standard deviation from pct()

pct2

Percent of data within 2 standard deviation from pct()

pct3

Percent of data within 3 standard deviation from pct()

See Also

pct

Examples

# Spread stats of random normal data
set.seed(123)
x <- rnorm(1000)
spread_stats(x)

# Spread stats of mpg in mtcars
data("mtcars")
spread_stats(mtcars$mpg)