• School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Class 8 Maths Notes
  • Class 9 Maths Notes
  • Class 10 Maths Notes
  • Class 11 Maths Notes
  • Class 12 Maths Notes

Descriptive Statistics

  • Descriptive Statistic in R
  • Descriptive Statistics in Excel
  • Statistics Questions
  • Categorical Data Descriptive Statistics in R
  • Mean in Statistics
  • Statistics Cheat Sheet
  • Probability and Statistics
  • Data Types in Statistics
  • Types of Statistical Series
  • SciPy - Stats
  • Application of Statistics
  • Statistics Formulas
  • SciPy - Statistical Significance Tests
  • Descriptive Statistics in Julia
  • R - Statistics
  • Statistical Database Security
  • Statistics for Economics
  • Statistics with Python

Descriptive statistics is a subfield of statistics that deals with characterizing the features of known data. Descriptive statistics give summaries of either population or sample data. Aside from descriptive statistics, inferential statistics is another important discipline of statistics used to draw conclusions about population data.

Descriptive statistics is divided into two categories:

Measures of Central Tendency

Measures of dispersion.

In this article, we will learn about descriptive statistics, including their many categories, formulae, and examples in detail.

What is Descriptive Statistics?

Descriptive statistics is a branch of statistics focused on summarizing, organizing, and presenting data in a clear and understandable way. Its primary aim is to define and analyze the fundamental characteristics of a dataset without making sweeping generalizations or assumptions about the entire data set.

The main purpose of descriptive statistics is to provide a straightforward and concise overview of the data, enabling researchers or analysts to gain insights and understand patterns, trends, and distributions within the dataset.

Descriptive statistics typically involve measures of central tendency (such as mean, median, mode), dispersion (such as range, variance, standard deviation), and distribution shape (including skewness and kurtosis). Additionally, graphical representations like charts, graphs, and tables are commonly used to visualize and interpret the data.

Histograms, bar charts, pie charts, scatter plots, and box plots are some examples of widely used graphical techniques in descriptive statistics.

Descriptive Statistics Definition

Descriptive statistics is a type of statistical analysis that uses quantitative methods to summarize the features of a population sample. It is useful to present easy and exact summaries of the sample and observations using metrics such as mean, median, variance, graphs, and charts.

Types of Descriptive Statistics

There are three types of descriptive statistics:

Measures of Frequency Distribution

The central tendency is defined as a statistical measure that may be used to describe a complete distribution or dataset with a single value, known as a measure of central tendency. Any of the central tendency measures accurately describes the whole data distribution. In the following sections, we will look at the central tendency measures, their formulae, applications, and kinds in depth.

Mean is the sum of all the components in a group or collection divided by the number of items in that group or collection. Mean of a data collection is typically represented as x̄ (pronounced “x bar”). The formula for calculating the mean for ungrouped data to express it as the measure is given as follows:

For a series of observations:

x̄ = Σx / n
  • x̄ = Mean Value of Provided Dataset
  • Σx = Sum of All Terms
  • n = Number of Terms

Example: Weights of 7 girls in kg are 54, 32, 45, 61, 20, 66 and 50. Determine the mean weight for the provided collection of data.

Mean = Σx/n = (54 + 32 + 45 + 61 + 20 + 66 + 50)/7 = 328 / 7 = 46.85 Thus, the group’s mean weight is 46.85 kg.

Median of a data set is the value of the middle-most observation obtained after organizing the data in ascending order, which is one of the measures of central tendency. Median formula may be used to compute the median for many types of data, such as grouped and ungrouped data.

Ungrouped Data Median (n is odd): [(n + 1)/2] th  term Ungrouped Data Median (n is even): [(n / 2) th  term + ((n / 2) + 1) th  term]/2

Example: Weights of 7 girls in kg are 54, 32, 45, 61, 20, 66 and 50. Determine the median weight for the provided collection of data.

Arrange the provided data collection in ascending order: 20, 32, 45, 50, 54, 61, 66 Median = [(n + 1) / 2] th  term = [(7 + 1) / 2] th  term = 4 th  term = 50 Thus, group’s median weight is 50 kg.

Mode is one of the measures of central tendency, defined as the value that appears the most frequently in the provided data, i.e. the observation with the highest frequency is known as the mode of data. The mode formulae provided below can be used to compute the mode for ungrouped data.

Mode of Ungrouped Data: Most Repeated Observation in Dataset

Example: Weights of 7 girls in kg are 54, 32, 45, 61, 20, 45 and 50. Determine the mode weight for the provided collection of data.

Mode = Most repeated observation in Dataset = 45 Thus, group’s mode weight is 45 kg.

If the variability of data within an experiment must be established, absolute measures of variability should be employed. These metrics often reflect differences in a data collection in terms of the average deviations of the observations. The most prevalent absolute measurements of deviation are mentioned below. In the following sections, we will look at the variability measures, their formulae in depth.

Standard Deviation

The range represents the spread of your data from the lowest to the highest value in the distribution. It is the most straightforward measure of variability to compute. To get the range, subtract the data set’s lowest and highest values.

Range = Highest Value – Lowest Value

Example: Calculate the range of the following data series:  5, 13, 32, 42, 15, 84

Arrange the provided data series in ascending order: 5, 13, 15, 32, 42, 84 Range = H – L = 84 – 5 = 79 So, the range is 79.

Standard deviation (s or SD) represents the average level of variability in your dataset. It represents the average deviation of each score from the mean. The higher the standard deviation, the more varied the dataset is.

To calculate standard deviation, follow these six steps:

Step 1: Make a list of each score and calculate the mean.

Step 2: Calculate deviation from the mean, by subtracting the mean from each score.

Step 3: Square each of these differences.

Step 4: Sum up all squared variances.

Step 5: Divide the total of squared variances by N-1.

Step 6: Find the square root of the number that you discovered.

Example: Calculate standard deviation of the following data series:  5, 13, 32, 42, 15, 84.

Step 1: First we have to calculate the mean of following series using formula: Σx / n

Step 2: Now calculate the deviation from mean, subtract the mean from each series.

Step 3: Squared the deviation from mean and then add all the deviation.

Step 4: Divide the squared deviation with N-1 => 4182.84 / 5 = 836.57

Step 5: √836.57 = 28.92

So, the standard deviation is 28.92

Variance is calculated as average of squared departures from the mean. Variance measures the degree of dispersion in a data collection. The more scattered the data, the larger the variance in relation to the mean. To calculate the variance, square the standard deviation.

Symbol for variance is s 2

Example: Calculate the variance of the following data series:  5, 13, 32, 42, 15, 84.

First we have to calculate the standard deviation, that we calculate above i.e. SD = 28.92 s 2 = (SD) 2 = (28.92) 2 = 836.37 So, the variance is 836.37

Mean Deviation

Mean Deviation  is used to find the average of the absolute value of the data about the mean, median, or mode. Mean Deviation is some times also known as absolute deviation. The formula mean deviation is given as follows:

Mean Deviation = ∑ n 1 |X – μ|/n
  •   μ is Central Value

Quartile Deviation

Quartile Deviation is the Half of difference between the third and first quartile. The formula for quartile deviation is given as follows:

Quartile Deviation = (Q 3 − Q 1 )/2
  •   Q 3 is Third Quartile
  • Q 1 is First Quartile

Other measures of dispersion include the relative measures also known as the coefficients of dispersion.

Datasets consist of various scores or values. Statisticians employ graphs and tables to summarize the occurrence of each possible value of a variable, often presented in percentages or numerical figures.

For instance, suppose you were conducting a poll to determine people’s favorite Beatles. You would create one column listing all potential options (John, Paul, George, and Ringo) and another column indicating the number of votes each received. Statisticians represent these frequency distributions through graphs or tables

Univariate Descriptive Statistics

Univariate descriptive statistics focus on one thing at a time. We look at each thing individually and use different ways to understand it better. Programs like SPSS and Excel can help us with this.

If we only look at the average (mean) of something, like how much people earn, it might not give us the true picture, especially if some people earn a lot more or less than others. Instead, we can also look at other things like the middle value (median) or the one that appears most often (mode). And to understand how spread out the values are, we use things like standard deviation and variance along with the range.

Bivariate Descriptive Statistics

When we have information about more than one thing, we can use bivariate or multivariate descriptive statistics to see if they are related. Bivariate analysis compares two things to see if they change together. Before doing any more complicated tests, it’s important to look at how the two things compare in the middle.

Multivariate analysis is similar to bivariate analysis, but it looks at more than two things at once, which helps us understand relationships even better.

Representations of Data in Descriptive Statistics

Descriptive statistics use a variety of ways to summarize and present data in an understandable manner. This helps us grasp the data set’s patterns, trends, and properties.

Frequency Distribution Tables: Frequency distribution tables divide data into categories or intervals and display the number of observations (frequency) that fall into each one. For example, suppose we have a class of 20 students and are tracking their test scores. We may make a frequency distribution table that contains score ranges (e.g., 0-10, 11-20) and displays how many students scored in each range.

Graphs and Charts: Graphs and charts graphically display data, making it simpler to understand and analyze. For example, using the same test score data, we may generate a bar graph with the x-axis representing score ranges and the y-axis representing the number of students. Each bar on the graph represents a score range, and its height shows the number of students scoring within that range.

These approaches help us summarize and visualize data, making it easier to discover trends, patterns, and outliers, which is critical for making informed decisions and reaching meaningful conclusions in a variety of sectors.

Descriptive Statistics Applications

Descriptive statistics are used in a variety of sectors to summarize, organize, and display data in a meaningful and intelligible way. Here are a few popular applications:

  • Business and Economics: Descriptive statistics are useful for analyzing sales data, market trends, and customer behaviour. They are used to generate averages, medians, and standard deviations in order to better evaluate product performance, pricing strategies, and financial metrics.
  • Healthcare: Descriptive statistics are used to analyze patient data such as demographics, medical histories, and treatment outcomes. They assist healthcare workers in determining illness prevalence, assessing treatment efficacy, and identifying risk factors.
  • Education: Descriptive statistics are useful in education since they summarize student performance on tests and examinations. They assist instructors in assessing instructional techniques, identifying areas for improvement, and monitoring student growth over time.
  • Market Research: Descriptive statistics are used to analyze customer preferences, product demand, and market trends. They enable businesses to make educated decisions about product development, advertising campaigns, and market segmentation.
  • Finance and investment: Descriptive statistics are used to analyze stock market data, portfolio performance, and risk management. They assist investors in determining investment possibilities, tracking asset values, and evaluating financial instruments.

Difference Between Descriptive Statistics and Inferential Statistics

Difference between Descriptive Statistics and Inferential Statistics is studied using the table added below as,

Example of Descriptive Statistics Examples

Example 1: Calculate the Mean, Median and Mode for the following series: {4, 8, 9, 10, 6, 12, 14, 4, 5, 3, 4}

First, we are going to calculate the mean. Mean = Σx / n = (4 + 8 + 9 + 10 + 6 + 12 + 14 + 4 + 5 + 3 + 4)/11 = 79 / 11 = 7.1818 Thus, the Mean is 7.1818. Now, we are going to calculate the median. Arrange the provided data collection in ascending order: 3, 4, 4, 4, 5, 6, 8, 9, 10, 12, 14 Median = [(n + 1) / 2] th  term = [(11 + 1) / 2] th  term = 6 th  term = 6 Thus, the median is 6. Now, we are going to calculate the mode. Mode = The most repeated observation in the dataset = 4 Thus, the mode is 4.

Example 2: Calculate the Range for the following series: {4, 8, 9, 10, 6, 12, 14, 4, 5, 3, 4}

Arrange the provided data series in ascending order: 3, 4, 4, 4, 5, 6, 8, 9, 10, 12, 14 Range = H – L = 14 – 3 = 11 So, the range is 11.

Example 3: Calculate the standard deviation and variance of following data: {12, 24, 36, 48, 10, 18}

First we are going to compute standard deviation. For standard deviation calculate the mean, deviation from mean and squared deviation.

Dividing squared deviation with N-1 => 1093.351 / 5 = 218.67

√(218.67) = 14.79

So, the standard deviation is 14.79.

Now we are going to calculate the variance.

s 2 = 218.744

So, the variance is 218.744

Practice Problems on Descriptive Statistics

P1) Determine the sample variance of the following series: {17, 21, 52, 28, 26, 23}

P2) Determine the mean and mode of the following series: {21, 14, 56, 41, 18, 15, 18, 21, 15, 18}

P3) Find the median of the following series: {7, 24, 12, 8, 6, 23, 11}

P4) Find the standard deviation and variance of the following series: {17, 28, 42, 48, 36, 42, 20}

FAQs of Descriptive Statistics

What is meant by descriptive statistics.

Descriptive statistics seek to summarize, organize, and display data in an accessible manner while avoiding making sweeping generalizations about the whole population. It aids in discovering patterns, trends, and distributions within the collection.

How is the mean computed in descriptive statistics?

Mean is computed by adding together all of the values in the dataset and dividing them by the total number of observations. It measures the dataset’s central tendency or average value.

What role do measures of variability play in descriptive statistics?

Measures of variability, such as range, standard deviation, and variance, aid in quantifying the spread or dispersion of data points around the mean. They give insights on the dataset’s variety and consistency.

Can you explain the median in descriptive statistics?

The median is the midpoint value of a dataset whether sorted ascending or descending. It measures central tendency and is important when dealing with skewed data or outliers.

How can frequency distribution measurements contribute to descriptive statistics?

Measures of frequency distribution summarize the incidence of various values or categories within a dataset. They give insights into the distribution pattern of the data and are commonly represented by graphs or tables.

How are inferential statistics distinguished from descriptive statistics?

Inferential statistics use sample data to draw inferences or make predictions about a wider population, whereas descriptive statistics summarize aspects of known data. Descriptive statistics concentrate on the present dataset, whereas inferential statistics go beyond the observable data.

Why are descriptive statistics necessary in data analysis?

Descriptive statistics give researchers and analysts a clear and straightforward summary of the dataset, helping them to identify patterns, trends, and distributions. It aids in making educated judgements and gaining valuable insights from data.

What are the four types of descriptive statistics?

There are four major types of descriptive statistics: Measures of Frequency Measures of Central Tendency Measures of Dispersion or Variation Measures of Position

Which is an example of descriptive statistics?

Descriptive statistics examples include the study of mean, median, and mode.

Please Login to comment...

Similar reads.

  • Math-Statistics
  • School Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Have a thesis expert improve your writing

Check your thesis for plagiarism in 10 minutes, generate your apa citations for free.

  • Knowledge Base

Descriptive Statistics | Definitions, Types, Examples

Published on 4 November 2022 by Pritha Bhandari . Revised on 9 January 2023.

Descriptive statistics summarise and organise characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population .

In quantitative research , after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g., age), or the relation between two variables (e.g., age and creativity).

The next step is inferential statistics , which help you decide whether your data confirms or refutes your hypothesis and whether it is generalisable to a larger population.

Table of contents

Types of descriptive statistics, frequency distribution, measures of central tendency, measures of variability, univariate descriptive statistics, bivariate descriptive statistics, frequently asked questions.

There are 3 main types of descriptive statistics:

  • The distribution concerns the frequency of each value.
  • The central tendency concerns the averages of the values.
  • The variability or dispersion concerns how spread out the values are.

Types of descriptive statistics

You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more, in bivariate and multivariate analysis.

  • Go to a library
  • Watch a movie at a theater
  • Visit a national park

A data set is made up of a distribution of values, or scores. In tables or graphs, you can summarise the frequency of every possible value of a variable in numbers or percentages.

  • Simple frequency distribution table
  • Grouped frequency distribution table

From this table, you can see that more women than men or people with another gender identity took part in the study. In a grouped frequency distribution, you can group numerical response values and add up the number of responses for each group. You can also convert each of these numbers to percentages.

Measures of central tendency estimate the center, or average, of a data set. The mean , median and mode are 3 ways of finding the average.

Here we will demonstrate how to calculate the mean, median, and mode using the first 6 responses of our survey.

The mean , or M , is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number of responses. The total number of responses or observations is called N .

The median is the value that’s exactly in the middle of a data set.

To find the median, order each response value from the smallest to the biggest. Then, the median is the number in the middle. If there are two numbers in the middle, find their mean.

The mode is the simply the most popular or most frequent response value. A data set can have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs most frequently.

Measures of variability give you a sense of how spread out the response values are. The range, standard deviation and variance each reflect different aspects of spread.

The range gives you an idea of how far apart the most extreme response scores are. To find the range , simply subtract the lowest value from the highest value.

Standard deviation

The standard deviation ( s ) is the average amount of variability in your dataset. It tells you, on average, how far each score lies from the mean. The larger the standard deviation, the more variable the data set is.

There are six steps for finding the standard deviation:

  • List each score and find their mean.
  • Subtract the mean from each score to get the deviation from the mean.
  • Square each of these deviations.
  • Add up all of the squared deviations.
  • Divide the sum of the squared deviations by N – 1.
  • Find the square root of the number you found.

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The symbol for variance is s 2 .

Univariate descriptive statistics focus on only one variable at a time. It’s important to examine data from each variable separately using multiple measures of distribution, central tendency and spread. Programs like SPSS and Excel can be used to easily calculate these.

If you were to only consider the mean as a measure of central tendency, your impression of the ‘middle’ of the data set can be skewed by outliers, unlike the median or mode.

Likewise, while the range is sensitive to extreme values, you should also consider the standard deviation and variance to get easily comparable measures of spread.

If you’ve collected data on more than one variable, you can use bivariate or multivariate descriptive statistics to explore whether there are relationships between them.

In bivariate analysis, you simultaneously study the frequency and variability of two variables to see if they vary together. You can also compare the central tendency of the two variables before performing further statistical tests .

Multivariate analysis is the same as bivariate analysis but with more than two variables.

Contingency table

In a contingency table, each cell represents the intersection of two variables. Usually, an independent variable (e.g., gender) appears along the vertical axis and a dependent one appears along the horizontal axis (e.g., activities). You read ‘across’ the table to see how the independent and dependent variables relate to each other.

Interpreting a contingency table is easier when the raw data is converted to percentages. Percentages make each row comparable to the other by making it seem as if each group had only 100 observations or participants. When creating a percentage-based contingency table, you add the N for each independent variable on the end.

From this table, it is more clear that similar proportions of children and adults go to the library over 17 times a year. Additionally, children most commonly went to the library between 5 and 8 times, while for adults, this number was between 13 and 16.

Scatter plots

A scatter plot is a chart that shows you the relationship between two or three variables. It’s a visual representation of the strength of a relationship.

In a scatter plot, you plot one variable along the x-axis and another one along the y-axis. Each data point is represented by a point in the chart.

From your scatter plot, you see that as the number of movies seen at movie theaters increases, the number of visits to the library decreases. Based on your visual assessment of a possible linear relationship, you perform further tests of correlation and regression.

Descriptive statistics: Scatter plot

Descriptive statistics summarise the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalisable to the broader population.

The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.

  • Distribution refers to the frequencies of different responses.
  • Measures of central tendency give you the average for each response.
  • Measures of variability show you the spread or dispersion of your dataset.
  • Univariate statistics summarise only one variable  at a time.
  • Bivariate statistics compare two variables .
  • Multivariate statistics compare more than two variables .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2023, January 09). Descriptive Statistics | Definitions, Types, Examples. Scribbr. Retrieved 14 May 2024, from https://www.scribbr.co.uk/stats/descriptive-statistics-explained/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, data collection methods | step-by-step guide & examples, variability | calculating range, iqr, variance, standard deviation, normal distribution | examples, formulas, & uses.

Grad Coach

Quant Analysis 101: Descriptive Statistics

Everything You Need To Get Started (With Examples)

By: Derek Jansen (MBA) | Reviewers: Kerryn Warren (PhD) | October 2023

If you’re new to quantitative data analysis , one of the first terms you’re likely to hear being thrown around is descriptive statistics. In this post, we’ll unpack the basics of descriptive statistics, using straightforward language and loads of examples . So grab a cup of coffee and let’s crunch some numbers!

Overview: Descriptive Statistics

What are descriptive statistics.

  • Descriptive vs inferential statistics
  • Why the descriptives matter
  • The “ Big 7 ” descriptive statistics
  • Key takeaways

At the simplest level, descriptive statistics summarise and describe relatively basic but essential features of a quantitative dataset – for example, a set of survey responses. They provide a snapshot of the characteristics of your dataset and allow you to better understand, roughly, how the data are “shaped” (more on this later). For example, a descriptive statistic could include the proportion of males and females within a sample or the percentages of different age groups within a population.

Another common descriptive statistic is the humble average (which in statistics-talk is called the mean ). For example, if you undertook a survey and asked people to rate their satisfaction with a particular product on a scale of 1 to 10, you could then calculate the average rating. This is a very basic statistic, but as you can see, it gives you some idea of how this data point is shaped .

Descriptive statistics summarise and describe relatively basic but essential features of a quantitative dataset, including its “shape”

What about inferential statistics?

Now, you may have also heard the term inferential statistics being thrown around, and you’re probably wondering how that’s different from descriptive statistics. Simply put, descriptive statistics describe and summarise the sample itself , while inferential statistics use the data from a sample to make inferences or predictions about a population .

Put another way, descriptive statistics help you understand your dataset , while inferential statistics help you make broader statements about the population , based on what you observe within the sample. If you’re keen to learn more, we cover inferential stats in another post , or you can check out the explainer video below.

Why do descriptive statistics matter?

While descriptive statistics are relatively simple from a mathematical perspective, they play a very important role in any research project . All too often, students skim over the descriptives and run ahead to the seemingly more exciting inferential statistics, but this can be a costly mistake.

The reason for this is that descriptive statistics help you, as the researcher, comprehend the key characteristics of your sample without getting lost in vast amounts of raw data. In doing so, they provide a foundation for your quantitative analysis . Additionally, they enable you to quickly identify potential issues within your dataset – for example, suspicious outliers, missing responses and so on. Just as importantly, descriptive statistics inform the decision-making process when it comes to choosing which inferential statistics you’ll run, as each inferential test has specific requirements regarding the shape of the data.

Long story short, it’s essential that you take the time to dig into your descriptive statistics before looking at more “advanced” inferentials. It’s also worth noting that, depending on your research aims and questions, descriptive stats may be all that you need in any case . So, don’t discount the descriptives! 

Free Webinar: Research Methodology 101

The “Big 7” descriptive statistics

With the what and why out of the way, let’s take a look at the most common descriptive statistics. Beyond the counts, proportions and percentages we mentioned earlier, we have what we call the “Big 7” descriptives. These can be divided into two categories – measures of central tendency and measures of dispersion.

Measures of central tendency

True to the name, measures of central tendency describe the centre or “middle section” of a dataset. In other words, they provide some indication of what a “typical” data point looks like within a given dataset. The three most common measures are:

The mean , which is the mathematical average of a set of numbers – in other words, the sum of all numbers divided by the count of all numbers. 
The median , which is the middlemost number in a set of numbers, when those numbers are ordered from lowest to highest.
The mode , which is the most frequently occurring number in a set of numbers (in any order). Naturally, a dataset can have one mode, no mode (no number occurs more than once) or multiple modes.

To make this a little more tangible, let’s look at a sample dataset, along with the corresponding mean, median and mode. This dataset reflects the service ratings (on a scale of 1 – 10) from 15 customers.

Example set of descriptive stats

As you can see, the mean of 5.8 is the average rating across all 15 customers. Meanwhile, 6 is the median . In other words, if you were to list all the responses in order from low to high, Customer 8 would be in the middle (with their service rating being 6). Lastly, the number 5 is the most frequent rating (appearing 3 times), making it the mode.

Together, these three descriptive statistics give us a quick overview of how these customers feel about the service levels at this business. In other words, most customers feel rather lukewarm and there’s certainly room for improvement. From a more statistical perspective, this also means that the data tend to cluster around the 5-6 mark , since the mean and the median are fairly close to each other.

To take this a step further, let’s look at the frequency distribution of the responses . In other words, let’s count how many times each rating was received, and then plot these counts onto a bar chart.

Example frequency distribution of descriptive stats

As you can see, the responses tend to cluster toward the centre of the chart , creating something of a bell-shaped curve. In statistical terms, this is called a normal distribution .

As you delve into quantitative data analysis, you’ll find that normal distributions are very common , but they’re certainly not the only type of distribution. In some cases, the data can lean toward the left or the right of the chart (i.e., toward the low end or high end). This lean is reflected by a measure called skewness , and it’s important to pay attention to this when you’re analysing your data, as this will have an impact on what types of inferential statistics you can use on your dataset.

Example of skewness

Measures of dispersion

While the measures of central tendency provide insight into how “centred” the dataset is, it’s also important to understand how dispersed that dataset is . In other words, to what extent the data cluster toward the centre – specifically, the mean. In some cases, the majority of the data points will sit very close to the centre, while in other cases, they’ll be scattered all over the place. Enter the measures of dispersion, of which there are three:

Range , which measures the difference between the largest and smallest number in the dataset. In other words, it indicates how spread out the dataset really is.

Variance , which measures how much each number in a dataset varies from the mean (average). More technically, it calculates the average of the squared differences between each number and the mean. A higher variance indicates that the data points are more spread out , while a lower variance suggests that the data points are closer to the mean.

Standard deviation , which is the square root of the variance . It serves the same purposes as the variance, but is a bit easier to interpret as it presents a figure that is in the same unit as the original data . You’ll typically present this statistic alongside the means when describing the data in your research.

Again, let’s look at our sample dataset to make this all a little more tangible.

descriptive statistics formula in research

As you can see, the range of 8 reflects the difference between the highest rating (10) and the lowest rating (2). The standard deviation of 2.18 tells us that on average, results within the dataset are 2.18 away from the mean (of 5.8), reflecting a relatively dispersed set of data .

For the sake of comparison, let’s look at another much more tightly grouped (less dispersed) dataset.

Example of skewed data

As you can see, all the ratings lay between 5 and 8 in this dataset, resulting in a much smaller range, variance and standard deviation . You might also notice that the data are clustered toward the right side of the graph – in other words, the data are skewed. If we calculate the skewness for this dataset, we get a result of -0.12, confirming this right lean.

In summary, range, variance and standard deviation all provide an indication of how dispersed the data are . These measures are important because they help you interpret the measures of central tendency within context . In other words, if your measures of dispersion are all fairly high numbers, you need to interpret your measures of central tendency with some caution , as the results are not particularly centred. Conversely, if the data are all tightly grouped around the mean (i.e., low dispersion), the mean becomes a much more “meaningful” statistic).

Key Takeaways

We’ve covered quite a bit of ground in this post. Here are the key takeaways:

  • Descriptive statistics, although relatively simple, are a critically important part of any quantitative data analysis.
  • Measures of central tendency include the mean (average), median and mode.
  • Skewness indicates whether a dataset leans to one side or another
  • Measures of dispersion include the range, variance and standard deviation

If you’d like hands-on help with your descriptive statistics (or any other aspect of your research project), check out our private coaching service , where we hold your hand through each step of the research journey. 

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

You Might Also Like:

Inferential stats 101

Good day. May I ask about where I would be able to find the statistics cheat sheet?

Khan

Right above you comment 🙂

Laarbik Patience

Good job. you saved me

Lou

Brilliant and well explained. So much information explained clearly!

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Purdue Online Writing Lab Purdue OWL® College of Liberal Arts

Descriptive Statistics

OWL logo

Welcome to the Purdue OWL

This page is brought to you by the OWL at Purdue University. When printing this page, you must include the entire legal notice.

Copyright ©1995-2018 by The Writing Lab & The OWL at Purdue and Purdue University. All rights reserved. This material may not be published, reproduced, broadcast, rewritten, or redistributed without permission. Use of this site constitutes acceptance of our terms and conditions of fair use.

The mean, the mode, the median, the range, and the standard deviation are all examples of descriptive statistics. Descriptive statistics are used because in most cases, it isn't possible to present all of your data in any form that your reader will be able to quickly interpret.

Generally, when writing descriptive statistics, you want to present at least one form of central tendency (or average), that is, either the mean, median, or mode. In addition, you should present one form of variability , usually the standard deviation.

Measures of Central Tendency and Other Commonly Used Descriptive Statistics

The mean, median, and the mode are all measures of central tendency. They attempt to describe what the typical data point might look like. In essence, they are all different forms of 'the average.' When writing statistics, you never want to say 'average' because it is difficult, if not impossible, for your reader to understand if you are referring to the mean, the median, or the mode.

The mean is the most common form of central tendency, and is what most people usually are referring to when the say average. It is simply the total sum of all the numbers in a data set, divided by the total number of data points. For example, the following data set has a mean of 4: {-1, 0, 1, 16}. That is, 16 divided by 4 is 4. If there isn't a good reason to use one of the other forms of central tendency, then you should use the mean to describe the central tendency.

The median is simply the middle value of a data set. In order to calculate the median, all values in the data set need to be ordered, from either highest to lowest, or vice versa. If there are an odd number of values in a data set, then the median is easy to calculate. If there is an even number of values in a data set, then the calculation becomes more difficult. Statisticians still debate how to properly calculate a median when there is an even number of values, but for most purposes, it is appropriate to simply take the mean of the two middle values. The median is useful when describing data sets that are skewed or have extreme values. Incomes of baseballs players, for example, are commonly reported using a median because a small minority of baseball players makes a lot of money, while most players make more modest amounts. The median is less influenced by extreme scores than the mean.

The mode is the most commonly occurring number in the data set. The mode is best used when you want to indicate the most common response or item in a data set. For example, if you wanted to predict the score of the next football game, you may want to know what the most common score is for the visiting team, but having an average score of 15.3 won't help you if it is impossible to score 15.3 points. Likewise, a median score may not be very informative either, if you are interested in what score is most likely.

Standard Deviation

The standard deviation is a measure of variability (it is not a measure of central tendency). Conceptually it is best viewed as the 'average distance that individual data points are from the mean.' Data sets that are highly clustered around the mean have lower standard deviations than data sets that are spread out.

For example, the first data set would have a higher standard deviation than the second data set:

Notice that both groups have the same mean (5) and median (also 5), but the two groups contain different numbers and are organized much differently. This organization of a data set is often referred to as a distribution. Because the two data sets above have the same mean and median, but different standard deviation, we know that they also have different distributions. Understanding the distribution of a data set helps us understand how the data behave.

  • Chester Fritz Library
  • Library of the Health Sciences
  • Thormodsgard Law Library
  • University of North Dakota
  • Research Guides
  • SMHS Library Resources

Statistics - explanations and formulas

Descriptive statistics.

  • Absolute Risk Reduction
  • Bell-shaped Curve
  • Confidence Interval
  • Control Event Rate
  • Correlation
  • Discrete Stats
  • Experimental Event Rate
  • Forest Plots
  • Hazard Ratio
  • Heterogeneity / Statistical Heterogeneity
  • Inferential Statistics
  • Intention to Treat
  • Internal Validity / External Validity
  • Kaplan-Meier Curves
  • Kruskal-Wallis Test
  • Likelihood Ratios
  • Logistics Regression
  • Mann-Whitney U Test
  • Mean Difference
  • Misclassification Bias
  • Multiple Regression Coefficients
  • Nominal Data
  • Noninferiority Studies
  • Noninferiority Trials
  • Nonparametric Analysis
  • Normal Distribution
  • Number Needed to Treat - including how to calculate
  • Power Analysis
  • Predictive Power
  • Probability
  • Propensity Score
  • Random Sample
  • Regression Analysis
  • Relative Risk
  • Sampling Error
  • Spearman Rank Correlation
  • Specificity and Sensitivity
  • Statistical Significance versus Clinical Significance
  • Survivor Analysis
  • Wilcoxon Rank Sum Test
  • Excel formulas
  • Picking the appropriate method

Descriptive statistics are techniques used for describing, graphing, organizing and summarizing quantitative data . They describe something, either visually or statistically, about individual variables or the association among two or more variables. For instance, a social researcher may want to know how many people in his/her study are male or female, what the average age of the respondents is, or what the median income is. Researchers often need to know how closely their data represent the population from which it is drawn so that they can assess the data’s representativeness.

Descriptive statistics include mean, standard deviation, mode,and median.

Descriptive information gives researchers a general picture of their data, as opposed to an explanation for why certain variables may be associated with each other. Descriptive statistics are often contrasted with inferential statistics, which are used to make inferences, or to explain factors, about the population. Data can be summarized at the univariate level with visual pictures, such as graphs, histograms, and pie charts. Statistical techniques used to describe individual variables include frequencies, the mean , median, mode, cumulative percent, percentile, standard deviation, variance, and interquartile range. Data can also be summarized at the bivariate level. Measures of association between two variables include calculations of eta, gamma, lambda, Pearson’s r, Kendall’s tau, Spearman’s rho, and chi2, among others. Bivariate relationships can also be illustrated in visual graphs that describe the association between two variables.

(from Oxford Reference Online )

  • << Previous: Correlation
  • Next: Discrete Stats >>
  • Last Updated: Apr 1, 2024 8:54 AM
  • URL: https://libguides.und.edu/statistics

2.8 Descriptive Statistics

Descriptive statistics.

Class Time:

  • The student will construct a histogram and a box plot.
  • The student will calculate univariate statistics.
  • The student will examine the graphs to interpret what the data implies.

Collect the Data Record the number of pairs of shoes you own.

  • x ¯ x ¯ = _____
  • Are the data discrete or continuous? How do you know?
  • In complete sentences, describe the shape of the histogram.
  • Are there any potential outliers? List the value(s) that could be outliers. Use a formula to check the end values to determine if they are potential outliers.
  • Min = _____
  • Max = _____
  • Q 1 = _____
  • Q 3 = _____
  • IQR = _____
  • Construct a box plot of data
  • What does the shape of the box plot imply about the concentration of data? Use complete sentences.
  • Using the box plot, how can you determine if there are potential outliers?
  • How does the standard deviation help you to determine concentration of the data and whether or not there are potential outliers?
  • What does the IQR represent in this problem?
  • above the mean.
  • below the mean.

As an Amazon Associate we earn from qualifying purchases.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics
  • Publication date: Sep 19, 2013
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics/pages/2-8-descriptive-statistics

© Jun 23, 2022 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

  • Search Search Please fill out this field.

What Are Descriptive Statistics?

  • How They Work

Univariate vs. Bivariate

Descriptive statistics and visualizations, descriptive statistics and outliers.

  • Descriptive vs. Inferential

The Bottom Line

  • Corporate Finance
  • Financial Analysis

Descriptive Statistics: Definition, Overview, Types, and Example

Adam Hayes, Ph.D., CFA, is a financial writer with 15+ years Wall Street experience as a derivatives trader. Besides his extensive derivative trading expertise, Adam is an expert in economics and behavioral finance. Adam received his master's in economics from The New School for Social Research and his Ph.D. from the University of Wisconsin-Madison in sociology. He is a CFA charterholder as well as holding FINRA Series 7, 55 & 63 licenses. He currently researches and teaches economic sociology and the social studies of finance at the Hebrew University in Jerusalem.

descriptive statistics formula in research

Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean, median, and mode, while measures of variability include standard deviation, variance, minimum and maximum variables, kurtosis , and skewness .

Key Takeaways

  • Descriptive statistics summarizes or describes the characteristics of a data set.
  • Descriptive statistics consists of three basic categories of measures: measures of central tendency, measures of variability (or spread), and frequency distribution.
  • Measures of central tendency describe the center of the data set (mean, median, mode).
  • Measures of variability describe the dispersion of the data set (variance, standard deviation).
  • Measures of frequency distribution describe the occurrence of data within the data set (count).

Jessica Olah

Understanding Descriptive Statistics

Descriptive statistics help describe and understand the features of a specific data set by giving short summaries about the sample and measures of the data. The most recognized types of descriptive statistics are measures of center. For example, the mean , median , and mode , which are used at almost all levels of math and statistics, are used to define and describe a data set. The mean, or the average, is calculated by adding all the figures within the data set and then dividing by the number of figures within the set.

For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a data set is the value appearing most often, and the median is the figure situated in the middle of the data set. It is the figure separating the higher figures from the lower figures within a data set. However, there are less common types of descriptive statistics that are still very important.

People use descriptive statistics to repurpose hard-to-understand quantitative insights across a large data set into bite-sized descriptions. A student's grade point average (GPA), for example, provides a good understanding of descriptive statistics. The idea of a GPA is that it takes data points from a wide range of exams, classes, and grades, and averages them together to provide a general understanding of a student's overall academic performance. A student's personal GPA reflects their mean academic performance.

Descriptive statistics, especially in fields such as medicine, often visually depict data using scatter plots, histograms, line graphs, or stem and leaf displays. We'll talk more about visuals later in this article.

Types of Descriptive Statistics

All descriptive statistics are either measures of central tendency or measures of variability , also known as measures of dispersion.

Central Tendency

Measures of central tendency focus on the average or middle values of data sets, whereas measures of variability focus on the dispersion of data. These two measures use graphs, tables and general discussions to help people understand the meaning of the analyzed data.

Measures of central tendency describe the center position of a distribution for a data set. A person analyzes the frequency of each data point in the distribution and describes it using the mean, median, or mode, which measures the most common patterns of the analyzed data set.

Measures of Variability

Measures of variability (or the measures of spread) aid in analyzing how dispersed the distribution is for a set of data. For example, while the measures of central tendency may give a person the average of a data set, it does not describe how the data is distributed within the set.

So while the average of the data maybe 65 out of 100, there can still be data points at both 1 and 100. Measures of variability help communicate this by describing the shape and spread of the data set. Range, quartiles , absolute deviation, and variance are all examples of measures of variability.

Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which is calculated by subtracting the lowest number (5) in the data set from the highest (100).

Distribution

Distribution (or frequency distribution) refers to the quantity of times a data point occurs. Alternatively, it is the measurement of a data point failing to occur. Consider a data set: male, male, female, female, female, other. The distribution of this data can be classified as:

  • The number of males in the data set is 2.
  • The number of females in the data set is 3.
  • The number of individuals identifying as other is 1.
  • The number of non-males is 4.

In descriptive statistics, univariate data analyzes only one variable. It is used to identify characteristics of a single trait and is not used to analyze any relationships or causations.

For example, imagine a room full of high school students. Say you wanted to gather the average age of the individuals in the room. This univariate data is only dependent on one factor: each person's age. By gathering this one piece of information from each person and dividing by the total number of people, you can determine the average age.

Bivariate data, on the other hand, attempts to link two variables by searching for correlation. Two types of data are collected, and the relationship between the two pieces of information is analyzed together. Because multiple variables are analyzed, this approach may also be referred to as multivariate .

Let's say each high school student in the example above takes a college assessment test, and we want to see whether older students are testing better than younger students. In addition to gathering the age of the students, we need to gather each student's test score. Then, using data analytics, we mathematically or graphically depict whether there is a relationship between student age and test scores.

The preparation and reporting of financial statements is an example of descriptive statistics. Analyzing that financial information to make decisions on the future is inferential statistics.

One essential aspect of descriptive statistics is graphical representation. Visualizing data distributions effectively can be incredibly powerful, and this is done in several ways.

Histograms are tools for displaying the distribution of numerical data. They divide the data into bins or intervals and represent the frequency or count of data points falling into each bin through bars of varying heights. Histograms help identify the shape of the distribution, central tendency, and variability of the data.

Another visualization is boxplots. Boxplots, also known as box-and-whisker plots, provide a concise summary of a data distribution by highlighting key summary statistics including the median (middle line inside the box), quartiles (edges of the box), and potential outliers (points outside the "whiskers"). Boxplots visually depict the spread and skewness of the data and are particularly useful for comparing distributions across different groups or variables.

Anytime descriptive statistics are being discussed, it's important to note outliers. Outliers are data points that significantly differ from other observations in a dataset. These could be errors, anomalies, or rare events within the data.

Detecting and managing outliers is a step in descriptive statistics to ensure accurate and reliable data analysis. To identify outliers, you can use graphical techniques (such as boxplots or scatter plots) or statistical methods (such as Z-score or IQR method). These approaches help pinpoint observations that deviate substantially from the overall pattern of the data.

The presence of outliers can have a notable impact on descriptive statistics. This is vitally important in statistics, as this can skew results and affect the interpretation of data. Outliers can disproportionately influence measures of central tendency, such as the mean, pulling it towards their extreme values. For example, the dataset of (1, 1, 1, 997) is 250, even though that is hardly representative of the dataset. This distortion can lead to misleading conclusions about the typical behavior of the dataset.

Depending on the context, outliers can be treated by either removing them (if they are genuinely erroneous or irrelevant). Alternatively, outliers may hold important information and should be kept for the value they may be able to demonstrate. As you analyze your data, consider the relevance of what outliers can contribute and whether it makes more sense to just strike those data points from your descriptive statistic calculations.

Descriptive Statistics vs. Inferential Statistics

Descriptive statistics have a different function than inferential statistics, data sets that are used to make decisions or apply characteristics from one data set to another.

Imagine another example where a company sells hot sauce. The company gathers data such as the count of sales , average quantity purchased per transaction , and average sale per day of the week. All of this information is descriptive, as it tells a story of what actually happened in the past. In this case, it is not being used beyond being informational.

Let's say the same company wants to roll out a new hot sauce. It gathers the same sales data above, but it crafts the information to make predictions about what the sales of the new hot sauce will be. The act of using descriptive statistics and applying characteristics to a different data set makes the data set inferential statistics. We are no longer simply summarizing data; we are using it predict what will happen regarding an entirely different body of data (the new hot sauce product).

What Is Descriptive Statistics?

Descriptive statistics is a means of describing features of a data set by generating summaries about data samples. It's often depicted as a summary of data shown that explains the contents of data. For example, a population census may include descriptive statistics regarding the ratio of men and women in a specific city.

What Are Examples of Descriptive Statistics?

Descriptive statistics are informational and meant to describe the actual characteristics of a data set. When analyzing numbers regarding the prior Major League Baseball season, descriptive statistics including the highest batting average for a single player, the number of runs allowed per team, and the average wins per division.

What Is the Main Purpose of Descriptive Statistics?

The main purpose of descriptive statistics is to provide information about a data set. In the example above, there are hundreds of baseballs players that engage in thousands of games. Descriptive statistics summarizes the large amount of data into several useful bits of information.

What Are the Types of Descriptive Statistics?

The three main types of descriptive statistics are frequency distribution, central tendency, and variability of a data set. The frequency distribution records how often data occurs, central tendency records the data's center point of distribution, and variability of a data set records its degree of dispersion.

Can Descriptive Statistics Be Used to Make Inference or Predictions?

No. While these descriptives help understand data attributes, inferential statistical techniques—a separate branch of statistics—are required to understand how variables interact with one another in a data set.

Descriptive statistics refers to the analysis, summary, and communication of findings that describe a data set. Often not useful for decision-making, descriptive statistics still hold value in explaining high-level summaries of a set of information such as the mean, median, mode, variance, range, and count of information.

Purdue Online Writing Lab. " Writing with Statistics: Descriptive Statistics ."

Cooksey, Ray W. " Descriptive Statistics for Summarizing Data ." Illustrating Statistical Procedures: Finding Meaning in Quantitative Data , vol. 15, May 2020, pp. 61–139.

Professor Andrew Ainsworth, California State University Northridge. " Measures of Variability, Descriptive Statistics Part 2 ." Page 2.

Professor Beverly Reed, Kent State University. " Summary: Differences Between Univariate and Bivariate Data ."

Purdue Online Writing Lab. " Writing with Statistics: Basic Inferential Statistics: Theory and Application ."

descriptive statistics formula in research

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • Your Privacy Choices

A Guide on Data Analysis

3 descriptive statistics.

When you have an area of interest that you want to research, a problem that you want to solve, a relationship that you want to investigate, theoretical and empirical processes will help you.

Estimand is defined as “a quantity of scientific interest that can be calculated in the population and does not change its value depending on the data collection design used to measure it (i.e., it does not vary with sample size and survey design, or the number of non-respondents, or follow-up efforts).” ( Rubin 1996 )

Estimands include:

  • population means
  • Population variances
  • correlations
  • factor loading
  • regression coefficients

3.1 Numerical Measures

There are differences between a population and a sample

Order Statistics: \(y_{(1)},y_{(2)},...,y_{(n)}\) where \(y_{(1)}<y_{(2)}<...<y_{(n)}\)

Coefficient of variation: standard deviation over mean. This metric is stable, dimensionless statistic for comparison.

Symmetric: mean = median, skewness = 0

Skewed right: mean > median, skewness > 0

Skewed left: mean < median, skewness < 0

Central moments: \(\mu=E(Y)\) , \(\mu_2 = \sigma^2=E(Y-\mu)^2\) , \(\mu_3 = E(Y-\mu)^3\) , \(\mu_4 = E(Y-\mu)^4\)

For normal distributions, \(\mu_3=0\) , so \(g_1=0\)

\(\hat{g_1}\) is distributed approximately as \(N(0,6/n)\) if sample is from a normal population. (valid when \(n > 150\) )

  • For large samples, inference on skewness can be based on normal tables with 95% confidence interval for \(g_1\) as \(\hat{g_1}\pm1.96\sqrt{6/n}\)
  • For small samples, special tables from Snedecor and Cochran 1989, Table A 19(i) or Monte Carlo test

For a normal distribution, \(g_2^*=3\) . Kurtosis is often redefined as: \(g_2=\frac{E(Y-\mu)^4}{\sigma^4}-3\) where the 4th central moment is estimated by \(m_4=\sum_{i=1}^{n}(y_i-\overline{y})^4/n\)

  • the asymptotic sampling distribution for \(\hat{g_2}\) is approximately \(N(0,24/n)\) (with \(n > 1000\) )
  • large sample on kurtosis uses standard normal tables
  • small sample uses tables by Snedecor and Cochran, 1989, Table A 19(ii) or Geary 1936

3.2 Graphical Measures

3.2.1 shape.

It’s a good habit to label your graph, so others can easily follow.

Others more advanced plots

3.2.2 Scatterplot

3.3 normality assessment.

Since Normal (Gaussian) distribution has many applications, we typically want/ wish our data or our variable is normal. Hence, we have to assess the normality based on not only Numerical Measures but also Graphical Measures

3.3.1 Graphical Assessment

descriptive statistics formula in research

The straight line represents the theoretical line for normally distributed data. The dots represent real empirical data that we are checking. If all the dots fall on the straight line, we can be confident that our data follow a normal distribution. If our data wiggle and deviate from the line, we should be concerned with the normality assumption.

3.3.2 Summary Statistics

Sometimes it’s hard to tell whether your data follow the normal distribution by just looking at the graph. Hence, we often have to conduct statistical test to aid our decision. Common tests are

Methods based on normal probability plot

  • Correlation Coefficient with Normal Probability Plots
  • Shapiro-Wilk Test

Methods based on empirical cumulative distribution function

  • Anderson-Darling Test
  • Kolmogorov-Smirnov Test
  • Cramer-von Mises Test
  • Jarque–Bera Test

3.3.2.1 Methods based on normal probability plot

3.3.2.1.1 correlation coefficient with normal probability plots.

( Looney and Gulledge Jr 1985 ) ( Samuel S. Shapiro and Francia 1972 ) The correlation coefficient between \(y_{(i)}\) and \(m_i^*\) as given on the normal probability plot:

\[W^*=\frac{\sum_{i=1}^{n}(y_{(i)}-\bar{y})(m_i^*-0)}{(\sum_{i=1}^{n}(y_{(i)}-\bar{y})^2\sum_{i=1}^{n}(m_i^*-0)^2)^.5}\]

where \(\bar{m^*}=0\)

Pearson product moment formula for correlation:

\[\hat{p}=\frac{\sum_{i-1}^{n}(y_i-\bar{y})(x_i-\bar{x})}{(\sum_{i=1}^{n}(y_{i}-\bar{y})^2\sum_{i=1}^{n}(x_i-\bar{x})^2)^.5}\]

  • When the correlation is 1, the plot is exactly linear and normality is assumed.
  • The closer the correlation is to zero, the more confident we are to reject normality
  • Inference on W* needs to be based on special tables ( Looney and Gulledge Jr 1985 )

3.3.2.1.2 Shapiro-Wilk Test

( Samuel Sanford Shapiro and Wilk 1965 )

\[W=(\frac{\sum_{i=1}^{n}a_i(y_{(i)}-\bar{y})(m_i^*-0)}{(\sum_{i=1}^{n}a_i^2(y_{(i)}-\bar{y})^2\sum_{i=1}^{n}(m_i^*-0)^2)^.5})^2\]

where \(a_1,..,a_n\) are weights computed from the covariance matrix for the order statistics.

  • Researchers typically use this test to assess normality. (n < 2000) Under normality, W is close to 1, just like \(W^*\) . Notice that the only difference between W and W* is the “weights”.

3.3.2.2 Methods based on empirical cumulative distribution function

The formula for the empirical cumulative distribution function (CDF) is:

\(F_n(t)\) = estimate of probability that an observation \(\le\) t = (number of observation \(\le\) t)/n

This method requires large sample sizes. However, it can apply to distributions other than the normal (Gaussian) one.

descriptive statistics formula in research

3.3.2.2.1 Anderson-Darling Test

The Anderson-Darling statistic ( T. W. Anderson and Darling 1952 ) :

\[A^2=\int_{-\infty}^{\infty}(F_n(t)=F(t))^2\frac{dF(t)}{F(t)(1-F(t))}\]

  • a weight average of squared deviations (it weights small and large values of t more)

For the normal distribution,

\(A^2 = - (\sum_{i=1}^{n}(2i-1)(ln(p_i) +ln(1-p_{n+1-i}))/n-n\)

where \(p_i=\Phi(\frac{y_{(i)}-\bar{y}}{s})\) , the probability that a standard normal variable is less than \(\frac{y_{(i)}-\bar{y}}{s}\)

Reject normal assumption when \(A^2\) is too large

Evaluate the null hypothesis that the observations are randomly selected from a normal population based on the critical value provided by ( Marsaglia and Marsaglia 2004 ) and ( Stephens 1974 )

This test can be applied to other distributions:

  • Exponential
  • Extreme-value
  • Weibull: log(Weibull) = Gumbel
  • Log-normal (two-parameter)

Consult ( Stephens 1974 ) for more detailed transformation and critical values.

3.3.2.2.2 Kolmogorov-Smirnov Test

  • Based on the largest absolute difference between empirical and expected cumulative distribution
  • Another deviation of K-S test is Kuiper’s test

3.3.2.2.3 Cramer-von Mises Test

  • Based on the average squared discrepancy between the empirical distribution and a given theoretical distribution. Each discrepancy is weighted equally (unlike Anderson-Darling test weights end points more heavily)

3.3.2.2.4 Jarque–Bera Test

( Bera and Jarque 1981 )

Based on the skewness and kurtosis to test normality.

\(JB = \frac{n}{6}(S^2+(K-3)^2/4)\) where \(S\) is the sample skewness and \(K\) is the sample kurtosis

\(S=\frac{\hat{\mu_3}}{\hat{\sigma}^3}=\frac{\sum_{i=1}^{n}(x_i-\bar{x})^3/n}{(\sum_{i=1}^{n}(x_i-\bar{x})^2/n)^\frac{3}{2}}\)

\(K=\frac{\hat{\mu_4}}{\hat{\sigma}^4}=\frac{\sum_{i=1}^{n}(x_i-\bar{x})^4/n}{(\sum_{i=1}^{n}(x_i-\bar{x})^2/n)^2}\)

recall \(\hat{\sigma^2}\) is the estimate of the second central moment (variance) \(\hat{\mu_3}\) and \(\hat{\mu_4}\) are the estimates of third and fourth central moments.

If the data comes from a normal distribution, the JB statistic asymptotically has a chi-squared distribution with two degrees of freedom.

The null hypothesis is a joint hypothesis of the skewness being zero and the excess kurtosis being zero.

3.4 Bivariate Statistics

Correlation between

  • Two Continuous variables
  • Two Discrete variables
  • Categorical and Continuous

Questions to keep in mind:

  • Is the relationship linear or non-linear?
  • If the variable is continuous, is it normal and homoskadastic?
  • How big is your dataset?

3.4.1 Two Continuous

3.4.1.1 pearson correlation.

  • Good with linear relationship

3.4.1.2 Spearman Correlation

3.4.2 categorical and continuous, 3.4.2.1 point-biserial correlation.

Similar to the Pearson correlation coefficient, the point-biserial correlation coefficient is between -1 and 1 where:

-1 means a perfectly negative correlation between two variables

0 means no correlation between two variables

1 means a perfectly positive correlation between two variables

Alternatively

3.4.2.2 Logistic Regression

See 3.4.2.2

3.4.3 Two Discrete

3.4.3.1 distance metrics.

Some consider distance is not a correlation metric because it isn’t unit independent (i.e., if you scale the distance, the metrics will change), but it’s still a useful proxy. Distance metrics are more likely to be used for similarity measure.

Euclidean Distance

Manhattan Distance

Chessboard Distance

Minkowski Distance

Canberra Distance

Hamming Distance

Cosine Distance

Sum of Absolute Distance

Sum of Squared Distance

Mean-Absolute Error

3.4.3.2 Statistical Metrics

3.4.3.2.1 chi-squared test, 3.4.3.2.1.1 phi coefficient, 3.4.3.2.1.2 cramer’s v.

  • between nominal categorical variables (no natural order)

\[ \text{Cramer's V} = \sqrt{\frac{\chi^2/n}{\min(c-1,r-1)}} \]

\(\chi^2\) = Chi-square statistic

\(n\) = sample size

\(r\) = # of rows

\(c\) = # of columns

Alternatively,

ncchisq noncentral Chi-square

nchisqadj Adjusted noncentral Chi-square

fisher Fisher Z transformation

fisheradj bias correction Fisher z transformation

3.4.3.2.1.3 Tschuprow’s T

  • 2 nominal variables

3.4.3.3 Ordinal Association (Rank correlation)

  • Good with non-linear relationship

3.4.3.3.1 Ordinal and Nominal

3.4.3.3.1.1 freeman’s theta.

  • Ordinal and nominal

3.4.3.3.1.2 Epsilon-squared

3.4.3.3.2 two ordinal, 3.4.3.3.2.1 goodman kruskal’s gamma.

  • 2 ordinal variables

3.4.3.3.2.2 Somers’ D

or Somers’ Delta

3.4.3.3.2.3 Kendall’s Tau-b

3.4.3.3.2.4 yule’s q and y.

Special version \((2 \times 2)\) of the Goodman Kruskal’s Gamma coefficient.

\[ \text{Yule's Q} = \frac{ad - bc}{ad + bc} \]

We typically use Yule’s \(Q\) in practice while Yule’s Y has the following relationship with \(Q\) .

\[ \text{Yule's Y} = \frac{\sqrt{ad} - \sqrt{bc}}{\sqrt{ad} + \sqrt{bc}} \]

\[ Q = \frac{2Y}{1 + Y^2} \]

\[ Y = \frac{1 = \sqrt{1-Q^2}}{Q} \]

3.4.3.3.2.5 Tetrachoric Correlation

  • is a special case of Polychoric Correlation when both variables are binary

3.4.3.3.2.6 Polychoric Correlation

  • between ordinal categorical variables (natural order).
  • Assumption: Ordinal variable is a discrete representation of a latent normally distributed continuous variable. (Income = low, normal, high).

3.5 Summary

Get the correlation table for continuous variables only

Alternatively, you can also have the

descriptive statistics formula in research

Different comparison between different correlation between different types of variables (i.e., continuous vs. categorical) can be problematic. Moreover, the problem of detecting non-linear vs. linear relationship/correlation is another one. Hence, a solution is that using mutual information from information theory (i.e., knowing one variable can reduce uncertainty about the other).

To implement mutual information, we have the following approximations

\[ \downarrow \text{prediction error} \approx \downarrow \text{uncertainty} \approx \downarrow \text{association strength} \]

More specifically, following the X2Y metric , we have the following steps:

Predict \(y\) without \(x\) (i.e., baseline model)

Average of \(y\) when \(y\) is continuous

Most frequent value when \(y\) is categorical

Predict \(y\) with \(x\) (e.g., linear, random forest, etc.)

Calculate the prediction error difference between 1 and 2

To have a comprehensive table that could handle

continuous vs. continuous

categorical vs. continuous

continuous vs. categorical

categorical vs. categorical

the suggested model would be Classification and Regression Trees (CART). But we can certainly use other models as well.

The downfall of this method is that you might suffer

  • Symmetry: \((x,y) \neq (y,x)\)
  • Comparability : Different pair of comparison might use different metrics (e.g., misclassification error vs. MAE)

3.5.1 Visualization

descriptive statistics formula in research

More general form,

descriptive statistics formula in research

Both heat map and correlation at the same time

descriptive statistics formula in research

More elaboration with ggplot2

descriptive statistics formula in research

Descriptive Statistics

Descriptive statistics is a branch of statistics that is concerned with describing the characteristics of the known data. Descriptive statistics provides summaries about either the population data or the sample data. Apart from descriptive statistics, inferential statistics is another crucial branch of statistics that is used to make inferences about the population data.

Descriptive statistics can be broadly classified into two categories - measures of central tendency and measures of dispersion. In this article, we will learn more about descriptive statistics, its various types, formulas, and see associated examples.

What are Descriptive Statistics?

Descriptive statistics are used to quantitatively or visually summarize the features of a sample. By using certain tools data from a sample can be analyzed to catch certain trends or patterns followed by it. It helps to organize the data in a more manageable and readable format.

Descriptive Statistics Definition

Descriptive statistics can be defined as a field of statistics that is used to summarize the characteristics of a sample by utilizing certain quantitative techniques. It helps to provide simple and precise summaries of the sample and the observations using measures like mean, median, variance, graphs, and charts. Univariate descriptive statistics are used to describe data containing only one variable. On the other hand, bivariate and multivariate descriptive statistics are used to describe data with multiple variables.

Types of Descriptive Statistics

Measures of central tendency and measures of dispersion are two types of descriptive statistics that are used to quantitatively summarize the characteristics of grouped and ungrouped data. When an experiment is conducted, the raw data obtained is known as ungrouped data. When this data is organized logically it is known as grouped data. To visually represent data, descriptive statistics use graphs, charts, and tables. Some important types of descriptive statistics are given below.

Types of Descriptive Statistics

Measures of Central Tendency

In descriptive statistics, the measures of central tendency are used to describe data by determining a single representative central value. The important measures of central tendency are given below:

Mean: The mean can be defined as the sum of all observations divided by the total number of observations. The formulas for the mean are given as follows:

Ungrouped data Mean: x̄ = Σ\(x_{i}\) / n

Grouped data Mean: x̄ = \(\frac{\sum M_{i}f_{i}}{\sum f_{i}}\)

Here, \(x_{i}\) is the i th observation, \(M_{i}\) is the midpoint of the i th interval, \(f_{i}\) is the corresponding frequency and n is the sample size.

Median: The median can be defined as the center-most observation that is obtained by arranging the data in ascending order. The formulas for the median are given as follows:

Ungrouped data Median (n is odd): [(n + 1) / 2] th term

Ungrouped data Median (n is even): [(n / 2) th term + ((n / 2) + 1) th term] / 2

Grouped data Median: l + [((n / 2) - c) / f] × h

l is the lower limit of the median class given by n / 2, c is the cumulative frequency , f is the frequency of the median class and h is the class height.

Mode: The mode is the most frequently occurring observation in the data set. The formulas for the mode are given as follows:

Ungrouped data Mode: Most recurrent observation

Grouped data Mode: L + h \(\frac{\left(f_{m}-f_{1}\right)}{\left(f_{m}-f_{1}\right)+\left(f_{m}-f_{2}\right)}\)

L is the lower limit of the modal class, h is the class height, f\(_m\) is the frequency of the modal class, f\(_1\) is the frequency of the class preceding the modal class and f\(_2\) is the frequency of the class succeeding the modal class.

Measures of Dispersion

In descriptive statistics, the measures of dispersion are used to determine how spread out a distribution is with respect to the central value. The important measures of dispersion are given below:

Range: The range can be defined as the difference between the highest value and the lowest value. The formula is given as follows:

Range = H - S

H is the highest value and S is the lowest value in a data set.

Variance: The variance gives the variability of the distribution with respect to the mean. The formulas for the variance are given as follows:

Grouped Data Sample Variance, s 2 = \(\sum \frac{f\left ( M_{i}-\overline{X} \right )^{2}}{N - 1}\)

Grouped Data Population Variance, σ 2 = \(\sum \frac{f\left ( M_{i}-\overline{X} \right )^{2}}{N}\)

Ungrouped Data Sample Variance, s 2 = \(\sum \frac{\left ( X_{i}-\overline{X} \right )^{2}}{n - 1}\)

Ungrouped Data Population Variance, σ 2 = \(\sum \frac{\left ( X_{i}-\overline{X} \right )^{2}}{n}\)

where, \(\overline{X}\) stands for mean, \(M_{i}\) is the midpoint of the i th interval, \(X_{i}\) is the i th data point, N is the summation of all frequencies and n is the number of observations.

Standard Deviation: The square root of the variance will result in the standard deviation . It helps to analyze the variability in a data set in a more effective manner as compared to the variance. The formula is given as follows:

Standard Deviation: S.D. = √Variance = σ

Mean Deviation: The mean deviation will give the average of the absolute value of the data about the mean, median, or mode. It is also known as absolute deviation. The formula is given as follows:

Mean Deviation = \(\sum_{1}^{n}\frac{|X - \overline{X}|}{n}\)

where \(\overline{X}\) is the central value.

Quartile Deviation: Half of the difference between the third and first quartile gives the quartile deviation . The formula is given as follows:

Quartile deviation = \(\frac{Q_{3}-Q_{1}}{2}\)

Other measures of dispersion include the relative measures also known as the coefficients of dispersion.

Descriptive Statistics Representations

Descriptive statistics can also be used to summarize data visually before quantitative methods of analysis are applied to them. Some important forms of representations of descriptive statistics are as follows:

Frequency Distribution Tables: These can be either simple or grouped frequency distribution tables . They are used to show the distribution of values or classes along with the corresponding frequencies. Such tables are very useful in making charts as well as catching patterns in data.

Graphs and Charts: Graphs and charts help to represent data in a completely visual format. It can be used to display percentages, distributions, and frequencies. Scatter plots , bar graphs , pie charts , etc., are some graphs that are used in descriptive statistics.

Descriptive Statistics Examples

Descriptive statistics help to provide the summary statistics for different data sets thereby, enabling comparison. The descriptive statistics examples are given as follows:

  • Suppose the marks of students belonging to class A are {70, 85, 90, 65) and class B are {60, 40, 89, 96}. Then the average marks of each class can be given by the mean as 77.5 and 71.25. This denotes that the average of class A is more than class B.
  • Using the same example, suppose it needs to be determined how far apart the most extreme responses are then the range is used. Range A = 25 and Range B = 56, thus, depicting that the range of class B is higher than the range of class A.

Descriptive Statistics vs Inferential Statistics

Inferential and descriptive statistics are both used to analyze data. Descriptive statistics helps to describe the data quantitatively while inferential statistics uses these parameters to make inferences about the population. The differences between descriptive statistics and inferential statistics are given below.

Related Articles:

  • Data Collection Methods
  • Summary statistics
  • How to Find Median

Important Notes on Descriptive Statistics

  • Descriptive statistics are used to describe the features of a sample or population using quantitative analysis methods.
  • Descriptive statistics can be classified into measures of central tendency and measures of dispersion.
  • Mean, mode, standard deviation, etc., are some measures of descriptive statistics.
  • Data of descriptive statistics can be visually represented using tables, charts, and graphs.

Examples on Descriptive Statistics

Example 1: Using descriptive statistics, find the mean and mode of the given data.

{1, 4, 6, 1, 8, 15, 18, 1, 5, 1}

Solution: Total number of observations = 10

Sum of observations = 1 + 4 + 6 + 1 + 8 + 15 + 18 + 1 + 5 + 1 = 60

Mean = 60 / 10 = 6

Mode = Most frequently occurring observation = 1

Answer: Mean = 6, Mode = 1

Example 2: Find the sample variance of the following data

{7, 11, 15, 18, 36, 43}

Solution: Sample variance formula, s 2 = \(\sum \frac{\left ( X_{i}-\overline{X} \right )^{2}}{n - 1}\)

Mean, \(\overline{X}\) = 21.67, n = 6

s 2 = [(7 - 21.67) 2 + (11 - 21.67) 2 + (15 - 21.67) 2 + (18 - 21.67) 2 + (36 - 21.67) 2 + (43 - 21.67) 2 ] / 6 - 1

Answer: s 2 = 209.47

Example 3: Find the median and the mean deviation about the median for the given data

{9, 10, 12, 16, 17, 17, 18, 20}

Solution: n = 8

Median = [(n / 2) th term + ((n / 2) + 1) th term] / 2

= [(8 / 2) th term + ((8 / 2) + 1) th term] / 2

= (4 th term + 5 th term) / 2

= (16 + 17) / 2 = 16.5

Mean deviation about median = \(\sum_{1}^{n}\frac{|X - 16.5|}{n}\)

= [|9 - 16.5| + |10 - 16.5| + |12 - 16.5| + |16 - 16.5| + |17 - 16.5| + |17 - 16.5| + |18 - 16.5| + |20 - 16.5| ] / 8

Answer: Median = 16.5, Mean deviation about median = 3.125

go to slide go to slide go to slide

descriptive statistics formula in research

Book a Free Trial Class

FAQs on Descriptive Statistics

What is the meaning of descriptive statistics.

Descriptive statistics is a branch of statistics that focuses on describing the characteristics of a sample or a population by using various quantitative methods.

What are the Types of Descriptive Statistics?

Measures of central tendency and measures of dispersion are the two types of descriptive statistics. Apart from this graphs, charts and tables can also be used for a visual representation of data.

What are the Measures of Central Tendency in Descriptive Statistics?

There are three measures of central tendency in descriptive statistics. These are the mean, median, and mode .

What are the Important Formulas in Descriptive Statistics?

The important formulas in descriptive statistics are as follows:

  • Mean = sum of all observations / number of observations.
  • Median = [(n + 1) / 2] th term
  • Variance = \(\sum \frac{\left ( X_{i}-\overline{X} \right )^{2}}{n - 1}\)
  • Standard Deviation = √Variance

What are the Measures of Dispersion in Descriptive Statistics?

In descriptive statistics, the measures of dispersion are variance, standard deviation, range, mean deviation, quartile deviation, and coefficients of dispersion.

How to Represent Data in Descriptive Statistics?

The best way to get a visual representation of data in descriptive statistics is by using frequency distribution tables. These tables can also be used to create various graphs and charts that help in further analysis of data.

What is the Difference Between Descriptive Statistics and Inferential Statistics?

Descriptive statistics uses quantitative tools like mean, variance, range, etc., to describe the features of data. Inferential statistics uses analytical tools such as z test, t test , linear regression, etc., to make generalizations about the population from the sample data.

Methods and formulas for Descriptive Statistics (Tables)

In this topic, standard deviation, n nonmissing, row percent, column percent, total percents.

The mean is the sum of all observations divided by the number of (non-missing) observations. Use the following formula to calculate the mean for each cell or margin using the data corresponding to that cell or margin.

descriptive statistics formula in research

The median is the middle value in an ordered data set. Thus, at least half the observations are less than or equal to the median, and at least half the observations are greater than or equal to the median.

If the number of observations in a data set is odd, the median is the value in the middle. If the number of observations in a data set is even, the median is the average of the two middle values.

descriptive statistics formula in research

The smallest data value that is in a table cell or margin.

The largest data value that is in a table cell or margin.

The sum is the total of all the data values that are in a table cell or margin.

The standard deviation is the most common measure of dispersion, or how spread out the data are about the mean. The more widely the values are spread out, the larger the standard deviation. The standard deviation is calculated by taking the square root of the variance.

Use this formula to calculate the standard deviation for each cell or margin using the data from that cell or margin.

descriptive statistics formula in research

The number of non-missing observations that are in a table cell or margin.

The number of missing observations that are in a table cell or margin.

The count is the number of times each combination of categories occurs.

The row percent is obtained by multiplying the ratio of a cell count to the corresponding row total by 100 and is given by:

descriptive statistics formula in research

The column percent is obtained by multiplying the ratio of a cell count to the corresponding column total by 100 and is given by:

descriptive statistics formula in research

The total percent is obtained by multiplying the ratio of a cell count to the total number of observations by 100 and is given by:

descriptive statistics formula in research

  • Minitab.com
  • License Portal
  • Cookie Settings

You are now leaving support.minitab.com.

Click Continue to proceed to:

Logo for University of Southern Queensland

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

14 Quantitative analysis: Descriptive statistics

Numeric data collected in a research project can be analysed quantitatively using statistical tools in two different ways. Descriptive analysis refers to statistically describing, aggregating, and presenting the constructs of interest or associations between these constructs. Inferential analysis refers to the statistical testing of hypotheses (theory testing). In this chapter, we will examine statistical techniques used for descriptive analysis, and the next chapter will examine statistical techniques for inferential analysis. Much of today’s quantitative data analysis is conducted using software programs such as SPSS or SAS. Readers are advised to familiarise themselves with one of these programs for understanding the concepts described in this chapter.

Data preparation

In research projects, data may be collected from a variety of sources: postal surveys, interviews, pretest or posttest experimental data, observational data, and so forth. This data must be converted into a machine-readable, numeric format, such as in a spreadsheet or a text file, so that they can be analysed by computer programs like SPSS or SAS. Data preparation usually follows the following steps:

Data coding. Coding is the process of converting data into numeric format. A codebook should be created to guide the coding process. A codebook is a comprehensive document containing a detailed description of each variable in a research study, items or measures for that variable, the format of each item (numeric, text, etc.), the response scale for each item (i.e., whether it is measured on a nominal, ordinal, interval, or ratio scale, and whether this scale is a five-point, seven-point scale, etc.), and how to code each value into a numeric format. For instance, if we have a measurement item on a seven-point Likert scale with anchors ranging from ‘strongly disagree’ to ‘strongly agree’, we may code that item as 1 for strongly disagree, 4 for neutral, and 7 for strongly agree, with the intermediate anchors in between. Nominal data such as industry type can be coded in numeric form using a coding scheme such as: 1 for manufacturing, 2 for retailing, 3 for financial, 4 for healthcare, and so forth (of course, nominal data cannot be analysed statistically). Ratio scale data such as age, income, or test scores can be coded as entered by the respondent. Sometimes, data may need to be aggregated into a different form than the format used for data collection. For instance, if a survey measuring a construct such as ‘benefits of computers’ provided respondents with a checklist of benefits that they could select from, and respondents were encouraged to choose as many of those benefits as they wanted, then the total number of checked items could be used as an aggregate measure of benefits. Note that many other forms of data—such as interview transcripts—cannot be converted into a numeric format for statistical analysis. Codebooks are especially important for large complex studies involving many variables and measurement items, where the coding process is conducted by different people, to help the coding team code data in a consistent manner, and also to help others understand and interpret the coded data.

Data entry. Coded data can be entered into a spreadsheet, database, text file, or directly into a statistical program like SPSS. Most statistical programs provide a data editor for entering data. However, these programs store data in their own native format—e.g., SPSS stores data as .sav files—which makes it difficult to share that data with other statistical programs. Hence, it is often better to enter data into a spreadsheet or database where it can be reorganised as needed, shared across programs, and subsets of data can be extracted for analysis. Smaller data sets with less than 65,000 observations and 256 items can be stored in a spreadsheet created using a program such as Microsoft Excel, while larger datasets with millions of observations will require a database. Each observation can be entered as one row in the spreadsheet, and each measurement item can be represented as one column. Data should be checked for accuracy during and after entry via occasional spot checks on a set of items or observations. Furthermore, while entering data, the coder should watch out for obvious evidence of bad data, such as the respondent selecting the ‘strongly agree’ response to all items irrespective of content, including reverse-coded items. If so, such data can be entered but should be excluded from subsequent analysis.

-1

Data transformation. Sometimes, it is necessary to transform data values before they can be meaningfully interpreted. For instance, reverse coded items—where items convey the opposite meaning of that of their underlying construct—should be reversed (e.g., in a 1-7 interval scale, 8 minus the observed value will reverse the value) before they can be compared or combined with items that are not reverse coded. Other kinds of transformations may include creating scale measures by adding individual scale items, creating a weighted index from a set of observed measures, and collapsing multiple values into fewer categories (e.g., collapsing incomes into income ranges).

Univariate analysis

Univariate analysis—or analysis of a single variable—refers to a set of statistical techniques that can describe the general properties of one variable. Univariate statistics include: frequency distribution, central tendency, and dispersion. The frequency distribution of a variable is a summary of the frequency—or percentages—of individual values or ranges of values for that variable. For instance, we can measure how many times a sample of respondents attend religious services—as a gauge of their ‘religiosity’—using a categorical scale: never, once per year, several times per year, about once a month, several times per month, several times per week, and an optional category for ‘did not answer’. If we count the number or percentage of observations within each category—except ‘did not answer’ which is really a missing value rather than a category—and display it in the form of a table, as shown in Figure 14.1, what we have is a frequency distribution. This distribution can also be depicted in the form of a bar chart, as shown on the right panel of Figure 14.1, with the horizontal axis representing each category of that variable and the vertical axis representing the frequency or percentage of observations within each category.

Frequency distribution of religiosity

With very large samples, where observations are independent and random, the frequency distribution tends to follow a plot that looks like a bell-shaped curve—a smoothed bar chart of the frequency distribution—similar to that shown in Figure 14.2. Here most observations are clustered toward the centre of the range of values, with fewer and fewer observations clustered toward the extreme ends of the range. Such a curve is called a normal distribution .

(15 + 20 + 21 + 20 + 36 + 15 + 25 + 15)/8=20.875

Lastly, the mode is the most frequently occurring value in a distribution of values. In the previous example, the most frequently occurring value is 15, which is the mode of the above set of test scores. Note that any value that is estimated from a sample, such as mean, median, mode, or any of the later estimates are called a statistic .

36-15=21

Bivariate analysis

Bivariate analysis examines how two variables are related to one another. The most common bivariate statistic is the bivariate correlation —often, simply called ‘correlation’—which is a number between -1 and +1 denoting the strength of the relationship between two variables. Say that we wish to study how age is related to self-esteem in a sample of 20 respondents—i.e., as age increases, does self-esteem increase, decrease, or remain unchanged?. If self-esteem increases, then we have a positive correlation between the two variables, if self-esteem decreases, then we have a negative correlation, and if it remains the same, we have a zero correlation. To calculate the value of this correlation, consider the hypothetical dataset shown in Table 14.1.

Normal distribution

After computing bivariate correlation, researchers are often interested in knowing whether the correlation is significant (i.e., a real one) or caused by mere chance. Answering such a question would require testing the following hypothesis:

\[H_0:\quad r = 0 \]

Social Science Research: Principles, Methods and Practices (Revised edition) Copyright © 2019 by Anol Bhattacherjee is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

Cover of StatPearls

StatPearls [Internet].

Exploratory data analysis: frequencies, descriptive statistics, histograms, and boxplots.

Jacob Shreffler ; Martin R. Huecker .

Affiliations

Last Update: November 3, 2023 .

  • Definition/Introduction

Researchers must utilize exploratory data techniques to present findings to a target audience and create appropriate graphs and figures. Researchers can determine if outliers exist, data are missing, and statistical assumptions will be upheld by understanding data. Additionally, it is essential to comprehend these data when describing them in conclusions of a paper, in a meeting with colleagues invested in the findings, or while reading others’ work.

  • Issues of Concern

This comprehension begins with exploring these data through the outputs discussed in this article. Individuals who do not conduct research must still comprehend new studies, and knowledge of fundamentals in analyzing data and interpretation of histograms and boxplots facilitates the ability to appraise recent publications accurately. Without this familiarity, decisions could be implemented based on inaccurate delivery or interpretation of medical studies.

Frequencies and Descriptive Statistics

Effective presentation of study results, in presentation or manuscript form, typically starts with frequencies and descriptive statistics (ie, mean, medians, standard deviations). One can get a better sense of the variables by examining these data to determine whether a balanced and sufficient research design exists. Frequencies also inform on missing data and give a sense of outliers (will be discussed below).

Luckily, software programs are available to conduct exploratory data analysis. For this chapter, we will be examining the following research question.

RQ: Are there differences in drug life (length of effect) for Drug 23 based on the administration site?

A more precise hypothesis could be: Is drug 23 longer-lasting when administered via site A compared to site B?

To address this research question, exploratory data analysis is conducted. First, it is essential to start with the frequencies of the variables. To keep things simple, only variables of minutes (drug life effect) and administration site (A vs B) are included. See Image. Figure 1 for outputs for frequencies.

Figure 1 shows that the administration site appears to be a balanced design with 50 individuals in each group. The excerpt for minutes frequencies is the bottom portion of Figure 1 and shows how many cases fell into each time frame with the cumulative percent on the right-hand side. In examining Figure 1, one suspiciously low measurement (135) was observed, considering time variables. If a data point seems inaccurate, a researcher should find this case and confirm if this was an entry error. For the sake of this review, the authors state that this was an entry error and should have been entered 535 and not 135. Had the analysis occurred without checking this, the data analysis, results, and conclusions would have been invalid. When finding any entry errors and determining how groups are balanced, potential missing data is explored. If not responsibly evaluated, missing values can nullify results.  

After replacing the incorrect 135 with 535, descriptive statistics, including the mean, median, mode, minimum/maximum scores, and standard deviation were examined. Output for the research example for the variable of minutes can be seen in Figure 2. Observe each variable to ensure that the mean seems reasonable and that the minimum and maximum are within an appropriate range based on medical competence or an available codebook. One assumption common in statistical analyses is a normal distribution. Image . Figure 2 shows that the mode differs from the mean and the median. We have visualization tools such as histograms to examine these scores for normality and outliers before making decisions.

Histograms are useful in assessing normality, as many statistical tests (eg, ANOVA and regression) assume the data have a normal distribution. When data deviate from a normal distribution, it is quantified using skewness and kurtosis. [1]  Skewness occurs when one tail of the curve is longer. If the tail is lengthier on the left side of the curve (more cases on the higher values), this would be negatively skewed, whereas if the tail is longer on the right side, it would be positively skewed. Kurtosis is another facet of normality. Positive kurtosis occurs when the center has many values falling in the middle, whereas negative kurtosis occurs when there are very heavy tails. [2]

Additionally, histograms reveal outliers: data points either entered incorrectly or truly very different from the rest of the sample. When there are outliers, one must determine accuracy based on random chance or the error in the experiment and provide strong justification if the decision is to exclude them. [3]  Outliers require attention to ensure the data analysis accurately reflects the majority of the data and is not influenced by extreme values; cleaning these outliers can result in better quality decision-making in clinical practice. [4]  A common approach to determining if a variable is approximately normally distributed is converting values to z scores and determining if any scores are less than -3 or greater than 3. For a normal distribution, about 99% of scores should lie within three standard deviations of the mean. [5]  Importantly, one should not automatically throw out any values outside of this range but consider it in corroboration with the other factors aforementioned. Outliers are relatively common, so when these are prevalent, one must assess the risks and benefits of exclusion. [6]

Image . Figure 3 provides examples of histograms. In Figure 3A, 2 possible outliers causing kurtosis are observed. If values within 3 standard deviations are used, the result in Figure 3B are observed. This histogram appears much closer to an approximately normal distribution with the kurtosis being treated. Remember, all evidence should be considered before eliminating outliers. When reporting outliers in scientific paper outputs, account for the number of outliers excluded and justify why they were excluded.

Boxplots can examine for outliers, assess the range of data, and show differences among groups. Boxplots provide a visual representation of ranges and medians, illustrating differences amongst groups, and are useful in various outlets, including evidence-based medicine. [7]  Boxplots provide a picture of data distribution when there are numerous values, and all values cannot be displayed (ie, a scatterplot). [8]  Figure 4 illustrates the differences between drug site administration and the length of drug life from the above example.

Image . Figure 4 shows differences with potential clinical impact. Had any outliers existed (data from the histogram were cleaned), they would appear outside the line endpoint. The red boxes represent the middle 50% of scores. The lines within each red box represent the median number of minutes within each administration site. The horizontal lines at the top and bottom of each line connected to the red box represent the 25th and 75th percentiles. In examining the difference boxplots, an overlap in minutes between 2 administration sites were observed: the approximate top 25 percent from site B had the same time noted as the bottom 25 percent at site A. Site B had a median minute amount under 525, whereas administration site A had a length greater than 550. If there were no differences in adverse reactions at site A, analysis of this figure provides evidence that healthcare providers should administer the drug via site A. Researchers could follow by testing a third administration site, site C. Image . Figure 5 shows what would happen if site C led to a longer drug life compared to site A.

Figure 5 displays the same site A data as Figure 4, but something looks different. The significant variance at site C makes site A’s variance appear smaller. In order words, patients who were administered the drug via site C had a larger range of scores. Thus, some patients experience a longer half-life when the drug is administered via site C than the median of site A; however, the broad range (lack of accuracy) and lower median should be the focus. The precision of minutes is much more compacted in site A. Therefore, the median is higher, and the range is more precise. One may conclude that this makes site A a more desirable site.

  • Clinical Significance

Ultimately, by understanding basic exploratory data methods, medical researchers and consumers of research can make quality and data-informed decisions. These data-informed decisions will result in the ability to appraise the clinical significance of research outputs. By overlooking these fundamentals in statistics, critical errors in judgment can occur.

  • Nursing, Allied Health, and Interprofessional Team Interventions

All interprofessional healthcare team members need to be at least familiar with, if not well-versed in, these statistical analyses so they can read and interpret study data and apply the data implications in their everyday practice. This approach allows all practitioners to remain abreast of the latest developments and provides valuable data for evidence-based medicine, ultimately leading to improved patient outcomes.

  • Review Questions
  • Access free multiple choice questions on this topic.
  • Comment on this article.

Exploratory Data Analysis Figure 1 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Exploratory Data Analysis Figure 2 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Exploratory Data Analysis Figure 3 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Exploratory Data Analysis Figure 4 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Exploratory Data Analysis Figure 5 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

  • Cite this Page Shreffler J, Huecker MR. Exploratory Data Analysis: Frequencies, Descriptive Statistics, Histograms, and Boxplots. [Updated 2023 Nov 3]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

In this Page

Bulk download.

  • Bulk download StatPearls data from FTP

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • Contour boxplots: a method for characterizing uncertainty in feature sets from simulation ensembles. [IEEE Trans Vis Comput Graph. 2...] Contour boxplots: a method for characterizing uncertainty in feature sets from simulation ensembles. Whitaker RT, Mirzargar M, Kirby RM. IEEE Trans Vis Comput Graph. 2013 Dec; 19(12):2713-22.
  • Review Univariate Outliers: A Conceptual Overview for the Nurse Researcher. [Can J Nurs Res. 2019] Review Univariate Outliers: A Conceptual Overview for the Nurse Researcher. Mowbray FI, Fox-Wasylyshyn SM, El-Masri MM. Can J Nurs Res. 2019 Mar; 51(1):31-37. Epub 2018 Jul 3.
  • Qualitative Study. [StatPearls. 2024] Qualitative Study. Tenny S, Brannan JM, Brannan GD. StatPearls. 2024 Jan
  • [Descriptive statistics]. [Rev Alerg Mex. 2016] [Descriptive statistics]. Rendón-Macías ME, Villasís-Keever MÁ, Miranda-Novales MG. Rev Alerg Mex. 2016 Oct-Dec; 63(4):397-407.
  • Review Graphics and statistics for cardiology: comparing categorical and continuous variables. [Heart. 2016] Review Graphics and statistics for cardiology: comparing categorical and continuous variables. Rice K, Lumley T. Heart. 2016 Mar; 102(5):349-55. Epub 2016 Jan 27.

Recent Activity

  • Exploratory Data Analysis: Frequencies, Descriptive Statistics, Histograms, and ... Exploratory Data Analysis: Frequencies, Descriptive Statistics, Histograms, and Boxplots - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Popular searches

  • How to Get Participants For Your Study
  • How to Do Segmentation?
  • Conjoint Preference Share Simulator
  • MaxDiff Analysis
  • Likert Scales
  • Reliability & Validity

Request consultation

Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.

Looking for an online survey platform?

Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.

Research Methods Knowledge Base

  • Navigating the Knowledge Base
  • Foundations
  • Measurement
  • Research Design
  • Conclusion Validity
  • Data Preparation
  • Correlation
  • Inferential Statistics
  • Table of Contents

Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.

Completely free for academics and students .

Descriptive Statistics

Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from inferential statistics . With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what’s going on in our data.

Descriptive Statistics are used to present quantitative descriptions in a manageable form. In a research study we may have lots of measures. Or we may measure a large number of people on any measure. Descriptive statistics help us to simplify large amounts of data in a sensible way. Each descriptive statistic reduces lots of data into a simpler summary. For instance, consider a simple number used to summarize how well a batter is performing in baseball, the batting average. This single number is simply the number of hits divided by the number of times at bat (reported to three significant digits). A batter who is hitting .333 is getting a hit one time in every three at bats. One batting .250 is hitting one time in four. The single number describes a large number of discrete events. Or, consider the scourge of many students, the Grade Point Average (GPA). This single number describes the general performance of a student across a potentially wide range of course experiences.

Every time you try to describe a large set of observations with a single indicator you run the risk of distorting the original data or losing important detail. The batting average doesn’t tell you whether the batter is hitting home runs or singles. It doesn’t tell whether she’s been in a slump or on a streak. The GPA doesn’t tell you whether the student was in difficult courses or easy ones, or whether they were courses in their major field or in other disciplines. Even given these limitations, descriptive statistics provide a powerful summary that may enable comparisons across people or other units.

Univariate Analysis

Univariate analysis involves the examination across cases of one variable at a time. There are three major characteristics of a single variable that we tend to look at:

  • the distribution
  • the central tendency
  • the dispersion

In most situations, we would describe all three of these characteristics for each of the variables in our study.

The Distribution

The distribution is a summary of the frequency of individual values or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of persons who had each value. For instance, a typical way to describe the distribution of college students is by year in college, listing the number or percent of students at each of the four years. Or, we describe gender by listing the number or percent of males and females. In these cases, the variable has few enough values that we can list each one and summarize how many sample cases had the value. But what do we do for a variable like income or GPA? With these variables there can be a large number of possible values, with relatively few people having each one. In this case, we group the raw scores into categories according to ranges of values. For instance, we might look at GPA according to the letter grade ranges. Or, we might group income into four or five ranges of income values.

One of the most common ways to describe a single variable is with a frequency distribution . Depending on the particular variable, all of the data values may be represented, or you may group the values into categories first (e.g. with age, price, or temperature variables, it would usually not be sensible to determine the frequencies for each value. Rather, the value are grouped into ranges and the frequencies determined.). Frequency distributions can be depicted in two ways, as a table or as a graph. The table above shows an age frequency distribution with five categories of age ranges defined. The same frequency distribution can be depicted in a graph as shown in Figure 1. This type of graph is often referred to as a histogram or bar chart.

Distributions may also be displayed using percentages. For example, you could use percentages to describe the:

  • percentage of people in different income levels
  • percentage of people in different age ranges
  • percentage of people in different ranges of standardized test scores

Central Tendency

The central tendency of a distribution is an estimate of the “center” of a distribution of values. There are three major types of estimates of central tendency:

The Mean or average is probably the most commonly used method of describing central tendency. To compute the mean all you do is add up all the values and divide by the number of values. For example, the mean or average quiz score is determined by summing all the scores and dividing by the number of students taking the exam. For example, consider the test score values:

The sum of these 8 values is 167 , so the mean is 167/8 = 20.875 .

The Median is the score found at the exact middle of the set of values. One way to compute the median is to list all scores in numerical order, and then locate the score in the center of the sample. For example, if there are 500 scores in the list, score #250 would be the median. If we order the 8 scores shown above, we would get:

There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores are 20 , the median is 20 . If the two middle scores had different values, you would have to interpolate to determine the median.

The Mode is the most frequently occurring value in the set of scores. To determine the mode, you might again order the scores as shown above, and then count each one. The most frequently occurring value is the mode. In our example, the value 15 occurs three times and is the model. In some distributions there is more than one modal value. For instance, in a bimodal distribution there are two values that occur most frequently.

Notice that for the same set of 8 scores we got three different values ( 20.875 , 20 , and 15 ) for the mean, median and mode respectively. If the distribution is truly normal (i.e. bell-shaped), the mean, median and mode are all equal to each other.

Dispersion refers to the spread of the values around the central tendency. There are two common measures of dispersion, the range and the standard deviation. The range is simply the highest value minus the lowest value. In our example distribution, the high value is 36 and the low is 15 , so the range is 36 - 15 = 21 .

The Standard Deviation is a more accurate and detailed estimate of dispersion because an outlier can greatly exaggerate the range (as was true in this example where the single outlier value of 36 stands apart from the rest of the values. The Standard Deviation shows the relation that set of scores has to the mean of the sample. Again lets take the set of scores:

to compute the standard deviation, we first find the distance between each value and the mean. We know from above that the mean is 20.875 . So, the differences from the mean are:

Notice that values that are below the mean have negative discrepancies and values above it have positive ones. Next, we square each discrepancy:

Now, we take these “squares” and sum them to get the Sum of Squares (SS) value. Here, the sum is 350.875 . Next, we divide this sum by the number of scores minus 1 . Here, the result is 350.875 / 7 = 50.125 . This value is known as the variance . To get the standard deviation, we take the square root of the variance (remember that we squared the deviations earlier). This would be SQRT(50.125) = 7.079901129253 .

Although this computation may seem convoluted, it’s actually quite simple. To see this, consider the formula for the standard deviation:

  • X is each score,
  • X̄ is the mean (or average),
  • n is the number of values,
  • Σ means we sum across the values.

In the top part of the ratio, the numerator, we see that each score has the mean subtracted from it, the difference is squared, and the squares are summed. In the bottom part, we take the number of scores minus 1 . The ratio is the variance and the square root is the standard deviation. In English, we can describe the standard deviation as:

the square root of the sum of the squared deviations from the mean divided by the number of scores minus one.

Although we can calculate these univariate statistics by hand, it gets quite tedious when you have more than a few values and variables. Every statistics program is capable of calculating them easily for you. For instance, I put the eight scores into SPSS and got the following table as a result:

which confirms the calculations I did by hand above.

The standard deviation allows us to reach some conclusions about specific scores in our distribution. Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the following conclusions can be reached:

  • approximately 68% of the scores in the sample fall within one standard deviation of the mean
  • approximately 95% of the scores in the sample fall within two standard deviations of the mean
  • approximately 99% of the scores in the sample fall within three standard deviations of the mean

For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799 , we can from the above statement estimate that approximately 95% of the scores will fall in the range of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348 . This kind of information is a critical stepping stone to enabling us to compare the performance of an individual on one variable with their performance on another, even when the variables are measured on entirely different scales.

Cookie Consent

Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.

For more information on Conjointly's use of cookies, please read our Cookie Policy .

Which one are you?

I am new to conjointly, i am already using conjointly.

U.S. flag

A .gov website belongs to an official government organization in the United States.

A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Contact Information
  • Guidelines for Examining Unusual Patterns of Cancer and Environmental Concerns
  • Resources and Tools

Appendix A: Statistical Considerations

At a glance.

  • This appendix provides guidance on epidemiologic and descriptive statistical methods to assess cancer occurrences.
  • The standardized incidence ratio (SIR) is often used to assess if there is an excess number of cancer cases.
  • Alpha, beta, and statistical power relate to the types of errors that can occur during hypothesis testing.

Aerial view of a neighborhood with many houses and trees.

This section provides general guidance regarding epidemiologic and descriptive statistical methods most commonly used to assess occurrences of cancer. Frequencies, proportions, rates, and other descriptive statistics are useful first steps in evaluating the suspected unusual pattern of cancer. These statistics can be calculated by geographical location (e.g., census tracts) and by demographic variables such as age category, race, ethnicity, and sex. Comparisons can then be made across different stratifications using statistical summaries such as ratios.

Standardized incidence ratio

The standardized incidence ratio (SIR) is often used to assess whether there is an excess number of cancer cases, considering what is “expected” to occur within an area over time given existing knowledge of the type of cancer and the local population at risk. The SIR is a ratio of the number of observed cancer cases in the study population compared to the number that would be expected if the study population experienced the same cancer rates as a selected reference population. Typically, the state as a whole is used as a reference population. The equation is as follows:

Standardized Incidence Ratio

Adjusting for factors

The SIR can be adjusted for factors such as age, sex, race, or ethnicity, but it is most commonly adjusted for differences in age between two populations. In cancer analyses, adjusting for age is important because age is a risk factor for many cancers, and the population in an area of interest could be, on average, younger or older than the reference population 1 2 . In these instances, comparing the crude counts or rates would present a biased comparison.

For more guidance, this measure is explained in many epidemiologic textbooks, sometimes under standardized mortality ratio, which uses the same method but measures mortality instead of incidence rates 3 4 5 6 7 8 9 . Two ways are generally used to adjust via standardization, an indirect and a direct method. An example of one method is shown below, but a discussion of other methods is provided in several epidemiologic textbooks 3 and reference manuals 10 .

An example is provided in the table below, adjusting for age groups. The second column, denoted with an "O," is the observed number of cases in the area of interest, which in this example is a particular county within the state. The third column shows the population totals for each age group within the county of interest, designated as "A." The state age-specific cancer rates are shown in the fourth column, denoted as "B." To get the expected number of cases in the fifth column, A and B must be multiplied for each row. The total observed cases and the total expected cases are then summarized.

*Number of cases in a specified time frame. † Number of cases in the state divided by the state population for the specified time frame. Rates are typically expressed per 100,000 or 1,000,000 population.

The number of observed cancer cases can then be compared to the expected. The SIR is calculated using the formula below.

Standardized Incidence Ratio Example

Confidence intervals

A confidence interval (CI) is one of the most important statistics to be calculated, as it helps to provide understanding of both statistical significance and precision of the estimate. The narrower the confidence interval, the more precise the estimate 4 .

A common way of calculating confidence intervals for the SIR is shown below 4 :

Confidence Intervals

Using the example above produces this result:

Confidence Intervals Example

If the confidence interval for the SIR includes 1.0, the SIR is not considered statistically significant. However, there are many considerations when using the SIR. Because the statistics can be impacted by small case counts, or the proportion of the population within an area of interest, and other factors, the significance of the SIR should not be used as the sole metric to determine further assessment in the investigation of unusual patterns of cancer. Additionally, in instances of a small sample, exact statistical methods, which are directly calculated from data probabilities such as a chi-square or Fisher’s exact test, can be considered. These calculations can be performed using software such as R, Microsoft Excel, SAS, and STATA 9 . A few additional topics regarding the SIR are summarized below.

Reference population

Decisions about the reference population should be made prior to calculating the SIR. The reference population used for the SIR could be people in the surrounding census tracts, other counties in the state, or the entire state. Selecting the appropriate reference population is dependent upon the hypothesis being tested and should be large enough to provide relatively stable reference rates. One issue to consider is the size of the study population relative to the reference population. If the study population is small relative to the overall state population, including the study population in the reference population calculation will not yield substantially different results. However, excluding the study population from the reference population may reduce bias. If the reference population is smaller than the state as a whole (such as another county), the reference population should be “similar” to the study population in terms of factors that could be confounders (like age distribution, socioeconomic status and environmental exposures other than the exposure of interest). However, the reference population should not be selected to be similar to the study population in terms of the exposure of interest. Appropriate comparisons may also better address issues of environmental justice and health equity. Ultimately, careful consideration of the refence population is necessary since the choice can impact appropriate interpretation of findings and can introduce biases resulting in a decrease in estimate precision.

Limitations and further considerations for the SIR

One difficulty in community cancer investigations is that the population under study is generally a community or part of a community, leading to a relatively small number of individuals comprising the total population (e.g., small denominator for rate calculations). Small denominators frequently yield wide confidence intervals, meaning that estimates like the SIR may be imprecise 5 . Other methods, such as qualitative analyses or geospatial/spatial statistics methods, can provide further examination of the cancer and area of concern to better discern associations. Further epidemiologic studies may help calculate other statistics, such as logistic regression or Poisson regression. These methods are described in Appendix B. Other resources can provide additional guidance on use of p-values, confidence intervals, and statistical tests 3 4 9 11 12 .

Alpha, beta, and statistical power

Another important consideration in community cancer investigations is the types of errors that can occur during hypothesis testing and the related alpha, beta, and statistical power for the investigation. A type I error occurs when the null hypothesis (H o ) is rejected but actually true (e.g., concluding that there is a difference in cancer rates between the study population and the reference population when there is actually no difference). The probability of a type I error is often referred to as alpha or α 13 .

Alpha, Beta, and Statistical Power

A type II error occurs when the null hypothesis is not rejected and it should have been (e.g., concluding that there is no difference in cancer rates when there actually is a difference). The probability of a type II error is often referred to as beta or β.

Alpha, Beta, and Statistical Power 2

Power is the probability of rejecting the null hypothesis when the null hypothesis is actually false (e.g., concluding there is a difference in cancer rates between the study population and reference population when there actually is a difference). Power is equal to 1-beta. Power is related to the sample size of the study—the larger the sample size, the larger the power. Power is also related to several other factors including the following:

  • The size of the effect (e.g., rate ratio or rate difference) to be detected
  • The probability of incorrectly rejecting the null hypothesis (alpha)
  • Other features related to the study design, such as the distribution and variability of the outcome measure

As with other epidemiologic analyses, in community cancer investigations, a power analysis can be conducted to estimate the minimum number of people (sample size) needed in a study for detection of an effect (e.g., rate ratio or rate difference) of a given size with a specified level of power (1-beta) and a specified probability of rejecting the null hypothesis when the null hypothesis is true (alpha), given an assumed distribution for the outcome. Typically, a power value of 0.8 (equivalent to a beta value of 0.2) and an alpha value of 0.05 are used. An alpha value of 0.05 corresponds to a 95% confidence interval. Selection of an alpha value larger than 0.05 (e.g., 0.10: 90% confidence interval) can increase the possibility of concluding that there is a difference when there is actually no difference (Type I error). Selection of a smaller alpha value (e.g., 0.01: 99% confidence interval) can decrease the possibility of that risk and is sometimes considered when many SIRs are computed. The rationale for doing this is that one would expect to see some statistically significant apparent associations just by chance. As the number of SIRs examined increases, the number of SIRs that will be statistically significant by chance alone also increases (if alpha is 0.05, then 5% of the results are expected to be statistically significant by chance alone). However, one may consider this fact when interpreting results, rather than using a lower alpha value 14 . Decreasing the alpha value used will also decrease power for detection of differences between the population of interest and the reference population.

In many investigations of suspected unusual patterns of cancer, the number of people in the study population is determined by factors that may prevent the selection of a sample size sufficient to detect statistically significant differences. In these situations, a power analysis can be used to estimate the power of the study for detecting a difference in rates of a given magnitude. This information can be used to decide if or what type of statistical analysis is appropriate. Therefore, the results of a power calculation can be informative regarding how best to move forward.

Additional Contributing Authors:

Andrea Winquist, Angela Werner

  • Waller LA, Gotway CA. Applied spatial statistics for public health data. New York: John Wiley and Sons; 2004.
  • National Cancer Institute. Cancer Incidence Statistics [Internet]. 2021 [cited 2022 Jan 7]. Available from: https://surveillance.cancer.gov/statistics/types/incidence.html
  • Gordis L. Epidemiology. Philadelphia, PA: Elsevier Saunders; 2014.
  • Merrill R. Environmental Epidemiology, Principles and Methods. Sudbury, MA: Jones and Bartlett Publishers, Inc.; 2008.
  • Kelsey JL, Whittemore AS, Evans AS, Thompson WD. Methods in observational epidemiology. 2nd ed. New York, NY: Oxford University Press; 1996.
  • Sahai H, Khurshid A. Statistics in epidemiology: methods, techniques, and applications. Boca Raton: CRC; 1996.
  • Selvin S. Statistical analysis of epidemiologic data. New York, NY: Oxford University Press; 1996.
  • Breslow NE, Day NE. Statistical methods in cancer research. Volume I – The analysis of case-control studies. IARC Sci Publ. 1980;(32):5–338.
  • Rothman KJ, Greenland S, Lash TL. Modern epidemiology. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2008.
  • United Kingdom and Ireland Association of Cancer Registries. Standard Operating Procedure: Investigating and Analysing Small-Area Cancer Clusters [Internet]. 2015. Available from: Cancer Cluster SOP_0.pdf (ukiacr.org)
  • Greenland S, Senn S, Rothman KJ, Carlin J, Poole C, Goodman S, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337–50.
  • Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567:305–7.
  • Pagano M, Gauvreau K. Principles of Biostatistics. 2nd ed. Pacific Grove, CA: Duxbury Thomson Learning; 2000.
  • Rothman KJ. No Adjustments Are Needed for Multiple Comparisons. Vol. 1. 1990.

Unusual Cancer Patterns

Guidelines and resources for examining unusual patterns of cancer and environmental concerns.

For Everyone

Public health.

COMMENTS

  1. Descriptive Statistics

    Learn how to summarize and organize characteristics of a data set using descriptive statistics. Find out the types, formulas and examples of frequency distribution, central tendency and variability.

  2. Descriptive Statistics: Definition, Formulas, Types, Examples

    Descriptive Statistics Definition. Descriptive statistics is a type of statistical analysis that uses quantitative methods to summarize the features of a population sample. It is useful to present easy and exact summaries of the sample and observations using metrics such as mean, median, variance, graphs, and charts.

  3. Descriptive Statistics

    Descriptive Statistics Formulas. Sure, here are some of the most commonly used formulas in descriptive statistics: Mean (μ or x̄): The average of all the numbers in the dataset. It is computed by summing all the observations and dividing by the number of observations. Formula: μ = Σx/n or x̄ = Σx/n

  4. Descriptive Statistics

    Descriptive statistics summarise and organise characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population . In quantitative research , after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e ...

  5. What Is Descriptive Statistics: Full Explainer With Examples

    Learn what descriptive statistics are, why they matter and how to calculate them. Find out the difference between descriptive and inferential statistics, and the key measures of central tendency and dispersion.

  6. Descriptive Statistics for Summarising Data

    Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s - the fastest quality decision to 17.10 - the slowest quality decision).

  7. Descriptive Statistics

    Measures of Central Tendency and Other Commonly Used Descriptive Statistics. The mean, median, and the mode are all measures of central tendency. They attempt to describe what the typical data point might look like. In essence, they are all different forms of 'the average.'. When writing statistics, you never want to say 'average' because it is ...

  8. Research Guides: Statistics

    Descriptive information gives researchers a general picture of their data, as opposed to an explanation for why certain variables may be associated with each other. Descriptive statistics are often contrasted with inferential statistics, which are used to make inferences, or to explain factors, about the population.

  9. 2.8 Descriptive Statistics

    The student will construct a histogram and a box plot. The student will calculate univariate statistics. The student will examine the graphs to interpret what the data implies. Collect the Data Record the number of pairs of shoes you own. Randomly survey 30 classmates about the number of pairs of shoes they own. Record their values. Construct a ...

  10. Descriptive statistics

    Research. A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, [1] while descriptive statistics (in the mass noun sense) is the process of using and analysing those statistics. Descriptive statistics is distinguished from inferential ...

  11. Descriptive Statistics: Definition, Overview, Types, and Example

    Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of it. Descriptive statistics ...

  12. Descriptive Statistics: Definitions, Types, Examples

    Descriptive statistics is the study of numerical and graphical ways of describing and displaying data. Here are some important concepts. ... Scientific research, hypothesis testing: ... The standard deviation formula varies for population and sample. Both formulas are similar but not the same. Symbol used for Sample Standard Deviation - "s

  13. Chapter 3 Descriptive Statistics

    3 Descriptive Statistics. 3. Descriptive Statistics. When you have an area of interest that you want to research, a problem that you want to solve, a relationship that you want to investigate, theoretical and empirical processes will help you. Estimand is defined as "a quantity of scientific interest that can be calculated in the population ...

  14. Methods and formulas for Display Descriptive Statistics

    Suppose you have a column that contains N values. To calculate the median, first order your data values from smallest to largest. If N is odd, the sample median is the value in the middle. If N is even, the sample median is the average of the two middle values. For example, when N = 5 and you have data x 1, x 2, x 3, x 4, and x 5, the median = x 3.

  15. Basic statistical tools in research and data analysis

    Descriptive statistics try to describe the relationship between variables in a sample or population. Descriptive statistics provide a summary of data in the form of mean, median and mode. ... The formula for the variance of a population has the value ... Bad statistics may lead to bad research, and bad research may lead to unethical practice ...

  16. Descriptive Statistics

    Important Notes on Descriptive Statistics. Descriptive statistics are used to describe the features of a sample or population using quantitative analysis methods. Descriptive statistics can be classified into measures of central tendency and measures of dispersion. Mean, mode, standard deviation, etc., are some measures of descriptive statistics.

  17. Methods and formulas for Descriptive Statistics (Tables)

    If the number of observations in a data set is odd, the median is the value in the middle. If the number of observations in a data set is even, the median is the average of the two middle values. Use the following method to calculate the median for each cell or margin using the data corresponding to that cell or margin.

  18. PDF Basic Descriptive Statistics

    Basic Descriptive Statistics 5 list these in order from smallest to largest. This is known as "ranking" the data. If n is odd, the median is the number in the 1 + n−1 2 place on this list. If n is even, the median is the average of the numbers in the n 2 and 1+ n 2 positions on this list.

  19. (PDF) Introduction to Descriptive statistics

    Descriptive statistics are used to examine methods of collecting, tidying up, and presenting research data (Alabi & Bukola, 2023). In addition to descriptive, there is also an evaluative analysis ...

  20. Quantitative analysis: Descriptive statistics

    Numeric data collected in a research project can be analysed quantitatively using statistical tools in two different ways. Descriptive analysis refers to statistically describing, aggregating, and presenting the constructs of interest or associations between these constructs.Inferential analysis refers to the statistical testing of hypotheses (theory testing).

  21. Exploratory Data Analysis: Frequencies, Descriptive Statistics

    Researchers must utilize exploratory data techniques to present findings to a target audience and create appropriate graphs and figures. Researchers can determine if outliers exist, data are missing, and statistical assumptions will be upheld by understanding data. Additionally, it is essential to comprehend these data when describing them in conclusions of a paper, in a meeting with ...

  22. Descriptive Statistics

    Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. Descriptive statistics are typically distinguished from inferential statistics.

  23. Descriptive Statistics in Excel

    To use this feature in Excel, arrange your data in columns or rows. I have my data in columns, as shown in the snippet below. Download the Excel file that contains the data for this example: HeightWeight. In Excel, click Data Analysis on the Data tab, as shown above. In the Data Analysis popup, choose Descriptive Statistics, and then follow the ...

  24. Appendix A: Statistical Considerations

    At a glance. This appendix provides guidance on epidemiologic and descriptive statistical methods to assess cancer occurrences. The standardized incidence ratio (SIR) is often used to assess if there is an excess number of cancer cases. Alpha, beta, and statistical power relate to the types of errors that can occur during hypothesis testing.