Have a thesis expert improve your writing

Check your thesis for plagiarism in 10 minutes, generate your apa citations for free.

  • Knowledge Base

Descriptive Statistics | Definitions, Types, Examples

Published on 4 November 2022 by Pritha Bhandari . Revised on 9 January 2023.

Descriptive statistics summarise and organise characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population .

In quantitative research , after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g., age), or the relation between two variables (e.g., age and creativity).

The next step is inferential statistics , which help you decide whether your data confirms or refutes your hypothesis and whether it is generalisable to a larger population.

Table of contents

Types of descriptive statistics, frequency distribution, measures of central tendency, measures of variability, univariate descriptive statistics, bivariate descriptive statistics, frequently asked questions.

There are 3 main types of descriptive statistics:

  • The distribution concerns the frequency of each value.
  • The central tendency concerns the averages of the values.
  • The variability or dispersion concerns how spread out the values are.

Types of descriptive statistics

You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more, in bivariate and multivariate analysis.

  • Go to a library
  • Watch a movie at a theater
  • Visit a national park

A data set is made up of a distribution of values, or scores. In tables or graphs, you can summarise the frequency of every possible value of a variable in numbers or percentages.

  • Simple frequency distribution table
  • Grouped frequency distribution table
Gender Number
Male 182
Female 235
Other 27

From this table, you can see that more women than men or people with another gender identity took part in the study. In a grouped frequency distribution, you can group numerical response values and add up the number of responses for each group. You can also convert each of these numbers to percentages.

Library visits in the past year Percent
0–4 6%
5–8 20%
9–12 42%
13–16 24%
17+ 8%

Measures of central tendency estimate the center, or average, of a data set. The mean , median and mode are 3 ways of finding the average.

Here we will demonstrate how to calculate the mean, median, and mode using the first 6 responses of our survey.

The mean , or M , is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number of responses. The total number of responses or observations is called N .

Mean number of library visits
Data set 15, 3, 12, 0, 24, 3
Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57
Total number of responses = 6
Mean Divide the sum of values by to find : 57/6 =

The median is the value that’s exactly in the middle of a data set.

To find the median, order each response value from the smallest to the biggest. Then, the median is the number in the middle. If there are two numbers in the middle, find their mean.

Median number of library visits
Ordered data set 0, 3, 3, 12, 15, 24
Middle numbers 3, 12
Median Find the mean of the two middle numbers: (3 + 12)/2 =

The mode is the simply the most popular or most frequent response value. A data set can have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs most frequently.

Mode number of library visits
Ordered data set 0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response:

Measures of variability give you a sense of how spread out the response values are. The range, standard deviation and variance each reflect different aspects of spread.

The range gives you an idea of how far apart the most extreme response scores are. To find the range , simply subtract the lowest value from the highest value.

Standard deviation

The standard deviation ( s ) is the average amount of variability in your dataset. It tells you, on average, how far each score lies from the mean. The larger the standard deviation, the more variable the data set is.

There are six steps for finding the standard deviation:

  • List each score and find their mean.
  • Subtract the mean from each score to get the deviation from the mean.
  • Square each of these deviations.
  • Add up all of the squared deviations.
  • Divide the sum of the squared deviations by N – 1.
  • Find the square root of the number you found.
Raw data Deviation from mean Squared deviation
15 15 – 9.5 = 5.5 30.25
3 3 – 9.5 = -6.5 42.25
12 12 – 9.5 = 2.5 6.25
0 0 – 9.5 = -9.5 90.25
24 24 – 9.5 = 14.5 210.25
3 3 – 9.5 = -6.5 42.25
= 9.5 Sum = 0 Sum of squares = 421.5

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The symbol for variance is s 2 .

Univariate descriptive statistics focus on only one variable at a time. It’s important to examine data from each variable separately using multiple measures of distribution, central tendency and spread. Programs like SPSS and Excel can be used to easily calculate these.

Visits to the library
6
Mean 9.5
Median 7.5
Mode 3
Standard deviation 9.18
Variance 84.3
Range 24

If you were to only consider the mean as a measure of central tendency, your impression of the ‘middle’ of the data set can be skewed by outliers, unlike the median or mode.

Likewise, while the range is sensitive to extreme values, you should also consider the standard deviation and variance to get easily comparable measures of spread.

If you’ve collected data on more than one variable, you can use bivariate or multivariate descriptive statistics to explore whether there are relationships between them.

In bivariate analysis, you simultaneously study the frequency and variability of two variables to see if they vary together. You can also compare the central tendency of the two variables before performing further statistical tests .

Multivariate analysis is the same as bivariate analysis but with more than two variables.

Contingency table

In a contingency table, each cell represents the intersection of two variables. Usually, an independent variable (e.g., gender) appears along the vertical axis and a dependent one appears along the horizontal axis (e.g., activities). You read ‘across’ the table to see how the independent and dependent variables relate to each other.

Number of visits to the library in the past year
Group 0–4 5–8 9–12 13–16 17+
Children 32 68 37 23 22
Adults 36 48 43 83 25

Interpreting a contingency table is easier when the raw data is converted to percentages. Percentages make each row comparable to the other by making it seem as if each group had only 100 observations or participants. When creating a percentage-based contingency table, you add the N for each independent variable on the end.

Visits to the library in the past year (Percentages)
Group 0–4 5–8 9–12 13–16 17+
Children 18% 37% 20% 13% 12% 182
Adults 15% 20% 18% 35% 11% 235

From this table, it is more clear that similar proportions of children and adults go to the library over 17 times a year. Additionally, children most commonly went to the library between 5 and 8 times, while for adults, this number was between 13 and 16.

Scatter plots

A scatter plot is a chart that shows you the relationship between two or three variables. It’s a visual representation of the strength of a relationship.

In a scatter plot, you plot one variable along the x-axis and another one along the y-axis. Each data point is represented by a point in the chart.

From your scatter plot, you see that as the number of movies seen at movie theaters increases, the number of visits to the library decreases. Based on your visual assessment of a possible linear relationship, you perform further tests of correlation and regression.

Descriptive statistics: Scatter plot

Descriptive statistics summarise the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalisable to the broader population.

The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.

  • Distribution refers to the frequencies of different responses.
  • Measures of central tendency give you the average for each response.
  • Measures of variability show you the spread or dispersion of your dataset.
  • Univariate statistics summarise only one variable  at a time.
  • Bivariate statistics compare two variables .
  • Multivariate statistics compare more than two variables .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2023, January 09). Descriptive Statistics | Definitions, Types, Examples. Scribbr. Retrieved 30 July 2024, from https://www.scribbr.co.uk/stats/descriptive-statistics-explained/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, data collection methods | step-by-step guide & examples, variability | calculating range, iqr, variance, standard deviation, normal distribution | examples, formulas, & uses.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Descriptive Statistics in Excel

By Jim Frost 38 Comments

Descriptive statistics summarize your dataset, painting a picture of its properties. These properties include various central tendency and variability measures, distribution properties, outlier detection, and other information. Unlike inferential statistics , descriptive statistics only describe your dataset’s characteristics and do not attempt to generalize from a sample to a population .

Excel logo

In this post, I provide step-by-step instructions for using Excel to calculate descriptive statistics for your data. Importantly, I also show you how to interpret the results, determine which statistics are most applicable to your data, and help you navigate some of the lesser-known values.

Additionally, I include links to resources I’ve written that present clear explanations of relevant statistical concepts that you won’t find in Excel’s documentation. And, I use an example dataset for us to work through and interpret together!

Before proceeding, ensure that Excel’s Data Analysis ToolPak is installed. On the Data tab, look for Data Analysis , as shown below.

Excel menu with Data Analysis ToolPak.

If you don’t see Data Analysis, install that ToolPak. Learn how to install it in my post about using Excel to perform t-tests . It’s free!

Let’s start with a caveat. Use descriptive statistics together with graphs. The statistical output contains numbers that describe the properties of your data. While they provide useful information, charts are often more intuitive. The best practice is to use graphs and statistical output together to maximize your understanding. At the end of this post, I display the histograms for the variables in this dataset.

For this example, we’ll assess two variables, the height and weight of preteen girls. I collected these data during a real experiment. To use this feature in Excel, arrange your data in columns or rows. I have my data in columns, as shown in the snippet below.

Displays portion of the dataset for this descriptive statistics example.

Download the Excel file that contains the data for this example: HeightWeight .

In Excel, click Data Analysis on the Data tab, as shown above. In the Data Analysis popup, choose Descriptive Statistics , and then follow the steps below.

Excel's Descriptive Statistics option in the Data Analysis menu.

Step-by-Step Instructions for Filling in Excel’s Descriptive Statistics Box

  • Under Input Range , select the range for the variables that you want to analyze. You can include multiple variables as long as they form a contiguous block. While you can explore more than one variable, the analysis assesses each variable in a univariate manner (i.e., no correlation ).
  • In Grouped By , choose how your variables are organized. I always include one variable per column as this format is standard across software. Alternatively, you can include one variable per row.
  • Check the Labels in first row checkbox if you have meaningful variable names in row 1. This option makes the output easier to interpret.
  • In Output options , choose where you want Excel to display the results.
  • Check the Summary statistics box to display most of the descriptive statistics (central tendency, dispersion, distribution properties, sum, and count).
  • Check the Confidence Level for Mean box to display a confidence interval for the mean. Enter the confidence level. 95% is usually a good value. For more information about confidence levels, read my post about confidence intervals .
  • Check Kth Largest and Kth Smallest to display a high and low value. If you enter 1, Excel displays the highest and lowest values. If you enter 2, it shows the 2 nd highest and lowest values. Etc.

For our example dataset, fill in the dialog box as shown below.

Excel's dialog box for descriptive statistics.

Interpreting Excel’s Descriptive Statistics Results

After Excel creates the statistical output, I autofit the columns for clarity.

Excel's descriptive statistics output.

As you can see, we’re assessing two variables, height in meters and weight in kilograms.

Generally, we’ll work our way down from the top of Excel’s descriptive statistics output. However, I’ll group the results into categories that make sense. Consequently, the following discussion doesn’t strictly follow the order of the output. If you want to learn more about the statistics, be sure to click the links for more detailed information!

Central Tendencies (Mean, Median, Mode)

A measure of central tendency describes where most of the values in the dataset occur. It’s the center of the distribution of values. Excel presents three measures of central tendency. Which one is best for your data?

  • Mean : This measure is the one with which you’re most familiar. It’s the sum of all observations divided by the number of observations. It’s best for data that follow symmetric distributions.
  • Median : This value splits your data in half. Half the values fall above the median while half are below it. It’s best for skewed distributions.
  • Mode : This measure represents the value that occurs most frequently in your data. It’s best for categorical and ordinal data.

The example data are continuous variables . Excel frequently displays “N/A” for the mode when you have continuous data. That happens because continuous data are unlikely to have exactly duplicated values, a requirement for the mode. Thanks to a data collection artifact, my data are continuous, but Excel displays the mode anyway. The study’s nurse collected the underlying data in inches and pounds, rounded them to the nearest unit, and converted them to their metric equivalents. That process produced clumps of rounded values. However, the mode really is not a good measure for these data.

Related post : Data Types and How to Graph Them

Central Tendency for our Descriptive Statistics Example

What can we learn by comparing the mean and median for both variables? For the height data, they are virtually equal, 1.51m and 1.50m, respectively. For symmetric distributions, the mean and median will be very close together. That’s a good sign that the heights follow a symmetric distribution, making the mean a good choice. The mean tells us that the height distribution centers on 1.51m.

However, there is a difference between the weight mean (46.3kg) and median (44.9kg). When the mean is greater than the median, it indicates that the distribution is right-skewed. We should use the median for these data. Half the data points fall above 44.9kg, and half fall below.

For more information about the different measures of central tendency, their calculations, how data types and distribution properties affect them, graphical representations, and when to use each type, read my post about Measures of Central Tendency .

Measures of Dispersion (Standard Deviation, Variance, Range)

Previously, you saw how a measure of central tendency indicates where most observations fall. Measures of dispersion indicate how closely clustered or loosely spread the data points fall around the center. Excel presents three measures of dispersion. In general, as their values increase, data points spread out further from the center (i.e., the distribution becomes broader).

  • Standard Deviation : The standard or typical difference between each data point and the mean. This measure uses the original units of the data, simplifying interpretation. Hence, analysts use this measure of variability the most frequently. The standard deviation is the square root of the variance.
  • Variance : The average squared difference of the values from the mean. Because the calculations use squared differences, the variance is in squared units rather than the original data units. While higher values of the variance indicate greater variability, there is no intuitive interpretation for specific values. Read more about the variance .
  • Range : The difference between the largest and smallest values in a dataset. The range is easy to understand but it is based on only the two most extreme values in the dataset, making it very susceptible to outliers. Additionally, the size of the dataset affects the range. As the sample size increases, the range tends to expand. Consequently, use the range to compare variability only when the sample sizes are similar. Read more about the range .

Typically, use the standard deviation . When you have fairly skewed data, consider using the interquartile range (IQR) , which Excel doesn’t provide, unfortunately.

Variability for our Descriptive Statistics Example

For the height data, the standard deviation is 0.07m (7cm). The typical height falls 7cm from the mean of 1.51m. The range tells us that the spread from the tallest to the shortest is 0.33m (33cm). You can draw similar conclusions from the weight data.

It might be tempting to compare the variability between heights and weights using the standard deviations. However, their standard deviations use different units, M and kg, making a direct comparison impossible. However, for some data, you can compare their coefficients of variation, which is easy to calculate using the standard deviation and means. For more information, read my post about the coefficient of variation .

For more information about the different measures of variability, their calculations, and when to use each type, read my post about Measures of Variability .

Distribution Shape Properties: Kurtosis and Skewness

Kurtosis and skewness are two measures that help you understand the general properties of your data’s distribution. These measures compare your distribution’s shape to a symmetric distribution and the normal distribution .

When either kurtosis or skewness significantly deviate from zero, it might indicate that your data do not follow a normal distribution. However, use a normality test or a normal distribution plot to make that determination.

I find that histograms present the same information more intuitively. However, graph axes and bin sizes can be manipulated to exaggerate or deemphasize characteristics while these statistics are completely objective.

Related posts : Using Histograms to Understand Your Data and Manually Adjusting Your Graph Axes

Kurtosis indicates how the peaks and tails of your distribution compare to the normal distribution. Is the peak taller or shorter than the normal distribution? Are the tails thicker or thinner? In the table, the red distributions have positive and negative kurtosis values while the blue distributions have a zero kurtosis value for comparison. For more details about this statistic, read my post about Kurtosis .

Zero Consistent with a normal distribution
Positive  

Thicker tails than the normal distribution

Negative  

Thinner tails than the normal distribution

For our example data, height has a kurtosis of -0.35. This value is close to zero, indicating that the tails are consistent with the normal distribution. However, weight has a kurtosis of 1.15, suggesting the tails are thicker than the normal distribution.

Skewness indicates the symmetry of your data’s distribution. Skewed data are asymmetric. The terms right-skewed and left-skewed indicate the direction in which the long tail points on a distribution curve. Learn more about skewed distributions .

Zero A perfectly symmetric distribution
Positive Right-skewed data
Negative Left-skewed data

Note that a U-shaped distribution can be symmetric even though it is inverted compared to the normal distribution.

For our example data, height has a skewness of 0.11. This value is close to zero, signifying that these data have a symmetric distribution. However, weight has a skewness of 1.05, which indicates it is right-skewed.

The relative locations of the mean and median and these distribution properties paint a consistent picture of these two variables. For the height data, the mean and median are nearly equal, and kurtosis and skewness are both virtually zero. These measures collectively imply that the heights follow a symmetric distribution consistent with the normal distribution.

Conversely, the weight data have a mean that is higher than the median, a positive skew value, and a positive kurtosis value. These values suggest that the weights follow an asymmetric, right-skewed distribution that is not consistent with the normal distribution.

Minimum and Maximum

The minimum and maximum values in your dataset can help you understand where your data fall. For our example data, the heights fall between 1.33 – 1.66 M, while the weights fall between 29.26 – 80.74 kg. Additionally, these values can help you identify outliers. Frequently, data entry errors create values that fall outside the range of valid data. Look at the minimum and maximum values and see if they make sense for your data!

Related post : Five Ways to Find Outliers in Your Data

Sum and Count

The sum is simply the sum of all values for each variable. I’ve never found this to be helpful, but perhaps it will be for you. The count is the number of observations for each variable. Use this value to determine whether the sample size is what you expected. Both the height and weight variables have 88 observations.

Precision of the Mean: Standard Error and the Confidence Interval

The standard error and the confidence interval assess how precisely your sample mean estimates the population mean. A relatively precise estimate indicates that your sample estimate is likely to be close to the actual population value. Conversely, an imprecise estimate tends to be further away from the correct population value.

Technically, neither of the values belong in the descriptive statistics output because they use your sample data to infer the properties of a larger population (inferential statistics). Descriptive statistics only describes your data without considering a population. However, Excel includes them in the output, so I’ll interpret them here.

Be aware that inferential statistics impose additional requirements on data collection methodologies that do not apply to descriptive statistics. For example, you must use a representative sampling methodology, such as random sampling; otherwise, these measures are invalid.

For more information, read my post about the differences between descriptive and inferential statistics .

Standard Error of the Mean

The standard error of the mean is the standard deviation of the sampling distribution of the mean. What?!

If you took many samples from the same population and calculated each sample’s mean, you’d produce a distribution of sample means. That distribution has a standard deviation, which is the standard error of the mean.

Smaller standard errors indicate that your sample provides a more precise estimate of the population value. Unfortunately, there is no intuitive interpretation of these values. However, the calculations for confidence intervals (CIs) incorporate the standard error, and CIs are much easier to interpret. So, focus on the CIs and don’t worry about the standard errors!

Related post : Standard Error of the Mean

Confidence Interval (CI) of the Mean

A confidence interval of the mean is a range of values that a population mean is likely to fall within. Because of random sampling error, you know that your sample mean is unlikely to equal the population mean, but how large is that difference? CIs help you answer that question by providing a range of probable values for the population mean.

Narrow CIs indicate more precise estimates of the population mean. In other words, you can expect your sample mean to be relatively close to the population mean.

Excel doesn’t provide the range, but it does display the number to add and subtract from your mean to calculate the confidence interval.

For the height data, Excel displays 0.015530282, which I’m rounding to 0.02. To calculate the CI, take the average height and +/- this value. In other words, 1.51 +/- 0.02 creates a CI of 1.49 – 1.53. We can be confident that the mean height for this population falls between these two values.

Using the same process, the confidence interval for weight is [43.98 48.68]. We can be confident that the mean weight for the population falls between these values.

If you want to know more about standard errors, confidence intervals, and confidence levels, read my post about How Confidence Intervals Work .

Histograms of our Descriptive Statistics Data

Let’s see the histograms for our example data. These graphs are not a part of Excel’s descriptive statistics. However, my suggestion is that you graph your data first and then study the numbers. All the statistics in this post describe the data that created the graphs below.

Are there any surprises?

Histogram of heights.

For myself, I expected the height data to be more perfectly symmetrical. However, they are very slightly skewed to the right. The weight data are more right skewed, consistent with the descriptive statistics.

While the Descriptive Statistics analysis can’t assess correlation, read my post about Using Excel to Calculate Correlation to evaluate the relationship between these two variables!

Share this:

assignment descriptive statistics

Reader Interactions

' src=

September 7, 2023 at 2:38 am

I find very helpful to do my practical assignment.

' src=

January 26, 2022 at 11:38 am

I love your books and blogs. I am a layman when it comes to statistics. I like to use Excel to analyze interesting pursuits like the effectiveness of our community’s cane toad removal efforts and my efforts to improve air gun target shooting. Quick question: Can Excel produce the “predicted R-squared” you describe to determine overfitting? I don’t see it; just the “adjusted R-squared” to evaluate when too many independent variables are being used. If it doesn’t offer “predicted R-squared” directly, are there formulas I can use to calculate it?

' src=

January 26, 2022 at 5:34 pm

Hi Richard,

I love how statistics are usable in so many situations. It’s fantastic you’ve found ways in your personal life and community!

Unfortunately, I don’t believe that Excel can calculate predicted R-square as a built-in function. There are some 3rd party Excel add-ons that might be able to calculate it. I’m not sure. You could probably create an Excel formula yourself to find the answer. I might take a stab at that some point, but my to do list is a bit long!

I’m sorry I didn’t have a better answer for you!

' src=

January 23, 2022 at 10:53 pm

Thank you Jim! I don’t know if my question is appropriate to this post, so please disregard if thats the case. I used an online calculator to find a sample size, with a 95% confidence level and 5% confidence interval. Now I collected the data and have my sample mean, and would like to report it estimating what the population mean would probably be like. And I don’t know if I should calculate de confidence interval using the 5% that I used in the sample size calculation (mean +/- 5% of the mean), or if I should use the number reported by Excel to calculate the 95% confidence interval, which you discuss in your post (mean +/- number reported by Excel). Thank you again.

January 25, 2022 at 2:51 pm

I’m not sure what you mean when you say you looked up both a 95% confidence level and a 5% confidence interval. Do you mean a 95% confidence level and a 5% significance level? Those are two different forms of the same things and those values fit together. The significance level = 1 – confidence level. So, if you use a confidence level of 95%, that corresponds to a significance level of 5% in a hypothesis test.

So, I’m not entirely sure what you mean there.

However, in terms of reporting for the confidence interval specifically, you report the confidence level, which is almost always 95%. Here is how the APA says you should report CIs. From their manual,

“Use the format 95% CI [LL, UL] where LL is the lower limit of the confidence interval and UL is the upper limit. For example, one might report: 95% CI [5.62, 8.31].”

Even if you’re not required to use the APA format, you’ll be on solid ground by using it. Depending on the knowledge of your audience, you could follow that up with a fleshed-out interpretation, such as the following for the APA’s example:

The results indicate there is a 95% confidence level that the population mean falls between 5.62 and 8.31.

I hope that answers your questions!

January 23, 2022 at 4:04 pm

Hello Jim, Thank you for this post! I have a question about the Confidence Interval (95%) number that excel provides. Is it just applicable to estimate the population mean? I am wondering how to provide a confidence interval for proportions observed in the sample, for example the number of cases that are within a certain height range (to use your example) so that we can generalize the results to the population. I wanted to know which confidence interval to use if I wanted to report that we are 95% certain that (roughtly) 35% (+/- confidence interval number) of the population’s height will be between 1.45-1.51 (looking at the distribution of your histogram above). Will this be a different CI – perhaps what is used to calculate the sample size?

January 23, 2022 at 5:03 pm

Confidence intervals are only applicable to inferential statistics. Inferential statistics are when you use a sample to generalize to a population. In other words, you’re using sample characteristics to infer the properties of a population. As I mention in this post, it’s not accurate for Excel to include CIs in their descriptive statistics, which adds to the confusion! Inferential statistics need to account for the sampling error, which is the difference between your sample and the population. CIs are one way of doing that. So, if you want to generalize to a population, then you’re performing inferential statistics and CIs are appropriate.

Descriptive statistics is when you’re just describing the sample that you measured. There’s no uncertainty because you’ve measured everyone in the sample. Hence, there’s no reason at all to use a CI or hypothesis testing. You know the sample exactly. So, if you are not generalizing to a population and just want to understand the sample itself, don’t use CIs.

There are confidence intervals for population parameters other than the mean. You can obtain them for proportions, standard deviations, and so on. They just involve different calculations and data types. For proportions, you need binary data. For example, if you had pass/fail data, those are binary. You could collect a random sample and calculation the proportion of those who passed out of the total number. Additionally, you could obtain a CI for the proportion which gives you a range of likely values for the population proportion. In this example, you are again generalizing from the sample to the population. Hence, CIs are appropriate.

You could convert the continuous height data to binary data. For example, all heights greater than X could be considered “tall” while all heights lower than X are “not tall.” You’d have binary data with the two possible values of tall/not-tall. You could then calculate a proportion for those who are tall and get a CI for that proportion.

' src=

December 1, 2021 at 6:44 pm

Thank you, Jim! In my first semester of a doctoral program, and it has been 20+ years since I was in a stats class. I recommended that my professor link to your blog because it is very helpful for our intro course and a good companion to the textbook. I have this bookmarked for the future.

' src=

May 28, 2021 at 10:40 am

or in simple terms I want to ask what is the difference between SE= SD/sqrt N and SE(m)= Sqrt (2MSE/r)…and what they both interpret.

May 28, 2021 at 4:53 am

Thank you sir! I read your recommended post and SEM post also, again very nicely explained. I could now understand the line: “The standard deviation is the variability of individual data points around the sample mean. The standard error of the mean is the variability of sample means in the sampling distribution of means.”

But again my question is standard error of the mean is given in the end of the ANOVA and I can understand that it is a kind of variability measures for the different sample means in the sampling distribution and is used for further calculations.

but what about the standard error of the sample mean (individual sample only)…… many research articles have mentioned the SE for individual sample means and also for which we can also go for the standard deviation….. in spss descriptive statistics both SD and SE are given for individual sample means. after calculations I find this SE of sampling mean given there = SD for the sample mean/sqrt of number of replications or individual units in that sample……which is similar to SEM formula where formula is SD/sqrt Number of samples.

so my quarry is what does this SE for sample mean indicates.

May 28, 2021 at 11:21 pm

Hi Himanshu,

I *think* I see where some of your confusion is but I’m not sure.

Let me clarify. You’re seeing the standard error of the mean in an ANOVA context and you’re thinking it applies to the multiple means that you’re analyzing? If so, that’s not correct, although I can see how that would seem to make sense in that context! The F-test itself assesses the variability of the group means. To read how that works, read my post about the F-test in ANOVA . That does involve assessing both the variability of the group means and data points around their mean.

However, that is different from my discussion about the standard error of the mean. These standard errors are for individual sample means. Although you can have them for the group means in ANOVA too. But, in my post about the standard error of the mean, I’m talking about them from the standpoint of an individual sample. The distribution of means I’m referring to in that context is the sampling distribution (not the multiple means in ANOVA). You can have only one sample mean but the procedure still estimates a sampling distribution.

So, while reading my post about the standard error of the mean, keep in mind that I AM referring to individual sample means–exactly what you’re asking about! I hope that will clarify that aspect for you.

Yes, the standard error for an individual sample mean is the standard deviation/square root of the sample size. Again, that formula is in my other post.

I’m not familiar with SE(m)= Sqrt (2MSE/r). I don’t have SPSS so I’m not sure what that is in relation to. Sorry.

If I’m misunderstanding what you’re unsure about, please clarify!

' src=

May 21, 2021 at 4:57 pm

When we talk about skewness , we talk about right tail and left tail(we divide distribution in two parts). if right tail is long then we say right skewed else left skewed.

in case of unimodal data , we divide distribution in two parts by looking at peak. right side of peak will be considered as right tail and left side of peak will be considered as left tail. so here, mode is point which divide distribution in two parts.

but in case of bimodal data , if we divide two parts using either of mode then it will not look symmetric even though my distribution can be symmetrical if i use other point like median to divide my distribution in two parts.

so , i am getting confused that am i interpreting rightly that in case of unimodal we divide distribution by looking at peak (mode) and then compare two parts to get idea of skewness or is there any other technique which we use to divide distribution in two parts?

Thanks…

May 20, 2021 at 9:12 am

Respected Sir Greetings any reply to this comment please

Stay Safe Best wishes

May 20, 2021 at 2:52 pm

Somehow your previous question slipped through the cracks! I’ll be answering momentarily!

May 16, 2021 at 10:19 am

Hello sir Greetings of the day

Here I am with one more quarry regarding the descriptive statistics.

1. Sir What is the difference between the Standard deviation (SD) and Standard Error (SE). Suppose we have given 3 treatments to a population with 5 Replication each. As of now what I have understood is : a.) we calculate SD for each treatment mean and write mean of 5 replication in a given respective treatment +- SD of respective treatment in the table b.) SE or SEM is calculated in ANOVA when it is performed for all the treatment and is used for the calculation of LSD. But in many research papers they use to mention mean +- SE in many places with the treatment mean instead of SD. Also in SPSS, the descriptive statistics provide both SD and SE for the treatment. So my question is how SE is calculated for treatment instead of whole of the population (different treatment in ANOVA as point b).

2. In excel 2016 there are two formulas given STDEV-S and STDEV-P which I think is STDEV -S is for sample and is actually SD and STDEV-P is for population is actually SE, Sample means each treatment (only 5 replications) and population means all the treatments (all the 3 treatments along with their respective 5 replication) in combination (population comprises all the treatments which we have given to the population)

Am I correct or not for the point 2?

Thank you and Regards

May 20, 2021 at 4:19 pm

The standard deviation is the variability of individual data points around the sample mean. The standard error of the mean is the variability of sample means in the sampling distribution of means. Specifically, if the standard error of the means is the standard deviation of the sampling distribution. Conversely, the standard deviation applies to the distribution of sample values.

Statistical procedures use the standard error of the mean to calculate p-values and confidence intervals. Typically, you don’t interpret them directly. It assess how precise your sample mean estimates the population mean.

There are different equations for the standard deviation depending on whether you’re using a sample to estimate a population (use STDEV -S) or whether you just want to know the standard deviation for a particular dataset and not use it to infer the properties of a larger population (use STDEV -P). For more information on that issue and the nature of the difference between the two formulas, read my post about Measures of Variability , which discusses all that. Note that STDEV -P is NOT the standard error.

So, you have three different calculation methods, standard deviations for a sample or a population (click link above), and the standard error of the mean, which is the sample standard deviation divided by the square root of the sample size.

I hope that helps!

' src=

March 18, 2021 at 4:21 pm

Do you have instructions on how to make graphs in excell?

March 19, 2021 at 3:10 pm

I currently don’t have posts about how to make graphs in Excel. However, I am expanding my Excel content all the time and will eventually explain how to create and interpret graphs in Excel. Was there a particular graph you’re interested in?

' src=

March 18, 2021 at 1:45 pm

Hola Jim, te leemos desde muchas partes del mundo; gracias por compartir tus conocimientos.

Saludos desde Colombia!

' src=

February 28, 2021 at 3:58 pm

Thanks! Very helpful – like the book I bought from you!

February 28, 2021 at 5:56 pm

Thank you, Dr. Muller! I’m also so glad to hear that my book was helpful! 🙂

' src=

February 25, 2021 at 5:23 pm

greatly appreciated..thank you very much..this is really helpful.

' src=

February 23, 2021 at 10:30 am

Hi Jim, There some errors in stating kurtosis for skewness and vice vera.

February 23, 2021 at 1:45 pm

Thank you Bal Ram! I’ve fixed that typo!

' src=

February 23, 2021 at 7:22 am

Your Descriptive Statistics in Excel manual is very good and applicable to my veterinary and agronomy students. For your information I bought your books Regression Analysis and Hypothesis Testing by Amazon. Greetings from Brazil.

' src=

February 22, 2021 at 3:51 am

Thanks a bunch Jim. You have always done it well. Quite appreciate.

Someone mentioned that you did a book on Minitab. Which book is that? I will like to have it since I have a Minitab but most lessons are either on SPSS or XLSTAT

February 22, 2021 at 3:39 pm

I have three books and all three use Minitab. In these books, I don’t teach the use of Minitab but I use it to perform the analyses, create the output and graphs, etc. My goal is that everyone can learn from them even if they don’t use Minitab. However, if you use Minitab, I’m sure you’ll get a little bit more!

To see my books, go to my webstore . My books are listed there and you can even get free samples of them, so you can get an idea of what they cover and how I use Minitab. I include a note about my usage of Minitab at the end of the Introduction section in each book.

Happy reading! Jim

' src=

February 22, 2021 at 2:44 am

Thank you so much Jim for the simplicity in your explanations and support towards our research problems. Stay blessed

February 22, 2021 at 3:25 pm

Hi Sulaina! I’m so glad it was helpful! You stay blessed as well! 🙂

' src=

February 22, 2021 at 2:00 am

I was looking for clear cut explanation of descriptive stats in excel and you explained with utmost clarity. Thanks a ton!

February 22, 2021 at 3:23 pm

You bet, Dhawal! So glad it was helpful!

' src=

February 22, 2021 at 1:38 am

Thank you so much for your elaborate exposition. This is very enlightening. You make statistics really enjoyable & functional in research

' src=

February 22, 2021 at 12:54 am

Excellent !! Jim !!! Thank you so much

February 22, 2021 at 3:22 pm

You’re very welcome, Janardhan!

' src=

February 22, 2021 at 12:50 am

Appreciated Jim. I bought your books but found the books are using Minitab. Can you create a version of your book using Excel. I understand Excel doesn’t have all of the capabilities of Minitab, but can you cover the topics that Excel is capable of, without using VBA?

Yes! My plan is to write a book that focuses on using Excel to perform statistical analysis.

' src=

February 22, 2021 at 12:43 am

Always very helpful! Appreciated Jim! Very clearly explained

February 22, 2021 at 12:47 am

Thanks, Bob!! 🙂

Comments and Questions Cancel reply

assignment descriptive statistics

Quant Analysis 101: Descriptive Statistics

Everything You Need To Get Started (With Examples)

By: Derek Jansen (MBA) | Reviewers: Kerryn Warren (PhD) | October 2023

If you’re new to quantitative data analysis , one of the first terms you’re likely to hear being thrown around is descriptive statistics. In this post, we’ll unpack the basics of descriptive statistics, using straightforward language and loads of examples . So grab a cup of coffee and let’s crunch some numbers!

Overview: Descriptive Statistics

What are descriptive statistics.

  • Descriptive vs inferential statistics
  • Why the descriptives matter
  • The “ Big 7 ” descriptive statistics
  • Key takeaways

At the simplest level, descriptive statistics summarise and describe relatively basic but essential features of a quantitative dataset – for example, a set of survey responses. They provide a snapshot of the characteristics of your dataset and allow you to better understand, roughly, how the data are “shaped” (more on this later). For example, a descriptive statistic could include the proportion of males and females within a sample or the percentages of different age groups within a population.

Another common descriptive statistic is the humble average (which in statistics-talk is called the mean ). For example, if you undertook a survey and asked people to rate their satisfaction with a particular product on a scale of 1 to 10, you could then calculate the average rating. This is a very basic statistic, but as you can see, it gives you some idea of how this data point is shaped .

Descriptive statistics summarise and describe relatively basic but essential features of a quantitative dataset, including its “shape”

What about inferential statistics?

Now, you may have also heard the term inferential statistics being thrown around, and you’re probably wondering how that’s different from descriptive statistics. Simply put, descriptive statistics describe and summarise the sample itself , while inferential statistics use the data from a sample to make inferences or predictions about a population .

Put another way, descriptive statistics help you understand your dataset , while inferential statistics help you make broader statements about the population , based on what you observe within the sample. If you’re keen to learn more, we cover inferential stats in another post , or you can check out the explainer video below.

Why do descriptive statistics matter?

While descriptive statistics are relatively simple from a mathematical perspective, they play a very important role in any research project . All too often, students skim over the descriptives and run ahead to the seemingly more exciting inferential statistics, but this can be a costly mistake.

The reason for this is that descriptive statistics help you, as the researcher, comprehend the key characteristics of your sample without getting lost in vast amounts of raw data. In doing so, they provide a foundation for your quantitative analysis . Additionally, they enable you to quickly identify potential issues within your dataset – for example, suspicious outliers, missing responses and so on. Just as importantly, descriptive statistics inform the decision-making process when it comes to choosing which inferential statistics you’ll run, as each inferential test has specific requirements regarding the shape of the data.

Long story short, it’s essential that you take the time to dig into your descriptive statistics before looking at more “advanced” inferentials. It’s also worth noting that, depending on your research aims and questions, descriptive stats may be all that you need in any case . So, don’t discount the descriptives! 

Free Webinar: Research Methodology 101

The “Big 7” descriptive statistics

With the what and why out of the way, let’s take a look at the most common descriptive statistics. Beyond the counts, proportions and percentages we mentioned earlier, we have what we call the “Big 7” descriptives. These can be divided into two categories – measures of central tendency and measures of dispersion.

Measures of central tendency

True to the name, measures of central tendency describe the centre or “middle section” of a dataset. In other words, they provide some indication of what a “typical” data point looks like within a given dataset. The three most common measures are:

The mean , which is the mathematical average of a set of numbers – in other words, the sum of all numbers divided by the count of all numbers. 
The median , which is the middlemost number in a set of numbers, when those numbers are ordered from lowest to highest.
The mode , which is the most frequently occurring number in a set of numbers (in any order). Naturally, a dataset can have one mode, no mode (no number occurs more than once) or multiple modes.

To make this a little more tangible, let’s look at a sample dataset, along with the corresponding mean, median and mode. This dataset reflects the service ratings (on a scale of 1 – 10) from 15 customers.

Example set of descriptive stats

As you can see, the mean of 5.8 is the average rating across all 15 customers. Meanwhile, 6 is the median . In other words, if you were to list all the responses in order from low to high, Customer 8 would be in the middle (with their service rating being 6). Lastly, the number 5 is the most frequent rating (appearing 3 times), making it the mode.

Together, these three descriptive statistics give us a quick overview of how these customers feel about the service levels at this business. In other words, most customers feel rather lukewarm and there’s certainly room for improvement. From a more statistical perspective, this also means that the data tend to cluster around the 5-6 mark , since the mean and the median are fairly close to each other.

To take this a step further, let’s look at the frequency distribution of the responses . In other words, let’s count how many times each rating was received, and then plot these counts onto a bar chart.

Example frequency distribution of descriptive stats

As you can see, the responses tend to cluster toward the centre of the chart , creating something of a bell-shaped curve. In statistical terms, this is called a normal distribution .

As you delve into quantitative data analysis, you’ll find that normal distributions are very common , but they’re certainly not the only type of distribution. In some cases, the data can lean toward the left or the right of the chart (i.e., toward the low end or high end). This lean is reflected by a measure called skewness , and it’s important to pay attention to this when you’re analysing your data, as this will have an impact on what types of inferential statistics you can use on your dataset.

Example of skewness

Measures of dispersion

While the measures of central tendency provide insight into how “centred” the dataset is, it’s also important to understand how dispersed that dataset is . In other words, to what extent the data cluster toward the centre – specifically, the mean. In some cases, the majority of the data points will sit very close to the centre, while in other cases, they’ll be scattered all over the place. Enter the measures of dispersion, of which there are three:

Range , which measures the difference between the largest and smallest number in the dataset. In other words, it indicates how spread out the dataset really is.

Variance , which measures how much each number in a dataset varies from the mean (average). More technically, it calculates the average of the squared differences between each number and the mean. A higher variance indicates that the data points are more spread out , while a lower variance suggests that the data points are closer to the mean.

Standard deviation , which is the square root of the variance . It serves the same purposes as the variance, but is a bit easier to interpret as it presents a figure that is in the same unit as the original data . You’ll typically present this statistic alongside the means when describing the data in your research.

Again, let’s look at our sample dataset to make this all a little more tangible.

assignment descriptive statistics

As you can see, the range of 8 reflects the difference between the highest rating (10) and the lowest rating (2). The standard deviation of 2.18 tells us that on average, results within the dataset are 2.18 away from the mean (of 5.8), reflecting a relatively dispersed set of data .

For the sake of comparison, let’s look at another much more tightly grouped (less dispersed) dataset.

Example of skewed data

As you can see, all the ratings lay between 5 and 8 in this dataset, resulting in a much smaller range, variance and standard deviation . You might also notice that the data are clustered toward the right side of the graph – in other words, the data are skewed. If we calculate the skewness for this dataset, we get a result of -0.12, confirming this right lean.

In summary, range, variance and standard deviation all provide an indication of how dispersed the data are . These measures are important because they help you interpret the measures of central tendency within context . In other words, if your measures of dispersion are all fairly high numbers, you need to interpret your measures of central tendency with some caution , as the results are not particularly centred. Conversely, if the data are all tightly grouped around the mean (i.e., low dispersion), the mean becomes a much more “meaningful” statistic).

Key Takeaways

We’ve covered quite a bit of ground in this post. Here are the key takeaways:

  • Descriptive statistics, although relatively simple, are a critically important part of any quantitative data analysis.
  • Measures of central tendency include the mean (average), median and mode.
  • Skewness indicates whether a dataset leans to one side or another
  • Measures of dispersion include the range, variance and standard deviation

If you’d like hands-on help with your descriptive statistics (or any other aspect of your research project), check out our private coaching service , where we hold your hand through each step of the research journey. 

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

ed

Good day. May I ask about where I would be able to find the statistics cheat sheet?

Khan

Right above you comment 🙂

Laarbik Patience

Good job. you saved me

Lou

Brilliant and well explained. So much information explained clearly!

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Python for Data Science

Table of contents.

  • Introduction
  • Descriptive statistics with Python
  • ... using Pandas
  • ... using Researchpy

Descriptive statistics

Descriptive statistics summarizes the data and are broken down into measures of central tendency (mean, median, and mode) and measures of variability (standard deviation, minimum/maximum values, range, kurtosis, and skewness). Example data to be used on this page is [3, 5, 7, 8, 8, 9, 10, 11]. Measures of Central Tendency Mean The average value of the data. Can be calculated by adding all the measurements of a variable together and dividing that summation by the number of observations used. The formula is displayed below. $$ \bar{x} = \frac{\sum x}{n} \\ \\ \begin{align} \text{Where,} \\ \text{$\bar{x}$ is the estimated average} \\ \text{$\sum$ indicates to add all the values in the data} \\ \text{$x$ represents the measurements, and} \\ \text{$n$ is the total number of observations} \end{align} $$ Calculating the mean using the example data. $$ \bar{x} = \frac{3 + 5 + 7 + 8 + 8 + 9 + 10 + 11}{8} \\ \\ \bar{x} = 7.625 $$ Median The middle value when the measurements are placed in ascending order. If there is no true midpoint, the median is calculated by adding the two midpoints together and dividing by 2. $$ \text{median} = \frac{8 + 8}{2} \\ \text{median} = 8 $$ Mode The number that occurs the most in the set of measurements. Measures of Variability Variance The sum of the squared deviations divided by the number of observations - 1. Using this definition is considered an unbiased estimate of the population variance. Variance does not have a unit of measurement. $$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \\ \begin{align} \text{Where,} \\ \text{$x_i$ is the $i^{th}$ value of the measurement} \\ \text{$\bar{x}$ is the estimated average} \\ \text{$\sum$ indicates to add all the values in the data} \\ \text{$n$ is the total number of observations} \end{align} $$ Standard deviation The positive square root of the variance. Standard deviation can be interpreted using the unit of measurement of the observations used. $$ \sqrt{s^2} $$ Minimum value The smallest value of the measurements. Maximum value The largest value of the measurements. Range The difference between the maximum and minimum values. Kurtois Is a measure of tailedness of a distribution. Skew Is a measure of symmetry of the distribution of the data.

Descriptive Statistics with Python

There are a few ways to get descriptive statistics using Python. Below will show how to get descriptive statistics using Pandas and Researchpy. First, let's import an example data set.

Continuous variables

This method returns many useful descriptive statistics with a mix of measures of central tendency and measures of variability. This includes the number of non-missing observations; the mean; standard deviation; minimum value; 25 th , 50 th (a.k.a. the median), and 75 th percentile; as well as the maximum value. It's missing some useful information that is typically desired regarding the mean, this is the standard error and the 95% confidence interval. No worries though, pairing this with Researcpy's summary_cont() method provides the descriptive statistic information that is wanted - this method will be shown later.

Categorical variables

Using both the describe() and value_counts() methods are useful since they compliment each other with the information returned. The describe() method says that "Female" occurs more than "Male" but one can see that is not the case since they both occur an equal amount. For more information about these methods, please see their official documentation page for describe() and value_counts() .

Distribution measures

For more information on these methods, please see their official documentation page for kurtosis() and skew() .

Variable N Mean SD SE 95% Conf. Interval
0 bp_before 120.0 156.45 11.389845 1.039746 154.391199 158.508801

This method returns less overall information compared to the describe() method, but it does return more in-depth information regarding the mean. It returns the non-missing count, mean, stand deviation (SD). standard error (SE), and the 95% confidence interval.

Variable Outcome Count Percent
0 sex Female 60 50.0
1 Male 60 50.0

The method returns the variable name, the non-missing count, and the percentage of each category of a variable. By default, the outcomes are sorted in descending order. For more information about these methods, please see the official documentation for summary_cont() and summary_cont() .

MGHIHP HE-802, Spring 2021

Chapter 1 jan 19–24: descriptive statistics and data distributions.

This is the only chapter you need to look at for the first week of the course, which is January 19–24 2021. You can ignore all subsequent chapters in this e-book, for now.

Please read this entire chapter and then complete the assignment at the end.

I recommend that you quickly skim through the assignment at the end of the chapter before going through this chapter’s content, so that you know in advance the tasks that you will be expected to do.

This week, our goals are to…

Become familiar with this e-book.

Become familiar with basic descriptive statistics and how they are calculated.

Examine different types of data distributions.

Inspect and explore a dataset.

Modify data so that it is ready to be used to answer a research question.

1.1 Course Basics

Since this is the first week of the course, we will go over a few administrative and workflow items now. Please start by watching the following embedded video:

The video above can also be viewed externally at https://youtu.be/09COx-n6m8I .

Below is some information about how to use this e-book and how the course will work on a weekly basis.

Chapter 0 of this e-book contains information about the course. You can refer to it on an as-needed basis. The course calendar and final project sections of Chapter 0 might be particularly useful.

Throughout this e-book, you will find many links that you can click on. These are marked in blue underlined text like this . Any time you see blue text like that, keep in mind that you can click on it! You can practice by clicking here .

Each week of the course, a chapter of this e-book will be assigned. This week, the assigned chapter is Chapter 1, the one you are reading right now. You can look up which chapter is assigned for each week in the course calendar in Chapter 0. This week, you should read through all of the text in this chapter, watch any videos, and then complete the assignment at the end of the chapter. Everything you need to do this week is exclusively in this chapter. And you will do the same for chapters assigned in future weeks.

The due date of each assignment is usually Sunday night at the end of each week. Assignment due dates are given in the course calendar in Chapter 0. Assignments should be submitted in D2L in the appropriate dropbox for each chapter’s work. Any other submission instructions will be included in each individual assignment.

  • If you find that the work for a given week is taking you too long, I recommend that you stop and send an e-mail to all course instructors. For example, if you get stuck on the first few tasks in a particular week, I recommend that you stop at that point (or even sooner) and contact us. We can meet on Zoom and work through a section of the content together so that you finish at a pace that is reasonable for your schedule. Please overcommunicate with us rather than undercommunicate. Do not hesitate to contact us!

1.2 Basic Concepts

We will begin by looking at a few foundational concepts in quantitative analysis. Please open the following PowerPoint document and reading the first 20 slides :

  • Week 1 Basic Concepts

Once you have read the first 20 slides at the link above, follow along with the remaining slides (page #21 and up) as needed as you read the rest of the material below in this chapter.

1.3 Descriptive Statistics

Descriptive statistics help us rapidly understand key characteristics of any data we have. We will be using them throughout the course, so it is important to become comfortable with them, starting this week. They key descriptive statistics are: arithmetic mean (often just called the mean), median, mode, range, and standard deviation. Please use the resources below to learn or review how all of these are calculated and what they mean.

Let’s begin by watching this video about mean, median, and mode: 5

The video above can also be viewed externally at https://www.youtube.com/watch?v=k3aKKasOmIw .

As an optional (not required) supplement to the video above, you can click the links below to read more about mean, median, and mode at their respective Wikipedia pages, if you would like:

  • Arithmetic Mean (often just called the mean)

Next, please continue by watching this video about how to calculate standard deviation: 6

The video above can be viewed externally at https://www.youtube.com/watch?v=IaTFpp-uzp0 .

Note that in the context of the video above, a sample just means a group of numbers. The sample standard deviation just means the standard deviation of a group of numbers.

As an optional (not required) supplement to the video above, you can click the links below to learn more about standard deviation, if you would like:

  • Standard deviation Howcast video – This video is a bit rushed and can be good just to refresh how to calculate standard deviation, rather than learn it for the first time.
  • Standard deviation on Wikipedia – This page contains some good descriptions of standard deviation as well as another calculation example.

Finally, you should also know how to calculate the range of a set of data. The range is simply the difference between the smallest and largest value in a sample of data or group of numbers. The video below explains how to calculate the range: 7

The video above can also be viewed externally at https://www.youtube.com/watch?v=0HS1P3vhNBU .

Now that we have learned about how to describe a set of numbers numerically, let’s move on to some visual ways in which we can describe and learn more about a set of numbers.

1.4 Data Distributions

1.4.1 histograms.

Histograms are an extremely useful tool to describe data. Histograms simply count the number of values that are in your data within selected intervals.

Please watch the following video to familiarize yourself with how a histogram is made: 8

The video above can be viewed externally at https://www.youtube.com/watch?v=gSEYtAjuZ-Y .

As an optional (not required) supplement to the video above, you can click the link below to learn more about histograms, if you would like:

  • Histogram: Make a Chart in Easy Steps

1.4.2 Types of distributions

Now we will examine a few examples related to histograms and data distributions. Let’s start by watching the following video, which reviews a number of possible data distributions: 9

The video above can be viewed externally at https://www.youtube.com/watch?v=Y53_8WRrPzg .

1.4.3 Normal distributions

Imagine that we measured the heights of hundreds or thousands of people. It is likely that our histogram of all the heights would look like this: 10

assignment descriptive statistics

This is called a normal distribution . Normal distributions can be spread out wide or very compact, but they all are tallest in the middle and shortest at the ends (the tails). They can all be characterized by a mean and standard deviation. Some examples are below.

Below is a normal distribution with 10000 samples (10000 measurements of something), mean = 50, and standard deviation = 5. You could pretend this is data on the number of questions that 10000 people got correct on a test. The average score was 50, the average deviation from that score was 5. The minimum score appears to be about 30 and the highest around 70 or 80.

assignment descriptive statistics

Here below is another with 10000 samples, mean = 50, standard deviation = 1. You can see that this one is much more compact (which I have emphasized by keeping the x-axis range the same as above). You could pretend that these are the lengths of hand-manufactured walking sticks that are meant to be 50 inches in length but aren’t always perfect.

assignment descriptive statistics

And finally here is another normal distribution with 10000 samples, mean = 50, and standard deviation = 50. The next two histograms both show the same distributions, but with different x-axis ranges and buckets. I’m not sure what this could be an example of!

assignment descriptive statistics

All three of these are normal distributions, each characterized by a different mean and standard deviation. The mean of a normal distribution that is balanced on both sides, like these ones, will often be the same as the mode of that distribution.

1.4.4 z-scores

We will end our section on data distributions by combining many of the topics we covered above. z-scores combine our knowledge of mean, standard deviation, and data distributions to allow us to understand our data better. Please watch the following video about how z-scores are calculated: 11

The video above can be viewed externally at https://www.youtube.com/watch?v=5S-Zfa-vOXs .

These optional resources might also be useful as you think more about z-scores, if you would like to see some more explanations and examples:

  • Z-score example from Khan Academy
  • An introduction to z scores
  • UConn Education Research Basics: Normal Distribution
  • David Lane’s Z-Table

1.5 Visualizing and Inspecting Your Data

The first step of data analysis should be to become familiar with your data through descriptive statistics and graphing (often using visualizations such as histograms, scatterplots, boxplots, and more).

Below are some guidelines to keep in mind as you learn about your data: 12

  • Make sure that the values for each of your variables are valid (this includes checking for data entry errors or values outside the possible range for a variable).
  • Check to see if variables are normally distributed. This is often important to know as you determine which types of statistical tests you can and cannot run on your data.
  • Get a feel for how much variability you have in your variables. The descriptive statistics/characteristics we looked at above can be useful for this (especially mean, standard deviation, and shape of a variable’s distribution).
  • Check for floor or ceiling effects. For example, if your data comes from an educational assessment tool to which students responded, were questions so easy or so hard that people got them all correct or all wrong?
  • If you have demographic variables, look at the characteristics of your sample (this affects generalizability, or how representative of a larger population your data are or are not).
  • Identify outliers and make preliminary determination of how you plan to handle them.
  • Once you are confident you have a clean dataset, you can score and/or code any variables that are not already ready for analysis. Then you should double-check you have done those calculations correctly, usually by looking in your data spreadsheet and generating many two-way tables.

We will be practicing all of these guidelines throughout this course. Again, the first step of data analysis should be to become familiar with the data.

1.5.1 Outliers

Another important consideration as you prepare to do data analysis is to determine if there are any outliers in your data. If there are, you have to decide how to handle them. Descriptive statistics and charts are especially useful in detecting outliers.

Below are some common options you have as you decide how to handle outliers in your data: 13

  • Remove (exclude) observations 14 that are outliers from your analysis.
  • Transform the data, if there is a possible and reasonable transformation that would mitigate any problematic effects of the outliers.
  • Change nothing and run your analyses as initially planned.

The strategy you choose to deal with outliers will depend on a lot of factors, and you need to think carefully about how you plan to handle extreme values in your analysis (and you will need to justify this in any findings you report). This will vary from dataset to dataset. You need to figure out what makes the most sense in the context of the research question you are trying to address. This decision-making process does not have any definitive rules. Instead, you will gain experience gradually that will help you decide how to handle outliers.

1.6 Basic data analysis in a spreadsheet software

In this section, we will introduce a few techniques for rapidly learning about a dataset using Microsoft Excel or a similar spreadsheet software such as Google Sheets or Open Office. Using a spreadsheet software can be an excellent way to learn about your data and prepare your data for more rigorous quantitative analysis in another software.

Below, we will learn how to calculate descriptive statistics, make selected charts, identify outliers, and sort data within a spreadsheet.

1.6.1 Descriptive statistics

Calculating descriptive statistics rapidly is an extremely valuable skill. The video below demonstrates how to do this in Excel: 15

The video above can be viewed externally at https://www.youtube.com/watch?v=-tFWH7AYLek .

Alternatively, the resources below show a different method of rapidly generating descriptive statistics in Excel:

  • Descriptive Statistics. Easy Excel. https://www.excel-easy.com/examples/descriptive-statistics.html .
  • Descriptive Statistics using “Data Analysis” tool in Excel. ozanteaching. YouTube. https://www.youtube.com/watch?v=5MFjwM6K5Sg .

1.6.2 Boxplot

Boxplots are a very useful tool for visualizing our data. Before we see how to create them in Excel, you can watch the video below if you would like, to understand how they are constructed: 16

The video above can be viewed externally at https://www.youtube.com/watch?v=fHLhBnmwUM0 .

And now let’s see how to make a boxplot in Excel: 17

The video above can be viewed externally at https://www.youtube.com/watch?v=BcwFD0ICOf0 .

1.6.3 Identify outliers

The following video shows one effective way to identify outliers in your data using Excel: 18

The video above can be viewed externally at https://www.youtube.com/watch?v=dt1MQmMaf4E .

Note that boxplots can also help you identify outliers.

1.6.4 Histogram

The video below shows how to make a histogram in Excel: 19

The video above is available externally at https://www.youtube.com/watch?v=is14ehdy7jo .

As an optional additional resource, you can read about how to make a histogram in Excel here:

  • Histogram. Easy Excel. https://www.excel-easy.com/examples/histogram.html .

1.6.5 Sorting data

Sometimes, it might be useful to reorder the observations (rows) of your data in a spreadsheet, based on one variable (column) of the data. The video below demonstrates this process in Excel: 20

The video above can be viewed externally at https://www.youtube.com/watch?v=KS9N4yAjuYQ .

As an optional additional resource, you can read about sorting data in Excel here:

  • Excel 2016 - Sorting Data. GCF Global. https://edu.gcfglobal.org/en/excel2016/sorting-data/1/ .

1.7 Assignment

Please complete and submit all of the tasks, which are presented in separate sections below. You can put your responses into a Word document or similar type of document. Any submission format is fine.

Do not hesitate to contact course instructors with any questions or concerns, big or small.

This assignment is due on the night of Sunday January 24 2021.

1.7.1 Conceptual Questions

The tasks below relate the statistical concepts that you learned about in this chapter to quantitative research design.

Task 1 : What are Z-scores?

Task 2 : Describe one situation where it would be more useful to use z-scores than raw scores when analyzing data.

Task 3 : Using your own words, what does the standard deviation tell us, and how is it calculated?

1.7.2 Exploring Data

In this part of the assignment, you will use a dataset called Autism Stigma Data. Click here to access this dataset . Download and then open it in a spreadsheet program of your choice. 21

Here is some background on the dataset: A team of researchers have developed an intervention to reduce mental illness stigma among primary care physicians. The intervention is an online training program designed to address the common misconceptions and stereotypes about mental illness. They use a pre-post design, and were interested in changes in both mental illness stigma (operationalized as attitudes towards people with mental illness, measured in the socdist_pre and socdist_post variables in the dataset) and knowledge (operationalized as mental health literacy, measured in the know_pre and know_post variables in the dataset). They also collected demographic information from all participants.

Task 4 : Familiarize yourself with the dataset. Look at the variable names, types of variables, labels, value labels, etc. How many variables (columns) are there in this data?

Task 5 : What is the primary research question the researchers are trying to address with this study?

Task 6 : What is the main independent variable the researchers are interested in?

Task 7 : What are the primary dependent variable(s) the researchers are interested in?

Task 8 : What are the levels of measurement of each of the variables in the dataset?

For the next few questions on outliers, in addition to visually inspecting the data in a spreadsheet, you might find it useful to make a box plot.

Task 9 : Are there any outliers or data entry errors in the dataset? Also explain how you made that determination. Present at least one box plot as part of your answer.

Task 10 : How did you decide whether a value was an outlier?

Task 11 : For any data entry errors: What kinds of error(s) did you find? How did you decide to “fix” the error?

Before moving forward to the next questions, make sure you are now working with a “clean” dataset in which you have handled any outliers.

Task 12 : Report the mean and standard deviation for the pre and post stigma and knowledge variables.

Task 13 : Do the stigma and knowledge variables appear to be normally distributed? Use histograms to make a determination and explain your reasoning.

Now, split the data by gender. You can do this by sorting the data by the gender column. Then, copy all of the rows for women into a new spreadsheet and all of the rows for men into yet another empty spreadsheet. Now you are ready to analyze the data separately for men and women.

Task 14 : What are the mean and standard deviation for men and women (separately) for each of the stigma and knowledge variables? Are these variables normally distributed? Show the histograms for men and women for each of the stigma and knowledge variables.

Task 15 : Based on your preliminary examination of the data, what do you think the next steps should be to answer the research question?

1.7.3 Follow up and submission

You have now reached the end of this week’s assignment. The tasks below will guide you through submission of the assignment and allow us to gather questions and/or feedback from you.

Task 16 : Please write any questions you have for the course instructors (optional).

Task 17 : Please write any feedback you have about the instructional materials (optional).

Task 18 : Please submit your assignment (multiple files are fine) to the D2L assignment drop-box corresponding to this chapter and week of the course. Please e-mail all instructors if you experience any difficulty with this process.

Finding mean, median, and mode. Khan Academy. YouTube. Nov 14 2011. https://www.youtube.com/watch?v=k3aKKasOmIw . ↩︎

How To Calculate The Standard Deviation. The Organic Chemistry Tutor. Sep 26 2019. YouTube. https://www.youtube.com/watch?v=IaTFpp-uzp0 . ↩︎

Finding the Range | How to Find the Range of a Data Set. Math with Mr. J. April 28 2020. YouTube. https://www.youtube.com/watch?v=0HS1P3vhNBU . ↩︎

How to create a histogram. Khan Academy. February 4 2015. YouTube. https://www.youtube.com/watch?v=gSEYtAjuZ-Y . ↩︎

Classifying shapes of distributions. Khan Academy. June 22 2018. YouTube https://www.youtube.com/watch?v=Y53_8WRrPzg . ↩︎

Image source: https://i.stack.imgur.com/hvTdo.png ↩︎

Z-score introduction. Khan Academy. June 14 2018. YouTube. https://www.youtube.com/watch?v=5S-Zfa-vOXs . ↩︎

This list was initially provided by Dr. Annie Fox at MGH Institute of Health Professions. It has been slightly modified. ↩︎

Remember, an observation is a row of your data when it is in a spreadsheet. A row of data can be a person, an organization, a group, a car, or anything else about which data has been collected. An observation is also sometimes called a data point . ↩︎

Excel - Basic Descriptive Statistics (Mean, Variance, Standard Devation, etc.). Jalayer Academy. January 12 2012. YouTube. https://www.youtube.com/watch?v=-tFWH7AYLek . ↩︎

Starmer, Josh. Boxplots, Clearly Explained. StatQuest with Josh Starmer. July 10 2017. YouTube. https://www.youtube.com/watch?v=fHLhBnmwUM0 ↩︎

Creating a boxplot in Microsoft Excel 365. People Analytics Alaska. December 28 2019. YouTube. https://www.youtube.com/watch?v=BcwFD0ICOf0 . ↩︎

How to Find Outliers with Excel. Absent Data Channel. December 3 2017. YouTube. https://www.youtube.com/watch?v=dt1MQmMaf4E . ↩︎

Price, Ryan. How to Make a Histogram in Excel 2016. August 27 2016. YouTube. https://www.youtube.com/watch?v=is14ehdy7jo . ↩︎

Excel 2013: Sorting Data. GCFLearnFree.org. November 26 2013. YouTube. https://www.youtube.com/watch?v=KS9N4yAjuYQ . ↩︎

You can download and then open the file in Excel or a similar program on your own computer. You might also be able to click on the “Add to My Drive” button to copy it into your own Google Drive account and then open it in Google Sheets. ↩︎

  • Search Search Please fill out this field.

What Are Descriptive Statistics?

  • How They Work

Univariate vs. Bivariate

Descriptive statistics and visualizations, descriptive statistics and outliers.

  • Descriptive vs. Inferential

The Bottom Line

  • Corporate Finance
  • Financial Analysis

Descriptive Statistics: Definition, Overview, Types, and Examples

Adam Hayes, Ph.D., CFA, is a financial writer with 15+ years Wall Street experience as a derivatives trader. Besides his extensive derivative trading expertise, Adam is an expert in economics and behavioral finance. Adam received his master's in economics from The New School for Social Research and his Ph.D. from the University of Wisconsin-Madison in sociology. He is a CFA charterholder as well as holding FINRA Series 7, 55 & 63 licenses. He currently researches and teaches economic sociology and the social studies of finance at the Hebrew University in Jerusalem.

assignment descriptive statistics

Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean , median , and mode , while measures of variability include standard deviation , variance , minimum and maximum variables, kurtosis , and skewness .

Key Takeaways

  • Descriptive statistics summarizes or describes the characteristics of a data set.
  • Descriptive statistics consists of three basic categories of measures: measures of central tendency, measures of variability (or spread), and frequency distribution.
  • Measures of central tendency describe the center of the data set (mean, median, mode).
  • Measures of variability describe the dispersion of the data set (variance, standard deviation).
  • Measures of frequency distribution describe the occurrence of data within the data set (count).

Jessica Olah

Understanding Descriptive Statistics

Descriptive statistics help describe and explain the features of a specific data set by giving short summaries about the sample and measures of the data. The most recognized types of descriptive statistics are measures of center. For example, the mean, median, and mode, which are used at almost all levels of math and statistics, are used to define and describe a data set. The mean, or the average, is calculated by adding all the figures within the data set and then dividing by the number of figures within the set.

For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a data set is the value appearing most often, and the median is the figure situated in the middle of the data set. It is the figure separating the higher figures from the lower figures within a data set. However, there are less common types of descriptive statistics that are still very important.

People use descriptive statistics to repurpose hard-to-understand quantitative insights across a large data set into bite-sized descriptions. A student's grade point average (GPA), for example, provides a good understanding of descriptive statistics. The idea of a GPA is that it takes data points from a range of individual course grades, and averages them together to provide a general understanding of a student's overall academic performance. A student's personal GPA reflects their mean academic performance.

Descriptive statistics, especially in fields such as medicine, often visually depict data using scatter plots, histograms, line graphs, or stem and leaf displays. We'll talk more about visuals later in this article.

Types of Descriptive Statistics

All descriptive statistics are either measures of central tendency or measures of variability , also known as measures of dispersion.

Central Tendency

Measures of central tendency focus on the average or middle values of data sets, whereas measures of variability focus on the dispersion of data. These two measures use graphs, tables, and general discussions to help people understand the meaning of the analyzed data.

Measures of central tendency describe the center position of a distribution for a data set. A person analyzes the frequency of each data point in the distribution and describes it using the mean, median, or mode, which measures the most common patterns of the analyzed data set.

Measures of Variability

Measures of variability (or measures of spread) aid in analyzing how dispersed the distribution is for a set of data. For example, while the measures of central tendency may give a person the average of a data set, it does not describe how the data is distributed within the set.

So while the average of the data might be 65 out of 100, there can still be data points at both 1 and 100. Measures of variability help communicate this by describing the shape and spread of the data set. Range, quartiles , absolute deviation, and variance are all examples of measures of variability.

Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which is calculated by subtracting the lowest number (5) in the data set from the highest (100).

Distribution

Distribution (or frequency distribution) refers to the number of times a data point occurs. Alternatively, it can be how many times a data point fails to occur. Consider this data set: male, male, female, female, female, other. The distribution of this data can be classified as:

  • The number of males in the data set is 2.
  • The number of females in the data set is 3.
  • The number of individuals identifying as other is 1.
  • The number of non-males is 4.

In descriptive statistics, univariate data analyzes only one variable. It is used to identify characteristics of a single trait and is not used to analyze any relationships or causations.

For example, imagine a room full of high school students. Say you wanted to gather the average age of the individuals in the room. This univariate data is only dependent on one factor: each person's age. By gathering this one piece of information from each person and dividing by the total number of people, you can determine the average age.

Bivariate data, on the other hand, attempts to link two variables by searching for correlation. Two types of data are collected, and the relationship between the two pieces of information is analyzed together. Because multiple variables are analyzed, this approach may also be referred to as multivariate .

Let's say each high school student in the example above takes a college assessment test, and we want to see whether older students are testing better than younger students. In addition to gathering the ages of the students, we need to find out each student's test score. Then, using data analytics, we mathematically or graphically depict whether there is a relationship between student age and test scores.

The preparation and reporting of financial statements is an example of descriptive statistics. Analyzing that financial information to make decisions on the future is inferential statistics.

One essential aspect of descriptive statistics is graphical representation. Visualizing data distributions effectively can be incredibly powerful, and this is done in several ways.

Histograms are tools for displaying the distribution of numerical data. They divide the data into bins or intervals and represent the frequency or count of data points falling into each bin through bars of varying heights. Histograms help identify the shape of the distribution, central tendency, and variability of the data.

Another visualization is boxplots. Boxplots, also known as box-and-whisker plots, provide a concise summary of a data distribution by highlighting key summary statistics including the median (middle line inside the box), quartiles (edges of the box), and potential outliers (points outside, or the "whiskers"). Boxplots visually depict the spread and skewness of the data and are particularly useful for comparing distributions across different groups or variables.

Whenever descriptive statistics are being discussed, it's important to note outliers. Outliers are data points that significantly differ from other observations in a dataset. These could be errors, anomalies, or rare events within the data.

Detecting and managing outliers is a step in descriptive statistics to ensure accurate and reliable data analysis. To identify outliers, you can use graphical techniques (such as boxplots or scatter plots) or statistical methods (such as Z-score or IQR method). These approaches help pinpoint observations that deviate substantially from the overall pattern of the data.

The presence of outliers can have a notable impact on descriptive statistics, skewing results and affecting the interpretation of data. Outliers can disproportionately influence measures of central tendency, such as the mean, pulling it towards their extreme values. For example, the dataset of (1, 1, 1, 997) is 250, even though that is hardly representative of the dataset. This distortion can lead to misleading conclusions about the typical behavior of the dataset.

Depending on the context, outliers can often be treated by removing them (if they are genuinely erroneous or irrelevant). Alternatively, outliers may hold important information and should be kept for the value they may be able to demonstrate. As you analyze your data, consider the relevance of what outliers can contribute and whether it makes more sense to just strike those data points from your descriptive statistic calculations.

Descriptive Statistics vs. Inferential Statistics

Descriptive statistics have a different function from inferential statistics, which are data sets that are used to make decisions or apply characteristics from one data set to another.

Imagine another example where a company sells hot sauce. The company gathers data such as the count of sales , average quantity purchased per transaction , and average sale per day of the week. All of this information is descriptive, as it tells a story of what actually happened in the past. In this case, it is not being used beyond being informational.

Now let's say that the company wants to roll out a new hot sauce. It gathers the same sales data above, but it uses the information to make predictions about what the sales of the new hot sauce will be. The act of using descriptive statistics and applying characteristics to a different data set makes the data set inferential statistics. We are no longer simply summarizing data; we are using it to predict what will happen regarding an entirely different body of data (in this case, the new hot sauce product).

What Is Descriptive Statistics?

Descriptive statistics is a means of describing features of a data set by generating summaries about data samples. For example, a population census may include descriptive statistics regarding the ratio of men and women in a specific city.

What Are Examples of Descriptive Statistics?

In recapping a Major League Baseball season, for example, descriptive statistics might include team batting averages, the number of runs allowed per team, and the average wins per division.

What Is the Main Purpose of Descriptive Statistics?

The main purpose of descriptive statistics is to provide information about a data set. In the example above, there are dozens of baseball teams, hundreds of players, and thousands of games. Descriptive statistics summarizes large amounts of data into useful bits of information.

What Are the Types of Descriptive Statistics?

The three main types of descriptive statistics are frequency distribution, central tendency, and variability of a data set. The frequency distribution records how often data occurs, central tendency records the data's center point of distribution, and variability of a data set records its degree of dispersion.

Can Descriptive Statistics Be Used to Make Inferences or Predictions?

Technically speaking, descriptive statistics only serves to help understand historical data attributes. Inferential statistics—a separate branch of statistics—is used to understand how variables interact with one another in a data set and possibly predict what might happen in the future.

Descriptive statistics refers to the analysis, summary, and communication of findings that describe a data set. Often not useful for decision-making, descriptive statistics still hold value in explaining high-level summaries of a set of information such as the mean, median, mode, variance, range, and count of information.

Purdue Online Writing Lab. " Writing with Statistics: Descriptive Statistics ."

National Library of Medicine. " Descriptive Statistics for Summarizing Data ."

CSUN.edu. " Measures of Variability, Descriptive Statistics Part 2 ."

Math.Kent.edu. " Summary: Differences Between Univariate and Bivariate Data ."

Purdue Online Writing Lab. " Writing with Statistics: Basic Inferential Statistics: Theory and Application ."

assignment descriptive statistics

  • Terms of Service
  • Editorial Policy
  • Privacy Policy

What Is Descriptive Statistics?

Descriptive statistics is a statistical measure used to describe data through numbers, like mean, median and mode. Here’s how to calculate them.

Satyapriya Chaudhari

In this article, I’ll help you understand the difference between descriptive statistics and inferential statistics. Then we’ll walk through some examples of descriptive statistics and how you can calculate them yourself.

What Is Statistics?

Statistics is the science of collecting data and analyzing them to infer proportions (sample) that are representative of the population. In other words, statistics is interpreting data in order to make predictions for the population.

There are two branches of statistics.

  • Descriptive Statistics : Descriptive statistics is a statistical measure that describes data.
  • Inferential Statistics : You practice inferential statistics when you use a random sample of data taken from a population to describe and make inferences about the population.

Descriptive Statistics vs. Inferential Statistics?

Descriptive statistics summarize data through certain numbers like mean, median, mode, etc. so as to make it easier to understand and interpret the data. Descriptive statistics don’t involve any generalization or inference beyond what is immediately available. This means that the descriptive statistics represent the available data (sample) and aren’t based on any theory of probability.

Commonly Used Measures

  • Measures of central tendency
  • Measures of dispersion (or variability)

What Are the Measures of Central Tendency?

A measure of central tendency is a one-number summary of the data that typically describes the center of the data. This one-number summary is of three types.

1. What Is the Mean?

Mean is the ratio of the sum of all observations in the data to the total number of observations. This is also known as average. Thus, mean is a number around which the entire data set is spread.

More on Statistics Statistical Tests: When to Use T-Test, Chi-Square and More

2. What Is the Median?  

Median is the point which divides the entire data into two equal halves. One half of the data is less than the median and the other half is greater than the median. Median is calculated by first arranging the data in either ascending or descending order.

  • If the number of observations is odd, median is given by the middle observation in the sorted form.
  • If the number of observations are even, median is given by the mean of the two middle observations in the sorted form.

An important point to note is that the order of the data (ascending or descending) does not affect the median.

3. What Is the Mode? 

Mode is the number that has the maximum frequency in the entire data set. In other words, mode is the number that appears the mo st often. A data can have one or more than one mode.

  • If there is only one number that appears the most number of times, the data has one mode, and is called uni-modal .
  • If there are two numbers that appear equally frequently, the data has two modes, and is called bi-modal .
  • If there are more than two numbers that appear equally frequently, the data has more than two modes. We call that multi-modal .

How to Find the Mean, Median and Mode

Consider the following data points:

17, 16, 21, 18, 15, 17, 21, 19, 11, 23

We calculate the mean as:

To calculate the median, let’s arrange the data in ascending order:

11, 15, 16, 17, 17, 18, 19, 21, 21, 23

Since the number of observations is even (10), median is given by the average of the two middle observations (fifth and sixth here).

Mode is given by the number that occurs the most number of times. Here, 17 and 21 both occur twice. Hence, this is a bi-modal data set and the modes are 17 and 21.

A few things to note: 

  • Since median and mode don’t consider all the data points for calculations, median and mode are robust against outliers (i.e., these are not affected by outliers).
  • At the same time, mean shifts toward the outlier as it considers all the data points. This means if the outlier is big, mean overestimates the data, and if it is small, the data is underestimated.
  • If the distribution is symmetrical, mean = median = mode, or what we would call “normal distribution.”

More on Data How to Find Outliers With IQR Using Python

What Are Measures of Dispersion?

Measures of dispersion describe the spread of the data around the central value (or the measures of central tendency).

7 Measures of Dispersion

  • Absolute deviation from mean
  • Standard deviation

1.  Absolute Deviation From Mean

The absolute deviation from the mean, also called mean absolute deviation (MAD), describes the variation in the data set. In a sense, it tells you the average absolute distance of each data point in the set. We calculate it as:  

2. Variance

Variance measures how far data points spread out from the mean. A high variance indicates that data points are spread widely and a small variance indicates that the data points are closer to the data set’s mean. We calculate it as:

3. Standard Deviation

The square root of variance is called the standard deviation. We calculate it as:

Dive Into Probability Distributions 4 Probability Distributions Every Data Scientist Needs to Know

Range is the difference between the maximum value and the minimum value in the data set. It is given as:

5. Quartiles

Quartiles are the points in the data set that divides the data set into four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of the data set.

  • 25 percent of the data points lie below Q1 and 75 percent lie above it.
  • 50 percent of the data points lie below Q2 and 50 percent lie above it. Q2 is nothing but median.
  • 75 percent of the data points lie below Q3 and 25 percent lie above it.

More on Quartiles Calculating Quartiles: A Step-by-Step Explaination

6. Skewness

The measure of asymmetry in a probability distribution is defined by skewness . Skewness can either be positive, negative or undefined. We’ll focus on positive and negative skew.

  • Positive Skew: This is the case when the tail on the right side of the curve is bigger than that on the left side. For these distributions, the mean is greater than the mode.
  • Negative Skew: This is the case when the tail on the left side of the curve is bigger than that on the right side. For these distributions, mean is smaller than the mode.

The most commonly used method of calculating skewness is:

If the skewness is zero, the distribution is symmetrical. If it is negative, the distribution is negatively skewed and if it is positive, it is positively skewed.

7. Kurtosis

Kurtosis describes whether the data is light tailed (lack of outliers) or heavy tailed (outliers present) when compared to a normal distribution. There are three kinds of kurtosis:

  • Mesokurtic: This is the case when the kurtosis is zero, similar to normal distributions.
  • Leptokurtic: This is when the tail of the distribution is heavy (outlier present) and kurtosis is higher than that of the normal distribution.
  • Platykurtic: This is when the tail of the distribution is light (no outlier) and kurtosis is lesser than that of the normal distribution.

Recent Data Science Articles

Regression in Machine Learning: Definition and Examples of Different Models

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Descriptive Statistics | Definitions, Types, Examples

Published on July 9, 2020 by Pritha Bhandari . Revised on June 21, 2023.

Descriptive statistics summarize and organize characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population.

In quantitative research , after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g., age), or the relation between two variables (e.g., age and creativity).

The next step is inferential statistics , which help you decide whether your data confirms or refutes your hypothesis and whether it is generalizable to a larger population.

Table of contents

Types of descriptive statistics, frequency distribution, measures of central tendency, measures of variability, univariate descriptive statistics, bivariate descriptive statistics, other interesting articles, frequently asked questions about descriptive statistics.

There are 3 main types of descriptive statistics:

  • The distribution concerns the frequency of each value.
  • The central tendency concerns the averages of the values.
  • The variability or dispersion concerns how spread out the values are.

Types of descriptive statistics

You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more, in bivariate and multivariate analysis.

  • Go to a library
  • Watch a movie at a theater
  • Visit a national park

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

assignment descriptive statistics

A data set is made up of a distribution of values, or scores. In tables or graphs, you can summarize the frequency of every possible value of a variable in numbers or percentages. This is called a frequency distribution .

  • Simple frequency distribution table
  • Grouped frequency distribution table
Gender Number
Male 182
Female 235
Other 27

From this table, you can see that more women than men or people with another gender identity took part in the study. In a grouped frequency distribution, you can group numerical response values and add up the number of responses for each group. You can also convert each of these numbers to percentages.

Library visits in the past year Percent
0–4 6%
5–8 20%
9–12 42%
13–16 24%
17+ 8%

Measures of central tendency estimate the center, or average, of a data set. The mean, median and mode are 3 ways of finding the average.

Here we will demonstrate how to calculate the mean, median, and mode using the first 6 responses of our survey.

The mean , or M , is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number of responses. The total number of responses or observations is called N .

Mean number of library visits
Data set 15, 3, 12, 0, 24, 3
Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57
Total number of responses = 6
Mean Divide the sum of values by to find : 57/6 =

The median is the value that’s exactly in the middle of a data set.

To find the median, order each response value from the smallest to the biggest. Then , the median is the number in the middle. If there are two numbers in the middle, find their mean.

Median number of library visits
Ordered data set 0, 3, 3, 12, 15, 24
Middle numbers 3, 12
Median Find the mean of the two middle numbers: (3 + 12)/2 =

The mode is the simply the most popular or most frequent response value. A data set can have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs most frequently.

Mode number of library visits
Ordered data set 0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response:

Measures of variability give you a sense of how spread out the response values are. The range, standard deviation and variance each reflect different aspects of spread.

The range gives you an idea of how far apart the most extreme response scores are. To find the range , simply subtract the lowest value from the highest value.

Standard deviation

The standard deviation ( s or SD ) is the average amount of variability in your dataset. It tells you, on average, how far each score lies from the mean. The larger the standard deviation, the more variable the data set is.

There are six steps for finding the standard deviation:

  • List each score and find their mean.
  • Subtract the mean from each score to get the deviation from the mean.
  • Square each of these deviations.
  • Add up all of the squared deviations.
  • Divide the sum of the squared deviations by N – 1.
  • Find the square root of the number you found.
Raw data Deviation from mean Squared deviation
15 15 – 9.5 = 5.5 30.25
3 3 – 9.5 = -6.5 42.25
12 12 – 9.5 = 2.5 6.25
0 0 – 9.5 = -9.5 90.25
24 24 – 9.5 = 14.5 210.25
3 3 – 9.5 = -6.5 42.25
= 9.5 Sum = 0 Sum of squares = 421.5

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The symbol for variance is s 2 .

Univariate descriptive statistics focus on only one variable at a time. It’s important to examine data from each variable separately using multiple measures of distribution, central tendency and spread. Programs like SPSS and Excel can be used to easily calculate these.

Visits to the library
6
Mean 9.5
Median 7.5
Mode 3
Standard deviation 9.18
Variance 84.3
Range 24

If you were to only consider the mean as a measure of central tendency, your impression of the “middle” of the data set can be skewed by outliers, unlike the median or mode.

Likewise, while the range is sensitive to outliers , you should also consider the standard deviation and variance to get easily comparable measures of spread.

If you’ve collected data on more than one variable, you can use bivariate or multivariate descriptive statistics to explore whether there are relationships between them.

In bivariate analysis, you simultaneously study the frequency and variability of two variables to see if they vary together. You can also compare the central tendency of the two variables before performing further statistical tests .

Multivariate analysis is the same as bivariate analysis but with more than two variables.

Contingency table

In a contingency table, each cell represents the intersection of two variables. Usually, an independent variable (e.g., gender) appears along the vertical axis and a dependent one appears along the horizontal axis (e.g., activities). You read “across” the table to see how the independent and dependent variables relate to each other.

Number of visits to the library in the past year
Group 0–4 5–8 9–12 13–16 17+
Children 32 68 37 23 22
Adults 36 48 43 83 25

Interpreting a contingency table is easier when the raw data is converted to percentages. Percentages make each row comparable to the other by making it seem as if each group had only 100 observations or participants. When creating a percentage-based contingency table, you add the N for each independent variable on the end.

Visits to the library in the past year (Percentages)
Group 0–4 5–8 9–12 13–16 17+
Children 18% 37% 20% 13% 12% 182
Adults 15% 20% 18% 35% 11% 235

From this table, it is more clear that similar proportions of children and adults go to the library over 17 times a year. Additionally, children most commonly went to the library between 5 and 8 times, while for adults, this number was between 13 and 16.

Scatter plots

A scatter plot is a chart that shows you the relationship between two or three variables . It’s a visual representation of the strength of a relationship.

In a scatter plot, you plot one variable along the x-axis and another one along the y-axis. Each data point is represented by a point in the chart.

From your scatter plot, you see that as the number of movies seen at movie theaters increases, the number of visits to the library decreases. Based on your visual assessment of a possible linear relationship, you perform further tests of correlation and regression.

Descriptive statistics: Scatter plot

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Statistical power
  • Pearson correlation
  • Degrees of freedom
  • Statistical significance

Methodology

  • Cluster sampling
  • Stratified sampling
  • Focus group
  • Systematic review
  • Ethnography
  • Double-Barreled Question

Research bias

  • Implicit bias
  • Publication bias
  • Cognitive bias
  • Placebo effect
  • Pygmalion effect
  • Hindsight bias
  • Overconfidence bias

Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.

The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.

  • Distribution refers to the frequencies of different responses.
  • Measures of central tendency give you the average for each response.
  • Measures of variability show you the spread or dispersion of your dataset.
  • Univariate statistics summarize only one variable  at a time.
  • Bivariate statistics compare two variables .
  • Multivariate statistics compare more than two variables .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 21). Descriptive Statistics | Definitions, Types, Examples. Scribbr. Retrieved August 1, 2024, from https://www.scribbr.com/statistics/descriptive-statistics/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, central tendency | understanding the mean, median & mode, variability | calculating range, iqr, variance, standard deviation, inferential statistics | an easy introduction & examples, what is your plagiarism score.

Descriptive Statistics: Definition & Charts and Graphs

Probability and Statistics > Descriptive Statistics

descriptive statistics: pie chart

  • Difference Between Descriptive and Inferential
  • Excel Instructions
  • Graphs, Charts and Plots See Also: Basic Statistics Terms

1. Definition of Descriptive Statistics

Descriptive statistics are one of the fundamental “must knows” with any set of data. It gives you a general idea of trends in your data including:

  • The mean, mode, median and range .
  • Variance and standard deviation .
  • Count, maximum and minimum.

Descriptive statistics is useful because it allows you to take a large amount of data and summarize it. For example, let’s say you had data on the incomes of one million people. No one is going to want to read a million pieces of data; if they did, they wouldn’t be able to glean any useful information from it. On the other hand, if you summarize it, it becomes useful: an average wage, or a median income, is much easier to understand than reams of data.

Descriptive statistics can be further broken down into several sub-areas, like:

  • Measures of central tendency.
  • measures of dispersion .
  • Charts & graphs.
  • Shapes of Distributions.

The charts, graphs and plots site index is below . For definitions and information on how to find measures of spread and central tendency, see: Basic statistics (which covers the basic terms you’ll find in descriptive statistics like interquartile range , outliers and standard deviation ).

2. Difference Between Descriptive and Inferential Statistics

Statistics can be broken down into two areas:

  • Descriptive statistics: describes and summarizes data. You are just describing what the data shows: a trend, a specific feature, or a certain statistic (like a mean or median).
  • Inferential statistics : uses statistics to make predictions.

Descriptive statistics just describes data. For example, descriptive statistics about a college could include: the average SAT score for incoming freshmen; the median income of parents; racial makeup of the student body. It says nothing about why the data might exist, or what trends you might be able to see from the data. When you take your data and start to make predictions about future behavior or trends, that’s inferential statistics. Inferential statistics also allows you to take sample data (e.g. from one university) and apply it to a larger population (e.g. all universities in the country).

3. Excel Descriptive Statistics

Excel Descriptive Statistics

How to Calculate Excel Descriptive Statistics: Steps

Step 1: Type your data into Excel, in a single column. For example, if you have ten items in your data set, type them into cells A1 through A10.

Step 2: Click the “Data” tab and then click “Data Analysis” in the Analysis group.

Step 3: Highlight “Descriptive Statistics” in the pop-up Data Analysis window.

Step 4: Type an input range into the “Input Range” text box. For this example, type “A1:A10” into the box.

Step 5: Check the “Labels in first row” check box if you have titled the column in row 1, otherwise leave the box unchecked.

Step 6: Type a cell location into the “Output Range” box. For example, type “C1.” Make sure that two adjacent columns do not have data in them.

Step 7: Click the “Summary Statistics” check box and then click “OK” to display Excel descriptive statistics. A list of descriptive statistics will be returned in the column you selected as the Output Range.

4. Descriptive Statistics: Charts, Graphs and Plots

There are literally dozens of charts and graphs you can make from data. which one you choose depends upon what kind of data you have and what you want to display. For example, if you wanted to display relationships between data in categories, you could make a bar graph.

Grouped bar graph. Image: CDC.

A pie chart would show you how categories in your data relate to the whole set.

Pie chart showing water consumption. Image courtesy of EPA.

Scatter plots are a good way to display data points.

Less common, but useful in some cases, include dot plots and box and whisker charts :

dot plot example

How To Articles for Descriptive Statistics

  • Causal Graph
  • Absolute Frequency: Definition, Examples
  • Make a Histogram.
  • Make a Relative Frequency Histogram.
  • How to Make a Frequency Chart and Determine Frequency.
  • House Graph
  • Choose Bin Sizes.
  • How to Make an Ogive Graph.
  • Read a Box Plot.
  • Find a Box Plot Interquartile Range.
  • Draw a Frequency Distribution Table.
  • Make a Cumulative Frequency Distribution Table.
  • Find a Quadratic Mean.
  • Make a Stemplot.
  • U Chart: Definition, Example
  • Venn Diagram Templates .

Microsoft Excel : Descriptive Statistics

  • How to Create a Bar Graph in Excel.
  • Create a Histogram in Excel.
  • How to Make a Scatter Plot in Microsoft Excel.
  • Create a Frequency Distribution Table in Excel.
  • How to Make a Pie Chart in Excel.
  • Grubb’s Test to Find Outliers .

Minitab for Descriptive Statistics

  • How to Make a Scatterplot in Minitab.
  • Make a Boxplot in Minitab.
  • How to Make a Histogram in Minitab.
  • How to Create a Bar Graph in Minitab.
  • How to Make an SPSS Frequency Table.
  • How to Make an SPSS Histogram.
  • Make a Bar Chart in SPSS.
  • How to Make an SPSS Boxplot.
  • How to Make an SPSS Scatterplot.
  • Make a Pie Chart in SPSS.

Definitions

  • 68-95-99.7 Rule.
  • The Area Principle.
  • Attribute Control Chart
  • Back-to-Back Stemplot.
  • Bimodal Distribution.
  • Bland-Altman Plot.
  • Collider Variable
  • Cumulative Frequency Distribution?
  • Directed Acyclic Graph?
  • What is a Forest Plot or Blobbogram?
  • Frequency Distribution Table.
  • What is a Funnel Plot?
  • Grouped Data.
  • What are upper hinges and lower hinges?
  • Interquartile Mean.
  • Measures of Position
  • Measures of Spread.
  • Measures of Variation .
  • What is an NP Chart?
  • What is a P-Chart?
  • What is a Probability Tree?
  • What is a Pyramid Graph?
  • Ribbon Diagram
  • Scatter Plot.
  • Radar Charts.
  • What is a Seven Number Summary?
  • What is a Skewed Distribution?
  • Finding Skewness .
  • Scales of Measurement.
  • What is a Stemplot?
  • Symmetric Distribution.
  • What is a Timeplot?
  • Uniform Distribution.
  • What is a Unimodal Distribution?
  • Upper and Lower Fences.
  • Variogram .
  • Waterfall plot
  • X-MR (X-Moving Range) Chart
  • Youden Plot
  • Misleading Graphs in Real Life.
  • Types of Graphs .

Dodge, Y. (2008). The Concise Encyclopedia of Statistics . Springer. Everitt, B. S.; Skrondal, A. (2010), The Cambridge Dictionary of Statistics , Cambridge University Press. Gonick, L. (1993). The Cartoon Guide to Statistics . HarperPerennial. Salkind, N. (2016). Statistics for People Who (Think They) Hate Statistics: Using Microsoft Excel 4th Edition.

  • Innovation at WSU
  • Directories
  • Give to WSU
  • Academic Calendar
  • A-Z Directory
  • Calendar of Events
  • Office Hours
  • Policies and Procedures
  • Schedule of Courses
  • Shocker Store
  • Student Webmail
  • Technology HelpDesk
  • Transfer to WSU
  • University Libraries

Unit 07: How to Evaluate Descriptive and Inferential Statistics

Unit 7: how to evaluate descriptive and inferential statistics.

Unit 7: Assignment #1 (due before 11:59 pm Central on Wednesday September 29) :

  • Watch Lynda.com’s (2010) video, “ Understanding Descriptive and Inferential Statistics .”
  • Read Laerd Statistics’ (no date) article, “ Descriptive and Inferential Statistics ” ( web link ).
  • Read a section of Wikipedia’s (2020) entry, “ Descriptive Statistics ” ( web link ).
  • Read Statistics HowTo’s (2014) article, “ Inferential Statistics: Definition, Uses ” ( web link ).
  • At this point, you should be clear on the difference between descriptive and inferential statistics and the common uses for both types of statistics.
  • If you’re not clear, you might want to re-read the above articles and re-watch the videos.
  • You might also want to review how to write a Five-Paragraph Examples-Style Essay, by watching the latter part of Professor Gernsbacher’s lecture video, “ The Five-Paragraph Model ” (a transcript of the video is available in PDF and Word ).
  • Check your essay to make sure your Introduction Paragraph has a hook and a Thesis Statement .
  • Check your Thesis Statement to make sure that it is ONE sentence that captures all three  of your three examples .
  • Check your essay to make sure it has three Examples Paragraphs .
  • Check your essay to make sure it has a Conclusion Paragraph .
  • Check your Conclusion Paragraph to make sure it has a Re-stated Thesis Statement and ends with something (mildly) witty or profound.
  • Check your Re-stated Thesis Statement to make sure that it is ONE sentence that summarizes all three of your three examples .
  • Check ALL FIVE of your paragraphs — your Introduction Paragraph, each of your three Examples Paragraphs, and your Conclusion Paragraph — to make sure each of your five paragraphs has FIVE sentences (a Topic Sentence, three Supporting Sentences, and a Conclusion Sentence).
  • Save your essay as a PDF and name the file YourLastname_DescriptiveEssay.pdf .
  • Check your Thesis Statement to make sure that it is ONE sentence that incorporates all three of your three examples .
  • Check your Conclusion Paragraph to make sure it has a Re-stated Thesis Statement and ends with something (mildly) witty or profound
  • Check ALL FIVE of your paragraphs — your Introduction Paragraph, each of your three Examples Paragraphs, and your Conclusion Paragraph — to make sure each of your five paragraphs has FIVE sentences (a Topic Sentence, three Supporting Sentences, and a Conclusion Sentence).
  • Save your second essay as a PDF and name the file YourLastname_InferentialEssay.pdf.
  • If you ever wonder why we repeatedly practice skills, such as writing five-paragraph essays, in different contexts throughout this course, consider the words of William James ( Word ), who is widely considered the father of U.S. Psychology!
  • First, attach your Descriptive Statistics Essay, saved as a PDF.
  • Remember to “Attach” your Descriptive Statistics Essay PDF and not use the “File” tool.
  • Because the Discussion Board will allow only one file to be attached to each post, make a reply post to your own post.
  • Use your reply post to attach your Inferential Statistics Essay, saved as a PDF.
  • You should write both essays, and then make your Discussion Board post, because if you turn in only one essay (or turn in one essay a while before you turn in the other), only one essay is what we’ll be alerted to grade.

Unit 7: Assignment #2 (due before 11:59 pm Central on Thursday September 30) :

  • NOTE: This book was published in 1954; therefore, the examples are from the 1940s and early 1950s. However, it’s still a beloved book (e.g., it’s recommended reading in a college Physics class), despite its age.
  • Chapter 2 explains the deception caused by indiscriminately referring to the mean, median, and mode (i.e., three central-tendency descriptive statistics) as “the average.”
  • Chapter 3 explains the deception caused by random variation and the solutions provided by inferential statistics.
  • Chapter 4 explains the deception caused by differences that aren’t meaningful.
  • Chapters 5 and 6 explain deception by graphs and figures.
  • When reading these chapters, jot down your three favorite deceptions. For example, you might choose as one of your favorite deceptions the hypothetical real estate agent’s deceptive use of a neighborhood’s “average” income in Chapter 2.
  • You need to choose an audience for your teaching document. Your choices are (1) other college students; (2) middle-school students (age 12 to 14); or (3) older adults (over age 60).
  • You need to choose a medium for your teaching document. Your choices are (1) a PPT; (2) an Infographic; or (3) a comic strip (e.g., The Nib’s ).
  • You need to save your teaching document as a PDF, named YourLastname_StatsDeception.pdf .
  • attach your teaching document PDF, and
  • tell us the intended audience of your teaching document and why you chose that intended audience.

Unit 7: Assignment #3 (due before 11:59 pm Central on Friday October 1) :

  • Read Sullivan and Feinn’s (2012) article, “ Using Effect Size—or Why the P Value Is Not Enough ” ( web link ).
  • Sullivan and Feinn’s (2012) article might be harder to read than other articles you’ve read in this course. But try to understand it at least at a superficial level. Feel free to Google terms that you don’t know.
  • Read Chapman and Louis’s (2017) article, “ The Seven Sins of Statistical Misinterpretation ” ( web link ).
  • In contrast to gaining a working, but superficial understanding of the computations and the like that Sullivan and Feinn (2012) provide in their article, make sure you understand well the seven “sins” that Chapman and Louis provide in their article.
  • Choose three of the 9 articles that you found and read in Unit #5 and that you synthesized in Unit #6.
  • Choose the three articles (of your 9 articles) that will be the easiest (and most logical) to evaluate according to Chapman and Louis’s (2017) “ Seven Sins of Statistical Misinterpretation ” ( web link ).
  • First, download the unfilled PDF and save it on your own computer.
  • Second, rename the unfilled PDF to be YourLastName _PSY-311_StatsCheck_Fillable.pdf. In other words, add your last name to the beginning of the filename.
  • Third, on your computer, open a PDF writer app, such as Preview, Adobe Reader, or the like. Be sure to open your PDF writer app before you open the unfilled PDF from your computer.
  • Fourth, from within your PDF writer app, open the unfilled PDF, which you have already saved onto your computer and re-named.
  • Fifth, using the PDF writer app, fill in the PDF.
  • Sixth, save your now-filled-in PDF on your computer.
  • There are three pages in the fillable PDF; use a different page for each of your three articles. Make sure that your citation in the citation text box at the top of the page is in APA style (see Unit 5). It’s okay, for this assignment, if you can’t italicize parts of the citation (in the citation text box of the fillable PDF).
  • Go the Unit 7: Assignment #3 Discussion Board and attach your filled in PDF.

Unit 7: Assignment #4 (due before 11:59 pm Central on Sunday October 3) :

  • Make sure you understand the article’s answer to the concern that “[the pollsters] never call me.”
  • Make sure you understand the article’s answer to the concern that “nobody I know says that.”
  • Read Rumsey’s (no date) article, “ How to Interpret the Margin of Error in Statistics ” ( web link ).
  • Make sure you understand the difference between sampling a population and surveying (or polling) an entire population.
  • Make sure you understand what a margin of error means in a public opinion poll or survey.
  • Read Hunter’s (no date) article, “ Margin of Error and Confidence Levels Made Simple ” ( web link ).
  • Make sure you understand what it means to calculate a margin of error at a 95% confidence level.
  • Make sure you understand the relation between sample size and margin of error.
  • Harter and Adkins (2015): “ Engaged Employees Less Likely to Have Health Problems ” ( web link ).
  • Newport (2017a) “ Email Outside of Working Hours Not a Burden to US Workers ” ( web link ).
  • Newport and Dugan (2017): “ Americans Still See Manufacturing as Key to Job Creation ” ( web link ).
  • Newport (2018a): “ Average American Predicts Retirement Age of 66 ” ( web link ).
  • Swift (2017a): “ Most U.S. Employed Adults Plan to Work Past Retirement Age ” ( web link ).
  • Newport (2017b): “ Young, Old in US Plan on Relying More on Social Security ” ( web link ).
  • Jones (2017a): “ Worry About Hunger, Homelessness Up for Lower-Income in US ” ( web link ).
  • Norman (2017): “ Financially Stressed in US Now Prefer Saving to Spending ” ( web link ).
  • Jones (2017b): “ Half of Non-Homeowners Expect to Buy Homes in Five Years ” ( web link ).
  • Newport (2018b): “ Americans’ Views of Their Spending and Saving ” ( web link ).
  • Rigoni and Nelson (2016): “ Millennials Want Jobs That Promote Their Well-Being ” ( web link ).
  • Witters (2017a): “ Hawaii Leads US States in Well-Being for Record Sixth Time ” ( web link ).
  • Witters (2017b): “ Naples, Florida, Remains Top US Metro for Well-Being ” ( web link ).
  • McCarthy (2017a): “ U.S. Support for Gay Marriage Edges to New High ” ( web link ).
  • McCarthy (2017b): “ Americans More Positive about Effects of Immigration ” ( web link ).
  • Swift (2017b): “ More Americans Say Immigrants Help Rather Than Hurt Economy ” ( web link ).
  • Reinhart and Ray (2018): “ Record Unhappiness with Women’s Position in U.S. ” ( web link ).
  • Auter (2018): “ Half of College Students Say Their Major Leads to a Good Job ” ( web link ).
  • Maturo (2017): “ One in Three Veterans Consult Coworkers About College Major ” ( web link ).
  • Auter (2017): “ Second Thoughts on College Major Linked to Source of Advice ” ( web link ).
  • What was the topic of the public opinion poll?
  • Why did you choose this topic (and read this report)?
  • What three findings from this public opinion poll do you think are the most interesting – and why do you think those three findings are interesting?
  • What was the total sample size?
  • What was the poll’s margin of error?
  • Was the margin of error calculated at the 95% confidence level?
  • What does it mean that the margin of error was calculated at the 95% confidence level?

Unit 7: Assignment #5 (due before 11:59 pm Central on Monday October 4) :

  • Go to the Unit 7: Assignment #4 and #5 Discussion Board and read all the other students’ posts.
  • One of your responses must be to a student who wrote about at least one of the topics that you also wrote about.
  • One of your responses must be to a student who wrote about at least one topics that you did NOT write about.
  • Your third response can be to any other student (besides the two students you responded to in 1. and 2. above).
  • If no other student wrote about one of the topics that you also wrote about, you can respond to three students who all wrote about different topics than you wrote about.

Unit 7: Assignment #6 (due before 11:59 pm Central on Tuesday October 5) :

  • Read both essays that each of the other members of your small Chat Group posted on the Unit 7: Assignment #1 Discussion Board . If you are in a Chat Group with two other students, that means you will read four essays; if you are in a Chat Group with only one other student, that means you will read two essays.
  • Review how to provide peer review on your Chat Group members’ essays by reading the Peer Review Guidelines ( Word ). Note that you will again be answering 12 questions about each member’s essays.
  • Prior to your Chat Group meeting online, all members of your Chat Group must have completed steps a. and b. of this Assignment .
  • Then, spend the remainder of your hour-long Chat with each Chat Group member providing peer review of the other Chat Group members’ essays.
  • Nominate one member of your Chat Group (who participated in the Chat) to make a post on the Unit 7: Assignment #6 Discussion Board that summarizes your Group Chat in at least 200 words.
  • Nominate another member of your Chat Group (who also participated in the Chat) to save the Chat transcript as a PDF, as described in the Course How To (under the topic, “How to Save and Attach a Group Text Chat Transcript”), and attach the Chat transcript to a post on the Unit 7: Assignment #6 Discussion Board .
  • Nominate another member of your Chat Group (who also participated in the Chat) to make another post on the Unit 7: Assignment #6 Discussion Board that states the name of your Chat Group, the names of the Chat Group members who participated the Chat, the date of your Chat, and the start and stop time of your Chat.
  • If only two persons participated in the Chat, then one of those two persons needs to do two of the above three tasks.
  • Before ending the Group Chat, bid goodbye to each other. In the next Unit you will be forming new Chat Groups!
  • All members of the Chat Group must record a typical Unit entry in your own Course Journal for Unit 7.

Congratulations, you have finished Unit 7! Onward to Unit 8 !

Open-Access Active-Learning Research Methods Course by Morton Ann Gernsbacher, PhD is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License . The materials have been modified to add various ADA-compliant accessibility features, in some cases including alternative text-only versions. Permissions beyond the scope of this license may be available at http://www.gernsbacherlab.org .

Purdue Online Writing Lab Purdue OWL® College of Liberal Arts

Descriptive Statistics

OWL logo

Welcome to the Purdue OWL

This page is brought to you by the OWL at Purdue University. When printing this page, you must include the entire legal notice.

Copyright ©1995-2018 by The Writing Lab & The OWL at Purdue and Purdue University. All rights reserved. This material may not be published, reproduced, broadcast, rewritten, or redistributed without permission. Use of this site constitutes acceptance of our terms and conditions of fair use.

The mean, the mode, the median, the range, and the standard deviation are all examples of descriptive statistics. Descriptive statistics are used because in most cases, it isn't possible to present all of your data in any form that your reader will be able to quickly interpret.

Generally, when writing descriptive statistics, you want to present at least one form of central tendency (or average), that is, either the mean, median, or mode. In addition, you should present one form of variability , usually the standard deviation.

Measures of Central Tendency and Other Commonly Used Descriptive Statistics

The mean, median, and the mode are all measures of central tendency. They attempt to describe what the typical data point might look like. In essence, they are all different forms of 'the average.' When writing statistics, you never want to say 'average' because it is difficult, if not impossible, for your reader to understand if you are referring to the mean, the median, or the mode.

The mean is the most common form of central tendency, and is what most people usually are referring to when the say average. It is simply the total sum of all the numbers in a data set, divided by the total number of data points. For example, the following data set has a mean of 4: {-1, 0, 1, 16}. That is, 16 divided by 4 is 4. If there isn't a good reason to use one of the other forms of central tendency, then you should use the mean to describe the central tendency.

The median is simply the middle value of a data set. In order to calculate the median, all values in the data set need to be ordered, from either highest to lowest, or vice versa. If there are an odd number of values in a data set, then the median is easy to calculate. If there is an even number of values in a data set, then the calculation becomes more difficult. Statisticians still debate how to properly calculate a median when there is an even number of values, but for most purposes, it is appropriate to simply take the mean of the two middle values. The median is useful when describing data sets that are skewed or have extreme values. Incomes of baseballs players, for example, are commonly reported using a median because a small minority of baseball players makes a lot of money, while most players make more modest amounts. The median is less influenced by extreme scores than the mean.

The mode is the most commonly occurring number in the data set. The mode is best used when you want to indicate the most common response or item in a data set. For example, if you wanted to predict the score of the next football game, you may want to know what the most common score is for the visiting team, but having an average score of 15.3 won't help you if it is impossible to score 15.3 points. Likewise, a median score may not be very informative either, if you are interested in what score is most likely.

Standard Deviation

The standard deviation is a measure of variability (it is not a measure of central tendency). Conceptually it is best viewed as the 'average distance that individual data points are from the mean.' Data sets that are highly clustered around the mean have lower standard deviations than data sets that are spread out.

For example, the first data set would have a higher standard deviation than the second data set:

Notice that both groups have the same mean (5) and median (also 5), but the two groups contain different numbers and are organized much differently. This organization of a data set is often referred to as a distribution. Because the two data sets above have the same mean and median, but different standard deviation, we know that they also have different distributions. Understanding the distribution of a data set helps us understand how the data behave.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Descriptive Statistics for Summarising Data

Ray w. cooksey.

UNE Business School, University of New England, Armidale, NSW Australia

This chapter discusses and illustrates descriptive statistics . The purpose of the procedures and fundamental concepts reviewed in this chapter is quite straightforward: to facilitate the description and summarisation of data. By ‘describe’ we generally mean either the use of some pictorial or graphical representation of the data (e.g. a histogram, box plot, radar plot, stem-and-leaf display, icon plot or line graph) or the computation of an index or number designed to summarise a specific characteristic of a variable or measurement (e.g., frequency counts, measures of central tendency, variability, standard scores). Along the way, we explore the fundamental concepts of probability and the normal distribution. We seldom interpret individual data points or observations primarily because it is too difficult for the human brain to extract or identify the essential nature, patterns, or trends evident in the data, particularly if the sample is large. Rather we utilise procedures and measures which provide a general depiction of how the data are behaving. These statistical procedures are designed to identify or display specific patterns or trends in the data. What remains after their application is simply for us to interpret and tell the story.

The first broad category of statistics we discuss concerns descriptive statistics . The purpose of the procedures and fundamental concepts in this category is quite straightforward: to facilitate the description and summarisation of data. By ‘describe’ we generally mean either the use of some pictorial or graphical representation of the data or the computation of an index or number designed to summarise a specific characteristic of a variable or measurement.

We seldom interpret individual data points or observations primarily because it is too difficult for the human brain to extract or identify the essential nature, patterns, or trends evident in the data, particularly if the sample is large. Rather we utilise procedures and measures which provide a general depiction of how the data are behaving. These statistical procedures are designed to identify or display specific patterns or trends in the data. What remains after their application is simply for us to interpret and tell the story.

Reflect on the QCI research scenario and the associated data set discussed in Chap. 10.1007/978-981-15-2537-7_4. Consider the following questions that Maree might wish to address with respect to decision accuracy and speed scores:

  • What was the typical level of accuracy and decision speed for inspectors in the sample? [see Procedure 5.4 – Assessing central tendency.]
  • What was the most common accuracy and speed score amongst the inspectors? [see Procedure 5.4 – Assessing central tendency.]
  • What was the range of accuracy and speed scores; the lowest and the highest scores? [see Procedure 5.5 – Assessing variability.]
  • How frequently were different levels of inspection accuracy and speed observed? What was the shape of the distribution of inspection accuracy and speed scores? [see Procedure 5.1 – Frequency tabulation, distributions & crosstabulation.]
  • What percentage of inspectors would have ‘failed’ to ‘make the cut’ assuming the industry standard for acceptable inspection accuracy and speed combined was set at 95%? [see Procedure 5.7 – Standard ( z ) scores.]
  • How variable were the inspectors in their accuracy and speed scores? Were all the accuracy and speed levels relatively close to each other in magnitude or were the scores widely spread out over the range of possible test outcomes? [see Procedure 5.5 – Assessing variability.]
  • What patterns might be visually detected when looking at various QCI variables singly and together as a set? [see Procedure 5.2 – Graphical methods for dispaying data, Procedure 5.3 – Multivariate graphs & displays, and Procedure 5.6 – Exploratory data analysis.]

This chapter includes discussions and illustrations of a number of procedures available for answering questions about data like those posed above. In addition, you will find discussions of two fundamental concepts, namely probability and the normal distribution ; concepts that provide building blocks for Chaps. 10.1007/978-981-15-2537-7_6 and 10.1007/978-981-15-2537-7_7.

Procedure 5.1: Frequency Tabulation, Distributions & Crosstabulation

Frequency tabulation and distributions.

Frequency tabulation serves to provide a convenient counting summary for a set of data that facilitates interpretation of various aspects of those data. Basically, frequency tabulation occurs in two stages:

  • First, the scores in a set of data are rank ordered from the lowest value to the highest value.
  • Second, the number of times each specific score occurs in the sample is counted. This count records the frequency of occurrence for that specific data value.

Consider the overall job satisfaction variable, jobsat , from the QCI data scenario. Performing frequency tabulation across the 112 Quality Control Inspectors on this variable using the SPSS Frequencies procedure (Allen et al. 2019 , ch. 3; George and Mallery 2019 , ch. 6) produces the frequency tabulation shown in Table 5.1 . Note that three of the inspectors in the sample did not provide a rating for jobsat thereby producing three missing values (= 2.7% of the sample of 112) and leaving 109 inspectors with valid data for the analysis.

Frequency tabulation of overall job satisfaction scores

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab1_HTML.jpg

The display of frequency tabulation is often referred to as the frequency distribution for the sample of scores. For each value of a variable, the frequency of its occurrence in the sample of data is reported. It is possible to compute various percentages and percentile values from a frequency distribution.

Table 5.1 shows the ‘Percent’ or relative frequency of each score (the percentage of the 112 inspectors obtaining each score, including those inspectors who were missing scores, which SPSS labels as ‘System’ missing). Table 5.1 also shows the ‘Valid Percent’ which is computed only for those inspectors in the sample who gave a valid or non-missing response.

Finally, it is possible to add up the ‘Valid Percent’ values, starting at the low score end of the distribution, to form the cumulative distribution or ‘Cumulative Percent’ . A cumulative distribution is useful for finding percentiles which reflect what percentage of the sample scored at a specific value or below.

We can see in Table 5.1 that 4 of the 109 valid inspectors (a ‘Valid Percent’ of 3.7%) indicated the lowest possible level of job satisfaction—a value of 1 (Very Low) – whereas 18 of the 109 valid inspectors (a ‘Valid Percent’ of 16.5%) indicated the highest possible level of job satisfaction—a value of 7 (Very High). The ‘Cumulative Percent’ number of 18.3 in the row for the job satisfaction score of 3 can be interpreted as “roughly 18% of the sample of inspectors reported a job satisfaction score of 3 or less”; that is, nearly a fifth of the sample expressed some degree of negative satisfaction with their job as a quality control inspector in their particular company.

If you have a large data set having many different scores for a particular variable, it may be more useful to tabulate frequencies on the basis of intervals of scores.

For the accuracy scores in the QCI database, you could count scores occurring in intervals such as ‘less than 75% accuracy’, ‘between 75% but less than 85% accuracy’, ‘between 85% but less than 95% accuracy’, and ‘95% accuracy or greater’, rather than counting the individual scores themselves. This would yield what is termed a ‘grouped’ frequency distribution since the data have been grouped into intervals or score classes. Producing such an analysis using SPSS would involve extra steps to create the new category or ‘grouping’ system for scores prior to conducting the frequency tabulation.

Crosstabulation

In a frequency crosstabulation , we count frequencies on the basis of two variables simultaneously rather than one; thus we have a bivariate situation.

For example, Maree might be interested in the number of male and female inspectors in the sample of 112 who obtained each jobsat score. Here there are two variables to consider: inspector’s gender and inspector’s j obsat score. Table 5.2 shows such a crosstabulation as compiled by the SPSS Crosstabs procedure (George and Mallery 2019 , ch. 8). Note that inspectors who did not report a score for jobsat and/or gender have been omitted as missing values, leaving 106 valid inspectors for the analysis.

Frequency crosstabulation of jobsat scores by gender category for the QCI data

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab2_HTML.jpg

The crosstabulation shown in Table 5.2 gives a composite picture of the distribution of satisfaction levels for male inspectors and for female inspectors. If frequencies or ‘Counts’ are added across the gender categories, we obtain the numbers in the ‘Total’ column (the percentages or relative frequencies are also shown immediately below each count) for each discrete value of jobsat (note this column of statistics differs from that in Table 5.1 because the gender variable was missing for certain inspectors). By adding down each gender column, we obtain, in the bottom row labelled ‘Total’, the number of males and the number of females that comprised the sample of 106 valid inspectors.

The totals, either across the rows or down the columns of the crosstabulation, are termed the marginal distributions of the table. These marginal distributions are equivalent to frequency tabulations for each of the variables jobsat and gender . As with frequency tabulation, various percentage measures can be computed in a crosstabulation, including the percentage of the sample associated with a specific count within either a row (‘% within jobsat ’) or a column (‘% within gender ’). You can see in Table 5.2 that 18 inspectors indicated a job satisfaction level of 7 (Very High); of these 18 inspectors reported in the ‘Total’ column, 8 (44.4%) were male and 10 (55.6%) were female. The marginal distribution for gender in the ‘Total’ row shows that 57 inspectors (53.8% of the 106 valid inspectors) were male and 49 inspectors (46.2%) were female. Of the 57 male inspectors in the sample, 8 (14.0%) indicated a job satisfaction level of 7 (Very High). Furthermore, we could generate some additional interpretive information of value by adding the ‘% within gender’ values for job satisfaction levels of 5, 6 and 7 (i.e. differing degrees of positive job satisfaction). Here we would find that 68.4% (= 24.6% + 29.8% + 14.0%) of male inspectors indicated some degree of positive job satisfaction compared to 61.2% (= 10.2% + 30.6% + 20.4%) of female inspectors.

This helps to build a picture of the possible relationship between an inspector’s gender and their level of job satisfaction (a relationship that, as we will see later, can be quantified and tested using Procedure 10.1007/978-981-15-2537-7_6#Sec14 and Procedure 10.1007/978-981-15-2537-7_7#Sec17).

It should be noted that a crosstabulation table such as that shown in Table 5.2 is often referred to as a contingency table about which more will be said later (see Procedure 10.1007/978-981-15-2537-7_7#Sec17 and Procedure 10.1007/978-981-15-2537-7_7#Sec115).

Frequency tabulation is useful for providing convenient data summaries which can aid in interpreting trends in a sample, particularly where the number of discrete values for a variable is relatively small. A cumulative percent distribution provides additional interpretive information about the relative positioning of specific scores within the overall distribution for the sample.

Crosstabulation permits the simultaneous examination of the distributions of values for two variables obtained from the same sample of observations. This examination can yield some useful information about the possible relationship between the two variables. More complex crosstabulations can be also done where the values of three or more variables are tracked in a single systematic summary. The use of frequency tabulation or cross-tabulation in conjunction with various other statistical measures, such as measures of central tendency (see Procedure 5.4 ) and measures of variability (see Procedure 5.5 ), can provide a relatively complete descriptive summary of any data set.

Disadvantages

Frequency tabulations can get messy if interval or ratio-level measures are tabulated simply because of the large number of possible data values. Grouped frequency distributions really should be used in such cases. However, certain choices, such as the size of the score interval (group size), must be made, often arbitrarily, and such choices can affect the nature of the final frequency distribution.

Additionally, percentage measures have certain problems associated with them, most notably, the potential for their misinterpretation in small samples. One should be sure to know the sample size on which percentage measures are based in order to obtain an interpretive reference point for the actual percentage values.

For example

In a sample of 10 individuals, 20% represents only two individuals whereas in a sample of 300 individuals, 20% represents 60 individuals. If all that is reported is the 20%, then the mental inference drawn by readers is likely to be that a sizeable number of individuals had a score or scores of a particular value—but what is ‘sizeable’ depends upon the total number of observations on which the percentage is based.

Where Is This Procedure Useful?

Frequency tabulation and crosstabulation are very commonly applied procedures used to summarise information from questionnaires, both in terms of tabulating various demographic characteristics (e.g. gender, age, education level, occupation) and in terms of actual responses to questions (e.g. numbers responding ‘yes’ or ‘no’ to a particular question). They can be particularly useful in helping to build up the data screening and demographic stories discussed in Chap. 10.1007/978-981-15-2537-7_4. Categorical data from observational studies can also be analysed with this technique (e.g. the number of times Suzy talks to Frank, to Billy, and to John in a study of children’s social interactions).

Certain types of experimental research designs may also be amenable to analysis by crosstabulation with a view to drawing inferences about distribution differences across the sets of categories for the two variables being tracked.

You could employ crosstabulation in conjunction with the tests described in Procedure 10.1007/978-981-15-2537-7_7#Sec17 to see if two different styles of advertising campaign differentially affect the product purchasing patterns of male and female consumers.

In the QCI database, Maree could employ crosstabulation to help her answer the question “do different types of electronic manufacturing firms ( company ) differ in terms of their tendency to employ male versus female quality control inspectors ( gender )?”

Software Procedures

ApplicationProcedures
SPSS or . and select the variable(s) you wish to analyse; for the procedure, hitting the ‘ ’ button will allow you to choose various types of statistics and percentages to show in each cell of the table.
NCSS or and select the variable(s) you wish to analyse.
SYSTAT or ➔ and select the variable(s) you wish to analyse and choose the optional statistics you wish to see.
STATGRAPHICS or and select the variable(s) you wish to analyse; hit ‘ ’ and when the ‘Tables and Graphs’ window opens, choose the Tables and Graphs you wish to see.
Commander or and select the variable(s) you wish to analyse and choose the optional statistics you wish to see.

Procedure 5.2: Graphical Methods for Displaying Data

Graphical methods for displaying data include bar and pie charts, histograms and frequency polygons, line graphs and scatterplots. It is important to note that what is presented here is a small but representative sampling of the types of simple graphs one can produce to summarise and display trends in data. Generally speaking, SPSS offers the easiest facility for producing and editing graphs, but with a rather limited range of styles and types. SYSTAT, STATGRAPHICS and NCSS offer a much wider range of graphs (including graphs unique to each package), but with the drawback that it takes somewhat more effort to get the graphs in exactly the form you want.

Bar and Pie Charts

These two types of graphs are useful for summarising the frequency of occurrence of various values (or ranges of values) where the data are categorical (nominal or ordinal level of measurement).

  • A bar chart uses vertical and horizontal axes to summarise the data. The vertical axis is used to represent frequency (number) of occurrence or the relative frequency (percentage) of occurrence; the horizontal axis is used to indicate the data categories of interest.
  • A pie chart gives a simpler visual representation of category frequencies by cutting a circular plot into wedges or slices whose sizes are proportional to the relative frequency (percentage) of occurrence of specific data categories. Some pie charts can have a one or more slices emphasised by ‘exploding’ them out from the rest of the pie.

Consider the company variable from the QCI database. This variable depicts the types of manufacturing firms that the quality control inspectors worked for. Figure 5.1 illustrates a bar chart summarising the percentage of female inspectors in the sample coming from each type of firm. Figure 5.2 shows a pie chart representation of the same data, with an ‘exploded slice’ highlighting the percentage of female inspectors in the sample who worked for large business computer manufacturers – the lowest percentage of the five types of companies. Both graphs were produced using SPSS.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig1_HTML.jpg

Bar chart: Percentage of female inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig2_HTML.jpg

Pie chart: Percentage of female inspectors

The pie chart was modified with an option to show the actual percentage along with the label for each category. The bar chart shows that computer manufacturing firms have relatively fewer female inspectors compared to the automotive and electrical appliance (large and small) firms. This trend is less clear from the pie chart which suggests that pie charts may be less visually interpretable when the data categories occur with rather similar frequencies. However, the ‘exploded slice’ option can help interpretation in some circumstances.

Certain software programs, such as SPSS, STATGRAPHICS, NCSS and Microsoft Excel, offer the option of generating 3-dimensional bar charts and pie charts and incorporating other ‘bells and whistles’ that can potentially add visual richness to the graphic representation of the data. However, you should generally be careful with these fancier options as they can produce distortions and create ambiguities in interpretation (e.g. see discussions in Jacoby 1997 ; Smithson 2000 ; Wilkinson 2009 ). Such distortions and ambiguities could ultimately end up providing misinformation to researchers as well as to those who read their research.

Histograms and Frequency Polygons

These two types of graphs are useful for summarising the frequency of occurrence of various values (or ranges of values) where the data are essentially continuous (interval or ratio level of measurement) in nature. Both histograms and frequency polygons use vertical and horizontal axes to summarise the data. The vertical axis is used to represent the frequency (number) of occurrence or the relative frequency (percentage) of occurrences; the horizontal axis is used for the data values or ranges of values of interest. The histogram uses bars of varying heights to depict frequency; the frequency polygon uses lines and points.

There is a visual difference between a histogram and a bar chart: the bar chart uses bars that do not physically touch, signifying the discrete and categorical nature of the data, whereas the bars in a histogram physically touch to signal the potentially continuous nature of the data.

Suppose Maree wanted to graphically summarise the distribution of speed scores for the 112 inspectors in the QCI database. Figure 5.3 (produced using NCSS) illustrates a histogram representation of this variable. Figure 5.3 also illustrates another representational device called the ‘density plot’ (the solid tracing line overlaying the histogram) which gives a smoothed impression of the overall shape of the distribution of speed scores. Figure 5.4 (produced using STATGRAPHICS) illustrates the frequency polygon representation for the same data.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig3_HTML.jpg

Histogram of the speed variable (with density plot overlaid)

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig4_HTML.jpg

Frequency polygon plot of the speed variable

These graphs employ a grouped format where speed scores which fall within specific intervals are counted as being essentially the same score. The shape of the data distribution is reflected in these plots. Each graph tells us that the inspection speed scores are positively skewed with only a few inspectors taking very long times to make their inspection judgments and the majority of inspectors taking rather shorter amounts of time to make their decisions.

Both representations tell a similar story; the choice between them is largely a matter of personal preference. However, if the number of bars to be plotted in a histogram is potentially very large (and this is usually directly controllable in most statistical software packages), then a frequency polygon would be the preferred representation simply because the amount of visual clutter in the graph will be much reduced.

It is somewhat of an art to choose an appropriate definition for the width of the score grouping intervals (or ‘bins’ as they are often termed) to be used in the plot: choose too many and the plot may look too lumpy and the overall distributional trend may not be obvious; choose too few and the plot will be too coarse to give a useful depiction. Programs like SPSS, SYSTAT, STATGRAPHICS and NCSS are designed to choose an ‘appropriate’ number of bins to be used, but the analyst’s eye is often a better judge than any statistical rule that a software package would use.

There are several interesting variations of the histogram which can highlight key data features or facilitate interpretation of certain trends in the data. One such variation is a graph is called a dual histogram (available in SYSTAT; a variation called a ‘comparative histogram’ can be created in NCSS) – a graph that facilitates visual comparison of the frequency distributions for a specific variable for participants from two distinct groups.

Suppose Maree wanted to graphically compare the distributions of speed scores for inspectors in the two categories of education level ( educlev ) in the QCI database. Figure 5.5 shows a dual histogram (produced using SYSTAT) that accomplishes this goal. This graph still employs the grouped format where speed scores falling within particular intervals are counted as being essentially the same score. The shape of the data distribution within each group is also clearly reflected in this plot. However, the story conveyed by the dual histogram is that, while the inspection speed scores are positively skewed for inspectors in both categories of educlev, the comparison suggests that inspectors with a high school level of education (= 1) tend to take slightly longer to make their inspection decisions than do their colleagues who have a tertiary qualification (= 2).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig5_HTML.jpg

Dual histogram of speed for the two categories of educlev

Line Graphs

The line graph is similar in style to the frequency polygon but is much more general in its potential for summarising data. In a line graph, we seldom deal with percentage or frequency data. Instead we can summarise other types of information about data such as averages or means (see Procedure 5.4 for a discussion of this measure), often for different groups of participants. Thus, one important use of the line graph is to break down scores on a specific variable according to membership in the categories of a second variable.

In the context of the QCI database, Maree might wish to summarise the average inspection accuracy scores for the inspectors from different types of manufacturing companies. Figure 5.6 was produced using SPSS and shows such a line graph.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig6_HTML.jpg

Line graph comparison of companies in terms of average inspection accuracy

Note how the trend in performance across the different companies becomes clearer with such a visual representation. It appears that the inspectors from the Large Business Computer and PC manufacturing companies have better average inspection accuracy compared to the inspectors from the remaining three industries.

With many software packages, it is possible to further elaborate a line graph by including error or confidence intervals bars (see Procedure 10.1007/978-981-15-2537-7_8#Sec18). These give some indication of the precision with which the average level for each category in the population has been estimated (narrow bars signal a more precise estimate; wide bars signal a less precise estimate).

Figure 5.7 shows such an elaborated line graph, using 95% confidence interval bars, which can be used to help make more defensible judgments (compared to Fig. 5.6 ) about whether the companies are substantively different from each other in average inspection performance. Companies whose confidence interval bars do not overlap each other can be inferred to be substantively different in performance characteristics.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig7_HTML.jpg

Line graph using confidence interval bars to compare accuracy across companies

The accuracy confidence interval bars for participants from the Large Business Computer manufacturing firms do not overlap those from the Large or Small Electrical Appliance manufacturers or the Automobile manufacturers.

We might conclude that quality control inspection accuracy is substantially better in the Large Business Computer manufacturing companies than in these other industries but is not substantially better than the PC manufacturing companies. We might also conclude that inspection accuracy in PC manufacturing companies is not substantially different from Small Electrical Appliance manufacturers.

Scatterplots

Scatterplots are useful in displaying the relationship between two interval- or ratio-scaled variables or measures of interest obtained on the same individuals, particularly in correlational research (see Fundamental Concept 10.1007/978-981-15-2537-7_6#Sec1 and Procedure 10.1007/978-981-15-2537-7_6#Sec4).

In a scatterplot, one variable is chosen to be represented on the horizontal axis; the second variable is represented on the vertical axis. In this type of plot, all data point pairs in the sample are graphed. The shape and tilt of the cloud of points in a scatterplot provide visual information about the strength and direction of the relationship between the two variables. A very compact elliptical cloud of points signals a strong relationship; a very loose or nearly circular cloud signals a weak or non-existent relationship. A cloud of points generally tilted upward toward the right side of the graph signals a positive relationship (higher scores on one variable associated with higher scores on the other and vice-versa). A cloud of points generally tilted downward toward the right side of the graph signals a negative relationship (higher scores on one variable associated with lower scores on the other and vice-versa).

Maree might be interested in displaying the relationship between inspection accuracy and inspection speed in the QCI database. Figure 5.8 , produced using SPSS, shows what such a scatterplot might look like. Several characteristics of the data for these two variables can be noted in Fig. 5.8 . The shape of the distribution of data points is evident. The plot has a fan-shaped characteristic to it which indicates that accuracy scores are highly variable (exhibit a very wide range of possible scores) at very fast inspection speeds but get much less variable and tend to be somewhat higher as inspection speed increases (where inspectors take longer to make their quality control decisions). Thus, there does appear to be some relationship between inspection accuracy and inspection speed (a weak positive relationship since the cloud of points tends to be very loose but tilted generally upward toward the right side of the graph – slower speeds tend to be slightly associated with higher accuracy.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig8_HTML.jpg

Scatterplot relating inspection accuracy to inspection speed

However, it is not the case that the inspection decisions which take longest to make are necessarily the most accurate (see the labelled points for inspectors 7 and 62 in Fig. 5.8 ). Thus, Fig. 5.8 does not show a simple relationship that can be unambiguously summarised by a statement like “the longer an inspector takes to make a quality control decision, the more accurate that decision is likely to be”. The story is more complicated.

Some software packages, such as SPSS, STATGRAPHICS and SYSTAT, offer the option of using different plotting symbols or markers to represent the members of different groups so that the relationship between the two focal variables (the ones anchoring the X and Y axes) can be clarified with reference to a third categorical measure.

Maree might want to see if the relationship depicted in Fig. 5.8 changes depending upon whether the inspector was tertiary-qualified or not (this information is represented in the educlev variable of the QCI database).

Figure 5.9 shows what such a modified scatterplot might look like; the legend in the upper corner of the figure defines the marker symbols for each category of the educlev variable. Note that for both High School only-educated inspectors and Tertiary-qualified inspectors, the general fan-shaped relationship between accuracy and speed is the same. However, it appears that the distribution of points for the High School only-educated inspectors is shifted somewhat upward and toward the right of the plot suggesting that these inspectors tend to be somewhat more accurate as well as slower in their decision processes.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig9_HTML.jpg

Scatterplot displaying accuracy vs speed conditional on educlev group

There are many other styles of graphs available, often dependent upon the specific statistical package you are using. Interestingly, NCSS and, particularly, SYSTAT and STATGRAPHICS, appear to offer the most variety in terms of types of graphs available for visually representing data. A reading of the user’s manuals for these programs (see the Useful additional readings) would expose you to the great diversity of plotting techniques available to researchers. Many of these techniques go by rather interesting names such as: Chernoff’s faces, radar plots, sunflower plots, violin plots, star plots, Fourier blobs, and dot plots.

These graphical methods provide summary techniques for visually presenting certain characteristics of a set of data. Visual representations are generally easier to understand than a tabular representation and when these plots are combined with available numerical statistics, they can give a very complete picture of a sample of data. Newer methods have become available which permit more complex representations to be depicted, opening possibilities for creatively visually representing more aspects and features of the data (leading to a style of visual data storytelling called infographics ; see, for example, McCandless 2014 ; Toseland and Toseland 2012 ). Many of these newer methods can display data patterns from multiple variables in the same graph (several of these newer graphical methods are illustrated and discussed in Procedure 5.3 ).

Graphs tend to be cumbersome and space consuming if a great many variables need to be summarised. In such cases, using numerical summary statistics (such as means or correlations) in tabular form alone will provide a more economical and efficient summary. Also, it can be very easy to give a misleading picture of data trends using graphical methods by simply choosing the ‘correct’ scaling for maximum effect or choosing a display option (such as a 3-D effect) that ‘looks’ presentable but which actually obscures a clear interpretation (see Smithson 2000 ; Wilkinson 2009 ).

Thus, you must be careful in creating and interpreting visual representations so that the influence of aesthetic choices for sake of appearance do not become more important than obtaining a faithful and valid representation of the data—a very real danger with many of today’s statistical packages where ‘default’ drawing options have been pre-programmed in. No single plot can completely summarise all possible characteristics of a sample of data. Thus, choosing a specific method of graphical display may, of necessity, force a behavioural researcher to represent certain data characteristics (such as frequency) at the expense of others (such as averages).

Virtually any research design which produces quantitative data and statistics (even to the extent of just counting the number of occurrences of several events) provides opportunities for graphical data display which may help to clarify or illustrate important data characteristics or relationships. Remember, graphical displays are communication tools just like numbers—which tool to choose depends upon the message to be conveyed. Visual representations of data are generally more useful in communicating to lay persons who are unfamiliar with statistics. Care must be taken though as these same lay people are precisely the people most likely to misinterpret a graph if it has been incorrectly drawn or scaled.

ApplicationProcedures
SPSS and choose from a range of gallery chart types: , ; drag the chart type into the working area and customise the chart with desired variables, labels, etc. many elements of a chart, including error bars, can be controlled.
NCSS or or or or or hichever type of chart you choose, you can control many features of the chart from the dialog box that pops open upon selection.
STATGRAPHICS or or or hichever type of chart you choose, you can control a number of features of the chart from the series of dialog boxes that pops open upon selection.
SYSTAT or or or or or (which offers a range of other more novel graphical displays, including the dual histogram). For each choice, a dialog box opens which allows you to control almost every characteristic of the graph you want.
Commander or or or or ; for some graphs ( being the exception), there is minimal control offered by Commander over the appearance of the graph (you need to use full commands to control more aspects; e.g. see Chang ).

Procedure 5.3: Multivariate Graphs & Displays

Graphical methods for displaying multivariate data (i.e. many variables at once) include scatterplot matrices, radar (or spider) plots, multiplots, parallel coordinate displays, and icon plots. Multivariate graphs are useful for visualising broad trends and patterns across many variables (Cleveland 1995 ; Jacoby 1998 ). Such graphs typically sacrifice precision in representation in favour of a snapshot pictorial summary that can help you form general impressions of data patterns.

It is important to note that what is presented here is a small but reasonably representative sampling of the types of graphs one can produce to summarise and display trends in multivariate data. Generally speaking, SYSTAT offers the best facilities for producing multivariate graphs, followed by STATGRAPHICS, but with the drawback that it is somewhat tricky to get the graphs in exactly the form you want. SYSTAT also has excellent facilities for creating new forms and combinations of graphs – essentially allowing graphs to be tailor-made for a specific communication purpose. Both SPSS and NCSS offer a more limited range of multivariate graphs, generally restricted to scatterplot matrices and variations of multiplots. Microsoft Excel or STATGRAPHICS are the packages to use if radar or spider plots are desired.

Scatterplot Matrices

A scatterplot matrix is a useful multivariate graph designed to show relationships between pairs of many variables in the same display.

Figure 5.10 illustrates a scatterplot matrix, produced using SYSTAT, for the mentabil , accuracy , speed , jobsat and workcond variables in the QCI database. It is easy to see that all the scatterplot matrix does is stack all pairs of scatterplots into a format where it is easy to pick out the graph for any ‘row’ variable that intersects a column ‘variable’.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig10_HTML.jpg

Scatterplot matrix relating mentabil , accuracy , speed , jobsat & workcond

In those plots where a ‘row’ variable intersects itself in a column of the matrix (along the so-called ‘diagonal’), SYSTAT permits a range of univariate displays to be shown. Figure 5.10 shows univariate histograms for each variable (recall Procedure 5.2 ). One obvious drawback of the scatterplot matrix is that, if many variables are to be displayed (say ten or more); the graph gets very crowded and becomes very hard to visually appreciate.

Looking at the first column of graphs in Fig. 5.10 , we can see the scatterplot relationships between mentabil and each of the other variables. We can get a visual impression that mentabil seems to be slightly negatively related to accuracy (the cloud of scatter points tends to angle downward to the right, suggesting, very slightly, that higher mentabil scores are associated with lower levels of accuracy ).

Conversely, the visual impression of the relationship between mentabil and speed is that the relationship is slightly positive (higher mentabil scores tend to be associated with higher speed scores = longer inspection times). Similar types of visual impressions can be formed for other parts of Fig. 5.10 . Notice that the histogram plots along the diagonal give a clear impression of the shape of the distribution for each variable.

Radar Plots

The radar plot (also known as a spider graph for obvious reasons) is a simple and effective device for displaying scores on many variables. Microsoft Excel offers a range of options and capabilities for producing radar plots, such as the plot shown in Fig. 5.11 . Radar plots are generally easy to interpret and provide a good visual basis for comparing plots from different individuals or groups, even if a fairly large number of variables (say, up to about 25) are being displayed. Like a clock face, variables are evenly spaced around the centre of the plot in clockwise order starting at the 12 o’clock position. Visual interpretation of a radar plot primarily relies on shape comparisons, i.e. the rise and fall of peaks and valleys along the spokes around the plot. Valleys near the centre display low scores on specific variables, peaks near the outside of the plot display high scores on specific variables. [Note that, technically, radar plots employ polar coordinates.] SYSTAT can draw graphs using polar coordinates but not as easily as Excel can, from the user’s perspective. Radar plots work best if all the variables represented are measured on the same scale (e.g. a 1 to 7 Likert-type scale or 0% to 100% scale). Individuals who are missing any scores on the variables being plotted are typically omitted.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig11_HTML.jpg

Radar plot comparing attitude ratings for inspectors 66 and 104

The radar plot in Fig. 5.11 , produced using Excel, compares two specific inspectors, 66 and 104, on the nine attitude rating scales. Inspector 66 gave the highest rating (= 7) on the cultqual variable and inspector 104 gave the lowest rating (= 1). The plot shows that inspector 104 tended to provide very low ratings on all nine attitude variables, whereas inspector 66 tended to give very high ratings on all variables except acctrain and trainapp , where the scores were similar to those for inspector 104. Thus, in general, inspector 66 tended to show much more positive attitudes toward their workplace compared to inspector 104.

While Fig. 5.11 was generated to compare the scores for two individuals in the QCI database, it would be just as easy to produce a radar plot that compared the five types of companies in terms of their average ratings on the nine variables, as shown in Fig. 5.12 .

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig12_HTML.jpg

Radar plot comparing average attitude ratings for five types of company

Here we can form the visual impression that the five types of companies differ most in their average ratings of mgmtcomm and least in the average ratings of polsatis . Overall, the average ratings from inspectors from PC manufacturers (black diamonds with solid lines) seem to be generally the most positive as their scores lie on or near the outer ring of scores and those from Automobile manufacturers tend to be least positive on many variables (except the training-related variables).

Extrapolating from Fig. 5.12 , you may rightly conclude that including too many groups and/or too many variables in a radar plot comparison can lead to so much clutter that any visual comparison would be severely degraded. You may have to experiment with using colour-coded lines to represent different groups versus line and marker shape variations (as used in Fig. 5.12 ), because choice of coding method for groups can influence the interpretability of a radar plot.

A multiplot is simply a hybrid style of graph that can display group comparisons across a number of variables. There are a wide variety of possible multiplots one could potentially design (SYSTAT offers great capabilities with respect to multiplots). Figure 5.13 shows a multiplot comprising a side-by-side series of profile-based line graphs – one graph for each type of company in the QCI database.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig13_HTML.jpg

Multiplot comparing profiles of average attitude ratings for five company types

The multiplot in Fig. 5.13 , produced using SYSTAT, graphs the profile of average attitude ratings for all inspectors within a specific type of company. This multiplot shows the same story as the radar plot in Fig. 5.12 , but in a different graphical format. It is still fairly clear that the average ratings from inspectors from PC manufacturers tend to be higher than for the other types of companies and the profile for inspectors from automobile manufacturers tends to be lower than for the other types of companies.

The profile for inspectors from large electrical appliance manufacturers is the flattest, meaning that their average attitude ratings were less variable than for other types of companies. Comparing the ease with which you can glean the visual impressions from Figs. 5.12 and 5.13 may lead you to prefer one style of graph over another. If you have such preferences, chances are others will also, which may mean you need to carefully consider your options when deciding how best to display data for effect.

Frequently, choice of graph is less a matter of which style is right or wrong, but more a matter of which style will suit specific purposes or convey a specific story, i.e. the choice is often strategic.

Parallel Coordinate Displays

A parallel coordinate display is useful for displaying individual scores on a range of variables, all measured using the same scale. Furthermore, such graphs can be combined side-by-side to facilitate very broad visual comparisons among groups, while retaining individual profile variability in scores. Each line in a parallel coordinate display represents one individual, e.g. an inspector.

The interpretation of a parallel coordinate display, such as the two shown in Fig. 5.14 , depends on visual impressions of the peaks and valleys (highs and lows) in the profiles as well as on the density of similar profile lines. The graph is called ‘parallel coordinate’ simply because it assumes that all variables are measured on the same scale and that scores for each variable can therefore be located along vertical axes that are parallel to each other (imagine vertical lines on Fig. 5.14 running from bottom to top for each variable on the X-axis). The main drawback of this method of data display is that only those individuals in the sample who provided legitimate scores on all of the variables being plotted (i.e. who have no missing scores) can be displayed.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig14_HTML.jpg

Parallel coordinate displays comparing profiles of average attitude ratings for five company types

The parallel coordinate display in Fig. 5.14 , produced using SYSTAT, graphs the profile of average attitude ratings for all inspectors within two specific types of company: the left graph for inspectors from PC manufacturers and the right graph for automobile manufacturers.

There are fewer lines in each display than the number of inspectors from each type of company simply because several inspectors from each type of company were missing a rating on at least one of the nine attitude variables. The graphs show great variability in scores amongst inspectors within a company type, but there are some overall patterns evident.

For example, inspectors from automobile companies clearly and fairly uniformly rated mgmtcomm toward the low end of the scale, whereas the reverse was generally true for that variable for inspectors from PC manufacturers. Conversely, inspectors from automobile companies tend to rate acctrain and trainapp more toward the middle to high end of the scale, whereas the reverse is generally true for those variables for inspectors from PC manufacturers.

Perhaps the most creative types of multivariate displays are the so-called icon plots . SYSTAT and STATGRAPHICS offer an impressive array of different types of icon plots, including, amongst others, Chernoff’s faces, profile plots, histogram plots, star glyphs and sunray plots (Jacoby 1998 provides a detailed discussion of icon plots).

Icon plots generally use a specific visual construction to represent variables scores obtained by each individual within a sample or group. All icon plots are thus methods for displaying the response patterns for individual members of a sample, as long as those individuals are not missing any scores on the variables to be displayed (note that this is the same limitation as for radar plots and parallel coordinate displays). To illustrate icon plots, without generating too many icons to focus on, Figs. 5.15 , 5.16 , 5.17 and 5.18 present four different icon plots for QCI inspectors classified, using a new variable called BEST_WORST , as either the worst performers (= 1 where their accuracy scores were less than 70%) or the best performers (= 2 where their accuracy scores were 90% or greater).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig15_HTML.jpg

Chernoff’s faces icon plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig16_HTML.jpg

Profile plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig17_HTML.jpg

Histogram plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig18_HTML.jpg

Sunray plot comparing individual attitude ratings for best and worst performing inspectors

The Chernoff’s faces plot gets its name from the visual icon used to represent variable scores – a cartoon-type face. This icon tries to capitalise on our natural human ability to recognise and differentiate faces. Each feature of the face is controlled by the scores on a single variable. In SYSTAT, up to 20 facial features are controllable; the first five being curvature of mouth, angle of brow, width of nose, length of nose and length of mouth (SYSTAT Software Inc., 2009 , p. 259). The theory behind Chernoff’s faces is that similar patterns of variable scores will produce similar looking faces, thereby making similarities and differences between individuals more apparent.

The profile plot and histogram plot are actually two variants of the same type of icon plot. A profile plot represents individuals’ scores for a set of variables using simplified line graphs, one per individual. The profile is scaled so that the vertical height of the peaks and valleys correspond to actual values for variables where the variables anchor the X-axis in a fashion similar to the parallel coordinate display. So, as you examine a profile from left to right across the X-axis of each graph, you are looking across the set of variables. A histogram plot represents the same information in the same way as for the profile plot but using histogram bars instead.

Figure 5.15 , produced using SYSTAT, shows a Chernoff’s faces plot for the best and worst performing inspectors using their ratings of job satisfaction, working conditions and the nine general attitude statements.

Each face is labelled with the inspector number it represents. The gaps indicate where an inspector had missing data on at least one of the variables, meaning a face could not be generated for them. The worst performers are drawn using red lines; the best using blue lines. The first variable is jobsat and this variable controls mouth curvature; the second variable is workcond and this controls angle of brow, and so on. It seems clear that there are differences in the faces between the best and worst performers with, for example, best performers tending to be more satisfied (smiling) and with higher ratings for working conditions (brow angle).

Beyond a broad visual impression, there is little in terms of precise inferences you can draw from a Chernoff’s faces plot. It really provides a visual sketch, nothing more. The fact that there is no obvious link between facial features, variables and score levels means that the Chernoff’s faces icon plot is difficult to interpret at the level of individual variables – a holistic impression of similarity and difference is what this type of plot facilitates.

Figure 5.16 produced using SYSTAT, shows a profile plot for the best and worst performing inspectors using their ratings of job satisfaction, working conditions and the nine attitude variables.

Like the Chernoff’s faces plot (Fig. 5.15 ), as you read across the rows of the plot from left to right, each plot corresponds respectively to a inspector in the sample who was either in the worst performer (red) or best performer (blue) category. The first attitude variable is jobsat and anchors the left end of each line graph; the last variable is polsatis and anchors the right end of the line graph. The remaining variables are represented in order from left to right across the X-axis of each graph. Figure 5.16 shows that these inspectors are rather different in their attitude profiles, with best performers tending to show taller profiles on the first two variables, for example.

Figure 5.17 produced using SYSTAT, shows a histogram plot for the best and worst performing inspectors based on their ratings of job satisfaction, working conditions and the nine attitude variables. This plot tells the same story as the profile plot, only using histogram bars. Some people would prefer the histogram icon plot to the profile plot because each histogram bar corresponds to one variable, making the visual linking of a specific bar to a specific variable much easier than visually linking a specific position along the profile line to a specific variable.

The sunray plot is actually a simplified adaptation of the radar plot (called a “star glyph”) used to represent scores on a set of variables for each individual within a sample or group. Remember that a radar plot basically arranges the variables around a central point like a clock face; the first variable is represented at the 12 o’clock position and the remaining variables follow around the plot in a clockwise direction.

Unlike a radar plot, while the spokes (the actual ‘star’ of the glyph’s name) of the plot are visible, no interpretive scale is evident. A variable’s score is visually represented by its distance from the central point. Thus, the star glyphs in a sunray plot are designed, like Chernoff’s faces, to provide a general visual impression, based on icon shape. A wide diameter well-rounded plot indicates an individual with high scores on all variables and a small diameter well-rounded plot vice-versa. Jagged plots represent individuals with highly variable scores across the variables. ‘Stars’ of similar size, shape and orientation represent similar individuals.

Figure 5.18 , produced using STATGRAPHICS, shows a sunray plot for the best and worst performing inspectors. An interpretation glyph is also shown in the lower right corner of Fig. 5.18 , where variables are aligned with the spokes of a star (e.g. jobsat is at the 12 o’clock position). This sunray plot could lead you to form the visual impression that the worst performing inspectors (group 1) have rather less rounded rating profiles than do the best performing inspectors (group 2) and that the jobsat and workcond spokes are generally lower for the worst performing inspectors.

Comparatively speaking, the sunray plot makes identifying similar individuals a bit easier (perhaps even easier than Chernoff’s faces) and, when ordered as STATGRAPHICS showed in Fig. 5.18 , permits easier visual comparisons between groups of individuals, but at the expense of precise knowledge about variable scores. Remember, a holistic impression is the goal pursued using a sunray plot.

Multivariate graphical methods provide summary techniques for visually presenting certain characteristics of a complex array of data on variables. Such visual representations are generally better at helping us to form holistic impressions of multivariate data rather than any sort of tabular representation or numerical index. They also allow us to compress many numerical measures into a finite representation that is generally easy to understand. Multivariate graphical displays can add interest to an otherwise dry statistical reporting of numerical data. They are designed to appeal to our pattern recognition skills, focusing our attention on features of the data such as shape, level, variability and orientation. Some multivariate graphs (e.g. radar plots, sunray plots and multiplots) are useful not only for representing score patterns for individuals but also providing summaries of score patterns across groups of individuals.

Multivariate graphs tend to get very busy-looking and are hard to interpret if a great many variables or a large number of individuals need to be displayed (imagine any of the icon plots, for a sample of 200 questionnaire participants, displayed on a A4 page – each icon would be so small that its features could not be easily distinguished, thereby defeating the purpose of the display). In such cases, using numerical summary statistics (such as averages or correlations) in tabular form alone will provide a more economical and efficient summary. Also, some multivariate displays will work better for conveying certain types of information than others.

Information about variable relationships may be better displayed using a scatterplot matrix. Information about individual similarities and difference on a set of variables may be better conveyed using a histogram or sunray plot. Multiplots may be better suited to displaying information about group differences across a set of variables. Information about the overall similarity of individual entities in a sample might best be displayed using Chernoff’s faces.

Because people differ greatly in their visual capacities and preferences, certain types of multivariate displays will work for some people and not others. Sometimes, people will not see what you see in the plots. Some plots, such as Chernoff’s faces, may not strike a reader as a serious statistical procedure and this could adversely influence how convinced they will be by the story the plot conveys. None of the multivariate displays described here provide sufficiently precise information for solid inferences or interpretations; all are designed to simply facilitate the formation of holistic visual impressions. In fact, you may have noticed that some displays (scatterplot matrices and the icon plots, for example) provide no numerical scaling information that would help make precise interpretations. If precision in summary information is desired, the types of multivariate displays discussed here would not be the best strategic choices.

Virtually any research design which produces quantitative data/statistics for multiple variables provides opportunities for multivariate graphical data display which may help to clarify or illustrate important data characteristics or relationships. Thus, for survey research involving many identically-scaled attitudinal questions, a multivariate display may be just the device needed to communicate something about patterns in the data. Multivariate graphical displays are simply specialised communication tools designed to compress a lot of information into a meaningful and efficient format for interpretation—which tool to choose depends upon the message to be conveyed.

Generally speaking, visual representations of multivariate data could prove more useful in communicating to lay persons who are unfamiliar with statistics or who prefer visual as opposed to numerical information. However, these displays would probably require some interpretive discussion so that the reader clearly understands their intent.

ApplicationProcedures
SPSS and choose from the gallery; drag the chart type into the working area and customise the chart with desired variables, labels, etc. Only a few elements of each chart can be configured and altered.
NCSS Only a few elements of this plot are customisable in NCSS.
SYSTAT (and you can select what type of plot you want to appear in the diagonal boxes) or ( can be selected by choosing a variable. e.g. ) or or (for icon plots, you can choose from a range of icons including Chernoff’s faces, histogram, star, sun or profile amongst others). A large number of elements of each type of plot are easily customisable, although it may take some trial and error to get exactly the look you want.
STATGRAPHICS or or or Several elements of each type of plot are easily customisable, although it may take some trial and error to get exactly the look you want.
commander You can select what type of plot you want to appear in the diagonal boxes, and you can control some other features of the plot. Other multivariate data displays are available via various packages (e.g. the or package), but not through commander.

Procedure 5.4: Assessing Central Tendency

The three most commonly reported measures of central tendency are the mean, median and mode. Each measure reflects a specific way of defining central tendency in a distribution of scores on a variable and each has its own advantages and disadvantages.

The mean is the most widely used measure of central tendency (also called the arithmetic average). Very simply, a mean is the sum of all the scores for a specific variable in a sample divided by the number of scores used in obtaining the sum. The resulting number reflects the average score for the sample of individuals on which the scores were obtained. If one were asked to predict the score that any single individual in the sample would obtain, the best prediction, in the absence of any other relevant information, would be the sample mean. Many parametric statistical methods (such as Procedures 10.1007/978-981-15-2537-7_7#Sec22 , 10.1007/978-981-15-2537-7_7#Sec32 , 10.1007/978-981-15-2537-7_7#Sec42 and 10.1007/978-981-15-2537-7_7#Sec68) deal with sample means in one way or another. For any sample of data, there is one and only one possible value for the mean in a specific distribution. For most purposes, the mean is the preferred measure of central tendency because it utilises all the available information in a sample.

In the context of the QCI database, Maree could quite reasonably ask what inspectors scored on the average in terms of mental ability ( mentabil ), inspection accuracy ( accuracy ), inspection speed ( speed ), overall job satisfaction ( jobsat ), and perceived quality of their working conditions ( workcond ). Table 5.3 shows the mean scores for the sample of 112 quality control inspectors on each of these variables. The statistics shown in Table 5.3 were computed using the SPSS Frequencies ... procedure. Notice that the table indicates how many of the 112 inspectors had a valid score for each variable and how many were missing a score (e.g. 109 inspectors provided a valid rating for jobsat; 3 inspectors did not).

Measures of central tendency for specific QCI variables

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab3_HTML.jpg

Each mean needs to be interpreted in terms of the original units of measurement for each variable. Thus, the inspectors in the sample showed an average mental ability score of 109.84 (higher than the general population mean of 100 for the test), an average inspection accuracy of 82.14%, and an average speed for making quality control decisions of 4.48 s. Furthermore, in terms of their work context, inspectors reported an average overall job satisfaction of 4.96 (on the 7-point scale, or a level of satisfaction nearly one full scale point above the Neutral point of 4—indicating a generally positive but not strong level of job satisfaction, and an average perceived quality of work conditions of 4.21 (on the 7-point scale which is just about at the level of Stressful but Tolerable.

The mean is sensitive to the presence of extreme values, which can distort its value, giving a biased indication of central tendency. As we will see below, the median is an alternative statistic to use in such circumstances. However, it is also possible to compute what is called a trimmed mean where the mean is calculated after a certain percentage (say, 5% or 10%) of the lowest and highest scores in a distribution have been ignored (a process called ‘trimming’; see, for example, the discussion in Field 2018 , pp. 262–264). This yields a statistic less influenced by extreme scores. The drawbacks are that the decision as to what percentage to trim can be somewhat subjective and trimming necessarily sacrifices information (i.e. the extreme scores) in order to achieve a less biased measure. Some software packages, such as SPSS, SYSTAT or NCSS, can report a specific percentage trimmed mean, if that option is selected for descriptive statistics or exploratory data analysis (see Procedure 5.6 ) procedures. Comparing the original mean with a trimmed mean can provide an indication of the degree to which the original mean has been biased by extreme values.

Very simply, the median is the centre or middle score of a set of scores. By ‘centre’ or ‘middle’ is meant that 50% of the data values are smaller than or equal to the median and 50% of the data values are larger when the entire distribution of scores is rank ordered from the lowest to highest value. Thus, we can say that the median is that score in the sample which occurs at the 50th percentile. [Note that a ‘percentile’ is attached to a specific score that a specific percentage of the sample scored at or below. Thus, a score at the 25th percentile means that 25% of the sample achieved this score or a lower score.] Table 5.3 shows the 25th, 50th and 75th percentile scores for each variable – note how the 50th percentile score is exactly equal to the median in each case .

The median is reported somewhat less frequently than the mean but does have some advantages over the mean in certain circumstances. One such circumstance is when the sample of data has a few extreme values in one direction (either very large or very small relative to all other scores). In this case, the mean would be influenced (biased) to a much greater degree than would the median since all of the data are used to calculate the mean (including the extreme scores) whereas only the single centre score is needed for the median. For this reason, many nonparametric statistical procedures (such as Procedures 10.1007/978-981-15-2537-7_7#Sec27 , 10.1007/978-981-15-2537-7_7#Sec37 and 10.1007/978-981-15-2537-7_7#Sec63) focus on the median as the comparison statistic rather than on the mean.

A discrepancy between the values for the mean and median of a variable provides some insight to the degree to which the mean is being influenced by the presence of extreme data values. In a distribution where there are no extreme values on either side of the distribution (or where extreme values balance each other out on either side of the distribution, as happens in a normal distribution – see Fundamental Concept II ), the mean and the median will coincide at the same value and the mean will not be biased.

For highly skewed distributions, however, the value of the mean will be pulled toward the long tail of the distribution because that is where the extreme values lie. However, in such skewed distributions, the median will be insensitive (statisticians call this property ‘robustness’) to extreme values in the long tail. For this reason, the direction of the discrepancy between the mean and median can give a very rough indication of the direction of skew in a distribution (‘mean larger than median’ signals possible positive skewness; ‘mean smaller than median’ signals possible negative skewness). Like the mean, there is one and only one possible value for the median in a specific distribution.

In Fig. 5.19 , the left graph shows the distribution of speed scores and the right-hand graph shows the distribution of accuracy scores. The speed distribution clearly shows the mean being pulled toward the right tail of the distribution whereas the accuracy distribution shows the mean being just slightly pulled toward the left tail. The effect on the mean is stronger in the speed distribution indicating a greater biasing effect due to some very long inspection decision times.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig19_HTML.jpg

Effects of skewness in a distribution on the values for the mean and median

If we refer to Table 5.3 , we can see that the median score for each of the five variables has also been computed. Like the mean, the median must be interpreted in the original units of measurement for the variable. We can see that for mentabil , accuracy , and workcond , the value of the median is very close to the value of the mean, suggesting that these distributions are not strongly influenced by extreme data values in either the high or low direction. However, note that the median speed was 3.89 s compared to the mean of 4.48 s, suggesting that the distribution of speed scores is positively skewed (the mean is larger than the median—refer to Fig. 5.19 ). Conversely, the median jobsat score was 5.00 whereas the mean score was 4.96 suggesting very little substantive skewness in the distribution (mean and median are nearly equal).

The mode is the simplest measure of central tendency. It is defined as the most frequently occurring score in a distribution. Put another way, it is the score that more individuals in the sample obtain than any other score. An interesting problem associated with the mode is that there may be more than one in a specific distribution. In the case where multiple modes exist, the issue becomes which value do you report? The answer is that you must report all of them. In a ‘normal’ bell-shaped distribution, there is only one mode and it is indeed at the centre of the distribution, coinciding with both the mean and the median.

Table 5.3 also shows the mode for each of the five variables. For example, more inspectors achieved a mentabil score of 111 more often than any other score and inspectors reported a jobsat rating of 6 more often than any other rating. SPSS only ever reports one mode even if several are present, so one must be careful and look at a histogram plot for each variable to make a final determination of the mode(s) for that variable.

All three measures of central tendency yield information about what is going on in the centre of a distribution of scores. The mean and median provide a single number which can summarise the central tendency in the entire distribution. The mode can yield one or multiple indices. With many measurements on individuals in a sample, it is advantageous to have single number indices which can describe the distributions in summary fashion. In a normal or near-normal distribution of sample data, the mean, the median, and the mode will all generally coincide at the one point. In this instance, all three statistics will provide approximately the same indication of central tendency. Note however that it is seldom the case that all three statistics would yield exactly the same number for any particular distribution. The mean is the most useful statistic, unless the data distribution is skewed by extreme scores, in which case the median should be reported.

While measures of central tendency are useful descriptors of distributions, summarising data using a single numerical index necessarily reduces the amount of information available about the sample. Not only do we need to know what is going on in the centre of a distribution, we also need to know what is going on around the centre of the distribution. For this reason, most social and behavioural researchers report not only measures of central tendency, but also measures of variability (see Procedure 5.5 ). The mode is the least informative of the three statistics because of its potential for producing multiple values.

Measures of central tendency are useful in almost any type of experimental design, survey or interview study, and in any observational studies where quantitative data are available and must be summarised. The decision as to whether the mean or median should be reported depends upon the nature of the data which should ideally be ascertained by visual inspection of the data distribution. Some researchers opt to report both measures routinely. Computation of means is a prelude to many parametric statistical methods (see, for example, Procedure 10.1007/978-981-15-2537-7_7#Sec22 , 10.1007/978-981-15-2537-7_7#Sec32 , 10.1007/978-981-15-2537-7_7#Sec42 , 10.1007/978-981-15-2537-7_7#Sec52 , 10.1007/978-981-15-2537-7_7#Sec68 , 10.1007/978-981-15-2537-7_7#Sec76 and 10.1007/978-981-15-2537-7_7#Sec105); comparison of medians is associated with many nonparametric statistical methods (see, for example, Procedure 10.1007/978-981-15-2537-7_7#Sec27 , 10.1007/978-981-15-2537-7_7#Sec37 , 10.1007/978-981-15-2537-7_7#Sec63 and 10.1007/978-981-15-2537-7_7#Sec81).

ApplicationProcedures
SPSS then press the ‘ ’ button and choose mean, median and mode. To see trimmed means, you must use the Exploratory Data Analysis procedure; see .
NCSS then select the reports and plots that you want to see; make sure you indicate that you want to see the ‘Means Section’ of the Report. If you want to see trimmed means, tick the ‘Trimmed Section’ of the Report.
SYSTAT … then select the mean, median and mode (as well as any other statistics you might wish to see). If you want to see trimmed means, tick the ‘Trimmed mean’ section of the dialog box and set the percentage to trim in the box labelled ‘Two-sided’.
STATGRAPHICS or then choose the variable(s) you want to describe and select Summary Statistics (you don’t get any options for statistics to report – measures of central tendency and variability are automatically produced). STATGRAPHICS will not report modes and you will need to use and request ‘Percentiles’ in order to see the 50%ile score which will be the median; however, it won’t be labelled as the median.
Commander then select the central tendency statistics you want to see. Commander will not produce modes and to see the median, make sure that the ‘Quantiles’ box is ticked – the .5 quantile score (= 50%ile) score is the median; however, it won’t be labelled as the median.

Procedure 5.5: Assessing Variability

There are a variety of measures of variability to choose from including the range, interquartile range, variance and standard deviation. Each measure reflects a specific way of defining variability in a distribution of scores on a variable and each has its own advantages and disadvantages. Most measures of variability are associated with a specific measure of central tendency so that researchers are now commonly expected to report both a measure of central tendency and its associated measure of variability whenever they display numerical descriptive statistics on continuous or ranked-ordered variables.

This is the simplest measure of variability for a sample of data scores. The range is merely the largest score in the sample minus the smallest score in the sample. The range is the one measure of variability not explicitly associated with any measure of central tendency. It gives a very rough indication as to the extent of spread in the scores. However, since the range uses only two of the total available scores in the sample, the rest of the scores are ignored, which means that a lot of potentially useful information is being sacrificed. There are also problems if either the highest or lowest (or both) scores are atypical or too extreme in their value (as in highly skewed distributions). When this happens, the range gives a very inflated picture of the typical variability in the scores. Thus, the range tends not be a frequently reported measure of variability.

Table 5.4 shows a set of descriptive statistics, produced by the SPSS Frequencies procedure, for the mentabil, accuracy, speed, jobsat and workcond measures in the QCI database. In the table, you will find three rows labelled ‘Range’, ‘Minimum’ and ‘Maximum’.

Measures of central tendency and variability for specific QCI variables

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab4_HTML.jpg

Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s – the fastest quality decision to 17.10 – the slowest quality decision). Accuracy scores had a range of 43 (from 57% – the least accurate inspector to 100% – the most accurate inspector). Both work context measures ( jobsat and workcond ) exhibited a range of 6 – the largest possible range given the 1 to 7 scale of measurement for these two variables.

Interquartile Range

The Interquartile Range ( IQR ) is a measure of variability that is specifically designed to be used in conjunction with the median. The IQR also takes care of the extreme data problem which typically plagues the range measure. The IQR is defined as the range that is covered by the middle 50% of scores in a distribution once the scores have been ranked in order from lowest value to highest value. It is found by locating the value in the distribution at or below which 25% of the sample scored and subtracting this number from the value in the distribution at or below which 75% of the sample scored. The IQR can also be thought of as the range one would compute after the bottom 25% of scores and the top 25% of scores in the distribution have been ‘chopped off’ (or ‘trimmed’ as statisticians call it).

The IQR gives a much more stable picture of the variability of scores and, like the median, is relatively insensitive to the biasing effects of extreme data values. Some behavioural researchers prefer to divide the IQR in half which gives a measure called the Semi-Interquartile Range ( S-IQR ) . The S-IQR can be interpreted as the distance one must travel away from the median, in either direction, to reach the value which separates the top (or bottom) 25% of scores in the distribution from the remaining 75%.

The IQR or S-IQR is typically not produced by descriptive statistics procedures by default in many computer software packages; however, it can usually be requested as an optional statistic to report or it can easily be computed by hand using percentile scores. Both the median and the IQR figure prominently in Exploratory Data Analysis, particularly in the production of boxplots (see Procedure 5.6 ).

Figure 5.20 illustrates the conceptual nature of the IQR and S-IQR compared to that of the range. Assume that 100% of data values are covered by the distribution curve in the figure. It is clear that these three measures would provide very different values for a measure of variability. Your choice would depend on your purpose. If you simply want to signal the overall span of scores between the minimum and maximum, the range is the measure of choice. But if you want to signal the variability around the median, the IQR or S-IQR would be the measure of choice.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig20_HTML.jpg

How the range, IQR and S-IQR measures of variability conceptually differ

Note: Some behavioural researchers refer to the IQR as the hinge-spread (or H-spread ) because of its use in the production of boxplots:

  • the 25th percentile data value is referred to as the ‘lower hinge’;
  • the 75th percentile data value is referred to as the ‘upper hinge’; and
  • their difference gives the H-spread.

Midspread is another term you may see used as a synonym for interquartile range.

Referring back to Table 5.4 , we can find statistics reported for the median and for the ‘quartiles’ (25th, 50th and 75th percentile scores) for each of the five variables of interest. The ‘quartile’ values are useful for finding the IQR or S-IQR because SPSS does not report these measures directly. The median clearly equals the 50th percentile data value in the table.

If we focus, for example, on the speed variable, we could find its IQR by subtracting the 25th percentile score of 2.19 s from the 75th percentile score of 5.71 s to give a value for the IQR of 3.52 s (the S-IQR would simply be 3.52 divided by 2 or 1.76 s). Thus, we could report that the median decision speed for inspectors was 3.89 s and that the middle 50% of inspectors showed scores spanning a range of 3.52 s. Alternatively, we could report that the median decision speed for inspectors was 3.89 s and that the middle 50% of inspectors showed scores which ranged 1.76 s either side of the median value.

Note: We could compare the ‘Minimum’ or ‘Maximum’ scores to the 25th percentile score and 75th percentile score respectively to get a feeling for whether the minimum or maximum might be considered extreme or uncharacteristic data values.

The variance uses information from every individual in the sample to assess the variability of scores relative to the sample mean. Variance assesses the average squared deviation of each score from the mean of the sample. Deviation refers to the difference between an observed score value and the mean of the sample—they are squared simply because adding them up in their naturally occurring unsquared form (where some differences are positive and others are negative) always gives a total of zero, which is useless for an index purporting to measure something.

If many scores are quite different from the mean, we would expect the variance to be large. If all the scores lie fairly close to the sample mean, we would expect a small variance. If all scores exactly equal the mean (i.e. all the scores in the sample have the same value), then we would expect the variance to be zero.

Figure 5.21 illustrates some possibilities regarding variance of a distribution of scores having a mean of 100. The very tall curve illustrates a distribution with small variance. The distribution of medium height illustrates a distribution with medium variance and the flattest distribution ia a distribution with large variance.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig21_HTML.jpg

The concept of variance

If we had a distribution with no variance, the curve would simply be a vertical line at a score of 100 (meaning that all scores were equal to the mean). You can see that as variance increases, the tails of the distribution extend further outward and the concentration of scores around the mean decreases. You may have noticed that variance and range (as well as the IQR) will be related, since the range focuses on the difference between the ends of the two tails in the distribution and larger variances extend the tails. So, a larger variance will generally be associated with a larger range and IQR compared to a smaller variance.

It is generally difficult to descriptively interpret the variance measure in a meaningful fashion since it involves squared deviations around the sample mean. [Note: If you look back at Table 5.4 , you will see the variance listed for each of the variables (e.g. the variance of accuracy scores is 84.118), but the numbers themselves make little sense and do not relate to the original measurement scale for the variables (which, for the accuracy variable, went from 0% to 100% accuracy).] Instead, we use the variance as a steppingstone for obtaining a measure of variability that we can clearly interpret, namely the standard deviation . However, you should know that variance is an important concept in its own right simply because it provides the statistical foundation for many of the correlational procedures and statistical inference procedures described in Chaps. 10.1007/978-981-15-2537-7_6 , 10.1007/978-981-15-2537-7_7 and 10.1007/978-981-15-2537-7_8.

When considering either correlations or tests of statistical hypotheses, we frequently speak of one variable explaining or sharing variance with another (see Procedure 10.1007/978-981-15-2537-7_6#Sec27 and 10.1007/978-981-15-2537-7_7#Sec47 ). In doing so, we are invoking the concept of variance as set out here—what we are saying is that variability in the behaviour of scores on one particular variable may be associated with or predictive of variability in scores on another variable of interest (e.g. it could explain why those scores have a non-zero variance).

Standard Deviation

The standard deviation (often abbreviated as SD, sd or Std. Dev.) is the most commonly reported measure of variability because it has a meaningful interpretation and is used in conjunction with reports of sample means. Variance and standard deviation are closely related measures in that the standard deviation is found by taking the square root of the variance. The standard deviation, very simply, is a summary number that reflects the ‘average distance of each score from the mean of the sample’. In many parametric statistical methods, both the sample mean and sample standard deviation are employed in some form. Thus, the standard deviation is a very important measure, not only for data description, but also for hypothesis testing and the establishment of relationships as well.

Referring again back to Table 5.4 , we’ll focus on the results for the speed variable for discussion purposes. Table 5.4 shows that the mean inspection speed for the QCI sample was 4.48 s. We can also see that the standard deviation (in the row labelled ‘Std Deviation’) for speed was 2.89 s.

This standard deviation has a straightforward interpretation: we would say that ‘on the average, an inspector’s quality inspection decision speed differed from the mean of the sample by about 2.89 s in either direction’. In a normal distribution of scores (see Fundamental Concept II ), we would expect to see about 68% of all inspectors having decision speeds between 1.59 s (the mean minus one amount of the standard deviation) and 7.37 s (the mean plus one amount of the standard deviation).

We noted earlier that the range of the speed scores was 16.05 s. However, the fact that the maximum speed score was 17.1 s compared to the 75th percentile score of just 5.71 s seems to suggest that this maximum speed might be rather atypically large compared to the bulk of speed scores. This means that the range is likely to be giving us a false impression of the overall variability of the inspectors’ decision speeds.

Furthermore, given that the mean speed score was higher than the median speed score, suggesting that speed scores were positively skewed (this was confirmed by the histogram for speed shown in Fig. 5.19 in Procedure 5.4 ), we might consider emphasising the median and its associated IQR or S-IQR rather than the mean and standard deviation. Of course, similar diagnostic and interpretive work could be done for each of the other four variables in Table 5.4 .

Measures of variability (particularly the standard deviation) provide a summary measure that gives an indication of how variable (spread out) a particular sample of scores is. When used in conjunction with a relevant measure of central tendency (particularly the mean), a reasonable yet economical description of a set of data emerges. When there are extreme data values or severe skewness is present in the data, the IQR (or S-IQR) becomes the preferred measure of variability to be reported in conjunction with the sample median (or 50th percentile value). These latter measures are much more resistant (‘robust’) to influence by data anomalies than are the mean and standard deviation.

As mentioned above, the range is a very cursory index of variability, thus, it is not as useful as variance or standard deviation. Variance has little meaningful interpretation as a descriptive index; hence, standard deviation is most often reported. However, the standard deviation (or IQR) has little meaning if the sample mean (or median) is not reported along with it.

Knowing that the standard deviation for accuracy is 9.17 tells you little unless you know the mean accuracy (82.14) that it is the standard deviation from.

Like the sample mean, the standard deviation can be strongly biased by the presence of extreme data values or severe skewness in a distribution in which case the median and IQR (or S-IQR) become the preferred measures. The biasing effect will be most noticeable in samples which are small in size (say, less than 30 individuals) and far less noticeable in large samples (say, in excess of 200 or 300 individuals). [Note that, in a manner similar to a trimmed mean, it is possible to compute a trimmed standard deviation to reduce the biasing effect of extreme data values, see Field 2018 , p. 263.]

It is important to realise that the resistance of the median and IQR (or S-IQR) to extreme values is only gained by deliberately sacrificing a good deal of the information available in the sample (nothing is obtained without a cost in statistics). What is sacrificed is information from all other members of the sample other than those members who scored at the median and 25th and 75th percentile points on a variable of interest; information from all members of the sample would automatically be incorporated in mean and standard deviation for that variable.

Any investigation where you might report on or read about measures of central tendency on certain variables should also report measures of variability. This is particularly true for data from experiments, quasi-experiments, observational studies and questionnaires. It is important to consider measures of central tendency and measures of variability to be inextricably linked—one should never report one without the other if an adequate descriptive summary of a variable is to be communicated.

Other descriptive measures, such as those for skewness and kurtosis 1 may also be of interest if a more complete description of any variable is desired. Most good statistical packages can be instructed to report these additional descriptive measures as well.

Of all the statistics you are likely to encounter in the business, behavioural and social science research literature, means and standard deviations will dominate as measures for describing data. Additionally, these statistics will usually be reported when any parametric tests of statistical hypotheses are presented as the mean and standard deviation provide an appropriate basis for summarising and evaluating group differences.

ApplicationProcedures
SPSS then press the ‘ ’ button and choose Std. Deviation, Variance, Range, Minimum and/or Maximum as appropriate. SPSS does not produce or have an option to produce either the IQR or S-IQR, however, if your request ‘Quantiles’ you will see the 25th and 75th %ile scores, which can then be used to quickly compute either variability measure. Remember to select appropriate central tendency measures as well.
NCSS then select the reports and plots that you want to see; make sure you indicate that you want to see the Variance Section of the Report. Remember to select appropriate central tendency measures as well (by opting to see the Means Section of the Report).
SYSTAT … then select SD, Variance, Range, Interquartile range, Minimum and/or Maximum as appropriate. Remember to select appropriate central tendency measures as well.
STATGRAPHICS or then choose the variable(s) you want to describe and select Summary Statistics (you don’t get any options for statistics to report – measures of central tendency and variability are automatically produced). STATGRAPHICS does not produce either the IQR or S-IQR, however, if you use Percentiles’ can be requested in order to see the 25th and 75th %ile scores, which can then be used to quickly compute either variability measure.
Commander then select either the Standard Deviation or Interquartile Range as appropriate. Commander will not produce the range statistic or report minimum or maximum scores. Remember to select appropriate central tendency measures as well.

Fundamental Concept I: Basic Concepts in Probability

The concept of simple probability.

In Procedures 5.1 and 5.2 , you encountered the idea of the frequency of occurrence of specific events such as particular scores within a sample distribution. Furthermore, it is a simple operation to convert the frequency of occurrence of a specific event into a number representing the relative frequency of that event. The relative frequency of an observed event is merely the number of times the event is observed divided by the total number of times one makes an observation. The resulting number ranges between 0 and 1 but we typically re-express this number as a percentage by multiplying it by 100%.

In the QCI database, Maree Lakota observed data from 112 quality control inspectors of which 58 were male and 51 were female (gender indications were missing for three inspectors). The statistics 58 and 51 are thus the frequencies of occurrence for two specific types of research participant, a male inspector or a female inspector.

If she divided each frequency by the total number of observations (i.e. 112), whe would obtain .52 for males and .46 for females (leaving .02 of observations with unknown gender). These statistics are relative frequencies which indicate the proportion of times that Maree obtained data from a male or female inspector. Multiplying each relative frequency by 100% would yield 52% and 46% which she could interpret as indicating that 52% of her sample was male and 46% was female (leaving 2% of the sample with unknown gender).

It does not take much of a leap in logic to move from the concept of ‘relative frequency’ to the concept of ‘probability’. In our discussion above, we focused on relative frequency as indicating the proportion or percentage of times a specific category of participant was obtained in a sample. The emphasis here is on data from a sample.

Imagine now that Maree had infinite resources and research time and was able to obtain ever larger samples of quality control inspectors for her study. She could still compute the relative frequencies for obtaining data from males and females in her sample but as her sample size grew larger and larger, she would notice these relative frequencies converging toward some fixed values.

If, by some miracle, Maree could observe all of the quality control inspectors on the planet today, she would have measured the entire population and her computations of relative frequency for males and females would yield two precise numbers, each indicating the proportion of the population of inspectors that was male and the proportion that was female.

If Maree were then to list all of these inspectors and randomly choose one from the list, the chances that she would choose a male inspector would be equal to the proportion of the population of inspectors that was male and this logic extends to choosing a female inspector. The number used to quantify this notion of ‘chances’ is called a probability. Maree would therefore have established the probability of randomly observing a male or a female inspector in the population on any specific occasion.

Probability is expressed on a 0.0 (the observation or event will certainly not be seen) to 1.0 (the observation or event will certainly be seen) scale where values close to 0.0 indicate observations that are less certain to be seen and values close to 1.0 indicate observations that are more certain to be seen (a value of .5 indicates an even chance that an observation or event will or will not be seen – a state of maximum uncertainty). Statisticians often interpret a probability as the likelihood of observing an event or type of individual in the population.

In the QCI database, we noted that the relative frequency of observing males was .52 and for females was .46. If we take these relative frequencies as estimates of the proportions of each gender in the population of inspectors, then .52 and .46 represent the probability of observing a male or female inspector, respectively.

Statisticians would state this as “the probability of observing a male quality control inspector is .52” or in a more commonly used shorthand code, the likelihood of observing a male quality control inspector is p = .52 (p for probability). For some, probabilities make more sense if they are converted to percentages (by multiplying by 100%). Thus, p = .52 can also understood as a 52% chance of observing a male quality control inspector.

We have seen that relative frequency is a sample statistic that can be used to estimate the population probability. Our estimate will get more precise as we use larger and larger samples (technically, as the size of our samples more closely approximates the size of our population). In most behavioural research, we never have access to entire populations so we must always estimate our probabilities.

In some very special populations, having a known number of fixed possible outcomes, such as results of coin tosses or rolls of a die, we can analytically establish event probabilities without doing an infinite number of observations; all we must do is assume that we have a fair coin or die. Thus, with a fair coin, the probability of observing a H or a T on any single coin toss is ½ or .5 or 50%; the probability of observing a 6 on any single throw of a die is 1/6 or .16667 or 16.667%. With behavioural data, though, we can never measure all possible behavioural outcomes, which thereby forces researchers to depend on samples of observations in order to make estimates of population values.

The concept of probability is central to much of what is done in the statistical analysis of behavioural data. Whenever a behavioural scientist wishes to establish whether a particular relationship exists between variables or whether two groups, treated differently, actually show different behaviours, he/she is playing a probability game. Given a sample of observations, the behavioural scientist must decide whether what he/she has observed is providing sufficient information to conclude something about the population from which the sample was drawn.

This decision always has a non-zero probability of being in error simply because in samples that are much smaller than the population, there is always the chance or probability that we are observing something rare and atypical instead of something which is indicative of a consistent population trend. Thus, the concept of probability forms the cornerstone for statistical inference about which we will have more to say later (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec6). Probability also plays an important role in helping us to understand theoretical statistical distributions (e.g. the normal distribution) and what they can tell us about our observations. We will explore this idea further in Fundamental Concept II .

The Concept of Conditional Probability

It is important to understand that the concept of probability as described above focuses upon the likelihood or chances of observing a specific event or type of observation for a specific variable relative to a population or sample of observations. However, many important behavioural research issues may focus on the question of the probability of observing a specific event given that the researcher has knowledge that some other event has occurred or been observed (this latter event is usually measured by a second variable). Here, the focus is on the potential relationship or link between two variables or two events.

With respect to the QCI database, Maree could ask the quite reasonable question “what is the probability (estimated in the QCI sample by a relative frequency) of observing an inspector being female given that she knows that an inspector works for a Large Business Computer manufacturer.

To address this question, all she needs to know is:

  • how many inspectors from Large Business Computer manufacturers are in the sample ( 22 ); and
  • how many of those inspectors were female ( 7 ) (inspectors who were missing a score for either company or gender have been ignored here).

If she divides 7 by 22, she would obtain the probability that an inspector is female given that they work for a Large Business Computer manufacturer – that is, p = .32 .

This type of question points to the important concept of conditional probability (‘conditional’ because we are asking “what is the probability of observing one event conditional upon our knowledge of some other event”).

Continuing with the previous example, Maree would say that the conditional probability of observing a female inspector working for a Large Business Computer manufacturer is .32 or, equivalently, a 32% chance. Compare this conditional probability of p  = .32 to the overall probability of observing a female inspector in the entire sample ( p  = .46 as shown above).

This means that there is evidence for a connection or relationship between gender and the type of company an inspector works for. That is, the chances are lower for observing a female inspector from a Large Business Computer manufacturer than they are for simply observing a female inspector at all.

Maree therefore has evidence suggesting that females may be relatively under-represented in Large Business Computer manufacturing companies compared to the overall population. Knowing something about the company an inspector works for therefore can help us make a better prediction about their likely gender.

Suppose, however, that Maree’s conditional probability had been exactly equal to p  = .46. This would mean that there was exactly the same chance of observing a female inspector working for a Large Business Computer manufacturer as there was of observing a female inspector in the general population. Here, knowing something about the company an inspector works doesn’t help Maree make any better prediction about their likely gender. This would mean that the two variables are statistically independent of each other.

A classic case of events that are statistically independent is two successive throws of a fair die: rolling a six on the first throw gives us no information for predicting how likely it will be that we would roll a six on the second throw. The conditional probability of observing a six on the second throw given that I have observed a six on the first throw is 0.16667 (= 1 divided by 6) which is the same as the simple probability of observing a six on any specific throw. This statistical independence also means that if we wanted to know what the probability of throwing two sixes on two successive throws of a fair die, we would just multiply the probabilities for each independent event (i.e., throw) together; that is, .16667 × .16667 = .02789 (this is known as the multiplication rule of probability, see, for example, Smithson 2000 , p. 114).

Finally, you should know that conditional probabilities are often asymmetric. This means that for many types of behavioural variables, reversing the conditional arrangement will change the story about the relationship. Bayesian statistics (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec73) relies heavily upon this asymmetric relationship between conditional probabilities.

Maree has already learned that the conditional probability that an inspector is female given that they worked for a Large Business Computer manufacturer is p = .32. She could easily turn the conditional relationship around and ask what is the conditional probability that an inspector works for a Large Business Computer manufacturer given that the inspector is female?

From the QCI database, she can find that 51 inspectors in her total sample were female and of those 51, 7 worked for a Large Business Computer manufacturer. If she divided 7 by 51, she would get p = .14 (did you notice that all that changed was the number she divided by?). Thus, there is only a 14% chance of observing an inspector working for a Large Business Computer manufacturer given that the inspector is female – a rather different probability from p = .32, which tells a different story.

As you will see in Procedures 10.1007/978-981-15-2537-7_6#Sec14 and 10.1007/978-981-15-2537-7_7#Sec17, conditional relationships between categorical variables are precisely what crosstabulation contingency tables are designed to reveal.

Procedure 5.6: Exploratory Data Analysis

There are a variety of visual display methods for EDA, including stem & leaf displays, boxplots and violin plots. Each method reflects a specific way of displaying features of a distribution of scores or measurements and, of course, each has its own advantages and disadvantages. In addition, EDA displays are surprisingly flexible and can combine features in various ways to enhance the story conveyed by the plot.

Stem & Leaf Displays

The stem & leaf display is a simple data summary technique which not only rank orders the data points in a sample but presents them visually so that the shape of the data distribution is reflected. Stem & leaf displays are formed from data scores by splitting each score into two parts: the first part of each score serving as the ‘stem’, the second part as the ‘leaf’ (e.g. for 2-digit data values, the ‘stem’ is the number in the tens position; the ‘leaf’ is the number in the ones position). Each stem is then listed vertically, in ascending order, followed horizontally by all the leaves in ascending order associated with it. The resulting display thus shows all of the scores in the sample, but reorganised so that a rough idea of the shape of the distribution emerges. As well, extreme scores can be easily identified in a stem & leaf display.

Consider the accuracy and speed scores for the 112 quality control inspectors in the QCI sample. Figure 5.22 (produced by the R Commander Stem-and-leaf display … procedure) shows the stem & leaf displays for inspection accuracy (left display) and speed (right display) data.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig22_HTML.jpg

Stem & leaf displays produced by R Commander

[The first six lines reflect information from R Commander about each display: lines 1 and 2 show the actual R command used to produce the plot (the variable name has been highlighted in bold); line 3 gives a warning indicating that inspectors with missing values (= NA in R ) on the variable have been omitted from the display; line 4 shows how the stems and leaves have been defined; line 5 indicates what a leaf unit represents in value; and line 6 indicates the total number (n) of inspectors included in the display).] In Fig. 5.22 , for the accuracy display on the left-hand side, the ‘stems’ have been split into ‘half-stems’—one (which is starred) associated with the ‘leaves’ 0 through 4 and the other associated with the ‘leaves’ 5 through 9—a strategy that gives the display better balance and visual appeal.

Notice how the left stem & leaf display conveys a fairly clear (yet sideways) picture of the shape of the distribution of accuracy scores. It has a rather symmetrical bell-shape to it with only a slight suggestion of negative skewness (toward the extreme score at the top). The right stem & leaf display clearly depicts the highly positively skewed nature of the distribution of speed scores. Importantly, we could reconstruct the entire sample of scores for each variable using its display, which means that unlike most other graphical procedures, we didn’t have to sacrifice any information to produce the visual summary.

Some programs, such as SYSTAT, embellish their stem & leaf displays by indicating in which stem or half-stem the ‘median’ (50th percentile), the ‘upper hinge score’ (75th percentile), and ‘lower hinge score’ (25th percentile) occur in the distribution (recall the discussion of interquartile range in Procedure 5.5 ). This is shown in Fig. 5.23 , produced by SYSTAT, where M and H indicate the stem locations for the median and hinge points, respectively. This stem & leaf display labels a single extreme accuracy score as an ‘outside value’ and clearly shows that this actual score was 57.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig23_HTML.jpg

Stem & leaf display, produced by SYSTAT, of the accuracy QCI variable

Another important EDA technique is the boxplot or, as it is sometimes known, the box-and-whisker plot . This plot provides a symbolic representation that preserves less of the original nature of the data (compared to a stem & leaf display) but typically gives a better picture of the distributional characteristics. The basic boxplot, shown in Fig. 5.24 , utilises information about the median (50th percentile score) and the upper (75th percentile score) and lower (25th percentile score) hinge points in the construction of the ‘box’ portion of the graph (the ‘median’ defines the centre line in the box; the ‘upper’ and ‘lower hinge values’ define the end boundaries of the box—thus the box encompasses the middle 50% of data values).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig24_HTML.jpg

Boxplots for the accuracy and speed QCI variables

Additionally, the boxplot utilises the IQR (recall Procedure 5.5 ) as a way of defining what are called ‘fences’ which are used to indicate score boundaries beyond which we would consider a score in a distribution to be an ‘outlier’ (or an extreme or unusual value). In SPSS, the inner fence is typically defined as 1.5 times the IQR in each direction and a ‘far’ outlier or extreme case is typically defined as 3 times the IQR in either direction (Field 2018 , p. 193). The ‘whiskers’ in a boxplot extend out to the data values which are closest to the upper and lower inner fences (in most cases, the vast majority of data values will be contained within the fences). Outliers beyond these ‘whiskers’ are then individually listed. ‘Near’ outliers are those lying just beyond the inner fences and ‘far’ outliers lie well beyond the inner fences.

Figure 5.24 shows two simple boxplots (produced using SPSS), one for the accuracy QCI variable and one for the speed QCI variable. The accuracy plot shows a median value of about 83, roughly 50% of the data fall between about 77 and 89 and there is one outlier, inspector 83, in the lower ‘tail’ of the distribution. The accuracy boxplot illustrates data that are relatively symmetrically distributed without substantial skewness. Such data will tend to have their median in the middle of the box, whiskers of roughly equal length extending out from the box and few or no outliers.

The speed plot shows a median value of about 4 s, roughly 50% of the data fall between 2 s and 6 s and there are four outliers, inspectors 7, 62, 65 and 75 (although inspectors 65 and 75 fall at the same place and are rather difficult to read), all falling in the slow speed ‘tail’ of the distribution. Inspectors 65, 75 and 7 are shown as ‘near’ outliers (open circles) whereas inspector 62 is shown as a ‘far’ outlier (asterisk). The speed boxplot illustrates data which are asymmetrically distributed because of skewness in one direction. Such data may have their median offset from the middle of the box and/or whiskers of unequal length extending out from the box and outliers in the direction of the longer whisker. In the speed boxplot, the data are clearly positively skewed (the longer whisker and extreme values are in the slow speed ‘tail’).

Boxplots are very versatile representations in that side-by-side displays for sub-groups of data within a sample can permit easy visual comparisons of groups with respect to central tendency and variability. Boxplots can also be modified to incorporate information about error bands associated with the median producing what is called a ‘notched boxplot’. This helps in the visual detection of meaningful subgroup differences, where boxplot ‘notches’ don’t overlap.

Figure 5.25 (produced using NCSS), compares the distributions of accuracy and speed scores for QCI inspectors from the five types of companies, plotted side-by-side.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig25_HTML.jpg

Comparisons of the accuracy (regular boxplots) and speed (notched boxplots) QCI variables for different types of companies

Focus first on the left graph in Fig. 5.25 which plots the distribution of accuracy scores broken down by company using regular boxplots. This plot clearly shows the differing degree of skewness in each type of company (indicated by one or more outliers in one ‘tail’, whiskers which are not the same length and/or the median line being offset from the centre of a box), the differing variability of scores within each type of company (indicated by the overall length of each plot—box and whiskers), and the differing central tendency in each type of company (the median lines do not all fall at the same level of accuracy score). From the left graph in Fig. 5.25 , we could conclude that: inspection accuracy scores are most variable in PC and Large Electrical Appliance manufacturing companies and least variable in the Large Business Computer manufacturing companies; Large Business Computer and PC manufacturing companies have the highest median level of inspection accuracy; and inspection accuracy scores tend to be negatively skewed (many inspectors toward higher levels, relatively fewer who are poorer in inspection performance) in the Automotive manufacturing companies. One inspector, working for an Automotive manufacturing company, shows extremely poor inspection accuracy performance.

The right display compares types of companies in terms of their inspection speed scores, using’ notched’ boxplots. The notches define upper and lower error limits around each median. Aside from the very obvious positive skewness for speed scores (with a number of slow speed outliers) in every type of company (least so for Large Electrical Appliance manufacturing companies), the story conveyed by this comparison is that inspectors from Large Electrical Appliance and Automotive manufacturing companies have substantially faster median decision speeds compared to inspectors from Large Business Computer and PC manufacturing companies (i.e. their ‘notches’ do not overlap, in terms of speed scores, on the display).

Boxplots can also add interpretive value to other graphical display methods through the creation of hybrid displays. Such displays might combine a standard histogram with a boxplot along the X-axis to provide an enhanced picture of the data distribution as illustrated for the mentabil variable in Fig. 5.26 (produced using NCSS). This hybrid plot also employs a data ‘smoothing’ method called a density trace to outline an approximate overall shape for the data distribution. Any one graphical method would tell some of the story, but combined in the hybrid display, the story of a relatively symmetrical set of mentabil scores becomes quite visually compelling.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig26_HTML.jpg

A hybrid histogram-density-boxplot of the mentabil QCI variable

Violin Plots

Violin plots are a more recent and interesting EDA innovation, implemented in the NCSS software package (Hintze 2012 ). The violin plot gets its name from the rough shape that the plots tend to take on. Violin plots are another type of hybrid plot, this time combining density traces (mirror-imaged right and left so that the plots have a sense of symmetry and visual balance) with boxplot-type information (median, IQR and upper and lower inner ‘fences’, but not outliers). The goal of the violin plot is to provide a quick visual impression of the shape, central tendency and variability of a distribution (the length of the violin conveys a sense of the overall variability whereas the width of the violin conveys a sense of the frequency of scores occurring in a specific region).

Figure 5.27 (produced using NCSS), compares the distributions of speed scores for QCI inspectors across the five types of companies, plotted side-by-side. The violin plot conveys a similar story to the boxplot comparison for speed in the right graph of Fig. 5.25 . However, notice that with the violin plot, unlike with a boxplot, you also get a sense of distributions that have ‘clumps’ of scores in specific areas. Some violin plots, like that for Automobile manufacturing companies in Fig. 5.27 , have a shape suggesting a multi-modal distribution (recall Procedure 5.4 and the discussion of the fact that a distribution may have multiple modes). The violin plot in Fig. 5.27 has also been produced to show where the median (solid line) and mean (dashed line) would fall within each violin. This facilitates two interpretations: (1) a relative comparison of central tendency across the five companies and (2) relative degree of skewness in the distribution for each company (indicated by the separation of the two lines within a violin; skewness is particularly bad for the Large Business Computer manufacturing companies).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig27_HTML.jpg

Violin plot comparisons of the speed QCI variable for different types of companies

EDA methods (of which we have illustrated only a small subset; we have not reviewed dot density diagrams, for example) provide summary techniques for visually displaying certain characteristics of a set of data. The advantage of the EDA methods over more traditional graphing techniques such as those described in Procedure 5.2 is that as much of the original integrity of the data is maintained as possible while maximising the amount of summary information available about distributional characteristics.

Stem & leaf displays maintain the data in as close to their original form as possible whereas boxplots and violin plots provide more symbolic and flexible representations. EDA methods are best thought of as communication devices designed to facilitate quick visual impressions and they can add interest to any statistical story being conveyed about a sample of data. NCSS, SYSTAT, STATGRAPHICS and R Commander generally offer more options and flexibility in the generation of EDA displays than SPSS.

EDA methods tend to get cumbersome if a great many variables or groups need to be summarised. In such cases, using numerical summary statistics (such as means and standard deviations) will provide a more economical and efficient summary. Boxplots or violin plots are generally more space efficient summary techniques than stem & leaf displays.

Often, EDA techniques are used as data screening devices, which are typically not reported in actual write-ups of research (we will discuss data screening in more detail in Procedure 10.1007/978-981-15-2537-7_8#Sec11). This is a perfectly legitimate use for the methods although there is an argument for researchers to put these techniques to greater use in published literature.

Software packages may use different rules for constructing EDA plots which means that you might get rather different looking plots and different information from different programs (you saw some evidence of this in Figs. 5.22 and 5.23 ). It is important to understand what the programs are using as decision rules for locating fences and outliers so that you are clear on how best to interpret the resulting plot—such information is generally contained in the user’s guides or manuals for NCSS (Hintze 2012 ), SYSTAT (SYSTAT Inc. 2009a , b ), STATGRAPHICS (StatPoint Technologies Inc. 2010 ) and SPSS (Norušis 2012 ).

Virtually any research design which produces numerical measures (even to the extent of just counting the number of occurrences of several events) provides opportunities for employing EDA displays which may help to clarify data characteristics or relationships. One extremely important use of EDA methods is as data screening devices for detecting outliers and other data anomalies, such as non-normality and skewness, before proceeding to parametric statistical analyses. In some cases, EDA methods can help the researcher to decide whether parametric or nonparametric statistical tests would be best to apply to his or her data because critical data characteristics such as distributional shape and spread are directly reflected.

ApplicationProcedures
SPSS

produces stem-and-leaf displays and boxplots by default; variables may be explored on a whole-of-sample basis or broken down by the categories of a specific variable (called a ‘factor’ in the procedure). Cases can also be labelled with a variable (like in the QCI database), so that outlier points in the boxplot are identifiable.

can also be used to custom build different types of boxplots.

NCSS

produces a stem-and-leaf display by default.

can be used to produce box plots with different features (such as ‘notches’ and connecting lines).

can be configured to produce violin plots (by selecting the plot shape as ‘density with reflection’).

SYSTAT

can be used to produce stem-and-leaf displays for variables; however, you cannot really control any features of these displays.

can be used to produce boxplots of many types, with a number of features being controllable.

STATGRAPHICS

allows you to do a complete exploration of a single variable, including stem-and-leaf display (you need to select this option) and boxplot (produced by default). Some features of the boxplot can be controlled, but not features of the stem-and-leaf diagram.

and select either or which can produce not only descriptive statistics but also boxplots with some controllable features.

Commander or the dialog box for each procedure offers some features of the display or plot that can be controlled; whole-of-sample boxplots or boxplots by groups are possible.

Procedure 5.7: Standard ( z ) Scores

In certain practical situations in behavioural research, it may be desirable to know where a specific individual’s score lies relative to all other scores in a distribution. A convenient measure is to observe how many standard deviations (see Procedure 5.5 ) above or below the sample mean a specific score lies. This measure is called a standard score or z -score . Very simply, any raw score can be converted to a z -score by subtracting the sample mean from the raw score and dividing that result by the sample’s standard deviation. z -scores can be positive or negative and their sign simply indicates whether the score lies above (+) or below (−) the mean in value. A z -score has a very simple interpretation: it measures the number of standard deviations above or below the sample mean a specific raw score lies.

In the QCI database, we have a sample mean for speed scores of 4.48 s, a standard deviation for speed scores of 2.89 s (recall Table 5.4 in Procedure 5.5 ). If we are interested in the z -score for Inspector 65’s raw speed score of 11.94 s, we would obtain a z -score of +2.58 using the method described above (subtract 4.48 from 11.94 and divide the result by 2.89). The interpretation of this number is that a raw decision speed score of 11.94 s lies about 2.9 standard deviations above the mean decision speed for the sample.

z -scores have some interesting properties. First, if one converts (statisticians would say ‘transforms’) every available raw score in a sample to z -scores, the mean of these z -scores will always be zero and the standard deviation of these z -scores will always be 1.0. These two facts about z -scores (mean = 0; standard deviation = 1) will be true no matter what sample you are dealing with and no matter what the original units of measurement are (e.g. seconds, percentages, number of widgets assembled, amount of preference for a product, attitude rating, amount of money spent). This is because transforming raw scores to z -scores automatically changes the measurement units from whatever they originally were to a new system of measurements expressed in standard deviation units.

Suppose Maree was interested in the performance statistics for the top 25% most accurate quality control inspectors in the sample. Given a sample size of 112, this would mean finding the top 28 inspectors in terms of their accuracy scores. Since Maree is interested in performance statistics, speed scores would also be of interest. Table 5.5 (generated using the SPSS Descriptives … procedure, listed using the Case Summaries … procedure and formatted for presentation using Excel) shows accuracy and speed scores for the top 28 inspectors in descending order of accuracy scores. The z -score transformation for each of these scores is also shown (last two columns) as are the type of company, education level and gender for each inspector.

Listing of the 28 (top 25%) most accurate QCI inspectors’ accuracy and speed scores as well as standard ( z ) score transformations for each score

Case numberInspectorcompanyeduclevgenderaccuracyspeedZaccuracyZspeed
18PC ManufacturerHigh School OnlyMale1001.521.95−1.03
29PC ManufacturerHigh School OnlyFemale1003.321.95−0.40
314PC ManufacturerHigh School OnlyMale1003.831.95−0.23
417PC ManufacturerHigh School OnlyFemale997.071.840.90
5101PC ManufacturerHigh School Only983.111.73−0.47
619PC ManufacturerTertiary QualifiedFemale943.841.29−0.22
734Large Electrical Appliance ManufacturerTertiary QualifiedMale941.901.29−0.89
863Large Business Computer ManufacturerHigh School OnlyMale9411.941.292.58
967Large Business Computer ManufacturerHigh School OnlyMale942.341.29−0.74
1080Large Business Computer ManufacturerHigh School OnlyFemale944.681.290.07
115PC ManufacturerTertiary QualifiedMale934.181.18−0.10
1218PC ManufacturerTertiary QualifiedMale937.321.180.98
1346Small Electrical Appliance ManufacturerTertiary QualifiedFemale932.011.18−0.86
1464Large Business Computer ManufacturerHigh School OnlyFemale925.181.080.24
1577Large Business Computer ManufacturerTertiary QualifiedFemale926.111.080.56
1679Large Business Computer ManufacturerHigh School OnlyMale924.381.08−0.03
17106Large Electrical Appliance ManufacturerTertiary QualifiedMale921.701.08−0.96
1858Small Electrical Appliance ManufacturerHigh School OnlyMale914.120.97−0.12
1963Large Business Computer ManufacturerHigh School OnlyMale914.730.970.09
2072Large Business Computer ManufacturerTertiary QualifiedMale914.720.970.08
2120PC ManufacturerHigh School OnlyMale904.530.860.02
2269Large Business Computer ManufacturerHigh School OnlyMale904.940.860.16
2371Large Business Computer ManufacturerHigh School OnlyFemale9010.460.862.07
2485Automobile ManufacturerTertiary QualifiedFemale903.140.86−0.46
25111Large Business Computer ManufacturerHigh School OnlyMale904.110.86−0.13
266PC ManufacturerHigh School OnlyMale895.460.750.34
2761Large Business Computer ManufacturerTertiary QualifiedMale895.710.750.43
2875Large Business Computer ManufacturerHigh School OnlyMale8912.050.752.62

There are three inspectors (8, 9 and 14) who scored maximum accuracy of 100%. Such accuracy converts to a z -score of +1.95. Thus 100% accuracy is 1.95 standard deviations above the sample’s mean accuracy level. Interestingly, all three inspectors worked for PC manufacturers and all three had only high school-level education. The least accurate inspector in the top 25% had a z -score for accuracy that was .75 standard deviations above the sample mean.

Interestingly, the top three inspectors in terms of accuracy had decision speeds that fell below the sample’s mean speed; inspector 8 was the fastest inspector of the three with a speed just over 1 standard deviation ( z  = −1.03) below the sample mean. The slowest inspector in the top 25% was inspector 75 (case #28 in the list) with a speed z -score of +2.62; i.e., he was over two and a half standard deviations slower in making inspection decisions relative to the sample’s mean speed.

The fact that z -scores always have a common measurement scale having a mean of 0 and a standard deviation of 1.0 leads to an interesting application of standard scores. Suppose we focus on inspector number 65 (case #8 in the list) in Table 5.5 . It might be of interest to compare this inspector’s quality control performance in terms of both his decision accuracy and decision speed. Such a comparison is impossible using raw scores since the inspector’s accuracy score and speed scores are different measures which have differing means and standard deviations expressed in fundamentally different units of measurement (percentages and seconds). However, if we are willing to assume that the score distributions for both variables are approximately the same shape and that both accuracy and speed are measured with about the same level of reliability or consistency (see Procedure 10.1007/978-981-15-2537-7_8#Sec1), we can compare the inspector’s two scores by first converting them to z -scores within their own respective distributions as shown in Table 5.5 .

Inspector 65 looks rather anomalous in that he demonstrated a relatively high level of accuracy (raw score = 94%; z  = +1.29) but took a very long time to make those accurate decisions (raw score = 11.94 s; z  = +2.58). Contrast this with inspector 106 (case #17 in the list) who demonstrated a similar level of accuracy (raw score = 92%; z  = +1.08) but took a much shorter time to make those accurate decisions (raw score = 1.70 s; z  = −.96). In terms of evaluating performance, from a company perspective, we might conclude that inspector 106 is performing at an overall higher level than inspector 65 because he can achieve a very high level of accuracy but much more quickly; accurate and fast is more cost effective and efficient than accurate and slow.

Note: We should be cautious here since we know from our previous explorations of the speed variable in Procedure 5.6 , that accuracy scores look fairly symmetrical and speed scores are positively skewed, so assuming that the two variables have the same distribution shape, so that z -score comparisons are permitted, would be problematic.

You might have noticed that as you scanned down the two columns of z -scores in Table 5.5 , there was a suggestion of a pattern between the signs attached to the respective z -scores for each person. There seems to be a very slight preponderance of pairs of z -scores where the signs are reversed (12 out of 22 pairs). This observation provides some very preliminary evidence to suggest that there may be a relationship between inspection accuracy and decision speed, namely that a more accurate decision tends to be associated with a faster decision speed. Of course, this pattern would be better verified using the entire sample rather than the top 25% of inspectors. However, you may find it interesting to learn that it is precisely this sort of suggestive evidence (about agreement or disagreement between z -score signs for pairs of variable scores throughout a sample) that is captured and summarised by a single statistical indicator called a ‘correlation coefficient’ (see Fundamental Concept 10.1007/978-981-15-2537-7_6#Sec1 and Procedure 10.1007/978-981-15-2537-7_6#Sec4).

z -scores are not the only type of standard score that is commonly used. Three other types of standard scores are: stanines (standard nines), IQ scores and T-scores (not to be confused with the t -test described in Procedure 10.1007/978-981-15-2537-7_7#Sec22). These other types of scores have the advantage of producing only positive integer scores rather than positive and negative decimal scores. This makes interpretation somewhat easier for certain applications. However, you should know that almost all other types of standard scores come from a specific transformation of z -scores. This is because once you have converted raw scores into z -scores, they can then be quite readily transformed into any other system of measurement by simply multiplying a person’s z -score by the new desired standard deviation for the measure and adding to that product the new desired mean for the measure.

T-scores are simply z-scores transformed to have a mean of 50.0 and a standard deviation of 10.0; IQ scores are simply z-scores transformed to have a mean of 100 and a standard deviation of 15 (or 16 in some systems). For more information, see Fundamental Concept II .

Standard scores are useful for representing the position of each raw score within a sample distribution relative to the mean of that distribution. The unit of measurement becomes the number of standard deviations a specific score is away from the sample mean. As such, z -scores can permit cautious comparisons across samples or across different variables having vastly differing means and standard deviations within the constraints of the comparison samples having similarly shaped distributions and roughly equivalent levels of measurement reliability. z -scores also form the basis for establishing the degree of correlation between two variables. Transforming raw scores into z -scores does not change the shape of a distribution or rank ordering of individuals within that distribution. For this reason, a z -score is referred to as a linear transformation of a raw score. Interestingly, z -scores provide an important foundational element for more complex analytical procedures such as factor analysis ( Procedure 10.1007/978-981-15-2537-7_6#Sec36), cluster analysis ( Procedure 10.1007/978-981-15-2537-7_6#Sec41) and multiple regression analysis (see, for example, Procedure 10.1007/978-981-15-2537-7_6#Sec27 and 10.1007/978-981-15-2537-7_7#Sec86).

While standard scores are useful indices, they are subject to restrictions if used to compare scores across samples or across different variables. The samples must have similar distribution shapes for the comparisons to be meaningful and the measures must have similar levels of reliability in each sample. The groups used to generate the z -scores should also be similar in composition (with respect to age, gender distribution, and so on). Because z -scores are not an intuitively meaningful way of presenting scores to lay-persons, many other types of standard score schemes have been devised to improve interpretability. However, most of these schemes produce scores that run a greater risk of facilitating lay-person misinterpretations simply because their connection with z -scores is hidden or because the resulting numbers ‘look’ like a more familiar type of score which people do intuitively understand.

It is extremely rare for a T-score to exceed 100 or go below 0 because this would mean that the raw score was in excess of 5 standard deviations away from the sample mean. This unfortunately means that T-scores are often misinterpreted as percentages because they typically range between 0 and 100 and therefore ‘look’ like percentages. However, T-scores are definitely not percentages.

Finally, a common misunderstanding of z -scores is that transforming raw scores into z -scores makes them follow a normal distribution (see Fundamental Concept II ). This is not the case. The distribution of z -scores will have exactly the same shape as that for the raw scores; if the raw scores are positively skewed, then the corresponding z -scores will also be positively skewed.

z -scores are particularly useful in evaluative studies where relative performance indices are of interest. Whenever you compute a correlation coefficient ( Procedure 10.1007/978-981-15-2537-7_6#Sec4), you are implicitly transforming the two variables involved into z -scores (which equates the variables in terms of mean and standard deviation), so that only the patterning in the relationship between the variables is represented. z -scores are also useful as a preliminary step to more advanced parametric statistical methods when variables differing in scale, range and/or measurement units must be equated for means and standard deviations prior to analysis.

ApplicationProcedures
SPSS and tick the box labelled ‘Save standardized values as variables’. -scores are saved as new variables (labelled as Z followed by the original variable name as shown in Table ) which can then be listed or analysed further.
NCSS and select a new variable to hold the -scores, then select the ‘STANDARDIZE’ transformation from the list of available functions. -scores are saved as new variables which can then be listed or analysed further.
SYSTAT where -scores are saved as new variables which can then be listed or analysed further.
STATGRAPHICSOpen the window, and select an empty column in the database, then and choose the ‘STANDARDIZE’ transformation, choose the variable you want to transform and give the new variable a name.
Commander and select the variables you want to standardize; Commander automatically saves the transformed variable to the data base, appending Z. to the front of each variable’s name.

Fundamental Concept II: The Normal Distribution

Arguably the most fundamental distribution used in the statistical analysis of quantitative data in the behavioural and social sciences is the normal distribution (also known as the Gaussian or bell-shaped distribution ). Many behavioural phenomena, if measured on a large enough sample of people, tend to produce ‘normally distributed’ variable scores. This includes most measures of ability, performance and productivity, personality characteristics and attitudes. The normal distribution is important because it is the one form of distribution that you must assume describes the scores of a variable in the population when parametric tests of statistical inference are undertaken. The standard normal distribution is defined as having a population mean of 0.0 and a population standard deviation of 1.0. The normal distribution is also important as a means of interpreting various types of scoring systems.

Figure 5.28 displays the standard normal distribution (mean = 0; standard deviation = 1.0) and shows that there is a clear link between z -scores and the normal distribution. Statisticians have analytically calculated the probability (also expressed as percentages or percentiles) that observations will fall above or below any specific z -score in the theoretical standard normal distribution. Thus, a z -score of +1.0 in the standard normal distribution will have 84.13% (equals a probability of .8413) of observations in the population falling at or below one standard deviation above the mean and 15.87% falling above that point. A z -score of −2.0 will have 2.28% of observations falling at that point or below and 97.72% of observations falling above that point. It is clear then that, in a standard normal distribution, z -scores have a direct relationship with percentiles .

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig28_HTML.jpg

The normal (bell-shaped or Gaussian) distribution

Figure 5.28 also shows how T-scores relate to the standard normal distribution and to z -scores. The mean T-score falls at 50 and each increment or decrement of 10 T-score units means a movement of another standard deviation away from this mean of 50. Thus, a T-score of 80 corresponds to a z -score of +3.0—a score 3 standard deviations higher than the mean of 50.

Of special interest to behavioural researchers are the values for z -scores in a standard normal distribution that encompass 90% of observations ( z  = ±1.645—isolating 5% of the distribution in each tail), 95% of observations ( z  = ±1.96—isolating 2.5% of the distribution in each tail), and 99% of observations ( z  = ±2.58—isolating 0.5% of the distribution in each tail).

Depending upon the degree of certainty required by the researcher, these bands describe regions outside of which one might define an observation as being atypical or as perhaps not belonging to a distribution being centred at a mean of 0.0. Most often, what is taken as atypical or rare in the standard normal distribution is a score at least two standard deviations away from the mean, in either direction. Why choose two standard deviations? Since in the standard normal distribution, only about 5% of observations will fall outside a band defined by z -scores of ±1.96 (rounded to 2 for simplicity), this equates to data values that are 2 standard deviations away from their mean. This can give us a defensible way to identify outliers or extreme values in a distribution.

Thinking ahead to what you will encounter in Chap. 10.1007/978-981-15-2537-7_7, this ‘banding’ logic can be extended into the world of statistics (like means and percentages) as opposed to just the world of observations. You will frequently hear researchers speak of some statistic estimating a specific value (a parameter ) in a population, plus or minus some other value.

A survey organisation might report political polling results in terms of a percentage and an error band, e.g. 59% of Australians indicated that they would vote Labour at the next federal election, plus or minus 2%.

Most commonly, this error band (±2%) is defined by possible values for the population parameter that are about two standard deviations (or two standard errors—a concept discussed further in Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec14) away from the reported or estimated statistical value. In effect, the researcher is saying that on 95% of the occasions he/she would theoretically conduct his/her study, the population value estimated by the statistic being reported would fall between the limits imposed by the endpoints of the error band (the official name for this error band is a confidence interval ; see Procedure 10.1007/978-981-15-2537-7_8#Sec18). The well-understood mathematical properties of the standard normal distribution are what make such precise statements about levels of error in statistical estimates possible.

Checking for Normality

It is important to understand that transforming the raw scores for a variable to z -scores (recall Procedure 5.7 ) does not produce z -scores which follow a normal distribution; rather they will have the same distributional shape as the original scores. However, if you are willing to assume that the normal distribution is the correct reference distribution in the population, then you are justified is interpreting z -scores in light of the known characteristics of the normal distribution.

In order to justify this assumption, not only to enhance the interpretability of z -scores but more generally to enhance the integrity of parametric statistical analyses, it is helpful to actually look at the sample frequency distributions for variables (using a histogram (illustrated in Procedure 5.2 ) or a boxplot (illustrated in Procedure 5.6 ), for example), since non-normality can often be visually detected. It is important to note that in the social and behavioural sciences as well as in economics and finance, certain variables tend to be non-normal by their very nature. This includes variables that measure time taken to complete a task, achieve a goal or make decisions and variables that measure, for example, income, occurrence of rare or extreme events or organisational size. Such variables tend to be positively skewed in the population, a pattern that can often be confirmed by graphing the distribution.

If you cannot justify an assumption of ‘normality’, you may be able to force the data to be normally distributed by using what is called a ‘normalising transformation’. Such transformations will usually involve a nonlinear mathematical conversion (such as computing the logarithm, square root or reciprocal) of the raw scores. Such transformations will force the data to take on a more normal appearance so that the assumption of ‘normality’ can be reasonably justified, but at the cost of creating a new variable whose units of measurement and interpretation are more complicated. [For some non-normal variables, such as the occurrence of rare, extreme or catastrophic events (e.g. a 100-year flood or forest fire, coronavirus pandemic, the Global Financial Crisis or other type of financial crisis, man-made or natural disaster), the distributions cannot be ‘normalised’. In such cases, the researcher needs to model the distribution as it stands. For such events, extreme value theory (e.g. see Diebold et al. 2000 ) has proven very useful in recent years. This theory uses a variation of the Pareto or Weibull distribution as a reference, rather than the normal distribution, when making predictions.]

Figure 5.29 displays before and after pictures of the effects of a logarithmic transformation on the positively skewed speed variable from the QCI database. Each graph, produced using NCSS, is of the hybrid histogram-density trace-boxplot type first illustrated in Procedure 5.6 . The left graph clearly shows the strong positive skew in the speed scores and the right graph shows the result of taking the log 10 of each raw score.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig29_HTML.jpg

Combined histogram-density trace-boxplot graphs displaying the before and after effects of a ‘normalising’ log 10 transformation of the speed variable

Notice how the long tail toward slow speed scores is pulled in toward the mean and the very short tail toward fast speed scores is extended away from the mean. The result is a more ‘normal’ appearing distribution. The assumption would then be that we could assume normality of speed scores, but only in a log 10 format (i.e. it is the log of speed scores that we assume is normally distributed in the population). In general, taking the logarithm of raw scores provides a satisfactory remedy for positively skewed distributions (but not for negatively skewed ones). Furthermore, anything we do with the transformed speed scores now has to be interpreted in units of log 10 (seconds) which is a more complex interpretation to make.

Another visual method for detecting non-normality is to graph what is called a normal Q-Q plot (the Q-Q stands for Quantile-Quantile). This plots the percentiles for the observed data against the percentiles for the standard normal distribution (see Cleveland 1995 for more detailed discussion; also see Lane 2007 , http://onlinestatbook.com/2/advanced_graphs/ q-q_plots.html) . If the pattern for the observed data follows a normal distribution, then all the points on the graph will fall approximately along a diagonal line.

Figure 5.30 shows the normal Q-Q plots for the original speed variable and the transformed log-speed variable, produced using the SPSS Explore... procedure. The diagnostic diagonal line is shown on each graph. In the left-hand plot, for speed , the plot points clearly deviate from the diagonal in a way that signals positive skewness. The right-hand plot, for log_speed, shows the plot points generally falling along the diagonal line thereby conforming much more closely to what is expected in a normal distribution.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig30_HTML.jpg

Normal Q-Q plots for the original speed variable and the new log_speed variable

In addition to visual ways of detecting non-normality, there are also numerical ways. As highlighted in Chap. 10.1007/978-981-15-2537-7_1, there are two additional characteristics of any distribution, namely skewness (asymmetric distribution tails) and kurtosis (peakedness of the distribution). Both have an associated statistic that provides a measure of that characteristic, similar to the mean and standard deviation statistics. In a normal distribution, the values for the skewness and kurtosis statistics are both zero (skewness = 0 means a symmetric distribution; kurtosis = 0 means a mesokurtic distribution). The further away each statistic is from zero, the more the distribution deviates from a normal shape. Both the skewness statistic and the kurtosis statistic have standard errors (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec14) associated with them (which work very much like the standard deviation, only for a statistic rather than for observations); these can be routinely computed by almost any statistical package when you request a descriptive analysis. Without going into the logic right now (this will come in Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec1), a rough rule of thumb you can use to check for normality using the skewness and kurtosis statistics is to do the following:

  • Prepare : Take the standard error for the statistic and multiply it by 2 (or 3 if you want to be more conservative).
  • Interval : Add the result from the Prepare step to the value of the statistic and subtract the result from the value of the statistic. You will end up with two numbers, one low - one high, that define the ends of an interval (what you have just created approximates what is called a ‘confidence interval’, see Procedure 10.1007/978-981-15-2537-7_8#Sec18).
  • Check : If zero falls inside of this interval (i.e. between the low and high endpoints from the Interval step), then there is likely to be no significant issue with that characteristic of the distribution. If zero falls outside of the interval (i.e. lower than the low value endpoint or higher than the high value endpoint), then you likely have an issue with non-normality with respect to that characteristic.

Visually, we saw in the left graph in Fig. 5.29 that the speed variable was highly positively skewed. What if Maree wanted to check some numbers to support this judgment? She could ask SPSS to produce the skewness and kurtosis statistics for both the original speed variable and the new log_speed variable using the Frequencies... or the Explore... procedure. Table 5.6 shows what SPSS would produce if the Frequencies ... procedure were used.

Skewness and kurtosis statistics and their standard errors for both the original speed variable and the new log_speed variable

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab6_HTML.jpg

Using the 3-step check rule described above, Maree could roughly evaluate the normality of the two variables as follows:

  • skewness : [Prepare] 2 × .229 = .458 ➔ [Interval] 1.487 − .458 = 1.029 and 1.487 + .458 = 1.945 ➔ [Check] zero does not fall inside the interval bounded by 1.029 and 1.945, so there appears to be a significant problem with skewness. Since the value for the skewness statistic (1.487) is positive, this means the problem is positive skewness, confirming what the left graph in Fig. 5.29 showed.
  • kurtosis : [Prepare] 2 × .455 = .91 ➔ [Interval] 3.071 − .91 = 2.161 and 3.071 + .91 = 3.981 ➔ [Check] zero does not fall in interval bounded by 2.161 and 3.981, so there appears to be a significant problem with kurtosis. Since the value for the kurtosis statistic (1.487) is positive, this means the problem is leptokurtosis—the peakedness of the distribution is too tall relative to what is expected in a normal distribution.
  • skewness : [Prepare] 2 × .229 = .458 ➔ [Interval] −.050 − .458 = −.508 and −.050 + .458 = .408 ➔ [Check] zero falls within interval bounded by −.508 and .408, so there appears to be no problem with skewness. The log transform appears to have corrected the problem, confirming what the right graph in Fig. 5.29 showed.
  • kurtosis : [Prepare] 2 × .455 = .91 ➔ [Interval] −.672 – .91 = −1.582 and −.672 + .91 = .238 ➔ [Check] zero falls within interval bounded by −1.582 and .238, so there appears to be no problem with kurtosis. The log transform appears to have corrected this problem as well, rendering the distribution more approximately mesokurtic (i.e. normal) in shape.

There are also more formal tests of significance (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec1) that one can use to numerically evaluate normality, such as the Kolmogorov-Smirnov test and the Shapiro-Wilk’s test . Each of these tests, for example, can be produced by SPSS on request, via the Explore... procedure.

1 For more information, see Chap. 10.1007/978-981-15-2537-7_1 – The language of statistics .

References for Procedure 5.1

  • Allen P, Bennett K, Heritage B. SPSS statistics: A practical guide. 4. South Melbourne, VIC: Cengage Learning Australia Pty; 2019. [ Google Scholar ]
  • George D, Mallery P. IBM SPSS statistics 25 step by step: A simple guide and reference. 15. New York: Routledge; 2019. [ Google Scholar ]

Useful Additional Readings for Procedure 5.1

  • Agresti A. Statistical methods for the social sciences. 5. Boston: Pearson; 2018. [ Google Scholar ]
  • Argyrous G. Statistics for research: With a guide to SPSS. 3. London: Sage; 2011. [ Google Scholar ]
  • De Vaus D. Analyzing social science data: 50 key problems in data analysis. London: Sage; 2002. [ Google Scholar ]
  • Glass GV, Hopkins KD. Statistical methods in education and psychology. 3. Upper Saddle River, NJ: Pearson; 1996. [ Google Scholar ]
  • Gravetter FJ, Wallnau LB. Statistics for the behavioural sciences. 10. Belmont, CA: Wadsworth Cengage; 2017. [ Google Scholar ]
  • Steinberg WJ. Statistics alive. 2. Los Angeles: Sage; 2011. [ Google Scholar ]

References for Procedure 5.2

  • Chang W. R graphics cookbook: Practical recipes for visualizing data. 2. Sebastopol, CA: O’Reilly Media; 2019. [ Google Scholar ]
  • Jacoby WG. Statistical graphics for univariate and bivariate data. Thousand Oaks, CA: Sage; 1997. [ Google Scholar ]
  • McCandless D. Knowledge is beautiful. London: William Collins; 2014. [ Google Scholar ]
  • Smithson MJ. Statistics with confidence. London: Sage; 2000. [ Google Scholar ]
  • Toseland M, Toseland S. Infographica: The world as you have never seen it before. London: Quercus Books; 2012. [ Google Scholar ]
  • Wilkinson L. Cognitive science and graphic design. In: SYSTAT Software Inc, editor. SYSTAT 13: Graphics. Chicago, IL: SYSTAT Software Inc; 2009. pp. 1–21. [ Google Scholar ]

Useful Additional Readings for Procedure 5.2

  • Field A. Discovering statistics using SPSS for windows. 5. Los Angeles: Sage; 2018. [ Google Scholar ]
  • George D, Mallery P. IBM SPSS statistics 25 step by step: A simple guide and reference. 15. Boston, MA: Pearson Education; 2019. [ Google Scholar ]
  • Hintze JL. NCSS 8 help system: Graphics. Kaysville, UT: Number Cruncher Statistical Systems; 2012. [ Google Scholar ]
  • StatPoint Technologies, Inc . STATGRAPHICS Centurion XVI user manual. Warrenton, VA: StatPoint Technologies Inc.; 2010. [ Google Scholar ]
  • SYSTAT Software Inc . SYSTAT 13: Graphics. Chicago, IL: SYSTAT Software Inc; 2009. [ Google Scholar ]

References for Procedure 5.3

  • Cleveland WR. Visualizing data. Summit, NJ: Hobart Press; 1995. [ Google Scholar ]
  • Jacoby WJ. Statistical graphics for visualizing multivariate data. Thousand Oaks, CA: Sage; 1998. [ Google Scholar ]

Useful Additional Readings for Procedure 5.3

  • Kirk A. Data visualisation: A handbook for data driven design. Los Angeles: Sage; 2016. [ Google Scholar ]
  • Knaflic CN. Storytelling with data: A data visualization guide for business professionals. Hoboken, NJ: Wiley; 2015. [ Google Scholar ]
  • Tufte E. The visual display of quantitative information. 2. Cheshire, CN: Graphics Press; 2001. [ Google Scholar ]

Reference for Procedure 5.4

Useful additional readings for procedure 5.4.

  • Rosenthal R, Rosnow RL. Essentials of behavioral research: Methods and data analysis. 2. New York: McGraw-Hill Inc; 1991. [ Google Scholar ]

References for Procedure 5.5

Useful additional readings for procedure 5.5.

  • Gravetter FJ, Wallnau LB. Statistics for the behavioural sciences. 9. Belmont, CA: Wadsworth Cengage; 2012. [ Google Scholar ]

References for Fundamental Concept I

Useful additional readings for fundamental concept i.

  • Howell DC. Statistical methods for psychology. 8. Belmont, CA: Cengage Wadsworth; 2013. [ Google Scholar ]

References for Procedure 5.6

  • Norušis MJ. IBM SPSS statistics 19 guide to data analysis. Upper Saddle River, NJ: Prentice Hall; 2012. [ Google Scholar ]
  • Field A. Discovering statistics using SPSS for Windows. 5. Los Angeles: Sage; 2018. [ Google Scholar ]
  • Hintze JL. NCSS 8 help system: Introduction. Kaysville, UT: Number Cruncher Statistical System; 2012. [ Google Scholar ]
  • SYSTAT Software Inc . SYSTAT 13: Statistics - I. Chicago, IL: SYSTAT Software Inc; 2009. [ Google Scholar ]

Useful Additional Readings for Procedure 5.6

  • Hartwig F, Dearing BE. Exploratory data analysis. Beverly Hills, CA: Sage; 1979. [ Google Scholar ]
  • Leinhardt G, Leinhardt L. Exploratory data analysis. In: Keeves JP, editor. Educational research, methodology, and measurement: An international handbook. 2. Oxford: Pergamon Press; 1997. pp. 519–528. [ Google Scholar ]
  • Rosenthal R, Rosnow RL. Essentials of behavioral research: Methods and data analysis. 2. New York: McGraw-Hill, Inc.; 1991. [ Google Scholar ]
  • Tukey JW. Exploratory data analysis. Reading, MA: Addison-Wesley Publishing; 1977. [ Google Scholar ]
  • Velleman PF, Hoaglin DC. ABC’s of EDA. Boston: Duxbury Press; 1981. [ Google Scholar ]

Useful Additional Readings for Procedure 5.7

References for fundemental concept ii.

  • Diebold FX, Schuermann T, Stroughair D. Pitfalls and opportunities in the use of extreme value theory in risk management. The Journal of Risk Finance. 2000; 1 (2):30–35. doi: 10.1108/eb043443. [ CrossRef ] [ Google Scholar ]
  • Lane D. Online statistics education: A multimedia course of study. Houston, TX: Rice University; 2007. [ Google Scholar ]

Useful Additional Readings for Fundemental Concept II

  • Keller DK. The tao of statistics: A path to understanding (with no math) Thousand Oaks, CA: Sage; 2006. [ Google Scholar ]

Study Site Homepage

  • Request new password
  • Create a new account

Basic SPSS Tutorial

Student resources, additional assignments for descriptive statistics.

Suggested assignments for descriptive statistics are designed to promote in-depth engagement with the material.

Download all additional assignments for descriptive statistics, and the accompanying data files: 

assignment descriptive statistics

Logo

  • Gradescope Guides
  • Instructors
  • Managing Assignments
  • Assignment Types

Gradescope allows you to grade paper-based exams, quizzes, bubble sheets, programming assignments  (graded automatically or manually) and lets you create online assignments that students can answer right on Gradescope.

In this guide:

Assignment Types and Features

Using gradescope for paper-based assignments, exams & quizzes, homework & problem sets, multi-versioned assignments.

  • Bubble Sheet Assignments

Programming Assignments

  • Online Assignments

The following table details Gradescope assignment types and features .

Handwritten student responses ✔️ ✔️ ✔️*    
Digital student responses     ✔️ ✔️ ✔️
Student-uploaded submissions ✔️ ✔️ ✔️ ✔️ ✔️
Instructor-uploaded submissions ✔️        
Templated assignment ✔️   ✔️ ✔️  
Non-templated assignment   ✔️     ✔️
Auto-graded     ✔️** ✔️ ✔️
AI-assisted grading ✔️        

*The file-upload question type can be used for students to upload images of their handwritten work.

**Certain question types can be auto-graded: Multiple choice, select all, and fill in the blank.

For paper-based assignments, Gradescope works well for many types of questions: paragraphs, proofs, diagrams, fill-in-the-blank, true/false, and more. Our biggest users so far have been high school and higher-ed courses in Math, Chemistry, Computer Science, Physics, Economics, and Business — but we’re confident that our tool is useful to most subject areas and grade levels. Please reach out to us and we can help you figure out if Gradescope will be helpful in your course.

A screen capture of the Exam/Quiz assignment type selected on the Create Assignment page.

To grade exams or quizzes you will start by creating a new assignment on Gradescope. 

Once the assignment is created, you’ll:

  • Mark the question regions on a template PDF ( Creating an outline )

See our tips for formatting the assignment template PDF and outline for automated roster matching of submissions.

  • Create rubrics for your questions if applicable (See Creating Rubrics in Grading Submissions )
  • Upload and process scans*  ( Managing scans )
  • Match student names to submissions*  ( Managing submissions )
  • Students can use the Gradescope Mobile App to scan and upload their handwritten assignments.
  • Grade student work with flexible, dynamic rubrics ( Grading )

When grading is finished you can:

  • Publish grades and email students ( Reviewing grades )
  • Export grades ( Exporting Grades )
  • Manage regrade requests ( Managing regrade requests )
  • See question and rubric-level statistics to better understand what your students have learned ( Assignment Statistics )

*Not applicable if students are uploading their own work.

A screen capture of the create assignment page with the homework / problem set option selected.

You will need to give the assignment a title and upload a blank copy of the homework to create the assignment outline you’ll use for grading. By default, the Homework / Problem Set assignment type is set up for students to submit work. In a typical homework assignment, students will upload their work and be directed to mark where their answers are on their submissions ( Submitting an assignment ), making them even easier for you to grade. 

If you want to scan and submit work for your students, you can change the Who will upload submissions? setting to Instructors and follow the steps above in the “Exam and Quizzes” section. If needed, you can also submit on behalf of your students, even if you’ve originally set the assignment to be student-uploaded. See more on that on our Managing Submissions help page.

Next, Gradescope will prompt you to set the assignment release date and due date, choose your submission type and set your group submission policy ( Submission Type ). Next, you can select Enforce time limit and use the Maximum Time Permitted feature to give students a set number of minutes to complete the assignment from the moment they confirm that they’re ready to begin. Under Template Visibility , you can select Allow students to view and download the template to let students view and download a blank copy of the homework after the assignment release date.

Assignments with a set time limit are not compatible for student upload on the Gradescope Mobile App.

Then, you will create the assignment outline ( Creating an outline ) and either create a rubric now or wait for students to submit their work. You can begin grading as soon as a single submission is uploaded (although we recommend waiting until the due date passes, since students can resubmit), and you can view all student-uploaded submissions from the Manage Submissions tab. The rest of the workflow is the same as exams and quizzes: you can publish grades, email students ( Reviewing grades ), export grades ( Exporting Grades ), and manage regrade requests ( Managing regrade requests ).

The Organize Exam Versions feature lets you group together multiple instructor-uploaded Exam or Homework assignments into an Exam Version Set. Please note that assignment versioning is style="color: #d33115;"not available on Online Assignments, Programming Assignments, or any other type of student-uploaded assignment . To see how to use this feature on your instructor-uploaded Exam or Homework assignments, check out the article on Creating and Grading Multi-Version Assignments .

Bubble Sheets

Bubble Sheet Assignments are available with an Institutional license .

If your assignment is completely multiple choice, you should consider using the Bubble Sheet assignment type . With this type of assignment, you need to electronically or manually distribute and have students fill out the Gradescope Bubble Sheet Template . You can then mark the correct answers for each question ahead of time, and all student submissions will be automatically graded.

A screen capture of the create assignment page with the bubble sheet option selected.

Bubble Sheet assignments allow up to five versions of the assignment during the creation of instructor-uploaded assignments. To learn how to add more than one version, check out our guide on Creating multiple versions .

By default, the Bubble Sheet assignment type is set up for instructors to scan and upload. However, you can change this by choosing Students under Who will upload submissions? in your assignment settings and following the steps in the Homework and Problem Sets section of this guide. If submissions will be student-uploaded, you can also enable Template Visibility in your assignment settings to let students download a blank, 200-question bubble sheet template from Gradescope when they open the assignment. If you enable template visibility on a Bubble Sheet assignment, please note that you will not need to upload a blank bubble sheet for students to be able to download it, and the template students can download will contain five answer bubbles per question, but no question content.

Once the assignment is created you’ll:

  • Create an answer key and set grading defaults ( Bubble Sheet specific features )
  • Upload and process scans * ( Managing scans )
  • Match student names to submissions * ( Managing submissions )
  • Review uncertain marks and optionally add more descriptive rubric items ( Reviewing Uncertain Marks )
  • Grade the bubble sheet assignment ( Grading a Bubble Sheet assignment )

And when grading is completed you can:

However, there is also an additional analysis page for Bubble Sheet Assignments - Item Analysis. We calculate a discriminatory score, or the correlation between getting the question right and the overall assignment score.

Programming assignments are available with an Institutional license . 

With Programming Assignments, students submit code projects and instructors can automatically grade student code with a custom written autograder and/or manually grade using the traditional Gradescope interface.

A screen capture of the create assignment page with the programming assignment type selected.

When setting up a Programming Assignment, you’ll have a few unique options to choose from for this specific assignment type which you can learn over in the programming assignment documentation .

After the assignment is created , the workflow is similar to other student submitted assignments:

  • If you wish to manually grade questions, you’ll add them to the outline
  • If you wish to use an autograder, you’ll set it up next ( Autograder Specifications )
  • Wait for submissions from students

Programming Assignments are not compatible for student upload on the Gradescope Mobile App.

  • Grading a programming assignment
  • Optionally, manually grade student work ( Manual Grading )

And when grading is completed you have access to the usual steps:

For more information about programming assignments and autograders, check out the Programming Assignment documentation .

Online Assignments (Beta)

Online assignments are available with an Institutional license .

A screen capture of the create assignment page with the online assignment type selected.

Currently in beta, an Online Assignment offers the following features:

  • Allows you to create questions directly on Gradescope.
  • Students will be able to log in and submit responses within the Gradescope interface.
  • If you’d like, you can also give students a set number of minutes to submit their work from the moment they open the assignment.
  • Additionally, you can choose to hide questions and responses once the due date passes or the time limit runs out to help prevent students who have completed the assignment from sharing questions and answers with students who have not finished working.
  • For multiple choice, select all, and short answer questions, you can indicate the correct answer ahead of time, and student submissions will be automatically graded. You can also add a File Upload field to a question that will allow students to complete their work on that question outside of Gradescope and then the upload files. For example, a photo or PDF of handwritten work can be uploaded that contains their answer.

After creating the assignment:

  • Enter your questions using the Assignment Editor ( Online Assignment specific features )
  • Create rubrics for your questions if applicable ( See Creating rubrics in Grading Submissions )
  • Optionally, manually grade student answers

Online Assignments are not compatible for student upload on the Gradescope Mobile App.

And when grading is completed, you have access to the usual steps:

  • Manage regrade requests ( Managing regrade requests ).

Articles in this section

  • Assignment Settings Overview
  • Creating Multi-version Assignments
  • Creating and Editing Sections for Assignments
  • Linking an assignment or gradebook column from an LMS to Gradescope
  • Extending assignment release dates, due dates, and time limits
  • Managing Submissions
  • Writing Formulas and Equations (LaTeX) for Assignments
  • Using Markdown for Assignments
  • Duplicating an Assignment

IMAGES

  1. SOLUTION: Assignment 1 descriptive statistics data analysis plan doc

    assignment descriptive statistics

  2. Descriptive Statistics Analysis Assignment

    assignment descriptive statistics

  3. descriptive-statistics-practice-2-12 (1)

    assignment descriptive statistics

  4. STAT200

    assignment descriptive statistics

  5. Unit 3 Summary Statistics (Descriptive Statistics)

    assignment descriptive statistics

  6. Descriptive Statistics

    assignment descriptive statistics

VIDEO

  1. English Descriptive Text Assignment

  2. LCC400

  3. Class 9th module 2 Homework assignment = 11 BBC Compacta Descriptive paragraph

  4. FIN534 Assignment 1b : Descriptive Statistical Measures & Hypothesis Testing

  5. Class 9th module 2 Practice assignment = 10 BBC Compacta descriptive paragraph

  6. 10523149 MUHAMMAD DZIKRI MAUAARIF assignment descriptive text software DANA

COMMENTS

  1. Descriptive Statistics

    Types of descriptive statistics. There are 3 main types of descriptive statistics: The distribution concerns the frequency of each value. The central tendency concerns the averages of the values. The variability or dispersion concerns how spread out the values are. You can apply these to assess only one variable at a time, in univariate ...

  2. Descriptive Statistics in Excel

    Descriptive statistics summarize your dataset, painting a picture of its properties. These properties include various central tendency and variability measures, distribution properties, outlier detection, and other information. Unlike inferential statistics, descriptive statistics only describe your dataset's characteristics and do not attempt to generalize from a sample to a population.

  3. What Is Descriptive Statistics: Full Explainer With Examples

    Descriptive statistics, although relatively simple, are a critically important part of any quantitative data analysis. Measures of central tendency include the mean (average), median and mode. Skewness indicates whether a dataset leans to one side or another. Measures of dispersion include the range, variance and standard deviation.

  4. Descriptive statistics with Python

    This method returns many useful descriptive statistics with a mix of measures of central tendency and measures of variability. This includes the number of non-missing observations; the mean; standard deviation; minimum value; 25 th, 50 th (a.k.a. the median), and 75 th percentile; as well as the maximum value. It's missing some useful ...

  5. Chapter 1 Jan 19-24: Descriptive Statistics and Data Distributions

    Jan 19-24: Descriptive Statistics and Data Distributions. This is the only chapter you need to look at for the first week of the course, which is January 19-24 2021. You can ignore all subsequent chapters in this e-book, for now. Please read this entire chapter and then complete the assignment at the end.

  6. Descriptive Statistics: Definitions, Types, Examples

    It involves organizing, visualizing, and summarizing raw data to create a coherent picture. The primary goal of descriptive statistics is to provide a clear and concise overview of the data's main features. This helps us identify patterns, trends, and characteristics within the data set without making broader inferences.

  7. How to Calculate Descriptive Statistics for Variables in SPSS

    The best way to understand a dataset is to calculate descriptive statistics for the variables within the dataset. There are three common forms of descriptive statistics: 1. Summary statistics - Numbers that summarize a variable using a single number.Examples include the mean, median, standard deviation, and range.

  8. PDF Introduction to descriptive statistics

    An average is a measure of the centre of the data set. There are three common ways of describing the centre of a set of numbers. They are the mean, the median and the mode and are calculated as follows. The mean add up all the numbers and divide by how many numbers there are. The median is the middle number.

  9. Descriptive Statistics: Definition, Overview, Types, and Examples

    Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of it. Descriptive statistics ...

  10. What Is Descriptive Statistics?

    Descriptive Statistics vs. Inferential Statistics? Descriptive statistics summarize data through certain numbers like mean, median, mode, etc. so as to make it easier to understand and interpret the data. Descriptive statistics don't involve any generalization or inference beyond what is immediately available.

  11. Writing with Descriptive Statistics

    Usually there is no good way to write a statistic. It rarely sounds good, and often interrupts the structure or flow of your writing. Oftentimes the best way to write descriptive statistics is to be direct. If you are citing several statistics about the same topic, it may be best to include them all in the same paragraph or section.

  12. Descriptive Statistics

    Types of descriptive statistics. There are 3 main types of descriptive statistics: The distribution concerns the frequency of each value. The central tendency concerns the averages of the values. The variability or dispersion concerns how spread out the values are. You can apply these to assess only one variable at a time, in univariate ...

  13. Descriptive Statistics: Definition & Charts and Graphs

    For example, if you have ten items in your data set, type them into cells A1 through A10. Step 2: Click the "Data" tab and then click "Data Analysis" in the Analysis group. Step 3: Highlight "Descriptive Statistics" in the pop-up Data Analysis window. Step 4: Type an input range into the "Input Range" text box.

  14. Unit 07: How to Evaluate Descriptive and Inferential Statistics

    Unit 7: How to Evaluate Descriptive and Inferential Statistics. Unit 7: Assignment #1 (due before 11:59 pm Central on Wednesday September 29): To review what descriptive and inferential statistics are, why they are important to learn, and examples of how they are used: Watch Lynda.com's (2010) video, " Understanding Descriptive and ...

  15. Assignment #1 Descriptive Statistics Data Analysis Plan

    Assignment #1: Prepare Descriptive Statistics Data Analysis Plan Before conducting any statistical analyses, researchers develop a plan for how they will analyze their data to answer their research questions. The purpose of this assignment is to provide an experience developing a descriptive statistics analysis plan.

  16. Descriptive Statistics

    Measures of Central Tendency and Other Commonly Used Descriptive Statistics. The mean, median, and the mode are all measures of central tendency. They attempt to describe what the typical data point might look like. In essence, they are all different forms of 'the average.'. When writing statistics, you never want to say 'average' because it is ...

  17. Descriptive Statistics for Summarising Data

    Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s - the fastest quality decision to 17.10 - the slowest quality decision).

  18. Additional Assignments for Descriptive Statistics

    Suggested assignments for descriptive statistics are designed to promote in-depth engagement with the material. Download all additional assignments for descriptive statistics, and the accompanying data files: assignments descriptive statistics .zip. Assignment 1. Details: Descriptive statistics 1.pdf. Data: practice.sav.

  19. STAT 200 Assignment 1 Discriptive Statistics Data Analysis Plan

    University of Mar yland Globa l Campus. STAT200 - Assignmen t #1: Descript ive Statistics Data Analysis Plan. Identifying Info rmation. Student (Fu ll Name): Willia ms, Tara. Class: STAT 20 0 7375. Instructor: Prof. San doval. Date: 3/29/2021. Scenario: The sample data has been gathered from 30households across from the US Department of.

  20. Statistics and Probability

    Learn statistics and probability—everything you'd want to know about descriptive and inferential statistics. If you're seeing this message, it means we're having trouble loading external resources on our website. If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked. ...

  21. Descriptive Statistics Assignment

    Descriptive Statistics Part 1 Histograms are also known as frequency distribution, varying in shape and size (Field, 2018). When creating histograms, a normal distribution is the end goal and is described as a bell- shaped curve (Field, 2018).

  22. Assignment 1: Descriptive statistics

    Part 1: Reproduce a table. Haun et al. (2019) has published raw data together with their paper. This means that we will be able to reproduce some of their results, exciting! In this assignment you are expected to reproduce Table 1 in (Haun et al. 2019). To get you started, I have included the code needed to download the data below (from the ...

  23. Assignment Types

    Review uncertain marks and optionally add more descriptive rubric items (Reviewing Uncertain Marks) Grade the bubble sheet assignment (Grading a Bubble Sheet assignment) ... (Assignment Statistics) However, there is also an additional analysis page for Bubble Sheet Assignments - Item Analysis. We calculate a discriminatory score, or the ...