Robot

Last few days of free access to Embibe

Click on Get Started to access Learning Outcomes today

Embibe Logo

Share this article

link

Table of Contents

Latest updates.

Ways To Improve Learning Outcomes: Learn Tips & Tricks

Ways To Improve Learning Outcomes: Learn Tips & Tricks

The Three States of Matter: Solids, Liquids, and Gases

The Three States of Matter: Solids, Liquids, and Gases

Types of Motion: Introduction, Parameters, Examples

Types of Motion: Introduction, Parameters, Examples

Understanding Frequency Polygon: Detailed Explanation

Understanding Frequency Polygon: Detailed Explanation

Uses of Silica Gel in Packaging?

Uses of Silica Gel in Packaging?

Visual Learning Style for Students: Pros and Cons

Visual Learning Style for Students: Pros and Cons

Air Pollution: Know the Causes, Effects & More

Air Pollution: Know the Causes, Effects & More

Sexual Reproduction in Flowering Plants

Sexual Reproduction in Flowering Plants

Integers Introduction: Check Detailed Explanation

Integers Introduction: Check Detailed Explanation

Human Respiratory System – Detailed Explanation

Human Respiratory System – Detailed Explanation

Tag cloud :.

  • entrance exams
  • engineering
  • ssc cgl 2024
  • Written By Priya_Singh
  • Last Modified 24-01-2023

Data Representation: Definition, Types, Examples

Data Representation: Data representation is a technique for analysing numerical data. The relationship between facts, ideas, information, and concepts is depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand. It is always determined by the data type in a specific domain. Graphical representations are available in many different shapes and sizes.

In mathematics, a graph is a chart in which statistical data is represented by curves or lines drawn across the coordinate point indicated on its surface. It aids in the investigation of a relationship between two variables by allowing one to evaluate the change in one variable’s amount in relation to another over time. It is useful for analysing series and frequency distributions in a given context. On this page, we will go through two different types of graphs that can be used to graphically display data. Continue reading to learn more.

Learn Informative Blog

Data Representation in Maths

Definition: After collecting the data, the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data.

Any information gathered may be organised in a frequency distribution table, and then shown using pictographs or bar graphs. A bar graph is a representation of numbers made up of equally wide bars whose lengths are determined by the frequency and scale you choose.

The collected raw data can be placed in any one of the given ways:

  • Serial order of alphabetical order
  • Ascending order
  • Descending order

Data Representation Example

Example: Let the marks obtained by \(30\) students of class VIII in a class test, out of \(50\)according to their roll numbers, be:

\(39,\,25,\,5,\,33,\,19,\,21,\,12,41,\,12,\,21,\,19,\,1,\,10,\,8,\,12\)

\(17,\,19,\,17,\,17,\,41,\,40,\,12,41,\,33,\,19,\,21,\,33,\,5,\,1,\,21\)

The data in the given form is known as raw data or ungrouped data. The above-given data can be placed in the serial order as shown below:

Data Representation Example

Now, for say you want to analyse the standard of achievement of the students. If you arrange them in ascending or descending order, it will give you a better picture.

Ascending order:

\(1,\,1,\,5,\,5,\,8,\,10,\,12,12,\,12,\,12,\,17,\,17,\,17,\,19,\,19\)

\(19,\,19,\,21,\,21,\,21,\,25,\,33,33,\,33,\,39,\,40,\,41,\,41,\,41\)

Descending order:

\(41,\,41,\,41,\,40,\,39,\,33,\,33,33,\,25,\,21,\,21,\,21,\,21,\,19,\,19\)

\(19,\,19,\,17,\,17,\,17,\,12,\,12,12,\,12,\,10,\,8,\,5,\,5,1,\,1\)

When the raw data is placed in ascending or descending order of the magnitude is known as an array or arrayed data.

Graph Representation in Data Structure

A few of the graphical representation of data is given below:

  • Frequency distribution table

Pictorial Representation of Data: Bar Chart

The bar graph represents the ​qualitative data visually. The information is displayed horizontally or vertically and compares items like amounts, characteristics, times, and frequency.

The bars are arranged in order of frequency, so more critical categories are emphasised. By looking at all the bars, it is easy to tell which types in a set of data dominate the others. Bar graphs can be in many ways like single, stacked, or grouped.

Bar Chart

Graphical Representation of Data: Frequency Distribution Table

A frequency table or frequency distribution is a method to present raw data in which one can easily understand the information contained in the raw data.

The frequency distribution table is constructed by using the tally marks. Tally marks are a form of a numerical system with the vertical lines used for counting. The cross line is placed over the four lines to get a total of \(5\).

Frequency Distribution Table

Consider a jar containing the different colours of pieces of bread as shown below:

Frequency Distribution Table Example

Construct a frequency distribution table for the data mentioned above.

Frequency Distribution Table Example

Graphical Representation of Data: Histogram

The histogram is another kind of graph that uses bars in its display. The histogram is used for quantitative data, and ranges of values known as classes are listed at the bottom, and the types with greater frequencies have the taller bars.

A histogram and the bar graph look very similar; however, they are different because of the data level. Bar graphs measure the frequency of the categorical data. A categorical variable has two or more categories, such as gender or hair colour.

Histogram

Graphical Representation of Data: Pie Chart

The pie chart is used to represent the numerical proportions of a dataset. This graph involves dividing a circle into different sectors, where each of the sectors represents the proportion of a particular element as a whole. Thus, it is also known as a circle chart or circle graph.

Pie Chart

Graphical Representation of Data: Line Graph

A graph that uses points and lines to represent change over time is defined as a line graph. In other words, it is the chart that shows a line joining multiple points or a line that shows the link between the points.

The diagram illustrates the quantitative data between two changing variables with the straight line or the curve that joins a series of successive data points. Linear charts compare two variables on the vertical and the horizontal axis.

Line Graph

General Rules for Visual Representation of Data

We have a few rules to present the information in the graphical representation effectively, and they are given below:

  • Suitable Title:  Ensure that the appropriate title is given to the graph, indicating the presentation’s subject.
  • Measurement Unit:  Introduce the measurement unit in the graph.
  • Proper Scale:  To represent the data accurately, choose an appropriate scale.
  • Index:  In the Index, the appropriate colours, shades, lines, design in the graphs are given for better understanding.
  • Data Sources:  At the bottom of the graph, include the source of information wherever necessary.
  • Keep it Simple:  Build the graph in a way that everyone should understand easily.
  • Neat:  You have to choose the correct size, fonts, colours etc., in such a way that the graph must be a model for the presentation of the information.

Solved Examples on Data Representation

Q.1. Construct the frequency distribution table for the data on heights in \(({\rm{cm}})\) of \(20\) boys using the class intervals \(130 – 135,135 – 140\) and so on. The heights of the boys in \({\rm{cm}}\) are: 

Data Representation Example 1

Ans: The frequency distribution for the above data can be constructed as follows:

Data Representation Example

Q.2. Write the steps of the construction of Bar graph? Ans: To construct the bar graph, follow the given steps: 1. Take a graph paper, draw two lines perpendicular to each other, and call them horizontal and vertical. 2. You have to mark the information given in the data like days, weeks, months, years, places, etc., at uniform gaps along the horizontal axis. 3. Then you have to choose the suitable scale to decide the heights of the rectangles or the bars and then mark the sizes on the vertical axis. 4. Draw the bars or rectangles of equal width and height marked in the previous step on the horizontal axis with equal spacing. The figure so obtained will be the bar graph representing the given numerical data.

Q.3. Read the bar graph and then answer the given questions: I. Write the information provided by the given bar graph. II. What is the order of change of the number of students over several years? III. In which year is the increase of the student maximum? IV. State whether true or false. The enrolment during \(1996 – 97\) is double that of \(1995 – 96\)

pictorial representation of data

Ans: I. The bar graph represents the number of students in class \({\rm{VI}}\) of a school during the academic years \(1995 – 96\,to\,1999 – 2000\). II. The number of stcccccudents is changing in increasing order as the heights of bars are growing. III. The increase in the number of students in uniform and the increase in the height of bars is uniform. Hence, in this case, the growth is not maximum in any of the years. The enrolment in the years is \(1996 – 97\, = 200\). and the enrolment in the years is \(1995 – 96\, = 150\). IV. The enrolment in \(1995 – 97\,\) is not double the enrolment in \(1995 – 96\). So the statement is false.

Q.4. Write the frequency distribution for the given information of ages of \(25\) students of class VIII in a school. \(15,\,16,\,16,\,14,\,17,\,17,\,16,\,15,\,15,\,16,\,16,\,17,\,15\) \(16,\,16,\,14,\,16,\,15,\,14,\,15,\,16,\,16,\,15,\,14,\,15\) Ans: Frequency distribution of ages of \(25\) students:

Data Representation Example

Q.5. There are \(20\) students in a classroom. The teacher asked the students to talk about their favourite subjects. The results are listed below:

Data Representation Example

By looking at the above data, which is the most liked subject? Ans: Representing the above data in the frequency distribution table by using tally marks as follows:

Data Representation Example

From the above table, we can see that the maximum number of students \((7)\) likes mathematics.

Also, Check –

  • Diagrammatic Representation of Data

In the given article, we have discussed the data representation with an example. Then we have talked about graphical representation like a bar graph, frequency table, pie chart, etc. later discussed the general rules for graphic representation. Finally, you can find solved examples along with a few FAQs. These will help you gain further clarity on this topic.

Test Informative Blog

FAQs on Data Representation

Q.1: How is data represented? A: The collected data can be expressed in various ways like bar graphs, pictographs, frequency tables, line graphs, pie charts and many more. It depends on the purpose of the data, and accordingly, the type of graph can be chosen.

Q.2: What are the different types of data representation? A : The few types of data representation are given below: 1. Frequency distribution table 2. Bar graph 3. Histogram 4. Line graph 5. Pie chart

Q.3: What is data representation, and why is it essential? A: After collecting the data, the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data. Importance: The data visualization gives us a clear understanding of what the information means by displaying it visually through maps or graphs. The data is more natural to the mind to comprehend and make it easier to rectify the trends outliners or trends within the large data sets.

Q.4: What is the difference between data and representation? A: The term data defines the collection of specific quantitative facts in their nature like the height, number of children etc., whereas the information in the form of data after being processed, arranged and then presented in the state which gives meaning to the data is data representation.

Q.5: Why do we use data representation? A: The data visualization gives us a clear understanding of what the information means by displaying it visually through maps or graphs. The data is more natural to the mind to comprehend and make it easier to rectify the trends outliners or trends within the large data sets.

Related Articles

Ways To Improve Learning Outcomes: With the development of technology, students may now rely on strategies to enhance learning outcomes. No matter how knowledgeable a...

The Three States of Matter: Anything with mass and occupied space is called ‘Matter’. Matters of different kinds surround us. There are some we can...

Motion is the change of a body's position or orientation over time. The motion of humans and animals illustrates how everything in the cosmos is...

Understanding Frequency Polygon: Students who are struggling with understanding Frequency Polygon can check out the details here. A graphical representation of data distribution helps understand...

When you receive your order of clothes or leather shoes or silver jewellery from any online shoppe, you must have noticed a small packet containing...

Visual Learning Style: We as humans possess the power to remember those which we have caught visually in our memory and that too for a...

Air Pollution: In the past, the air we inhaled was pure and clean. But as industrialisation grows and the number of harmful chemicals in the...

In biology, flowering plants are known by the name angiosperms. Male and female reproductive organs can be found in the same plant in flowering plants....

Integers Introduction: To score well in the exam, students must check out the Integers introduction and understand them thoroughly. The collection of negative numbers and whole...

Human Respiratory System: Students preparing for the NEET and Biology-related exams must have an idea about the human respiratory system. It is a network of tissues...

Place Value of Numbers: Detailed Explanation

Place Value of Numbers: Students must understand the concept of the place value of numbers to score high in the exam. In mathematics, place value...

The Leaf: Types, Structures, Parts

The Leaf: Students who want to understand everything about the leaf can check out the detailed explanation provided by Embibe experts. Plants have a crucial role...

Factors Affecting Respiration: Definition, Diagrams with Examples

In plants, respiration can be regarded as the reversal of the photosynthetic process. Like photosynthesis, respiration involves gas exchange with the environment. Unlike photosynthesis, respiration...

General Terms Related to Spherical Mirrors

General terms related to spherical mirrors: A mirror with the shape of a portion cut out of a spherical surface or substance is known as a...

Number System: Types, Conversion and Properties

Number System: Numbers are highly significant and play an essential role in Mathematics that will come up in further classes. In lower grades, we learned how...

Types of Respiration

Every living organism has to "breathe" to survive. The process by which the living organisms use their food to get energy is called respiration. It...

Animal Cell: Definition, Diagram, Types of Animal Cells

Animal Cell: An animal cell is a eukaryotic cell with membrane-bound cell organelles without a cell wall. We all know that the cell is the fundamental...

Conversion of Percentages: Conversion Method & Examples

Conversion of Percentages: To differentiate and explain the size of quantities, the terms fractions and percent are used interchangeably. Some may find it difficult to...

Arc of a Circle: Definition, Properties, and Examples

Arc of a circle: A circle is the set of all points in the plane that are a fixed distance called the radius from a fixed point...

Ammonia (NH3): Preparation, Structure, Properties and Uses

Ammonia, a colourless gas with a distinct odour, is a chemical building block and a significant component in producing many everyday items. It is found...

CGPA to Percentage: Calculator for Conversion, Formula, & More

CGPA to Percentage: The average grade point of a student is calculated using their cumulative grades across all subjects, omitting any supplemental coursework. Many colleges,...

Uses of Ether – Properties, Nomenclature, Uses, Disadvantages

Uses of Ether:  Ether is an organic compound containing an oxygen atom and an ether group connected to two alkyl/aryl groups. It is formed by the...

General and Middle Terms: Definitions, Formula, Independent Term, Examples

General and Middle terms: The binomial theorem helps us find the power of a binomial without going through the tedious multiplication process. Further, the use...

Mutually Exclusive Events: Definition, Formulas, Solved Examples

Mutually Exclusive Events: In the theory of probability, two events are said to be mutually exclusive events if they cannot occur simultaneously or at the...

Geometry: Definition, Shapes, Structure, Examples

Geometry is a branch of mathematics that is largely concerned with the forms and sizes of objects, their relative positions, and the qualities of space....

Bohr’s Model of Hydrogen Atom: Expressions for Radius, Energy

Rutherford’s Atom Model was undoubtedly a breakthrough in atomic studies. However, it was not wholly correct. The great Danish physicist Niels Bohr (1885–1962) made immediate...

Types of Functions: Definition, Classification and Examples

Types of Functions: Functions are the relation of any two sets. A relation describes the cartesian product of two sets. Cartesian products of two sets...

methods of data representation

39 Insightful Publications

World Economic Forum

Embibe Is A Global Innovator

accenture

Innovator Of The Year Education Forever

Interpretable And Explainable AI

Interpretable And Explainable AI

Tedx

Revolutionizing Education Forever

Amazon AI Conclave

Best AI Platform For Education

Forbes India

Enabling Teachers Everywhere

ACM

Decoding Performance

World Education Summit

Leading AI Powered Learning Solution Provider

Journal of Educational Data Mining

Auto Generation Of Tests

BW Disrupt

Disrupting Education In India

Springer

Problem Sequencing Using DKT

Fortune India Forty Under Fourty

Help Students Ace India's Toughest Exams

Edtech Digest

Best Education AI Platform

Nasscom Product Connect

Unlocking AI Through Saas

Tech In Asia

Fixing Student’s Behaviour With Data Analytics

Your Story

Leveraging Intelligence To Deliver Results

City AI

Brave New World Of Applied AI

vccircle

You Can Score Higher

INK Talks

Harnessing AI In Education

kstart

Personalized Ed-tech With AI

StartUpGrind

Exciting AI Platform, Personalizing Education

Digital Women Award

Disruptor Award For Maximum Business Impact

The Mumbai Summit 2020 AI

Top 20 AI Influencers In India

USPTO

Proud Owner Of 9 Patents

StartUpGrind

Innovation in AR/VR/MR

StartUpGrind

Best Animated Frames Award 2024

Close

Trending Searches

Previous year question papers, sample papers.

Unleash Your True Potential With Personalised Learning on EMBIBE

Pattern

Ace Your Exam With Personalised Learning on EMBIBE

Enter mobile number.

By signing up, you agree to our Privacy Policy and Terms & Conditions

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

K12 LibreTexts

2.1: Types of Data Representation

  • Last updated
  • Save as PDF
  • Page ID 5696

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Two common types of graphic displays are bar charts and histograms. Both bar charts and histograms use vertical or horizontal bars to represent the number of data points in each category or interval. The main difference graphically is that in a  bar chart  there are spaces between the bars and in a  histogram  there are not spaces between the bars. Why does this subtle difference exist and what does it imply about graphic displays in general?

Displaying Data

It is often easier for people to interpret relative sizes of data when that data is displayed graphically. Note that a  categorical variable  is a variable that can take on one of a limited number of values and a  quantitative variable  is a variable that takes on numerical values that represent a measurable quantity. Examples of categorical variables are tv stations, the state someone lives in, and eye color while examples of quantitative variables are the height of students or the population of a city. There are a few common ways of displaying data graphically that you should be familiar with. 

A  pie chart  shows the relative proportions of data in different categories.  Pie charts  are excellent ways of displaying categorical data with easily separable groups. The following pie chart shows six categories labeled A−F.  The size of each pie slice is determined by the central angle. Since there are 360 o  in a circle, the size of the central angle θ A  of category A can be found by:

Screen Shot 2020-04-27 at 4.52.45 PM.png

CK-12 Foundation -  https://www.flickr.com/photos/slgc/16173880801  - CCSA

A  bar chart  displays frequencies of categories of data. The bar chart below has 5 categories, and shows the TV channel preferences for 53 adults. The horizontal axis could have also been labeled News, Sports, Local News, Comedy, Action Movies. The reason why the bars are separated by spaces is to emphasize the fact that they are categories and not continuous numbers. For example, just because you split your time between channel 8 and channel 44 does not mean on average you watch channel 26. Categories can be numbers so you need to be very careful.

Screen Shot 2020-04-27 at 4.54.15 PM.png

CK-12 Foundation -  https://www.flickr.com/photos/slgc/16173880801  - CCSA

A  histogram  displays frequencies of quantitative data that has been sorted into intervals. The following is a histogram that shows the heights of a class of 53 students. Notice the largest category is 56-60 inches with 18 people.

Screen Shot 2020-04-27 at 4.55.38 PM.png

A  boxplot  (also known as a  box and whiskers plot ) is another way to display quantitative data. It displays the five 5 number summary (minimum, Q1,  median , Q3, maximum). The box can either be vertically or horizontally displayed depending on the labeling of the axis. The box does not need to be perfectly symmetrical because it represents data that might not be perfectly symmetrical.

Screen Shot 2020-04-27 at 5.03.32 PM.png

Earlier, you were asked about the difference between histograms and bar charts. The reason for the space in bar charts but no space in histograms is bar charts graph categorical variables while histograms graph quantitative variables. It would be extremely improper to forget the space with bar charts because you would run the risk of implying a spectrum from one side of the chart to the other. Note that in the bar chart where TV stations where shown, the station numbers were not listed horizontally in order by size. This was to emphasize the fact that the stations were categories.

Create a boxplot of the following numbers in your calculator.

8.5, 10.9, 9.1, 7.5, 7.2, 6, 2.3, 5.5

Enter the data into L1 by going into the Stat menu.

Screen Shot 2020-04-27 at 5.04.34 PM.png

CK-12 Foundation - CCSA

Then turn the statplot on and choose boxplot.

Screen Shot 2020-04-27 at 5.05.07 PM.png

Use Zoomstat to automatically center the window on the boxplot.

Screen Shot 2020-04-27 at 5.05.34 PM.png

Create a pie chart to represent the preferences of 43 hungry students.

  • Other – 5
  • Burritos – 7
  • Burgers – 9
  • Pizza – 22

Screen Shot 2020-04-27 at 5.06.00 PM.png

Create a bar chart representing the preference for sports of a group of 23 people.

  • Football – 12
  • Baseball – 10
  • Basketball – 8
  • Hockey – 3

Screen Shot 2020-04-27 at 5.06.29 PM.png

Create a histogram for the income distribution of 200 million people.

  • Below $50,000 is 100 million people
  • Between $50,000 and $100,000 is 50 million people
  • Between $100,000 and $150,000 is 40 million people
  • Above $150,000 is 10 million people

Screen Shot 2020-04-27 at 5.07.15 PM.png

1. What types of graphs show categorical data?

2. What types of graphs show quantitative data?

A math class of 30 students had the following grades:

A 10
B 10
C 5
D 3
F 2

3. Create a bar chart for this data.

4. Create a pie chart for this data.

5. Which graph do you think makes a better visual representation of the data?

A set of 20 exam scores is 67, 94, 88, 76, 85, 93, 55, 87, 80, 81, 80, 61, 90, 84, 75, 93, 75, 68, 100, 98

6. Create a histogram for this data. Use your best judgment to decide what the intervals should be.

7. Find the  five number summary  for this data.

8. Use the  five number summary  to create a boxplot for this data.

9. Describe the data shown in the boxplot below.

Screen Shot 2020-04-27 at 5.11.42 PM.png

10. Describe the data shown in the histogram below.

Screen Shot 2020-04-27 at 5.12.15 PM.png

A math class of 30 students has the following eye colors:

Brown 20
Blue 5
Green 3
Other 2

11. Create a bar chart for this data.

12. Create a pie chart for this data.

13. Which graph do you think makes a better visual representation of the data?

14. Suppose you have data that shows the breakdown of registered republicans by state. What types of graphs could you use to display this data?

15. From which types of graphs could you obtain information about the spread of the data? Note that spread is a measure of how spread out all of the data is.

Review (Answers)

To see the Review answers, open this  PDF file  and look for section 15.4. 

Term Definition
A bar chart is a graphic display of categorical variables that uses bars to represent the frequency of the count in each category.
A box- and- whisker plot is a graphic display of quantitative data that demonstrates the five number summary.
A boxplot is a graphic display of quantitative data that illustrates the five number summary.
A categorical variable is a variable that can take on one of a limited number of values. Examples of categorical variables are tv stations, the state someone lives in, and eye color.
A continuous variable is a variable that takes on any value within the limits of the variable.
The five number summary of a set of data is the minimum, first quartile, second quartile, third quartile, and maximum.
A frequency polygon is a graph constructed by using lines to join the midpoints of each interval, or bin.
A histogram is a display that indicates the frequency of specified ranges of continuous data values on a graph in the form of immediately adjacent bars.
A line plot is a graph that shows the frequency of data over a number line.
In statistics, linear regression is a process that attempts to model the relationship between two variables by fitting a linear equation to the data.
A pie chart is a graphic display of categorical data where the relative size of each pie slice corresponds to the frequency of each category.
A quantitative variable is a variable that takes on numerical values that represent a measurable quantity. Examples of quantitative variables are the height of students or the population of a city.
A scatterplot is a type of visual display that shows pairs of data for two different variables.
A stem-and-leaf plot is a way of organizing data values from least to greatest using place value. Usually, the last digit of each data value becomes the "leaf" and the other digits become the "stem".

Additional Resources

PLIX: Play, Learn, Interact, eXplore - Baby Due Date Histogram

Practice: Types of Data Representation

Real World: Prepare for Impact

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

methods of data representation

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Creating Brand Value
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading Change and Organizational Renewal
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

17 Data Visualization Techniques All Professionals Should Know

Data Visualizations on a Page

  • 17 Sep 2019

There’s a growing demand for business analytics and data expertise in the workforce. But you don’t need to be a professional analyst to benefit from data-related skills.

Becoming skilled at common data visualization techniques can help you reap the rewards of data-driven decision-making , including increased confidence and potential cost savings. Learning how to effectively visualize data could be the first step toward using data analytics and data science to your advantage to add value to your organization.

Several data visualization techniques can help you become more effective in your role. Here are 17 essential data visualization techniques all professionals should know, as well as tips to help you effectively present your data.

Access your free e-book today.

What Is Data Visualization?

Data visualization is the process of creating graphical representations of information. This process helps the presenter communicate data in a way that’s easy for the viewer to interpret and draw conclusions.

There are many different techniques and tools you can leverage to visualize data, so you want to know which ones to use and when. Here are some of the most important data visualization techniques all professionals should know.

Data Visualization Techniques

The type of data visualization technique you leverage will vary based on the type of data you’re working with, in addition to the story you’re telling with your data .

Here are some important data visualization techniques to know:

  • Gantt Chart
  • Box and Whisker Plot
  • Waterfall Chart
  • Scatter Plot
  • Pictogram Chart
  • Highlight Table
  • Bullet Graph
  • Choropleth Map
  • Network Diagram
  • Correlation Matrices

1. Pie Chart

Pie Chart Example

Pie charts are one of the most common and basic data visualization techniques, used across a wide range of applications. Pie charts are ideal for illustrating proportions, or part-to-whole comparisons.

Because pie charts are relatively simple and easy to read, they’re best suited for audiences who might be unfamiliar with the information or are only interested in the key takeaways. For viewers who require a more thorough explanation of the data, pie charts fall short in their ability to display complex information.

2. Bar Chart

Bar Chart Example

The classic bar chart , or bar graph, is another common and easy-to-use method of data visualization. In this type of visualization, one axis of the chart shows the categories being compared, and the other, a measured value. The length of the bar indicates how each group measures according to the value.

One drawback is that labeling and clarity can become problematic when there are too many categories included. Like pie charts, they can also be too simple for more complex data sets.

3. Histogram

Histogram Example

Unlike bar charts, histograms illustrate the distribution of data over a continuous interval or defined period. These visualizations are helpful in identifying where values are concentrated, as well as where there are gaps or unusual values.

Histograms are especially useful for showing the frequency of a particular occurrence. For instance, if you’d like to show how many clicks your website received each day over the last week, you can use a histogram. From this visualization, you can quickly determine which days your website saw the greatest and fewest number of clicks.

4. Gantt Chart

Gantt Chart Example

Gantt charts are particularly common in project management, as they’re useful in illustrating a project timeline or progression of tasks. In this type of chart, tasks to be performed are listed on the vertical axis and time intervals on the horizontal axis. Horizontal bars in the body of the chart represent the duration of each activity.

Utilizing Gantt charts to display timelines can be incredibly helpful, and enable team members to keep track of every aspect of a project. Even if you’re not a project management professional, familiarizing yourself with Gantt charts can help you stay organized.

5. Heat Map

Heat Map Example

A heat map is a type of visualization used to show differences in data through variations in color. These charts use color to communicate values in a way that makes it easy for the viewer to quickly identify trends. Having a clear legend is necessary in order for a user to successfully read and interpret a heatmap.

There are many possible applications of heat maps. For example, if you want to analyze which time of day a retail store makes the most sales, you can use a heat map that shows the day of the week on the vertical axis and time of day on the horizontal axis. Then, by shading in the matrix with colors that correspond to the number of sales at each time of day, you can identify trends in the data that allow you to determine the exact times your store experiences the most sales.

6. A Box and Whisker Plot

Box and Whisker Plot Example

A box and whisker plot , or box plot, provides a visual summary of data through its quartiles. First, a box is drawn from the first quartile to the third of the data set. A line within the box represents the median. “Whiskers,” or lines, are then drawn extending from the box to the minimum (lower extreme) and maximum (upper extreme). Outliers are represented by individual points that are in-line with the whiskers.

This type of chart is helpful in quickly identifying whether or not the data is symmetrical or skewed, as well as providing a visual summary of the data set that can be easily interpreted.

7. Waterfall Chart

Waterfall Chart Example

A waterfall chart is a visual representation that illustrates how a value changes as it’s influenced by different factors, such as time. The main goal of this chart is to show the viewer how a value has grown or declined over a defined period. For example, waterfall charts are popular for showing spending or earnings over time.

8. Area Chart

Area Chart Example

An area chart , or area graph, is a variation on a basic line graph in which the area underneath the line is shaded to represent the total value of each data point. When several data series must be compared on the same graph, stacked area charts are used.

This method of data visualization is useful for showing changes in one or more quantities over time, as well as showing how each quantity combines to make up the whole. Stacked area charts are effective in showing part-to-whole comparisons.

9. Scatter Plot

Scatter Plot Example

Another technique commonly used to display data is a scatter plot . A scatter plot displays data for two variables as represented by points plotted against the horizontal and vertical axis. This type of data visualization is useful in illustrating the relationships that exist between variables and can be used to identify trends or correlations in data.

Scatter plots are most effective for fairly large data sets, since it’s often easier to identify trends when there are more data points present. Additionally, the closer the data points are grouped together, the stronger the correlation or trend tends to be.

10. Pictogram Chart

Pictogram Example

Pictogram charts , or pictograph charts, are particularly useful for presenting simple data in a more visual and engaging way. These charts use icons to visualize data, with each icon representing a different value or category. For example, data about time might be represented by icons of clocks or watches. Each icon can correspond to either a single unit or a set number of units (for example, each icon represents 100 units).

In addition to making the data more engaging, pictogram charts are helpful in situations where language or cultural differences might be a barrier to the audience’s understanding of the data.

11. Timeline

Timeline Example

Timelines are the most effective way to visualize a sequence of events in chronological order. They’re typically linear, with key events outlined along the axis. Timelines are used to communicate time-related information and display historical data.

Timelines allow you to highlight the most important events that occurred, or need to occur in the future, and make it easy for the viewer to identify any patterns appearing within the selected time period. While timelines are often relatively simple linear visualizations, they can be made more visually appealing by adding images, colors, fonts, and decorative shapes.

12. Highlight Table

Highlight Table Example

A highlight table is a more engaging alternative to traditional tables. By highlighting cells in the table with color, you can make it easier for viewers to quickly spot trends and patterns in the data. These visualizations are useful for comparing categorical data.

Depending on the data visualization tool you’re using, you may be able to add conditional formatting rules to the table that automatically color cells that meet specified conditions. For instance, when using a highlight table to visualize a company’s sales data, you may color cells red if the sales data is below the goal, or green if sales were above the goal. Unlike a heat map, the colors in a highlight table are discrete and represent a single meaning or value.

13. Bullet Graph

Bullet Graph Example

A bullet graph is a variation of a bar graph that can act as an alternative to dashboard gauges to represent performance data. The main use for a bullet graph is to inform the viewer of how a business is performing in comparison to benchmarks that are in place for key business metrics.

In a bullet graph, the darker horizontal bar in the middle of the chart represents the actual value, while the vertical line represents a comparative value, or target. If the horizontal bar passes the vertical line, the target for that metric has been surpassed. Additionally, the segmented colored sections behind the horizontal bar represent range scores, such as “poor,” “fair,” or “good.”

14. Choropleth Maps

Choropleth Map Example

A choropleth map uses color, shading, and other patterns to visualize numerical values across geographic regions. These visualizations use a progression of color (or shading) on a spectrum to distinguish high values from low.

Choropleth maps allow viewers to see how a variable changes from one region to the next. A potential downside to this type of visualization is that the exact numerical values aren’t easily accessible because the colors represent a range of values. Some data visualization tools, however, allow you to add interactivity to your map so the exact values are accessible.

15. Word Cloud

Word Cloud Example

A word cloud , or tag cloud, is a visual representation of text data in which the size of the word is proportional to its frequency. The more often a specific word appears in a dataset, the larger it appears in the visualization. In addition to size, words often appear bolder or follow a specific color scheme depending on their frequency.

Word clouds are often used on websites and blogs to identify significant keywords and compare differences in textual data between two sources. They are also useful when analyzing qualitative datasets, such as the specific words consumers used to describe a product.

16. Network Diagram

Network Diagram Example

Network diagrams are a type of data visualization that represent relationships between qualitative data points. These visualizations are composed of nodes and links, also called edges. Nodes are singular data points that are connected to other nodes through edges, which show the relationship between multiple nodes.

There are many use cases for network diagrams, including depicting social networks, highlighting the relationships between employees at an organization, or visualizing product sales across geographic regions.

17. Correlation Matrix

Correlation Matrix Example

A correlation matrix is a table that shows correlation coefficients between variables. Each cell represents the relationship between two variables, and a color scale is used to communicate whether the variables are correlated and to what extent.

Correlation matrices are useful to summarize and find patterns in large data sets. In business, a correlation matrix might be used to analyze how different data points about a specific product might be related, such as price, advertising spend, launch date, etc.

Other Data Visualization Options

While the examples listed above are some of the most commonly used techniques, there are many other ways you can visualize data to become a more effective communicator. Some other data visualization options include:

  • Bubble clouds
  • Circle views
  • Dendrograms
  • Dot distribution maps
  • Open-high-low-close charts
  • Polar areas
  • Radial trees
  • Ring Charts
  • Sankey diagram
  • Span charts
  • Streamgraphs
  • Wedge stack graphs
  • Violin plots

Business Analytics | Become a data-driven leader | Learn More

Tips For Creating Effective Visualizations

Creating effective data visualizations requires more than just knowing how to choose the best technique for your needs. There are several considerations you should take into account to maximize your effectiveness when it comes to presenting data.

Related : What to Keep in Mind When Creating Data Visualizations in Excel

One of the most important steps is to evaluate your audience. For example, if you’re presenting financial data to a team that works in an unrelated department, you’ll want to choose a fairly simple illustration. On the other hand, if you’re presenting financial data to a team of finance experts, it’s likely you can safely include more complex information.

Another helpful tip is to avoid unnecessary distractions. Although visual elements like animation can be a great way to add interest, they can also distract from the key points the illustration is trying to convey and hinder the viewer’s ability to quickly understand the information.

Finally, be mindful of the colors you utilize, as well as your overall design. While it’s important that your graphs or charts are visually appealing, there are more practical reasons you might choose one color palette over another. For instance, using low contrast colors can make it difficult for your audience to discern differences between data points. Using colors that are too bold, however, can make the illustration overwhelming or distracting for the viewer.

Related : Bad Data Visualization: 5 Examples of Misleading Data

Visuals to Interpret and Share Information

No matter your role or title within an organization, data visualization is a skill that’s important for all professionals. Being able to effectively present complex data through easy-to-understand visual representations is invaluable when it comes to communicating information with members both inside and outside your business.

There’s no shortage in how data visualization can be applied in the real world. Data is playing an increasingly important role in the marketplace today, and data literacy is the first step in understanding how analytics can be used in business.

Are you interested in improving your analytical skills? Learn more about Business Analytics , our eight-week online course that can help you use data to generate insights and tackle business decisions.

This post was updated on January 20, 2022. It was originally published on September 17, 2019.

methods of data representation

About the Author

11 Data Visualization Techniques for Every Use-Case with Examples

methods of data representation

Data visualization is rapidly becoming an essential skill in data science and many other data-driven industries, such as finance, education, and healthcare. This comes with no surprise: as data practitioners are dealing with an ever-growing volume of complex and varied data, data visualization provides a set of techniques to make sense of it and effectively communicate data insights.

Historically considered a minor topic in data science, today, data visualization is a vibrant and fast-paced field, enriched with numerous techniques, tools, theories, and contributions from other disciplines, like psychology and neuroscience. If you’re interested in becoming a data visualization wizard, DataCamp gets you covered. Check out our data visualization course catalog to access more than 30 data visualization courses taught by leading experts and covering a variety of popular technologies.

This article provides an overview of the state of data visualization. We will focus on the most popular data visualization analyses, techniques, and tools. Keep reading!

The Power of Good Data Visualization

Data visualization involves the use of graphical representations of data, such as graphs, charts, and maps. Compared to descriptive statistics or tables, visuals provide a more effective way to analyze data, including identifying patterns, distributions, and correlations and spotting outliers in complex datasets.

Visuals allow data scientists to summarize thousands of rows and columns of complex data and put it in an understandable and accessible format.

By bringing data to life with insightful plots and charts, data visualization is vital in decision-making processes. Whether it’s data analysts breaking down their findings to non-technical stakeholders, data scientists performing A/B tests for marketing purposes, or machine learning engineers explaining potential bias in complex large language models like ChatGPT, data visualization is the key to moving from data insights to decision-making.

Despite the use of data visualization, many thorough and detailed data analyses still end up in the drawer for the simple reason that they didn’t get to captivate the audience, whether decision-makers, stakeholders, or other members of the team.

Thanks to progress in disciplines like neuroscience, today, we know the way a data visualization is depicted can severely affect how people perceive it. The choices you make when designing a graph –for example, the colors, the layout, and the size– can make a big difference. Interested in the theory behind data visualization? Our Understanding Data Visualization Course is a great place to get started.

While data visualization has an important role to play when communicating data insights, the recipe for successful communication is more complex. That’s the idea behind data storytelling, an innovative approach that advocates for using visuals, narrative, and data to turn data insights into action. To know more about data storytelling, check out our DataFramed podcast , where we speak with Brent Dykes, Senior Director of Insights & Data Storytelling at Blast Analytics and author of Effective Data Storytelling.

Types of Data Visualization Analysis

Data visualization is used to analyze visually the behavior of the different variables in a dataset, such as a relationship between data points in a variable or the distribution. Depending on the number of variables you want to study at once, you can distinguish three types of data visualization analysis.

  • Univariate analysis . Used to summarize the behavior of only one variable at a time.
  • Bivariate analysis . Helps to study the relationship between two variables
  • Multivariate analysis . Allows data practitioners to analyze more than two variables at once.

Key Data Visualization Techniques

Let’s now examine the most popular data visualization techniques!

One of the most used visualizations, line plots are excellent at tracking the evolution of a variable over time. They are normally created by putting a time variable on the x-axis and the variable you want to analyze on the y-axis. For example, the line plot below shows the evolution of the DJIA Stock Price during 2022.

image10.png

Source. DataCamp

To learn about how to create compelling line plots, check out our Line Plots in MatplotLib with Python Tutorial .

A bar chart ranks data according to the value of multiple categories. It consists of rectangles whose lengths are proportional to the value of each category. Bar charts are prevalent because they are easy to read. Businesses commonly use bar charts to make comparisons, like comparing the market share of different brands or the revenue of different regions. There are multiple types of bar charts, each suited for a different purpose.

There are multiple types of bar charts, each suited for a different purpose, including vertical bar plots, horizontal bar plots, and clustered bar plots.

image7.png

Vertical, horizontal, and clustered bar plots.

Our course, Introduction to Data Science in Python , covers a range of data visualization techniques, including bar plots.

Histograms are one of the most popular visualizations to analyze the distribution of data. They show the numerical variable's distribution with bars.

To build a histogram, the numerical data is first divided into several ranges or bins, and the frequency of occurrence of each range is counted. The horizontal axis shows the range, while the vertical axis represents the frequency or percentage of occurrences of a range.

Histograms immediately showcase how a variable's distribution is skewed or where it peaks. Here are examples from our Data Demystified Series on Data Visualizations that Capture Distributions .

image4.png

Box and whisker plots

Another great plot to summarize the distribution of a variable is boxplots. Boxplots provide an intuitive and compelling way to spot the following elements:

  • Median . The middle value of a dataset where 50% of the data is less than the median and 50% of the data is higher than the median.
  • The upper quartile . The 75th percentile of a dataset where 75% of the data is less than the upper quartile, and 25% of the data is higher than the upper quartile.
  • The lower quartile . The 25th percentile of a dataset where 25% of the data is less than the lower quartile and 75% is higher than the lower quartile.
  • The interquartile range . The upper quartile minus the lower quartile
  • The upper adjacent value . Or colloquially, the “maximum.” It represents the upper quartile plus 1.5 times the interquartile range.
  • The lower adjacent value . Or colloquially, the “minimum." It represents the lower quartile minus 1.5 times the interquartile range.
  • Outliers . Any values above the “maximum” or below the “minimum.”

The anatomy of a box plot. Source: Galarnyk

The anatomy of a box plot. Source: Galarnyk

For example, the following seaborn-based boxplot shows the distribution of sepal length in three varieties of iris plants, drawing on the popular iris dataset. Our Python Seaborn Tutorial For Beginners is a perfect resource to discover how to create boxplots and other graphs using Python’s popular visualization package, Seaborn.

image16.png

Scatter plots

Scatter plots are used to visualize the relationship between two continuous variables. Each point on the plot represents a single data point, and the position of the point on the x and y-axis represents the values of the two variables. It is often used in data exploration to understand the data and quickly surface potential correlations.

The following example takes again the iris dataset to plot the relationship between sepal width and sepal length.

image11.png

To have more examples of scatter plots, read our Data Demystified Series on Data Visualizations that Capture Relationships . You can also learn to create a variety of plots, including scatter plots, in our plotting with Matplotlib tutorial .

Bubble plot

Scatter plots can be easily augmented by adding new elements that represent new variables. For example, if we want to plot the relationship between sepal width and sepal length in the different varieties of iris, we could just add colors to the points, as following:

image15.png

We could also change the size of the points according to another variable. This is what characterizes the so-called bubble plots. For example, this incredible graph shows the relationship between a country's life expectancy and GDP, adding color to represent the country's region, and size to represent the country's population.

Source. Gapminder

Source. Gapminder

We cover bubble plots and how to create them in our course, Intermediate Interactive Data Visualization with plotly in R .

Treemaps are suitable to show part-to-whole relationships in data. They display hierarchical data as a set of rectangles. Each rectangle is a category within a given variable, whereas the area of the rectangle is proportional to the size of that category. Compared to similar visualizations, like pie charts, tree maps are considered more intuitive and preferable.

Below you can find an example.

image3.png

In our Sentiment Analysis in R course , you’ll learn how to use treemaps to visualize sentiment in groups of documents.

A heatmap is a common and beautiful matrix plot that can be used to graphically summarize the relationship between two variables. The degree of correlation between two variables is represented by a color code.

For example, this heat extracted from our Intermediate Data Visualization with Seaborn Course analyzes the occupation of the guests of the Daily Show during the 1999-2012 period. As expected, guests from the acting and media industries are the most frequent attendants.

image8.png

To learn more about how to create a heatmap , you can check out our tutorial that explores how to make one using Power BI.

Word clouds

Word clouds are useful for visualizing common words in a text or data set. They're similar to bar plots but are often more visually appealing. However, at times word clouds can be harder to interpret. World clouds are useful in the following scenarios:

  • Quickly identify the most important themes or topics in a large body of text.
  • Understand the overall sentiment or tone of a piece of writing.
  • Explore patterns or trends in data that contain textual information.
  • Communicate the key ideas or concepts in a visually engaging way.

Check out our Generating WordClouds in Python Tutorial to discover how to create your own word cloud.

Source. Datacamp

Source. Datacamp

A considerable proportion of the data generated every day is inherently spatial. Spatial data –also known sometimes as geospatial data or geographic information– are data for which a specific location is associated with each record.

Every spatial data point can be located on a map using a certain coordinate reference system. For example, the image below, extracted from our GeoPandas Tutorial , shows the different districts of Barcelona.

Geospatial analysis is a rapidly-evolving field within data science. Maps are at the heart of this discipline. Check out our Working with Geospatial Data in Python Course to start drawing maps today!

image14.png

Network diagrams

Most data is stored in tables. However, this is not the only format available. The so-called graphs are better suited to analyze data that is organized in networks, such as online social networks, like Facebook and Twitter, to transportation networks, like metro lines. Network analytics is the subdomain of data science that uses graphs to study networks.

Network graphs consist of two main components: nodes and edges, also known as relationships. This is an example of a simple network graph.

image6.png

Cool right? The possibilities of network graphs are endless. To get a gentle introduction to this field, we highly recommend our Introduction to Network Analysis in Python Course .

Become an ML Scientist

Upskill in Python to become a machine learning scientist.

Choosing the Right Visualization Technique

We have just presented a small subset of the many data visualization techniques available. Depending on the type of analysis you want to perform, some graphs will be more suitable than others.

For example, if you want to showcase trends and fluctuations in data over time, a line plot is what you’re looking for. By contrast, if you want to analyze the distribution of the data points in a variable, a histogram or a boxplot will be better suited.

When deciding what technique to use, ask yourself the following questions:

  • How many variables do you want to analyze at once? Depending on the answer, you will be performing univariate, bivariate, or multivariate analysis.
  • What do you want to analyze? Each visualization is suitable for analyzing one of the following phenomena:
  • Distribution
  • Correlation
  • Part-of-Whole

With practice, matching the visualization technique with the type of data and the question being answered will be a straightforward process.

Tools for Data Visualization

Data visualization tools range from no-code business intelligence tools like Power BI and Tableau to online visualization platforms like DataWrapper and Google Charts. There are also specific packages in popular programming languages for data science, such as Python and R . As such, data visualization is often viewed as the entry point, or “gateway drug,” for many aspiring data practitioners.

When deciding on a data visualization tool, you should consider the following factors:

  • Learning curve . The ease of use and complexity of data visualization tools range considerably. Generally, the more features and capabilities, the steeper the learning curve. Simpler data visualization tools are better suited for non-technical users, but they come with more constraints and limitations.
  • Flexibility . If you want to complete control over every little aspect of your visualizations, you should go for tools with wide flexibility. It will take you more time to get familiar with them, but once you are there, you will be able to produce incredibly aesthetic and customizable visualizations.
  • Type of visualization . Data visualization tools can be categorized depending on whether they focus on independent plots or dashboards. The first category of tools is designed to create one visualization at a time. The second category treats applications or dashboards as the basic unit. Tools like Power BI and Tableau fall within this category.
  • Price . Price is an important factor to consider when choosing a data visualization tool. Depending on your needs and budget, some tools will function better than others.

In the fast-paced field of data visualization, new tools are launching the ecosystem every day. Choosing the right one for your needs can be daunting. That’s why we have prepared an article with 12 of the Best Data Visualizations Tools that may help you make up your mind.

Master Tableau From Scratch

Accelerate your career with Tableau—no experience required.

Best Practices for Effective Data Visualization

The main goal of data visualization is to reduce complexity and provide clarity. Choosing the right data visualization technique is vital for success, but there are many other factors to consider. Here are some of the design best practices to effectively communicate data insights with your audience.

  • Consider your audience . As a golden rule, you should always empathize with the audience your visualization is addressing. This means having a good understanding of your audience’s area of expertise, level of technical knowledge, and interests.
  • Clear the clutter . To avoid making unreadable, cluttered visualizations, ask yourself if what you’re including is relevant to the audience, and remove unnecessary elements as much as you can.
  • Keep an eye on the fonts . Even though it can be tempting to use different fonts and sizes, as a general rule of thumb, stick to one font with no more than three different sizes. You should follow the font hierarchy and keep headings larger than the body, as well as use a bold typeface to highlight key elements and headings.
  • Use colors creatively . Color is one of the most eye-catching aspects of any data visualization. As such, put a lot of thought into choosing the color scheme of your data visualization. This means having a consistent color palette across your visualizations and using color systematically to distinguish between groups, levels of importance, and different kinds of information hierarchy.

Making visualization can be fairly considered an art. Intuition and good taste can make a difference, but you should always consider the theory behind it. To know more about the best practices for effective data visualization, we highly recommend you check out our Data Storytelling & Communication Cheat Sheet . Further, if you are working with dashboards, this article on Best Practices for Designing Dashboards is worth reading.

How to Master Data Visualization Techniques

We hope you enjoyed this article. Now that you have an insight into the state of data visualization, it’s time for practice. DataCamp is here to help. You can find more resources to guide you through your data visualization journey below:

  • Data Visualization with Python Skill Track
  • Data Visualization with R Skill Track
  • Data Visualization in Power BI Course
  • Data Visualization in Tableau Course

Data Visualization Cheat Sheet

Photo of Javier Canales Luna

I am a freelance data analyst, collaborating with companies and organisations worldwide in data science projects. I am also a data science instructor with 2+ experience. I regularly write data-science-related articles in English and Spanish, some of which have been published on established websites such as DataCamp, Towards Data Science and Analytics Vidhya As a data scientist with a background in political science and law, my goal is to work at the interplay of public policy, law and technology, leveraging the power of ideas to advance innovative solutions and narratives that can help us address urgent challenges, namely the climate crisis. I consider myself a self-taught person, a constant learner, and a firm supporter of multidisciplinary. It is never too late to learn new things.

Exploring 12 of the Best Data Visualization Tools in 2023 With Examples

Javier Canales Luna's photo

Javier Canales Luna

Top 10 Data Visualization Books

methods of data representation

What is Data Visualization? A Guide for Data Scientists

Kurtis Pykes 's photo

Kurtis Pykes

Top 10 Data Science Tools To Use in 2024

Abid Ali Awan's photo

Abid Ali Awan

image5.png

Data Demystified: Data Visualizations that Capture Trends

Richie Cotton's photo

Richie Cotton

cheat sheet

methods of data representation

Member-only story

Decoding Computation Through the Essence of Data Representation

Diving deep into the foundational ways information is encoded.

Jason M. Pittman

Jason M. Pittman

Better Programming

The theory of computation boils down to eight core principles . These principles describe the very nature of computation itself — what can be computed, how efficiently, and with what resources. Just as physics seeks to understand the universe’s fundamental laws, the theory of computation reveals the primitives of computing.

While the theory may seem abstract and far removed from practical applications, it defines the entire field of computer science. Upstream, this is the foundation for information technology. Thus, we’re not only interested in the design and analysis of algorithms, software, and hardware but also in understanding what computation means in both practical and philosophical contexts.

With this in mind, let’s explore the third principle in our list of eight: data representation and processing. First, let me ask a question:

When you think about data, what do you see in your mind?

Hold your answer. I promise, we’ll come back to it. For now, I want to outline four data representation and processing elements for us to consider. We can visualize each using our running GameCharacter code example.

Jason M. Pittman

Written by Jason M. Pittman

I am a forward-leaning innovator committed to solving tomorrow’s grand challenges by developing cutting-edge research and technology today.

Text to speech

Data Representation

methods of data representation

Photo by Aaron Burden on Unsplash

While data comes in many forms, mathematical models are limited to real numbers. As a result, we often have to engineer our inputs prior to model development and inference. Data representation is best illustrated with an example.

This is what the top 3 rows of our dataset looks like -- we can assume that we have at least a few thousand observations. The goal is to train a probabilistic model to determine each person's likelihood of buying a Magazine subscription after being given a free trial.

customer_idpurchased_subscription income_annualized age review linked_payment_method
12TRUE78,000494CREDIT_CARD
328FALSE45,000212NONE
89TRUE120,000505PAYPAL

We want to predict purchased_subscription using the other columns as predictors.

  • income_annualized is a continuous variable between 0 and 10 million
  • age is a positive integer under 120
  • review is on the scale of 1 to 5 (1 being the worst, 5 being the best)
  • linked_payment_method is one of PAYPAL, CREDIT_CARD or NONE if no payment method is linked
  • customer_id is only for identification purposes (excluded from model)

Continuous Variables

Income is the only continuous variable out of our predictors. We can keep income as it is or transform it to another set of real numbers (for instance, we can divide income by 1000 or take its logarithm value). In either case, income remains a real number and needs no further preprocessing.

We can also group income into buckets if we think people in the same income bucket share similar purchasing behavior. For instance, any income under 30,000 would be Group 1, any income between 30,000 and 50,000 would be Group 2, etc. This effectively turns income into a categorical variable .

Categorical Variables

linked_payment_method is a categorical variable represented as a string. If we think that all linked payment types are equal, we can represent payment as a binary indicator (payment linked versus no payment linked). If we think that linking Paypal affects Magazine purchase differently than linking Credit Card (maybe there's a hefty fee for Paypal?), then we should divide this into three classes: no payment linked, paypal and credit card.

Let's go with the second assumption. The next step is to expand our single linked_payment_method predictor into 3 different predictors: no_payment_linked, credit_card_linked, paypal_linked . Each of these predictors will evaluate to 1 if the customer is linked to that method; otherwise, it will evaluate to 0. The sum of these three predictors should equal one for every observation.

This method of enumerating categorical variables is known as one-hot encoding . If we one-hot encoded our categorical variable linked_payment_method , our data would look like:

customer_idpurchased_subscription income_annualized age review no_payment_linkedcredit_card_linkedpaypal_linked
12TRUE78,000494010
328FALSE45,000212100
89TRUE120,000505001

Ordinal Variables

review is an ordinal variable. Ordinal variables are categorical variables where the possible values are ordered. We have the flexibility here to leave them as is or to one-hot encode them. If we one-hot encode them, we will lose information about the order.

Interaction Effects between Variables

The effect of certain variables might differ based on the values of another variable. For instance, the purchase behavior of a 50+ year old making 150,000 a year might be different from the purchase behavior of 20 year old making the same amount of money.

While most models and algorithms will understand these relationships implicitly, it is sometimes better to create new model features to explicitly capture this behavior.

Interaction between Continuous Variables

We can treat both age, income as continuous variables and designate new variable age_income_interaction as their product. We can use age_income_interaction as a predictor in our model.

customer_idincome_annualized ageage_income_interaction
1278,0004949 x 78,000 = 3,822,000
32845,0002121 x 45,000 = 945,000
89120,0005050 x 120,000 = 6,000,000

Interaction between Categorical Variables

Alteratively, we can use age_income_interaction to bucket and capture the interaction between age and income. We will one-hot encode this predictor similarly to what we did with the linked_payment_method predictor.

Age Age > 50
Income Age-Income Bucket 2Age-Income Bucket 3
Income > 50,000Age-Income Bucket 4Age-Income Bucket 5Age-Income Bucket 6

Interaction between Categorical and Continuous Variable

Assume we want to stick with our buckets for income , but keep age as a continuous variable. This creates new predictors income_bucket_1_age and income_bucket_2_age . income_bucket_1_age will reflect the customer's age if they fall under bucket 1 and 0 otherwise. Similarly, income_bucket_2_age will reflect the customer's age if they fall under bucket 2 and 0 otherwise.

While these examples deal with interactions between two variables, we can capture interactions for any number of variables. For instance, we can use the interaction between age, income, review as a predictor.

methods of data representation

  • Data representation

Bytes of memory

  • Abstract machine

Unsigned integer representation

Signed integer representation, pointer representation, array representation, compiler layout, array access performance, collection representation.

  • Consequences of size and alignment rules

Uninitialized objects

Pointer arithmetic, undefined behavior.

  • Computer arithmetic

Arena allocation

This course is about learning how computers work, from the perspective of systems software: what makes programs work fast or slow, and how properties of the machines we program impact the programs we write. We want to communicate ideas, tools, and an experimental approach.

The course divides into six units:

  • Assembly & machine programming
  • Storage & caching
  • Kernel programming
  • Process management
  • Concurrency

The first unit, data representation , is all about how different forms of data can be represented in terms the computer can understand.

Computer memory is kind of like a Lite Brite.

Lite Brite

A Lite Brite is big black backlit pegboard coupled with a supply of colored pegs, in a limited set of colors. You can plug in the pegs to make all kinds of designs. A computer’s memory is like a vast pegboard where each slot holds one of 256 different colors. The colors are numbered 0 through 255, so each slot holds one byte . (A byte is a number between 0 and 255, inclusive.)

A slot of computer memory is identified by its address . On a computer with M bytes of memory, and therefore M slots, you can think of the address as a number between 0 and M −1. My laptop has 16 gibibytes of memory, so M = 16×2 30 = 2 34 = 17,179,869,184 = 0x4'0000'0000 —a very large number!

The problem of data representation is the problem of representing all the concepts we might want to use in programming—integers, fractions, real numbers, sets, pictures, texts, buildings, animal species, relationships—using the limited medium of addresses and bytes.

Powers of ten and powers of two. Digital computers love the number two and all powers of two. The electronics of digital computers are based on the bit , the smallest unit of storage, which a base-two digit: either 0 or 1. More complicated objects are represented by collections of bits. This choice has many scale and error-correction advantages. It also refracts upwards to larger choices, and even into terminology. Memory chips, for example, have capacities based on large powers of two, such as 2 30 bytes. Since 2 10 = 1024 is pretty close to 1,000, 2 20 = 1,048,576 is pretty close to a million, and 2 30 = 1,073,741,824 is pretty close to a billion, it’s common to refer to 2 30 bytes of memory as “a giga byte,” even though that actually means 10 9 = 1,000,000,000 bytes. But for greater precision, there are terms that explicitly signal the use of powers of two. 2 30 is a gibibyte : the “-bi-” component means “binary.”
Virtual memory. Modern computers actually abstract their memory spaces using a technique called virtual memory . The lowest-level kind of address, called a physical address , really does take on values between 0 and M −1. However, even on a 16GiB machine like my laptop, the addresses we see in programs can take on values like 0x7ffe'ea2c'aa67 that are much larger than M −1 = 0x3'ffff'ffff . The addresses used in programs are called virtual addresses . They’re incredibly useful for protection: since different running programs have logically independent address spaces, it’s much less likely that a bug in one program will crash the whole machine. We’ll learn about virtual memory in much more depth in the kernel unit ; the distinction between virtual and physical addresses is not as critical for data representation.

Most programming languages prevent their users from directly accessing memory. But not C and C++! These languages let you access any byte of memory with a valid address. This is powerful; it is also very dangerous. But it lets us get a hands-on view of how computers really work.

C++ programs accomplish their work by constructing, examining, and modifying objects . An object is a region of data storage that contains a value, such as the integer 12. (The standard specifically says “a region of data storage in the execution environment, the contents of which can represent values”.) Memory is called “memory” because it remembers object values.

In this unit, we often use functions called hexdump to examine memory. These functions are defined in hexdump.cc . hexdump_object(x) prints out the bytes of memory that comprise an object named x , while hexdump(ptr, size) prints out the size bytes of memory starting at a pointer ptr .

For example, in datarep1/add.cc , we might use hexdump_object to examine the memory used to represent some integers:

This display reports that a , b , and c are each four bytes long; that a , b , and c are located at different, nonoverlapping addresses (the long hex number in the first column); and shows us how the numbers 1, 2, and 3 are represented in terms of bytes. (More on that later.)

The compiler, hardware, and standard together define how objects of different types map to bytes. Each object uses a contiguous range of addresses (and thus bytes), and objects never overlap (objects that are active simultaneously are always stored in distinct address ranges).

Since C and C++ are designed to help software interface with hardware devices, their standards are transparent about how objects are stored. A C++ program can ask how big an object is using the sizeof keyword. sizeof(T) returns the number of bytes in the representation of an object of type T , and sizeof(x) returns the size of object x . The result of sizeof is a value of type size_t , which is an unsigned integer type large enough to hold any representable size. On 64-bit architectures, such as x86-64 (our focus in this course), size_t can hold numbers between 0 and 2 64 –1.

Qualitatively different objects may have the same data representation. For example, the following three objects have the same data representation on x86-64, which you can verify using hexdump :

In C and C++, you can’t reliably tell the type of an object by looking at the contents of its memory. That’s why tricks like our different addf*.cc functions work.

An object can have many names. For example, here, local and *ptr refer to the same object:

The different names for an object are sometimes called aliases .

There are five objects here:

  • ch1 , a global variable
  • ch2 , a constant (non-modifiable) global variable
  • ch3 , a local variable
  • ch4 , a local variable
  • the anonymous storage allocated by new char and accessed by *ch4

Each object has a lifetime , which is called storage duration by the standard. There are three different kinds of lifetime.

  • static lifetime: The object lasts as long as the program runs. ( ch1 , ch2 )
  • automatic lifetime: The compiler allocates and destroys the object automatically as the program runs, based on the object’s scope (the region of the program in which it is meaningful). ( ch3 , ch4 )
  • dynamic lifetime: The programmer allocates and destroys the object explicitly. ( *allocated_ch )

Objects with dynamic lifetime aren’t easy to use correctly. Dynamic lifetime causes many serious problems in C programs, including memory leaks, use-after-free, double-free, and so forth. Those serious problems cause undefined behavior and play a “disastrously central role” in “our ongoing computer security nightmare” . But dynamic lifetime is critically important. Only with dynamic lifetime can you construct an object whose size isn’t known at compile time, or construct an object that outlives the function that created it.

The compiler and operating system work together to put objects at different addresses. A program’s address space (which is the range of addresses accessible to a program) divides into regions called segments . Objects with different lifetimes are placed into different segments. The most important segments are:

  • Code (also known as text or read-only data ). Contains instructions and constant global objects. Unmodifiable; static lifetime.
  • Data . Contains non-constant global objects. Modifiable; static lifetime.
  • Heap . Modifiable; dynamic lifetime.
  • Stack . Modifiable; automatic lifetime.

The compiler decides on a segment for each object based on its lifetime. The final compiler phase, which is called the linker , then groups all the program’s objects by segment (so, for instance, global variables from different compiler runs are grouped together into a single segment). Finally, when a program runs, the operating system loads the segments into memory. (The stack and heap segments grow on demand.)

We can use a program to investigate where objects with different lifetimes are stored. (See cs61-lectures/datarep2/mexplore0.cc .) This shows address ranges like this:

Object declaration
(C++ program text)

Lifetime
(abstract machine)

Segment

Example address range
(runtime location in x86-64 Linux, non- )

Constant global

Static

Code (or Text)

0x40'0000 (≈1 × 2 )

Global

Static

Data

0x60'0000 (≈1.5 × 2 )

Local

Automatic

Stack

0x7fff'448d'0000 (≈2 = 2 × 2 )

Anonymous, returned by

Dynamic

Heap

0x1a0'0000 (≈8 × 2 )

Constant global data and global data have the same lifetime, but are stored in different segments. The operating system uses different segments so it can prevent the program from modifying constants. It marks the code segment, which contains functions (instructions) and constant global data, as read-only, and any attempt to modify code-segment memory causes a crash (a “Segmentation violation”).

An executable is normally at least as big as the static-lifetime data (the code and data segments together). Since all that data must be in memory for the entire lifetime of the program, it’s written to disk and then loaded by the OS before the program starts running. There is an exception, however: the “bss” segment is used to hold modifiable static-lifetime data with initial value zero. Such data is common, since all static-lifetime data is initialized to zero unless otherwise specified in the program text. Rather than storing a bunch of zeros in the object files and executable, the compiler and linker simply track the location and size of all zero-initialized global data. The operating system sets this memory to zero during the program load process. Clearing memory is faster than loading data from disk, so this optimization saves both time (the program loads faster) and space (the executable is smaller).

Abstract machine and hardware

Programming involves turning an idea into hardware instructions. This transformation happens in multiple steps, some you control and some controlled by other programs.

First you have an idea , like “I want to make a flappy bird iPhone game.” The computer can’t (yet) understand that idea. So you transform the idea into a program , written in some programming language . This process is called programming.

A C++ program actually runs on an abstract machine . The behavior of this machine is defined by the C++ standard , a technical document. This document is supposed to be so precisely written as to have an exact mathematical meaning, defining exactly how every C++ program behaves. But the document can’t run programs!

C++ programs run on hardware (mostly), and the hardware determines what behavior we see. Mapping abstract machine behavior to instructions on real hardware is the task of the C++ compiler (and the standard library and operating system). A C++ compiler is correct if and only if it translates each correct program to instructions that simulate the expected behavior of the abstract machine.

This same rough series of transformations happens for any programming language, although some languages use interpreters rather than compilers.

A bit is the fundamental unit of digital information: it’s either 0 or 1.

C++ manages memory in units of bytes —8 contiguous bits that together can represent numbers between 0 and 255. C’s unit for a byte is char : the abstract machine says a byte is stored in char . That means an unsigned char holds values in the inclusive range [0, 255].

The C++ standard actually doesn’t require that a byte hold 8 bits, and on some crazy machines from decades ago , bytes could hold nine bits! (!?)

But larger numbers, such as 258, don’t fit in a single byte. To represent such numbers, we must use multiple bytes. The abstract machine doesn’t specify exactly how this is done—it’s the compiler and hardware’s job to implement a choice. But modern computers always use place–value notation , just like in decimal numbers. In decimal, the number 258 is written with three digits, the meanings of which are determined both by the digit and by their place in the overall number:

\[ 258 = 2\times10^2 + 5\times10^1 + 8\times10^0 \]

The computer uses base 256 instead of base 10. Two adjacent bytes can represent numbers between 0 and \(255\times256+255 = 65535 = 2^{16}-1\) , inclusive. A number larger than this would take three or more bytes.

\[ 258 = 1\times256^1 + 2\times256^0 \]

On x86-64, the ones place, the least significant byte, is on the left, at the lowest address in the contiguous two-byte range used to represent the integer. This is the opposite of how decimal numbers are written: decimal numbers put the most significant digit on the left. The representation choice of putting the least-significant byte in the lowest address is called little-endian representation. x86-64 uses little-endian representation.

Some computers actually store multi-byte integers the other way, with the most significant byte stored in the lowest address; that’s called big-endian representation. The Internet’s fundamental protocols, such as IP and TCP, also use big-endian order for multi-byte integers, so big-endian is also called “network” byte order.

The C++ standard defines five fundamental unsigned integer types, along with relationships among their sizes. Here they are, along with their actual sizes and ranges on x86-64:

Type

Size
(abstract machine)

Size
(x86-64)

Range
(x86-64)


(byte)

1

1

[0, 255] = [0, 2 −1]

≥1

2

[0, 65,535] = [0, 2 −1]


(or )

4

[0, 4,294,967,295] = [0, 2 −1]

8

[0, 18,446,744,073,709,551,615] = [0, 2 −1]

8

[0, 18,446,744,073,709,551,615] = [0, 2 −1]

Other architectures and operating systems implement different ranges for these types. For instance, on IA32 machines like Intel’s Pentium (the 32-bit processors that predated x86-64), sizeof(long) was 4, not 8.

Note that all values of a smaller unsigned integer type can fit in any larger unsigned integer type. When a value of a larger unsigned integer type is placed in a smaller unsigned integer object, however, not every value fits; for instance, the unsigned short value 258 doesn’t fit in an unsigned char x . When this occurs, the C++ abstract machine requires that the smaller object’s value equals the least -significant bits of the larger value (so x will equal 2).

In addition to these types, whose sizes can vary, C++ has integer types whose sizes are fixed. uint8_t , uint16_t , uint32_t , and uint64_t define 8-bit, 16-bit, 32-bit, and 64-bit unsigned integers, respectively; on x86-64, these correspond to unsigned char , unsigned short , unsigned int , and unsigned long .

This general procedure is used to represent a multi-byte integer in memory.

  • Write the large integer in hexadecimal format, including all leading zeros required by the type size. For example, the unsigned value 65534 would be written 0x0000FFFE . There will be twice as many hexadecimal digits as sizeof(TYPE) .
  • Divide the integer into its component bytes, which are its digits in base 256. In our example, they are, from most to least significant, 0x00, 0x00, 0xFF, and 0xFE.

In little-endian representation, the bytes are stored in memory from least to most significant. If our example was stored at address 0x30, we would have:

In big-endian representation, the bytes are stored in the reverse order.

Computers are often fastest at dealing with fixed-length numbers, rather than variable-length numbers, and processor internals are organized around a fixed word size . A word is the natural unit of data used by a processor design . In most modern processors, this natural unit is 8 bytes or 64 bits , because this is the power-of-two number of bytes big enough to hold those processors’ memory addresses. Many older processors could access less memory and had correspondingly smaller word sizes, such as 4 bytes (32 bits).

The best representation for signed integers—and the choice made by x86-64, and by the C++20 abstract machine—is two’s complement . Two’s complement representation is based on this principle: Addition and subtraction of signed integers shall use the same instructions as addition and subtraction of unsigned integers.

To see what this means, let’s think about what -x should mean when x is an unsigned integer. Wait, negative unsigned?! This isn’t an oxymoron because C++ uses modular arithmetic for unsigned integers: the result of an arithmetic operation on unsigned values is always taken modulo 2 B , where B is the number of bits in the unsigned value type. Thus, on x86-64,

-x is simply the number that, when added to x , yields 0 (mod 2 B ). For example, when unsigned x = 0xFFFFFFFFU , then -x == 1U , since x + -x equals zero (mod 2 32 ).

To obtain -x , we flip all the bits in x (an operation written ~x ) and then add 1. To see why, consider the bit representations. What is x + (~x + 1) ? Well, (~x) i (the i th bit of ~x ) is 1 whenever x i is 0, and vice versa. That means that every bit of x + ~x is 1 (there are no carries), and x + ~x is the largest unsigned integer, with value 2 B -1. If we add 1 to this, we get 2 B . Which is 0 (mod 2 B )! The highest “carry” bit is dropped, leaving zero.

Two’s complement arithmetic uses half of the unsigned integer representations for negative numbers. A two’s-complement signed integer with B bits has the following values:

  • If the most-significant bit is 1, the represented number is negative. Specifically, the represented number is – (~x + 1) , where the outer negative sign is mathematical negation (not computer arithmetic).
  • If every bit is 0, the represented number is 0.
  • If the most-significant but is 0 but some other bit is 1, the represented number is positive.

The most significant bit is also called the sign bit , because if it is 1, then the represented value depends on the signedness of the type (and that value is negative for signed types).

Another way to think about two’s-complement is that, for B -bit integers, the most-significant bit has place value 2 B –1 in unsigned arithmetic and negative 2 B –1 in signed arithmetic. All other bits have the same place values in both kinds of arithmetic.

The two’s-complement bit pattern for x + y is the same whether x and y are considered as signed or unsigned values. For example, in 4-bit arithmetic, 5 has representation 0b0101 , while the representation 0b1100 represents 12 if unsigned and –4 if signed ( ~0b1100 + 1 = 0b0011 + 1 == 4). Let’s add those bit patterns and see what we get:

Note that this is the right answer for both signed and unsigned arithmetic : 5 + 12 = 17 = 1 (mod 16), and 5 + -4 = 1.

Subtraction and multiplication also produce the same results for unsigned arithmetic and signed two’s-complement arithmetic. (For instance, 5 * 12 = 60 = 12 (mod 16), and 5 * -4 = -20 = -4 (mod 16).) This is not true of division. (Consider dividing the 4-bit representation 0b1110 by 2. In signed arithmetic, 0b1110 represents -2, so 0b1110/2 == 0b1111 (-1); but in unsigned arithmetic, 0b1110 is 14, so 0b1110/2 == 0b0111 (7).) And, of course, it is not true of comparison. In signed 4-bit arithmetic, 0b1110 < 0 , but in unsigned 4-bit arithmetic, 0b1110 > 0 . This means that a C compiler for a two’s-complement machine can use a single add instruction for either signed or unsigned numbers, but it must generate different instruction patterns for signed and unsigned division (or less-than, or greater-than).

There are a couple quirks with C signed arithmetic. First, in two’s complement, there are more negative numbers than positive numbers. A representation with sign bit is 1, but every other bit 0, has no positive counterpart at the same bit width: for this number, -x == x . (In 4-bit arithmetic, -0b1000 == ~0b1000 + 1 == 0b0111 + 1 == 0b1000 .) Second, and far worse, is that arithmetic overflow on signed integers is undefined behavior .

Type

Size
(abstract machine)

Size
(x86-64)

Range
(x86-64)

1

1

[−128, 127] = [−2 , 2 −1]


(or )

=

2

[−32,768, 32,767] = [−2 , 2 −1]

=

4

[−2,147,483,648, 2,147,483,647] = [−2 , 2 −1]

=

8

[−9,223,372,036,854,775,808, 9,223,372,036,854,775,807] = [−2 , 2 −1]

=

8

[−9,223,372,036,854,775,808, 9,223,372,036,854,775,807] = [−2 , 2 −1]

The C++ abstract machine requires that signed integers have the same sizes as their unsigned counterparts.

We distinguish pointers , which are concepts in the C abstract machine, from addresses , which are hardware concepts. A pointer combines an address and a type.

The memory representation of a pointer is the same as the representation of its address value. The size of that integer is the machine’s word size; for example, on x86-64, a pointer occupies 8 bytes, and a pointer to an object located at address 0x400abc would be stored as:

The C++ abstract machine defines an unsigned integer type uintptr_t that can hold any address. (You have to #include <inttypes.h> or <cinttypes> to get the definition.) On most machines, including x86-64, uintptr_t is the same as unsigned long . Cast a pointer to an integer address value with syntax like (uintptr_t) ptr ; cast back to a pointer with syntax like (T*) addr . Casts between pointer types and uintptr_t are information preserving, so this assertion will never fail:

Since it is a 64-bit architecture, the size of an x86-64 address is 64 bits (8 bytes). That’s also the size of x86-64 pointers.

To represent an array of integers, C++ and C allocate the integers next to each other in memory, in sequential addresses, with no gaps or overlaps. Here, we put the integers 0, 1, and 258 next to each other, starting at address 1008:

Say that you have an array of N integers, and you access each of those integers in order, accessing each integer exactly once. Does the order matter?

Computer memory is random-access memory (RAM), which means that a program can access any bytes of memory in any order—it’s not, for example, required to read memory in ascending order by address. But if we run experiments, we can see that even in RAM, different access orders have very different performance characteristics.

Our arraysum program sums up all the integers in an array of N integers, using an access order based on its arguments, and prints the resulting delay. Here’s the result of a couple experiments on accessing 10,000,000 items in three orders, “up” order (sequential: elements 0, 1, 2, 3, …), “down” order (reverse sequential: N , N −1, N −2, …), and “random” order (as it sounds).

order trial 1 trial 2 trial 3
, up 8.9ms 7.9ms 7.4ms
, down 9.2ms 8.9ms 10.6ms
, random 316.8ms 352.0ms 360.8ms

Wow! Down order is just a bit slower than up, but random order seems about 40 times slower. Why?

Random order is defeating many of the internal architectural optimizations that make memory access fast on modern machines. Sequential order, since it’s more predictable, is much easier to optimize.

Foreshadowing. This part of the lecture is a teaser for the Storage unit, where we cover access patterns and caching, including the processor caches that explain this phenomenon, in much more depth.

The C++ programming language offers several collection mechanisms for grouping subobjects together into new kinds of object. The collections are arrays, structs, and unions. (Classes are a kind of struct. All library types, such as vectors, lists, and hash tables, use combinations of these collection types.) The abstract machine defines how subobjects are laid out inside a collection. This is important, because it lets C/C++ programs exchange messages with hardware and even with programs written in other languages: messages can be exchanged only when both parties agree on layout.

Array layout in C++ is particularly simple: The objects in an array are laid out sequentially in memory, with no gaps or overlaps. Assume a declaration like T x[N] , where x is an array of N objects of type T , and say that the address of x is a . Then the address of element x[i] equals a + i * sizeof(T) , and sizeof(a) == N * sizeof(T) .

Sidebar: Vector representation

The C++ library type std::vector defines an array that can grow and shrink. For instance, this function creates a vector containing the numbers 0 up to N in sequence:

Here, v is an object with automatic lifetime. This means its size (in the sizeof sense) is fixed at compile time. Remember that the sizes of static- and automatic-lifetime objects must be known at compile time; only dynamic-lifetime objects can have varying size based on runtime parameters. So where and how are v ’s contents stored?

The C++ abstract machine requires that v ’s elements are stored in an array in memory. (The v.data() method returns a pointer to the first element of the array.) But it does not define std::vector ’s layout otherwise, and C++ library designers can choose different layouts based on their needs. We found these to hold for the std::vector in our library:

sizeof(v) == 24 for any vector of any type, and the address of v is a stack address (i.e., v is located in the stack segment).

The first 8 bytes of the vector hold the address of the first element of the contents array—call it the begin address . This address is a heap address, which is as expected, since the contents must have dynamic lifetime. The value of the begin address is the same as that of v.data() .

Bytes 8–15 hold the address just past the contents array—call it the end address . Its value is the same as &v.data()[v.size()] . If the vector is empty, then the begin address and the end address are the same.

Bytes 16–23 hold an address greater than or equal to the end address. This is the capacity address . As a vector grows, it will sometimes outgrow its current location and move its contents to new memory addresses. To reduce the number of copies, vectors usually to request more memory from the operating system than they immediately need; this additional space, which is called “capacity,” supports cheap growth. Often the capacity doubles on each growth spurt, since this allows operations like v.push_back() to execute in O (1) time on average.

Compilers must also decide where different objects are stored when those objects are not part of a collection. For instance, consider this program:

The abstract machine says these objects cannot overlap, but does not otherwise constrain their positions in memory.

On Linux, GCC will put all these variables into the stack segment, which we can see using hexdump . But it can put them in the stack segment in any order , as we can see by reordering the declarations (try declaration order i1 , c1 , i2 , c2 , c3 ), by changing optimization levels, or by adding different scopes (braces). The abstract machine gives the programmer no guarantees about how object addresses relate. In fact, the compiler may move objects around during execution, as long as it ensures that the program behaves according to the abstract machine. Modern optimizing compilers often do this, particularly for automatic objects.

But what order does the compiler choose? With optimization disabled, the compiler appears to lay out objects in decreasing order by declaration, so the first declared variable in the function has the highest address. With optimization enabled, the compiler follows roughly the same guideline, but it also rearranges objects by type—for instance, it tends to group char s together—and it can reuse space if different variables in the same function have disjoint lifetimes. The optimizing compiler tends to use less space for the same set of variables. This is because it’s arranging objects by alignment.

The C++ compiler and library restricts the addresses at which some kinds of data appear. In particular, the address of every int value is always a multiple of 4, whether it’s located on the stack (automatic lifetime), the data segment (static lifetime), or the heap (dynamic lifetime).

A bunch of observations will show you these rules:

Type Size Address restrictions ( )
( , ) 1 No restriction 1
( ) 2 Multiple of 2 2
( ) 4 Multiple of 4 4
( ) 8 Multiple of 8 8
4 Multiple of 4 4
8 Multiple of 8 8
16 Multiple of 16 16
8 Multiple of 8 8

These are the alignment restrictions for an x86-64 Linux machine.

These restrictions hold for most x86-64 operating systems, except that on Windows, the long type has size and alignment 4. (The long long type has size and alignment 8 on all x86-64 operating systems.)

Just like every type has a size, every type has an alignment. The alignment of a type T is a number a ≥1 such that the address of every object of type T must be a multiple of a . Every object with type T has size sizeof(T) —it occupies sizeof(T) contiguous bytes of memory; and has alignment alignof(T) —the address of its first byte is a multiple of alignof(T) . You can also say sizeof(x) and alignof(x) where x is the name of an object or another expression.

Alignment restrictions can make hardware simpler, and therefore faster. For instance, consider cache blocks. CPUs access memory through a transparent hardware cache. Data moves from primary memory, or RAM (which is large—a couple gigabytes on most laptops—and uses cheaper, slower technology) to the cache in units of 64 or 128 bytes. Those units are always aligned: on a machine with 128-byte cache blocks, the bytes with memory addresses [127, 128, 129, 130] live in two different cache blocks (with addresses [0, 127] and [128, 255]). But the 4 bytes with addresses [4n, 4n+1, 4n+2, 4n+3] always live in the same cache block. (This is true for any small power of two: the 8 bytes with addresses [8n,…,8n+7] always live in the same cache block.) In general, it’s often possible to make a system faster by leveraging restrictions—and here, the CPU hardware can load data faster when it can assume that the data lives in exactly one cache line.

The compiler, library, and operating system all work together to enforce alignment restrictions.

On x86-64 Linux, alignof(T) == sizeof(T) for all fundamental types (the types built in to C: integer types, floating point types, and pointers). But this isn’t always true; on x86-32 Linux, double has size 8 but alignment 4.

It’s possible to construct user-defined types of arbitrary size, but the largest alignment required by a machine is fixed for that machine. C++ lets you find the maximum alignment for a machine with alignof(std::max_align_t) ; on x86-64, this is 16, the alignment of the type long double (and the alignment of some less-commonly-used SIMD “vector” types ).

We now turn to the abstract machine rules for laying out all collections. The sizes and alignments for user-defined types—arrays, structs, and unions—are derived from a couple simple rules or principles. Here they are. The first rule applies to all types.

1. First-member rule. The address of the first member of a collection equals the address of the collection.

Thus, the address of an array is the same as the address of its first element. The address of a struct is the same as the address of the first member of the struct.

The next three rules depend on the class of collection. Every C abstract machine enforces these rules.

2. Array rule. Arrays are laid out sequentially as described above.

3. Struct rule. The second and subsequent members of a struct are laid out in order, with no overlap, subject to alignment constraints.

4. Union rule. All members of a union share the address of the union.

In C, every struct follows the struct rule, but in C++, only simple structs follow the rule. Complicated structs, such as structs with some public and some private members, or structs with virtual functions, can be laid out however the compiler chooses. The typical situation is that C++ compilers for a machine architecture (e.g., “Linux x86-64”) will all agree on a layout procedure for complicated structs. This allows code compiled by different compilers to interoperate.

That next rule defines the operation of the malloc library function.

5. Malloc rule. Any non-null pointer returned by malloc has alignment appropriate for any type. In other words, assuming the allocated size is adequate, the pointer returned from malloc can safely be cast to T* for any T .

Oddly, this holds even for small allocations. The C++ standard (the abstract machine) requires that malloc(1) return a pointer whose alignment is appropriate for any type, including types that don’t fit.

And the final rule is not required by the abstract machine, but it’s how sizes and alignments on our machines work.

6. Minimum rule. The sizes and alignments of user-defined types, and the offsets of struct members, are minimized within the constraints of the other rules.

The minimum rule, and the sizes and alignments of basic types, are defined by the x86-64 Linux “ABI” —its Application Binary Interface. This specification standardizes how x86-64 Linux C compilers should behave, and lets users mix and match compilers without problems.

Consequences of the size and alignment rules

From these rules we can derive some interesting consequences.

First, the size of every type is a multiple of its alignment .

To see why, consider an array with two elements. By the array rule, these elements have addresses a and a+sizeof(T) , where a is the address of the array. Both of these addresses contain a T , so they are both a multiple of alignof(T) . That means sizeof(T) is also a multiple of alignof(T) .

We can also characterize the sizes and alignments of different collections .

  • The size of an array of N elements of type T is N * sizeof(T) : the sum of the sizes of its elements. The alignment of the array is alignof(T) .
  • The size of a union is the maximum of the sizes of its components (because the union can only hold one component at a time). Its alignment is also the maximum of the alignments of its components.
  • The size of a struct is at least as big as the sum of the sizes of its components. Its alignment is the maximum of the alignments of its components.

Thus, the alignment of every collection equals the maximum of the alignments of its components.

It’s also true that the alignment equals the least common multiple of the alignments of its components. You might have thought lcm was a better answer, but the max is the same as the lcm for every architecture that matters, because all fundamental alignments are powers of two.

The size of a struct might be larger than the sum of the sizes of its components, because of alignment constraints. Since the compiler must lay out struct components in order, and it must obey the components’ alignment constraints, and it must ensure different components occupy disjoint addresses, it must sometimes introduce extra space in structs. Here’s an example: the struct will have 3 bytes of padding after char c , to ensure that int i2 has the correct alignment.

Thanks to padding, reordering struct components can sometimes reduce the total size of a struct. Padding can happen at the end of a struct as well as the middle. Padding can never happen at the start of a struct, however (because of Rule 1).

The rules also imply that the offset of any struct member —which is the difference between the address of the member and the address of the containing struct— is a multiple of the member’s alignment .

To see why, consider a struct s with member m at offset o . The malloc rule says that any pointer returned from malloc is correctly aligned for s . Every pointer returned from malloc is maximally aligned, equalling 16*x for some integer x . The struct rule says that the address of m , which is 16*x + o , is correctly aligned. That means that 16*x + o = alignof(m)*y for some integer y . Divide both sides by a = alignof(m) and you see that 16*x/a + o/a = y . But 16/a is an integer—the maximum alignment is a multiple of every alignment—so 16*x/a is an integer. We can conclude that o/a must also be an integer!

Finally, we can also derive the necessity for padding at the end of structs. (How?)

What happens when an object is uninitialized? The answer depends on its lifetime.

  • static lifetime (e.g., int global; at file scope): The object is initialized to 0.
  • automatic or dynamic lifetime (e.g., int local; in a function, or int* ptr = new int ): The object is uninitialized and reading the object’s value before it is assigned causes undefined behavior.

Compiler hijinks

In C++, most dynamic memory allocation uses special language operators, new and delete , rather than library functions.

Though this seems more complex than the library-function style, it has advantages. A C compiler cannot tell what malloc and free do (especially when they are redefined to debugging versions, as in the problem set), so a C compiler cannot necessarily optimize calls to malloc and free away. But the C++ compiler may assume that all uses of new and delete follow the rules laid down by the abstract machine. That means that if the compiler can prove that an allocation is unnecessary or unused, it is free to remove that allocation!

For example, we compiled this program in the problem set environment (based on test003.cc ):

The optimizing C++ compiler removes all calls to new and delete , leaving only the call to m61_printstatistics() ! (For instance, try objdump -d testXXX to look at the compiled x86-64 instructions.) This is valid because the compiler is explicitly allowed to eliminate unused allocations, and here, since the ptrs variable is local and doesn’t escape main , all allocations are unused. The C compiler cannot perform this useful transformation. (But the C compiler can do other cool things, such as unroll the loops .)

One of C’s more interesting choices is that it explicitly relates pointers and arrays. Although arrays are laid out in memory in a specific way, they generally behave like pointers when they are used. This property probably arose from C’s desire to explicitly model memory as an array of bytes, and it has beautiful and confounding effects.

We’ve already seen one of these effects. The hexdump function has this signature (arguments and return type):

But we can just pass an array as argument to hexdump :

When used in an expression like this—here, as an argument—the array magically changes into a pointer to its first element. The above call has the same meaning as this:

C programmers transition between arrays and pointers very naturally.

A confounding effect is that unlike all other types, in C arrays are passed to and returned from functions by reference rather than by value. C is a call-by-value language except for arrays. This means that all function arguments and return values are copied, so that parameter modifications inside a function do not affect the objects passed by the caller—except for arrays. For instance: void f ( int a[ 2 ]) { a[ 0 ] = 1 ; } int main () { int x[ 2 ] = { 100 , 101 }; f(x); printf( "%d \n " , x[ 0 ]); // prints 1! } If you don’t like this behavior, you can get around it by using a struct or a C++ std::array . #include <array> struct array1 { int a[ 2 ]; }; void f1 (array1 arg) { arg.a[ 0 ] = 1 ; } void f2 (std :: array < int , 2 > a) { a[ 0 ] = 1 ; } int main () { array1 x = {{ 100 , 101 }}; f1(x); printf( "%d \n " , x.a[ 0 ]); // prints 100 std :: array < int , 2 > x2 = { 100 , 101 }; f2(x2); printf( "%d \n " , x2[ 0 ]); // prints 100 }

C++ extends the logic of this array–pointer correspondence to support arithmetic on pointers as well.

Pointer arithmetic rule. In the C abstract machine, arithmetic on pointers produces the same result as arithmetic on the corresponding array indexes.

Specifically, consider an array T a[n] and pointers T* p1 = &a[i] and T* p2 = &a[j] . Then:

Equality : p1 == p2 if and only if (iff) p1 and p2 point to the same address, which happens iff i == j .

Inequality : Similarly, p1 != p2 iff i != j .

Less-than : p1 < p2 iff i < j .

Also, p1 <= p2 iff i <= j ; and p1 > p2 iff i > j ; and p1 >= p2 iff i >= j .

Pointer difference : What should p1 - p2 mean? Using array indexes as the basis, p1 - p2 == i - j . (But the type of the difference is always ptrdiff_t , which on x86-64 is long , the signed version of size_t .)

Addition : p1 + k (where k is an integer type) equals the pointer &a[i + k] . ( k + p1 returns the same thing.)

Subtraction : p1 - k equals &a[i - k] .

Increment and decrement : ++p1 means p1 = p1 + 1 , which means p1 = &a[i + 1] . Similarly, --p1 means p1 = &a[i - 1] . (There are also postfix versions, p1++ and p1-- , but C++ style prefers the prefix versions.)

No other arithmetic operations on pointers are allowed. You can’t multiply pointers, for example. (You can multiply addresses by casting the pointers to the address type, uintptr_t —so (uintptr_t) p1 * (uintptr_t) p2 —but why would you?)

From pointers to iterators

Let’s write a function that can sum all the integers in an array.

This function can compute the sum of the elements of any int array. But because of the pointer–array relationship, its a argument is really a pointer . That allows us to call it with subarrays as well as with whole arrays. For instance:

This way of thinking about arrays naturally leads to a style that avoids sizes entirely, using instead a sentinel or boundary argument that defines the end of the interesting part of the array.

These expressions compute the same sums as the above:

Note that the data from first to last forms a half-open range . iIn mathematical notation, we care about elements in the range [first, last) : the element pointed to by first is included (if it exists), but the element pointed to by last is not. Half-open ranges give us a simple and clear way to describe empty ranges, such as zero-element arrays: if first == last , then the range is empty.

Note that given a ten-element array a , the pointer a + 10 can be formed and compared, but must not be dereferenced—the element a[10] does not exist. The C/C++ abstract machines allow users to form pointers to the “one-past-the-end” boundary elements of arrays, but users must not dereference such pointers.

So in C, two pointers naturally express a range of an array. The C++ standard template library, or STL, brilliantly abstracts this pointer notion to allow two iterators , which are pointer-like objects, to express a range of any standard data structure—an array, a vector, a hash table, a balanced tree, whatever. This version of sum works for any container of int s; notice how little it changed:

Some example uses:

Addresses vs. pointers

What’s the difference between these expressions? (Again, a is an array of type T , and p1 == &a[i] and p2 == &a[j] .)

The first expression is defined analogously to index arithmetic, so d1 == i - j . But the second expression performs the arithmetic on the addresses corresponding to those pointers. We will expect d2 to equal sizeof(T) * d1 . Always be aware of which kind of arithmetic you’re using. Generally arithmetic on pointers should not involve sizeof , since the sizeof is included automatically according to the abstract machine; but arithmetic on addresses almost always should involve sizeof .

Although C++ is a low-level language, the abstract machine is surprisingly strict about which pointers may be formed and how they can be used. Violate the rules and you’re in hell because you have invoked the dreaded undefined behavior .

Given an array a[N] of N elements of type T :

Forming a pointer &a[i] (or a + i ) with 0 ≤ i ≤ N is safe.

Forming a pointer &a[i] with i < 0 or i > N causes undefined behavior.

Dereferencing a pointer &a[i] with 0 ≤ i < N is safe.

Dereferencing a pointer &a[i] with i < 0 or i ≥ N causes undefined behavior.

(For the purposes of these rules, objects that are not arrays count as single-element arrays. So given T x , we can safely form &x and &x + 1 and dereference &x .)

What “undefined behavior” means is horrible. A program that executes undefined behavior is erroneous. But the compiler need not catch the error. In fact, the abstract machine says anything goes : undefined behavior is “behavior … for which this International Standard imposes no requirements.” “Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).” Other possible behaviors include allowing hackers from the moon to steal all of a program’s data, take it over, and force it to delete the hard drive on which it is running. Once undefined behavior executes, a program may do anything, including making demons fly out of the programmer’s nose.

Pointer arithmetic, and even pointer comparisons, are also affected by undefined behavior. It’s undefined to go beyond and array’s bounds using pointer arithmetic. And pointers may be compared for equality or inequality even if they point to different arrays or objects, but if you try to compare different arrays via less-than, like this:

that causes undefined behavior.

If you really want to compare pointers that might be to different arrays—for instance, you’re writing a hash function for arbitrary pointers—cast them to uintptr_t first.

Undefined behavior and optimization

A program that causes undefined behavior is not a C++ program . The abstract machine says that a C++ program, by definition, is a program whose behavior is always defined. The C++ compiler is allowed to assume that its input is a C++ program. (Obviously!) So the compiler can assume that its input program will never cause undefined behavior. Thus, since undefined behavior is “impossible,” if the compiler can prove that a condition would cause undefined behavior later, it can assume that condition will never occur.

Consider this program:

If we supply a value equal to (char*) -1 , we’re likely to see output like this:

with no assertion failure! But that’s an apparently impossible result. The printout can only happen if x + 1 > x (otherwise, the assertion will fail and stop the printout). But x + 1 , which equals 0 , is less than x , which is the largest 8-byte value!

The impossible happens because of undefined behavior reasoning. When the compiler sees an expression like x + 1 > x (with x a pointer), it can reason this way:

“Ah, x + 1 . This must be a pointer into the same array as x (or it might be a boundary pointer just past that array, or just past the non-array object x ). This must be so because forming any other pointer would cause undefined behavior.

“The pointer comparison is the same as an index comparison. x + 1 > x means the same thing as &x[1] > &x[0] . But that holds iff 1 > 0 .

“In my infinite wisdom, I know that 1 > 0 . Thus x + 1 > x always holds, and the assertion will never fail.

“My job is to make this code run fast. The fastest code is code that’s not there. This assertion will never fail—might as well remove it!”

Integer undefined behavior

Arithmetic on signed integers also has important undefined behaviors. Signed integer arithmetic must never overflow. That is, the compiler may assume that the mathematical result of any signed arithmetic operation, such as x + y (with x and y both int ), can be represented inside the relevant type. It causes undefined behavior, therefore, to add 1 to the maximum positive integer. (The ubexplore.cc program demonstrates how this can produce impossible results, as with pointers.)

Arithmetic on unsigned integers is much safer with respect to undefined behavior. Unsigned integers are defined to perform arithmetic modulo their size. This means that if you add 1 to the maximum positive unsigned integer, the result will always be zero.

Dividing an integer by zero causes undefined behavior whether or not the integer is signed.

Sanitizers, which in our makefiles are turned on by supplying SAN=1 , can catch many undefined behaviors as soon as they happen. Sanitizers are built in to the compiler itself; a sanitizer involves cooperation between the compiler and the language runtime. This has the major performance advantage that the compiler introduces exactly the required checks, and the optimizer can then use its normal analyses to remove redundant checks.

That said, undefined behavior checking can still be slow. Undefined behavior allows compilers to make assumptions about input values, and those assumptions can directly translate to faster code. Turning on undefined behavior checking can make some benchmark programs run 30% slower [link] .

Signed integer undefined behavior

File cs61-lectures/datarep5/ubexplore2.cc contains the following program.

What will be printed if we run the program with ./ubexplore2 0x7ffffffe 0x7fffffff ?

0x7fffffff is the largest positive value can be represented by type int . Adding one to this value yields 0x80000000 . In two's complement representation this is the smallest negative number represented by type int .

Assuming that the program behaves this way, then the loop exit condition i > n2 can never be met, and the program should run (and print out numbers) forever.

However, if we run the optimized version of the program, it prints only two numbers and exits:

The unoptimized program does print forever and never exits.

What’s going on here? We need to look at the compiled assembly of the program with and without optimization (via objdump -S ).

The unoptimized version basically looks like this:

This is a pretty direct translation of the loop.

The optimized version, though, does it differently. As always, the optimizer has its own ideas. (Your compiler may produce different results!)

The compiler changed the source’s less than or equal to comparison, i <= n2 , into a not equal to comparison in the executable, i != n2 + 1 (in both cases using signed computer arithmetic, i.e., modulo 2 32 )! The comparison i <= n2 will always return true when n2 == 0x7FFFFFFF , the maximum signed integer, so the loop goes on forever. But the i != n2 + 1 comparison does not always return true when n2 == 0x7FFFFFFF : when i wraps around to 0x80000000 (the smallest negative integer), then i equals n2 + 1 (which also wrapped), and the loop stops.

Why did the compiler make this transformation? In the original loop, the step-6 jump is immediately followed by another comparison and jump in steps 1 and 2. The processor jumps all over the place, which can confuse its prediction circuitry and slow down performance. In the transformed loop, the step-7 jump is never followed by a comparison and jump; instead, step 7 goes back to step 4, which always prints the current number. This more streamlined control flow is easier for the processor to make fast.

But the streamlined control flow is only a valid substitution under the assumption that the addition n2 + 1 never overflows . Luckily (sort of), signed arithmetic overflow causes undefined behavior, so the compiler is totally justified in making that assumption!

Programs based on ubexplore2 have demonstrated undefined behavior differences for years, even as the precise reasons why have changed. In some earlier compilers, we found that the optimizer just upgraded the int s to long s—arithmetic on long s is just as fast on x86-64 as arithmetic on int s, since x86-64 is a 64-bit architecture, and sometimes using long s for everything lets the compiler avoid conversions back and forth. The ubexplore2l program demonstrates this form of transformation: since the loop variable is added to a long counter, the compiler opportunistically upgrades i to long as well. This transformation is also only valid under the assumption that i + 1 will not overflow—which it can’t, because of undefined behavior.

Using unsigned type prevents all this undefined behavior, because arithmetic overflow on unsigned integers is well defined in C/C++. The ubexplore2u.cc file uses an unsigned loop index and comparison, and ./ubexplore2u and ./ubexplore2u.noopt behave exactly the same (though you have to give arguments like ./ubexplore2u 0xfffffffe 0xffffffff to see the overflow).

Computer arithmetic and bitwise operations

Basic bitwise operators.

Computers offer not only the usual arithmetic operators like + and - , but also a set of bitwise operators. The basic ones are & (and), | (or), ^ (xor/exclusive or), and the unary operator ~ (complement). In truth table form:

(and)
0 0
0 1
(or)
0 1
1 1
(xor)
0 1
1 0
(complement)
1
0

In C or C++, these operators work on integers. But they work bitwise: the result of an operation is determined by applying the operation independently at each bit position. Here’s how to compute 12 & 4 in 4-bit unsigned arithmetic:

These basic bitwise operators simplify certain important arithmetics. For example, (x & (x - 1)) == 0 tests whether x is zero or a power of 2.

Negation of signed integers can also be expressed using a bitwise operator: -x == ~x + 1 . This is in fact how we define two's complement representation. We can verify that x and (-x) does add up to zero under this representation:

Bitwise "and" ( & ) can help with modular arithmetic. For example, x % 32 == (x & 31) . We essentially "mask off", or clear, higher order bits to do modulo-powers-of-2 arithmetics. This works in any base. For example, in decimal, the fastest way to compute x % 100 is to take just the two least significant digits of x .

Bitwise shift of unsigned integer

x << i appends i zero bits starting at the least significant bit of x . High order bits that don't fit in the integer are thrown out. For example, assuming 4-bit unsigned integers

Similarly, x >> i appends i zero bits at the most significant end of x . Lower bits are thrown out.

Bitwise shift helps with division and multiplication. For example:

A modern compiler can optimize y = x * 66 into y = (x << 6) + (x << 1) .

Bitwise operations also allows us to treat bits within an integer separately. This can be useful for "options".

For example, when we call a function to open a file, we have a lot of options:

  • Open for reading?
  • Open for writing?
  • Read from the end?
  • Optimize for writing?

We have a lot of true/false options.

One bad way to implement this is to have this function take a bunch of arguments -- one argument for each option. This makes the function call look like this:

The long list of arguments slows down the function call, and one can also easily lose track of the meaning of the individual true/false values passed in.

A cheaper way to achieve this is to use a single integer to represent all the options. Have each option defined as a power of 2, and simply | (or) them together and pass them as a single integer.

Flags are usually defined as powers of 2 so we set one bit at a time for each flag. It is less common but still possible to define a combination flag that is not a power of 2, so that it sets multiple bits in one go.

File cs61-lectures/datarep5/mb-driver.cc contains a memory allocation benchmark. The core of the benchmark looks like this:

The benchmark tests the performance of memnode_arena::allocate() and memnode_arena::deallocate() functions. In the handout code, these functions do the same thing as new memnode and delete memnode —they are wrappers for malloc and free . The benchmark allocates 4096 memnode objects, then free-and-then-allocates them for noperations times, and then frees all of them.

We only allocate memnode s, and all memnode s are of the same size, so we don't need metadata that keeps track of the size of each allocation. Furthermore, since all dynamically allocated data are freed at the end of the function, for each individual memnode_free() call we don't really need to return memory to the system allocator. We can simply reuse these memory during the function and returns all memory to the system at once when the function exits.

If we run the benchmark with 100000000 allocation, and use the system malloc() , free() functions to implement the memnode allocator, the benchmark finishes in 0.908 seconds.

Our alternative implementation of the allocator can finish in 0.355 seconds, beating the heavily optimized system allocator by a factor of 3. We will reveal how we achieved this in the next lecture.

We continue our exploration with the memnode allocation benchmark introduced from the last lecture.

File cs61-lectures/datarep6/mb-malloc.cc contains a version of the benchmark using the system new and delete operators.

In this function we allocate an array of 4096 pointers to memnode s, which occupy 2 3 *2 12 =2 15 bytes on the stack. We then allocate 4096 memnode s. Our memnode is defined like this:

Each memnode contains a std::string object and an unsigned integer. Each std::string object internally contains a pointer points to an character array in the heap. Therefore, every time we create a new memnode , we need 2 allocations: one to allocate the memnode itself, and another one performed internally by the std::string object when we initialize/assign a string value to it.

Every time we deallocate a memnode by calling delete , we also delete the std::string object, and the string object knows that it should deallocate the heap character array it internally maintains. So there are also 2 deallocations occuring each time we free a memnode.

We make the benchmark to return a seemingly meaningless result to prevent an aggressive compiler from optimizing everything away. We also use this result to make sure our subsequent optimizations to the allocator are correct by generating the same result.

This version of the benchmark, using the system allocator, finishes in 0.335 seconds. Not bad at all.

Spoiler alert: We can do 15x better than this.

1st optimization: std::string

We only deal with one file name, namely "datarep/mb-filename.cc", which is constant throughout the program for all memnode s. It's also a string literal, which means it as a constant string has a static life time. Why can't we just simply use a const char* in place of the std::string and let the pointer point to the static constant string? This saves us the internal allocation/deallocation performed by std::string every time we initialize/delete a string.

The fix is easy, we simply change the memnode definition:

This version of the benchmark now finishes in 0.143 seconds, a 2x improvement over the original benchmark. This 2x improvement is consistent with a 2x reduction in numbers of allocation/deallocation mentioned earlier.

You may ask why people still use std::string if it involves an additional allocation and is slower than const char* , as shown in this benchmark. std::string is much more flexible in that it also deals data that doesn't have static life time, such as input from a user or data the program receives over the network. In short, when the program deals with strings that are not constant, heap data is likely to be very useful, and std::string provides facilities to conveniently handle on-heap data.

2nd optimization: the system allocator

We still use the system allocator to allocate/deallocate memnode s. The system allocator is a general-purpose allocator, which means it must handle allocation requests of all sizes. Such general-purpose designs usually comes with a compromise for performance. Since we are only memnode s, which are fairly small objects (and all have the same size), we can build a special- purpose allocator just for them.

In cs61-lectures/datarep5/mb2.cc , we actually implement a special-purpose allocator for memnode s:

This allocator maintains a free list (a C++ vector ) of freed memnode s. allocate() simply pops a memnode off the free list if there is any, and deallocate() simply puts the memnode on the free list. This free list serves as a buffer between the system allocator and the benchmark function, so that the system allocator is invoked less frequently. In fact, in the benchmark, the system allocator is only invoked for 4096 times when it initializes the pointer array. That's a huge reduction because all 10-million "recycle" operations in the middle now doesn't involve the system allocator.

With this special-purpose allocator we can finish the benchmark in 0.057 seconds, another 2.5x improvement.

However this allocator now leaks memory: it never actually calls delete ! Let's fix this by letting it also keep track of all allocated memnode s. The modified definition of memnode_arena now looks like this:

With the updated allocator we simply need to invoke arena.destroy_all() at the end of the function to fix the memory leak. And we don't even need to invoke this method manually! We can use the C++ destructor for the memnode_arena struct, defined as ~memnode_arena() in the code above, which is automatically called when our arena object goes out of scope. We simply make the destructor invoke the destroy_all() method, and we are all set.

Fixing the leak doesn't appear to affect performance at all. This is because the overhead added by tracking the allocated list and calling delete only affects our initial allocation the 4096 memnode* pointers in the array plus at the very end when we clean up. These 8192 additional operations is a relative small number compared to the 10 million recycle operations, so the added overhead is hardly noticeable.

Spoiler alert: We can improve this by another factor of 2.

3rd optimization: std::vector

In our special purpose allocator memnode_arena , we maintain an allocated list and a free list both using C++ std::vector s. std::vector s are dynamic arrays, and like std::string they involve an additional level of indirection and stores the actual array in the heap. We don't access the allocated list during the "recycling" part of the benchmark (which takes bulk of the benchmark time, as we showed earlier), so the allocated list is probably not our bottleneck. We however, add and remove elements from the free list for each recycle operation, and the indirection introduced by the std::vector here may actually be our bottleneck. Let's find out.

Instead of using a std::vector , we could use a linked list of all free memnode s for the actual free list. We will need to include some extra metadata in the memnode to store pointers for this linked list. However, unlike in the debugging allocator pset, in a free list we don't need to store this metadata in addition to actual memnode data: the memnode is free, and not in use, so we can use reuse its memory, using a union:

We then maintain the free list like this:

Compared to the std::vector free list, this free list we always directly points to an available memnode when it is not empty ( free_list !=nullptr ), without going through any indirection. In the std::vector free list one would first have to go into the heap to access the actual array containing pointers to free memnode s, and then access the memnode itself.

With this change we can now finish the benchmark under 0.3 seconds! Another 2x improvement over the previous one!

Compared to the benchmark with the system allocator (which finished in 0.335 seconds), we managed to achieve a speedup of nearly 15x with arena allocation.

National Academies Press: OpenBook

Computer Science: Reflections on the Field, Reflections from the Field (2004)

Chapter: 5 data, representation, and information, 5 data, representation, and information.

T he preceding two chapters address the creation of models that capture phenomena of interest and the abstractions both for data and for computation that reduce these models to forms that can be executed by computer. We turn now to the ways computer scientists deal with information, especially in its static form as data that can be manipulated by programs.

Gray begins by narrating a long line of research on databases—storehouses of related, structured, and durable data. We see here that the objects of research are not data per se but rather designs of “schemas” that allow deliberate inquiry and manipulation. Gray couples this review with introspection about the ways in which database researchers approach these problems.

Databases support storage and retrieval of information by defining—in advance—a complex structure for the data that supports the intended operations. In contrast, Lesk reviews research on retrieving information from documents that are formatted to meet the needs of applications rather than predefined schematized formats.

Interpretation of information is at the heart of what historians do, and Ayers explains how information technology is transforming their paradigms. He proposes that history is essentially model building—constructing explanations based on available information—and suggests that the methods of computer science are influencing this core aspect of historical analysis.

DATABASE SYSTEMS: A TEXTBOOK CASE OF RESEARCH PAYING OFF

Jim Gray, Microsoft Research

A small research investment helped produce U.S. market dominance in the $14 billion database industry. Government and industry funding of a few research projects created the ideas for several generations of products and trained the people who built those products. Continuing research is now creating the ideas and training the people for the next generation of products.

Industry Profile

The database industry generated about $14 billion in revenue in 2002 and is growing at 20 percent per year, even though the overall technology sector is almost static. Among software sectors, the database industry is second only to operating system software. Database industry leaders are all U.S.-based corporations: IBM, Microsoft, and Oracle are the three largest. There are several specialty vendors: Tandem sells over $1 billion/ year of fault-tolerant transaction processing systems, Teradata sells about $1 billion/year of data-mining systems, and companies like Information Resources Associates, Verity, Fulcrum, and others sell specialized data and text-mining software.

In addition to these well-established companies, there is a vibrant group of small companies specializing in application-specific databases—for text retrieval, spatial and geographical data, scientific data, image data, and so on. An emerging group of companies offer XML-oriented databases. Desktop databases are another important market focused on extreme ease of use, small size, and disconnected (offline) operation.

Historical Perspective

Companies began automating their back-office bookkeeping in the 1960s. The COBOL programming language and its record-oriented file model were the workhorses of this effort. Typically, a batch of transactions was applied to the old-tape-master, producing a new-tape-master and printout for the next business day. During that era, there was considerable experimentation with systems to manage an online database that could capture transactions as they happened. At first these systems were ad hoc, but late in that decade network and hierarchical database products emerged. A COBOL subcommittee defined a network data model stan-

dard (DBTG) that formed the basis for most systems during the 1970s. Indeed, in 1980 DBTG-based Cullinet was the leading software company.

However, there were some problems with DBTG. DBTG uses a low-level, record-at-a-time procedural language to access information. The programmer has to navigate through the database, following pointers from record to record. If the database is redesigned, as often happens over a decade, then all the old programs have to be rewritten.

The relational data model, enunciated by IBM researcher Ted Codd in a 1970 Communications of the Association for Computing Machinery article, 1 was a major advance over DBTG. The relational model unified data and metadata so that there was only one form of data representation. It defined a non-procedural data access language based on algebra or logic. It was easier for end users to visualize and understand than the pointers-and-records-based DBTG model.

The research community (both industry and university) embraced the relational data model and extended it during the 1970s. Most significantly, researchers showed that a non-procedural language could be compiled to give performance comparable to the best record-oriented database systems. This research produced a generation of systems and people that formed the basis for products from IBM, Ingres, Oracle, Informix, Sybase, and others. The SQL relational database language was standardized by ANSI/ISO between 1982 and 1986. By 1990, virtually all database systems provided an SQL interface (including network, hierarchical, and object-oriented systems).

Meanwhile the database research agenda moved on to geographically distributed databases and to parallel data access. Theoretical work on distributed databases led to prototypes that in turn led to products. Today, all the major database systems offer the ability to distribute and replicate data among nodes of a computer network. Intense research on data replication during the late 1980s and early 1990s gave rise to a second generation of replication products that are now the mainstays of mobile computing.

Research of the 1980s showed how to execute each of the relational data operators in parallel—giving hundred-fold and thousand-fold speedups. The results of this research began to appear in the products of several major database companies. With the proliferation of data mining in the 1990s, huge databases emerged. Interactive access to these databases requires that the system use multiple processors and multiple disks to read all the data in parallel. In addition, these problems require near-

  

E.F. Codd, 1970, “A Relational Model of Data from Large Shared Data Banks,” 13(6):377-387. Available online at .

linear time search algorithms. University and industrial research of the previous decade had solved these problems and forms the basis of the current VLDB (very large database) data-mining systems.

Rollup and drilldown data reporting systems had been a mainstay of decision-support systems ever since the 1960s. In the middle 1990s, the research community really focused on data-mining algorithms. They invented very efficient data cube and materialized view algorithms that form the basis for the current generation of business intelligence products.

The most recent round of government-sponsored research creating a new industry comes from the National Science Foundation’s Digital Libraries program, which spawned Google. It was founded by a group of “database” graduate students who took a fresh look at how information should be organized and presented in the Internet era.

Current Research Directions

There continues to be active and valuable research on representing and indexing data, adding inference to data search, compiling queries more efficiently, executing queries in parallel, integrating data from heterogeneous data sources, analyzing performance, and extending the transaction model to handle long transactions and workflow (transactions that involve human as well as computer steps). The availability of huge volumes of data on the Internet has prompted the study of data integration, mediation, and federation in which a portal system presents a unification of several data sources by pulling data on demand from different parts of the Internet.

In addition, there is great interest in unifying object-oriented concepts with the relational model. New data types (image, document, and drawing) are best viewed as the methods that implement them rather than by the bytes that represent them. By adding procedures to the database system, one gets active databases, data inference, and data encapsulation. This object-oriented approach is an area of active research and ferment both in academe and industry. It seems that in 2003, the research prototypes are mostly done and this is an area that is rapidly moving into products.

The Internet is full of semi-structured data—data that has a bit of schema and metadata, but is mostly a loose collection of facts. XML has emerged as the standard representation of semi-structured data, but there is no consensus on how such data should be stored, indexed, or searched. There have been intense research efforts to answer these questions. Prototypes have been built at universities and industrial research labs, and now products are in development.

The database research community now has a major focus on stream data processing. Traditionally, databases have been stored locally and are

updated by transactions. Sensor networks, financial markets, telephone calls, credit card transactions, and other data sources present streams of data rather than a static database. The stream data processing researchers are exploring languages and algorithms for querying such streams and providing approximate answers.

Now that nearly all information is online, data security and data privacy are extremely serious and important problems. A small, but growing, part of the database community is looking at ways to protect people’s privacy by limiting the ways data is used. This work also has implications for protecting intellectual property (e.g., digital rights management, watermarking) and protecting data integrity by digitally signing documents and then replicating them so that the documents cannot be altered or destroyed.

Case Histories

The U.S. government funded many database research projects from 1972 to the present. Projects at the University of California at Los Angeles gave rise to Teradata and produced many excellent students. Projects at Computer Corp. of America (SDD-1, Daplex, Multibase, and HiPAC) pioneered distributed database technology and object-oriented database technology. Projects at Stanford University fostered deductive database technology, data integration technology, query optimization technology, and the popular Yahoo! and Google Internet sites. Work at Carnegie Mellon University gave rise to general transaction models and ultimately to the Transarc Corporation. There have been many other successes from AT&T, the University of Texas at Austin, Brown and Harvard Universities, the University of Maryland, the University of Michigan, Massachusetts Institute of Technology, Princeton University, and the University of Toronto among others. It is not possible to enumerate all the contributions here, but we highlight three representative research projects that had a major impact on the industry.

Project INGRES

Project Ingres started at the University of California at Berkeley in 1972. Inspired by Codd’s paper on relational databases, several faculty members (Stonebraker, Rowe, Wong, and others) started a project to design and build a relational system. Incidental to this work, they invented a query language (QUEL), relational optimization techniques, a language binding technique, and interesting storage strategies. They also pioneered work on distributed databases.

The Ingres academic system formed the basis for the Ingres product now owned by Computer Associates. Students trained on Ingres went on

to start or staff all the major database companies (AT&T, Britton Lee, HP, Informix, IBM, Oracle, Tandem, Sybase). The Ingres project went on to investigate distributed databases, database inference, active databases, and extensible databases. It was rechristened Postgres, which is now the basis of the digital library and scientific database efforts within the University of California system. Recently, Postgres spun off to become the basis for a new object-relational system from the start-up Illustra Information Technologies.

Codd’s ideas were inspired by seeing the problems IBM and its customers were having with IBM’s IMS product and the DBTG network data model. His relational model was at first very controversial; people thought that the model was too simplistic and that it could never give good performance. IBM Research management took a gamble and chartered a small (10-person) systems effort to prototype a relational system based on Codd’s ideas. That system produced a prototype that eventually grew into the DB2 product series. Along the way, the IBM team pioneered ideas in query optimization, data independence (views), transactions (logging and locking), and security (the grant-revoke model). In addition, the SQL query language from System R was the basis for the ANSI/ISO standard.

The System R group went on to investigate distributed databases (project R*) and object-oriented extensible databases (project Starburst). These research projects have pioneered new ideas and algorithms. The results appear in IBM’s database products and those of other vendors.

Not all research ideas work out. During the 1970s there was great enthusiasm for database machines—special-purpose computers that would be much faster than general-purpose operating systems running conventional database systems. These research projects were often based on exotic hardware like bubble memories, head-per-track disks, or associative RAM. The problem was that general-purpose systems were improving at 50 percent per year, so it was difficult for exotic systems to compete with them. By 1980, most researchers realized the futility of special-purpose approaches and the database-machine community switched to research on using arrays of general-purpose processors and disks to process data in parallel.

The University of Wisconsin hosted the major proponents of this idea in the United States. Funded by the government and industry, those researchers prototyped and built a parallel database machine called

Gamma. That system produced ideas and a generation of students who went on to staff all the database vendors. Today the parallel systems from IBM, Tandem, Oracle, Informix, Sybase, and Microsoft all have a direct lineage from the Wisconsin research on parallel database systems. The use of parallel database systems for data mining is the fastest-growing component of the database server industry.

The Gamma project evolved into the Exodus project at Wisconsin (focusing on an extensible object-oriented database). Exodus has now evolved to the Paradise system, which combines object-oriented and parallel database techniques to represent, store, and quickly process huge Earth-observing satellite databases.

And Then There Is Science

In addition to creating a huge industry, database theory, science, and engineering constitute a key part of computer science today. Representing knowledge within a computer is one of the central challenges of computer science ( Box 5.1 ). Database research has focused primarily on this fundamental issue. Many universities have faculty investigating these problems and offer classes that teach the concepts developed by this research program.


How can knowledge be represented so that algorithms can make new inferences from the knowledge base? This problem has challenged philosophers for millennia. There has been progress. Euclid axiomized geometry and proved its basic theorems, and in doing so implicitly demonstrated mechanical reasoning from first principles. George Boole’s Laws of Thought created a predicate calculus, and Laplace’s work on probability was a first start on statistical inference.

Each of these threads—proofs, predicate calculus, and statistical inference—were major advances; but each requires substantial human creativity to fit new problems to the solution. Wouldn’t it be nice if we could just put all the books and journals in a library that would automatically organize them and start producing new answers?

There are huge gaps between our current tools and the goal of a self-organizing library, but computer scientists are trying to fill the gaps with better algorithms and better ways of representing knowledge. Databases are one branch of this effort to represent information and reason about it. The database community has taken a bottom-up approach, working with simple data representations and developing a calculus for asking and answering questions about the database.

The fundamental approach of database researchers is to insist that the information must be schematized—the information must be represented in a predefined schema that assigns a meaning to each value. The author-title-subject-abstract schema of a library system is a typical example of this approach. The schema is used both to organize the data and to make it easy to express questions about the database.

Database researchers have labored to make it easy to define the schema, easy to add data to the database, and easy to pose questions to the database. Early database systems were dreadfully difficult to use—largely because we lacked the algorithms to automatically index huge databases and lacked powerful query tools. Today there are good tools to define schemas, and graphical tools that make it easy to explore and analyze the contents of a database.

This has required invention at all levels of the problem. At the lowest levels we had to discover efficient algorithms to sort, index, and organize numeric, text, temporal, and spatial information so that higher-level software could just pick from a wide variety of organizations and algorithms. These low-level algorithms mask data placement so that it can be spread among hundreds or thousands of disks; they mask concurrency so that the higher-level software can view a consistent data snapshot, even though the data is in flux. The low-level software includes enough redundancy so that once data is placed in the database, it is safe to assume that the data will never be lost. One major advance was the theory and algorithms to automatically guarantee these concurrency-reliability properties.

Text, spatial, and temporal databases have always posed special challenges. Certainly there have been huge advances in indexing these databases, but researchers still have many more problems to solve. The advent of image, video, and sound databases raises new issues. In particular, we are now able to extract a huge number of features from images and sounds, but we have no really good ways to index these features. This is just another aspect of the “curse of dimensionality” faced by database systems in the data-mining and data analysis area. When each object has more than a dozen attributes, traditional indexing techniques give little help in reducing the approximate search space.

So, there are still many unsolved research challenges for the low-level database “plumbers.”

The higher-level software that uses this plumbing has been a huge success. Early on, the research community embraced the relational data model championed by Ted Codd. Codd advocated the use of non-procedural set-oriented programming to define schemas and to pose queries. After a decade of experimentation, these research ideas evolved into the SQL database language. Having this high-level non-procedural language was a boon both to application programmers and to database implementers. Application programmers could write much simpler programs. The database implementers faced the challenge of optimizing and executing SQL. Because it is so high level (SQL is a non-procedural functional dataflow language), SQL allows data to be distributed across many computers and disks. Because the programs do not mention any physical structures, the implementer is free to use whatever “plumbing” is available. And because the language is functional, it can be executed in parallel.

Techniques for implementing the relational data model and algorithms for efficiently executing database queries remain a core part of the database research agenda. Over the last decade, the traditional database systems have grown to include analytics (data cubes), and also data-mining algorithms borrowed from the machine-learning and statistics communities. There is increasing interest in solving information retrieval and multimedia database issues.

Today, there are very good tools for defining and querying traditional database systems; but, there are still major research challenges in the traditional database field. The major focus is automating as much of the data administration tasks as possible—making the database system self-healing and self-managing.

We are still far from the goal of building systems that automatically ingest information, reason about it, and produce answers on demand. But the goal is closer, and it seems attainable within this century.

COMPUTER SCIENCE IS TO INFORMATION AS CHEMISTRY IS TO MATTER

Michael Lesk, Rutgers University

In other countries computer science is often called “informatics” or some similar name. Much computer science research derives from the need to access, process, store, or otherwise exploit some resource of useful information. Just as chemistry is driven to large extent by the need to understand substances, computing is driven by a need to handle data and information. As an example of the way chemistry has developed, see Oliver Sacks’s book Uncle Tungsten: Memories of a Chemical Boyhood (Vintage Books, 2002). He describes his explorations through the different metals, learning the properties of each, and understanding their applications. Similarly, in the history of computer science, our information needs and our information capabilities have driven parts of the research agenda. Information retrieval systems take some kind of information, such as text documents or pictures, and try to retrieve topics or concepts based on words or shapes. Deducing the concept from the bytes can be difficult, and the way we approach the problem depends on what kind of bytes we have and how many of them we have.

Our experimental method is to see if we can build a system that will provide some useful access to information or service. If it works, those algorithms and that kind of data become a new field: look at areas like geographic information systems. If not, people may abandon the area until we see a new motivation to exploit that kind of data. For example, face-recognition algorithms have received a new impetus from security needs, speeding up progress in the last few years. An effective strategy to move computer science forward is to provide some new kind of information and see if we can make it useful.

Chemistry, of course, involves a dichotomy between substances and reactions. Just as we can (and frequently do) think of computer science in terms of algorithms, we can talk about chemistry in terms of reactions. However, chemistry has historically focused on substances: the encyclopedias and indexes in chemistry tend to be organized and focused on compounds, with reaction names and schemes getting less space on the shelf. Chemistry is becoming more balanced as we understand reactions better; computer science has always been more heavily oriented toward algorithms, but we cannot ignore the driving force of new kinds of data.

The history of information retrieval, for example, has been driven by the kinds of information we could store and use. In the 1960s, for example, storage was extremely expensive. Research projects were limited to text

materials. Even then, storage costs meant that a research project could just barely manage to have a single ASCII document available for processing. For example, Gerard Salton’s SMART system, one of the leading text retrieval systems for many years (see Salton’s book, The SMART Automatic Retrieval System , Prentice-Hall, 1971), did most of its processing on collections of a few hundred abstracts. The only collections of “full documents” were a collection of 80 extended abstracts, each a page or two long, and a collection of under a thousand stories from Time Magazine , each less than a page in length. The biggest collection was 1400 abstracts in aeronautical engineering. With this data, Salton was able to experiment on the effectiveness of retrieval methods using suffixing, thesauri, and simple phrase finding. Salton also laid down the standard methodology for evaluating retrieval systems, based on Cyril Cleverdon’s measures of “recall” (percentage of the relevant material that is retrieved in response to a query) and “precision” (the percentage of the material retrieved that is relevant). A system with perfect recall finds all relevant material, making no errors of omission and leaving out nothing the user wanted. In contrast, a system with perfect precision finds only relevant material, making no errors of commission and not bothering the user with stuff of no interest. The SMART system produced these measures for many retrieval experiments and its methodology was widely used, making text retrieval one of the earliest areas of computer science with agreed-on evaluation methods. Salton was not able to do anything with image retrieval at the time; there were no such data available for him.

Another idea shaped by the amount of information available was “relevance feedback,” the idea of identifying useful documents from a first retrieval pass in order to improve the results of a later retrieval. With so few documents, high precision seemed like an unnecessary goal. It was simply not possible to retrieve more material than somebody could look at. Thus, the research focused on high recall (also stimulated by the insistence by some users that they had to have every single relevant document). Relevance feedback helped recall. By contrast, the use of phrase searching to improve precision was tried but never got much attention simply because it did not have the scope to produce much improvement in the running systems.

The basic problem is that we wish to search for concepts, and what we have in natural language are words and phrases. When our documents are few and short, the main problem is not to miss any, and the research at the time stressed algorithms that found related words via associations or improved recall with techniques like relevance feedback.

Then, of course, several other advances—computer typesetting and word processing to generate material and cheap disks to hold it—led to much larger text collections. Figure 5.1 shows the decline in the price of

methods of data representation

FIGURE 5.1 Decline in the price of disk space, 1950 to 2004.

disk space since the first disks in the mid-1950s, generally following the cost-performance trends of Moore’s law.

Cheaper storage led to larger and larger text collections online. Now there are many terabytes of data on the Web. These vastly larger volumes mean that precision has now become more important, since a common problem is to wade through vastly too many documents. Not surprisingly, in the mid-1980s efforts started on separating the multiple meanings of words like “bank” or “pine” and became the research area of “sense disambiguation.” 2 With sense disambiguation, it is possible to imagine searching for only one meaning of an ambiguous word, thus avoiding many erroneous retrievals.

Large-scale research on text processing took off with the availability of the TREC (Text Retrieval Evaluation Conference) data. Thanks to the National Institute of Standards and Technology, several hundred megabytes of text were provided (in each of several years) for research use. This stimulated more work on query analysis, text handling, searching

  

See Michael Lesk, 1986, “How to Tell a Pine Cone from an Ice Cream Cone,” , pp. 26-28.

algorithms, and related areas; see the series titled TREC Conference Proceedings, edited by Donna Harmon of NIST.

Document clustering appeared as an important way to shorten long search results. Clustering enables a system to report not, say, 5000 documents but rather 10 groups of 500 documents each, and the user can then explore the group or groups that seem relevant. Salton anticipated the future possibility of such algorithms, as did others. 3 Until we got large collections, though, clustering did not find application in the document retrieval world. Now one routinely sees search engines using these techniques, and faster clustering algorithms have been developed.

Thus the algorithms explored switched from recall aids to precision aids as the quantity of available data increased. Manual thesauri, for example, have dropped out of favor for retrieval, partly because of their cost but also because their goal is to increase recall, which is not today’s problem. In terms of finding the concepts hinted at by words and phrases, our goals now are to sharpen rather than broaden these concepts: thus disambiguation and phrase matching, and not as much work on thesauri and term associations.

Again, multilingual searching started to matter, because multilingual collections became available. Multilingual research shows a more precise example of particular information resources driving research. The Canadian government made its Parliamentary proceedings (called Hansard ) available in both French and English, with paragraph-by-paragraph translation. This data stimulated a number of projects looking at how to handle bilingual material, including work on automatic alignment of the parallel texts, automatic linking of similar words in the two languages, and so on. 4

A similar effect was seen with the Brown corpus of tagged English text, where the part of speech of each word (e.g., whether a word is a noun or a verb) was identified. This produced a few years of work on algorithms that learned how to assign parts of speech to words in running text based on statistical techniques, such as the work by Garside. 5

  

See, for example, N. Jardine and C.J. van Rijsbergen, 1971, “The Use of Hierarchical Clustering in Information Retrieval,” 7:217-240.

  

See, for example, T.K. Landauer and M.L. Littman, 1990, “Fully Automatic Cross-Language Document Retrieval Using Latent Semantic Indexing,” pp. 31-38, University of Waterloo Centre for the New OED and Text Research, Waterloo, Ontario, October; or I. Dagan and Ken Church, 1997, “Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition,” 12(1/2):89-107.

  

Roger Garside, 1987, “The CLAWS Word-tagging System,” in R. Garside, G. Leech, and G. Sampson (eds.), , Longman, London.

One might see an analogy to various new fields of chemistry. The recognition that pesticides like DDT were environmental pollutants led to a new interest in biodegradability, and the Freon propellants used in aerosol cans stimulated research in reactions in the upper atmosphere. New substances stimulated a need to study reactions that previously had not been a top priority for chemistry and chemical engineering.

As storage became cheaper, image storage was now as practical as text storage had been a decade earlier. Starting in the 1980s we saw the IBM QBIC project demonstrating that something could be done to retrieve images directly, without having to index them by text words first. 6 Projects like this were stimulated by the availability of “clip art” such as the COREL image disks. Several different projects were driven by the easy access to images in this way, with technology moving on from color and texture to more accurate shape processing. At Berkeley, for example, the “Blobworld” project made major improvements in shape detection and recognition, as described in Carson et al. 7 These projects demonstrated that retrieval could be done with images as well as with words, and that properties of images could be found that were usable as concepts for searching.

Another new kind of data that became feasible to process was sound, in particular human speech. Here it was the Defense Advanced Research Projects Agency (DARPA) that took the lead, providing the SWITCH-BOARD corpus of spoken English. Again, the availability of a substantial file of tagged information helped stimulate many research projects that used this corpus and developed much of the technology that eventually went into the commercial speech recognition products we now have. As with the TREC contests, the competitions run by DARPA based on its spoken language data pushed the industry and the researchers to new advances. National needs created a new technology; one is reminded of the development of synthetic rubber during World War II or the advances in catalysis needed to make explosives during World War I.

Yet another kind of new data was geo-coded data, introducing a new set of conceptual ideas related to place. Geographical data started showing up in machine-readable form during the 1980s, especially with the release of the Dual Independent Map Encoding (DIME) files after the 1980

  

See, for example, Wayne Niblack, Ron Barber, William Equitz, Myron Flickner, Eduardo H. Glasman, Dragutin Petkovic, Peter Yanker, Christos Faloutsos, and Gabriel Taubin, 1993, “The QBIC Project: Querying Images by Content, Using Color, Texture, and Shape,” pp. 173-187.

  

C. Carson, M. Thomas, S. Belongie, J.M. Hellerstein, and J. Malik, 1999, “Blobworld: A System for Region-based Image Indexing and Retrieval,” , Springer-Verlag, Amsterdam, pp. 509-516.

census and the Topologically Integrated Geographic Encoding and Referencing (TIGER) files from the 1990 census. The availability, free of charge, of a complete U.S. street map stimulated much research on systems to display maps, to give driving directions, and the like. 8 When aerial photographs also became available, there was the triumph of Microsoft’s “Terraserver,” which made it possible to look at a wide swath of the world from the sky along with correlated street and topographic maps. 9

More recently, in the 1990s, we have started to look at video search and retrieval. After all, if a CD-ROM contains about 300,000 times as many bytes per pound as a deck of punched cards, and a digitized video has about 500,000 times as many bytes per second as the ASCII script it comes from, we should be about where we were in the 1960s with video today. And indeed there are a few projects, most notably the Informedia project at Carnegie Mellon University, that experiment with video signals; they do not yet have ways of searching enormous collections, but they are developing algorithms that exploit whatever they can find in the video: scene breaks, closed-captioning, and so on.

Again, there is the problem of deducing concepts from a new kind of information. We started with the problem of words in one language needing to be combined when synonymous, picked apart when ambiguous, and moved on to detecting synonyms across multiple languages and then to concepts depicted in pictures and sounds. Now we see research such as that by Jezekiel Ben-Arie associating words like “run” or “hop” with video images of people doing those actions. In the same way we get again new chemistry when molecules like “buckyballs” are created and stimulate new theoretical and reaction studies.

Defining concepts for search can be extremely difficult. For example, despite our abilities to parse and define every item in a computer language, we have made no progress on retrieval of software; people looking for search or sort routines depend on metadata or comments. Some areas seem more flexible than others: text and naturalistic photograph processing software tends to be very general, while software to handle CAD diagrams and maps tends to be more specific. Algorithms are sometimes portable; both speech processing and image processing need Fourier transforms, but the literature is less connected than one might like (partly

  

An early publication was R. Elliott and M. Lesk, 1982, “Route Finding in Street Maps by Computers and People,” , Pittsburgh, Pa., August, pp. 258-261.

  

T. Barclay, J. Gray, and D. Slutz, 2000, “Microsoft Terraserver: A Spatial Data Warehouse,” , Association for Computing Machinery, New York, pp. 307-318.

because of the difference between one-dimensional and two-dimensional transforms).

There are many other examples of interesting computer science research stimulated by the availability of particular kinds of information. Work on string matching today is often driven by the need to align sequences in either protein or DNA data banks. Work on image analysis is heavily influenced by the need to deal with medical radiographs. And there are many other interesting projects specifically linked to an individual data source. Among examples:

The British Library scanning of the original manuscript of Beowulf in collaboration with the University of Kentucky, working on image enhancement until the result of the scanning is better than reading the original;

The Perseus project, demonstrating the educational applications possible because of the earlier Thesaurus Linguae Graecae project, which digitized all the classical Greek authors;

The work in astronomical analysis stimulated by the Sloan Digital Sky Survey;

The creation of the field of “forensic paleontology” at the University of Texas as a result of doing MRI scans of fossil bones;

And, of course, the enormous amount of work on search engines stimulated by the Web.

When one of these fields takes off, and we find wide usage of some online resource, it benefits society. Every university library gained readers as their catalogs went online and became accessible to students in their dorm rooms. Third World researchers can now access large amounts of technical content their libraries could rarely acquire in the past.

In computer science, and in chemistry, there is a tension between the algorithm/reaction and the data/substance. For example, should one look up an answer or compute it? Once upon a time logarithms were looked up in tables; today we also compute them on demand. Melting points and other physical properties of chemical substances are looked up in tables; perhaps with enough quantum mechanical calculation we could predict them, but it’s impractical for most materials. Predicting tomorrow’s weather might seem a difficult choice. One approach is to measure the current conditions, take some equations that model the atmosphere, and calculate forward a day. Another is to measure the current conditions, look in a big database for the previous day most similar to today, and then take the day after that one as the best prediction for tomorrow. However, so far the meteorologists feel that calculation is better. Another complicated example is chess: given the time pressure of chess tournaments

against speed and storage available in computers, chess programs do the opening and the endgame by looking in tables of old data and calculate for the middle game.

To conclude, a recipe for stimulating advances in computer science is to make some data available and let people experiment with it. With the incredibly cheap disks and scanners available today, this should be easier than ever. Unfortunately, what we gain with technology we are losing to law and economics. Many large databases are protected by copyright; few motion pictures, for example, are old enough to have gone out of copyright. Content owners generally refuse to grant permission for wide use of their material, whether out of greed or fear: they may have figured out how to get rich off their files of information or they may be afraid that somebody else might have. Similarly it is hard to get permission to digitize in-copyright books, no matter how long they have been out of print. Jim Gray once said to me, “May all your problems be technical.” In the 1960s I was paying people to key in aeronautical abstracts. It never occurred to us that we should be asking permission of the journals involved (I think what we did would qualify as fair use, but we didn’t even think about it). Today I could scan such things much more easily, but I would not be able to get permission. Am I better off or worse off?

There are now some 22 million chemical substances in the Chemical Abstracts Service Registry and 7 million reactions. New substances continue to intrigue chemists and cause research on new reactions, with of course enormous interest in biochemistry both for medicine and agriculture. Similarly, we keep adding data to the Web, and new kinds of information (photographs of dolphins, biological flora, and countless other things) can push computer scientists to new algorithms. In both cases, synthesis of specific instances into concepts is a crucial problem. As we see more and more kinds of data, we learn more about how to extract meaning from it, and how to present it, and we develop a need for new algorithms to implement this knowledge. As the data gets bigger, we learn more about optimization. As it gets more complex, we learn more about representation. And as it gets more useful, we learn more about visualization and interfaces, and we provide better service to society.

HISTORY AND THE FUNDAMENTALS OF COMPUTER SCIENCE

Edward L. Ayers, University of Virginia

We might begin with a thought experiment: What is history? Many people, I’ve discovered, think of it as books and the things in books. That’s certainly the explicit form in which we usually confront history. Others, thinking less literally, might think of history as stories about the past; that would open us to oral history, family lore, movies, novels, and the other forms in which we get most of our history.

All these images are wrong, of course, in the same way that images of atoms as little solar systems are wrong, or pictures of evolution as profiles of ever taller and more upright apes and people are wrong. They are all models, radically simplified, that allow us to think about such things in the exceedingly small amounts of time that we allot to these topics.

The same is true for history, which is easiest to envision as technological progress, say, or westward expansion, of the emergence of freedom—or of increasing alienation, exploitation of the environment, or the growth of intrusive government.

Those of us who think about specific aspects of society or nature for a living, of course, are never satisfied with the stories that suit the purposes of everyone else so well.

We are troubled by all the things that don’t fit, all the anomalies, variance, and loose ends. We demand more complex measurement, description, and fewer smoothing metaphors and lowest common denominators.

Thus, to scientists, atoms appear as clouds of probability; evolution appears as a branching, labyrinthine bush in which some branches die out and others diversify. It can certainly be argued that past human experience is as complex as anything in nature and likely much more so, if by complexity we mean numbers of components, variability of possibilities, and unpredictability of outcomes.

Yet our means of conveying that complexity remain distinctly analog: the story, the metaphor, the generalization. Stories can be wonderfully complex, of course, but they are complex in specific ways: of implication, suggestion, evocation. That’s what people love and what they remember.

But maybe there is a different way of thinking about the past: as information. In fact, information is all we have. Studying the past is like studying scientific processes for which you have the data but cannot run the experiment again, in which there is no control, and in which you can never see the actual process you are describing and analyzing. All we have is information in various forms: words in great abundance, billions of numbers, millions of images, some sounds and buildings, artifacts.

The historian’s goal, it seems to me, should be to account for as much of the complexity embedded in that information as we can. That, it appears, is what scientists do, and it has served them well.

And how has science accounted for ever-increasing amounts of complexity in the information they use? Through ever more sophisticated instruments. The connection between computer science and history could be analogous to that between telescopes and stars, microscopes and cells. We could be on the cusp of a new understanding of the patterns of complexity in human behavior of the past.

The problem may be that there is too much complexity in that past, or too much static, or too much silence. In the sciences, we’ve learned how to filter, infer, use indirect evidence, and fill in the gaps, but we have a much more literal approach to the human past.

We have turned to computer science for tasks of more elaborate description, classification, representation. The digital archive my colleagues and I have built, the Valley of the Shadow Project, permits the manipulation of millions of discrete pieces of evidence about two communities in the era of the American Civil War. It uses sorting mechanisms, hypertextual display, animation, and the like to allow people to handle the evidence of this part of the past for themselves. This isn’t cutting-edge computer science, of course, but it’s darned hard and deeply disconcerting to some, for it seems to abdicate responsibility, to undermine authority, to subvert narrative, to challenge story.

Now, we’re trying to take this work to the next stage, to analysis. We have composed a journal article that employs an array of technologies, especially geographic information systems and statistical analysis in the creation of the evidence. The article presents its argument, evidence, and historiographical context as a complex textual, tabular, and graphical representation. XML offers a powerful means to structure text and XSL an even more powerful means to transform it and manipulate its presentation. The text is divided into sections called “statements,” each supported with “explanation.” Each explanation, in turn, is supported by evidence and connected to relevant historiography.

Linkages, forward and backward, between evidence and narrative are central. The historiography can be automatically sorted by author, date, or title; the evidence can be arranged by date, topic, or type. Both evidence and historiographical entries are linked to the places in the analysis where they are invoked. The article is meant to be used online, but it can be printed in a fixed format with all the limitations and advantages of print.

So, what are the implications of thinking of the past in the hardheaded sense of admitting that all we really have of the past is information? One implication might be great humility, since all we have for most

of the past are the fossils of former human experience, words frozen in ink and images frozen in line and color. Another implication might be hubris: if we suddenly have powerful new instruments, might we be on the threshold of a revolution in our understanding of the past? We’ve been there before.

A connection between history and social science was tried before, during the first days of accessible computers. Historians taught themselves statistical methods and even programming languages so that they could adopt the techniques, models, and insights of sociology and political science. In the 1950s and 1960s the creators of the new political history called on historians to emulate the precision, explicitness, replicability, and inclusivity of the quantitative social sciences. For two decades that quantitative history flourished, promising to revolutionize the field. And to a considerable extent it did: it changed our ideas of social mobility, political identification, family formation, patterns of crime, economic growth, and the consequences of ethnic identity. It explicitly linked the past to the present and held out a history of obvious and immediate use.

But that quantitative social science history collapsed suddenly, the victim of its own inflated claims, limited method and machinery, and changing academic fashion. By the mid-1980s, history, along with many of the humanities and social sciences, had taken the linguistic turn. Rather than software manuals and codebooks, graduate students carried books of French philosophy and German literary interpretation. The social science of choice shifted from sociology to anthropology; texts replaced tables. A new generation defined itself in opposition to social scientific methods just as energetically as an earlier generation had seen in those methods the best means of writing a truly democratic history. The first computer revolution largely failed.

The first effort at that history fell into decline in part because historians could not abide the distance between their most deeply held beliefs and what the statistical machinery permitted, the abstraction it imposed. History has traditionally been built around contingency and particularity, but the most powerful tools of statistics are built on sampling and extrapolation, on generalization and tendency. Older forms of social history talked about vague and sometimes dubious classifications in part because that was what the older technology of tabulation permitted us to see. It has become increasingly clear across the social sciences that such flat ways of describing social life are inadequate; satisfying explanations must be dynamic, interactive, reflexive, and subtle, refusing to reify structures of social life or culture. The new technology permits a new cross-fertilization.

Ironically, social science history faded just as computers became widely available, just as new kinds of social science history became feasible. No longer is there any need for white-coated attendants at huge mainframes

and expensive proprietary software. Rather than reducing people to rows and columns, searchable databases now permit researchers to maintain the identities of individuals in those databases and to represent entire populations rather than samples. Moreover, the record can now include things social science history could only imagine before the Web: completely indexed newspapers, with the original readable on the screen; completely searchable letters and diaries by the thousands; and interactive maps with all property holders identified and linked to other records. Visualization of patterns in the data, moreover, far outstrips the possibilities of numerical calculation alone. Manipulable histograms, maps, and time lines promise a social history that is simultaneously sophisticated and accessible. We have what earlier generations of social science historians dreamed of: a fast and widely accessible network linked to cheap and powerful computers running common software with well-established standards for the handling of numbers, texts, and images. New possibilities of collaboration and cumulative research beckon. Perhaps the time is right to reclaim a worthy vision of a disciplined and explicit social scientific history that we abandoned too soon.

What does this have to do with computer science? Everything, it seems to me. If you want hard problems, historians have them. And what’s the hardest problem of all right now? The capture of the very information that is history. Can computer science imagine ways to capture historical information more efficiently? Can it offer ways to work with the spotty, broken, dirty, contradictory, nonstandardized information we work with?

The second hard problem is the integration of this disparate evidence in time and space, offering new precision, clarity, and verifiability, as well as opening new questions and new ways of answering them.

If we can think of these ways, then we face virtually limitless possibilities. Is there a more fundamental challenge or opportunity for computer science than helping us to figure out human society over human time?

This page intentionally left blank.

Computer Science: Reflections on the Field, Reflections from the Field provides a concise characterization of key ideas that lie at the core of computer science (CS) research. The book offers a description of CS research recognizing the richness and diversity of the field. It brings together two dozen essays on diverse aspects of CS research, their motivation and results. By describing in accessible form computer science's intellectual character, and by conveying a sense of its vibrancy through a set of examples, the book aims to prepare readers for what the future might hold and help to inspire CS researchers in its creation.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council; Division of Behavioral and Social Sciences and Education; Commission on Behavioral and Social Sciences and Education; Committee on Basic Research in the Behavioral and Social Sciences; Gerstein DR, Luce RD, Smelser NJ, et al., editors. The Behavioral and Social Sciences: Achievements and Opportunities. Washington (DC): National Academies Press (US); 1988.

Cover of The Behavioral and Social Sciences: Achievements and Opportunities

The Behavioral and Social Sciences: Achievements and Opportunities.

  • Hardcopy Version at National Academies Press

5 Methods of Data Collection, Representation, and Analysis

This chapter concerns research on collecting, representing, and analyzing the data that underlie behavioral and social sciences knowledge. Such research, methodological in character, includes ethnographic and historical approaches, scaling, axiomatic measurement, and statistics, with its important relatives, econometrics and psychometrics. The field can be described as including the self-conscious study of how scientists draw inferences and reach conclusions from observations. Since statistics is the largest and most prominent of methodological approaches and is used by researchers in virtually every discipline, statistical work draws the lion’s share of this chapter’s attention.

Problems of interpreting data arise whenever inherent variation or measurement fluctuations create challenges to understand data or to judge whether observed relationships are significant, durable, or general. Some examples: Is a sharp monthly (or yearly) increase in the rate of juvenile delinquency (or unemployment) in a particular area a matter for alarm, an ordinary periodic or random fluctuation, or the result of a change or quirk in reporting method? Do the temporal patterns seen in such repeated observations reflect a direct causal mechanism, a complex of indirect ones, or just imperfections in the data? Is a decrease in auto injuries an effect of a new seat-belt law? Are the disagreements among people describing some aspect of a subculture too great to draw valid inferences about that aspect of the culture?

Such issues of inference are often closely connected to substantive theory and specific data, and to some extent it is difficult and perhaps misleading to treat methods of data collection, representation, and analysis separately. This report does so, as do all sciences to some extent, because the methods developed often are far more general than the specific problems that originally gave rise to them. There is much transfer of new ideas from one substantive field to another—and to and from fields outside the behavioral and social sciences. Some of the classical methods of statistics arose in studies of astronomical observations, biological variability, and human diversity. The major growth of the classical methods occurred in the twentieth century, greatly stimulated by problems in agriculture and genetics. Some methods for uncovering geometric structures in data, such as multidimensional scaling and factor analysis, originated in research on psychological problems, but have been applied in many other sciences. Some time-series methods were developed originally to deal with economic data, but they are equally applicable to many other kinds of data.

  • In economics: large-scale models of the U.S. economy; effects of taxation, money supply, and other government fiscal and monetary policies; theories of duopoly, oligopoly, and rational expectations; economic effects of slavery.
  • In psychology: test calibration; the formation of subjective probabilities, their revision in the light of new information, and their use in decision making; psychiatric epidemiology and mental health program evaluation.
  • In sociology and other fields: victimization and crime rates; effects of incarceration and sentencing policies; deployment of police and fire-fighting forces; discrimination, antitrust, and regulatory court cases; social networks; population growth and forecasting; and voting behavior.

Even such an abridged listing makes clear that improvements in methodology are valuable across the spectrum of empirical research in the behavioral and social sciences as well as in application to policy questions. Clearly, methodological research serves many different purposes, and there is a need to develop different approaches to serve those different purposes, including exploratory data analysis, scientific inference about hypotheses and population parameters, individual decision making, forecasting what will happen in the event or absence of intervention, and assessing causality from both randomized experiments and observational data.

This discussion of methodological research is divided into three areas: design, representation, and analysis. The efficient design of investigations must take place before data are collected because it involves how much, what kind of, and how data are to be collected. What type of study is feasible: experimental, sample survey, field observation, or other? What variables should be measured, controlled, and randomized? How extensive a subject pool or observational period is appropriate? How can study resources be allocated most effectively among various sites, instruments, and subsamples?

The construction of useful representations of the data involves deciding what kind of formal structure best expresses the underlying qualitative and quantitative concepts that are being used in a given study. For example, cost of living is a simple concept to quantify if it applies to a single individual with unchanging tastes in stable markets (that is, markets offering the same array of goods from year to year at varying prices), but as a national aggregate for millions of households and constantly changing consumer product markets, the cost of living is not easy to specify clearly or measure reliably. Statisticians, economists, sociologists, and other experts have long struggled to make the cost of living a precise yet practicable concept that is also efficient to measure, and they must continually modify it to reflect changing circumstances.

Data analysis covers the final step of characterizing and interpreting research findings: Can estimates of the relations between variables be made? Can some conclusion be drawn about correlation, cause and effect, or trends over time? How uncertain are the estimates and conclusions and can that uncertainty be reduced by analyzing the data in a different way? Can computers be used to display complex results graphically for quicker or better understanding or to suggest different ways of proceeding?

Advances in analysis, data representation, and research design feed into and reinforce one another in the course of actual scientific work. The intersections between methodological improvements and empirical advances are an important aspect of the multidisciplinary thrust of progress in the behavioral and social sciences.

  • Designs for Data Collection

Four broad kinds of research designs are used in the behavioral and social sciences: experimental, survey, comparative, and ethnographic.

Experimental designs, in either the laboratory or field settings, systematically manipulate a few variables while others that may affect the outcome are held constant, randomized, or otherwise controlled. The purpose of randomized experiments is to ensure that only one or a few variables can systematically affect the results, so that causes can be attributed. Survey designs include the collection and analysis of data from censuses, sample surveys, and longitudinal studies and the examination of various relationships among the observed phenomena. Randomization plays a different role here than in experimental designs: it is used to select members of a sample so that the sample is as representative of the whole population as possible. Comparative designs involve the retrieval of evidence that is recorded in the flow of current or past events in different times or places and the interpretation and analysis of this evidence. Ethnographic designs, also known as participant-observation designs, involve a researcher in intensive and direct contact with a group, community, or population being studied, through participation, observation, and extended interviewing.

Experimental Designs

Laboratory experiments.

Laboratory experiments underlie most of the work reported in Chapter 1 , significant parts of Chapter 2 , and some of the newest lines of research in Chapter 3 . Laboratory experiments extend and adapt classical methods of design first developed, for the most part, in the physical and life sciences and agricultural research. Their main feature is the systematic and independent manipulation of a few variables and the strict control or randomization of all other variables that might affect the phenomenon under study. For example, some studies of animal motivation involve the systematic manipulation of amounts of food and feeding schedules while other factors that may also affect motivation, such as body weight, deprivation, and so on, are held constant. New designs are currently coming into play largely because of new analytic and computational methods (discussed below, in “Advances in Statistical Inference and Analysis”).

Two examples of empirically important issues that demonstrate the need for broadening classical experimental approaches are open-ended responses and lack of independence of successive experimental trials. The first concerns the design of research protocols that do not require the strict segregation of the events of an experiment into well-defined trials, but permit a subject to respond at will. These methods are needed when what is of interest is how the respondent chooses to allocate behavior in real time and across continuously available alternatives. Such empirical methods have long been used, but they can generate very subtle and difficult problems in experimental design and subsequent analysis. As theories of allocative behavior of all sorts become more sophisticated and precise, the experimental requirements become more demanding, so the need to better understand and solve this range of design issues is an outstanding challenge to methodological ingenuity.

The second issue arises in repeated-trial designs when the behavior on successive trials, even if it does not exhibit a secular trend (such as a learning curve), is markedly influenced by what has happened in the preceding trial or trials. The more naturalistic the experiment and the more sensitive the meas urements taken, the more likely it is that such effects will occur. But such sequential dependencies in observations cause a number of important conceptual and technical problems in summarizing the data and in testing analytical models, which are not yet completely understood. In the absence of clear solutions, such effects are sometimes ignored by investigators, simplifying the data analysis but leaving residues of skepticism about the reliability and significance of the experimental results. With continuing development of sensitive measures in repeated-trial designs, there is a growing need for more advanced concepts and methods for dealing with experimental results that may be influenced by sequential dependencies.

Randomized Field Experiments

The state of the art in randomized field experiments, in which different policies or procedures are tested in controlled trials under real conditions, has advanced dramatically over the past two decades. Problems that were once considered major methodological obstacles—such as implementing randomized field assignment to treatment and control groups and protecting the randomization procedure from corruption—have been largely overcome. While state-of-the-art standards are not achieved in every field experiment, the commitment to reaching them is rising steadily, not only among researchers but also among customer agencies and sponsors.

The health insurance experiment described in Chapter 2 is an example of a major randomized field experiment that has had and will continue to have important policy reverberations in the design of health care financing. Field experiments with the negative income tax (guaranteed minimum income) conducted in the 1970s were significant in policy debates, even before their completion, and provided the most solid evidence available on how tax-based income support programs and marginal tax rates can affect the work incentives and family structures of the poor. Important field experiments have also been carried out on alternative strategies for the prevention of delinquency and other criminal behavior, reform of court procedures, rehabilitative programs in mental health, family planning, and special educational programs, among other areas.

In planning field experiments, much hinges on the definition and design of the experimental cells, the particular combinations needed of treatment and control conditions for each set of demographic or other client sample characteristics, including specification of the minimum number of cases needed in each cell to test for the presence of effects. Considerations of statistical power, client availability, and the theoretical structure of the inquiry enter into such specifications. Current important methodological thresholds are to find better ways of predicting recruitment and attrition patterns in the sample, of designing experiments that will be statistically robust in the face of problematic sample recruitment or excessive attrition, and of ensuring appropriate acquisition and analysis of data on the attrition component of the sample.

Also of major significance are improvements in integrating detailed process and outcome measurements in field experiments. To conduct research on program effects under field conditions requires continual monitoring to determine exactly what is being done—the process—how it corresponds to what was projected at the outset. Relatively unintrusive, inexpensive, and effective implementation measures are of great interest. There is, in parallel, a growing emphasis on designing experiments to evaluate distinct program components in contrast to summary measures of net program effects.

Finally, there is an important opportunity now for further theoretical work to model organizational processes in social settings and to design and select outcome variables that, in the relatively short time of most field experiments, can predict longer-term effects: For example, in job-training programs, what are the effects on the community (role models, morale, referral networks) or on individual skills, motives, or knowledge levels that are likely to translate into sustained changes in career paths and income levels?

Survey Designs

Many people have opinions about how societal mores, economic conditions, and social programs shape lives and encourage or discourage various kinds of behavior. People generalize from their own cases, and from the groups to which they belong, about such matters as how much it costs to raise a child, the extent to which unemployment contributes to divorce, and so on. In fact, however, effects vary so much from one group to another that homespun generalizations are of little use. Fortunately, behavioral and social scientists have been able to bridge the gaps between personal perspectives and collective realities by means of survey research. In particular, governmental information systems include volumes of extremely valuable survey data, and the facility of modern computers to store, disseminate, and analyze such data has significantly improved empirical tests and led to new understandings of social processes.

Within this category of research designs, two major types are distinguished: repeated cross-sectional surveys and longitudinal panel surveys. In addition, and cross-cutting these types, there is a major effort under way to improve and refine the quality of survey data by investigating features of human memory and of question formation that affect survey response.

Repeated cross-sectional designs can either attempt to measure an entire population—as does the oldest U.S. example, the national decennial census—or they can rest on samples drawn from a population. The general principle is to take independent samples at two or more times, measuring the variables of interest, such as income levels, housing plans, or opinions about public affairs, in the same way. The General Social Survey, collected by the National Opinion Research Center with National Science Foundation support, is a repeated cross sectional data base that was begun in 1972. One methodological question of particular salience in such data is how to adjust for nonresponses and “don’t know” responses. Another is how to deal with self-selection bias. For example, to compare the earnings of women and men in the labor force, it would be mistaken to first assume that the two samples of labor-force participants are randomly selected from the larger populations of men and women; instead, one has to consider and incorporate in the analysis the factors that determine who is in the labor force.

In longitudinal panels, a sample is drawn at one point in time and the relevant variables are measured at this and subsequent times for the same people. In more complex versions, some fraction of each panel may be replaced or added to periodically, such as expanding the sample to include households formed by the children of the original sample. An example of panel data developed in this way is the Panel Study of Income Dynamics (PSID), conducted by the University of Michigan since 1968 (discussed in Chapter 3 ).

Comparing the fertility or income of different people in different circumstances at the same time to find correlations always leaves a large proportion of the variability unexplained, but common sense suggests that much of the unexplained variability is actually explicable. There are systematic reasons for individual outcomes in each person’s past achievements, in parental models, upbringing, and earlier sequences of experiences. Unfortunately, asking people about the past is not particularly helpful: people remake their views of the past to rationalize the present and so retrospective data are often of uncertain validity. In contrast, generation-long longitudinal data allow readings on the sequence of past circumstances uncolored by later outcomes. Such data are uniquely useful for studying the causes and consequences of naturally occurring decisions and transitions. Thus, as longitudinal studies continue, quantitative analysis is becoming feasible about such questions as: How are the decisions of individuals affected by parental experience? Which aspects of early decisions constrain later opportunities? And how does detailed background experience leave its imprint? Studies like the two-decade-long PSID are bringing within grasp a complete generational cycle of detailed data on fertility, work life, household structure, and income.

Advances in Longitudinal Designs

Large-scale longitudinal data collection projects are uniquely valuable as vehicles for testing and improving survey research methodology. In ways that lie beyond the scope of a cross-sectional survey, longitudinal studies can sometimes be designed—without significant detriment to their substantive interests—to facilitate the evaluation and upgrading of data quality; the analysis of relative costs and effectiveness of alternative techniques of inquiry; and the standardization or coordination of solutions to problems of method, concept, and measurement across different research domains.

Some areas of methodological improvement include discoveries about the impact of interview mode on response (mail, telephone, face-to-face); the effects of nonresponse on the representativeness of a sample (due to respondents’ refusal or interviewers’ failure to contact); the effects on behavior of continued participation over time in a sample survey; the value of alternative methods of adjusting for nonresponse and incomplete observations (such as imputation of missing data, variable case weighting); the impact on response of specifying different recall periods, varying the intervals between interviews, or changing the length of interviews; and the comparison and calibration of results obtained by longitudinal surveys, randomized field experiments, laboratory studies, onetime surveys, and administrative records.

It should be especially noted that incorporating improvements in methodology and data quality has been and will no doubt continue to be crucial to the growing success of longitudinal studies. Panel designs are intrinsically more vulnerable than other designs to statistical biases due to cumulative item non-response, sample attrition, time-in-sample effects, and error margins in repeated measures, all of which may produce exaggerated estimates of change. Over time, a panel that was initially representative may become much less representative of a population, not only because of attrition in the sample, but also because of changes in immigration patterns, age structure, and the like. Longitudinal studies are also subject to changes in scientific and societal contexts that may create uncontrolled drifts over time in the meaning of nominally stable questions or concepts as well as in the underlying behavior. Also, a natural tendency to expand over time the range of topics and thus the interview lengths, which increases the burdens on respondents, may lead to deterioration of data quality or relevance. Careful methodological research to understand and overcome these problems has been done, and continued work as a component of new longitudinal studies is certain to advance the overall state of the art.

Longitudinal studies are sometimes pressed for evidence they are not designed to produce: for example, in important public policy questions concerning the impact of government programs in such areas as health promotion, disease prevention, or criminal justice. By using research designs that combine field experiments (with randomized assignment to program and control conditions) and longitudinal surveys, one can capitalize on the strongest merits of each: the experimental component provides stronger evidence for casual statements that are critical for evaluating programs and for illuminating some fundamental theories; the longitudinal component helps in the estimation of long-term program effects and their attenuation. Coupling experiments to ongoing longitudinal studies is not often feasible, given the multiple constraints of not disrupting the survey, developing all the complicated arrangements that go into a large-scale field experiment, and having the populations of interest overlap in useful ways. Yet opportunities to join field experiments to surveys are of great importance. Coupled studies can produce vital knowledge about the empirical conditions under which the results of longitudinal surveys turn out to be similar to—or divergent from—those produced by randomized field experiments. A pattern of divergence and similarity has begun to emerge in coupled studies; additional cases are needed to understand why some naturally occurring social processes and longitudinal design features seem to approximate formal random allocation and others do not. The methodological implications of such new knowledge go well beyond program evaluation and survey research. These findings bear directly on the confidence scientists—and others—can have in conclusions from observational studies of complex behavioral and social processes, particularly ones that cannot be controlled or simulated within the confines of a laboratory environment.

Memory and the Framing of Questions

A very important opportunity to improve survey methods lies in the reduction of nonsampling error due to questionnaire context, phrasing of questions, and, generally, the semantic and social-psychological aspects of surveys. Survey data are particularly affected by the fallibility of human memory and the sensitivity of respondents to the framework in which a question is asked. This sensitivity is especially strong for certain types of attitudinal and opinion questions. Efforts are now being made to bring survey specialists into closer contact with researchers working on memory function, knowledge representation, and language in order to uncover and reduce this kind of error.

Memory for events is often inaccurate, biased toward what respondents believe to be true—or should be true—about the world. In many cases in which data are based on recollection, improvements can be achieved by shifting to techniques of structured interviewing and calibrated forms of memory elicitation, such as specifying recent, brief time periods (for example, in the last seven days) within which respondents recall certain types of events with acceptable accuracy.

  • “Taking things altogether, how would you describe your marriage? Would you say that your marriage is very happy, pretty happy, or not too happy?”
  • “Taken altogether how would you say things are these days—would you say you are very happy, pretty happy, or not too happy?”

Presenting this sequence in both directions on different forms showed that the order affected answers to the general happiness question but did not change the marital happiness question: responses to the specific issue swayed subsequent responses to the general one, but not vice versa. The explanations for and implications of such order effects on the many kinds of questions and sequences that can be used are not simple matters. Further experimentation on the design of survey instruments promises not only to improve the accuracy and reliability of survey research, but also to advance understanding of how people think about and evaluate their behavior from day to day.

Comparative Designs

Both experiments and surveys involve interventions or questions by the scientist, who then records and analyzes the responses. In contrast, many bodies of social and behavioral data of considerable value are originally derived from records or collections that have accumulated for various nonscientific reasons, quite often administrative in nature, in firms, churches, military organizations, and governments at all levels. Data of this kind can sometimes be subjected to careful scrutiny, summary, and inquiry by historians and social scientists, and statistical methods have increasingly been used to develop and evaluate inferences drawn from such data. Some of the main comparative approaches are cross-national aggregate comparisons, selective comparison of a limited number of cases, and historical case studies.

Among the more striking problems facing the scientist using such data are the vast differences in what has been recorded by different agencies whose behavior is being compared (this is especially true for parallel agencies in different nations), the highly unrepresentative or idiosyncratic sampling that can occur in the collection of such data, and the selective preservation and destruction of records. Means to overcome these problems form a substantial methodological research agenda in comparative research. An example of the method of cross-national aggregative comparisons is found in investigations by political scientists and sociologists of the factors that underlie differences in the vitality of institutions of political democracy in different societies. Some investigators have stressed the existence of a large middle class, others the level of education of a population, and still others the development of systems of mass communication. In cross-national aggregate comparisons, a large number of nations are arrayed according to some measures of political democracy and then attempts are made to ascertain the strength of correlations between these and the other variables. In this line of analysis it is possible to use a variety of statistical cluster and regression techniques to isolate and assess the possible impact of certain variables on the institutions under study. While this kind of research is cross-sectional in character, statements about historical processes are often invoked to explain the correlations.

More limited selective comparisons, applied by many of the classic theorists, involve asking similar kinds of questions but over a smaller range of societies. Why did democracy develop in such different ways in America, France, and England? Why did northeastern Europe develop rational bourgeois capitalism, in contrast to the Mediterranean and Asian nations? Modern scholars have turned their attention to explaining, for example, differences among types of fascism between the two World Wars, and similarities and differences among modern state welfare systems, using these comparisons to unravel the salient causes. The questions asked in these instances are inevitably historical ones.

Historical case studies involve only one nation or region, and so they may not be geographically comparative. However, insofar as they involve tracing the transformation of a society’s major institutions and the role of its main shaping events, they involve a comparison of different periods of a nation’s or a region’s history. The goal of such comparisons is to give a systematic account of the relevant differences. Sometimes, particularly with respect to the ancient societies, the historical record is very sparse, and the methods of history and archaeology mesh in the reconstruction of complex social arrangements and patterns of change on the basis of few fragments.

Like all research designs, comparative ones have distinctive vulnerabilities and advantages: One of the main advantages of using comparative designs is that they greatly expand the range of data, as well as the amount of variation in those data, for study. Consequently, they allow for more encompassing explanations and theories that can relate highly divergent outcomes to one another in the same framework. They also contribute to reducing any cultural biases or tendencies toward parochialism among scientists studying common human phenomena.

One main vulnerability in such designs arises from the problem of achieving comparability. Because comparative study involves studying societies and other units that are dissimilar from one another, the phenomena under study usually occur in very different contexts—so different that in some cases what is called an event in one society cannot really be regarded as the same type of event in another. For example, a vote in a Western democracy is different from a vote in an Eastern bloc country, and a voluntary vote in the United States means something different from a compulsory vote in Australia. These circumstances make for interpretive difficulties in comparing aggregate rates of voter turnout in different countries.

The problem of achieving comparability appears in historical analysis as well. For example, changes in laws and enforcement and recording procedures over time change the definition of what is and what is not a crime, and for that reason it is difficult to compare the crime rates over time. Comparative researchers struggle with this problem continually, working to fashion equivalent measures; some have suggested the use of different measures (voting, letters to the editor, street demonstration) in different societies for common variables (political participation), to try to take contextual factors into account and to achieve truer comparability.

A second vulnerability is controlling variation. Traditional experiments make conscious and elaborate efforts to control the variation of some factors and thereby assess the causal significance of others. In surveys as well as experiments, statistical methods are used to control sources of variation and assess suspected causal significance. In comparative and historical designs, this kind of control is often difficult to attain because the sources of variation are many and the number of cases few. Scientists have made efforts to approximate such control in these cases of “many variables, small N.” One is the method of paired comparisons. If an investigator isolates 15 American cities in which racial violence has been recurrent in the past 30 years, for example, it is helpful to match them with 15 cities of similar population size, geographical region, and size of minorities—such characteristics are controls—and then search for systematic differences between the two sets of cities. Another method is to select, for comparative purposes, a sample of societies that resemble one another in certain critical ways, such as size, common language, and common level of development, thus attempting to hold these factors roughly constant, and then seeking explanations among other factors in which the sampled societies differ from one another.

Ethnographic Designs

Traditionally identified with anthropology, ethnographic research designs are playing increasingly significant roles in most of the behavioral and social sciences. The core of this methodology is participant-observation, in which a researcher spends an extended period of time with the group under study, ideally mastering the local language, dialect, or special vocabulary, and participating in as many activities of the group as possible. This kind of participant-observation is normally coupled with extensive open-ended interviewing, in which people are asked to explain in depth the rules, norms, practices, and beliefs through which (from their point of view) they conduct their lives. A principal aim of ethnographic study is to discover the premises on which those rules, norms, practices, and beliefs are built.

The use of ethnographic designs by anthropologists has contributed significantly to the building of knowledge about social and cultural variation. And while these designs continue to center on certain long-standing features—extensive face-to-face experience in the community, linguistic competence, participation, and open-ended interviewing—there are newer trends in ethnographic work. One major trend concerns its scale. Ethnographic methods were originally developed largely for studying small-scale groupings known variously as village, folk, primitive, preliterate, or simple societies. Over the decades, these methods have increasingly been applied to the study of small groups and networks within modern (urban, industrial, complex) society, including the contemporary United States. The typical subjects of ethnographic study in modern society are small groups or relatively small social networks, such as outpatient clinics, medical schools, religious cults and churches, ethnically distinctive urban neighborhoods, corporate offices and factories, and government bureaus and legislatures.

As anthropologists moved into the study of modern societies, researchers in other disciplines—particularly sociology, psychology, and political science—began using ethnographic methods to enrich and focus their own insights and findings. At the same time, studies of large-scale structures and processes have been aided by the use of ethnographic methods, since most large-scale changes work their way into the fabric of community, neighborhood, and family, affecting the daily lives of people. Ethnographers have studied, for example, the impact of new industry and new forms of labor in “backward” regions; the impact of state-level birth control policies on ethnic groups; and the impact on residents in a region of building a dam or establishing a nuclear waste dump. Ethnographic methods have also been used to study a number of social processes that lend themselves to its particular techniques of observation and interview—processes such as the formation of class and racial identities, bureaucratic behavior, legislative coalitions and outcomes, and the formation and shifting of consumer tastes.

Advances in structured interviewing (see above) have proven especially powerful in the study of culture. Techniques for understanding kinship systems, concepts of disease, color terminologies, ethnobotany, and ethnozoology have been radically transformed and strengthened by coupling new interviewing methods with modem measurement and scaling techniques (see below). These techniques have made possible more precise comparisons among cultures and identification of the most competent and expert persons within a culture. The next step is to extend these methods to study the ways in which networks of propositions (such as boys like sports, girls like babies) are organized to form belief systems. Much evidence suggests that people typically represent the world around them by means of relatively complex cognitive models that involve interlocking propositions. The techniques of scaling have been used to develop models of how people categorize objects, and they have great potential for further development, to analyze data pertaining to cultural propositions.

Ideological Systems

Perhaps the most fruitful area for the application of ethnographic methods in recent years has been the systematic study of ideologies in modern society. Earlier studies of ideology were in small-scale societies that were rather homogeneous. In these studies researchers could report on a single culture, a uniform system of beliefs and values for the society as a whole. Modern societies are much more diverse both in origins and number of subcultures, related to different regions, communities, occupations, or ethnic groups. Yet these subcultures and ideologies share certain underlying assumptions or at least must find some accommodation with the dominant value and belief systems in the society.

The challenge is to incorporate this greater complexity of structure and process into systematic descriptions and interpretations. One line of work carried out by researchers has tried to track the ways in which ideologies are created, transmitted, and shared among large populations that have traditionally lacked the social mobility and communications technologies of the West. This work has concentrated on large-scale civilizations such as China, India, and Central America. Gradually, the focus has generalized into a concern with the relationship between the great traditions—the central lines of cosmopolitan Confucian, Hindu, or Mayan culture, including aesthetic standards, irrigation technologies, medical systems, cosmologies and calendars, legal codes, poetic genres, and religious doctrines and rites—and the little traditions, those identified with rural, peasant communities. How are the ideological doctrines and cultural values of the urban elites, the great traditions, transmitted to local communities? How are the little traditions, the ideas from the more isolated, less literate, and politically weaker groups in society, transmitted to the elites?

India and southern Asia have been fruitful areas for ethnographic research on these questions. The great Hindu tradition was present in virtually all local contexts through the presence of high-caste individuals in every community. It operated as a pervasive standard of value for all members of society, even in the face of strong little traditions. The situation is surprisingly akin to that of modern, industrialized societies. The central research questions are the degree and the nature of penetration of dominant ideology, even in groups that appear marginal and subordinate and have no strong interest in sharing the dominant value system. In this connection the lowest and poorest occupational caste—the untouchables—serves as an ultimate test of the power of ideology and cultural beliefs to unify complex hierarchical social systems.

Historical Reconstruction

Another current trend in ethnographic methods is its convergence with archival methods. One joining point is the application of descriptive and interpretative procedures used by ethnographers to reconstruct the cultures that created historical documents, diaries, and other records, to interview history, so to speak. For example, a revealing study showed how the Inquisition in the Italian countryside between the 1570s and 1640s gradually worked subtle changes in an ancient fertility cult in peasant communities; the peasant beliefs and rituals assimilated many elements of witchcraft after learning them from their persecutors. A good deal of social history—particularly that of the family—has drawn on discoveries made in the ethnographic study of primitive societies. As described in Chapter 4 , this particular line of inquiry rests on a marriage of ethnographic, archival, and demographic approaches.

Other lines of ethnographic work have focused on the historical dimensions of nonliterate societies. A strikingly successful example in this kind of effort is a study of head-hunting. By combining an interpretation of local oral tradition with the fragmentary observations that were made by outside observers (such as missionaries, traders, colonial officials), historical fluctuations in the rate and significance of head-hunting were shown to be partly in response to such international forces as the great depression and World War II. Researchers are also investigating the ways in which various groups in contemporary societies invent versions of traditions that may or may not reflect the actual history of the group. This process has been observed among elites seeking political and cultural legitimation and among hard-pressed minorities (for example, the Basque in Spain, the Welsh in Great Britain) seeking roots and political mobilization in a larger society.

Ethnography is a powerful method to record, describe, and interpret the system of meanings held by groups and to discover how those meanings affect the lives of group members. It is a method well adapted to the study of situations in which people interact with one another and the researcher can interact with them as well, so that information about meanings can be evoked and observed. Ethnography is especially suited to exploration and elucidation of unsuspected connections; ideally, it is used in combination with other methods—experimental, survey, or comparative—to establish with precision the relative strengths and weaknesses of such connections. By the same token, experimental, survey, and comparative methods frequently yield connections, the meaning of which is unknown; ethnographic methods are a valuable way to determine them.

  • Models for Representing Phenomena

The objective of any science is to uncover the structure and dynamics of the phenomena that are its subject, as they are exhibited in the data. Scientists continuously try to describe possible structures and ask whether the data can, with allowance for errors of measurement, be described adequately in terms of them. Over a long time, various families of structures have recurred throughout many fields of science; these structures have become objects of study in their own right, principally by statisticians, other methodological specialists, applied mathematicians, and philosophers of logic and science. Methods have evolved to evaluate the adequacy of particular structures to account for particular types of data. In the interest of clarity we discuss these structures in this section and the analytical methods used for estimation and evaluation of them in the next section, although in practice they are closely intertwined.

A good deal of mathematical and statistical modeling attempts to describe the relations, both structural and dynamic, that hold among variables that are presumed to be representable by numbers. Such models are applicable in the behavioral and social sciences only to the extent that appropriate numerical measurement can be devised for the relevant variables. In many studies the phenomena in question and the raw data obtained are not intrinsically numerical, but qualitative, such as ethnic group identifications. The identifying numbers used to code such questionnaire categories for computers are no more than labels, which could just as well be letters or colors. One key question is whether there is some natural way to move from the qualitative aspects of such data to a structural representation that involves one of the well-understood numerical or geometric models or whether such an attempt would be inherently inappropriate for the data in question. The decision as to whether or not particular empirical data can be represented in particular numerical or more complex structures is seldom simple, and strong intuitive biases or a priori assumptions about what can and cannot be done may be misleading.

Recent decades have seen rapid and extensive development and application of analytical methods attuned to the nature and complexity of social science data. Examples of nonnumerical modeling are increasing. Moreover, the widespread availability of powerful computers is probably leading to a qualitative revolution, it is affecting not only the ability to compute numerical solutions to numerical models, but also to work out the consequences of all sorts of structures that do not involve numbers at all. The following discussion gives some indication of the richness of past progress and of future prospects although it is by necessity far from exhaustive.

In describing some of the areas of new and continuing research, we have organized this section on the basis of whether the representations are fundamentally probabilistic or not. A further useful distinction is between representations of data that are highly discrete or categorical in nature (such as whether a person is male or female) and those that are continuous in nature (such as a person’s height). Of course, there are intermediate cases involving both types of variables, such as color stimuli that are characterized by discrete hues (red, green) and a continuous luminance measure. Probabilistic models lead very naturally to questions of estimation and statistical evaluation of the correspondence between data and model. Those that are not probabilistic involve additional problems of dealing with and representing sources of variability that are not explicitly modeled. At the present time, scientists understand some aspects of structure, such as geometries, and some aspects of randomness, as embodied in probability models, but do not yet adequately understand how to put the two together in a single unified model. Table 5-1 outlines the way we have organized this discussion and shows where the examples in this section lie.

Table 5-1. A Classification of Structural Models.

A Classification of Structural Models.

Probability Models

Some behavioral and social sciences variables appear to be more or less continuous, for example, utility of goods, loudness of sounds, or risk associated with uncertain alternatives. Many other variables, however, are inherently categorical, often with only two or a few values possible: for example, whether a person is in or out of school, employed or not employed, identifies with a major political party or political ideology. And some variables, such as moral attitudes, are typically measured in research with survey questions that allow only categorical responses. Much of the early probability theory was formulated only for continuous variables; its use with categorical variables was not really justified, and in some cases it may have been misleading. Recently, very significant advances have been made in how to deal explicitly with categorical variables. This section first describes several contemporary approaches to models involving categorical variables, followed by ones involving continuous representations.

Log-Linear Models for Categorical Variables

Many recent models for analyzing categorical data of the kind usually displayed as counts (cell frequencies) in multidimensional contingency tables are subsumed under the general heading of log-linear models, that is, linear models in the natural logarithms of the expected counts in each cell in the table. These recently developed forms of statistical analysis allow one to partition variability due to various sources in the distribution of categorical attributes, and to isolate the effects of particular variables or combinations of them.

Present log-linear models were first developed and used by statisticians and sociologists and then found extensive application in other social and behavioral sciences disciplines. When applied, for instance, to the analysis of social mobility, such models separate factors of occupational supply and demand from other factors that impede or propel movement up and down the social hierarchy. With such models, for example, researchers discovered the surprising fact that occupational mobility patterns are strikingly similar in many nations of the world (even among disparate nations like the United States and most of the Eastern European socialist countries), and from one time period to another, once allowance is made for differences in the distributions of occupations. The log-linear and related kinds of models have also made it possible to identify and analyze systematic differences in mobility among nations and across time. As another example of applications, psychologists and others have used log-linear models to analyze attitudes and their determinants and to link attitudes to behavior. These methods have also diffused to and been used extensively in the medical and biological sciences.

Regression Models for Categorical Variables

Models that permit one variable to be explained or predicted by means of others, called regression models, are the workhorses of much applied statistics; this is especially true when the dependent (explained) variable is continuous. For a two-valued dependent variable, such as alive or dead, models and approximate theory and computational methods for one explanatory variable were developed in biometry about 50 years ago. Computer programs able to handle many explanatory variables, continuous or categorical, are readily available today. Even now, however, the accuracy of the approximate theory on given data is an open question.

Using classical utility theory, economists have developed discrete choice models that turn out to be somewhat related to the log-linear and categorical regression models. Models for limited dependent variables, especially those that cannot take on values above or below a certain level (such as weeks unemployed, number of children, and years of schooling) have been used profitably in economics and in some other areas. For example, censored normal variables (called tobits in economics), in which observed values outside certain limits are simply counted, have been used in studying decisions to go on in school. It will require further research and development to incorporate information about limited ranges of variables fully into the main multivariate methodologies. In addition, with respect to the assumptions about distribution and functional form conventionally made in discrete response models, some new methods are now being developed that show promise of yielding reliable inferences without making unrealistic assumptions; further research in this area promises significant progress.

One problem arises from the fact that many of the categorical variables collected by the major data bases are ordered. For example, attitude surveys frequently use a 3-, 5-, or 7-point scale (from high to low) without specifying numerical intervals between levels. Social class and educational levels are often described by ordered categories. Ignoring order information, which many traditional statistical methods do, may be inefficient or inappropriate, but replacing the categories by successive integers or other arbitrary scores may distort the results. (For additional approaches to this question, see sections below on ordered structures.) Regression-like analysis of ordinal categorical variables is quite well developed, but their multivariate analysis needs further research. New log-bilinear models have been proposed, but to date they deal specifically with only two or three categorical variables. Additional research extending the new models, improving computational algorithms, and integrating the models with work on scaling promise to lead to valuable new knowledge.

Models for Event Histories

Event-history studies yield the sequence of events that respondents to a survey sample experience over a period of time; for example, the timing of marriage, childbearing, or labor force participation. Event-history data can be used to study educational progress, demographic processes (migration, fertility, and mortality), mergers of firms, labor market behavior, and even riots, strikes, and revolutions. As interest in such data has grown, many researchers have turned to models that pertain to changes in probabilities over time to describe when and how individuals move among a set of qualitative states.

Much of the progress in models for event-history data builds on recent developments in statistics and biostatistics for life-time, failure-time, and hazard models. Such models permit the analysis of qualitative transitions in a population whose members are undergoing partially random organic deterioration, mechanical wear, or other risks over time. With the increased complexity of event-history data that are now being collected, and the extension of event-history data bases over very long periods of time, new problems arise that cannot be effectively handled by older types of analysis. Among the problems are repeated transitions, such as between unemployment and employment or marriage and divorce; more than one time variable (such as biological age, calendar time, duration in a stage, and time exposed to some specified condition); latent variables (variables that are explicitly modeled even though not observed); gaps in the data; sample attrition that is not randomly distributed over the categories; and respondent difficulties in recalling the exact timing of events.

Models for Multiple-Item Measurement

For a variety of reasons, researchers typically use multiple measures (or multiple indicators) to represent theoretical concepts. Sociologists, for example, often rely on two or more variables (such as occupation and education) to measure an individual’s socioeconomic position; educational psychologists ordinarily measure a student’s ability with multiple test items. Despite the fact that the basic observations are categorical, in a number of applications this is interpreted as a partitioning of something continuous. For example, in test theory one thinks of the measures of both item difficulty and respondent ability as continuous variables, possibly multidimensional in character.

Classical test theory and newer item-response theories in psychometrics deal with the extraction of information from multiple measures. Testing, which is a major source of data in education and other areas, results in millions of test items stored in archives each year for purposes ranging from college admissions to job-training programs for industry. One goal of research on such test data is to be able to make comparisons among persons or groups even when different test items are used. Although the information collected from each respondent is intentionally incomplete in order to keep the tests short and simple, item-response techniques permit researchers to reconstitute the fragments into an accurate picture of overall group proficiencies. These new methods provide a better theoretical handle on individual differences, and they are expected to be extremely important in developing and using tests. For example, they have been used in attempts to equate different forms of a test given in successive waves during a year, a procedure made necessary in large-scale testing programs by legislation requiring disclosure of test-scoring keys at the time results are given.

An example of the use of item-response theory in a significant research effort is the National Assessment of Educational Progress (NAEP). The goal of this project is to provide accurate, nationally representative information on the average (rather than individual) proficiency of American children in a wide variety of academic subjects as they progress through elementary and secondary school. This approach is an improvement over the use of trend data on university entrance exams, because NAEP estimates of academic achievements (by broad characteristics such as age, grade, region, ethnic background, and so on) are not distorted by the self-selected character of those students who seek admission to college, graduate, and professional programs.

Item-response theory also forms the basis of many new psychometric instruments, known as computerized adaptive testing, currently being implemented by the U.S. military services and under additional development in many testing organizations. In adaptive tests, a computer program selects items for each examinee based upon the examinee’s success with previous items. Generally, each person gets a slightly different set of items and the equivalence of scale scores is established by using item-response theory. Adaptive testing can greatly reduce the number of items needed to achieve a given level of measurement accuracy.

Nonlinear, Nonadditive Models

Virtually all statistical models now in use impose a linearity or additivity assumption of some kind, sometimes after a nonlinear transformation of variables. Imposing these forms on relationships that do not, in fact, possess them may well result in false descriptions and spurious effects. Unwary users, especially of computer software packages, can easily be misled. But more realistic nonlinear and nonadditive multivariate models are becoming available. Extensive use with empirical data is likely to force many changes and enhancements in such models and stimulate quite different approaches to nonlinear multivariate analysis in the next decade.

Geometric and Algebraic Models

Geometric and algebraic models attempt to describe underlying structural relations among variables. In some cases they are part of a probabilistic approach, such as the algebraic models underlying regression or the geometric representations of correlations between items in a technique called factor analysis. In other cases, geometric and algebraic models are developed without explicitly modeling the element of randomness or uncertainty that is always present in the data. Although this latter approach to behavioral and social sciences problems has been less researched than the probabilistic one, there are some advantages in developing the structural aspects independent of the statistical ones. We begin the discussion with some inherently geometric representations and then turn to numerical representations for ordered data.

Although geometry is a huge mathematical topic, little of it seems directly applicable to the kinds of data encountered in the behavioral and social sciences. A major reason is that the primitive concepts normally used in geometry—points, lines, coincidence—do not correspond naturally to the kinds of qualitative observations usually obtained in behavioral and social sciences contexts. Nevertheless, since geometric representations are used to reduce bodies of data, there is a real need to develop a deeper understanding of when such representations of social or psychological data make sense. Moreover, there is a practical need to understand why geometric computer algorithms, such as those of multidimensional scaling, work as well as they apparently do. A better understanding of the algorithms will increase the efficiency and appropriateness of their use, which becomes increasingly important with the widespread availability of scaling programs for microcomputers.

Over the past 50 years several kinds of well-understood scaling techniques have been developed and widely used to assist in the search for appropriate geometric representations of empirical data. The whole field of scaling is now entering a critical juncture in terms of unifying and synthesizing what earlier appeared to be disparate contributions. Within the past few years it has become apparent that several major methods of analysis, including some that are based on probabilistic assumptions, can be unified under the rubric of a single generalized mathematical structure. For example, it has recently been demonstrated that such diverse approaches as nonmetric multidimensional scaling, principal-components analysis, factor analysis, correspondence analysis, and log-linear analysis have more in common in terms of underlying mathematical structure than had earlier been realized.

Nonmetric multidimensional scaling is a method that begins with data about the ordering established by subjective similarity (or nearness) between pairs of stimuli. The idea is to embed the stimuli into a metric space (that is, a geometry with a measure of distance between points) in such a way that distances between points corresponding to stimuli exhibit the same ordering as do the data. This method has been successfully applied to phenomena that, on other grounds, are known to be describable in terms of a specific geometric structure; such applications were used to validate the procedures. Such validation was done, for example, with respect to the perception of colors, which are known to be describable in terms of a particular three-dimensional structure known as the Euclidean color coordinates. Similar applications have been made with Morse code symbols and spoken phonemes. The technique is now used in some biological and engineering applications, as well as in some of the social sciences, as a method of data exploration and simplification.

One question of interest is how to develop an axiomatic basis for various geometries using as a primitive concept an observable such as the subject’s ordering of the relative similarity of one pair of stimuli to another, which is the typical starting point of such scaling. The general task is to discover properties of the qualitative data sufficient to ensure that a mapping into the geometric structure exists and, ideally, to discover an algorithm for finding it. Some work of this general type has been carried out: for example, there is an elegant set of axioms based on laws of color matching that yields the three-dimensional vectorial representation of color space. But the more general problem of understanding the conditions under which the multidimensional scaling algorithms are suitable remains unsolved. In addition, work is needed on understanding more general, non-Euclidean spatial models.

Ordered Factorial Systems

One type of structure common throughout the sciences arises when an ordered dependent variable is affected by two or more ordered independent variables. This is the situation to which regression and analysis-of-variance models are often applied; it is also the structure underlying the familiar physical identities, in which physical units are expressed as products of the powers of other units (for example, energy has the unit of mass times the square of the unit of distance divided by the square of the unit of time).

There are many examples of these types of structures in the behavioral and social sciences. One example is the ordering of preference of commodity bundles—collections of various amounts of commodities—which may be revealed directly by expressions of preference or indirectly by choices among alternative sets of bundles. A related example is preferences among alternative courses of action that involve various outcomes with differing degrees of uncertainty; this is one of the more thoroughly investigated problems because of its potential importance in decision making. A psychological example is the trade-off between delay and amount of reward, yielding those combinations that are equally reinforcing. In a common, applied kind of problem, a subject is given descriptions of people in terms of several factors, for example, intelligence, creativity, diligence, and honesty, and is asked to rate them according to a criterion such as suitability for a particular job.

In all these cases and a myriad of others like them the question is whether the regularities of the data permit a numerical representation. Initially, three types of representations were studied quite fully: the dependent variable as a sum, a product, or a weighted average of the measures associated with the independent variables. The first two representations underlie some psychological and economic investigations, as well as a considerable portion of physical measurement and modeling in classical statistics. The third representation, averaging, has proved most useful in understanding preferences among uncertain outcomes and the amalgamation of verbally described traits, as well as some physical variables.

For each of these three cases—adding, multiplying, and averaging—researchers know what properties or axioms of order the data must satisfy for such a numerical representation to be appropriate. On the assumption that one or another of these representations exists, and using numerical ratings by subjects instead of ordering, a scaling technique called functional measurement (referring to the function that describes how the dependent variable relates to the independent ones) has been developed and applied in a number of domains. What remains problematic is how to encompass at the ordinal level the fact that some random error intrudes into nearly all observations and then to show how that randomness is represented at the numerical level; this continues to be an unresolved and challenging research issue.

During the past few years considerable progress has been made in understanding certain representations inherently different from those just discussed. The work has involved three related thrusts. The first is a scheme of classifying structures according to how uniquely their representation is constrained. The three classical numerical representations are known as ordinal, interval, and ratio scale types. For systems with continuous numerical representations and of scale type at least as rich as the ratio one, it has been shown that only one additional type can exist. A second thrust is to accept structural assumptions, like factorial ones, and to derive for each scale the possible functional relations among the independent variables. And the third thrust is to develop axioms for the properties of an order relation that leads to the possible representations. Much is now known about the possible nonadditive representations of both the multifactor case and the one where stimuli can be combined, such as combining sound intensities.

Closely related to this classification of structures is the question: What statements, formulated in terms of the measures arising in such representations, can be viewed as meaningful in the sense of corresponding to something empirical? Statements here refer to any scientific assertions, including statistical ones, formulated in terms of the measures of the variables and logical and mathematical connectives. These are statements for which asserting truth or falsity makes sense. In particular, statements that remain invariant under certain symmetries of structure have played an important role in classical geometry, dimensional analysis in physics, and in relating measurement and statistical models applied to the same phenomenon. In addition, these ideas have been used to construct models in more formally developed areas of the behavioral and social sciences, such as psychophysics. Current research has emphasized the communality of these historically independent developments and is attempting both to uncover systematic, philosophically sound arguments as to why invariance under symmetries is as important as it appears to be and to understand what to do when structures lack symmetry, as, for example, when variables have an inherent upper bound.

Many subjects do not seem to be correctly represented in terms of distances in continuous geometric space. Rather, in some cases, such as the relations among meanings of words—which is of great interest in the study of memory representations—a description in terms of tree-like, hierarchial structures appears to be more illuminating. This kind of description appears appropriate both because of the categorical nature of the judgments and the hierarchial, rather than trade-off, nature of the structure. Individual items are represented as the terminal nodes of the tree, and groupings by different degrees of similarity are shown as intermediate nodes, with the more general groupings occurring nearer the root of the tree. Clustering techniques, requiring considerable computational power, have been and are being developed. Some successful applications exist, but much more refinement is anticipated.

Network Models

Several other lines of advanced modeling have progressed in recent years, opening new possibilities for empirical specification and testing of a variety of theories. In social network data, relationships among units, rather than the units themselves, are the primary objects of study: friendships among persons, trade ties among nations, cocitation clusters among research scientists, interlocking among corporate boards of directors. Special models for social network data have been developed in the past decade, and they give, among other things, precise new measures of the strengths of relational ties among units. A major challenge in social network data at present is to handle the statistical dependence that arises when the units sampled are related in complex ways.

  • Statistical Inference and Analysis

As was noted earlier, questions of design, representation, and analysis are intimately intertwined. Some issues of inference and analysis have been discussed above as related to specific data collection and modeling approaches. This section discusses some more general issues of statistical inference and advances in several current approaches to them.

Causal Inference

Behavioral and social scientists use statistical methods primarily to infer the effects of treatments, interventions, or policy factors. Previous chapters included many instances of causal knowledge gained this way. As noted above, the large experimental study of alternative health care financing discussed in Chapter 2 relied heavily on statistical principles and techniques, including randomization, in the design of the experiment and the analysis of the resulting data. Sophisticated designs were necessary in order to answer a variety of questions in a single large study without confusing the effects of one program difference (such as prepayment or fee for service) with the effects of another (such as different levels of deductible costs), or with effects of unobserved variables (such as genetic differences). Statistical techniques were also used to ascertain which results applied across the whole enrolled population and which were confined to certain subgroups (such as individuals with high blood pressure) and to translate utilization rates across different programs and types of patients into comparable overall dollar costs and health outcomes for alternative financing options.

A classical experiment, with systematic but randomly assigned variation of the variables of interest (or some reasonable approach to this), is usually considered the most rigorous basis from which to draw such inferences. But random samples or randomized experimental manipulations are not always feasible or ethically acceptable. Then, causal inferences must be drawn from observational studies, which, however well designed, are less able to ensure that the observed (or inferred) relationships among variables provide clear evidence on the underlying mechanisms of cause and effect.

Certain recurrent challenges have been identified in studying causal inference. One challenge arises from the selection of background variables to be measured, such as the sex, nativity, or parental religion of individuals in a comparative study of how education affects occupational success. The adequacy of classical methods of matching groups in background variables and adjusting for covariates needs further investigation. Statistical adjustment of biases linked to measured background variables is possible, but it can become complicated. Current work in adjustment for selectivity bias is aimed at weakening implausible assumptions, such as normality, when carrying out these adjustments. Even after adjustment has been made for the measured background variables, other, unmeasured variables are almost always still affecting the results (such as family transfers of wealth or reading habits). Analyses of how the conclusions might change if such unmeasured variables could be taken into account is essential in attempting to make causal inferences from an observational study, and systematic work on useful statistical models for such sensitivity analyses is just beginning.

The third important issue arises from the necessity for distinguishing among competing hypotheses when the explanatory variables are measured with different degrees of precision. Both the estimated size and significance of an effect are diminished when it has large measurement error, and the coefficients of other correlated variables are affected even when the other variables are measured perfectly. Similar results arise from conceptual errors, when one measures only proxies for a theoretical construct (such as years of education to represent amount of learning). In some cases, there are procedures for simultaneously or iteratively estimating both the precision of complex measures and their effect on a particular criterion.

Although complex models are often necessary to infer causes, once their output is available, it should be translated into understandable displays for evaluation. Results that depend on the accuracy of a multivariate model and the associated software need to be subjected to appropriate checks, including the evaluation of graphical displays, group comparisons, and other analyses.

New Statistical Techniques

Internal resampling.

One of the great contributions of twentieth-century statistics was to demonstrate how a properly drawn sample of sufficient size, even if it is only a tiny fraction of the population of interest, can yield very good estimates of most population characteristics. When enough is known at the outset about the characteristic in question—for example, that its distribution is roughly normal—inference from the sample data to the population as a whole is straightforward, and one can easily compute measures of the certainty of inference, a common example being the 95 percent confidence interval around an estimate. But population shapes are sometimes unknown or uncertain, and so inference procedures cannot be so simple. Furthermore, more often than not, it is difficult to assess even the degree of uncertainty associated with complex data and with the statistics needed to unravel complex social and behavioral phenomena.

Internal resampling methods attempt to assess this uncertainty by generating a number of simulated data sets similar to the one actually observed. The definition of similar is crucial, and many methods that exploit different types of similarity have been devised. These methods provide researchers the freedom to choose scientifically appropriate procedures and to replace procedures that are valid under assumed distributional shapes with ones that are not so restricted. Flexible and imaginative computer simulation is the key to these methods. For a simple random sample, the “bootstrap” method repeatedly resamples the obtained data (with replacement) to generate a distribution of possible data sets. The distribution of any estimator can thereby be simulated and measures of the certainty of inference be derived. The “jackknife” method repeatedly omits a fraction of the data and in this way generates a distribution of possible data sets that can also be used to estimate variability. These methods can also be used to remove or reduce bias. For example, the ratio-estimator, a statistic that is commonly used in analyzing sample surveys and censuses, is known to be biased, and the jackknife method can usually remedy this defect. The methods have been extended to other situations and types of analysis, such as multiple regression.

There are indications that under relatively general conditions, these methods, and others related to them, allow more accurate estimates of the uncertainty of inferences than do the traditional ones that are based on assumed (usually, normal) distributions when that distributional assumption is unwarranted. For complex samples, such internal resampling or subsampling facilitates estimating the sampling variances of complex statistics.

An older and simpler, but equally important, idea is to use one independent subsample in searching the data to develop a model and at least one separate subsample for estimating and testing a selected model. Otherwise, it is next to impossible to make allowances for the excessively close fitting of the model that occurs as a result of the creative search for the exact characteristics of the sample data—characteristics that are to some degree random and will not predict well to other samples.

Robust Techniques

Many technical assumptions underlie the analysis of data. Some, like the assumption that each item in a sample is drawn independently of other items, can be weakened when the data are sufficiently structured to admit simple alternative models, such as serial correlation. Usually, these models require that a few parameters be estimated. Assumptions about shapes of distributions, normality being the most common, have proved to be particularly important, and considerable progress has been made in dealing with the consequences of different assumptions.

More recently, robust techniques have been designed that permit sharp, valid discriminations among possible values of parameters of central tendency for a wide variety of alternative distributions by reducing the weight given to occasional extreme deviations. It turns out that by giving up, say, 10 percent of the discrimination that could be provided under the rather unrealistic assumption of normality, one can greatly improve performance in more realistic situations, especially when unusually large deviations are relatively common.

These valuable modifications of classical statistical techniques have been extended to multiple regression, in which procedures of iterative reweighting can now offer relatively good performance for a variety of underlying distributional shapes. They should be extended to more general schemes of analysis.

In some contexts—notably the most classical uses of analysis of variance—the use of adequate robust techniques should help to bring conventional statistical practice closer to the best standards that experts can now achieve.

Many Interrelated Parameters

In trying to give a more accurate representation of the real world than is possible with simple models, researchers sometimes use models with many parameters, all of which must be estimated from the data. Classical principles of estimation, such as straightforward maximum-likelihood, do not yield reliable estimates unless either the number of observations is much larger than the number of parameters to be estimated or special designs are used in conjunction with strong assumptions. Bayesian methods do not draw a distinction between fixed and random parameters, and so may be especially appropriate for such problems.

A variety of statistical methods have recently been developed that can be interpreted as treating many of the parameters as or similar to random quantities, even if they are regarded as representing fixed quantities to be estimated. Theory and practice demonstrate that such methods can improve the simpler fixed-parameter methods from which they evolved, especially when the number of observations is not large relative to the number of parameters. Successful applications include college and graduate school admissions, where quality of previous school is treated as a random parameter when the data are insufficient to separately estimate it well. Efforts to create appropriate models using this general approach for small-area estimation and undercount adjustment in the census are important potential applications.

Missing Data

In data analysis, serious problems can arise when certain kinds of (quantitative or qualitative) information is partially or wholly missing. Various approaches to dealing with these problems have been or are being developed. One of the methods developed recently for dealing with certain aspects of missing data is called multiple imputation: each missing value in a data set is replaced by several values representing a range of possibilities, with statistical dependence among missing values reflected by linkage among their replacements. It is currently being used to handle a major problem of incompatibility between the 1980 and previous Bureau of Census public-use tapes with respect to occupation codes. The extension of these techniques to address such problems as nonresponse to income questions in the Current Population Survey has been examined in exploratory applications with great promise.

Computer Packages and Expert Systems

The development of high-speed computing and data handling has fundamentally changed statistical analysis. Methodologies for all kinds of situations are rapidly being developed and made available for use in computer packages that may be incorporated into interactive expert systems. This computing capability offers the hope that much data analyses will be more carefully and more effectively done than previously and that better strategies for data analysis will move from the practice of expert statisticians, some of whom may not have tried to articulate their own strategies, to both wide discussion and general use.

But powerful tools can be hazardous, as witnessed by occasional dire misuses of existing statistical packages. Until recently the only strategies available were to train more expert methodologists or to train substantive scientists in more methodology, but without the updating of their training it tends to become outmoded. Now there is the opportunity to capture in expert systems the current best methodological advice and practice. If that opportunity is exploited, standard methodological training of social scientists will shift to emphasizing strategies in using good expert systems—including understanding the nature and importance of the comments it provides—rather than in how to patch together something on one’s own. With expert systems, almost all behavioral and social scientists should become able to conduct any of the more common styles of data analysis more effectively and with more confidence than all but the most expert do today. However, the difficulties in developing expert systems that work as hoped for should not be underestimated. Human experts cannot readily explicate all of the complex cognitive network that constitutes an important part of their knowledge. As a result, the first attempts at expert systems were not especially successful (as discussed in Chapter 1 ). Additional work is expected to overcome these limitations, but it is not clear how long it will take.

Exploratory Analysis and Graphic Presentation

The formal focus of much statistics research in the middle half of the twentieth century was on procedures to confirm or reject precise, a priori hypotheses developed in advance of collecting data—that is, procedures to determine statistical significance. There was relatively little systematic work on realistically rich strategies for the applied researcher to use when attacking real-world problems with their multiplicity of objectives and sources of evidence. More recently, a species of quantitative detective work, called exploratory data analysis, has received increasing attention. In this approach, the researcher seeks out possible quantitative relations that may be present in the data. The techniques are flexible and include an important component of graphic representations. While current techniques have evolved for single responses in situations of modest complexity, extensions to multiple responses and to single responses in more complex situations are now possible.

Graphic and tabular presentation is a research domain in active renaissance, stemming in part from suggestions for new kinds of graphics made possible by computer capabilities, for example, hanging histograms and easily assimilated representations of numerical vectors. Research on data presentation has been carried out by statisticians, psychologists, cartographers, and other specialists, and attempts are now being made to incorporate findings and concepts from linguistics, industrial and publishing design, aesthetics, and classification studies in library science. Another influence has been the rapidly increasing availability of powerful computational hardware and software, now available even on desktop computers. These ideas and capabilities are leading to an increasing number of behavioral experiments with substantial statistical input. Nonetheless, criteria of good graphic and tabular practice are still too much matters of tradition and dogma, without adequate empirical evidence or theoretical coherence. To broaden the respective research outlooks and vigorously develop such evidence and coherence, extended collaborations between statistical and mathematical specialists and other scientists are needed, a major objective being to understand better the visual and cognitive processes (see Chapter 1 ) relevant to effective use of graphic or tabular approaches.

Combining Evidence

Combining evidence from separate sources is a recurrent scientific task, and formal statistical methods for doing so go back 30 years or more. These methods include the theory and practice of combining tests of individual hypotheses, sequential design and analysis of experiments, comparisons of laboratories, and Bayesian and likelihood paradigms.

There is now growing interest in more ambitious analytical syntheses, which are often called meta-analyses. One stimulus has been the appearance of syntheses explicitly combining all existing investigations in particular fields, such as prison parole policy, classroom size in primary schools, cooperative studies of therapeutic treatments for coronary heart disease, early childhood education interventions, and weather modification experiments. In such fields, a serious approach to even the simplest question—how to put together separate estimates of effect size from separate investigations—leads quickly to difficult and interesting issues. One issue involves the lack of independence among the available studies, due, for example, to the effect of influential teachers on the research projects of their students. Another issue is selection bias, because only some of the studies carried out, usually those with “significant” findings, are available and because the literature search may not find out all relevant studies that are available. In addition, experts agree, although informally, that the quality of studies from different laboratories and facilities differ appreciably and that such information probably should be taken into account. Inevitably, the studies to be included used different designs and concepts and controlled or measured different variables, making it difficult to know how to combine them.

Rich, informal syntheses, allowing for individual appraisal, may be better than catch-all formal modeling, but the literature on formal meta-analytic models is growing and may be an important area of discovery in the next decade, relevant both to statistical analysis per se and to improved syntheses in the behavioral and social and other sciences.

  • Opportunities and Needs

This chapter has cited a number of methodological topics associated with behavioral and social sciences research that appear to be particularly active and promising at the present time. As throughout the report, they constitute illustrative examples of what the committee believes to be important areas of research in the coming decade. In this section we describe recommendations for an additional $16 million annually to facilitate both the development of methodologically oriented research and, equally important, its communication throughout the research community.

Methodological studies, including early computer implementations, have for the most part been carried out by individual investigators with small teams of colleagues or students. Occasionally, such research has been associated with quite large substantive projects, and some of the current developments of computer packages, graphics, and expert systems clearly require large, organized efforts, which often lie at the boundary between grant-supported work and commercial development. As such research is often a key to understanding complex bodies of behavioral and social sciences data, it is vital to the health of these sciences that research support continue on methods relevant to problems of modeling, statistical analysis, representation, and related aspects of behavioral and social sciences data. Researchers and funding agencies should also be especially sympathetic to the inclusion of such basic methodological work in large experimental and longitudinal studies. Additional funding for work in this area, both in terms of individual research grants on methodological issues and in terms of augmentation of large projects to include additional methodological aspects, should be provided largely in the form of investigator-initiated project grants.

Ethnographic and comparative studies also typically rely on project grants to individuals and small groups of investigators. While this type of support should continue, provision should also be made to facilitate the execution of studies using these methods by research teams and to provide appropriate methodological training through the mechanisms outlined below.

Overall, we recommend an increase of $4 million in the level of investigator-initiated grant support for methodological work. An additional $1 million should be devoted to a program of centers for methodological research.

Many of the new methods and models described in the chapter, if and when adopted to any large extent, will demand substantially greater amounts of research devoted to appropriate analysis and computer implementation. New user interfaces and numerical algorithms will need to be designed and new computer programs written. And even when generally available methods (such as maximum-likelihood) are applicable, model application still requires skillful development in particular contexts. Many of the familiar general methods that are applied in the statistical analysis of data are known to provide good approximations when sample sizes are sufficiently large, but their accuracy varies with the specific model and data used. To estimate the accuracy requires extensive numerical exploration. Investigating the sensitivity of results to the assumptions of the models is important and requires still more creative, thoughtful research. It takes substantial efforts of these kinds to bring any new model on line, and the need becomes increasingly important and difficult as statistical models move toward greater realism, usefulness, complexity, and availability in computer form. More complexity in turn will increase the demand for computational power. Although most of this demand can be satisfied by increasingly powerful desktop computers, some access to mainframe and even supercomputers will be needed in selected cases. We recommend an additional $4 million annually to cover the growth in computational demands for model development and testing.

Interaction and cooperation between the developers and the users of statistical and mathematical methods need continual stimulation—both ways. Efforts should be made to teach new methods to a wider variety of potential users than is now the case. Several ways appear effective for methodologists to communicate to empirical scientists: running summer training programs for graduate students, faculty, and other researchers; encouraging graduate students, perhaps through degree requirements, to make greater use of the statistical, mathematical, and methodological resources at their own or affiliated universities; associating statistical and mathematical research specialists with large-scale data collection projects; and developing statistical packages that incorporate expert systems in applying the methods.

Methodologists, in turn, need to become more familiar with the problems actually faced by empirical scientists in the laboratory and especially in the field. Several ways appear useful for communication in this direction: encouraging graduate students in methodological specialties, perhaps through degree requirements, to work directly on empirical research; creating postdoctoral fellowships aimed at integrating such specialists into ongoing data collection projects; and providing for large data collection projects to engage relevant methodological specialists. In addition, research on and development of statistical packages and expert systems should be encouraged to involve the multidisciplinary collaboration of experts with experience in statistical, computer, and cognitive sciences.

A final point has to do with the promise held out by bringing different research methods to bear on the same problems. As our discussions of research methods in this and other chapters have emphasized, different methods have different powers and limitations, and each is designed especially to elucidate one or more particular facets of a subject. An important type of interdisciplinary work is the collaboration of specialists in different research methodologies on a substantive issue, examples of which have been noted throughout this report. If more such research were conducted cooperatively, the power of each method pursued separately would be increased. To encourage such multidisciplinary work, we recommend increased support for fellowships, research workshops, and training institutes.

Funding for fellowships, both pre-and postdoctoral, should be aimed at giving methodologists experience with substantive problems and at upgrading the methodological capabilities of substantive scientists. Such targeted fellowship support should be increased by $4 million annually, of which $3 million should be for predoctoral fellowships emphasizing the enrichment of methodological concentrations. The new support needed for research workshops is estimated to be $1 million annually. And new support needed for various kinds of advanced training institutes aimed at rapidly diffusing new methodological findings among substantive scientists is estimated to be $2 million annually.

  • Cite this Page National Research Council; Division of Behavioral and Social Sciences and Education; Commission on Behavioral and Social Sciences and Education; Committee on Basic Research in the Behavioral and Social Sciences; Gerstein DR, Luce RD, Smelser NJ, et al., editors. The Behavioral and Social Sciences: Achievements and Opportunities. Washington (DC): National Academies Press (US); 1988. 5, Methods of Data Collection, Representation, and Analysis.
  • PDF version of this title (16M)

In this Page

Other titles in this collection.

  • The National Academies Collection: Reports funded by National Institutes of Health

Recent Activity

  • Methods of Data Collection, Representation, and Analysis - The Behavioral and So... Methods of Data Collection, Representation, and Analysis - The Behavioral and Social Sciences: Achievements and Opportunities

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

  • School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Class 8 Maths Notes
  • Class 9 Maths Notes
  • Class 10 Maths Notes
  • Class 11 Maths Notes
  • Class 12 Maths Notes

Graphical Representation of Data

Graphical Representation of Data: Graphical Representation of Data,” where numbers and facts become lively pictures and colorful diagrams . Instead of staring at boring lists of numbers, we use fun charts, cool graphs, and interesting visuals to understand information better. In this exciting concept of data visualization, we’ll learn about different kinds of graphs, charts, and pictures that help us see patterns and stories hidden in data.

There is an entire branch in mathematics dedicated to dealing with collecting, analyzing, interpreting, and presenting numerical data in visual form in such a way that it becomes easy to understand and the data becomes easy to compare as well, the branch is known as Statistics .

The branch is widely spread and has a plethora of real-life applications such as Business Analytics, demography, Astro statistics, and so on . In this article, we have provided everything about the graphical representation of data, including its types, rules, advantages, etc.

Graphical-Representation-of-Data

Table of Content

What is Graphical Representation

Types of graphical representations, line graphs, histograms , stem and leaf plot , box and whisker plot .

  • Graphical Representations used in Maths

Value-Based or Time Series Graphs 

Frequency based, principles of graphical representations, advantages and disadvantages of using graphical system, general rules for graphical representation of data, frequency polygon, solved examples on graphical representation of data.

Graphics Representation is a way of representing any data in picturized form . It helps a reader to understand the large set of data very easily as it gives us various data patterns in visualized form.

There are two ways of representing data,

  • Pictorial Representation through graphs.

They say, “A picture is worth a thousand words”.  It’s always better to represent data in a graphical format. Even in Practical Evidence and Surveys, scientists have found that the restoration and understanding of any information is better when it is available in the form of visuals as Human beings process data better in visual form than any other form.

Does it increase the ability 2 times or 3 times? The answer is it increases the Power of understanding 60,000 times for a normal Human being, the fact is amusing and true at the same time.

Check: Graph and its representations

Comparison between different items is best shown with graphs, it becomes easier to compare the crux of the data about different items. Let’s look at all the different types of graphical representations briefly: 

A line graph is used to show how the value of a particular variable changes with time. We plot this graph by connecting the points at different values of the variable. It can be useful for analyzing the trends in the data and predicting further trends. 

methods of data representation

A bar graph is a type of graphical representation of the data in which bars of uniform width are drawn with equal spacing between them on one axis (x-axis usually), depicting the variable. The values of the variables are represented by the height of the bars. 

methods of data representation

This is similar to bar graphs, but it is based frequency of numerical values rather than their actual values. The data is organized into intervals and the bars represent the frequency of the values in that range. That is, it counts how many values of the data lie in a particular range. 

methods of data representation

It is a plot that displays data as points and checkmarks above a number line, showing the frequency of the point.  

methods of data representation

This is a type of plot in which each value is split into a “leaf”(in most cases, it is the last digit) and “stem”(the other remaining digits). For example: the number 42 is split into leaf (2) and stem (4).  

methods of data representation

These plots divide the data into four parts to show their summary. They are more concerned about the spread, average, and median of the data. 

methods of data representation

It is a type of graph which represents the data in form of a circular graph. The circle is divided such that each portion represents a proportion of the whole. 

methods of data representation

Graphical Representations used in Math’s

Graphs in Math are used to study the relationships between two or more variables that are changing. Statistical data can be summarized in a better way using graphs. There are basically two lines of thoughts of making graphs in maths: 

  • Value-Based or Time Series Graphs

These graphs allow us to study the change of a variable with respect to another variable within a given interval of time. The variables can be anything. Time Series graphs study the change of variable with time. They study the trends, periodic behavior, and patterns in the series. We are more concerned with the values of the variables here rather than the frequency of those values. 

Example: Line Graph

These kinds of graphs are more concerned with the distribution of data. How many values lie between a particular range of the variables, and which range has the maximum frequency of the values. They are used to judge a spread and average and sometimes median of a variable under study.

Also read: Types of Statistical Data
  • All types of graphical representations follow algebraic principles.
  • When plotting a graph, there’s an origin and two axes.
  • The x-axis is horizontal, and the y-axis is vertical.
  • The axes divide the plane into four quadrants.
  • The origin is where the axes intersect.
  • Positive x-values are to the right of the origin; negative x-values are to the left.
  • Positive y-values are above the x-axis; negative y-values are below.

graphical-representation

  • It gives us a summary of the data which is easier to look at and analyze.
  • It saves time.
  • We can compare and study more than one variable at a time.

Disadvantages

  • It usually takes only one aspect of the data and ignores the other. For example, A bar graph does not represent the mean, median, and other statistics of the data. 
  • Interpretation of graphs can vary based on individual perspectives, leading to subjective conclusions.
  • Poorly constructed or misleading visuals can distort data interpretation and lead to incorrect conclusions.
Check : Diagrammatic and Graphic Presentation of Data

We should keep in mind some things while plotting and designing these graphs. The goal should be a better and clear picture of the data. Following things should be kept in mind while plotting the above graphs: 

  • Whenever possible, the data source must be mentioned for the viewer.
  • Always choose the proper colors and font sizes. They should be chosen to keep in mind that the graphs should look neat.
  • The measurement Unit should be mentioned in the top right corner of the graph.
  • The proper scale should be chosen while making the graph, it should be chosen such that the graph looks accurate.
  • Last but not the least, a suitable title should be chosen.

A frequency polygon is a graph that is constructed by joining the midpoint of the intervals. The height of the interval or the bin represents the frequency of the values that lie in that interval. 

frequency-polygon

Question 1: What are different types of frequency-based plots? 

Types of frequency-based plots:  Histogram Frequency Polygon Box Plots

Question 2: A company with an advertising budget of Rs 10,00,00,000 has planned the following expenditure in the different advertising channels such as TV Advertisement, Radio, Facebook, Instagram, and Printed media. The table represents the money spent on different channels. 

Draw a bar graph for the following data. 

  • Put each of the channels on the x-axis
  • The height of the bars is decided by the value of each channel.

methods of data representation

Question 3: Draw a line plot for the following data 

  • Put each of the x-axis row value on the x-axis
  • joint the value corresponding to the each value of the x-axis.

methods of data representation

Question 4: Make a frequency plot of the following data: 

  • Draw the class intervals on the x-axis and frequencies on the y-axis.
  • Calculate the midpoint of each class interval.
Class Interval Mid Point Frequency
0-3 1.5 3
3-6 4.5 4
6-9 7.5 2
9-12 10.5 6

Now join the mid points of the intervals and their corresponding frequencies on the graph. 

methods of data representation

This graph shows both the histogram and frequency polygon for the given distribution.

Related Article:

Graphical Representation of Data| Practical Work in Geography Class 12 What are the different ways of Data Representation What are the different ways of Data Representation? Charts and Graphs for Data Visualization

Conclusion of Graphical Representation

Graphical representation is a powerful tool for understanding data, but it’s essential to be aware of its limitations. While graphs and charts can make information easier to grasp, they can also be subjective, complex, and potentially misleading . By using graphical representations wisely and critically, we can extract valuable insights from data, empowering us to make informed decisions with confidence.

Graphical Representation of Data – FAQs

What are the advantages of using graphs to represent data.

Graphs offer visualization, clarity, and easy comparison of data, aiding in outlier identification and predictive analysis.

What are the common types of graphs used for data representation?

Common graph types include bar, line, pie, histogram, and scatter plots , each suited for different data representations and analysis purposes.

How do you choose the most appropriate type of graph for your data?

Select a graph type based on data type, analysis objective, and audience familiarity to effectively convey information and insights.

How do you create effective labels and titles for graphs?

Use descriptive titles, clear axis labels with units, and legends to ensure the graph communicates information clearly and concisely.

How do you interpret graphs to extract meaningful insights from data?

Interpret graphs by examining trends, identifying outliers, comparing data across categories, and considering the broader context to draw meaningful insights and conclusions.

Please Login to comment...

Similar reads.

  • School Learning
  • Maths-Class-9
  • How to Get a Free SSL Certificate
  • Best SSL Certificates Provider in India
  • Elon Musk's xAI releases Grok-2 AI assistant
  • What is OpenAI SearchGPT? How it works and How to Get it?
  • Full Stack Developer Roadmap [2024 Updated]

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

  • Reviews / Why join our community?
  • For companies
  • Frequently asked questions

Data Representation

Literature on data representation.

Here’s the entire UX literature on Data Representation by the Interaction Design Foundation, collated in one place:

Learn more about Data Representation

Take a deep dive into Data Representation with our course AI for Designers .

In an era where technology is rapidly reshaping the way we interact with the world, understanding the intricacies of AI is not just a skill, but a necessity for designers . The AI for Designers course delves into the heart of this game-changing field, empowering you to navigate the complexities of designing in the age of AI. Why is this knowledge vital? AI is not just a tool; it's a paradigm shift, revolutionizing the design landscape. As a designer, make sure that you not only keep pace with the ever-evolving tech landscape but also lead the way in creating user experiences that are intuitive, intelligent, and ethical.

AI for Designers is taught by Ioana Teleanu, a seasoned AI Product Designer and Design Educator who has established a community of over 250,000 UX enthusiasts through her social channel UX Goodies. She imparts her extensive expertise to this course from her experience at renowned companies like UiPath and ING Bank, and now works on pioneering AI projects at Miro.

In this course, you’ll explore how to work with AI in harmony and incorporate it into your design process to elevate your career to new heights. Welcome to a course that doesn’t just teach design; it shapes the future of design innovation.

In lesson 1, you’ll explore AI's significance, understand key terms like Machine Learning, Deep Learning, and Generative AI, discover AI's impact on design, and master the art of creating effective text prompts for design.

In lesson 2, you’ll learn how to enhance your design workflow using AI tools for UX research, including market analysis, persona interviews, and data processing. You’ll dive into problem-solving with AI, mastering problem definition and production ideation.

In lesson 3, you’ll discover how to incorporate AI tools for prototyping, wireframing, visual design, and UX writing into your design process. You’ll learn how AI can assist to evaluate your designs and automate tasks, and ensure your product is launch-ready.

In lesson 4, you’ll explore the designer's role in AI-driven solutions, how to address challenges, analyze concerns, and deliver ethical solutions for real-world design applications.

Throughout the course, you'll receive practical tips for real-life projects. In the Build Your Portfolio exercises, you’ll practice how to integrate AI tools into your workflow and design for AI products, enabling you to create a compelling portfolio case study to attract potential employers or collaborators.

All open-source articles on Data Representation

Visual mapping – the elements of information visualization.

methods of data representation

  • 4 years ago

Rating Scales in UX Research: The Ultimate Guide

methods of data representation

Open Access—Link to us!

We believe in Open Access and the  democratization of knowledge . Unfortunately, world-class educational materials such as this page are normally hidden behind paywalls or in expensive textbooks.

If you want this to change , cite this page , link to us, or join us to help us democratize design knowledge !

Privacy Settings

Our digital services use necessary tracking technologies, including third-party cookies, for security, functionality, and to uphold user rights. Optional cookies offer enhanced features, and analytics.

Experience the full potential of our site that remembers your preferences and supports secure sign-in.

Governs the storage of data necessary for maintaining website security, user authentication, and fraud prevention mechanisms.

Enhanced Functionality

Saves your settings and preferences, like your location, for a more personalized experience.

Referral Program

We use cookies to enable our referral program, giving you and your friends discounts.

Error Reporting

We share user ID with Bugsnag and NewRelic to help us track errors and fix issues.

Optimize your experience by allowing us to monitor site usage. You’ll enjoy a smoother, more personalized journey without compromising your privacy.

Analytics Storage

Collects anonymous data on how you navigate and interact, helping us make informed improvements.

Differentiates real visitors from automated bots, ensuring accurate usage data and improving your website experience.

Lets us tailor your digital ads to match your interests, making them more relevant and useful to you.

Advertising Storage

Stores information for better-targeted advertising, enhancing your online ad experience.

Personalization Storage

Permits storing data to personalize content and ads across Google services based on user behavior, enhancing overall user experience.

Advertising Personalization

Allows for content and ad personalization across Google services based on user behavior. This consent enhances user experiences.

Enables personalizing ads based on user data and interactions, allowing for more relevant advertising experiences across Google services.

Receive more relevant advertisements by sharing your interests and behavior with our trusted advertising partners.

Enables better ad targeting and measurement on Meta platforms, making ads you see more relevant.

Allows for improved ad effectiveness and measurement through Meta’s Conversions API, ensuring privacy-compliant data sharing.

LinkedIn Insights

Tracks conversions, retargeting, and web analytics for LinkedIn ad campaigns, enhancing ad relevance and performance.

LinkedIn CAPI

Enhances LinkedIn advertising through server-side event tracking, offering more accurate measurement and personalization.

Google Ads Tag

Tracks ad performance and user engagement, helping deliver ads that are most useful to you.

Share Knowledge, Get Respect!

or copy link

Cite according to academic standards

Simply copy and paste the text below into your bibliographic reference list, onto your blog, or anywhere else. You can also just hyperlink to this page.

New to UX Design? We’re Giving You a Free ebook!

The Basics of User Experience Design

Download our free ebook The Basics of User Experience Design to learn about core concepts of UX design.

In 9 chapters, we’ll cover: conducting user interviews, design thinking, interaction design, mobile UX design, usability, UX research, and many more!

Computer Concepts Tutorial

  • Computer Concepts Tutorial
  • Computer Concepts - Home
  • Introduction to Computer
  • Introduction to GUI based OS
  • Elements of Word Processing
  • Spread Sheet
  • Introduction to Internet, WWW, Browsers
  • Communication & Collaboration
  • Application of Presentations
  • Application of Digital Financial Services
  • Computer Concepts Resources
  • Computer Concepts - Quick Guide
  • Computer Concepts - Useful Resources
  • Computer Concepts - Discussion
  • Selected Reading
  • UPSC IAS Exams Notes
  • Developer's Best Practices
  • Questions and Answers
  • Effective Resume Writing
  • HR Interview Questions
  • Computer Glossary

Representation of Data/Information

Computers do not understand human language; they understand data within the prescribed form. Data representation is a method to represent data and encode it in a computer system. Generally, a user inputs numbers, text, images, audio, and video etc types of data to process but the computer converts this data to machine language first and then processes it.

Some Common Data Representation Methods Include

Methods

Data representation plays a vital role in storing, process, and data communication. A correct and effective data representation method impacts data processing performance and system compatibility.

Computers represent data in the following forms

Number system.

A computer system considers numbers as data; it includes integers, decimals, and complex numbers. All the inputted numbers are represented in binary formats like 0 and 1. A number system is categorized into four types −

  • Binary − A binary number system is a base of all the numbers considered for data representation in the digital system. A binary number system consists of only two values, either 0 or 1; so its base is 2. It can be represented to the external world as (10110010) 2 . A computer system uses binary digits (0s and 1s) to represent data internally.
  • Octal − The octal number system represents values in 8 digits. It consists of digits 0,12,3,4,5,6, and 7; so its base is 8. It can be represented to the external world as (324017) 8 .
  • Decimal − Decimal number system represents values in 10 digits. It consists of digits 0, 12, 3, 4, 5, 6, 7, 8, and 9; so its base is 10. It can be represented to the external world as (875629) 10 .

The below-mentioned table below summarises the data representation of the number system along with their Base and digits.

Number System
System Base Digits
Binary 2 0 1
Octal 8 0 1 2 3 4 5 6 7
Decimal 10 0 1 2 3 4 5 6 7 8 9
Hexadecimal 16 0 1 2 3 4 5 6 7 8 9 A B C D E F

Bits and Bytes

A bit is the smallest data unit that a computer uses in computation; all the computation tasks done by the computer systems are based on bits. A bit represents a binary digit in terms of 0 or 1. The computer usually uses bits in groups. It's the basic unit of information storage and communication in digital computing.

A group of eight bits is called a byte. Half of a byte is called a nibble; it means a group of four bits is called a nibble. A byte is a fundamental addressable unit of computer memory and storage. It can represent a single character, such as a letter, number, or symbol using encoding methods such as ASCII and Unicode.

Bytes are used to determine file sizes, storage capacity, and available memory space. A kilobyte (KB) is equal to 1,024 bytes, a megabyte (MB) is equal to 1,024 KB, and a gigabyte (GB) is equal to 1,024 MB. File size is roughly measured in KBs and availability of memory space in MBs and GBs.

Bytes

The following table shows the conversion of Bits and Bytes −

Byte Value Bit Value
1 Byte 8 Bits
1024 Bytes 1 Kilobyte
1024 Kilobytes 1 Megabyte
1024 Megabytes 1 Gigabyte
1024 Gigabytes 1 Terabyte
1024 Terabytes 1 Petabyte
1024 Petabytes 1 Exabyte
1024 Exabytes 1 Zettabyte
1024 Zettabytes 1 Yottabyte
1024 Yottabytes 1 Brontobyte
1024 Brontobytes 1 Geopbytes

A Text Code is a static code that allows a user to insert text that others will view when they scan it. It includes alphabets, punctuation marks and other symbols. Some of the most commonly used text code systems are −

Extended ASCII

EBCDIC stands for Extended Binary Coded Decimal Interchange Code. IBM developed EBCDIC in the early 1960s and used it in their mainframe systems like System/360 and its successors. To meet commercial and data processing demands, it supports letters, numbers, punctuation marks, and special symbols. Character codes distinguish EBCDIC from other character encoding methods like ASCII. Data encoded in EBCDIC or ASCII may not be compatible with computers; to make them compatible, we need to convert with systems compatibility. EBCDIC encodes each character as an 8-bit binary code and defines 256 symbols. The below-mentioned table depicts different characters along with their EBCDIC code.

EBCDIC

ASCII stands for American Standard Code for Information Interchange. It is an 8-bit code that specifies character values from 0 to 127. ASCII is a standard for the Character Encoding of Numbers that assigns numerical values to represent characters, such as letters, numbers, exclamation marks and control characters used in computers and communication equipment that are using data.

ASCII originally defined 128 characters, encoded with 7 bits, allowing for 2^7 (128) potential characters. The ASCII standard specifies characters for the English alphabet (uppercase and lowercase), numerals from 0 to 9, punctuation marks, and control characters for formatting and control tasks such as line feed, carriage return, and tab.

ASCII Tabular column
ASCII Code Decimal Value Character
0000 0000 0 Null prompt
0000 0001 1 Start of heading
0000 0010 2 Start of text
0000 0011 3 End of text
0000 0100 4 End of transmit
0000 0101 5 Enquiry
0000 0110 6 Acknowledge
0000 0111 7 Audible bell
0000 1000 8 Backspace
0000 1001 9 Horizontal tab
0000 1010 10 Line Feed

Extended American Standard Code for Information Interchange is an 8-bit code that specifies character values from 128 to 255. Extended ASCII encompasses different character encoding normal ASCII character set, consisting of 128 characters encoded in 7 bits, some additional characters that utilise full 8 bits of a byte; there are a total of 256 potential characters.

Different extended ASCII exist, each introducing more characters beyond the conventional ASCII set. These additional characters may encompass symbols, letters, and special characters to a specific language or location.

Extended ASCII Tabular column

Extended ASCII

It is a worldwide character standard that uses 4 to 32 bits to represent letters, numbers and symbols. Unicode is a standard character encoding which is specifically designed to provide a consistent way to represent text in nearly all of the world's writing systems. Every character is assigned a unique numeric code, program, or language. Unicode offers a wide variety of characters, including alphabets, ideographs, symbols, and emojis.

Unicode Tabular Column

Unicode

  • Open access
  • Published: 16 August 2024

Gender-based differences in the representation and experiences of academic leaders in medicine and dentistry: a mixed method study from Pakistan

  • Muhammad Shahzad 1 , 2 ,
  • Brekhna Jamil 3   na1 ,
  • Mehboob Bushra 4   na1 ,
  • Usman Mahboob 3   na1 &
  • Fayig Elmigdadi 1   na1  

BMC Medical Education volume  24 , Article number:  885 ( 2024 ) Cite this article

201 Accesses

Metrics details

Research evidence suggests gender-based differences in the extent and experiences of academic leaders across the globe even in developed countries like USA, UK, and Canada. The under-representation is particularly common in higher education organizations, including medical and dental schools. The current study aimed to investigate gender-based distribution and explore leaders’ experiences in the medical and dental institutes in a developing country, Pakistan.

A mixed-method approach was used. Gender-based distribution data of academic leaders in 28 colleges including 18 medical and 10 dental colleges of Khyber Pakhtunkhwa, Pakistan were collected. Qualitative data regarding the experiences of academic leaders ( n  = 10) was collected through semi-structured interviews followed by transcription and thematic analysis using standard procedures.

Gender-based disparities exist across all institutes with the greatest differences among the top-rank leadership level (principals/deans) where 84.5% of the positions were occupied by males. The gender gap was relatively narrow at mid-level leadership positions reaching up to as high as > 40% of female leaders in medical/dental education. The qualitative analysis found gender-based differences in the experiences under four themes: leadership attributes, leadership journey, challenges, and support.

Conclusions

The study showed that women are not only significantly under-represented in leadership positions in medical and dental colleges in Pakistan, they also face gender-based discrimination and struggling to maintain a decent work life balance. These findings are critical and can have important implications for government, organizations, human resource managers, and policymakers in terms of enacting laws, proposing regulations, and establishing support mechanisms to improve gender-based balance and help current and aspiring leaders in their leadership journey.

Peer Review reports

An inclusive and diverse workforce is crucial for organizational success in today’s modern world. However, despite a gradual increase in the representation of women in the workforce globally, gender discrimination and unequal gender ratios still exist in workplaces both horizontally (i.e., across different sectors and industries) and vertically (women are usually excluded from the top positions at the organizational level [ 1 , 2 , 3 ]. Workplace gender descrimination, in any farm, is not only illegal and against the basic human rights [ 4 ], it also contributes towards the lower socioeconomic status of women globally [ 5 ]. The under-representation of women, especially at leadership position is particularly common in higher education organizations, including medical and dental schools, as reported previously [ 6 , 7 ]. Even those who reach a position in academic healthcare institutes publish research less frequently [ 8 ], receive fewer research grants [ 9 ], and have 23% fewer chances of promotion [ 10 ] than their male counterparts.

Recently, researchers tried to explore the extent of gender-based parity in leadership positions in academic medicine and sciences in many different countries of the world. Surprisingly, gender imbalance exists at all levels and nations, including countries ranking high on gender equality indices, such as Nordic countries [ 11 ]. According to a report by the European Commission, only 25% of the professors in European member states are women [ 12 ]. Similarly, in medical schools in the United States of America (USA), only 18% of the department heads are females [ 13 ]. Research studies also revealed that gender-based inequalities are multifaceted, and many extrinsic and intrinsic factors contributing to the issue have been identified [ 10 , 14 ]. However, most research studies exploring women and leadership in academic medicine have been conducted in developed Western countries, including Europe, the USA, and Canada [ 15 ]. There is a dire need for research to document the extent and how male and female professionals develop as leaders during their careers in academic medicine and dentistry, especially from a non-western perspective. In this context, a country like Pakistan offers an excellent opportunity to study and assess the unique barriers these women must overcome to attain leadership positions in academic medicine and dentistry.

Pakistan is a former British colony and a Muslim-majority country in South Asia with over 241 million population, of whom 48.54% are women [ 16 ]. The country’s strictly conservative and patriarchal society strongly influences culture, traditions, and gender dynamics [ 17 ] thus significantly reducing women’s opportunities for career choice and progression [ 18 ]. The overall job market in Pakistan is male-dominated, and only certain professions, such as teaching and medical practice, are deemed more suitable for women in Pakistan. Women constitute nearly 70% of the students in Pakistani medical schools, but only 50% register with PMDC, and even fewer continue medical practice as a career [ 19 ].

There exists a critical gap in our understanding of the gender-based differences in the extent and experience of leaders from a non-Western perspective. In the current study, we investigated the gender-based distribution of academic leaders in medical and dental colleges in Khyber Pakhtunkhwa (KP), Pakistan and explore the experiences of men and women occupying leadership positions in these colleges. The study results would help design gender-based leadership development strategies and promote gender equity in academic institutes across Pakistan.

Reflexivity

The individuals who conceptualized and designed this study (MS & BJ) hold leadership positions themselves. We are cognizant of both the author’s positions and to avoid our personal biases reflect in the study findings, we were constantly engaged in reflexive practices [ 20 ]. For example, the author (MS) wrote personal memos during the data collection, coding, and interpretations. Each transcript was thoroughly discussed, and codes were shared between the researchers. Moreover, before the data analysis started, the coding team set aside and bracketed their personal assumptions to further improve trustworthiness of the study findings.

Based on the study objectives, a mixed-method sequential study design [ 21 ] including (a) survey of the gender-based distribution of academic leadership positions in medical and dental colleges (b) qualitative interviews with selected individuals occupying leadership positions to explore gender-based differences in their experiences were employed. Integration occurred during development of the interview protocol and results description of both phases to gain an in-depth insight into the extent and experiences of gender based distribution in academic medicine and dentistry. The ethics approval of the study was granted by the institutional research board of Khyber Medical University (Ref No: 1–11/IHPER/MHPE/KMU/23–62) and the study design followed GRAMMS (Good Reporting of a Mixed Methods Study) guidelines.

Study setting and sampling

The sampling frame for the quantitative survey was all the medical and dental colleges of KP, both public sector and private that have been recognized by the Pakistan Medical & Dental Council. At the time of data collection (June – October 2023), there were 30 colleges in KP offering undergraduate programs in medicine (MBBS) and dentistry (BDS). Information about the gender and rank of those working in academic leadership positions were collected from the official websites of the institutes, followed by confirmation of the organogram with the HR administration of the respective college. Only two private institutes, including a dental college declined to provide the required details and were excluded from the study. Academic leaders were the individuals occupying the position of dean, director or principal, head of the department or chairs. Academic leaders in medical and dental colleges in Pakistan may be divided into three leadership tiers based on roles and responsibilities. The top level includes the dean, directors, or principals responsible for looking after the overall organizational management, including teaching, research, and administration. The mid-level academic leaders take responsibility for a specific educational/training program and include the head of the departments or chairman/women. The line-level leadership is responsible for teaching within the program, including teachers, course coordinators, etc.

For the qualitative arm of the study, we could only get email IDs of 78 academic leaders connected to the undergraduate medical and dental programs at the top and mid-level. The inclusion criteria were persons of any age and gender who were serving in a leadership position for at least one year in a medical and dental college in KP. Those serving in a leadership position on additional/ad-hoc/acting charge who did not meet the required qualifications and experience were excluded from the study. Potential participants sent an email describing the study and an invitation to participate. Only 21 responded positively to our email and they were all included in our final sample. Written informed consent was obtained from all those participants who agreed to participate in the study. Data saturation was observed after 7 interviews [ 22 ], but we continue our data collection to ensure equal representation of both genders (5 male, 5 female). A thank you email was sent to the remaining prospective participants informing them that they would not be interviewed as the data saturation has been reached.

Data collection

For quantitative data collection, college name, status (public or private), and Rank and gender of those working in leadership positions were taken from website and entered Microsoft Excel version 2013. Qualitative data was collected through semi-structured interview sessions conducted by a single researcher (MS) face-to-face or online using the Microsoft ® Team platform depending on the participant choice. Based on the available published literature, a 10-item interview guide was developed and piloted on two participants who were not part of the original study (Supplementary file 1 ). Information on age, gender, qualification, experience, and current role were also collected. Face-to-face interviews were audio recorded using the record function of a mobile phone device (iPhone 13, Apple inc USA). In the case of the online interview, the recording was conducted using the record option available in the meeting link. To capture the real essence of the experiences, the participants were free to respond in English, Urdu (the national language of Pakistan), and Pashto, most of the interviewees’ mother tongue or first language. Data was securely recorded, translated, and transcribed verbatim.

Data analysis

The survey data was analyzed using Microsoft Excel 2013 and presented as frequency and percentages. The female-to-male ratio was calculated across different institutes, programs, academic ranks, and departments. Qualitative data was transcribed, and thematic analysis was done using ATLAS Ti Version 8.4.5. The Braun and Clarke framework [ 21 ] was used for thematic analysis (Fig.  1 ). The analysis process involved a systematic approach to coding and interpretation by two researchers (MS & BM). The coding list was further discussed with supervisors (BJ & UM) for improved quality.

figure 1

Braun and Clarke thematic analysis steps

Quantitative analysis

In total, 18 medical and 10 dental colleges including a total of 497 (312 in medicine and 185 in dentistry) academic leaders were surveyed. The overall gender-based distribution of academic leaders is presented in supplementary Tables 1 & 2 and summarized in Fig.  2 . Overall, women remain under-represented at all leadership tiers. Gender-based differences were highest at top leadership positions, including principal/deans/directors, where women occupy only 15.5% ( n  = 9) of the total leadership positions. However, women are relatively well represented in mid-level leadership positions. The mid-level leadership positions are further categorized into three groups (basic sciences, clinical sciences, and medical/dental education departments). Across the three categories at the mid-level, the leadership gap tends to be narrower in the medical/dental education department, where more than 40% ( n  = 12) of the heads of the departments were females.

figure 2

Bar graph showing the gender-based distribution of academic leaders in KP, Pakistan’s medical and dental college

We next assessed gender-based differences in academic leadership positions in medical and dental colleges (Fig.  3 A & B). At top-level management in dental colleges (Fig.  3 A), the percentage of women leaders is almost twice (26.7%; n  = 4) that of women leaders in medical colleges (11.6%; n  = 5) (Fig.  3 B).

figure 3

Bar graph representing gender-based distribution at different leadership tiers in ( A ) Dental colleges & ( B ) Medical colleges

Gender-based disparities were also observed at the level of the department as well as whether the institute is private or public sector (Fig.  4 & supplementary Tables 1 & 2 ). In general, except for medical education, the female-to-male ratio at leadership position was higher in public sector medical (average 0.44; range 0.06–1.71) and dental colleges (average 0.67; range 0.1–1.6) than private sector medical (average 0.36; range 0.14–0.67) and dental (average 0.32; range 0.1–0.5) colleges in the KP province (Fig.  4 ). Similarly, in certain clinical fields such as obstetrics gynaecology, and prosthodontics, leadership positions were entirely occupied by only one gender.

figure 4

Bar graph representing female-to-male ratio in public and private sector medical and dental college

Qualitative analysis

Demographic characteristics of the participants are presented in Table  1 . Wide variations exist among the participants in terms of age, experience, academic credentials and number of years in a leadership position. The mean age of the participants was 45.4 ± 5.2 years and average experience was 16.4 ± 5.3 years.

Qualitative findings

The experiences of the participants can be described under four main themes and eleven subthemes as presented in Table  2 .

Theme 1: Leadership attributes

The participants identified various attributes that are important for a leader. Both male and female participants expressed that a leader is internally motivated to substantially impact the lives of those around them. A vast majority of the participants agreed that they possessed a clear and strong desire to bring about positive transformations in the lives of others through their leadership but also to witness and experience the visible results of their efforts personally.

If you are a leader , you are required to influence people and guide them properly (ID-1 Male)

Another leadership attribute identified by participants was good teamwork.

…One of the leadership qualities is working with your colleagues and subordinates as a team. And when you take them into confidence , they support and encourage you and facilitate you in every way… (ID7- Male)

Positive mindset, flexibility, diplomacy, and self-awareness were also cited as essential traits for leaders.

As a leader , you are required to know about yourself.… (ID-1- Female) When you are in a team. Anyone can come up with a good idea , an even better idea than yours. You must have the capacity to Embrace that (ID-10- Male)

Theme 2: Leadership journey

The participants in the study shared their views on leadership, stating that it involves multiple responsibilities and requires both education and practical experience. The path to leadership was seen as challenging, but male and female participants had different perspectives on how they achieved it. Female leaders were found to actively seek opportunities, while leadership roles were more readily offered to male participants.

The transition into leadership roles as a male has been easy (ID9-Male)

Female leaders also struggle to balance their domestic responsibilities with their leadership roles, which was not the case for male leaders.

…. I had to do all the house chores myself. The cleaning , the cooking , the washing , the daily activities , and then return home to do more work with the kids. (ID5-Female)

Both male and female participants acknowledged the existence of gender discrimination in leadership roles.

When we apply for projects to get funds , we are unable to get a percentage like the men. (ID2-Female) The gender disparity does come into play , especially when you’re working as a head of an institute where all the men are working with you , even if no one says anything , the default thinking is maybe she’s a woman and that’s why she’s saying this (ID5-Female)

Participants also highlighted various advantages of leadership in their career growth, including training, financial stability, and autonomy.

…but then they got some advantages to their financially stable, they are socially responsible, you have a strong social background… (ID4- Male)

Theme 3: challenges

Both male and female participants highlighted the underrepresentation of women in leadership positions.

Yes , because our society is male-dominated. In my own opinion if there was a female in my place , she would have surrendered and said that she would have forgiven that…. (ID6- Male)

Moreover, female participants faced discrimination while pursuing leadership positions. They were able to overcome these challenges by relinquishing some degree of control and seeking assistance from other individuals.

When I was going for my PhD , in my interview , they said , oh , you’re a woman. OK , you’re married. Ohh. You have kids. What are you going to do with your kids… (ID8-Female) ….However , within the limited end available resources , I made efforts with the support from the institute and high-ups (ID7- Female)

The challenge of work-family balance was identified by both male and female participants. However the male participants reported that their families were taken care of by their spouses, which allowed them to focus on their work but the female participants had to balance multiple responsibilities.

My wife is a housewife , so that she can give the family time. And my parents are also looked after by my wife. So , if financially I’m supporting them , the emotional support is always there , and it doesn’t make any difference if I am busy (ID1-Male) …. Yes , a terrible effect. It has badly affected my health. My family has excluded me from their social activities and their social circle… (ID2-Female)

Theme 4: support

Mentoring was found to be vital to leaders for their personal growth and advancement as they provided focused guidance and support to aspiring leaders, helping them reach their fullest potential. It was interesting to note that most female participants acknowledged the mentoring role of their father during their leadership journey.

Mentor for me as a person is my father. He retired as the Chief Secretary of the province. He was a very senior bureaucrat , a very well-read person , and a scholar… (ID5- Female)

Both male and female participants stressed the importance of developing support systems to build resilience. These include the role of institution, family, administration, and collaborative work environment. Interestingly most female participants highlighted that support from female colleagues is necessary to sustain leadership roles.

A leader cannot become a leader in a vacuum. Your bosses , your followers , your teammates , and your colleagues all have to support a leader to become the leader and there has to be a network of support available in terms of learning opportunities in terms of funding opportunities in terms of Carrying out the vision of the leader (ID-8 Male) My husband has supported me a lot… (ID-3 Female) Of Course , we don’t get to wherever we are until we have somebody as support , especially in academics. (ID2- Female) Yes , and not only that , they (female colleagues) give me support at my workplace. We give support to each other… (ID-5- Female)

Gender equity and diversity in leadership are crucially important to design and implement policies that ensure a safe learning environment for the students and produce a critical mass of healthcare professionals trained in catering to the needs of a diverse patient population. The integration of results from the quantitative and qualitative arms of the present study suggests that gender- based discrepancies in the extent and experience of academic leader exists in Pakistan. The quantitative analysis showed that women are significantly under-represented especially in the senior level leadership positions where women occupy only 15.5% of the positions. These results follow the same trends as observed in neighbouring South Asian countries such as Bangladesh and India, where women occupy 16% and 18% of the top leadership positions in the healthcare [ 23 , 24 ]. These findings were further confirmed during the qualitative study reporting gender-based inequalities at both faculty and administration, thus confirming the alarming gender gaps in leadership positions in higher education that not only exist in Pakistan [ 25 ] but also in developed countries like the USA, Canada, and UK [ 15 ]. The qualitative study also revealed the possible reasons behind the observed gender-based inequalities. For example, the subtheme “Gender discrimination” was acknowledge by both male and female participants. The findings that opportunities and sponsorship are more readily available to males than females have also been reflected in the literature previously [ 26 ]. Compared to males, it is less likely that a female at a junior rank is recognized for her leadership potential and actively helped (sponsored) in the promotion to a higher rank by a senior who has the power and influence to do so [ 27 ]. Another emerging theme was the role of “mentors” in the leadership journey. The male participants reported their colleagues, teachers, and clinical supervisors as their mentors, a finding consistent with the previously published literature [ 28 , 29 ]. However, the female participants frequently reported the role of their father in their academic success and leadership journey. Since Pakistan is a patriarchal society with traditional gender roles in place, this finding is not unsurprising. In the context of gender, the father always plays a major role in helping, guiding, and supporting daughters in pursuit of education, professional excellence or leadership journey [ 17 ] especially when they are not married.

Most of the study participant started their leadership journey at the entry-level positions. These findings are in concordance with the study by Tagoe and colleagues [ 30 ] who indicate that most of the leadership experiences especially in the case of women leaders, can be gained while working in the middle management positions. The presence of women in mid-level leadership positions and their preference for academic medicine [ 10 ] means that gender disparities in high-level positions will gradually be reduced in future.

Our study report comparatively more women in mid-level (Fig.  1 ) than in the senior level leadership positions. During the qualitative interviews, the participants especially the women leaders revealed that they are struggling to maintain a work-life balance. Long working hours (sometimes exceeding 80 h per week) along with domestic responsibilities such as taking care of the children and family [ 31 ] force women to remain in middle rather than senior leadership positions [ 32 , 33 ]. These issues also force a lot of lady doctors in Pakistan to quit their job [ 18 ].

The uneven gender-based distribution of academic leaders was also observed between private and public sector institutes as well as individual departments within the same institute. In general, except for medical education, the female-to-male ratio was higher in public sector medical and dental colleges of the KP province. These findings are similar to recently published reports from Sweden and the USA, indicating a narrower gender gap in the public than private sector [ 34 , 35 ].

Similarly, in certain clinical fields such as obstetrics & gynecology, and prosthodontics, leadership positions were entirely occupied by only one gender. This gender polarization is not surprising. Due to cultural norms and traditions, female patients mostly prefer women doctors for obstetrics and gynae services [ 36 ], thereby making it the field of choice for women doctors in Pakistan [ 37 , 38 ]. On the contrary, the number of women leaders in surgery and allied were disproportionally low in the current study. Historically, surgery is considered a masculine field, and women are scarce in leadership positions across the globe, even in developed countries like the USA [ 39 ].

This study found that gender disparities exist at leadership positions as well as experiences in medical and dental colleges of KP Pakistan. The women leaders frequently reported gender-based inequality experiences and challenges in maintaining a decent work-life balance due to multiple domestic responsibilities. Although the majority of the top-level leadership positions are occupied by men, the relatively high representation of women at the mid-level indicates that the gender-based disparities in leadership could be more balanced in the future. These finding are critical and can have important implications for government, organizations and policy makers in term of enacting laws, propose regulations and establish support mechanism that help current and aspiring leaders in their leadership journey.

Strengths and limitations

Our study provides a multi-faceted perspective on the extent and experiences of gender-based disparities in medical and dental institutes in Pakistan. The inclusion of all medical and dental institutes increased the generalizability of the study findings while the more than 93% response rate of the survey reduced the selection bias thus increasing validity. Moreover, the qualitative arm of the study provided deep insights into the experiences of academic leaders in a developing and strictly patriarchal society. However, our study has some limitations. The study findings cannot be generalized across Pakistan due to widely prevalent differences across different ethnicities, cultures, and socioeconomic background. Another limitation of the study that should be considered is that the survey information was collected based on department-wise availability of leadership roles. However, the subdivisions in some academic departments for example Medicine & Allied were not included in our study and therefore, variations are expected in the exact female-to-male ratio in the medical and dental colleges.

Data availability

This published article and its supplementary information files include all data generated or analyzed during this study.

Cassells R, Duncan AS. Gender equity insights 2019: Breaking through the glass ceiling. Bankwest Curtin Economics Centre Report series [Internet]. 2019 Mar [cited 2022 Dec 20]; https://ideas.repec.org//p/ozl/bcecrs/ge04.html

Coder L, Spiller M. Leadership Education and Gender Roles: Think Manager, Think ‘?’ The Academy of Educational Leadership Journal [Internet]. 2013 Jul 1 [cited 2023 Dec 18]; https://www.semanticscholar.org/paper/Leadership-Education-and-Gender-Roles%3A-Think-Think-Coder-Spiller/0dc762a868621fab949e0159187a40ff95a7393e

OECD. Gender - OECD [Internet]. 2023 [cited 2024 Apr 22]. https://www.oecd.org/gender/

Schrijver NJ. Fifty years International Human rights covenants: improving the Global Protection of Human rights by bridging the gap. Between Two Continents. 2016;41(4):457–64.

Google Scholar  

Stamarski CS, Son Hing LS. Gender inequalities in the workplace: the effects of organizational structures, processes, practices, and decision makers’ sexism. Front Psychol. 2015;6. https://www.frontiersin.org/journals/psychology/articles/ https://doi.org/10.3389/fpsyg.2015.01400/full

AAMC. AAMC. 2019 [cited 2023 Feb 6]. Advancing women in academic medicine. https://www.aamc.org/news-insights/advancing-women-academic-medicine

Jacobs JW, Jagsi R, Stanford FC, Sarno D, Spector ND, Silver JK, et al. Gender Representation among United States Medical Board Leadership. J Womens Health (Larchmt). 2022;31(12):1710–8.

Article   Google Scholar  

Nielsen MW, Andersen JP, Schiebinger L, Schneider JW. One and a half million medical papers reveal a link between author gender and attention to gender and sex analysis. Nat Hum Behav. 2017;1(11):791–6.

Magua W, Zhu X, Bhattacharya A, Filut A, Potvien A, Leatherberry R, et al. Are Female Applicants Disadvantaged in National Institutes of Health Peer Review? Combining Algorithmic text mining and qualitative methods to detect evaluative differences in R01 reviewers’ critiques. J Womens Health (Larchmt). 2017;26(5):560–70.

Hartmark-Hill J, Hale T, Gallitano A. Women Physicians and Promotion in Academic Medicine. N Engl J Med. 2021;384(7):680.

Silander C, Haake U, Lindberg L, Riis U. Nordic research on gender equality in academic careers: a literature review. Eur J High Educ. 2022;12(1):72–97.

European Commission. She Fig. 2018 - Publications Office of the EU [Internet]. 2019 [cited 2023 Apr 18]. https://op.europa.eu/en/publication-detail/-/publication/9540ffa1-4478-11e9-a8ed-01aa75ed71a1

Carr PL, Raj A, Kaplan SE, Terrin N, Breeze JL, Freund KM. Gender differences in Academic Medicine: Retention, Rank, and Leadership comparisons from the National Faculty Survey. Acad Med. 2018;93(11):1694–9.

Győrffy Z, Dweik D, Girasek E. Workload, mental health and burnout indicators among female physicians. Hum Resour Health. 2016;14:12.

Alwazzan L, Al-Angari SS. Women’s leadership in academic medicine: a systematic review of extent, condition and interventions. BMJ Open. 2020;10(1):e032232.

Pakistan Bureau of Statistics. Population Census [Internet]. 2023 [cited 2023 Dec 4]. https://www.pbs.gov.pk/content/population-census

Chaudhary N, Dutt A. Women as agents of change: exploring women leaders’ resistance and shaping of gender ideologies in Pakistan. Front Psychol. 2022;13. https://www.frontiersin.org/articles/ https://doi.org/10.3389/fpsyg.2022.800334

Mohsin M, Syed J. The missing doctors — an analysis of educated women and female domesticity in Pakistan. Gend Work Organ. 2020;27(6):1077–102.

Rafia Z. The doctor brides - DAWN.COM [Internet]. 2013 [cited 2023 Feb 6]. https://www.dawn.com/news/1032070

Dowling M. Approaches to reflexivity in qualitative research. Nurse Res. 2006;13(3):7–21.

Creswell JW, Clark VLP. Designing and conducting mixed methods research. SAGE; 2007. p. 297.

Saunders B, Sim J, Kingstone T, Baker S, Waterfield J, Bartlam B, et al. Saturation in qualitative research: exploring its conceptualization and operationalization. Qual Quant. 2018;52(4):1893–907.

Gulati K, Davies J, de la González Á. Addressing leadership competency gaps and gender disparities in India’s medical workforce: a call to action. Lancet Reg Health Southeast Asia. 2023;16:100247.

Shiner C, Chemonics I. 2023 [cited 2023 Dec 15]. Women Rise: Celebrating Women in Bangladesh’s Health Sector Who Champion Empowerment. https://chemonics.com/blog/women-rise-celebrating-women-in-bangladeshs-health-sector-who-champion-empowerment/

Batool SQ. Gender and Higher Education In Pakistan. 2013.

Hobgood C, Draucker C. Gender differences in experiences of Leadership Emergence among Emergency Medicine Department Chairs. JAMA Netw Open. 2022;5(3):e221860.

Levine RB, Ayyala MS, Skarupski KA, Bodurtha JN, Fernández MG, Ishii LE, et al. It’s a little different for men—sponsorship and gender in Academic Medicine: a qualitative study. J GEN INTERN MED. 2021;36(1):1–8.

Aminololama-Shakeri S, Seritan AL. Mentorship in academic medicine. Mentoring in formal and informal contexts. Charlotte. NC, US: IAP Information Age Publishing; 2016. pp. 227–46. (Adult learning in professional, organizational, and community settings).

Mellon A, Murdoch-Eaton D. Supervisor or mentor: is there a difference? Implications for paediatric practice. Arch Dis Child. 2015;100(9):873–8.

Tagoe M, Abakah E. Issues of women’s political participation and decision-making in local governance: perspectives from the Central Region of Ghana. Int J Public Adm. 2015;38(5):371–80.

Boateng DA. Experiences of female academics in Ghana: negotiation and strengths as strategies for successful careers. Afr J Social Work. 2018;8(1):21–30.

Barron M. Senior-Level African American Women, Underrepresentation, and Career Decision-Making. Walden Dissertations and Doctoral Studies [Internet]. 2019; https://scholarworks.waldenu.edu/dissertations/6305

Charles A. Perceptions of Women of Color on Career Advancement in High Technology Management. Walden Dissertations and Doctoral Studies [Internet]. 2017; https://scholarworks.waldenu.edu/dissertations/4276

a Gardner D, Lucas T, Mabikacheche, McConnell M. Making government an even better place for women to work | McKinsey [Internet]. 2023 [cited 2023 Dec 16]. https://www.mckinsey.com/industries/public-sector/our-insights/making-government-an-even-better-place-for-women-to-work

Claésson L. Managerial representation: are women better off in the Public. or the Private Sector?; 2019.

Riaz B, Sherwani NZF, Inam SHA, Rafiq MY, Tanveer S, Arif A, et al. Physician gender preference amongst females attending Obstetrics/Gynecology clinics. Cureus. 2021;13(5):e15028.

Ahmed I, Ashar A. To be or not to be an obstetrician / gynaecologist. Pak J Med Sci. 2020;36(6):1165–70.

Zaheer A. The lost doctors. Pak J Med Sci. 2022;38(8):2053–5.

Battaglia F, Farhan SA, Narmeen M, Karimuddin AA, Jalal S, Tse M, et al. Does gender influence leadership roles in academic surgery in the United States of America? A cross-sectional study. Int J Surg. 2020;83:67–74.

Download references

Not applicable.

Author information

Brekhna Jami, Bushra Mehboob, Usman Mahboob and Fayig Elmigdadi contributed equally to this work.

Authors and Affiliations

Faculty of Dentistry, Zarqa University, Zarqa, Jordan

Muhammad Shahzad & Fayig Elmigdadi

Institute of Basic Medical Sciences, Khyber Medical University, Peshawar, Pakistan

Muhammad Shahzad

Institute of Health Professions Education and Research, Khyber Medical University, Peshawar, Pakistan

Brekhna Jamil & Usman Mahboob

Department of Oral and Maxillofacial Surgery, Peshawar Dental College, Peshawar, Pakistan

Mehboob Bushra

You can also search for this author in PubMed   Google Scholar

Contributions

M.S: Study design, Data collection, Analysis and writing the manuscript. B.J and U.M supervision, data analysis and finalizing the manuscirpt. B.M & F.A helped in qualitative data analysis, writing, reviewing and finalizing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Muhammad Shahzad .

Ethics declarations

Ethics approval and consent to participate.

The ethics approval of the study was granted by the institutional research board of Khyber Medical University (Ref No: 1–11/IHPER/MHPE/KMU/23–62) and the study design followed GRAMMS (Good Reporting of a Mixed Methods Study) guidelines and informed consent was obtained from all participants.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, supplementary material 3, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Shahzad, M., Jamil, B., Bushra, M. et al. Gender-based differences in the representation and experiences of academic leaders in medicine and dentistry: a mixed method study from Pakistan. BMC Med Educ 24 , 885 (2024). https://doi.org/10.1186/s12909-024-05811-6

Download citation

Received : 30 January 2024

Accepted : 23 July 2024

Published : 16 August 2024

DOI : https://doi.org/10.1186/s12909-024-05811-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Qualitative study

BMC Medical Education

ISSN: 1472-6920

methods of data representation

  • Alternatives

10 Methods of Data Presentation That Really Work in 2024

Leah Nguyen • 20 August, 2024 • 13 min read

Have you ever presented a data report to your boss/coworkers/teachers thinking it was super dope like you’re some cyber hacker living in the Matrix, but all they saw was a pile of static numbers that seemed pointless and didn't make sense to them?

Understanding digits is rigid . Making people from non-analytical backgrounds understand those digits is even more challenging.

How can you clear up those confusing numbers and make your presentation as clear as the day? Let's check out these best ways to present data. 💎

How many type of charts are available to present data?7
How many charts are there in statistics?4, including bar, line, histogram and pie.
How many types of charts are available in Excel?8
Who invented charts?William Playfair
When were the charts invented?18th Century

More Tips with AhaSlides

  • Marketing Presentation
  • Survey Result Presentation
  • Types of Presentation

Alternative Text

Start in seconds.

Get any of the above examples as templates. Sign up for free and take what you want from the template library!

Data Presentation - What Is It?

The term ’data presentation’ relates to the way you present data in a way that makes even the most clueless person in the room understand. 

Some say it’s witchcraft (you’re manipulating the numbers in some ways), but we’ll just say it’s the power of turning dry, hard numbers or digits into a visual showcase that is easy for people to digest.

Presenting data correctly can help your audience understand complicated processes, identify trends, and instantly pinpoint whatever is going on without exhausting their brains.

Good data presentation helps…

  • Make informed decisions and arrive at positive outcomes . If you see the sales of your product steadily increase throughout the years, it’s best to keep milking it or start turning it into a bunch of spin-offs (shoutout to Star Wars👀).
  • Reduce the time spent processing data . Humans can digest information graphically 60,000 times faster than in the form of text. Grant them the power of skimming through a decade of data in minutes with some extra spicy graphs and charts.
  • Communicate the results clearly . Data does not lie. They’re based on factual evidence and therefore if anyone keeps whining that you might be wrong, slap them with some hard data to keep their mouths shut.
  • Add to or expand the current research . You can see what areas need improvement, as well as what details often go unnoticed while surfing through those little lines, dots or icons that appear on the data board.

Methods of Data Presentation and Examples

Imagine you have a delicious pepperoni, extra-cheese pizza. You can decide to cut it into the classic 8 triangle slices, the party style 12 square slices, or get creative and abstract on those slices. 

There are various ways to cut a pizza and you get the same variety with how you present your data. In this section, we will bring you the 10 ways to slice a pizza - we mean to present your data - that will make your company’s most important asset as clear as day. Let's dive into 10 ways to present data efficiently.

#1 - Tabular 

Among various types of data presentation, tabular is the most fundamental method, with data presented in rows and columns. Excel or Google Sheets would qualify for the job. Nothing fancy.

a table displaying the changes in revenue between the year 2017 and 2018 in the East, West, North, and South region

This is an example of a tabular presentation of data on Google Sheets. Each row and column has an attribute (year, region, revenue, etc.), and you can do a custom format to see the change in revenue throughout the year.

When presenting data as text, all you do is write your findings down in paragraphs and bullet points, and that’s it. A piece of cake to you, a tough nut to crack for whoever has to go through all of the reading to get to the point.

  • 65% of email users worldwide access their email via a mobile device.
  • Emails that are optimised for mobile generate 15% higher click-through rates.
  • 56% of brands using emojis in their email subject lines had a higher open rate.

(Source: CustomerThermometer )

All the above quotes present statistical information in textual form. Since not many people like going through a wall of texts, you’ll have to figure out another route when deciding to use this method, such as breaking the data down into short, clear statements, or even as catchy puns if you’ve got the time to think of them.

#3 - Pie chart

A pie chart (or a ‘donut chart’ if you stick a hole in the middle of it) is a circle divided into slices that show the relative sizes of data within a whole. If you’re using it to show percentages, make sure all the slices add up to 100%.

Methods of data presentation

The pie chart is a familiar face at every party and is usually recognised by most people. However, one setback of using this method is our eyes sometimes can’t identify the differences in slices of a circle, and it’s nearly impossible to compare similar slices from two different pie charts, making them the villains in the eyes of data analysts.

a half-eaten pie chart

#4 - Bar chart

The bar chart is a chart that presents a bunch of items from the same category, usually in the form of rectangular bars that are placed at an equal distance from each other. Their heights or lengths depict the values they represent.

They can be as simple as this:

a simple bar chart example

Or more complex and detailed like this example of data presentation. Contributing to an effective statistic presentation, this one is a grouped bar chart that not only allows you to compare categories but also the groups within them as well.

an example of a grouped bar chart

#5 - Histogram

Similar in appearance to the bar chart but the rectangular bars in histograms don’t often have the gap like their counterparts.

Instead of measuring categories like weather preferences or favourite films as a bar chart does, a histogram only measures things that can be put into numbers.

an example of a histogram chart showing the distribution of students' score for the IQ test

Teachers can use presentation graphs like a histogram to see which score group most of the students fall into, like in this example above.

#6 - Line graph

Recordings to ways of displaying data, we shouldn't overlook the effectiveness of line graphs. Line graphs are represented by a group of data points joined together by a straight line. There can be one or more lines to compare how several related things change over time. 

an example of the line graph showing the population of bears from 2017 to 2022

On a line chart’s horizontal axis, you usually have text labels, dates or years, while the vertical axis usually represents the quantity (e.g.: budget, temperature or percentage).

#7 - Pictogram graph

A pictogram graph uses pictures or icons relating to the main topic to visualise a small dataset. The fun combination of colours and illustrations makes it a frequent use at schools.

How to Create Pictographs and Icon Arrays in Visme-6 pictograph maker

Pictograms are a breath of fresh air if you want to stay away from the monotonous line chart or bar chart for a while. However, they can present a very limited amount of data and sometimes they are only there for displays and do not represent real statistics.

#8 - Radar chart

If presenting five or more variables in the form of a bar chart is too stuffy then you should try using a radar chart, which is one of the most creative ways to present data.

Radar charts show data in terms of how they compare to each other starting from the same point. Some also call them ‘spider charts’ because each aspect combined looks like a spider web.

a radar chart showing the text scores between two students

Radar charts can be a great use for parents who’d like to compare their child’s grades with their peers to lower their self-esteem. You can see that each angular represents a subject with a score value ranging from 0 to 100. Each student’s score across 5 subjects is highlighted in a different colour.

a radar chart showing the power distribution of a Pokemon

If you think that this method of data presentation somehow feels familiar, then you’ve probably encountered one while playing Pokémon .

#9 - Heat map

A heat map represents data density in colours. The bigger the number, the more colour intensity that data will be represented.

voting chart

Most US citizens would be familiar with this data presentation method in geography. For elections, many news outlets assign a specific colour code to a state, with blue representing one candidate and red representing the other. The shade of either blue or red in each state shows the strength of the overall vote in that state.

a heatmap showing which parts the visitors click on in a website

Another great thing you can use a heat map for is to map what visitors to your site click on. The more a particular section is clicked the ‘hotter’ the colour will turn, from blue to bright yellow to red.

#10 - Scatter plot

If you present your data in dots instead of chunky bars, you’ll have a scatter plot. 

A scatter plot is a grid with several inputs showing the relationship between two variables. It’s good at collecting seemingly random data and revealing some telling trends.

a scatter plot example showing the relationship between beach visitors each day and the average daily temperature

For example, in this graph, each dot shows the average daily temperature versus the number of beach visitors across several days. You can see that the dots get higher as the temperature increases, so it’s likely that hotter weather leads to more visitors.

5 Data Presentation Mistakes to Avoid

#1 - assume your audience understands what the numbers represent.

You may know all the behind-the-scenes of your data since you’ve worked with them for weeks, but your audience doesn’t.

sales data board

Showing without telling only invites more and more questions from your audience, as they have to constantly make sense of your data, wasting the time of both sides as a result.

While showing your data presentations, you should tell them what the data are about before hitting them with waves of numbers first. You can use interactive activities such as polls , word clouds , online quizzes and Q&A sections , combined with icebreaker games , to assess their understanding of the data and address any confusion beforehand.

#2 - Use the wrong type of chart

Charts such as pie charts must have a total of 100% so if your numbers accumulate to 193% like this example below, you’re definitely doing it wrong.

bad example of data presentation

Before making a chart, ask yourself: what do I want to accomplish with my data? Do you want to see the relationship between the data sets, show the up and down trends of your data, or see how segments of one thing make up a whole?

Remember, clarity always comes first. Some data visualisations may look cool, but if they don’t fit your data, steer clear of them. 

#3 - Make it 3D

3D is a fascinating graphical presentation example. The third dimension is cool, but full of risks.

methods of data representation

Can you see what’s behind those red bars? Because we can’t either. You may think that 3D charts add more depth to the design, but they can create false perceptions as our eyes see 3D objects closer and bigger than they appear, not to mention they cannot be seen from multiple angles.

#4 - Use different types of charts to compare contents in the same category

methods of data representation

This is like comparing a fish to a monkey. Your audience won’t be able to identify the differences and make an appropriate correlation between the two data sets. 

Next time, stick to one type of data presentation only. Avoid the temptation of trying various data visualisation methods in one go and make your data as accessible as possible.

#5 - Bombard the audience with too much information

The goal of data presentation is to make complex topics much easier to understand, and if you’re bringing too much information to the table, you’re missing the point.

a very complicated data presentation with too much information on the screen

The more information you give, the more time it will take for your audience to process it all. If you want to make your data understandable and give your audience a chance to remember it, keep the information within it to an absolute minimum. You should end your session with open-ended questions to see what your participants really think.

What are the Best Methods of Data Presentation?

Finally, which is the best way to present data?

The answer is…

There is none! Each type of presentation has its own strengths and weaknesses and the one you choose greatly depends on what you’re trying to do. 

For example:

  • Go for a scatter plot if you’re exploring the relationship between different data values, like seeing whether the sales of ice cream go up because of the temperature or because people are just getting more hungry and greedy each day?
  • Go for a line graph if you want to mark a trend over time. 
  • Go for a heat map if you like some fancy visualisation of the changes in a geographical location, or to see your visitors' behaviour on your website.
  • Go for a pie chart (especially in 3D) if you want to be shunned by others because it was never a good idea👇

example of how a bad pie chart represents the data in a complicated way

Frequently Asked Questions

What is a chart presentation.

A chart presentation is a way of presenting data or information using visual aids such as charts, graphs, and diagrams. The purpose of a chart presentation is to make complex information more accessible and understandable for the audience.

When can I use charts for the presentation?

Charts can be used to compare data, show trends over time, highlight patterns, and simplify complex information.

Why should you use charts for presentation?

You should use charts to ensure your contents and visuals look clean, as they are the visual representative, provide clarity, simplicity, comparison, contrast and super time-saving!

What are the 4 graphical methods of presenting data?

Histogram, Smoothed frequency graph, Pie diagram or Pie chart, Cumulative or ogive frequency graph, and Frequency Polygon.

Leah Nguyen

Leah Nguyen

Words that convert, stories that stick. I turn complex ideas into engaging narratives - helping audiences learn, remember, and take action.

Tips to Engage with Polls & Trivia

newsletter star

More from AhaSlides

Business Analyst Skills 101: A Roadmap To Success In The Data-Driven Era

EGCT: enhanced graph convolutional transformer for 3D point cloud representation learning

  • Original Article
  • Published: 23 August 2024

Cite this article

methods of data representation

  • Gang Chen 1 ,
  • Wenju Wang 1 ,
  • Haoran Zhou 1 &
  • Xiaolin Wang 1  

16 Accesses

Explore all metrics

It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features. Therefore, we explore transformer on topological structures of point cloud graphs, proposing an enhanced graph convolutional transformer (EGCT) method. EGCT construct graph topology for disordered and unstructured point cloud. Then it uses the enhanced point feature representation method to further aggregate the feature information of all neighborhood points, which can compactly represent the features of this local neighborhood graph. Subsequent process, the graph convolutional transformer simultaneously performs self-attention calculations and convolution operations on the point coordinates and features of the neighborhood graph. It efficiently utilizes the spatial geometric information of point cloud objects. Therefore, EGCT learns comprehensive geometric information of point cloud objects, which can help to improve segmentation and classification accuracy. On the ShapeNetPart and ModelNet40 datasets, our EGCT method achieves a mIoU of 86.8%, OA and AA of 93.5% and 91.2%, respectively. On the large-scale indoor scene point cloud dataset (S3DIS), the OA of EGCT method is 90.1%, and the mIoU is 67.8%. Experimental results demonstrate that our EGCT method can achieve comparable point cloud segmentation and classification performance to state-of-the-art methods while maintaining low model complexity. Our source code is available at https://github.com/shepherds001/EGCT .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price excludes VAT (USA) Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

methods of data representation

Explore related subjects

  • Artificial Intelligence

Data availability

The ShapeNetPart, ModelNet40 and S3DIS datasets used in this study were obtained from public domains and are available online at https://shapenet.org/ , https://modelnet.cs.princeton.edu and http://buildingparser.stanford.edu/dataset.html .

Elsner, P., Dornbusch, U., Thomas, I., et al.: Coincident beach surveys using UAS, vehicle mounted and airborne laser scanner: Point cloud inter-comparison and effects of surface type heterogeneity on elevation accuracies. Remote Sens. Environ. 208 , 15–26 (2018). https://doi.org/10.1016/j.rse.2018.02.008 . ( https://www.sciencedirect.com/science/article/pii/S0034425718300142 )

Article   Google Scholar  

Lemmens, M.: Mobile laser scanning - point clouds: status and prospects of automatic 3d mapping of road objects. GIM Int Worldwide Mag. Geomat. 31 (August), 18–21 (2017)

Google Scholar  

Zhu, J., Xu, Y., Ye, Z., et al.: Fusion of urban 3d point clouds with thermal attributes using MLS data and TIR image sequences. Infrared Phys. Technol. 113 , 103622 (2021). https://doi.org/10.1016/j.infrared.2020.103622 . ( https://www.sciencedirect.com/science/article/pii/S1350449520306708 )

Han, X., Dong, Z., Yang, B.: A point-based deep learning network for semantic segmentation of MLS point clouds. ISPRS J. Photogramm. Remote. Sens. 175 , 199–214 (2021). https://doi.org/10.1016/j.isprsjprs.2021.03.001 . ( https://www.sciencedirect.com/science/article/pii/S0924271621000642 )

Cai, S., Xing, Y., Duan, M.: Extraction of DBH from filtering out low intensity point cloud by backpack laser scanning. For Eng 37 (05), 12–19 (2021)

Guo, Y., Wang, H., Hu, Q., et al.: Deep learning for 3d point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43 (12), 4338–4364 (2021). https://doi.org/10.1109/TPAMI.2020.3005434

Li, Y., Ma, L., Zhong, Z., et al.: Deep learning for lidar point clouds in autonomous driving: a review. IEEE Trans. Neural Netw. Learn. Systems 32 (8), 3412–3432 (2021). https://doi.org/10.1109/TNNLS.2020.3015992

Kästner, L., Frasineanu, VC., Lambrecht, J.: A 3d-deep-learning-based augmented reality calibration method for robotic environments using depth sensor data. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 1135–1141 (2020) https://doi.org/10.1109/ICRA40945.2020.9197155

Liang, Z., Guo, Y., Feng, Y., et al.: Stereo matching using multi-level cost volume and multi-scale feature constancy. IEEE Trans. Pattern Anal. Mach. Intell. 43 (1), 300–315 (2021). https://doi.org/10.1109/TPAMI.2019.2928550

Niemeyer, J., Rottensteiner, F., Soergel, U.: Contextual classification of lidar data and building object detection in urban areas. ISPRS J. Photogramm. Remote. Sens. 87 , 152–165 (2014). https://doi.org/10.1016/j.isprsjprs.2013.11.001 . ( https://www.sciencedirect.com/science/article/pii/S0924271613002359 )

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc, New York (2017)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020) https://doi.org/10.48550/ARXIV.2010.11929 ,

Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015) https://doi.org/10.1109/IROS.2015.7353481

Riegler, G., Ulusoy, AO., Geiger, A.: Octnet: learning deep 3d representations at high resolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6620–6629 (2017) https://doi.org/10.1109/CVPR.2017.701

Meng, HY., Gao, L., Lai, YK., et al.: Vv-net: voxel vae net with group convolutions for point cloud segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8499–8507 (2019) https://doi.org/10.1109/ICCV.2019.00859

Su, H., Maji, S., Kalogerakis, E., et al.: Multi-view convolutional neural networks for 3D shape recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 945–953 (2015) https://doi.org/10.1109/ICCV.2015.114

Han, Z., Lu, H., Liu, Z., et al.: 3d2seqviews: aggregating sequential views for 3d global feature learning by CNN with hierarchical attention aggregation. IEEE Trans. Image Process. 28 (8), 3986–3999 (2019). https://doi.org/10.1109/TIP.2019.2904460

Article   MathSciNet   Google Scholar  

He, X., Huang, T., Bai, S., et al.: View n-gram network for 3D object retrieval. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7514–7523 (2019) https://doi.org/10.1109/ICCV.2019.00761

Mohammadi, SS., Wang, Y., Bue, AD.: Pointview-GCN: 3D shape classification with multi-view point clouds. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 3103–3107 (2021) https://doi.org/10.1109/ICIP42928.2021.9506426

Charles, RQ., Su, H., Kaichun, M., et al.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017) https://doi.org/10.1109/CVPR.2017.16

Qi, C.R., Yi, L., Su, H., et al.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Guyon, I., Luxburg, U.V., Bengio, S., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc, New York (2017)

Yan, X., Zheng, C., Li, Z., et al.: Pointasnl: robust point clouds processing using nonlocal neural networks with adaptive sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

Li, Y., Bu, R., Sun, M., et al.: Pointcnn: Convolution on x-transformed points. In: Bengio, S., Wallach, H., Larochelle, H., et al. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates Inc, New York (2018)

Wang, Y., Sun, Y., Liu, Z., et al.: Dynamic graph CNN for learning on point clouds 38 (5) (2019) https://doi.org/10.1145/3326362 ,

Zhang, K., Hao, M., Wang, J., et al.: Linked dynamic graph CNN: Learning on point cloud via linking hierarchical features (2019) https://doi.org/10.48550/ARXIV.1904.10014 ,

Zhou, H., Feng, Y., Fang, M., et al.: Adaptive graph convolution for point cloud analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4965–4974 (2021)

Zhu, X., Su, W., Lu, L., et al.: Deformable DETR: deformable transformers for end-to-end object detection. CoRR abs/2010.04159. (2020) arXiv:2010.04159

Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

Zhao, H., Jiang, L., Jia, J., et al.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16259–16268 (2021)

Guo, M.H., Cai, J.X., Liu, Z.N., et al.: Pct: point cloud transformer. Comput. Vis. Media 7 (2), 187–199 (2021). https://doi.org/10.1007/s41095-021-0229-5

Zhang, C., Wan, H., Liu, S., et al.: Point-voxel transformer: an efficient approach to 3d deep learning. CoRR abs/2108.06076 (2021) arXiv:2108.06076

Lai, X., Liu, J., Jiang, L., et al.: Stratified transformer for 3d point cloud segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8500–8509 (2022)

Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Cortes, C., Lawrence, N., Lee, D., et al. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates Inc, New York (2015)

Woo, S., Park, J., Lee, J.Y., et al.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

Yi, L., Kim, V.G., Ceylan, D., et al.: A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. (2016). https://doi.org/10.1145/2980179.2980238

Wu, Z., Song, S., Khosla, A., et al.: 3d shapenets: a deep representation for volumetric shapes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920 (2015) https://doi.org/10.1109/CVPR.2015.7298801

Lu, D., Xie, Q., Gao, K., et al.: 3dctn: 3d convolution-transformer network for point cloud classification. IEEE Trans. Intell. Transp. Syst. (2022). https://doi.org/10.1109/TITS.2022.3198836

Xu, M., Ding, R., Zhao, H., et al.: Paconv: position adaptive convolution with dynamic kernel assembling on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3173–3182 (2021)

Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with restarts (2016) CoRR abs/1608.03983. arXiv:1608.03983

Han, X.F., Jin, Y.F., Cheng, H.X., et al.: Dual transformer for point cloud analysis. IEEE Trans. Multimedia (2022). https://doi.org/10.1109/TMM.2022.3198318

Armeni, I., Sener, O., Zamir, A.R., et al.: 3d semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

Achlioptas, P., Diamanti, O., Mitliagkas, I., et al.: Learning representations and generative models for 3D point clouds. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 40–49. PMLR (2018)

Chen, S., Duan, C., Yang, Y., et al.: Deep unsupervised learning of 3d point clouds via graph topology inference and filtering. IEEE Trans. Image Process. 29 , 3183–3198 (2020). https://doi.org/10.1109/TIP.2019.2957935

Yi, Y., Lu, X., Gao, S., et al.: Graph classification via discriminative edge feature learning. Pattern Recogn. 143 , 109799 (2023). https://doi.org/10.1016/j.patcog.2023.109799 . ( https://www.sciencedirect.com/science/article/pii/S0031320323004971 )

Jiang, J., Lu, X., Zhao, L., et al.: Masked autoencoders in 3d point cloud representation learning. IEEE Trans. Multimedia. (2023). https://doi.org/10.1109/TMM.2023.3314973

Karambakhsh, A., Sheng, B., Li, P., et al.: Sparsevoxnet: 3-d object recognition with sparsely aggregation of 3-d dense blocks. IEEE Trans. Neural Netw. Learn. Syst. 35 (1), 532–546 (2024). https://doi.org/10.1109/TNNLS.2022.3175775

Lin, X., Sun, S., Huang, W., et al.: Eapt: efficient attention pyramid transformer for image processing. IEEE Trans. Multimedia 25 , 50–61 (2023). https://doi.org/10.1109/TMM.2021.3120873

Zhao, H., Jia, J., Koltun, V.: Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

Woo, S., Lee, D., Hwang, S., et al.: Mkconv: multidimensional feature representation for point cloud analysis. Pattern Recogn. 143 , 109800 (2023). https://doi.org/10.1016/j.patcog.2023.109800

Download references

This work was sponsored by Natural Science Foundation of Shanghai under Grant No. 19ZR1435900.

Author information

Authors and affiliations.

College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China

Gang Chen, Wenju Wang, Haoran Zhou & Xiaolin Wang

You can also search for this author in PubMed   Google Scholar

Contributions

GC contributed to conceptualization, methodology, resources and software; GC, HZ and XW performed data curation; WW carried out formal analysis and funding acquisition; WW, HZ and XW performed supervision and validation; HZ and XW contributed to visualization; GC performed writing-original draft; GC and WW performed writing-review and editing. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Wenju Wang .

Ethics declarations

Conflict of interest.

The authors declare no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Chen, G., Wang, W., Zhou, H. et al. EGCT: enhanced graph convolutional transformer for 3D point cloud representation learning. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03600-2

Download citation

Accepted : 01 August 2024

Published : 23 August 2024

DOI : https://doi.org/10.1007/s00371-024-03600-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Scene understanding
  • Graph convolution
  • Transformer
  • 3D point cloud
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. PPT

    methods of data representation

  2. How to Use Data Visualization in Your Infographics

    methods of data representation

  3. 15 Data Visualization Techniques · Polymer

    methods of data representation

  4. Data Representation

    methods of data representation

  5. Effective Data Visualization Techniques in Data Science Using Python

    methods of data representation

  6. Introduction to Data Representation

    methods of data representation

COMMENTS

  1. What are the different ways of Data Representation?

    Learn what data representation is and how to present data in different ways, such as bar charts, histograms, line graphs, pie charts and frequency tables. See examples and sample questions on data representation for statistics and mathematics.

  2. Data Representation: Definition, Types, Examples

    Learn how to present data using graphs, tables, and charts in mathematics and data structure. See examples of bar charts, histograms, pie charts, and line graphs with explanations and rules.

  3. 2.1: Types of Data Representation

    2.1: Types of Data Representation. Page ID. Two common types of graphic displays are bar charts and histograms. Both bar charts and histograms use vertical or horizontal bars to represent the number of data points in each category or interval. The main difference graphically is that in a bar chart there are spaces between the bars and in a ...

  4. 17 Important Data Visualization Techniques

    Learn 17 essential data visualization techniques to communicate data effectively and make data-driven decisions. Explore different types of charts, maps, diagrams and matrices with examples and tips.

  5. Data representations

    Data representations are useful for interpreting data and identifying trends and relationships. When working with data representations, pay close attention to both the data values and the key words in the question. When matching data to a representation, check that the values are graphed accurately for all categories.

  6. Methods for Data Representation

    Data representation methods need to be robust to these variations to be effective. Fine-grained details: Facial expressions can convey fine-grained details that are important for recognizing specific emotions, such as the movement of wrinkles around the eyes or mouth. Image-based methods need to capture these details to be effective.

  7. 11 Data Visualization Techniques for Every Use-Case with Examples

    The Power of Good Data Visualization. Data visualization involves the use of graphical representations of data, such as graphs, charts, and maps. Compared to descriptive statistics or tables, visuals provide a more effective way to analyze data, including identifying patterns, distributions, and correlations and spotting outliers in complex ...

  8. PDF Data Representation

    Table 1.1: Powers of Two and their Binary Representation Because of this property, numbers that are a power of two are very, very common when talking about computers. Table1.1shows powers of two up to 210 and their binary representation. The powers of two show up repeatedly in the sizes of di erent objects

  9. Data Representation

    As shown in the table, the data representation methods of Internet services fall into two groups: description language-based data representation and mathematical model-based data representation. The models in the first group are further divided into semantic (ontology-based) representation and non-semantic representation, while the second group ...

  10. Decoding Computation Through Data Representation

    2. add_item(): This method processes the inventory by adding an item. Again, we're manipulating the data within the GameCharacter object. 3. use_ability(): This method randomly selects an ability from the character's list of abilities and "uses" it. This is another example of data processing.

  11. Data Representation

    Data Representation. While data comes in many forms, mathematical models are limited to real numbers. As a result, we often have to engineer our inputs prior to model development and inference. Data representation is best illustrated with an example. This is what the top 3 rows of our dataset looks like -- we can assume that we have at least a ...

  12. Data representation

    The problem of data representation is the problem of representing all the concepts we might want to use in programming—integers, fractions, real numbers, sets, pictures, texts, buildings, animal species, relationships—using the limited medium of addresses and bytes. Powers of ten and powers of two.

  13. PDF Data Representation

    This information is called static data. static data 0000 stack FFFF • Each time you call a method, Java allocates a new block of memory called a stack frame to hold its local variables. These stack frames come from a region of memory called the stack. • Whenever you create a new object, Java allocates space from a pool of memory called the ...

  14. PDF Data Representation

    Data Representation • Data refers to the symbols that represent people, events, things, and ideas. Data can be a name, a number, the colors in a photograph, or the notes in a musical composition. • Data Representation refers to the form in which data is stored, processed, and transmitted. • Devices such as smartphones, iPods, and

  15. Representing data (video)

    Data can be represented in multiple ways such as tables, bar graphs, histograms, and frequency plots. These methods provide different perspectives of the same information, helping to answer various questions about the data.

  16. 5. Methods of Data Collection, Representation, and Anlysis

    This chapter from a 1988 report by the National Research Council reviews the methods and challenges of data collection, representation, and analysis in the behavioral and social sciences. It covers topics such as survey research, experimental design, measurement, and data management.

  17. Data, Representation, and Information

    The relational model unified data and metadata so that there was only one form of data representation. It defined a non-procedural data access language based on algebra or logic. It was easier for end users to visualize and understand than the pointers-and-records-based DBTG model.

  18. Methods of Data Collection, Representation, and Analysis

    This chapter concerns research on collecting, representing, and analyzing the data that underlie behavioral and social sciences knowledge. Such research, methodological in character, includes ethnographic and historical approaches, scaling, axiomatic measurement, and statistics, with its important relatives, econometrics and psychometrics. The field can be described as including the self ...

  19. Graphical Representation of Data

    Graphical Representation of Data: Graphical Representation of Data," where numbers and facts become lively pictures and colorful diagrams. Instead of staring at boring lists of numbers, we use fun charts, ... Class 11 Geography Methods of Relief Representation Notes: Methods of Relief Representation is the fifth chapter in CBSE Class 11 ...

  20. What is Data Representation?

    Learn more about Data Representation. Take a deep dive into Data Representation with our course AI for Designers . In an era where technology is rapidly reshaping the way we interact with the world, understanding the intricacies of AI is not just a skill, but a necessity for designers. The AI for Designers course delves into the heart of this ...

  21. Representation of Data/Information

    Representation of Data/Information - Computers do not understand human language; they understand data within the prescribed form. Data representation is a method to represent data and encode it in a computer system. Generally, a user inputs numbers, text, images, audio, and video etc types of data to process but the computer converts t.

  22. Methods for Data Representation

    There are several methods for data representation in personality recognition, and each method has advantages and limitations. These methods can be divided by the resulting data created by the technique, such as vectors, embeddings, graphs, and matrices, or by modalities, such as speech, text, behavior on images, and physiological signals.

  23. Gender-based differences in the representation and experiences of

    The under-representation is particularly common in higher education organizations, including medical and dental schools. The current study aimed to investigate gender-based distribution and explore leaders' experiences in the medical and dental institutes in a developing country, Pakistan. A mixed-method approach was used.

  24. 10 Methods of Data Presentation That Really Work in 2024

    Among various types of data presentation, tabular is the most fundamental method, with data presented in rows and columns. Excel or Google Sheets would qualify for the job. Nothing fancy. This is an example of a tabular presentation of data on Google Sheets.

  25. EGCT: enhanced graph convolutional transformer for 3D point cloud

    It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features ...