• School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Class 8 Maths Notes
  • Class 9 Maths Notes
  • Class 10 Maths Notes
  • Class 11 Maths Notes
  • Class 12 Maths Notes

Graphical Representation of Data

Graphical Representation of Data: Graphical Representation of Data,” where numbers and facts become lively pictures and colorful diagrams . Instead of staring at boring lists of numbers, we use fun charts, cool graphs, and interesting visuals to understand information better. In this exciting concept of data visualization, we’ll learn about different kinds of graphs, charts, and pictures that help us see patterns and stories hidden in data.

There is an entire branch in mathematics dedicated to dealing with collecting, analyzing, interpreting, and presenting numerical data in visual form in such a way that it becomes easy to understand and the data becomes easy to compare as well, the branch is known as Statistics .

The branch is widely spread and has a plethora of real-life applications such as Business Analytics, demography, Astro statistics, and so on . In this article, we have provided everything about the graphical representation of data, including its types, rules, advantages, etc.

Graphical-Representation-of-Data

Table of Content

What is Graphical Representation

Types of graphical representations, line graphs, histograms , stem and leaf plot , box and whisker plot .

  • Graphical Representations used in Maths

Value-Based or Time Series Graphs 

Frequency based, principles of graphical representations, advantages and disadvantages of using graphical system, general rules for graphical representation of data, frequency polygon, solved examples on graphical representation of data.

Graphics Representation is a way of representing any data in picturized form . It helps a reader to understand the large set of data very easily as it gives us various data patterns in visualized form.

There are two ways of representing data,

  • Pictorial Representation through graphs.

They say, “A picture is worth a thousand words”.  It’s always better to represent data in a graphical format. Even in Practical Evidence and Surveys, scientists have found that the restoration and understanding of any information is better when it is available in the form of visuals as Human beings process data better in visual form than any other form.

Does it increase the ability 2 times or 3 times? The answer is it increases the Power of understanding 60,000 times for a normal Human being, the fact is amusing and true at the same time.

Check: Graph and its representations

Comparison between different items is best shown with graphs, it becomes easier to compare the crux of the data about different items. Let’s look at all the different types of graphical representations briefly: 

A line graph is used to show how the value of a particular variable changes with time. We plot this graph by connecting the points at different values of the variable. It can be useful for analyzing the trends in the data and predicting further trends. 

data graphical representation

A bar graph is a type of graphical representation of the data in which bars of uniform width are drawn with equal spacing between them on one axis (x-axis usually), depicting the variable. The values of the variables are represented by the height of the bars. 

data graphical representation

This is similar to bar graphs, but it is based frequency of numerical values rather than their actual values. The data is organized into intervals and the bars represent the frequency of the values in that range. That is, it counts how many values of the data lie in a particular range. 

data graphical representation

It is a plot that displays data as points and checkmarks above a number line, showing the frequency of the point.  

data graphical representation

This is a type of plot in which each value is split into a “leaf”(in most cases, it is the last digit) and “stem”(the other remaining digits). For example: the number 42 is split into leaf (2) and stem (4).  

data graphical representation

These plots divide the data into four parts to show their summary. They are more concerned about the spread, average, and median of the data. 

data graphical representation

It is a type of graph which represents the data in form of a circular graph. The circle is divided such that each portion represents a proportion of the whole. 

data graphical representation

Graphical Representations used in Math’s

Graphs in Math are used to study the relationships between two or more variables that are changing. Statistical data can be summarized in a better way using graphs. There are basically two lines of thoughts of making graphs in maths: 

  • Value-Based or Time Series Graphs

These graphs allow us to study the change of a variable with respect to another variable within a given interval of time. The variables can be anything. Time Series graphs study the change of variable with time. They study the trends, periodic behavior, and patterns in the series. We are more concerned with the values of the variables here rather than the frequency of those values. 

Example: Line Graph

These kinds of graphs are more concerned with the distribution of data. How many values lie between a particular range of the variables, and which range has the maximum frequency of the values. They are used to judge a spread and average and sometimes median of a variable under study.

Also read: Types of Statistical Data
  • All types of graphical representations follow algebraic principles.
  • When plotting a graph, there’s an origin and two axes.
  • The x-axis is horizontal, and the y-axis is vertical.
  • The axes divide the plane into four quadrants.
  • The origin is where the axes intersect.
  • Positive x-values are to the right of the origin; negative x-values are to the left.
  • Positive y-values are above the x-axis; negative y-values are below.

graphical-representation

  • It gives us a summary of the data which is easier to look at and analyze.
  • It saves time.
  • We can compare and study more than one variable at a time.

Disadvantages

  • It usually takes only one aspect of the data and ignores the other. For example, A bar graph does not represent the mean, median, and other statistics of the data. 
  • Interpretation of graphs can vary based on individual perspectives, leading to subjective conclusions.
  • Poorly constructed or misleading visuals can distort data interpretation and lead to incorrect conclusions.
Check : Diagrammatic and Graphic Presentation of Data

We should keep in mind some things while plotting and designing these graphs. The goal should be a better and clear picture of the data. Following things should be kept in mind while plotting the above graphs: 

  • Whenever possible, the data source must be mentioned for the viewer.
  • Always choose the proper colors and font sizes. They should be chosen to keep in mind that the graphs should look neat.
  • The measurement Unit should be mentioned in the top right corner of the graph.
  • The proper scale should be chosen while making the graph, it should be chosen such that the graph looks accurate.
  • Last but not the least, a suitable title should be chosen.

A frequency polygon is a graph that is constructed by joining the midpoint of the intervals. The height of the interval or the bin represents the frequency of the values that lie in that interval. 

frequency-polygon

Question 1: What are different types of frequency-based plots? 

Types of frequency-based plots:  Histogram Frequency Polygon Box Plots

Question 2: A company with an advertising budget of Rs 10,00,00,000 has planned the following expenditure in the different advertising channels such as TV Advertisement, Radio, Facebook, Instagram, and Printed media. The table represents the money spent on different channels. 

Draw a bar graph for the following data. 

  • Put each of the channels on the x-axis
  • The height of the bars is decided by the value of each channel.

data graphical representation

Question 3: Draw a line plot for the following data 

  • Put each of the x-axis row value on the x-axis
  • joint the value corresponding to the each value of the x-axis.

data graphical representation

Question 4: Make a frequency plot of the following data: 

  • Draw the class intervals on the x-axis and frequencies on the y-axis.
  • Calculate the midpoint of each class interval.
Class Interval Mid Point Frequency
0-3 1.5 3
3-6 4.5 4
6-9 7.5 2
9-12 10.5 6

Now join the mid points of the intervals and their corresponding frequencies on the graph. 

data graphical representation

This graph shows both the histogram and frequency polygon for the given distribution.

Related Article:

Graphical Representation of Data| Practical Work in Geography Class 12 What are the different ways of Data Representation What are the different ways of Data Representation? Charts and Graphs for Data Visualization

Conclusion of Graphical Representation

Graphical representation is a powerful tool for understanding data, but it’s essential to be aware of its limitations. While graphs and charts can make information easier to grasp, they can also be subjective, complex, and potentially misleading . By using graphical representations wisely and critically, we can extract valuable insights from data, empowering us to make informed decisions with confidence.

Graphical Representation of Data – FAQs

What are the advantages of using graphs to represent data.

Graphs offer visualization, clarity, and easy comparison of data, aiding in outlier identification and predictive analysis.

What are the common types of graphs used for data representation?

Common graph types include bar, line, pie, histogram, and scatter plots , each suited for different data representations and analysis purposes.

How do you choose the most appropriate type of graph for your data?

Select a graph type based on data type, analysis objective, and audience familiarity to effectively convey information and insights.

How do you create effective labels and titles for graphs?

Use descriptive titles, clear axis labels with units, and legends to ensure the graph communicates information clearly and concisely.

How do you interpret graphs to extract meaningful insights from data?

Interpret graphs by examining trends, identifying outliers, comparing data across categories, and considering the broader context to draw meaningful insights and conclusions.

Please Login to comment...

Similar reads.

  • School Learning
  • Maths-Class-9
  • Top 10 Fun ESL Games and Activities for Teaching Kids English Abroad in 2024
  • Top Free Voice Changers for Multiplayer Games and Chat in 2024
  • Best Monitors for MacBook Pro and MacBook Air in 2024
  • 10 Best Laptop Brands in 2024
  • System Design Netflix | A Complete Architecture

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Graphical Representation of Data

Graphical representation of data is an attractive method of showcasing numerical data that help in analyzing and representing quantitative data visually. A graph is a kind of a chart where data are plotted as variables across the coordinate. It became easy to analyze the extent of change of one variable based on the change of other variables. Graphical representation of data is done through different mediums such as lines, plots, diagrams, etc. Let us learn more about this interesting concept of graphical representation of data, the different types, and solve a few examples.

1.
2.
3.
4.
5.
6.
7.

Definition of Graphical Representation of Data

A graphical representation is a visual representation of data statistics-based results using graphs, plots, and charts. This kind of representation is more effective in understanding and comparing data than seen in a tabular form. Graphical representation helps to qualify, sort, and present data in a method that is simple to understand for a larger audience. Graphs enable in studying the cause and effect relationship between two variables through both time series and frequency distribution. The data that is obtained from different surveying is infused into a graphical representation by the use of some symbols, such as lines on a line graph, bars on a bar chart, or slices of a pie chart. This visual representation helps in clarity, comparison, and understanding of numerical data.

Representation of Data

The word data is from the Latin word Datum, which means something given. The numerical figures collected through a survey are called data and can be represented in two forms - tabular form and visual form through graphs. Once the data is collected through constant observations, it is arranged, summarized, and classified to finally represented in the form of a graph. There are two kinds of data - quantitative and qualitative. Quantitative data is more structured, continuous, and discrete with statistical data whereas qualitative is unstructured where the data cannot be analyzed.

Principles of Graphical Representation of Data

The principles of graphical representation are algebraic. In a graph, there are two lines known as Axis or Coordinate axis. These are the X-axis and Y-axis. The horizontal axis is the X-axis and the vertical axis is the Y-axis. They are perpendicular to each other and intersect at O or point of Origin. On the right side of the Origin, the Xaxis has a positive value and on the left side, it has a negative value. In the same way, the upper side of the Origin Y-axis has a positive value where the down one is with a negative value. When -axis and y-axis intersect each other at the origin it divides the plane into four parts which are called Quadrant I, Quadrant II, Quadrant III, Quadrant IV. This form of representation is seen in a frequency distribution that is represented in four methods, namely Histogram, Smoothed frequency graph, Pie diagram or Pie chart, Cumulative or ogive frequency graph, and Frequency Polygon.

Principle of Graphical Representation of Data

Advantages and Disadvantages of Graphical Representation of Data

Listed below are some advantages and disadvantages of using a graphical representation of data:

  • It improves the way of analyzing and learning as the graphical representation makes the data easy to understand.
  • It can be used in almost all fields from mathematics to physics to psychology and so on.
  • It is easy to understand for its visual impacts.
  • It shows the whole and huge data in an instance.
  • It is mainly used in statistics to determine the mean, median, and mode for different data

The main disadvantage of graphical representation of data is that it takes a lot of effort as well as resources to find the most appropriate data and then represent it graphically.

Rules of Graphical Representation of Data

While presenting data graphically, there are certain rules that need to be followed. They are listed below:

  • Suitable Title: The title of the graph should be appropriate that indicate the subject of the presentation.
  • Measurement Unit: The measurement unit in the graph should be mentioned.
  • Proper Scale: A proper scale needs to be chosen to represent the data accurately.
  • Index: For better understanding, index the appropriate colors, shades, lines, designs in the graphs.
  • Data Sources: Data should be included wherever it is necessary at the bottom of the graph.
  • Simple: The construction of a graph should be easily understood.
  • Neat: The graph should be visually neat in terms of size and font to read the data accurately.

Uses of Graphical Representation of Data

The main use of a graphical representation of data is understanding and identifying the trends and patterns of the data. It helps in analyzing large quantities, comparing two or more data, making predictions, and building a firm decision. The visual display of data also helps in avoiding confusion and overlapping of any information. Graphs like line graphs and bar graphs, display two or more data clearly for easy comparison. This is important in communicating our findings to others and our understanding and analysis of the data.

Types of Graphical Representation of Data

Data is represented in different types of graphs such as plots, pies, diagrams, etc. They are as follows,

Data Representation Description

A group of data represented with rectangular bars with lengths proportional to the values is a .

The bars can either be vertically or horizontally plotted.

The is a type of graph in which a circle is divided into Sectors where each sector represents a proportion of the whole. Two main formulas used in pie charts are:

The represents the data in a form of series that is connected with a straight line. These series are called markers.

Data shown in the form of pictures is a . Pictorial symbols for words, objects, or phrases can be represented with different numbers.

The is a type of graph where the diagram consists of rectangles, the area is proportional to the frequency of a variable and the width is equal to the class interval. Here is an example of a histogram.

The table in statistics showcases the data in ascending order along with their corresponding frequencies.

The frequency of the data is often represented by f.

The is a way to represent quantitative data according to frequency ranges or frequency distribution. It is a graph that shows numerical data arranged in order. Each data value is broken into a stem and a leaf.

Scatter diagram or is a way of graphical representation by using Cartesian coordinates of two variables. The plot shows the relationship between two variables.

Related Topics

Listed below are a few interesting topics that are related to the graphical representation of data, take a look.

  • x and y graph
  • Frequency Polygon
  • Cumulative Frequency

Examples on Graphical Representation of Data

Example 1 : A pie chart is divided into 3 parts with the angles measuring as 2x, 8x, and 10x respectively. Find the value of x in degrees.

We know, the sum of all angles in a pie chart would give 360º as result. ⇒ 2x + 8x + 10x = 360º ⇒ 20 x = 360º ⇒ x = 360º/20 ⇒ x = 18º Therefore, the value of x is 18º.

Example 2: Ben is trying to read the plot given below. His teacher has given him stem and leaf plot worksheets. Can you help him answer the questions? i) What is the mode of the plot? ii) What is the mean of the plot? iii) Find the range.

Stem Leaf
1 2 4
2 1 5 8
3 2 4 6
5 0 3 4 4
6 2 5 7
8 3 8 9
9 1

Solution: i) Mode is the number that appears often in the data. Leaf 4 occurs twice on the plot against stem 5.

Hence, mode = 54

ii) The sum of all data values is 12 + 14 + 21 + 25 + 28 + 32 + 34 + 36 + 50 + 53 + 54 + 54 + 62 + 65 + 67 + 83 + 88 + 89 + 91 = 958

To find the mean, we have to divide the sum by the total number of values.

Mean = Sum of all data values ÷ 19 = 958 ÷ 19 = 50.42

iii) Range = the highest value - the lowest value = 91 - 12 = 79

go to slide go to slide

data graphical representation

Book a Free Trial Class

Practice Questions on Graphical Representation of Data

Faqs on graphical representation of data, what is graphical representation.

Graphical representation is a form of visually displaying data through various methods like graphs, diagrams, charts, and plots. It helps in sorting, visualizing, and presenting data in a clear manner through different types of graphs. Statistics mainly use graphical representation to show data.

What are the Different Types of Graphical Representation?

The different types of graphical representation of data are:

  • Stem and leaf plot
  • Scatter diagrams
  • Frequency Distribution

Is the Graphical Representation of Numerical Data?

Yes, these graphical representations are numerical data that has been accumulated through various surveys and observations. The method of presenting these numerical data is called a chart. There are different kinds of charts such as a pie chart, bar graph, line graph, etc, that help in clearly showcasing the data.

What is the Use of Graphical Representation of Data?

Graphical representation of data is useful in clarifying, interpreting, and analyzing data plotting points and drawing line segments , surfaces, and other geometric forms or symbols.

What are the Ways to Represent Data?

Tables, charts, and graphs are all ways of representing data, and they can be used for two broad purposes. The first is to support the collection, organization, and analysis of data as part of the process of a scientific study.

What is the Objective of Graphical Representation of Data?

The main objective of representing data graphically is to display information visually that helps in understanding the information efficiently, clearly, and accurately. This is important to communicate the findings as well as analyze the data.

  • Math Article

Graphical Representation

Class Registration Banner

Graphical Representation is a way of analysing numerical data. It exhibits the relation between data, ideas, information and concepts in a diagram. It is easy to understand and it is one of the most important learning strategies. It always depends on the type of information in a particular domain. There are different types of graphical representation. Some of them are as follows:

  • Line Graphs – Line graph or the linear graph is used to display the continuous data and it is useful for predicting future events over time.
  • Bar Graphs – Bar Graph is used to display the category of data and it compares the data using solid bars to represent the quantities.
  • Histograms – The graph that uses bars to represent the frequency of numerical data that are organised into intervals. Since all the intervals are equal and continuous, all the bars have the same width.
  • Line Plot – It shows the frequency of data on a given number line. ‘ x ‘ is placed above a number line each time when that data occurs again.
  • Frequency Table – The table shows the number of pieces of data that falls within the given interval.
  • Circle Graph – Also known as the pie chart that shows the relationships of the parts of the whole. The circle is considered with 100% and the categories occupied is represented with that specific percentage like 15%, 56%, etc.
  • Stem and Leaf Plot – In the stem and leaf plot, the data are organised from least value to the greatest value. The digits of the least place values from the leaves and the next place value digit forms the stems.
  • Box and Whisker Plot – The plot diagram summarises the data by dividing into four parts. Box and whisker show the range (spread) and the middle ( median) of the data.

Graphical Representation

General Rules for Graphical Representation of Data

There are certain rules to effectively present the information in the graphical representation. They are:

  • Suitable Title: Make sure that the appropriate title is given to the graph which indicates the subject of the presentation.
  • Measurement Unit: Mention the measurement unit in the graph.
  • Proper Scale: To represent the data in an accurate manner, choose a proper scale.
  • Index: Index the appropriate colours, shades, lines, design in the graphs for better understanding.
  • Data Sources: Include the source of information wherever it is necessary at the bottom of the graph.
  • Keep it Simple: Construct a graph in an easy way that everyone can understand.
  • Neat: Choose the correct size, fonts, colours etc in such a way that the graph should be a visual aid for the presentation of information.

Graphical Representation in Maths

In Mathematics, a graph is defined as a chart with statistical data, which are represented in the form of curves or lines drawn across the coordinate point plotted on its surface. It helps to study the relationship between two variables where it helps to measure the change in the variable amount with respect to another variable within a given interval of time. It helps to study the series distribution and frequency distribution for a given problem.  There are two types of graphs to visually depict the information. They are:

  • Time Series Graphs – Example: Line Graph
  • Frequency Distribution Graphs – Example: Frequency Polygon Graph

Principles of Graphical Representation

Algebraic principles are applied to all types of graphical representation of data. In graphs, it is represented using two lines called coordinate axes. The horizontal axis is denoted as the x-axis and the vertical axis is denoted as the y-axis. The point at which two lines intersect is called an origin ‘O’. Consider x-axis, the distance from the origin to the right side will take a positive value and the distance from the origin to the left side will take a negative value. Similarly, for the y-axis, the points above the origin will take a positive value, and the points below the origin will a negative value.

Principles of graphical representation

Generally, the frequency distribution is represented in four methods, namely

  • Smoothed frequency graph
  • Pie diagram
  • Cumulative or ogive frequency graph
  • Frequency Polygon

Merits of Using Graphs

Some of the merits of using graphs are as follows:

  • The graph is easily understood by everyone without any prior knowledge.
  • It saves time
  • It allows us to relate and compare the data for different time periods
  • It is used in statistics to determine the mean, median and mode for different data, as well as in the interpolation and the extrapolation of data.

Example for Frequency polygonGraph

Here are the steps to follow to find the frequency distribution of a frequency polygon and it is represented in a graphical way.

  • Obtain the frequency distribution and find the midpoints of each class interval.
  • Represent the midpoints along x-axis and frequencies along the y-axis.
  • Plot the points corresponding to the frequency at each midpoint.
  • Join these points, using lines in order.
  • To complete the polygon, join the point at each end immediately to the lower or higher class marks on the x-axis.

Draw the frequency polygon for the following data

10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
4 6 8 10 12 14 7 5

Mark the class interval along x-axis and frequencies along the y-axis.

Let assume that class interval 0-10 with frequency zero and 90-100 with frequency zero.

Now calculate the midpoint of the class interval.

0-10 5 0
10-20 15 4
20-30 25 6
30-40 35 8
40-50 45 10
50-60 55 12
60-70 65 14
70-80 75 7
80-90 85 5
90-100 95 0

Using the midpoint and the frequency value from the above table, plot the points A (5, 0), B (15, 4), C (25, 6), D (35, 8), E (45, 10), F (55, 12), G (65, 14), H (75, 7), I (85, 5) and J (95, 0).

To obtain the frequency polygon ABCDEFGHIJ, draw the line segments AB, BC, CD, DE, EF, FG, GH, HI, IJ, and connect all the points.

data graphical representation

Frequently Asked Questions

What are the different types of graphical representation.

Some of the various types of graphical representation include:

  • Line Graphs
  • Frequency Table
  • Circle Graph, etc.

Read More:  Types of Graphs

What are the Advantages of Graphical Method?

Some of the advantages of graphical representation are:

  • It makes data more easily understandable.
  • It saves time.
  • It makes the comparison of data more efficient.
MATHS Related Links

Leave a Comment Cancel reply

Your Mobile number and Email id will not be published. Required fields are marked *

Request OTP on Voice Call

Post My Comment

data graphical representation

Very useful for understand the basic concepts in simple and easy way. Its very useful to all students whether they are school students or college sudents

Thanks very much for the information

data graphical representation

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

What Is Data Visualization: Brief Theory, Useful Tips and Awesome Examples

  • Share on Facebook
  • Share on Twitter

By Al Boicheva

in Insights , Inspiration

4 years ago

Viewed 11,876 times

Spread the word about this article:

What Is Data Visualization Brief Theory, Useful Tips and Awesome Examples

Updated: June 23, 2022

To create data visualization in order to present your data is no longer just a nice to have skill. Now, the skill to effectively sort and communicate your data through charts is a must-have for any business in any field that deals with data. Data visualization helps businesses quickly make sense of complex data and start making decisions based on that data. This is why today we’ll talk about what is data visualization. We’ll discuss how and why does it work, what type of charts to choose in what cases, how to create effective charts, and, of course, end with beautiful examples.

So let’s jump right in. As usual, don’t hesitate to fast-travel to a particular section of your interest.

Article overview: 1. What Does Data Visualization Mean? 2. How Does it Work? 3. When to Use it? 4. Why Use it? 5. Types of Data Visualization 6. Data Visualization VS Infographics: 5 Main Differences 7. How to Create Effective Data Visualization?: 5 Useful Tips 8. Examples of Data Visualization

1. What is Data Visualization?

Data Visualization is a graphic representation of data that aims to communicate numerous heavy data in an efficient way that is easier to grasp and understand . In a way, data visualization is the mapping between the original data and graphic elements that determine how the attributes of these elements vary. The visualization is usually made by the use of charts, lines, or points, bars, and maps.

  • Data Viz is a branch of Descriptive statistics but it requires both design, computer, and statistical skills.
  • Aesthetics and functionality go hand in hand to communicate complex statistics in an intuitive way.
  • Data Viz tools and technologies are essential for making data-driven decisions.
  • It’s a fine balance between form and functionality.
  • Every STEM field benefits from understanding data.

2. How Does it Work?

If we can see it, our brains can internalize and reflect on it. This is why it’s much easier and more effective to make sense of a chart and see trends than to read a massive document that would take a lot of time and focus to rationalize. We wouldn’t want to repeat the cliche that humans are visual creatures, but it’s a fact that visualization is much more effective and comprehensive.

In a way, we can say that data Viz is a form of storytelling with the purpose to help us make decisions based on data. Such data might include:

  • Tracking sales
  • Identifying trends
  • Identifying changes
  • Monitoring goals
  • Monitoring results
  • Combining data

3. When to Use it?

Data visualization is useful for companies that deal with lots of data on a daily basis. It’s essential to have your data and trends instantly visible. Better than scrolling through colossal spreadsheets. When the trends stand out instantly this also helps your clients or viewers to understand them instead of getting lost in the clutter of numbers.

With that being said, Data Viz is suitable for:

  • Annual reports
  • Presentations
  • Social media micronarratives
  • Informational brochures
  • Trend-trafficking
  • Candlestick chart for financial analysis
  • Determining routes

Common cases when data visualization sees use are in sales, marketing, healthcare, science, finances, politics, and logistics.

4. Why Use it?

Short answer: decision making. Data Visualization comes with the undeniable benefits of quickly recognizing patterns and interpret data. More specifically, it is an invaluable tool to determine the following cases.

  • Identifying correlations between the relationship of variables.
  • Getting market insights about audience behavior.
  • Determining value vs risk metrics.
  • Monitoring trends over time.
  • Examining rates and potential through frequency.
  • Ability to react to changes.

5. Types of Data Visualization

As you probably already guessed, Data Viz is much more than simple pie charts and graphs styled in a visually appealing way. The methods that this branch uses to visualize statistics include a series of effective types.

Map visualization is a great method to analyze and display geographically related information and present it accurately via maps. This intuitive way aims to distribute data by region. Since maps can be 2D or 3D, static or dynamic, there are numerous combinations one can use in order to create a Data Viz map.

COVID-19 Spending Data Visualization POGO by George Railean

The most common ones, however, are:

  • Regional Maps: Classic maps that display countries, cities, or districts. They often represent data in different colors for different characteristics in each region.
  • Line Maps: They usually contain space and time and are ideal for routing, especially for driving or taxi routes in the area due to their analysis of specific scenes.
  • Point Maps: These maps distribute data of geographic information. They are ideal for businesses to pinpoint the exact locations of their buildings in a region.
  • Heat Maps: They indicate the weight of a geographical area based on a specific property. For example, a heat map may distribute the saturation of infected people by area.

Charts present data in the form of graphs, diagrams, and tables. They are often confused with graphs since graphs are indeed a subcategory of charts. However, there is a small difference: graphs show the mathematical relationship between groups of data and is only one of the chart methods to represent data.

Gluten in America - chart data visualization

Infographic Data Visualization by Madeline VanRemmen

With that out of the way, let’s talk about the most basic types of charts in data visualization.

Finance Statistics - Bar Graph visualization

They use a series of bars that illustrate data development.  They are ideal for lighter data and follow trends of no more than three variables or else, the bars become cluttered and hard to comprehend. Ideal for year-on-year comparisons and monthly breakdowns.

Pie chart visualization type

These familiar circular graphs divide data into portions. The bigger the slice, the bigger the portion. They are ideal for depicting sections of a whole and their sum must always be 100%. Avoid pie charts when you need to show data development over time or lack a value for any of the portions. Doughnut charts have the same use as pie charts.

Line graph - common visualization type

They use a line or more than one lines that show development over time. It allows tracking multiple variables at the same time. A great example is tracking product sales by a brand over the years. Area charts have the same use as line charts.

Scatter Plot

Scatter Plot - data visualization idea

These charts allow you to see patterns through data visualization. They have an x-axis and a y-axis for two different values. For example, if your x-axis contains information about car prices while the y-axis is about salaries, the positive or negative relationship will tell you about what a person’s car tells about their salary.

Unlike the charts we just discussed, tables show data in almost a raw format. They are ideal when your data is hard to present visually and aim to show specific numerical data that one is supposed to read rather than visualize.

Creative data table visualization

Data Visualisation | To bee or not to bee by Aishwarya Anand Singh

For example, charts are perfect to display data about a particular illness over a time period in a particular area, but a table comes to better use when you also need to understand specifics such as causes, outcomes, relapses, a period of treatment, and so on.

6. Data Visualization VS Infographics

5 main differences.

They are not that different as both visually represent data. It is often you search for infographics and find images titled Data Visualization and the other way around. In many cases, however, these titles aren’t misleading. Why is that?

  • Data visualization is made of just one element. It could be a map, a chart, or a table. Infographics , on the other hand, often include multiple Data Viz elements.
  • Unlike data visualizations that can be simple or extremely complex and heavy, infographics are simple and target wider audiences. The latter is usually comprehensible even to people outside of the field of research the infographic represents.
  • Interestingly enough, data Viz doesn’t offer narratives and conclusions, it’s a tool and basis for reaching those. While infographics, in most cases offer a story and a narrative. For example, a data visualization map may have the title “Air pollution saturation by region”, while an infographic with the same data would go “Areas A and B are the most polluted in Country C”.
  • Data visualizations can be made in Excel or use other tools that automatically generate the design unless they are set for presentation or publishing. The aesthetics of infographics , however, are of great importance and the designs must be appealing to wider audiences.
  • In terms of interaction, data visualizations often offer interactive charts, especially in an online form. Infographics, on the other hand, rarely have interaction and are usually static images.

While on topic, you could also be interested to check out these 50 engaging infographic examples that make complex data look great.

7. Tips to Create Effective Data Visualization

The process is naturally similar to creating Infographics and it revolves around understanding your data and audience. To be more precise, these are the main steps and best practices when it comes to preparing an effective visualization of data for your viewers to instantly understand.

1. Do Your Homework

Preparation is half the work already done. Before you even start visualizing data, you have to be sure you understand that data to the last detail.

Knowing your audience is undeniable another important part of the homework, as different audiences process information differently. Who are the people you’re visualizing data for? How do they process visual data? Is it enough to hand them a single pie chart or you’ll need a more in-depth visual report?

The third part of preparing is to determine exactly what you want to communicate to the audience. What kind of information you’re visualizing and does it reflect your goal?

And last, think about how much data you’ll be working with and take it into account.

2. Choose the Right Type of Chart

In a previous section, we listed the basic chart types that find use in data visualization. To determine best which one suits your work, there are a few things to consider.

  • How many variables will you have in a chart?
  • How many items will you place for each of your variables?
  • What will be the relation between the values (time period, comparison, distributions, etc.)

With that being said, a pie chart would be ideal if you need to present what portions of a whole takes each item. For example, you can use it to showcase what percent of the market share takes a particular product. Pie charts, however, are unsuitable for distributions, comparisons, and following trends through time periods. Bar graphs, scatter plots,s and line graphs are much more effective in those cases.

Another example is how to use time in your charts. It’s way more accurate to use a horizontal axis because time should run left to right. It’s way more visually intuitive.

3. Sort your Data

Start with removing every piece of data that does not add value and is basically excess for the chart. Sometimes, you have to work with a huge amount of data which will inevitably make your chart pretty complex and hard to read. Don’t hesitate to split your information into two or more charts. If that won’t work for you, you could use highlights or change the entire type of chart with something that would fit better.

Tip: When you use bar charts and columns for comparison, sort the information in an ascending or a descending way by value instead of alphabetical order.

4. Use Colors to Your Advantage

In every form of visualization, colors are your best friend and the most powerful tool. They create contrasts, accents, and emphasis and lead the eye intuitively. Even here, color theory is important.

When you design your chart, make sure you don’t use more than 5 or 6 colors. Anything more than that will make your graph overwhelming and hard to read for your viewers. However, color intensity is a different thing that you can use to your advantage. For example, when you compare the same concept in different periods of time, you could sort your data from the lightest shade of your chosen color to its darker one. It creates a strong visual progression, proper to your timeline.

Things to consider when you choose colors:

  • Different colors for different categories.
  • A consistent color palette for all charts in a series that you will later compare.
  • It’s appropriate to use color blind-friendly palettes.

5. Get Inspired

Always put your inspiration to work when you want to be at the top of your game. Look through examples, infographics, and other people’s work and see what works best for each type of data you need to implement.

This Twitter account Data Visualization Society is a great way to start. In the meantime, we’ll also handpick some amazing examples that will get you in the mood to start creating the visuals for your data.

8. Examples for Data Visualization

As another art form, Data Viz is a fertile ground for some amazing well-designed graphs that prove that data is beautiful. Now let’s check out some.

Dark Souls III Experience Data

We start with Meng Hsiao Wei’s personal project presenting his experience with playing Dark Souls 3. It’s a perfect example that infographics and data visualization are tools for personal designs as well. The research is pretty massive yet very professionally sorted into different types of charts for the different concepts. All data visualizations are made with the same color palette and look great in infographics.

Data of My Dark Souls 3 example

My dark souls 3 playing data by Meng Hsiao Wei

Greatest Movies of all Time

Katie Silver has compiled a list of the 100 greatest movies of all time based on critics and crowd reviews. The visualization shows key data points for every movie such as year of release, oscar nominations and wins, budget, gross, IMDB score, genre, filming location, setting of the film, and production studio. All movies are ordered by the release date.

Greatest Movies visualization chart

100 Greatest Movies Data Visualization by Katie Silver

The Most Violent Cities

Federica Fragapane shows data for the 50 most violent cities in the world in 2017. The items are arranged on a vertical axis based on population and ordered along the horizontal axis according to the homicide rate.

The Most Violent Cities example

The Most Violent Cities by Federica Fragapane

Family Businesses as Data

These data visualizations and illustrations were made by Valerio Pellegrini for Perspectives Magazine. They show a pie chart with sector breakdown as well as a scatter plot for contribution for employment.

Family Businesses as Data Visual

PERSPECTIVES MAGAZINE – Family Businesses by Valerio Pellegrini

Orbit Map of the Solar System

The map shows data on the orbits of more than 18000 asteroids in the solar system. Each asteroid is shown at its position on New Years’ Eve 1999, colored by type of asteroid.

Orbit Map of the Solar System graphic

An Orbit Map of the Solar System by Eleanor Lutz

The Semantics Of Headlines

Katja Flükiger has a take on how headlines tell the story. The data visualization aims to communicate how much is the selling influencing the telling. The project was completed at Maryland Institute College of Art to visualize references to immigration and color-coding the value judgments implied by word choice and context.

The Semantics Of Headlines graph

The Semantics of Headlines by Katja Flükiger

Moon and Earthquakes

This data visualization works on answering whether the moon is responsible for earthquakes. The chart features the time and intensity of earthquakes in response to the phase and orbit location of the moon.

Moon and Earthquakes statistics visual

Moon and Earthquakes by Aishwarya Anand Singh

Dawn of the Nanosats

The visualization shows the satellites launched from 2003 to 2015. The graph represents the type of institutions focused on projects as well as the nations that financed them. On the left, it is shown the number of launches per year and satellite applications.

Dawn of the Nanosats visualization

WIRED UK – Dawn of the by Nanosats by Valerio Pellegrini

Final Words

Data visualization is not only a form of science but also a form of art. Its purpose is to help businesses in any field quickly make sense of complex data and start making decisions based on that data. To make your graphs efficient and easy to read, it’s all about knowing your data and audience. This way you’ll be able to choose the right type of chart and use visual techniques to your advantage.

You may also be interested in some of these related articles:

  • Infographics for Marketing: How to Grab and Hold the Attention
  • 12 Animated Infographics That Will Engage Your Mind from Start to Finish
  • 50 Engaging Infographic Examples That Make Complex Ideas Look Great
  • Good Color Combinations That Go Beyond Trends: Inspirational Examples and Ideas

data graphical representation

Add some character to your visuals

Cartoon Characters, Design Bundles, Illustrations, Backgrounds and more...

Like us on Facebook

Subscribe to our newsletter

Be the first to know what’s new in the world of graphic design and illustrations.

  • [email protected]

Browse High Quality Vector Graphics

E.g.: businessman, lion, girl…

Related Articles

How to use adobe character animator for free in 2022, best logo creator tools & generators: fast, easy, cost-effective, 100+ great flat character design inspiration examples, how to make an engaging infographic: the full guide, 33 peculiar character design styles of the modern day, check out our infographics bundle with 500+ infographic templates:, enjoyed this article.

Don’t forget to share!

  • Comments (2)

data graphical representation

Al Boicheva

Al is an illustrator at GraphicMama with out-of-the-box thinking and a passion for anything creative. In her free time, you will see her drooling over tattoo art, Manga, and horror movies.

data graphical representation

Thousands of vector graphics for your projects.

Hey! You made it all the way to the bottom!

Here are some other articles we think you may like:

amazing black and white illustrations - examples for inspiration

Inspiration

Amazing black-and-white illustrations that don’t need color to impress.

by Lyudmil Enchev

Best Places Where you Can Find Web Design Inspiration

Best Places Where You Can Find Web Design Inspiration

'Living Coral' is Pantone Color of The Year 2019

‘Living Coral’ is Pantone Color of The Year 2019

by Iveta Pavlova

Looking for Design Bundles or Cartoon Characters?

A source of high-quality vector graphics offering a huge variety of premade character designs, graphic design bundles, Adobe Character Animator puppets, and more.

data graphical representation

The Sheridan Libraries

  • Data Visualization
  • Sheridan Libraries
  • Introduction to Data Visualization
  • Data Vis Quote

What is Data Visualization?

Why visualize data, planning a data visualization.

  • Designing Effective Data Visualizations
  • Network Visualization
  • Scientific Visualization
  • Workshops, Tutorials, and Resources

Data Vis Planning

Learn the steps to plan and implement a data visualization.

Data Vis Design

Designing an Effective Data Visualization

Learn how to design an effective data visualization..

data graphical representation

Data Visualization Resources

Learn about the software packages, programming languages, and data visualization libraries available to you.

Data Visualization Consultants

data graphical representation

"There is a magic in graphs."

"the profile of a curve reveals in a flash a whole situation — the life history of an epidemic, a panic, or an era of prosperity. the curve informs the mind, awakens the imagination, convinces .", - henry d. hubbard, national bureau of standards, data visualization is the graphical representation of data for understanding and communication. this encompasses two primary classes of visualization:, information visualization - visualization of data. this can either be:         exploratory: you are trying to explore and understand patterns and trends within your data.         explanatory:  there is something in your data you would like to communicate to your audience., scientific visualization - scientific visualization involves the visualization of data with an inherent spatial component. this can be the visualization of scalar, vector, and tensor fields. common areas of scientific visualization include computational fluid dynamics, medical imaging and analysis and weather data analysis..

Good data visualizations allow us to reason and think effectively about our data. By presenting information visually, it allows us offload internal cognition to the perceptual system. If we see numerical data in a table, we may be able to find a trend, but it will take a significant amount of work on our part to recognize and conceptualize that trend. By plotting that data visually, that trend becomes immediately clear to our mind through our perceptual system.

A good example of this is "Anscombe's quartet", four datasets that share the same descriptive statistics, including mean, variance, and correlation.

Anscombe's Quartet Table

Upon visual inspection, it becomes immediately clear that these datasets, while seemingly identical according to common summary statistics, are each unique. This is the power of effective data visualization: it allows us to bypass cognition by communicating directly with our perceptual system.

Anscombe's Quartet plot published under the terms of "Creative Commons Attribution-Share Alike", source: http://commons.wikimedia.org/wiki/File:Anscombe%27s_quartet_3.svg Anscombe's Quartet table source: https://multithreaded.stitchfix.com/assets/images/blog/anscombes_quartet_table.png

These materials are licensed under a Creative Commons , attributable to , Johns Hopkins University.
  • Next: Planning a Data Visualization >>
  • Last Updated: Mar 7, 2024 1:06 PM
  • URL: https://guides.library.jhu.edu/datavisualization

data graphical representation

Graphical Representation

Graphical representation definition.

Graphical representation refers to the use of charts and graphs to visually display, analyze, clarify, and interpret numerical data, functions, and other qualitative structures. ‍

data graphical representation

What is Graphical Representation?

Graphical representation refers to the use of intuitive charts to clearly visualize and simplify data sets. Data is ingested into graphical representation of data software and then represented by a variety of symbols, such as lines on a line chart, bars on a bar chart, or slices on a pie chart, from which users can gain greater insight than by numerical analysis alone. 

Representational graphics can quickly illustrate general behavior and highlight phenomenons, anomalies, and relationships between data points that may otherwise be overlooked, and may contribute to predictions and better, data-driven decisions. The types of representational graphics used will depend on the type of data being explored.

Types of Graphical Representation

Data charts are available in a wide variety of maps, diagrams, and graphs that typically include textual titles and legends to denote the purpose, measurement units, and variables of the chart. Choosing the most appropriate chart depends on a variety of different factors -- the nature of the data, the purpose of the chart, and whether a graphical representation of qualitative data or a graphical representation of quantitative data is being depicted. There are dozens of different formats for graphical representation of data. Some of the most popular charts include:

  • Bar Graph -- contains a vertical axis and horizontal axis and displays data as rectangular bars with lengths proportional to the values that they represent; a useful visual aid for marketing purposes
  • Choropleth -- thematic map in which an aggregate summary of a geographic characteristic within an area is represented by patterns of shading proportionate to a statistical variable
  • Flow Chart -- diagram that depicts a workflow graphical representation with the use of arrows and geometric shapes; a useful visual aid for business and finance purposes
  • Heatmap -- a colored, two-dimensional matrix of cells in which each cell represents a grouping of data and each cell’s color indicates its relative value
  • Histogram – frequency distribution and graphical representation uses adjacent vertical bars erected over discrete intervals to represent the data frequency within a given interval; a useful visual aid for meteorology and environment purposes
  • Line Graph – displays continuous data; ideal for predicting future events over time;  a useful visual aid for marketing purposes
  • Pie Chart -- shows percentage values as a slice of pie; a useful visual aid for marketing purposes
  • Pointmap -- CAD & GIS contract mapping and drafting solution that visualizes the location of data on a map by plotting geographic latitude and longitude data
  • Scatter plot -- a diagram that shows the relationship between two sets of data, where each dot represents individual pieces of data and each axis represents a quantitative measure
  • Stacked Bar Graph -- a graph in which each bar is segmented into parts, with the entire bar representing the whole, and each segment representing different categories of that whole; a useful visual aid for political science and sociology purposes
  • Timeline Chart -- a long bar labelled with dates paralleling it that display a list of events in chronological order, a useful visual aid for history charting purposes
  • Tree Diagram -- a hierarchical genealogical tree that illustrates a family structure; a useful visual aid for history charting purposes
  • Venn Diagram -- consists of multiple overlapping usually circles, each representing a set; the default inner join graphical representation

Proprietary and open source software for graphical representation of data is available in a wide variety of programming languages. Software packages often provide spreadsheets equipped with built-in charting functions.

Advantages and Disadvantages of Graphical Representation of Data

Tabular and graphical representation of data are a vital component in analyzing and understanding large quantities of numerical data and the relationship between data points. Data visualization is one of the most fundamental approaches to data analysis, providing an intuitive and universal means to visualize, abstract, and share complex data patterns. The primary advantages of graphical representation of data are:

  • Facilitates and improves learning: graphics make data easy to understand and eliminate language and literacy barriers
  • Understanding content: visuals are more effective than text in human understanding
  • Flexibility of use: graphical representation can be leveraged in nearly every field involving data
  • Increases structured thinking: users can make quick, data-driven decisions at a glance with visual aids
  • Supports creative, personalized reports for more engaging and stimulating visual  presentations 
  • Improves communication: analyzing graphs that highlight relevant themes is significantly faster than reading through a descriptive report line by line
  • Shows the whole picture: an instantaneous, full view of all variables, time frames, data behavior and relationships

Disadvantages of graphical representation of data typically concern the cost of human effort and resources, the process of selecting the most appropriate graphical and tabular representation of data, greater design complexity of visualizing data, and the potential for human bias.

Why Graphical Representation of Data is Important

Graphic visual representation of information is a crucial component in understanding and identifying patterns and trends in the ever increasing flow of data. Graphical representation enables the quick analysis of large amounts of data at one time and can aid in making predictions and informed decisions. Data visualizations also make collaboration significantly more efficient by using familiar visual metaphors to illustrate relationships and highlight meaning, eliminating complex, long-winded explanations of an otherwise chaotic-looking array of figures. 

Data only has value once its significance has been revealed and consumed, and its consumption is best facilitated with graphical representation tools that are designed with human cognition and perception in mind. Human visual processing is very efficient at detecting relationships and changes between sizes, shapes, colors, and quantities. Attempting to gain insight from numerical data alone, especially in big data instances in which there may be billions of rows of data, is exceedingly cumbersome and inefficient.

Does HEAVY.AI Offer a Graphical Representation Solution?

HEAVY.AI's visual analytics platform is an interactive data visualization client that works seamlessly with server-side technologies HEAVY.AIDB and Render to enable data science analysts to easily visualize and instantly interact with massive datasets. Analysts can interact with conventional charts and data tables, as well as big data graphical representations such as massive-scale scatterplots and geo charts. Data visualization contributes to a broad range of use cases, including performance analysis in business and guiding research in academia.

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand.

Data visualization can be utilized for a variety of purposes, and it’s important to note that is not only reserved for use by data teams. Management also leverages it to convey organizational structure and hierarchy while data analysts and data scientists use it to discover and explain patterns and trends.  Harvard Business Review  (link resides outside ibm.com) categorizes data visualization into four key purposes: idea generation, idea illustration, visual discovery, and everyday dataviz. We’ll delve deeper into these below:

Idea generation

Data visualization is commonly used to spur idea generation across teams. They are frequently leveraged during brainstorming or  Design Thinking  sessions at the start of a project by supporting the collection of different perspectives and highlighting the common concerns of the collective. While these visualizations are usually unpolished and unrefined, they help set the foundation within the project to ensure that the team is aligned on the problem that they’re looking to address for key stakeholders.

Idea illustration

Data visualization for idea illustration assists in conveying an idea, such as a tactic or process. It is commonly used in learning settings, such as tutorials, certification courses, centers of excellence, but it can also be used to represent organization structures or processes, facilitating communication between the right individuals for specific tasks. Project managers frequently use Gantt charts and waterfall charts to illustrate  workflows .  Data modeling  also uses abstraction to represent and better understand data flow within an enterprise’s information system, making it easier for developers, business analysts, data architects, and others to understand the relationships in a database or data warehouse.

Visual discovery

Visual discovery and every day data viz are more closely aligned with data teams. While visual discovery helps data analysts, data scientists, and other data professionals identify patterns and trends within a dataset, every day data viz supports the subsequent storytelling after a new insight has been found.

Data visualization

Data visualization is a critical step in the data science process, helping teams and individuals convey data more effectively to colleagues and decision makers. Teams that manage reporting systems typically leverage defined template views to monitor performance. However, data visualization isn’t limited to performance dashboards. For example, while  text mining  an analyst may use a word cloud to to capture key concepts, trends, and hidden relationships within this unstructured data. Alternatively, they may utilize a graph structure to illustrate relationships between entities in a knowledge graph. There are a number of ways to represent different types of data, and it’s important to remember that it is a skillset that should extend beyond your core analytics team.

Use this model selection framework to choose the most appropriate model while balancing your performance requirements with cost, risks and deployment needs.

Register for the ebook on generative AI

The earliest form of data visualization can be traced back the Egyptians in the pre-17th century, largely used to assist in navigation. As time progressed, people leveraged data visualizations for broader applications, such as in economic, social, health disciplines. Perhaps most notably, Edward Tufte published  The Visual Display of Quantitative Information  (link resides outside ibm.com), which illustrated that individuals could utilize data visualization to present data in a more effective manner. His book continues to stand the test of time, especially as companies turn to dashboards to report their performance metrics in real-time. Dashboards are effective data visualization tools for tracking and visualizing data from multiple data sources, providing visibility into the effects of specific behaviors by a team or an adjacent one on performance. Dashboards include common visualization techniques, such as:

  • Tables: This consists of rows and columns used to compare variables. Tables can show a great deal of information in a structured way, but they can also overwhelm users that are simply looking for high-level trends.
  • Pie charts and stacked bar charts:  These graphs are divided into sections that represent parts of a whole. They provide a simple way to organize data and compare the size of each component to one other.
  • Line charts and area charts:  These visuals show change in one or more quantities by plotting a series of data points over time and are frequently used within predictive analytics. Line graphs utilize lines to demonstrate these changes while area charts connect data points with line segments, stacking variables on top of one another and using color to distinguish between variables.
  • Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces between the bars), representing the quantity of data that falls within a particular range. This visual makes it easy for an end user to identify outliers within a given dataset.
  • Scatter plots: These visuals are beneficial in reveling the relationship between two variables, and they are commonly used within regression data analysis. However, these can sometimes be confused with bubble charts, which are used to visualize three variables via the x-axis, the y-axis, and the size of the bubble.
  • Heat maps:  These graphical representation displays are helpful in visualizing behavioral data by location. This can be a location on a map, or even a webpage.
  • Tree maps, which display hierarchical data as a set of nested shapes, typically rectangles. Treemaps are great for comparing the proportions between categories via their area size.

Access to data visualization tools has never been easier. Open source libraries, such as D3.js, provide a way for analysts to present data in an interactive way, allowing them to engage a broader audience with new data. Some of the most popular open source visualization libraries include:

  • D3.js: It is a front-end JavaScript library for producing dynamic, interactive data visualizations in web browsers.  D3.js  (link resides outside ibm.com) uses HTML, CSS, and SVG to create visual representations of data that can be viewed on any browser. It also provides features for interactions and animations.
  • ECharts:  A powerful charting and visualization library that offers an easy way to add intuitive, interactive, and highly customizable charts to products, research papers, presentations, etc.  Echarts  (link resides outside ibm.com) is based in JavaScript and ZRender, a lightweight canvas library.
  • Vega:   Vega  (link resides outside ibm.com) defines itself as “visualization grammar,” providing support to customize visualizations across large datasets which are accessible from the web.
  • deck.gl: It is part of Uber's open source visualization framework suite.  deck.gl  (link resides outside ibm.com) is a framework, which is used for  exploratory data analysis  on big data. It helps build high-performance GPU-powered visualization on the web.

With so many data visualization tools readily available, there has also been a rise in ineffective information visualization. Visual communication should be simple and deliberate to ensure that your data visualization helps your target audience arrive at your intended insight or conclusion. The following best practices can help ensure your data visualization is useful and clear:

Set the context: It’s important to provide general background information to ground the audience around why this particular data point is important. For example, if e-mail open rates were underperforming, we may want to illustrate how a company’s open rate compares to the overall industry, demonstrating that the company has a problem within this marketing channel. To drive an action, the audience needs to understand how current performance compares to something tangible, like a goal, benchmark, or other key performance indicators (KPIs).

Know your audience(s): Think about who your visualization is designed for and then make sure your data visualization fits their needs. What is that person trying to accomplish? What kind of questions do they care about? Does your visualization address their concerns? You’ll want the data that you provide to motivate people to act within their scope of their role. If you’re unsure if the visualization is clear, present it to one or two people within your target audience to get feedback, allowing you to make additional edits prior to a large presentation.

Choose an effective visual:  Specific visuals are designed for specific types of datasets. For instance, scatter plots display the relationship between two variables well, while line graphs display time series data well. Ensure that the visual actually assists the audience in understanding your main takeaway. Misalignment of charts and data can result in the opposite, confusing your audience further versus providing clarity.

Keep it simple:  Data visualization tools can make it easy to add all sorts of information to your visual. However, just because you can, it doesn’t mean that you should! In data visualization, you want to be very deliberate about the additional information that you add to focus user attention. For example, do you need data labels on every bar in your bar chart? Perhaps you only need one or two to help illustrate your point. Do you need a variety of colors to communicate your idea? Are you using colors that are accessible to a wide range of audiences (e.g. accounting for color blind audiences)? Design your data visualization for maximum impact by eliminating information that may distract your target audience.

An AI-infused integrated planning solution that helps you transcend the limits of manual planning.

Build, run and manage AI models. Prepare data and build models on any cloud using open source code or visual modeling. Predict and optimize your outcomes.

Unlock the value of enterprise data and build an insight-driven organization that delivers business advantage with IBM Consulting.                                   

Your trusted Watson co-pilot for smarter analytics and confident decisions.

Use features within IBM Watson® Studio that help you visualize and gain insights into your data, then cleanse and transform your data to build high-quality predictive models.

Data Refinery makes it easy to explore, prepare, and deliver data that people across your organization can trust.

Learn how to use Apache Superset (a modern, enterprise-ready business intelligence web application) with Netezza database to uncover the story behind the data.

Predict outcomes with flexible AI-infused forecasting and analyze what-if scenarios in real-time. IBM Planning Analytics is an integrated business planning solution that turns raw data into actionable insights. Deploy as you need, on premises or on cloud.

18 Best Types of Charts and Graphs for Data Visualization [+ Guide]

Erica Santiago

Published: May 22, 2024

As a writer for the marketing blog, I frequently use various types of charts and graphs to help readers visualize the data I collect and better understand their significance. And trust me, there's a lot of data to present.

Person on laptop researching the types of graphs for data visualization

In fact, the volume of data in 2025 will be almost double the data we create, capture, copy, and consume today.

Download Now: Free Excel Graph Generators

This makes data visualization essential for businesses. Different types of graphs and charts can help you:

  • Motivate your team to take action.
  • Impress stakeholders with goal progress.
  • Show your audience what you value as a business.

Data visualization builds trust and can organize diverse teams around new initiatives. So, I'm going to talk about the types of graphs and charts that you can use to grow your business.

And, if you still need a little more guidance by the end of this post, check out our data visualization guide for more information on how to design visually stunning and engaging charts and graphs.  

data graphical representation

Free Excel Graph Templates

Tired of struggling with spreadsheets? These free Microsoft Excel Graph Generator Templates can help.

  • Simple, customizable graph designs.
  • Data visualization tips & instructions.
  • Templates for two, three, four, and five-variable graph templates.

Download Free

All fields are required.

You're all set!

Click this link to access this resource at any time.

Charts vs Graphs: What's the Difference?

A lot of people think charts and graphs are synonymous (I know I did), but they're actually two different things.

Charts visually represent current data in the form of tables and diagrams, but graphs are more numerical in data and show how one variable affects another.

For example, in one of my favorite sitcoms, How I Met Your Mother, Marshall creates a bunch of charts and graphs representing his life. One of these charts is a Venn diagram referencing the song "Cecilia" by Simon and Garfunkle. 

Marshall says, "This circle represents people who are breaking my heart, and this circle represents people who are shaking my confidence daily. Where they overlap? Cecilia."

The diagram is a chart and not a graph because it doesn't track how these people make him feel over time or how these variables are influenced by each other.

It may show where the two types of people intersect but not how they influence one another.

marshall

Later, Marshall makes a line graph showing how his friends' feelings about his charts have changed in the time since presenting his "Cecilia diagram.

Note: He calls the line graph a chart on the show, but it's acceptable because the nature of line graphs and charts makes the terms interchangeable. I'll explain later, I promise.

The line graph shows how the time since showing his Cecilia chart has influenced his friends' tolerance for his various graphs and charts. 

Marshall graph

Image source

I can't even begin to tell you all how happy I am to reference my favorite HIMYM joke in this post.

Now, let's dive into the various types of graphs and charts. 

Different Types of Graphs for Data Visualization

1. bar graph.

I strongly suggest using a bar graph to avoid clutter when one data label is long or if you have more than 10 items to compare. Also, fun fact: If the example below was vertical it would be a column graph.

Customer bar graph example

Best Use Cases for These Types of Graphs

Bar graphs can help track changes over time. I've found that bar graphs are most useful when there are big changes or to show how one group compares against other groups.

The example above compares the number of customers by business role. It makes it easy to see that there is more than twice the number of customers per role for individual contributors than any other group.

A bar graph also makes it easy to see which group of data is highest or most common.

For example, at the start of the pandemic, online businesses saw a big jump in traffic. So, if you want to look at monthly traffic for an online business, a bar graph would make it easy to see that jump.

Other use cases for bar graphs include:

  • Product comparisons.
  • Product usage.
  • Category comparisons.
  • Marketing traffic by month or year.
  • Marketing conversions.

Design Best Practices for Bar Graphs

  • Use consistent colors throughout the chart, selecting accent colors to highlight meaningful data points or changes over time.

You should also use horizontal labels to improve its readability, and start the y-axis at 0 to appropriately reflect the values in your graph.

2. Line Graph

A line graph reveals trends or progress over time, and you can use it to show many different categories of data. You should use it when you track a continuous data set.

This makes the terms line graphs and line charts interchangeable because the very nature of both is to track how variables impact each other, particularly how something changes over time. Yeah, it confused me, too.

Types of graphs — example of a line graph.

Line graphs help users track changes over short and long periods. Because of this, I find these types of graphs are best for seeing small changes.

Line graphs help me compare changes for more than one group over the same period. They're also helpful for measuring how different groups relate to each other.

A business might use this graph to compare sales rates for different products or services over time.

These charts are also helpful for measuring service channel performance. For example, a line graph that tracks how many chats or emails your team responds to per month.

Design Best Practices for Line Graphs

  • Use solid lines only.
  • Don't plot more than four lines to avoid visual distractions.
  • Use the right height so the lines take up roughly 2/3 of the y-axis' height.

3. Bullet Graph

A bullet graph reveals progress towards a goal, compares this to another measure, and provides context in the form of a rating or performance.

Types of graph — example of a bullet graph.

In the example above, the bullet graph shows the number of new customers against a set customer goal. Bullet graphs are great for comparing performance against goals like this.

These types of graphs can also help teams assess possible roadblocks because you can analyze data in a tight visual display.

For example, I could create a series of bullet graphs measuring performance against benchmarks or use a single bullet graph to visualize these KPIs against their goals:

  • Customer satisfaction.
  • Average order size.
  • New customers.

Seeing this data at a glance and alongside each other can help teams make quick decisions.

Bullet graphs are one of the best ways to display year-over-year data analysis. YBullet graphs can also visualize:

  • Customer satisfaction scores.
  • Customer shopping habits.
  • Social media usage by platform.

Design Best Practices for Bullet Graphs

  • Use contrasting colors to highlight how the data is progressing.
  • Use one color in different shades to gauge progress.

4. Column + Line Graph

Column + line graphs are also called dual-axis charts. They consist of a column and line graph together, with both graphics on the X axis but occupying their own Y axis.

Download our FREE Excel Graph Templates for this graph and more!

Best Use Cases

These graphs are best for comparing two data sets with different measurement units, such as rate and time. 

As a marketer, you may want to track two trends at once.

Design Best Practices 

Use individual colors for the lines and colors to make the graph more visually appealing and to further differentiate the data. 

The Four Basic Types of Charts

Before we get into charts, I want to touch on the four basic chart types that I use the most. 

1. Bar Chart

Bar charts are pretty self-explanatory. I use them to indicate values by the length of bars, which can be displayed horizontally or vertically. Vertical bar charts, like the one below, are sometimes called column charts. 

bar chart examples

2. Line Chart 

I use line charts to show changes in values across continuous measurements, such as across time, generations, or categories. For example, the chart below shows the changes in ice cream sales throughout the week.

line chart example

3. Scatter Plot

A scatter plot uses dotted points to compare values against two different variables on separate axes. It's commonly used to show correlations between values and variables. 

scatter plot examples

4. Pie Chart

Pie charts are charts that represent data in a circular (pie-shaped) graphic, and each slice represents a percentage or portion of the whole. 

Notice the example below of a household budget. (Which reminds me that I need to set up my own.)

Notice that the percentage of income going to each expense is represented by a slice. 

pie chart

Different Types of Charts for Data Visualization

To better understand chart types and how you can use them, here's an overview of each:

1. Column Chart

Use a column chart to show a comparison among different items or to show a comparison of items over time. You could use this format to see the revenue per landing page or customers by close date.

Types of charts — example of a column chart.

Best Use Cases for This Type of Chart

I use both column charts to display changes in data, but I've noticed column charts are best for negative data. The main difference, of course, is that column charts show information vertically while bar charts  show data horizontally.

For example, warehouses often track the number of accidents on the shop floor. When the number of incidents falls below the monthly average, a column chart can make that change easier to see in a presentation.

In the example above, this column chart measures the number of customers by close date. Column charts make it easy to see data changes over a period of time. This means that they have many use cases, including:

  • Customer survey data, like showing how many customers prefer a specific product or how much a customer uses a product each day.
  • Sales volume, like showing which services are the top sellers each month or the number of sales per week.
  • Profit and loss, showing where business investments are growing or falling.

Design Best Practices for Column Charts

  • Use horizontal labels to improve readability.
  • Start the y-axis at 0 to appropriately reflect the values in your chart .

2. Area Chart

Okay, an area chart is basically a line chart, but I swear there's a meaningful difference.

The space between the x-axis and the line is filled with a color or pattern. It is useful for showing part-to-whole relations, like showing individual sales reps’ contributions to total sales for a year.

It helps me analyze both overall and individual trend information.

Types of charts — example of an area chart.

Best Use Cases for These Types of Charts

Area charts help show changes over time. They work best for big differences between data sets and help visualize big trends.

For example, the chart above shows users by creation date and life cycle stage.

A line chart could show more subscribers than marketing qualified leads. But this area chart emphasizes how much bigger the number of subscribers is than any other group.

These charts make the size of a group and how groups relate to each other more visually important than data changes over time.

Area charts  can help your business to:

  • Visualize which product categories or products within a category are most popular.
  • Show key performance indicator (KPI) goals vs. outcomes.
  • Spot and analyze industry trends.

Design Best Practices for Area Charts

  • Use transparent colors so information isn't obscured in the background.
  • Don't display more than four categories to avoid clutter.
  • Organize highly variable data at the top of the chart to make it easy to read.

3. Stacked Bar Chart

I suggest using this chart to compare many different items and show the composition of each item you’re comparing.

Types of charts — example of a stacked bar chart.

These charts  are helpful when a group starts in one column and moves to another over time.

For example, the difference between a marketing qualified lead (MQL) and a sales qualified lead (SQL) is sometimes hard to see. The chart above helps stakeholders see these two lead types from a single point of view — when a lead changes from MQL to SQL.

Stacked bar charts are excellent for marketing. They make it simple to add a lot of data on a single chart or to make a point with limited space.

These charts  can show multiple takeaways, so they're also super for quarterly meetings when you have a lot to say but not a lot of time to say it.

Stacked bar charts are also a smart option for planning or strategy meetings. This is because these charts can show a lot of information at once, but they also make it easy to focus on one stack at a time or move data as needed.

You can also use these charts to:

  • Show the frequency of survey responses.
  • Identify outliers in historical data.
  • Compare a part of a strategy to its performance as a whole.

Design Best Practices for Stacked Bar Charts

  • Best used to illustrate part-to-whole relationships.
  • Use contrasting colors for greater clarity.
  • Make the chart scale large enough to view group sizes in relation to one another.

4. Mekko Chart

Also known as a Marimekko chart, this type of chart  can compare values, measure each one's composition, and show data distribution across each one.

It's similar to a stacked bar, except the Mekko's x-axis can capture another dimension of your values — instead of time progression, like column charts often do. In the graphic below, the x-axis compares the cities to one another.

Types of charts — example of a Mekko chart.

Image Source

I typically use a Mekko chart to show growth, market share, or competitor analysis.

For example, the Mekko chart above shows the market share of asset managers grouped by location and the value of their assets. This chart clarifies which firms manage the most assets in different areas.

It's also easy to see which asset managers are the largest and how they relate to each other.

Mekko charts can seem more complex than other types of charts, so it's best to use these in situations where you want to emphasize scale or differences between groups of data.

Other use cases for Mekko charts include:

  • Detailed profit and loss statements.
  • Revenue by brand and region.
  • Product profitability.
  • Share of voice by industry or niche.

Design Best Practices for Mekko Charts

  • Vary your bar heights if the portion size is an important point of comparison.
  • Don't include too many composite values within each bar. Consider reevaluating your presentation if you have a lot of data.
  • Order your bars from left to right in such a way that exposes a relevant trend or message.

5. Pie Chart

Remember, a pie chart represents numbers in percentages, and the total sum of all segments needs to equal 100%.

Types of charts — example of a pie chart.

The image above shows another example of customers by role in the company.

The bar chart  example shows you that there are more individual contributors than any other role. But this pie chart makes it clear that they make up over 50% of customer roles.

Pie charts make it easy to see a section in relation to the whole, so they are good for showing:

  • Customer personas in relation to all customers.
  • Revenue from your most popular products or product types in relation to all product sales.
  • Percent of total profit from different store locations.

Design Best Practices for Pie Charts

  • Don't illustrate too many categories to ensure differentiation between slices.
  • Ensure that the slice values add up to 100%.
  • Order slices according to their size.

6. Scatter Plot Chart

As I said earlier, a scatter plot or scattergram chart will show the relationship between two different variables or reveal distribution trends.

Use this chart when there are many different data points, and you want to highlight similarities in the data set. This is useful when looking for outliers or understanding your data's distribution.

Types of charts — example of a scatter plot chart.

Scatter plots are helpful in situations where you have too much data to see a pattern quickly. They are best when you use them to show relationships between two large data sets.

In the example above, this chart shows how customer happiness relates to the time it takes for them to get a response.

This type of chart  makes it easy to compare two data sets. Use cases might include:

  • Employment and manufacturing output.
  • Retail sales and inflation.
  • Visitor numbers and outdoor temperature.
  • Sales growth and tax laws.

Try to choose two data sets that already have a positive or negative relationship. That said, this type of chart  can also make it easier to see data that falls outside of normal patterns.

Design Best Practices for Scatter Plots

  • Include more variables, like different sizes, to incorporate more data.
  • Start the y-axis at 0 to represent data accurately.
  • If you use trend lines, only use a maximum of two to make your plot easy to understand.

7. Bubble Chart

A bubble chart is similar to a scatter plot in that it can show distribution or relationship. There is a third data set shown by the size of the bubble or circle.

 Types of charts — example of a bubble chart.

In the example above, the number of hours spent online isn't just compared to the user's age, as it would be on a scatter plot chart.

Instead, you can also see how the gender of the user impacts time spent online.

This makes bubble charts useful for seeing the rise or fall of trends over time. It also lets you add another option when you're trying to understand relationships between different segments or categories.

For example, if you want to launch a new product, this chart could help you quickly see your new product's cost, risk, and value. This can help you focus your energies on a low-risk new product with a high potential return.

You can also use bubble charts for:

  • Top sales by month and location.
  • Customer satisfaction surveys.
  • Store performance tracking.
  • Marketing campaign reviews.

Design Best Practices for Bubble Charts

  • Scale bubbles according to area, not diameter.
  • Make sure labels are clear and visible.
  • Use circular shapes only.

8. Waterfall Chart

I sometimes use a waterfall chart to show how an initial value changes with intermediate values — either positive or negative — and results in a final value.

Use this chart to reveal the composition of a number. An example of this would be to showcase how different departments influence overall company revenue and lead to a specific profit number.

Types of charts — example of a waterfall chart.

The most common use case for a funnel chart is the marketing or sales funnel. But there are many other ways to use this versatile chart.

If you have at least four stages of sequential data, this chart can help you easily see what inputs or outputs impact the final results.

For example, a funnel chart can help you see how to improve your buyer journey or shopping cart workflow. This is because it can help pinpoint major drop-off points.

Other stellar options for these types of charts include:

  • Deal pipelines.
  • Conversion and retention analysis.
  • Bottlenecks in manufacturing and other multi-step processes.
  • Marketing campaign performance.
  • Website conversion tracking.

Design Best Practices for Funnel Charts

  • Scale the size of each section to accurately reflect the size of the data set.
  • Use contrasting colors or one color in graduated hues, from darkest to lightest, as the size of the funnel decreases.

10. Heat Map

A heat map shows the relationship between two items and provides rating information, such as high to low or poor to excellent. This chart displays the rating information using varying colors or saturation.

 Types of charts — example of a heat map.

Best Use Cases for Heat Maps

In the example above, the darker the shade of green shows where the majority of people agree.

With enough data, heat maps can make a viewpoint that might seem subjective more concrete. This makes it easier for a business to act on customer sentiment.

There are many uses for these types of charts. In fact, many tech companies use heat map tools to gauge user experience for apps, online tools, and website design .

Another common use for heat map charts  is location assessment. If you're trying to find the right location for your new store, these maps can give you an idea of what the area is like in ways that a visit can't communicate.

Heat maps can also help with spotting patterns, so they're good for analyzing trends that change quickly, like ad conversions. They can also help with:

  • Competitor research.
  • Customer sentiment.
  • Sales outreach.
  • Campaign impact.
  • Customer demographics.

Design Best Practices for Heat Map

  • Use a basic and clear map outline to avoid distracting from the data.
  • Use a single color in varying shades to show changes in data.
  • Avoid using multiple patterns.

11. Gantt Chart

The Gantt chart is a horizontal chart that dates back to 1917. This chart maps the different tasks completed over a period of time.

Gantt charting is one of the most essential tools for project managers. It brings all the completed and uncompleted tasks into one place and tracks the progress of each.

While the left side of the chart displays all the tasks, the right side shows the progress and schedule for each of these tasks.

This chart type allows you to:

  • Break projects into tasks.
  • Track the start and end of the tasks.
  • Set important events, meetings, and announcements.
  • Assign tasks to the team and individuals.

Gantt Chart - product creation strategy

I use donut charts for the same use cases as pie charts, but I tend to prefer the former because of the added benefit that the data is easier to read.

Another benefit to donut charts is that the empty center leaves room for extra layers of data, like in the examples above. 

Design Best Practices for Donut Charts 

Use varying colors to better differentiate the data being displayed, just make sure the colors are in the same palette so viewers aren't put off by clashing hues. 

14. Sankey Diagram

A Sankey Diagram visually represents the flow of data between categories, with the link width reflecting the amount of flow. It’s a powerful tool for uncovering the stories hidden in your data.

As data grows more complex, charts must evolve to handle these intricate relationships. Sankey Diagrams excel at this task.

Sankey Diagram

With ChartExpo , you can create a Sankey Chart with up to eight levels, offering multiple perspectives for analyzing your data. Even the most complicated data sets become manageable and easy to interpret.

You can customize your Sankey charts and every component including nodes, links, stats, text, colors, and more. ChartExpo is an add-in in Microsoft Excel, Google Sheets, and Power BI, you can create beautiful Sankey diagrams while keeping your data safe in your favorite tools.

Sankey diagrams can be used to visualize all types of data which contain a flow of information. It beautifully connects the flows and presents the data in an optimum way.

Here are a few use cases:

  • Sankey diagrams are widely used to visualize energy production, consumption, and distribution. They help in tracking how energy flows from one source (like oil or gas) to various uses (heating, electricity, transportation).
  • Businesses use Sankey diagrams to trace customer interactions across different channels and touchpoints. It highlights the flow of users through a funnel or process, revealing drop-off points and success paths.
  • I n supply chain management, these diagrams show how resources, products, or information flow between suppliers, manufacturers, and retailers, identifying bottlenecks and inefficiencies.

Design Best Practices for Sankey Diagrams 

When utilizing a Sankey diagram, it is essential to maintain simplicity while ensuring accuracy in proportions. Clear labeling and effective color usage are key factors to consider. Emphasizing the logical flow direction and highlighting significant flows will enhance the visualization.

How to Choose the Right Chart or Graph for Your Data

Channels like social media or blogs have multiple data sources, and managing these complex content assets can get overwhelming. What should you be tracking? What matters most?

How do you visualize and analyze the data so you can extract insights and actionable information?

1. Identify your goals for presenting the data.

Before creating any data-based graphics, I ask myself if I want to convince or clarify a point. Am I trying to visualize data that helped me solve a problem? Or am I trying to communicate a change that's happening?

A chart or graph can help compare different values, understand how different parts impact the whole, or analyze trends. Charts and graphs can also be useful for recognizing data that veers away from what you’re used to or help you see relationships between groups.

So, clarify your goals then use them to guide your chart selection.

2. Figure out what data you need to achieve your goal.

Different types of charts and graphs use different kinds of data. Graphs usually represent numerical data, while charts are visual representations of data that may or may not use numbers.

So, while all graphs are a type of chart, not all charts are graphs. If you don't already have the kind of data you need, you might need to spend some time putting your data together before building your chart.

3. Gather your data.

Most businesses collect numerical data regularly, but you may need to put in some extra time to collect the right data for your chart.

Besides quantitative data tools that measure traffic, revenue, and other user data, you might need some qualitative data.

These are some other ways you can gather data for your data visualization:

  • Interviews 
  • Quizzes and surveys
  • Customer reviews
  • Reviewing customer documents and records
  • Community boards

Fill out the form to get your templates.

4. select the right type of graph or chart..

Choosing the wrong visual aid or defaulting to the most common type of data visualization could confuse your viewer or lead to mistaken data interpretation.

But a chart is only useful to you and your business if it communicates your point clearly and effectively.

Ask yourself the questions below to help find the right chart or graph type.

Download the Excel templates mentioned in the video here.

5 Questions to Ask When Deciding Which Type of Chart to Use

1. do you want to compare values.

Charts and graphs are perfect for comparing one or many value sets, and they can easily show the low and high values in the data sets. To create a comparison chart, use these types of graphs:

  • Scatter plot

2. Do you want to show the composition of something?

Use this type of chart to show how individual parts make up the whole of something, like the device type used for mobile visitors to your website or total sales broken down by sales rep.

To show composition, use these charts:

  • Stacked bar

3. Do you want to understand the distribution of your data?

Distribution charts help you to understand outliers, the normal tendency, and the range of information in your values.

Use these charts to show distribution:

4. Are you interested in analyzing trends in your data set?

If you want more information about how a data set performed during a specific time, there are specific chart types that do extremely well.

You should choose one of the following:

  • Dual-axis line

5. Do you want to better understand the relationship between value sets?

Relationship charts can show how one variable relates to one or many different variables. You could use this to show how something positively affects, has no effect, or negatively affects another variable.

When trying to establish the relationship between things, use these charts:

Featured Resource: The Marketer's Guide to Data Visualization

Types of chart — HubSpot tool for making charts.

Don't forget to share this post!

Related articles.

9 Great Ways to Use Data in Content Creation

9 Great Ways to Use Data in Content Creation

Data Visualization: Tips and Examples to Inspire You

Data Visualization: Tips and Examples to Inspire You

17 Data Visualization Resources You Should Bookmark

17 Data Visualization Resources You Should Bookmark

An Introduction to Data Visualization: How to Create Compelling Charts & Graphs [Ebook]

An Introduction to Data Visualization: How to Create Compelling Charts & Graphs [Ebook]

Why Data Is The Real MVP: 7 Examples of Data-Driven Storytelling by Leading Brands

Why Data Is The Real MVP: 7 Examples of Data-Driven Storytelling by Leading Brands

How to Create an Infographic Using Poll & Survey Data [Infographic]

How to Create an Infographic Using Poll & Survey Data [Infographic]

Data Storytelling 101: Helpful Tools for Gathering Ideas, Designing Content & More

Data Storytelling 101: Helpful Tools for Gathering Ideas, Designing Content & More

Tired of struggling with spreadsheets? These free Microsoft Excel Graph Generator Templates can help

Marketing software that helps you drive revenue, save time and resources, and measure and optimize your investments — all on one easy-to-use platform

Robot

Make Waves in Learning! 25% off

for World Oceans Day

Use code OCEAN25

Embibe Logo

Share this article

link

Table of Contents

Latest updates.

1 Million Means: 1 Million in Rupees, Lakhs and Crores

1 Million Means: 1 Million in Rupees, Lakhs and Crores

Ways To Improve Learning Outcomes: Learn Tips & Tricks

Ways To Improve Learning Outcomes: Learn Tips & Tricks

The Three States of Matter: Solids, Liquids, and Gases

The Three States of Matter: Solids, Liquids, and Gases

Types of Motion: Introduction, Parameters, Examples

Types of Motion: Introduction, Parameters, Examples

Understanding Frequency Polygon: Detailed Explanation

Understanding Frequency Polygon: Detailed Explanation

Uses of Silica Gel in Packaging?

Uses of Silica Gel in Packaging?

Visual Learning Style for Students: Pros and Cons

Visual Learning Style for Students: Pros and Cons

Air Pollution: Know the Causes, Effects & More

Air Pollution: Know the Causes, Effects & More

Sexual Reproduction in Flowering Plants

Sexual Reproduction in Flowering Plants

Integers Introduction: Check Detailed Explanation

Integers Introduction: Check Detailed Explanation

Tag cloud :.

  • entrance exams
  • engineering
  • ssc cgl 2024
  • Written By Sushma_P
  • Last Modified 22-06-2023

Graphical Representation: Advantages, Types & Examples

Graphical Representation: A graph is a categorised representation of data. It helps us understand the data easily. Data is a collection of numerical figures collected through surveying. The word data came from the Latin word ‘Datum’, which means ‘something given’. After developing a research question, data is being collected constantly through observation. Then the data collected is arranged, summarised, classified, and finally represented graphically. This is the concept of graphical representation of data.

Let’s study different kinds of graphical representations with examples, the types of graphical representation, and graphical representation of data in statistics, in this article.

What Are Graphical Representations?

Graphical representation refers to the use of intuitive charts to visualise clearly and simplify data sets. Data obtained from surveying is ingested into a graphical representation of data software. Then it is represented by some symbols, such as lines on a line graph, bars on a bar chart, or slices of a pie chart. In this way, users can achieve much more clarity and understanding than by numerical study alone. 

Advantages of Graphical Representation

Some of the advantages of using graphs are listed below:

  • The graph helps us understand the data or information even when we have no idea about it.
  • It saves time.
  • It makes it easier for us to compare the data for different time periods or different kinds.
  • It is mainly used in statistics to determine the mean, median and mode for different data and interpolation and extrapolation of data.

Use of Graphical Representations

The main agenda of presenting scientific data into graphs is to provide information efficiently to utilise the power of visual display while avoiding confusion or deception. This is important in communicating our findings to others and our understanding and analysis of the data.

Graphical data representation is crucial in understanding and identifying trends and patterns in the ever-increasing data flow. Graphical representation helps in quick analysis of large quantities and can support making predictions and informed decisions.

General Rules for Graphical Representation of Data

The following are a few rules to present the information in the graphical representation:

  • Suitable title:  The title of the graph should be appropriate that indicates the subject of the presentation.
  • Measurement unit:  The measurement unit in the graph should be mentioned.
  • Proper scale:   Choose a proper scale to represent the data accurately.
  • Index:  For better understanding, index the appropriate colours, shades, lines, and designs in the graphs. 
  • Data sources:  Data should be included wherever it is necessary at the bottom of the graph.
  • Keep it simple:  The construction of a graph should be such a way that it is effortlessly understood.
  • Neat:  The correct size, fonts, colours etc., should be chosen so that the graph should be a visual aid for presenting the information.

Types of Graphical Representation

1. Line graph 2. Histogram 3. Bar graph 4. Pie chart 5. Frequency polygon 6. Ogives or Cumulative frequency graphs

1. Line Graph

A line graph is a chart used to show information that changes over time. We plot line graphs by connecting several points with straight lines.  Another name is a line chart. The line graph contains two axes: \(x-\)axis and \(y-\)axis.

  • The horizontal axis is the \(x-\)axis.
  • The vertical axis is the \(y-\)axis.

Example: The following graph shows the number of motorbikes sold on different days of the week.

Line Graph

2. Histogram

Continuous data represented on the two-dimensional graph is called a histogram. In the histogram, the bars are placed continuously side by side without a gap between consecutive bars. In other words, rectangles are erected on the class intervals of the distribution. The areas of the rectangles formed by bars are proportional to the frequencies.

Example: Following is an example of a histogram showing the average pass percentage of students.

Histogram

3. Bar Graph

Bar graphs can be of two types – horizontal bar graphs and vertical bar graphs. While a horizontal bar graph is applied for qualitative data or data varying over space, the vertical bar graph is associated with quantitative data or time-series data.

Bars are rectangles of varying lengths and of equal width usually are drawn either horizontally or vertically. We consider multiple or grouped bar graphs to compare related series. Component or sub-divided bar diagrams are applied for representing data divided into several components. 

Example:  The following graph is an example of a bar graph representing the money spent month-wise

Bar Graph

4. Pie Chart

The sector of a circle represents various observations or components, and the whole circle represents the sum of the value of all the components. The total central angle of a circle is \({360^{\rm{o}}}\) and is divided according to the values of the components.

The central angle of a component\( = \frac{{{\rm{ value}}\,{\rm{of}}\,{\rm{the}}\,{\rm{component }}}}{{{\rm{total}}\,{\rm{value}}}} \times {360^{\rm{o}}}\)

Sometimes, the value of the components is expressed in percentages. In such cases, The central angle of a component\( = \frac{{{\rm{ percentage}}\,{\rm{value}}\,{\rm{of}}\,{\rm{the}}\,{\rm{component }}}}{{100}} \times {360^{\rm{o}}}\)

Example:  The following figure represents a pie-chart

Pie Chart

5. Frequency Polygon

A frequency polygon is another way of representing frequency distribution graphically. Follow the steps below to make a frequency polygon:

(i) Calculate and obtain the frequency distribution and the mid-points of each class interval. (ii) Represent the mid-points along the \(x-\)axis and the frequencies along the \(y-\)axis. (iii) Mark the points corresponding to the frequency at each midpoint. (iv) Now join these points in straight lines. (v) To finish the frequency polygon, join the consecutive points at each end (as the case may be at zero frequency) on the \(x-\)axis.

Example: The following graph is the frequency polygon showing the road race results.

Frequency Polygon

6. Ogives or Cumulative Frequency Graphs

By plotting cumulative frequency against the respective class intervals, we obtain ogives. There are two ogives – less than type ogives and more than type.

Less than type ogives is obtained by taking less than cumulative frequency on the vertical axis. We can obtain more than type ogives by plotting more than type cumulative frequency on the vertical axis and joining the plotted points successively by line segments.

Example: The below graph represents the less than and more than ogives for the entrance examination scores of \(60\) students.

Ogives or Cumulative Frequency Graphs

Solved Examples – Basic Graphical Representation

Q.1. The wildlife population in the following years, \(2013, 2014, 2015, 2016, 2017, 2018,\) and \(2019\) were \(300, 200, 400, 600, 500, 400\) and \(500,\) respectively. Represent these data using a line graph. Ans: We can represent the population for seven consecutive years by drawing a line diagram as given below. Let us consider years on the horizontal axis and population on the vertical axis.

For the year \(2013,\) the population was \(300.\) It can be written as a point \((2013, 300)\) Similarly, we can write the points for the succeeding years as follows: \((2014, 200), (2015, 400), (2016, 600), (2017, 500), (2018, 400)\) and \((2019, 500)\)

We can obtain the line graph by plotting all these points and joining them using a ruler. The following line diagram shows the population of wildlife from \(2013\) to \(2019.\)

 Basic Graphical Representation

Q.2. Draw a histogram for the following data that represents the marks scored by \(120\) students in an examination:

\(0-20\)\(20-40\)\(40-60\)\(60-80\)\(80-100\)
\(5\)\(10\)\(40\)\(45\)\(20\)

Ans: The class intervals are of an equal length of \(20\) marks. Let us indicate the class intervals along the \(x-\)axis and the number of students along the \(y-\)axis, with the appropriate scale. The histogram is given below.

 Basic Graphical Representation

Q.3. The total number of scoops of vanilla ice cream in the different months of a year is given below:

\(240\)\(400\)\(440\)\(320\)\(200\)

For the above data, draw a bar graph. Ans: The following graph represents the number of vanilla ice cream scoops sold from March to July. The month is indicated along the \(x-\)axis, and the number of scoops sold is represented along the \(y-\)axis.

 Basic Graphical Representation

Q.4. The number of hours spent by a working woman on various activities on a working day is given below. Using the angle measurement, draw a pie chart.

\(3\)\(7\)\(2\)\(9\)\(1\)\(2\)

Ans: The central angle of a component\( = \frac{{{\rm{ value}}\,{\rm{of}}\,{\rm{the}}\,{\rm{component }}}}{{{\rm{total}}\,{\rm{value}}}} \times {360^{\rm{o}}}\). We may calculate the central angles for various components as follow:

Household\(3\)\(\frac{3}{{24}} \times {360^{\rm{o}}} = {45^{\rm{o}}}\)
Sleep\(7\)\(\frac{7}{{24}} \times {360^{\rm{o}}} = {105^{\rm{o}}}\)
Cooking\(2\)\(\frac{2}{{24}} \times {360^{\rm{o}}} = {30^{\rm{o}}}\)
Office\(9\)\(\frac{9}{{24}} \times {360^{\rm{o}}} = {135^{\rm{o}}}\)
TV\(1\)\(\frac{1}{{24}} \times {360^{\rm{o}}} = {15^{\rm{o}}}\)
Other\(2\)\(\frac{2}{{24}} \times {360^{\rm{o}}} = {30^{\rm{o}}}\)
Total\(24\)\({360^{\rm{o}}}\)

By knowing the central angle, a pie chart is drawn,

 Basic Graphical Representation

Q.5. Draw a frequency polygon for the following data using a histogram.

\(140-145\)\(145-150\)\(150-155\)\(155-160\)\(160-165\)\(165-170\)\(170-175\)
\(35\)\(40\)\(55\)\(50\)\(40\)\(35\)\(20\)

Ans: To draw a frequency polygon, we take the imagined classes \(135-140\) at the beginning and \(175-180\) at the end, each with frequency zero. The following is the frequency table tabulated for the given data

\(140-145\)\(142.5\)\(35\)
\(145-150\)\(147.5\)\(40\)
\(150-155\)\(152.5\)\(55\)
\(155-160\)\(157.5\)\(50\)
\(160-165\)\(162.5\)\(40\)
\(165-170\)\(167.5\)\(35\)
\(170-175\)\(172.5\)\(20\)

Let’s mark the class intervals along the \(x-\)axis and the frequency along the \(y-\)axis.

 Basic Graphical Representation

Using the above table, plot the points on the histogram: \((137.5, 0), (142.5, 35), (147.5, 40), (152.5, 55), (157.5, 50), (162.5, 40),\) \((167.5, 35), (172.5, 20)\) and \((177.5, 0).\)

We join these points one after the other to obtain the required frequency polygon.

In this article, we have studied the details of the graphical representation of data. We learnt the meaning, uses, and advantages of using graphs . Then we studied the different types of graphs with examples. Lastly, we solved examples to help students understand the concept in a better way.

Frequently Asked Questions (FAQs) on Basic Graphical Representation

Q.1: What are graphical representations? Ans: Graphical representations represent given data using charts or graphs numerically and then visually analyse and interpret the information.

Q.2: What are the 6 types of graphs used? Ans: The following are the types of graphs we use commonly: 1. Line graph 2. Histogram 3. Bar graph 4. Pie chart 5. Frequency polygon 6. Ogives or cumulative frequency graphs

Q.3: What are the advantages of the graphical method? Ans: The advantages of using a graphical method are: 1. Facilitates improved learning 2. Knowing the content 3. Usage of flexibility 4. Increases thinking 5. Supports creative, personalised reports for more engaging and stimulating visual presentations 6. Better communication 7. It shows the whole picture

Q.4: What is the graphical representation of an idea? Ans: The graphical representations exhibit relationships between ideas, data, information and concepts in a visual graph or map. Graphical representations are effortless to acknowledge.

Q.5: How do you do frequency polygon? Ans: Frequency distribution is first obtained, and the midpoints of each class interval are found. Mark the midpoints along the \(x-\)axis and frequencies along the \(y-\)axis. Plot the points corresponding to the frequency. Join the points, using line segments in order.

Related Articles

1 Million Means: 1 million in numerical is represented as 10,00,000. The Indian equivalent of a million is ten lakh rupees. It is not a...

Ways To Improve Learning Outcomes: With the development of technology, students may now rely on strategies to enhance learning outcomes. No matter how knowledgeable a...

The Three States of Matter: Anything with mass and occupied space is called ‘Matter’. Matters of different kinds surround us. There are some we can...

Motion is the change of a body's position or orientation over time. The motion of humans and animals illustrates how everything in the cosmos is...

Understanding Frequency Polygon: Students who are struggling with understanding Frequency Polygon can check out the details here. A graphical representation of data distribution helps understand...

When you receive your order of clothes or leather shoes or silver jewellery from any online shoppe, you must have noticed a small packet containing...

Visual Learning Style: We as humans possess the power to remember those which we have caught visually in our memory and that too for a...

Air Pollution: In the past, the air we inhaled was pure and clean. But as industrialisation grows and the number of harmful chemicals in the...

In biology, flowering plants are known by the name angiosperms. Male and female reproductive organs can be found in the same plant in flowering plants....

Integers Introduction: To score well in the exam, students must check out the Integers introduction and understand them thoroughly. The collection of negative numbers and whole...

Human Respiratory System – Detailed Explanation

Human Respiratory System: Students preparing for the NEET and Biology-related exams must have an idea about the human respiratory system. It is a network of tissues...

Place Value of Numbers: Detailed Explanation

Place Value of Numbers: Students must understand the concept of the place value of numbers to score high in the exam. In mathematics, place value...

The Leaf: Types, Structures, Parts

The Leaf: Students who want to understand everything about the leaf can check out the detailed explanation provided by Embibe experts. Plants have a crucial role...

Factors Affecting Respiration: Definition, Diagrams with Examples

In plants, respiration can be regarded as the reversal of the photosynthetic process. Like photosynthesis, respiration involves gas exchange with the environment. Unlike photosynthesis, respiration...

General Terms Related to Spherical Mirrors

General terms related to spherical mirrors: A mirror with the shape of a portion cut out of a spherical surface or substance is known as a...

Number System: Types, Conversion and Properties

Number System: Numbers are highly significant and play an essential role in Mathematics that will come up in further classes. In lower grades, we learned how...

Types of Respiration

Every living organism has to "breathe" to survive. The process by which the living organisms use their food to get energy is called respiration. It...

Animal Cell: Definition, Diagram, Types of Animal Cells

Animal Cell: An animal cell is a eukaryotic cell with membrane-bound cell organelles without a cell wall. We all know that the cell is the fundamental...

Conversion of Percentages: Conversion Method & Examples

Conversion of Percentages: To differentiate and explain the size of quantities, the terms fractions and percent are used interchangeably. Some may find it difficult to...

Arc of a Circle: Definition, Properties, and Examples

Arc of a circle: A circle is the set of all points in the plane that are a fixed distance called the radius from a fixed point...

Ammonia (NH3): Preparation, Structure, Properties and Uses

Ammonia, a colourless gas with a distinct odour, is a chemical building block and a significant component in producing many everyday items. It is found...

CGPA to Percentage: Calculator for Conversion, Formula, & More

CGPA to Percentage: The average grade point of a student is calculated using their cumulative grades across all subjects, omitting any supplemental coursework. Many colleges,...

Uses of Ether – Properties, Nomenclature, Uses, Disadvantages

Uses of Ether:  Ether is an organic compound containing an oxygen atom and an ether group connected to two alkyl/aryl groups. It is formed by the...

General and Middle Terms: Definitions, Formula, Independent Term, Examples

General and Middle terms: The binomial theorem helps us find the power of a binomial without going through the tedious multiplication process. Further, the use...

Mutually Exclusive Events: Definition, Formulas, Solved Examples

Mutually Exclusive Events: In the theory of probability, two events are said to be mutually exclusive events if they cannot occur simultaneously or at the...

Geometry: Definition, Shapes, Structure, Examples

Geometry is a branch of mathematics that is largely concerned with the forms and sizes of objects, their relative positions, and the qualities of space....

Bohr’s Model of Hydrogen Atom: Expressions for Radius, Energy

Rutherford’s Atom Model was undoubtedly a breakthrough in atomic studies. However, it was not wholly correct. The great Danish physicist Niels Bohr (1885–1962) made immediate...

data graphical representation

39 Insightful Publications

World Economic Forum

Embibe Is A Global Innovator

accenture

Innovator Of The Year Education Forever

Interpretable And Explainable AI

Interpretable And Explainable AI

Tedx

Revolutionizing Education Forever

Amazon AI Conclave

Best AI Platform For Education

Forbes India

Enabling Teachers Everywhere

ACM

Decoding Performance

World Education Summit

Leading AI Powered Learning Solution Provider

Journal of Educational Data Mining

Auto Generation Of Tests

BW Disrupt

Disrupting Education In India

Springer

Problem Sequencing Using DKT

Fortune India Forty Under Fourty

Help Students Ace India's Toughest Exams

Edtech Digest

Best Education AI Platform

Nasscom Product Connect

Unlocking AI Through Saas

Tech In Asia

Fixing Student’s Behaviour With Data Analytics

Your Story

Leveraging Intelligence To Deliver Results

City AI

Brave New World Of Applied AI

vccircle

You Can Score Higher

INK Talks

Harnessing AI In Education

kstart

Personalized Ed-tech With AI

StartUpGrind

Exciting AI Platform, Personalizing Education

Digital Women Award

Disruptor Award For Maximum Business Impact

The Mumbai Summit 2020 AI

Top 20 AI Influencers In India

USPTO

Proud Owner Of 9 Patents

StartUpGrind

Innovation in AR/VR/MR

StartUpGrind

Best Animated Frames Award 2024

Close

Trending Searches

Previous year question papers, sample papers.

Unleash Your True Potential With Personalised Learning on EMBIBE

Pattern

Ace Your Exam With Personalised Learning on EMBIBE

Enter mobile number.

By signing up, you agree to our Privacy Policy and Terms & Conditions

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

data graphical representation

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Creating Brand Value
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading Change and Organizational Renewal
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

6 Data Visualization Examples To Inspire Your Own

Color-coded data visualization

  • 12 Jan 2017

Data informs virtually every business decision an organization makes. Because of this, it’s become increasingly important for professionals of all backgrounds to be adept at working with data.

While data can provide immense value, it’s important that professionals are able to effectively communicate the significance of the data to stakeholders. This is where data visualization comes into play. By transforming raw data into engaging visuals using various data visualization tools , it’s much easier to communicate insights gleaned from it.

Here are six real-world examples of data visualization that you can use to inspire your own.

What Is Data Visualization?

Data visualization is the process of turning raw data into graphical representations.

Visualizations make it easy to communicate trends in data and draw conclusions. When presented with a graph or chart, stakeholders can easily visualize the story the data is telling, rather than try to glean insights from raw data.

There are countless data visualization techniques , including:

  • Scatter plots

The technique you use will vary based on the type of data you’re handling and what you’re trying to communicate.

6 Real-World Data Visualization Examples

1. the most common jobs by state.

NPR Job Visualization

Source: NPR

National Public Radio (NPR) produced a color-coded, interactive display of the most common jobs in each state in each year from 1978 to 2014. By dragging the scroll bar at the bottom of the map, you’re able to visualize occupational changes over time.

If you’re trying to represent geographical data, a map is the best way to go.

2. COVID-19 Hospitalization Rates

CDC COVID-19 Visualization

Source: CDC

Throughout the COVID-19 pandemic, the Centers for Disease Control and Prevention (CDC) has been transforming raw data into easily digestible visuals. This line graph represents COVID-19 hospitalization rates from March through November 2020.

The CDC tactfully incorporated color to place further emphasis on the stark increase in hospitalization rates, using a darker shade for lower values and a lighter shade for higher values.

3. Forecasted Revenue of Amazon.com

Statista Data Visualization

Source: Statista

Data visualizations aren’t limited to historical data. This bar chart created by Statista visualizes the forecasted gross revenue of Amazon.com from 2018 to 2025.

This visualization uses a creative title to summarize the main message that the data is conveying, as well as a darker orange color to spike out the most important data point.

4. Web-Related Statistics

Internet Live Stats Visualization

Source: Internet Live Stats

Internet Live Stats has tracked web-related statistics and pioneered methods for visualizing data to show how different digital properties have ebbed and flowed over time.

Simple infographics like this one are particularly effective when your goal is to communicate key statistics rather than visualizing trends or forecasts.

5. Most Popular Food Delivery Items

Eater Food Delivery Visualization

Source: Eater

Eater, Vox’s food and dining brand, has created this fun take on a “pie” chart, which shows the most common foods ordered for delivery in each of the United States.

To visualize this data, Eater used a specific type of pie chart known as a spie chart. Spie charts are essentially pie charts in which you can vary the height of each segment to further visualize differences in data.

6. Netflix Viewing Patterns

Vox Netflix Visualization

Source: Vox

Vox created this interesting visualization depicting the viewing patterns of Netflix users over time by device type. This Sankey diagram visualizes the tendency of users to switch to streaming via larger device types.

A Beginner's Guide to Data and Analytics | Access Your Free E-Book | Download Now

Visualizing Data to Make Business Decisions

The insights and conclusions drawn from data visualizations can guide the decision-making and strategic planning processes for your organization.

To ensure your visualizations are relevant, accurate, and ethical, familiarize yourself with basic data science concepts . With a foundational knowledge in data science, you can maintain confidence in your data and better understand its significance. An online analytics course can help you get started.

Are you interested in improving your data science and analytical skills? Download our Beginner’s Guide to Data & Analytics to learn how you can leverage the power of data for professional and organizational success.

This post was updated on February 26, 2021. It was originally published on January 12, 2017.

Med-MGF: multi-level graph-based framework for handling medical data imbalance and representation

  • Open access
  • Published: 02 September 2024
  • Volume 24 , article number  242 , ( 2024 )

Cite this article

You have full access to this open access article

data graphical representation

  • Tuong Minh Nguyen 1 ,
  • Kim Leng Poh 1 ,
  • Shu-Ling Chong 2 , 3 &
  • Jan Hau Lee 3 , 4  

Modeling patient data, particularly electronic health records (EHR), is one of the major focuses of machine learning studies in healthcare, as these records provide clinicians with valuable information that can potentially assist them in disease diagnosis and decision-making.

In this study, we present a multi-level graph-based framework called MedMGF, which models both patient medical profiles extracted from EHR data and their relationship network of health profiles in a single architecture. The medical profiles consist of several layers of data embedding derived from interval records obtained during hospitalization, and the patient-patient network is created by measuring the similarities between these profiles. We also propose a modification to the Focal Loss (FL) function to improve classification performance in imbalanced datasets without the need to imputate the data. MedMGF’s performance was evaluated against several Graphical Convolutional Network (GCN) baseline models implemented with Binary Cross Entropy (BCE), FL, class balancing parameter \(\alpha\) , and Synthetic Minority Oversampling Technique (SMOTE).

Our proposed framework achieved high classification performance (AUC: 0.8098, ACC: 0.7503, SEN: 0.8750, SPE: 0.7445, NPV: 0.9923, PPV: 0.1367) on an extreme imbalanced pediatric sepsis dataset (n=3,014, imbalance ratio of 0.047). It yielded a classification improvement of 3.81% for AUC, 15% for SEN compared to the baseline GCN+ \(\alpha\) FL (AUC: 0.7717, ACC: 0.8144, SEN: 0.7250, SPE: 0.8185, PPV: 0.1559, NPV: 0.9847), and an improvement of 5.88% in AUC and 22.5% compared to GCN+FL+SMOTE (AUC: 0.7510, ACC: 0.8431, SEN: 0.6500, SPE: 0.8520, PPV: 0.1688, NPV: 0.9814). It also showed a classification improvement of 3.86% for AUC, 15% for SEN compared to the baseline GCN+ \(\alpha\) BCE (AUC: 0.7712, ACC: 0.8133, SEN: 0.7250, SPE: 0.8173, PPV: 0.1551, NPV: 0.9847), and an improvement of 14.33% in AUC and 27.5% in comparison to GCN+BCE+SMOTE (AUC: 0.6665, ACC: 0.7271, SEN: 0.6000, SPE: 0.7329, PPV: 0.0941, NPV: 0.9754).

When compared to all baseline models, MedMGF achieved the highest SEN and AUC results, demonstrating the potential for several healthcare applications.

Similar content being viewed by others

data graphical representation

Heterogeneous Graph Embeddings of Electronic Health Records Improve Critical Care Disease Predictions

data graphical representation

A machine learning model to predict the risk of 30-day readmissions in patients with heart failure: a retrospective analysis of electronic medical records data

data graphical representation

A Machine Learning-Based Missing Data Imputation with FHIR Interoperability Approach in Sepsis Prediction

Explore related subjects.

  • Artificial Intelligence
  • Medical Ethics

Introduction

Making an accurate medical diagnosis for a patient requires consideration of several aspects of clinical information and evidence. This includes reviewing the patient’s medical history, performing physical examinations, ordering tests, interpreting test results, and consulting with other professionals if necessary. The data collected during this process are mainly stored as tabular Electronic health records (EHR) (e.g., vital signs, laboratory results), high-frequency physiologic waveforms (e.g., electrocardiogram), imaging (e.g., radiograph), or other forms of medical data. Using these data, clinicians are able to monitor the patient’s disease progression and make informed treatment decisions. As it contains a large volume of rich clinical information, EHR can potentially be used to support clinical research as well [ 1 , 2 ]. The use of EHR as a data source for Machine learning (ML) studies has increased significantly over the past few years, and modeling EHR data has been one of the major focuses of ML applications in the healthcare sector [ 3 , 4 ].

The concept of Patient similarity network (PSN) is an emerging research field within the context of precision medicine [ 5 , 6 ]. The diagnosis made using this network is based on the premise that if patients’ medical data are similar in several aspects, then their clinical progress should be similar as well. It is hypothesized that a common disease trajectory resulting in a specific outcome may establish a similarity between patients, thereby making the insight gained using PSN more reliable and robust [ 7 ]. Recent advances in ML techniques have led to the development of a variety methods to construct PSN. The International classification of diseases (ICD) was often utilized to establish connections between patients [ 8 , 9 ]. In some instances, medical inputs are converted into feature vectors and the distance between these vectors will determine the degree of similarity between them [ 7 , 10 ]. Studies usually treat the medical inputs as a flat structure or embed them within several layers of neural networks without preserving their structure or interpretation. The latter often requires a separate training process to create the medical embedding before they are introduced into the PSN for further training, which could result in an increase in training costs.

In this work, we propose Medical Multilevel Graph-based Framework (MedMGF), a framework that is capable of modeling medical data, as well as representing the patient’s individual medical profile and their similarity to other patients within a single architecture. Depending on data availability, the medical profile can be constructed from EHR, physiologic waveforms, imaging data, or a combination thereof. In this study, we demonstrate the feasibility of the framework using EHR data. In contrast to most studies which treat EHR as a flat structure, we preserve its natural hierarchical structure and provide an intuitive way to describe it by incorporating interval data from multiple hospitalizations. A multi-level embedding process allows the medical inputs to pass directly through the PSN, where embedding and PSN are optimized through a single training procedure. We also propose a modification of the Focal Loss (FL, [ 11 ]) function to improve classification performance on imbalanced datasets without having to imputate the data, thus reducing the amount of preprocessing needed. In general, MedMGF encapsulates the following characteristics: (1) generality and modality, (2) multi-purpose, (3) intuitive interpretation, and (4) minimal data requirements.

In this study, our objective is to present the framework architecture and feasibility of MedMGF on an imbalanced pediatric sepsis EHR datasets and evaluate its classification performance against several Graph Convolutional Network (GCN, [ 12 ]) baselines implemented with Binary Cross Entropy (BCE, [ 13 ]), FL, class balancing parameter \(\alpha\) , and Synthetic Minority Over-sampling Technique (SMOTE, [ 14 ]).

Related works

Electronic health record modeling.

The use of ML in modelling EHR has become more prevalent as EHR contains rich clinical information that can potentially assist clinicians in making diagnosis and treatment decisions. Although most studies model the EHR in a flat manner [ 15 , 16 ], exploring its structural aspects may reveal new possibilities for enhancing the model. In particular, Choi et al. developed Multi-layer Representation Learning for Medical concepts (Med2Vec, [ 17 ]), and continue to explore this approach with Graph-based Attention Model (GRAM, [ 18 ]) and Multi-level Medical Embedding (MiME, [ 19 ]). By leveraging the parent-child relationship on the knowledge-based directed graph, GRAM can learn the representation of medical concepts (e.g., ICD codes) with attention mechanisms and predict the next hospital visit’s diagnosis code. Based on GRAM, Li et al. developed Multimodal Diagnosis Prediction model (MDP, [ 20 ]) that allows clinical data to be integrated into the framework. Although clinical data from EHR can be weighed dynamically to highlight the most important features, the data is still processed in a flat manner. With MiME, Choi et al. constructed a hierarchical structure of EHR data based on the relationship between symptoms and treatments, where the hospital visit consists of a number of symptoms, each corresponding to a number of specific treatments. This influential interaction is encapsulated in the patient’s data embedding representation, which is used for prediction purposes. The precision-recall area under the curve (PR-AUC) for heart failure prediction showed a 15% improvement compared to the baseline model. As most EHR datasets lack the connection between symptoms and treatments, MiME may require extensive data preprocessing, and EHR data may need to undergo a rigorous pre-processing procedure before being mapped to MiME. In addition, the current MiME structure may not capture all aspects of EHR data, other than the relationship between symptoms and treatments. In light of these two drawbacks, we propose MedMGF, a framework for modeling EHR data that can capture all aspects of EHR efficiently and effectively with minimal data preprocessing required.

Patient similarity network

There are several approaches to constructing a PSN using ICD codes [ 8 , 9 ]. One of the approaches is to create a bipartite graph to connect patients to their corresponding ICDs in a similar manner to what Lu and Uddin did in 2021. This bipartite graph is then converted into a weighted PSN, in which the weight of the edge is determined by the number of mutual ICD codes between the patients [ 9 ]. In this approach, the number of mutual ICD codes used to connect patients is highly dependent upon cohort and ICD code selection. In Rouge et al.’s study, an inverse document frequency measured vector of 674 ICD10 codes was constructed for each patient. A cosine similarity between these vectors was calculated for all possible pairs of patients. The PSN was then constructed using a pre-defined threshold on the calculated distances [ 8 ]. As the number of patients increases, it becomes more difficult to process the large ICD matrix computationally. In other cases, the medical input is mapped into feature vectors, and distance metrics (e.g. Euclidean, Cosine, Manhattan) are applied to determine the degree of similarity [ 8 , 10 ]. In the work of Navaz et al., two similarity matrices were calculated separately for static data (e.g. age) and dynamic data (e.g. vital signs). These matrixes were then fused together to construct the PSN [ 7 ].

Focal loss function

FL function was first introduced by Lin et al. in 2018 [ 11 ]. On the training data \(\mathcal {D}=\{(\textbf{x}_i,y_i)\}^N_i\) independently drawn from an i.i.d probability distribution, the FL for a binary classification problem is defined as follows:

where p is the predicted probability and \(\gamma \ge 0\) is a user-defined hyperparameter to control the rate at which easy samples are down-weighted. It can be observed that FL reduces to the Cross Entropy (CE) when \(\gamma = 0\) . FL introduces a modulating factor \((1-p_t)^\gamma\) to the CE to dynamically adjust the loss based on the difficulty of each sample. This factor is higher for misclassified samples and lower for well-classified samples. Thus, FL reduces the impact of the dominant class by focusing on difficult samples. Researchers typically perform cross-validation to find the optimal value of gamma [ 11 , 21 ]. In a strategic policy proposed by Mukhoti et al., a higher value of \(\gamma\) would be allocated to predicted probabilities, which is less than a pre-calculated threshold and a lower value of \(\gamma\) for probabilities greater than the threshold [ 22 ]. The results of their work showed that the dynamic value of \(\gamma\) could improve FL calibration. In another work, Ghosh et al. proposed to dynamically adjusted \(\gamma\) based on its value from the previous steps [ 23 ]. Either way, the classification performance is strongly influenced by and dependent on the value of \(\gamma\) . Considering this dependence, we propose a modification that allows us to dynamically adjust the modulating factor in a similar manner without relying on the hyperparameter \(\gamma\) .

The framework consists of three main components: the patient’s medical profile, which represents the health data extracted from the EHR data, the patient-patient network, which represents the similarity among the patients, based on their profiles, and the modification of FL function. An individual’s medical profile is constructed based on a hierarchical representation that embeds several layers of information derived from interval data collected during hospitalizations and medical modules. In this study, we present the medical representation for EHR data. The overall framework is illustrated in Fig. 1 . The notation for patient’s medical profile and patient-patient network are listed in Tables 1 and  2 .

figure 1

The MedMGF framework consists of several layers of medical data embedding

Patient’s medical profile

Suppose that an individual’s medical profile for a specific disease contains health data from a sequence of hospitalizations \(\left( \mathcal {V}^{(1)},\mathcal {V}^{(2)},...,\mathcal {V}^{(i)},...,\mathcal {V}^{(T)}\right)\) , whereat each hospitalization \(\mathcal {V}^{(i)}\) , a sequence of medical data \(\left( \mathcal {D}^{(1)},\mathcal {D}^{(2)},...,\mathcal {D}^{(j)},...,\mathcal {D}^{(t)}\right)\) is entered at time intervals \(\left( \Delta _1,\Delta _2,...,\Delta _j,...,\Delta _t\right)\) , with \(\Delta _j\) being the time interval between \(\mathcal {D}^{(j)}\) and \(\mathcal {D}^{(j-1)}\) . Medical data \(\mathcal {D}^{(j)}\) collected at interval j -th includes medical module from EHR data \(\mathcal {S}_{E}^{(j)}\) , imaging data \(\mathcal {S}_{I}^{(j)}\) , signal data \(\mathcal {S}_{S}^{(j)}\) , or a combination thereof, then \(\mathcal {D}^{(j)} = \oplus \left( \mathcal {S}_{E}^{(j)},\mathcal {S}_{I}^{(j)},\mathcal {S}_{S}^{(j)},...\right)\) , where \(\oplus (.)\) represents the CONCAT data aggregation function. Let \(\textbf{d}^{(j)}\) be the vector representation of \(\mathcal {D}^{(j)}\) at j -th interval, \(\textbf{v}^{(i)}\) be a vector representation of i -th hospitalization \(\mathcal {V}^{(i)}\) , and \(\textbf{s}_{E}^{(j)},\textbf{s}_{I}^{(j)},\textbf{s}_{S}^{(j)}\) be the vector representation of \(\mathcal {S}_{E}^{(j)},\mathcal {S}_{I}^{(j)},\mathcal {S}_{S}^{(j)}\) , then \(\textbf{d}^{(j)} = \oplus \left( \textbf{s}_{E}^{(j)},\textbf{s}_{I}^{(j)},\textbf{s}_{S}^{(j)},...\right)\) and \(\textbf{v}^{(i)} = \oplus \left( \textbf{d}^{(1)}, \textbf{d}^{(2)},...,\textbf{d}^{(j)},...,\textbf{d}^{(t)}\right) \in \mathbb {R}^{t \times z}\) , where z represents the number of the medical modules. We define \(\textbf{h}\) to be the vector presentation of a patient’s medical profile, then \(\textbf{h}= \oplus \left( \textbf{v}^{(1)}, \textbf{v}^{(2)},..., \textbf{v}^{(i)},...,\textbf{v}^{(T)} \right)\) .

The interval sequence \(\left( \Delta _1,\Delta _2...,\Delta _t\right)\) represents the irregular periodicity of the hospital data, where \(\Delta _i\) can vary to match the requirement of the desired analysis. For this study, we fix \(\Delta _1 = \Delta _2=...= \Delta _t = \Delta\) so that the medical data will be extracted at a fixed interval \(\Delta\) . Different variables are collected at different intervals, resulting in three possible scenarios: no value is recorded, one value is recorded, or multiple values are recorded. We extract the variables value of an interval as follows: if no data are available for interval j -th, the value from the previous interval will be carried forward. If more than one value is recorded in the interval, the worst value will be taken (Fig. 2 ).

figure 2

Data extraction rule at an interval: when no value is available during the interval, the value from previous interval is carried forward. A worst value is selected if more than one value is available in the interval

Patient-patient medical profile network

The patient’s medical profile network is defined as a graphical network \(\mathcal {G} = \left( V,E,X,A\right)\) with \(|V| = N\) nodes and | E | edges, where nodes represent patients and the edge weights represent the degree of similarity between them. The node feature matrix \(X =(\textbf{x}_1, \textbf{x}_2, \textbf{x}_3,...,\textbf{x}_n) \in \mathbb {R}^{N \times T}\) contains the feature vector of all nodes. A single row \(\textbf{x}_i\) from the node matrix X is a representation of a patient’s medical profile from T hospitalizations that has been described in “ Patient’s medical profile ” section. Hence, \(\textbf{x}_i = \textbf{h}_i = \left\{ \textbf{v}^{(1)}, \textbf{v}^{(2)},..., \textbf{v}^{(i)},...,\textbf{v}^{(T)} \right\}\) . In order to determine a the similarity between patients, we measure the similarity between their medical profiles. Since the medical profile is represented as a data vector, we can measure the similarity between patient’s medical profile by calculating the Euclidean distance between them. Let \(u,v \in V\) be the two nodes representing patient u and v on \(\mathcal {G}\) , the similarity distance d ( u ,  v ) is defined as follows:

Using Eq. 3 , an Euclidean distance matrix can be constructed for \(\mathcal {G}\) . This distance matrix allows us to construct a patient-patient medical profile network \(\mathcal {G}\) . If we assume that no two patient’s profiles are absolutely identical, then \(\mathcal {G}\) will be a complete network. Patients with similar profiles will stay close to each other, forming several clusters in the network representation. As connections between very different profiles may produce noise data for classification, we define a similarity threshold \(\xi\) to control the number of connections on \(\mathcal {G}\) .

The connection between nodes u ,  v is represented by \((u,v)\in E\) , and \((u,v) = \mathbbm{1}\{d(u,v) \le \xi : u,v \in V\}\) . The adjacency matrix is then expressed as \(A \in \mathbb {R}^{N\times N}\) , \(A_{uv} = \mathbbm{1}\{(u,v)=1: (u,v)\in E, u,v \in V\}\) . The construction of patient’s medical profile network consists of the following steps:

Calculate the Euclidean similarity matrix using the node feature matrix X and Eq. 3 .

Setting a threshold \(\xi\) for the similarity matrix .

Using the thresholded similarity matrix, construct the adjacency matrix A and network \(\mathcal [G]\) .

Tree-structure representation of EHR

EHRs are often formatted similarly to relational databases, where variables are categorized by their interpretation into tables, such as demographic information, vital signs, and laboratory results. By leveraging this relationship, the EHR can be easily represented as a tree structure, where a table i is mapped with an object denoted as \(\textbf{o}_i^{(t)}\) and variable j recorded under the table is denoted as \(\textbf{o}_{ij}^{(t)}\) (Fig. 3 ). In this section, the t -th interval will be dropped to simplify the notation. i and j will be used as the notation for the nodes representing the corresponding objects. The tree-based representation of EHR data is defined as a \(\mathcal {T} = (\mathcal {P},\mathcal {C},\mathcal {A})\) , where \(\mathcal {P}\) is a set of parent nodes and \(\mathcal {C}\) is a set of child nodes. Let \(i \in \mathcal {P}\) and \(j \in \mathcal {C}\) then the connection between parent and child is represented by \((i,j)\in \mathcal {A}\) . The adjacency matrix is expressed as \(\mathcal {A} \in \mathbb {R}^{|\mathcal {P}|\times |\mathcal {C}|}\) , \(\mathcal {A}_{ij} = 1\) if there is a connection between them or \(\mathcal {A}_{ij} = 0\) otherwise. An empty root node \(R_{\mathcal {T}}\) is added to \(\mathcal {P}\) to receive the final data embedding and its connection to the existing parent nodes are added to \(\mathcal {A}\) . The data embedding in the tree structure is carried from child node to parent node recursively from the bottom to the root node. The notation summary is listed in Table 3 . The data embedding at any parent node is as follows:

In Eq. 4 , the data of child nodes \(\textbf{o}_{ij}\) are transformed by multiplying with weight matrix \(\textbf{W}_i \in \mathbb {R}^{j}\) , which are then summed together to obtain the embedding of the object group \(\textbf{o}_i\) . At the root node, the data is aggregated with a CONCAT function. Hence, the data embedding vector at the root node \(\textbf{s}_{E} \in \mathbb {R}^{|\mathcal {C(R_\mathcal {T})}|}\) will have the dimension of the number of its child nodes (Eq. 5 ) .

figure 3

Tree-structure representation of the EHR data

Proposed modification of loss function

Given the training data \(D = \{\textbf{x}_i,y_i\}^N_{i=1}\) , where \(\textbf{x}_i \in \mathbb {R}^d\) is the feature vector of d dimensions and \(y_i \in \{0,1\}\) is the label of the sample i -th. \(\textbf{x}_i\) is extracted from EHR data and is used to construct the patient’s medical profile and patient-patient network as described in the previous sections. Let \(p_t\) be the predicted probability of a patient at node i in the positive class, \(\alpha _t\) is a balancing parameter for imbalanced class, and \(\gamma\) is a user-defined hyperparameter, we propose a modification of the FL function for binary classification as follows:

We propose to use \((1- e^{p_t})^{-1}\) instead of the original factor \((1-p_t)^{\gamma }\) to control the sample weight. Figure 4 shows the weight distribution the modulating term assign to different predicted probability. The proposed modulating factor imposes a more severe penalty for a predicted probability that is further away from the actual probability as compared to the original modulating factor. In this way, it strongly draws the attention of the loss function during the learning process to the wrongly predicted sample, emphasizing the punishment for predicted probabilities that are close to zero. The advantage of this approach over the original FL is that the sample weight can be dynamically adjusted without being dependent on \(\gamma\) , thereby eliminating the need to tune a hyperparameter. A large penalty assigned to a sample that is greatly mispredicted is the driving force behind an improved classification (Fig. 4 ).

figure 4

The visualization of the sample weight assigned to sample in the original FL and the proposed modification of FL function

Multi-level data embedding & model learning

The data is embedded in a bottom-up manner, folding several layers of the information: medical modules embedding, interval data embedding, and hospitalization embedding. A patient’s medical profile is encoded through the following embedding sequence:

In Eqs. 7 , 8 , and 9 , \(\oplus (.)\) represents a CONCAT aggregation function. The embedding \(\textbf{h}\) is then used to construct the patient-patient network \(\mathcal {G}\) described in “ Patient-patient medical profile network ” section. Let n be a node on \(\mathcal {G}\) and \(\mathcal {N}(n)\) be the neighbors of n on the network, then the final embedding of node n on \(\mathcal {G}\) is encoded as follows:

where \(\textbf{W}\) is a trainable weight matrix to transform the embedding of the neighbor nodes, \(\sigma\) is a softmax activation function. The learning loss is measured by the proposed loss function as described in “ Proposed modification of loss function ” section. The framework is trained and validated in a transductive manner. The training algorithm is shown in Algorithm 1.

figure a

Algorithm 1 Psuedo code of the framework training

Dataset and data processing

This study was conducted using the public dataset Pediatric intensive care dataset (PICD), version 1.1.0, which is available in the PhysioNet repository [ 24 ]. The dataset consists of patients aged 0-18 years admitted to the Intensive care units (ICUs) at the Children’s Hospital of Zhejiang University School of Medicine, Zhejiang, China, from 2010-2019. Our previous work, published in 2023, described the method of selecting cohort samples and extracting data [ 25 ]. We follow the same procedure for collecting data and defining sepsis in this study. However, in the current study, only continuous variables were used, and raw demographic, vital sign, and laboratory data were used instead of category-coded data. This study was approved by the National University of Singapore’s Institutional Review Board (NUS-IRB-2024-396).

Evaluation metrics

The evaluation task is to predict the sepsis outcome of the patients in the test set. As it is a binary classification task, we used Accuracy (ACC), Sensitivity (SEN), Specificity (SPE), Negative predictive value (NPV), Positive predictive value (PPV), and Area under the receiver operating characteristic curve (AUC) to evaluate the model performance.

AUC was measured by comparing the true positive rate against the false positive rate. A high AUC indicates the model ability to distinguish the classes in binary classification. The rest of the metrics are derived from the confusion matrix (Table 4 ). SEN, SPE is the proportion of TP among all positives and TN among all negative while PPV, NPV measures the TP and TN among predicted positives and predicted negatives.

Study design

The data was masked in 70% for training and 30% for testing. We trained the MedMGF on training data and reported the model performance on the masked testing data (Fig. 5 ). The evaluation aimed: (1) to validate the overall performance of the framework compared to the baseline models, (2) to compare its effectiveness against the oversampling method, and (3) to verify that the proposed loss function is comparable to existing loss functions. In the first evaluation, we used three sets of baseline models, including Logistic Regression (LR), GCN implemented with BCE, FL, and balancing parameter \(\alpha\) . In the second evaluation, we used GCN+BCE and GCN+FL+SMOTE as the baseline models. As SMOTE is the most common oversampling technique for imbalance data, it was selected for this study. In the third evaluations, we implemented our proposed framework using BCE, FL, with balancing parameter \(\alpha\) as the baseline models. Finally, we used t-distributed stochastic neighbor embedding (t-SNE) plot to visualize the data embedding produced by the MedMGF+ e FL and the best two baseline models (GCN+ \(\alpha\) FL and GCN+ \(\alpha\) BCE) to demonstrate the learning process. The performance of all models was summarized in Table 5 . A summary of the proposed MedMGF and the previous studies (MiME, GRAM, and MDP) was also provided in Table 6 to highlight the differences of our approach.

figure 5

The training and validation workflow of MedMGF

Models were fine-tuned to perform optimally. \(\gamma\) was selected in the manner that would optimize the model performance and the balancing parameter was set at the imbalance ratio of the dataset \(\alpha _+ = 0.047\) . All models except LR models were trained with Adam optimizer, 10,000 maximum epochs, a learning rate of 0.01. The training was implemented with an early-stopping mechanism, such that the training would be stopped when the validation loss did not decrease after 10 epochs, otherwise the results would be reported at the conclusion of the training process. BCE with logit loss was set up with a mean reduction. The data split in 70% for training and 30% for testing using the sklearn library. The SMOTE oversampling algorithm was implemented using the imblearn library. The t-SNE plot was produced by sklearn library. The framework was implemented in Spyder IDE (MIT, version 5.5.0, Python version 3.9.14).

Statistical methods

We calculated medians [interquartile ranges (IQRs)] and absolute counts (percentage) for continuous and categorical variables, respectively. Differences between the sepsis and non-sepsis cohort were assessed with Mann-Whitney U on continuous and Pearson’s Chi-squared tests on categorical variables. All statistical analyses were performed using Microsoft Excel (version 16.55, Microsoft, USA) with a statistical significance taken as p < 0.05.

Demographic and baseline clinical characteristics of patients

The cohort contains 3,014 admissions with a median age of 1.13 (0.15-4.30) years old and 1,698 (56.3%) males. The number of sepsis-positive cases is 134 (4.4%), which results in an imbalance ratio of 0.047 between classes. A total of three demographic variables (age, length of stay in the intensive care unit, length of stay in the hospital), five vital signs (temperature, heart rate, respiratory rate, diastolic and systolic blood pressure), and 15 laboratory variables are included in the study (Appendix A). An overview of cohort demographics and clinical outcomes can be referred to [ 25 ].

Model performance comparison against baseline model

On PICD (imbalance ratio of 0.047), LR produced predictions overwhelmingly in favor of the dominant class, resulting in a low SEN (0.0256), high ACC (0.9546), SPE (0.9965), and NPV (0.9578). With LR+SMOTE, the classification improved significantly in AUC and SEN (AUC: 0.7740, SEN: 0.7179). Comparing to these baseline models, MedMGF+ e FL showed higher classification performance for AUC, SEN, and NPV (AUC: 0.8098, ACC: 0.7503, SEN: 0.8750, SPE: 0.7445, PPV: 0.1367, NPV: 0.9923). Specifically, MedMGF+ e FL obtained an increase of 29.88% in AUC, and 84.94% in SEN when compared to LR.

For both GCN+BCE and GCN+FL, we observed that there was no effective learning for the minority class. However, integrating with balancing parameter, \(\alpha\) , improved the results. \(\alpha\) FL (AUC: 0.7717, ACC: 0.8144, SEN:0.7250, SPE: 0.8185, PPV: 0.1559, NPV: 0.9847) gave a slightly higher performance than \(\alpha\) BCE (AUC: 0.7712, ACC: 0.8133, SEN: 0.7250, SPE: 0.8173, PPV: 0.1551, NPV: 0.9847), though the difference was not considered significant. Compared to the \(\alpha\) FL, the proposed MedMGF+ e FL framework demonstrated a 3.81% increase in AUC and a 15% increase in SEN.

Model performance comparison with different loss functions & SMOTE

Using BCE and FL alone does not lead to effective learning during training due to the extreme imbalance ratio of the dataset. Performance improvements were only achieved with the inclusion of SMOTE. The GCN+SMOTE+FL model (AUC: 0.7510, ACC: 0.8431, SEN: 0.6500, SPE: 0.8520, PPV: 0.1688, NPV: 0.9814) yielded better results compared to GCN+SMOTE+BCE (AUC: 0.6665, ACC: 0.7271, SEN: 0.6000, SPE: 0.7329, PPV: 0.0941, NPV: 0.9754).When compared with GCN+SMOTE+FL, the MedMGF+ e FL model showed a 5.88% increase in AUC and a 22.5% increase in SEN, although there was a decrease of 8.05% in SPE and 3.21% in PPV. Additionally, MedMGF+ e FL achieved a 3.58% increase in AUC and a 4.98% increase in SEN when compared to LR+SMOTE.

Model performance comparison for the proposed loss function

We observed that MedMGF achieved high SEN ( \(\alpha\) BCE: 0.8750, \(\alpha\) FL: 0.8250, e FL: 0.8750), high AUC ( \(\alpha\) BCE: 0.7975, \(\alpha\) FL: 0.7998, e FL: 0.8098), and high NPV ( \(\alpha\) BCE: 0.9896, \(\alpha\) FL: 0.9897, e FL: 0.9923) when compared to all baseline models. The best SEN (0.8750), AUC (0.8098), and NPV (0.9923) were achieved with the proposed loss function e FL. However, MedMGF+ e FL experienced a decrease in PPV (0.1367) and SPE (0.7445) compared to the other two models.

figure 6

The data embedding transformation during training with MedMGF+ e LF, GCN+ \(\alpha\) FL, and GCN+ \(\alpha\) BCE. Yellow dots represent positive samples and purple dots represent negative samples

Figure 6 presents the final patient embeddings within the patient network, generated by the proposed MedMGF+ e FL framework, alongside GCN+ \(\alpha\) FL and GCN+ \(\alpha\) BCE. In this visualization, yellow dots represent the positive class, while purple dots represent the negative class. For MedMGF+ e FL, we observed that the yellow dots initially intermingle with the purple dots, making it challenging to establish a clear boundary between them. However, as training progresses, the yellow dots gradually cluster together, and by epoch 700, most of them have concentrated at one end, facilitating easier classification. In contrast, the other two baseline models quickly separated the dots, but the separation process slowed down starting from epoch 300 for GCN+ \(\alpha\) FL and from epoch 400 for GCN+ \(\alpha\) BCE. Learning in these models ceased around epoch 500, whereas it continued with MedMGF+ e FL, leading to a higher SEN for MedMGF+ e FL.

In this study, we propose a novel multi-level graph-based framework designed to represent clinical knowledge that can be utilized for several downstream applications. It consists of three components: a tree structure that captures an individual patient’s medical information, a patient-patient network, and an modified loss function specifically for imbalanced datasets. The integration of patient medical profiles and patient networks within a unified architecture facilitates multiple types of analyses, including patient stratification and cohort discovery. Our results demonstrated the framework’s effectiveness, achieving improved classification performance on a highly imbalanced pediatric sepsis dataset (imbalance ratio of 0.047) compared to baseline models. Furthermore, the proposed loss function has shown improvements in classification performance over BCE and FL. In the following section, we will discuss the framework’s properties, its clinical implications, as well as its limitations and potential direction for future research.

Framework approach . Our approach focuses on preserving the EHR’s inherent structure and interpretability by leveraging its existing groupings. By utilizing this structure and organizing it in a tree-like manner, we effectively reduce the dimensionality of the data input while maintaining a minimum level of data embedding interpretation. This dimensionality reduction leads to faster training times and a less complex learning process. The approach has also been designed to facilitate the integration of domain experts’ knowledge, allowing them to construct the structure intuitively, thereby enhancing interpretability. Compared to the creation of graphical models like Bayesian networks [ 26 ] by domain experts, constructing a tree structure is simpler and more cost-effective. Through this graph-based architecture, we can visually represent both the patient’s medical profile and their relationships with other patients. This architecture incorporates several layers of information, including interval data and hospitalization records. Essentially, it encompasses the entire hospitalization of the patient and the data for each visit in a compact, easily visualized format. Depending on the context, this can be presented either as an individual medical profile or as a cluster of similar patients. Furthermore, by integrating patient medical profiles and patient networks into a unified architecture and training process, we achieve a reduction in training costs.

Framework properties . MedMGF has the following key properties: (1) generality and modality, (2) multi-purpose functionality, (3) intuitive interpretation, and (4) minimal data processing requirements.

Firstly, the framework is designed to seamlessly integrate with various types of medical data by embedding and extending the number of modules to accommodate additional data sources. This modular approach allows for the effortless incorporation of new information, enabling the framework to be easily modified and updated in response to evolving medical data. With its flexible module structure, the MedMGF framework efficiently utilizes available information, enhancing its adaptability and scalability.

Secondly, the framework demonstrates potential for a wide range of tasks, such as disease diagnosis, cohort discovery, and risk prediction. For instance, the similarity between patient profiles can be leveraged to predict another patient’s risk of rehospitalization or their likely response to treatment. Clinicians can also utilize the framework to identify individuals at risk for certain diseases or adverse reactions by comparing medical profiles. Additionally, MedMGF can serve as a bedside monitoring system, tracking patients’ conditions and the progression of their diseases. In some scenarios, the framework could be adapted to alert clinicians when a patient’s medical profile closely resembles that of a specific disease or when certain characteristics are present, enhancing early detection and intervention.

A third characteristic of the framework is its ease of interpretation, which enables clinicians to easily understand concepts related to the structure of the EHR, patient profiles, and the patient network. By presenting the data in a clear and concise manner, the framework can assist clinicians in making informed decisions and gaining valuable insights from framework visualizations. This intuitive interpretability enhances the framework’s effectiveness and usability in various medical contexts, ultimately contributing to improved patient care and outcomes.

Last but not least, it requires minimal processing of EHR data since it does not require oversampling techniques to improve learning, or additional processing to map data to the multilevel graph-based structure. However, it still requires basic processing tasks such as handling missing data, removing outliers, and selecting variables for the EHR tree-based structure.

Handling data imbalance . Class imbalances in medical data are common and can significantly impair classification performance [ 27 , 28 ].Due to these imbalances, ML models may struggle to accurately differentiate between classes, often leading to biased predictions that favor the dominant class. Various techniques can address this issue at both the data and algorithmic levels. Data-level approaches include oversampling and undersampling, while algorithmic-level approaches involve heuristics that prioritize minority classes [ 29 ].

Oversampling techniques, such as SMOTE, have shown effectiveness in improving ML model performance by generating synthetic samples during training. This, however, may introduce unwanted additional noise and bias to the training process. On the other hand, undersampling, which reduces the number of samples in the dominant class, is not beneficial when dealing with extremely imbalanced or small medical datasets. In this study, we address the imbalance problem at the algorithmic level by modifying the focal loss function. By assigning a modulating term to samples, the loss function can concentrate more on hard-to-classify samples during training. The modulating term proposed in our study creates a flexible sampling weight that adapts based on the framework’s learning at each training round, eliminating the need to rely on a hyperparameter.

Framework explainability . It is essential for clinicians to understand how machine learning models make decisions to apply these models to their practice. For this reason, models should be able to explain how data is used, identify the factors influencing decisions, and clarify how those decisions are reached. Given this need, it is not surprising that Explainable AI (XAI) has seen rapid growth in recent years [ 30 , 31 , 32 ]. XAI plays a critical role in bridging the gap between proof-of-concept studies and applied ML in medicine [ 33 ]. By leveraging XAI, potential biases or errors in the model can be identified, offering insights into the reasons behind specific decisions. Moreover, it can be used to tune parameters or correct errors in the model. XAI techniques commonly used in ML-based studies include Shapley Additive Explanations (SHAP, [ 34 ]) and Local Interpretable Model-Agnostic Explanations (LIME, [ 30 ]). Currently, our framework does not use XAI, but it can easily be adapted to do so. For example, it is possible to identify different nodes’ attention weights with SHAP or LIME or with Graph Attention Networks (GAT, [ 35 ]) integrated into the framework. An alternative is to integrating a GAT-like approach to the hierarchical embeddings to enhance its explainability during the model learning.

Framework complexity . The framework consists of four operations: (1) the tree-structure representation and embedding for EHR data \(\mathcal {O}_{\mathcal {T}}\) , (2) the multilevel data embedding for patient’s medical profile \(\mathcal {O}_{\mathcal {P}}\) , (3) the construction of patient-patient medical profile network \(\mathcal {O}_{\mathcal {G}}\) , and (4) inference for downstream tasks on the medical profile network \(\mathcal {O}_{\mathcal {I}}\) . Hence, the time complexity of the overall framework will be the sum of these operations:

The patient’s medical profile network is constructed based on a Euclidean distance matrix \(\in \mathbb {R}^{N \times N}\) with N number of patients. Hence, the complexity is estimated to be \(\mathcal {O}(N^2)\) . The core operation in (1), (2), and (4) is based on the message passing mechanism. This mechanism includes the feature transformation, neighborhood aggregation, and updating via activation function in both forward and backward pass for one layer. In the forward pass, the feature transformation is a multiplication between node feature matrix \(X \in \mathbb {R}^{N \times T}\) and transformation weight matrix \(W \in \mathbb {R}^{T \times T}\) , hence, \(\mathcal {O}(NT^2)\) . Neighbor aggregation is a multiplication between matrices of size \(N \times N\) and \(N \times T\) , yielding \(\mathcal {O}(N^2T)\) . Finally, the cost for using activation function is a \(\mathcal {O}(N)\) . In practice, we could use a sparse operator, therefore the cost of the neighbor aggregation can be reduce to \(\mathcal {O}(|E|T)\) . Hence, the total cost of the forward pass is \(\mathcal {O}(NT^2) + \mathcal {O}(|E|T) + \mathcal {O}(N)\) . In the backward pass, the cost of performing the backprobagation for X and W is \(\mathcal {O}(NT^2) + \mathcal {O}(|E|T)\) .

In the tree-structure representation \(\mathcal {T}\) for EHR data, there are \((|\mathcal {P}-1|)\) message passing and one aggregation operation at the root node. Additionally, the multilevel data embedding for interval and hospitalizations consist of three aggregation functions. Hence, the time complexity of the framework from the four mentioned operations is:

As we used embedding to reduce the dimension of the feature vectors, T is rather small compared to the original dimension of the feature vectors. Therefore, overall complexity of MedMGF is reduced to \(\mathcal {O}(N^2)\) . The complexity depends on the number of sample in the dataset. As the number of sample grows, it will be taxing to construct the patient network.

Comparison with previous studies . Table 6 provides a summary of our proposed framework, MedMGF, in comparison with MiME, GRAM, and MDP, all of which employ a multi-level embedding approach to medical data representation. While MiME and GRAM may be limited to diagnosis and treatment codes, MedMGF encompasses more aspects of EHR data and can be extended to incorporate imaging, signals, and other data types into the representation. Although MDP integrated clinical data into GRAM, the data is still handled in a flat manner. While existing works exploit the hierarchy between medical codes, MedMGF exploits the hierarchy between EHR data itself. MiME and GRAM are capable of representing complex and general medical concepts beyond just data alone.

All methods are capable of handling both small and large medical datasets. With MedMGF, the complexity increases significantly as the number of patients increases. None of the methods have integrated XAI, with model interpretation primarily derived from the framework architecture. GRAM and MDP are notable for their use of attention mechanisms, which allow for better model interpretation and feature importance determination. In this regard, MedMGF relies on the intuitive tree structure of EHR data as well as the integration of a network of patient similarity to enhance the interpretation of the model. As of now, MedMGF does not have a mechanism for determining the importance of features.

In comparison with existing methods, MedMGF has the advantage of handling imbalanced data and does not require additional data processing. Existing methods do not address imbalanced data directly and may require additional steps to process medical codes when applied to other EHR datasets.

Clinical impact . The MedMGF framework demonstrates significant improvements in AUC and SEN when compared to the baseline models on an extremely imbalanced dataset. The improvement in these metrics suggests that MedMGF may improve diagnostic precision and accuracy in real-world medical settings, such as sepsis diagnosis, where the sepsis population is much smaller than the healthy population. Furthermore, a false negative in a sepsis diagnosis can have a more detrimental effect on the patient’s well-being than a false positive. It is therefore more desirable to achieve a high level of performance in SEN to reduce the number of false negatives. It cannot be ignored that false positives can result in a greater incidence of antibiotic resistance. However, the well-being of the patient as well as his or her mortality status should usually take precedence. Our MedMGF framework is therefore advantageous, since it can deliver a high SEN on imbalanced data. By effectively addressing the challenges posed by imbalanced datasets, MedMGF can potentially open up new possibilities for more accurate and reliable clinical applications. In terms of development, deployment, and application, MedMGF can be tailored to meet a variety of needs in the hospital, including disease diagnosis, bedside monitoring, and research assistance. With its versatility, it can be used for a variety of purposes, thereby eliminating the need for multiple systems and frameworks, resulting in cost savings.

Study limitation & future works . There are, however, a number of limitations to our study. First, the limited number of datasets used for evaluation may raise concerns about MedMGF’s generalizability, which is an important aspect of ML to ensure that the model can perform well on unseen data. Without a diverse range of datasets, the model may fail to accurately predict outcomes in real-world scenarios, leading to unreliable results and limited practical applications. Second, for demonstration purposes, we used only a portion of the data collected within 24 hours of ICU admission. To validate the generality of the framework when modeling patient medical profiles with several hospitalizations and intervals, more data points should be included. The number of features in the dataset is relatively small to be able to validate the complexity of the framework, as discussed in the section on complexity analysis. As our experiment only work on EHR data, more research should be conducted to validate on other data types (e.g., imaging, waveforms).

Furthermore, a comparison of our work with previous studies would provide additional evidence and validation of the MedMGF’s efficiency. However, previous studies such as MiME and Med2Vec require data on the relationship between symptoms and treatments, which is not available in our dataset. The implementation of MiME or GRAM with our current data is therefore challenging. We did not include the performance results of MedMGF with MiME, GRAM, and MDP in our comparison for the following reasons. Each framework uses different performance metrics to measure the performance: MiME uses PR-AUC for predicting the onset of Heart Failure (HF), GRAM uses Accuracy@K (where K represents the top diagnosis code guesses of the next hospital visit) to count the number of correct-predicted codes and AUC for predicting the onset of HF, MDP measures Accuracy@K for the top k diagnosis code guesses of the next hospital visit, our MedMGF uses AUC, ACC, SEN, SPE, NPV, PPV for sepsis prediction. Considering the differences in nature of the tasks defined in the various experiments and the metrics used, it is challenging for us to compare the results among studies. In addition, we were not able to reproduce MiME or GRAM as our dataset lacks the relationship between treatment and diagnosis codes.

In addition, the inferred characteristics of the framework are inferred from its design. Currently, demonstrating modality characteristics is challenging due to our dataset lacking imaging or waveform data. Therefore, more research is needed to confirm the feasibility of the framework for a variety of other analysis purposes as well as to confirm its multipurpose characteristic. Finally, we have not yet incorporated an XAI mechanism into the framework. To address these limitations, future research can consider collecting data over a longer period in order to conduct evaluations that are more comprehensive and diverse. Additionally, incorporating data from multiple healthcare institutions or collaborating with other researchers could enhance the framework’s generalizability and validate its effectiveness across different settings. It is also beneficial to integrate an XAI mechanism into the framework in order to enhance its interpretability. XAI is an emerging area in applied ML within healthcare, and it has the potential to significantly enhance model interpretation and promote the practical ML application in clinical settings. Literature reviews and model development related to XAI and the contribution of various types of medical data to clinical prediction models, could be valuable areas for further research.

Our study proposes MedMGF, a framework that integrates medical profile representation and patient-patient profile network within a single architecture. It utilizes the hierarchical structure of EHR data to represent the patients’ medical data and the graphical structure of patient-patient networks to perform supervised tasks. Additionally, the proposed modification to the focal loss resulted in improved classification performance on imbalance datasets compared to the baseline models. Generally, the framework encapsulates both generality and modality that can easily be adapted to a variety of analyses and applications. Furthermore, it can be further extended by incorporating XAI to enhance its interpretation and transparency in future research.

Availability of data and materials

The datasets analysed during the current study are available in the PhysioNet repository, https://physionet.org/content/picdb/1.1.0/ .

Code availability

The underlying code for this study is not publicly available but may be made available to qualified researchers on reasonable request from the corresponding author. Link: https://github.com/zoetmn/med-mgf .

Wang W, Ferrari D, Haddon-Hill G, Curcin V. Electronic Health Records as Source of Research Data. In: Machine Learning for Brain Disorders, vol. 197. Springer US; 2023. pp. 331–354. https://doi.org/10.1007/978-1-0716-3195-9_11 . https://link.springer.com/10.1007/978-1-0716-3195-9_11 .

Kim MK, Rouphael C, McMichael J, Welch N, Dasarathy S. Challenges in and Opportunities for Electronic Health Record-Based Data Analysis and Interpretation. Gut Liver. 2024;18. https://doi.org/10.5009/gnl230272 .

Habehh H, Gohel S. Machine learning in healthcare. Curr Genomics. 2021;22:291–300. https://doi.org/10.2174/1389202922666210705124359 . https://www.eurekaselect.com/194468/article

Amirahmadi A, Ohlsson M, Etminani K. Deep learning prediction models based on EHR trajectories: a systematic review. J Biomed Inform. 2023;144:104430. https://doi.org/10.1016/j.jbi.2023.104430 . https://linkinghub.elsevier.com/retrieve/pii/S153204642300151X

Pai S, Bader GD. Patient Similarity Networks for Precision Medicine. J Mol Biol. 2018;430:2924–38. https://doi.org/10.1016/j.jmb.2018.05.037 . https://linkinghub.elsevier.com/retrieve/pii/S0022283618305321

Parimbelli E, Marini S, Sacchi L, Bellazzi R. Patient similarity for precision medicine: A systematic review. J Biomed Inform. 2018;83:87–96. https://doi.org/10.1016/j.jbi.2018.06.001 . https://linkinghub.elsevier.com/retrieve/pii/S1532046418301072

Navaz AN, T El-Kassabi H, Serhani MA, Oulhaj A, Khalil K. A Novel Patient Similarity Network (PSN) Framework Based on Multi-Model Deep Learning for Precision Medicine. J Personalized Med. 2022;12:768. https://doi.org/10.3390/jpm12050768 .

Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M, Hansen T, et al. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol. 2011;7. https://doi.org/10.1371/journal.pcbi.1002141 .

Lu H, Uddin S. A weighted patient network-based framework for predicting chronic diseases using graph neural networks. Sci Rep. 2021;11. https://doi.org/10.1038/s41598-021-01964-2 .

Panahiazar M, Taslimitehrani V, Pereira NL, Pathak J. Using EHRs for Heart Failure Therapy Recommendation Using Multidimensional Patient Similarity Analytics. Stud Health Technol Inform. 2015;210:369–73.

PubMed   PubMed Central   Google Scholar  

Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal Loss for Dense Object Detection. 2018. arXiv:1708.02002 .

Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. 2017. arXiv:1609.02907 . Accessed 24 June 2024.

Shannon CE. A Mathematical Theory of Communication. Bell Syst Tech J. 1948;27:379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x . https://ieeexplore.ieee.org/document/6773024

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res. 2002;16:321–57. https://doi.org/10.1613/jair.953 . https://www.jair.org/index.php/jair/article/view/10302

Mukherjee P, Humbert-Droz M, Chen JH, Gevaert O. SCOPE: predicting future diagnoses in office visits using electronic health records. Sci Rep. 2023;13:11005. https://doi.org/10.1038/s41598-023-38257-9 . https://www.nature.com/articles/s41598-023-38257-9

Grout R, Gupta R, Bryant R, Elmahgoub MA, Li Y, Irfanullah K, et al. Predicting disease onset from electronic health records for population health management: a scalable and explainable Deep Learning approach. Front Artif Intell. 2024;6:1287541. https://doi.org/10.3389/frai.2023.1287541 . https://www.frontiersin.org/articles/10.3389/frai.2023.1287541/full

Choi E, Bahadori MT, Searles E, Coffey C, Sun J. Multi-layer Representation Learning for Medical Concepts. 2016. arXiv:1602.05568 .

Choi E, Bahadori MT, Song L, Stewart WF, Sun J. GRAM: Graph-based Attention Model for Healthcare Representation Learning. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2017. pp. 787–795. https://doi.org/10.1145/3097983.3098126 . https://dl.acm.org/doi/10.1145/3097983.3098126 .

Choi E, Xiao C, Stewart WF, Sun J. MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare. 2018. arXiv:1810.09593 .

Li R, Ma F, Gao J. Integrating Multimodal Electronic Health Records for Diagnosis Prediction. AMIA Annual Symposium proceedings, vol. 2021. AMIA Symposium; 2021. pp. 726–735.

Charoenphakdee N, Vongkulbhisal J, Chairatanakul N, Sugiyama M. On Focal Loss for Class-Posterior Probability Estimation: A Theoretical Perspective. 2020. arXiv:2011.09172 .

Mukhoti J, Kulharia V, Sanyal A, Golodetz S, Torr PHS, Dokania PK. Calibrating Deep Neural Networks using Focal Loss. 2020. arXiv:2002.09437 .

Ghosh A, Schaaf T, Gormley MR. AdaFocal: Calibration-aware Adaptive Focal Loss. 2023. arXiv:2211.11838 .

Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Sci Data. 2020;7:14. https://doi.org/10.1038/s41597-020-0355-4 . http://www.nature.com/articles/s41597-020-0355-4

Nguyen TM, Poh KL, Chong SL, Lee JH. Effective diagnosis of sepsis in critically ill children using probabilistic graphical model. Transl Pediatr. 2023;12:538–51. https://doi.org/10.21037/tp-22-510 .

Andersen SK. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Artif Intell. 1991;48:117–24. https://doi.org/10.1016/0004-3702(91)90084-W . https://linkinghub.elsevier.com/retrieve/pii/000437029190084W

Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: Experimental evaluation. Inf Sci. 2020;513:429–41. https://doi.org/10.1016/j.ins.2019.11.004 . https://linkinghub.elsevier.com/retrieve/pii/S0020025519310497

Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6:27. https://doi.org/10.1186/s40537-019-0192-5 . https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0192-5

Rezvani S, Wang X. A broad review on class imbalance learning techniques. Appl Soft Comput. 2023;143:110415. https://doi.org/10.1016/j.asoc.2023.110415 . https://linkinghub.elsevier.com/retrieve/pii/S1568494623004337

Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco California USA: ACM; 2016. pp. 1135–1144. https://doi.org/10.1145/2939672.2939778 . https://dl.acm.org/doi/10.1145/2939672.2939778 .

Ali S, Abuhmed T, El-Sappagh S, Muhammad K, Alonso-Moral JM, Confalonieri R, et al. Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence. Inf Fusion. 2023;99:101805. https://doi.org/10.1016/j.inffus.2023.101805 . https://linkinghub.elsevier.com/retrieve/pii/S1566253523001148

Saeed W, Omlin C. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowl-Based Syst. 2023;263:110273. https://doi.org/10.1016/j.knosys.2023.110273 . https://linkinghub.elsevier.com/retrieve/pii/S0950705123000230

S Band S, Yarahmadi A, Hsu CC, Biyari M, Sookhak M, Ameri R, et al. Application of explainable artificial intelligence in medical health: A systematic review of interpretability methods. Inform Med Unlocked. 2023;40:101286. https://doi.org/10.1016/j.imu.2023.101286 . https://linkinghub.elsevier.com/retrieve/pii/S2352914823001302 .

Lundberg S, Lee SI. A Unified Approach to Interpreting Model Predictions. 2017. arXiv:1705.07874 .

Veličković P, Cucurull G, Casanova A, Romero A, Lió P, Bengio Y. Graph Attention Networks. 2018. arXiv:1710.10903 .

Download references

Acknowledgements

This study received no funding.

Author information

Authors and affiliations.

Department of Industrial Engineering and Management, National University of Singapore, Singapore, 117576, Singapore

Tuong Minh Nguyen & Kim Leng Poh

Children’s Emergency, KK Women’s and Children’s Hospital, Singapore, 229899, Singapore

Shu-Ling Chong

SingHealth-Duke NUS Paediatrics Academic Clinical Programme, Duke-NUS Medical School, Singapore, 169857, Singapore

Shu-Ling Chong & Jan Hau Lee

Children’s Intensive Care Unit, KK Women’s and Children’s Hospital, Singapore, 229899, Singapore

Jan Hau Lee

You can also search for this author in PubMed   Google Scholar

Contributions

The concept and architecture were designed by TMN under the supervision of PKL, JHL and SLC. PKL provided the technical guidance. JHL and SLC provided clinical interpretation of the results. All authors contributed to manuscript preparation, writing, critical revisions, and have read and approved the final manuscript.

Corresponding author

Correspondence to Tuong Minh Nguyen .

Ethics declarations

Ethics approval and consent to participate.

The Institutional Review Board of National University of Singapore approved this study (IRB: NUS-IRB-2024-396).

Consent for publication

This paper did not include any individual data, rights and permissions. Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Nguyen, T., Poh, K., Chong, SL. et al. Med-MGF: multi-level graph-based framework for handling medical data imbalance and representation. BMC Med Inform Decis Mak 24 , 242 (2024). https://doi.org/10.1186/s12911-024-02649-2

Download citation

Received : 24 May 2024

Accepted : 23 August 2024

Published : 02 September 2024

DOI : https://doi.org/10.1186/s12911-024-02649-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Pediatric sepsis
  • Patient network
  • Graphical models
  • Message passing
  • Machine learning
  • Find a journal
  • Publish with us
  • Track your research

This week: the arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Machine Learning

Title: task-oriented communication for graph data: a graph information bottleneck approach.

Abstract: Graph data, essential in fields like knowledge representation and social networks, often involves large networks with many nodes and edges. Transmitting these graphs can be highly inefficient due to their size and redundancy for specific tasks. This paper introduces a method to extract a smaller, task-focused subgraph that maintains key information while reducing communication overhead. Our approach utilizes graph neural networks (GNNs) and the graph information bottleneck (GIB) principle to create a compact, informative, and robust graph representation suitable for transmission. The challenge lies in the irregular structure of graph data, making GIB optimization complex. We address this by deriving a tractable variational upper bound for the objective function. Additionally, we propose the VQ-GIB mechanism, integrating vector quantization (VQ) to convert subgraph representations into a discrete codebook sequence, compatible with existing digital communication systems. Our experiments show that this GIB-based method significantly lowers communication costs while preserving essential task-related information. The approach demonstrates robust performance across various communication channels, suitable for both continuous and discrete systems.
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Signal Processing (eess.SP)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 07 September 2024

A Semantic Knowledge Graph of European Mountain Value Chains

  • Valentina Bartalesi 1 ,
  • Gianpaolo Coro   ORCID: orcid.org/0000-0001-7232-191X 1 ,
  • Emanuele Lenzi 1 ,
  • Nicolò Pratelli 1 ,
  • Pasquale Pagano 1 ,
  • Michele Moretti 2 &
  • Gianluca Brunori 2  

Scientific Data volume  11 , Article number:  978 ( 2024 ) Cite this article

Metrics details

  • Socioeconomic scenarios

The United Nations forecast a significant shift in global population distribution by 2050, with rural populations projected to decline. This decline will particularly challenge mountain areas’ cultural heritage, well-being, and economic sustainability. Understanding the economic, environmental, and societal effects of rural population decline is particularly important in Europe, where mountainous regions are vital for supplying goods. The present paper describes a geospatially explicit semantic knowledge graph containing information on 454 European mountain value chains. It is the first large-size, structured collection of information on mountain value chains. Our graph, structured through ontology-based semantic modelling, offers representations of the value chains in the form of narratives. The graph was constructed semi-automatically from unstructured data provided by mountain-area expert scholars. It is accessible through a public repository and explorable through interactive Story Maps and a semantic Web service. Through semantic queries, we demonstrate that the graph allows for exploring territorial complexities and discovering new knowledge on mountain areas’ environmental, societal, territory, and economic aspects that could help stem depopulation.

Background & Summary

The 2018 update of the World Urbanization Prospects, released by the United Nations Department of Economic and Social Affairs, projects a significant shift in the global population. Currently, 47% of the World population is rural, but this is expected to decline to 30% by 2050 1 . This transition raises concerns for traditional and cultural heritage, urban well-being, and ecological sustainability. Massive rural-to-urban migration will increase city populations, pollution, and energy consumption. The United Nations (UN) predicts that by 2050, 7 billion people will live in cities 2 , creating unsustainable conditions for food, health, and energy security. Challenges include excessive public administration burdens, cost of living mismatched with salaries, surges in pollution and greenhouse gas emissions, and increased healthcare expenditures. Additionally, cities will depend more on distant resources, leading to negative environmental impacts. In Europe, 36% of the territory is mountainous, critical for public and private-goods supply. Therefore, understanding factors that can mitigate rural and mountain area depopulation is crucial for sustainability. This aligns with the UN’s Sustainable Development Goal 11 2 , emphasising strategies to preserve rural economies and services, including economic diversification and tourism enhancements. Recent international initiatives promote capacity-building and participatory processes involving stakeholders and policymakers to create resilient and sustainable mountain areas in response to climate change 3 .

However, such endeavours demand substantial volumes of data to generate meaningful insights. Specifically, data related to environment, geography, demographics, and economics are essential for comprehending how regions and mountain-based value chains react and adapt to climate change. Such information is key to gaining new knowledge on the dynamics of mountain value chains. For example, it allows identifying the value chains sharing the same environmental characteristics (e.g. rivers, lakes, vineyards, chestnut trees) and issues (e.g. depopulation, emigration, climate change problems) or similar products across territories (e.g. cow or sheep milk cheese).

The present paper describes an extensive collection of data on 454 value chains from 23 European mountain areas belonging to 16 countries, representing as far as possible the diversity of European mountain areas (Supplementary Table  1 ). Although these data do not cover the entire spectrum of European mountain value chains, they aim to be as representative and reliable as possible by embedding information from local experts with comprehensive views. These experts overall selected the 454 value chains as those with the highest importance in the 23 areas from a socio-ecological (innovation, stage of development, size, governance system, and environmental status and protection) perspective. When they had sufficient information, the experts also extended the data beyond their monitored territories to connect and compare the value chains with similar or related territories and other value chains not initially involved in the project (e.g., the Scandinavian Mountains, the Massif Central, and the Pyrenees).

Data representation and publication principles

Our data collection is organised as a semantic knowledge graph , i.e., a structured representation of knowledge, where knowledge entities and their relations are modelled in a graph format 4 , 5 . The structure of our graph is based on an ontology. An ontology ( computational ontology, formally) is a model of the knowledge structure of a system in terms of the significant classes (e.g., event, person, location, object) and relations emerging from the observation of the system itself 6 . It is an abstract, simplified view of the system structured in a machine-readable format. Building an ontology requires identifying the relevant concepts and relations (symbolised by unary and binary predicates) of a domain of interest, and organising them in a hierarchical structure. Summarising, a semantic knowledge graph based on an ontology has nodes corresponding to the classes of the ontology and edges corresponding to the relations. A narrative is an example of a system describable as a knowledge graph modelled on an ontology.

Our collection is a semantic knowledge graph of narratives, where each narrative is a sub-graph explaining one among the 454 value chains. Each value chain narrative is a semantic network of narrative events related to each other through plot-dependent semantic relations. The overall graph is described under the Web Ontology Language (OWL) and complies with the Narrative Ontology 7 (NOnt), which provides a structure to represent the knowledge of a narrative formally. NOnt is used in several cultural heritage projects (e.g., CRAEFT 8 , Mingei 9 , and IMAGO 10 ). It reuses classes and properties (and complements) the CIDOC CRM ISO standard 11 and other standard ontologies (FRBRoo 12 , OWL Time 13 , and GeoSPARQL 14 ). NOnt reuses most concepts from the CIDOC CRM ontology - among which is the concept of event - because this is used by many European cultural and museum institutions for data representation. NOnt adds new concepts, such as the role of a narrative actor, the fabula and plot, and new properties, such as the causal dependency between two events and the geospatial belonging of an event to a country or a territory. By reusing concepts from other consolidated ontologies, NOnt enhances interoperability with other vocabularies and semantic knowledge bases and allows for building more extensive knowledge networks in agreement with the Linked Open Data paradigm 15 . From a conservation perspective, our analysed value chains are part of the European cultural heritage and suitably described as narratives because they relate to European territories’ history, artisanal knowledge, and environmental characteristics.

We used narratives for data description also because they are central to human activities across cultural, scientific, and social domains and can establish shared meanings among diverse domain-specific communities 16 . Psychological theories assert that humans comprehend reality by organising events into narratives 17 , 18 . Stories can depict characters’ intentions, emotions, and aspirations through the attributes of objects and events 19 and can describe overall territorial aspects beyond the analytical data 20 , 21 .

Our value chain narratives include comprehensive information on the selected European product value chains, e.g., about the production of European cheese, beer, milk, sheep and dairy farming, flour, herbs, oil, wine, tourism, carpentry, food and drink, nuts, and others. Overall, they cover economic assets, biodiversity, and ecosystem service descriptions (e.g., food and water resources and touristic features). Therefore, our representation also includes geographical information such as maps, pinned locations, and polygonal areas. A map is a valuable support to represent the spatiotemporal structure of a territory story and the relationships between places 21 . For such reason, we also represented the value chain spatiotemporal narratives as Story Maps , i.e., online interactive maps enriched with textual/audio events and digital material narrating overall territorial complexity. Story Maps allow exploring and navigating a narrative through many digital devices (e.g., PCs, tablets, smartphones, interactive displays) and can be built through collaborative tools 20 . They are valuable to represent the life, emotions, reality, fiction, legends, and expectations associated with the described territory beyond a mere map representation 20 , 21 , 22 , 23 , 24 , and fill the perceptual gap between a territory-as-a-whole and its map 25 .

Our principal target data users and stakeholders are policymakers at all spatial scales, from local to European. The knowledge contained in the data is valuable to designing local, regional, national and/or European policies, strategies, and actions to promote the development of mountain areas, starting from the value chains that populate these areas. In fact, the stakeholders can use this knowledge to understand the prevailing economic sector (primary, secondary, or tertiary) of their respective regions. They can also infer information at a finer spatial scale, such as the resources (natural, cultural, and others) on which the productive fabrics depend. Based on this information, they can design place-based and data-driven policies supporting the socio-economic development of marginalised mountain areas. Other stakeholders of our data are citizens who wish to have an overview of their regions’ territorial and economic assets and the related peculiarities, competitions, and challenges in Europe.

Our semantic knowledge graph is available as a collection on a Figshare repository and through a public semantic database instance, and is interactively explorable through online Story Maps ( Data Records and Usage Notes ). To our knowledge, this is the first extensive and structured collection of information on European mountain value chains.

The Figshare collection tries to meet the FAIR (Findable, Accessible, Interoperable, Reusable) principles as far as possible. Figshare indeed fosters the alignment of the hosted data collections to data FAIRness 26 . The data have a unique and persistent Digital Object Identifier (DOI) assigned (Findable-F1 property). The collection’s metadata comply with the DataCite metadata interconnection schema, and we fulfilled all essential and mandatory fields (Findable-F2). The metadata schema contains a dedicated “doi” field (Findable-F3) and is indexed in both the Figshare search engine (without authentication required) and the major search engines (Findable-F4). Moreover, we added textual metadata for each data folder and a data dictionary to improve data interpretability. The data and metadata are accessible for download, without authentication required. They comply with the “Attribution 4.0 International” licence ( CC BY 4.0 ), i.e., they can be copied, redistributed, transformed, and reused even for commercial purposes. Access is also guaranteed through several open authentication protocols (Accessible-A1), and the collection’s metadata and DOI will be preserved for the repository’s lifetime (Accessible-A2). The metadata are accessible through the Figshare APIs and are exportable (through the collection’s page) to several standards (Interoperable-I1). They conform to controlled vocabularies of research categorisation and open licences (Interoperable-I2). The data vocabulary contains a controlled list of concepts belonging to ontological standards (Interoperable-I3). Finally, the metadata description, the open licence, the availability of the input and output data (complemented by provenance description through the present paper), and the use of a semantic knowledge graph for data representation strongly support our collection’s reusability (Reusable-R1 and R2).

Paper and project background

In the present paper, we describe how we built our knowledge graph for 454 European value chains. The primary source data were unstructured textual documents provided by territory experts working in the MOuntain Valorisation through INterconnectedness and Green growth (MOVING) European project 3 . MOVING was an H2020 project (September 2020 - August 2024) involving 23 organizations and companies that monitor, support, and conduct value chains in mountain areas. The primary project target was to collect updated and comparable knowledge on mountainous territories, with the conjecture that this would lead to a deeper understanding of the context, trends, and potential evolution of mountain communities, territories, and businesses. Moreover, this understanding would help design new policies for conservation and evolution. As a main strategy, the project proposed a bottom-up participatory process involving value chain actors, stakeholders, and policymakers to co-design European policy frameworks for resilience and sustainability. The heterogeneous MOVING community of practice monitored 454 value chains. In the first two project years (2020-2021), the territory experts studied and collected local knowledge about geography, traditions, and societal and economic aspects. Each expert independently compiled information on his/her monitored value chains. The provided information was complete from the point of view of the MOVING project scope. The experts used a socio-ecological system (SES) approach to understand the value chain contributions to the mountain areas’ resilience and sustainable development. Within the SES framework, they related the value chain processes and activities to the local natural resources, particularly those affected by climate change and major socioeconomic and demographic trends (e.g., out-migration, livelihoods, and basic-service provisioning). They prioritised land-use and land-use change indicators because most value chains were agri-food and forestry-based, heavily relying on land resources. However, they also included other regional assets when particularly relevant for the region (e.g. hydropower in Scotland, Portugal and Romania; tourism in Italy, Portugal, Spain, Scandinavian countries, Serbia, North Macedonia, Romania, and Bulgaria). The SES approach was also justified by the MOVING project’s focus on understanding the balance between economically valuable activities and environmental protection. Finding the right balance between these contrasting stressors will likely be more difficult in the near future due to the increasing number of European natural protected areas 27 . The possibility of analysing the vulnerabilities of mountainous value chains’ environments, actors, resources, governance, and processes altogether was critical in this context, and could also support decision authorities in the design of multi-actor (public and private) institutional arrangements and multi-level (local, regional, national, and European) policies.

While this approach generated valuable and new knowledge, a side effect was the non-homogeneity of the collected information, e.g., administrative codes and statistical data were sometimes missing, and the investigated territory and value chain data often did not focus on the same assets across the studies. The need for managing this heterogeneous-knowledge scenario was the primary motivation for our study. After approval by the MOVING scientific community, we automatically transformed the unstructured, expert-provided data into a semantic knowledge graph. Here, we also demonstrate - through queries in the SPARQL Protocol and RDF Query Language (SPARQL) - that this representation allows for extracting new valuable knowledge for societal, economic, and environmental monitoring and studies.

The present paper outlines a semi-automated workflow developed to convert unstructured data about European value chains ( VCs ) into a semantic knowledge graph, as depicted in Fig.  1 and elaborated in the current section.

figure 1

Conceptual flowchart of our data preparation, augmentation, validation, and publication workflow.

Our input was a set of textual documents, each detailing practical aspects of European VCs, including economic, meteorological, climatic, ecological, cultural, and societal aspects, along with specifications about their geographical regions and nations.

During the data preparation phase, these documents were processed to create a preliminary semi-structured form of the VC narratives, organized in tables with rows corresponding to narrative events. Then, a data augmentation phase regarded the extraction of information, for each event, about the mentioned places, locations, organizations, and keywords, and the enrichment of the data with geospatial references. This enriched and structured narrative representation was then converted into a semantic knowledge graph using the OWL format ( Knowledge graph creation and publication ).

This OWL-formatted knowledge graph was subsequently published in an openly accessible online semantic triple store and visually represented through 454 Story Maps. The OWL file, being the main output of this research, is available for other researchers for import into their semantic triple stores ( Usage Notes ). It allows them to explore the rich information about European value chains that the graph encapsulates.

Data preparation

Our data collection originated from textual documents on VCs written by territory experts (researchers, members of local authorities, non-governmental organisations, producers’ and processors’ cooperatives, Local Action Groups, extension services, and others) within the MOVING European project 3 . Each involved country had from 1 to 51 documents associated (Table  1 ). The textual documents coarsely followed one textual-data collection schema designed by researchers at the University of Pisa (UniPi), who were involved in the MOVING project. As a preliminary validation, the UniPi researchers checked each expert’s document for inconsistencies in geographical locations, primary resources, and socioeconomic assets of the reference area and value chain. In the case of inconsistencies identified, they sent the document back to the expert(s) for adjustments, and repeated the checks on the updated document afterwards.

As an additional pre-processing step, we organised the information in the VC documents through an MS Excel table. This table contained one row for each VC and the columns corresponded to different VC aspects (Table  2 ). Some columns contained numeric values (e.g., for incomes and tourism). Other columns contained descriptions in natural language (e.g., the landscape description) or categorised information (e.g., Local Administrative Units). The table was very sparse since information on several columns was often unavailable. This table aimed to provide a first overview of the commonalities and heterogeneity between the VCs across European countries and regions. This file was the only manually prepared dataset of our workflow and the basis of the narrative building and augmentation part described in the next section. The MOVING project experts were also asked to check whether the MS Excel table correctly reported and represented the information they had provided.

As a further pre-processing step, we processed the MS Excel table to produce new tables in Comma Separated Value (CSV) file-format, one for each VC. Each CSV table was a rough, structured version of a VC ( VC table ). Our Figshare repository contains these files for consultation ( Data Records ). Each VC table contained 11 rows corresponding to the key events of a VC narrative (Table  3 -right-hand column). Each row corresponded to one narrative event , with a title and a description column associated. To build the VC tables from the MS Excel table, we implemented a JAVA process that automatically mapped the column contents of one row of the MS Excel table onto the description column of one VC table. Table  3 reports this mapping. For one-to-one mappings, we directly reported the source-column’s text content. When multiple columns corresponded to the same VC event, instead, we appended the column contents through text-harmonisation rules for the conjunctions.

This mapping process produced 454 VC tables, which were the input to the subsequent augmentation phase.

Data augmentation

In the present section, we describe all data augmentation steps in our workflow and eventually report the corresponding algorithm pseudo-codes.

Named entity extraction

Our workflow used a named entity extraction module we implemented in JAVA. This module characterised each event in the VC narrative with abstract or physical objects mentioned in the event description texts ( named entities ). The module used the NLPHub service 28 , a cloud computing service that coordinates and consolidates the results of various state-of-the-art text-mining processes integrated within the D4Science e-Infrastructure 29 , 30 , 31 . In our workflow, we set the NLPHub to identify entities of types location , person , and organisation , plus the keywords of the text. Keywords were individual words or compound terms particularly meaningful within their respective contexts. The NLPHub exploited the D4Science cloud computing platform (named DataMiner 31 ) to efficiently manage the processing of ∼ 5000 event texts in our dataset via distributed and concurrent cloud processing. The named entity extraction module augmented each VC table with one additional column ( named entities ) reporting a comma-separated list of named entities (and keywords) associated with each event.

Wikidata entry association

We used the named entities extracted by the previous step as the input of queries to the Wikidata semantic service’s SPARQL endpoint 32 . A JAVA process executed these SPARQL queries to Wikidata to check if each narrative-event entity could correspond to a Wikidata entry. One special rule was adopted for location -type entities. By convention, Wikidata describes location -type entries with the first letter capitalised. Our process used this convention to check for the existence of Wikidata entries associated with location -type named entities.

In the case of a correspondence found, the process extracted the entry’s Wikidata’s Internationalized Resource Identifier (IRI). The IRI is part of the information the Wikidata SPARQL response returns for an entry. It persists also after entry-content update. For instance, the “Alps” entity had the following Wikidata entry IRI associated: https://www.wikidata.org/wiki/Q1286 which corresponded to the Q1286 identifier.

As an additional step, our process checked the consistency of the Wikidata entry retrieved. In particular, it explored the entry-associated Wikipedia pages. For a Wikidata entry to be valid, its associated Wikipedia pages should not correspond to (i) a disambiguation page, (ii) a page with a title not matching the named entity, or (iii) a page referring to a different named entity type. For example, the Wikipedia page associated with a location -type named entity had to correspond to a location. This check distinguished cases like Tours (the French city) from tours (journeys in the area). These rules overall improved the precision of the association between a Wikidata entry and a named entity, i.e., a validated Wikidata entry likely had the same meaning as the named entity.

At the end of the Wikidata entry retrieval and consistency check, our workflow added one column (named IRIs ) to every VC table. This column contained, for each event, the valid IRIs of the event’s entities. Entities without a valid IRI associated were discarded because they brought the risk of introducing false topics in the narratives.

Geometry association

As an additional data augmentation step, a Python process added a new column (named geometry ) to each VC table containing spatial representations for the location -type entities. The process checked each valid location -type entity for having a corresponding coordinate pair in the associated Wikidata entry. In particular, it retrieved the Wikidata “coordinate location” property (P625) content as a reference longitude-latitude coordinate pair. Moreover, the process also checked if a polygon was possibly associated with the entity. To this aim, it used an instance of the open-access QLever endpoint of the University of Freiburg 33 to retrieve a possible polygon representation from the OpenStreetMap subgraph included in this large knowledge graph. QLever is a SPARQL engine capable of efficiently indexing and querying large knowledge graphs (even with over 100 billion triples) such as Wikidata, Wikimedia Commons, OpenStreetMap, UniProt, PubChem, and DBLP 34 . The University of Freiburg populated a large knowledge graph with these sources. Our process reported all geometries found on the QLever service as Well-Known Text (WKT) formatted strings 35 . The first VC event ( Introduction ), was always assigned the country’s polygon and centroid. Our process added the found entities’ geometries to the geometry column of their associated events. It reported both the polygon and point representations when existing. All geometries reported by our workflow used the EPSG:4326 geodetic coordinate system for World (equirectangular projection).

Representation of Local Administrative Units

The expert-provided data also included the indications of the 2-level Local Administrative Units 36 (LAUs) of the municipalities covered by each VC (Table  2 ). A VC could span more than one municipality and often had several LAUs associated. Eurostat, the statistical office of the European Union, has been producing regional statistics for these areas since 2003 37 . Different LAUs can form one “Nomenclature of Territorial Unit for Statistics” (NUTS), for which Eurostat produces additional statistics. These statistics help assess trends for local community typologies (rural, suburban, and urban), urbanisation degree (city, town and suburb, rural), functional urban areas (cities and their surrounding commuting zones), and coastal areas.

Our workflow included a Python process to retrieve a geometry representation of the VC-associated LAUs (as WKT strings). The process searched for a polygonal representation of each LAU code in two structured files published by Eurostat in their Geographic Information System of the Commission (GISCO)0 38 . GISCO is an EU-funded geographic information system that includes data on administrative boundaries and thematic information (e.g., population data) at the levels of European nations and regions. The first GISCO file our process used was a GeoJSON file 39 containing all WKT polygon representations of the Eurostat-monitored LAUs. However, the experts often reported NUTS codes instead of LAU codes. Therefore, if a polygon representation could not be found for one LAU code, our process searched for the same code in a second GISCO GeoJSON file containing NUTS polygon representations 40 . Since different countries could use the same LAU and NUTS codes for different territories, our process used the VC’s belonging country code (e.g., IT, ES, UK) for disambiguation.

Our process found correspondences for all LAU and NUTS codes associated with our VCs (1224 total). It augmented each VC table’s geometry column with LAU (or NUTS) geometries repeated for each event. It represented all geometries with equirectangular projection, also used in GISCO.

Geometry filtering

The geometries associated with the VC narrative events were checked for “geographical consistency” with the narrative itself. A story map set in the Austrian Alps that mentioned a cow breed also found in America might lead to the inclusion of United States regions’ entities (and thus geometries) in the story. From a narrative point of view, associating a point too distant from the VC territory would be dispersing and produce jittery paths on the map that could confuse the reader. Therefore, we decided to avoid shifts from one continent to another or between far locations in our narratives while keeping a geographically coherent focus.

A dedicated JAVA process estimated a bi-variate log-normal distribution on the longitude-latitude pairs of each narrative. It included the LAU/NUTS centroids among the pairs. The process computed the upper and lower 95% log-normal confidence limits on the coordinates and considered the coordinates outside these boundaries as outliers. Consequently, if most coordinates pertained to a specific region, the calculated boundaries naturally surrounded that region. Otherwise, the boundaries likely encompassed all coordinates if these were uniformly distributed worldwide (e.g., in a global-scale narrative). We demonstrated the validity of a bi-variate log-normal distribution to estimate the primary geographical focus of a narrative in a previous work 20 . Each event in our narrative underwent outlier removal using this log-normal filter. By construction, at least the LAU/NUTS geometries remained associated with an event after the filtering. All geometries associated to an event are reported on a map during the event visualisation in a Story Map.

Image assignment

As a final data augmentation step, our workflow assigned images to the 11 events of each VC narrative through a dedicated Python process. The image associated with the first event ( Introduction ) was always the geographical map of the VC-associated country. This map was retrieved from Wikidata through a SPARQL query on the “locator map image” property (P242). Quantitative events such as “Income and gross value added” and “Employment” were not associated with images because their images should necessarily be conceptual. However, we verified that the MOVING community did not perceive such conceptual images as meaningful. For the remaining events, we used images the MOVING project members willingly provided for each country (without copyright violation). Six images per country were averagely available, which we enriched with additional region-specific images from Wikimedia Commons 41 referring to the VC territories. Our Python process randomly sampled images from the VC’s country-associated image set (without repetitions) and assigned them to the narrative events while prioritising the expert-provided images. For example, the narrative “Chestnut flour coming from the rediscovered chestnut cultivation activities in the area” was enriched with seven images of Tuscany by the MOVING members and two images of the Apuan Alps (where this chestnut flour is produced) from Wikimedia Commons.

In the present section, we report the algorithms of the data augmentation processes described so far.

The data augmentation algorithm for named-entity extraction and geometry association can be summarised as follows:

Algorithm 1

data graphical representation

The algorithm translating LAU/NUTS codes into WKT strings is the following:

Algorithm 2

data graphical representation

The geometry filtering algorithm can be summarised as follows:

Algorithm 3

data graphical representation

The image assignment algorithms can be summarised as follows:

Algorithm 4

data graphical representation

Knowledge graph creation and publication

Our workflow used an additional Python process to translate all augmented VC tables into a semantic knowledge graph. As a first step, this process translated the VC tables into JSON files that followed a schema we designed and optimised in a previous work 20 , 42 . This JSON representation structurally describes the event sequence and the associated entities, images, geometries, and Wikidata IRIs. Our process also stored each JSON file in a PostgreSQL-JSON database for quick retrieval and use for narrative visualisation ( Usage Notes ).

As a second step, the process translated each JSON file into a Web Ontology Language (OWL) graph file and assembled all graphs into one overall OWL graph file 43 . To this aim, it invoked a JAVA-based semantic triplifier software we implemented for this specific sub-task. The VC-individual and the overall OWL graphs complied with the OWL 2 Description Logic 44 , which assured the decidability of the language. They adhered to the Narrative Ontology model version 2.0 7 , 45 , extended with the GeoSPARQL ontology 14 , a standard of the Open Geospatial Consortium that handles geospatially explicit data in ontologies. We published the entire VC-narrative OWL graph (and sub-graphs) in the Figshare repository attached to the present paper ( Data Records ) to openly allow users to import them in a semantic triple store and query, explore, and infer knowledge on the 454 European VCs represented. This file was the main output of our workflow.

We also published the knowledge graph on a public-access Apache Jena GeoSPARQL Fuseki 46 semantic triple store to allow users to execute semantic and geospatial queries to our knowledge graph openly ( Usage Notes ). Fuseki is a SPARQL service that can internally store Resource Description Framework (RDF) data representing semantic triples consisting of a subject (node), a predicate (relation) and an object (node) of the knowledge graph. This service allows retrieving the stored RDF triples through a SPARQL/GeoSPARQL endpoint.

It is important to stress that the main target of the knowledge graph was to enhance the communication about the value chains to a broad, heterogeneous audience. Our target stakeholders were value chain and territory experts, citizens, and local and national administrations. These stakeholders need an overall understanding of the value chains, their role in characterising the territory, and the criticalities and strengths of the territory. Entities such as locations, persons, organisations, and keywords - enriched with images and geometries - matched their interests. Consequently, we did not include statistics and numeric indicators among the entities because data analytics was not a target of the narratives. Moreover, the unavailability of statistical data for several value chains would have created knowledge gaps across the narratives. Therefore, we reported statistical data, when available, in an understandable and readable textual format in the event text while leaving the possibility to conduct data analytics on the tabular-format files available in the Figshare repository.

As an additional note, we clarify that semantic knowledge graphs were a more suitable choice for data representation than tabular data models. Tabular data models, such as those used in relational databases, satisfy a predefined schema. Coercing our data to rows and columns was unsuitable for quickly capturing the complex relationships between value chain entities. Moreover, tabular models hardly manage inconsistent naming conventions, formats, and ambiguous identifiers like those in our data. Although foreign keys allow for modelling rich, interconnected data, they introduce complexity in the database schema, making knowledge extraction more challenging. Moreover, as the volume of data grows, managing and querying large relational tables can become inefficient and require dedicated distributed systems. In a scenario like ours, where data were many, heterogeneous, and dynamic, we could not assume that a traditional relational schema was efficient and effective. Instead, we used Linked Data and Semantic Web technologies because they offered more flexibility in quickly extending, modifying, and interconnecting a knowledge base of diverse and heterogeneous data. Moreover, semantic graphs could intuitively represent rich and complex relationships between the data while capturing real-world facts. They also enacted interoperability through the reuse of shared vocabularies and IRIs from other semantic knowledge bases, allowing the creation of interconnected, consistent data networks. Finally, as semantic technologies are Web-native, they quickly allowed for accessing and querying data through standard Web protocols.

Data Records

We made the data available on a public-access Figshare repository 47 (version 3, currently). One dataset is available for downloading the overall OWL knowledge graph, which allows other users to reproduce the entire knowledge base in another triple store ( Usage Notes ). This knowledge graph contains 503,963 triples. The data in the graph are also available in CSV and GeoPackage formats for easy import, manipulation, and inspection in multiple software solutions. Another dataset presents a folder hierarchy containing sub-graphs focusing on one VC at a time. The folder organisation is optimised for regional ecological, socioeconomic, and agricultural modelling experts. The files are organised into subfolders, each corresponding to a country. The name of each file reports the title of the corresponding value chain. Each file is in the OWL format (e.g. wood_charcoal_from_Gran_Canaria_island.owl). The complete file collection contains 454 files that can be imported independently of each other. The Figshare repository also contains all links to the JAVA and Python software used in our workflow.

Additionally, the repository contains a direct link to our public-access Apache Jena GeoSPARQL Fuseki instance hosting the entire VC knowledge graph. This instance allows the execution of SPARQL and GeoSPARQL queries ( Usage Notes ). The Figshare repository also contains the MS Excel file that was the input of our workflow. It allows for comparing our workflow’s original and final products and repeating the technical validation. Finally, the repository contains all VC tables in CSV format resulting from the data preparation phase. The authors received authorisation by the MOVING project community to publish this material.

Technical Validation

Formal consistency of the knowledge graph.

We used a semantic reasoner to validate the logic consistency of our entire OWL graph. A semantic reasoner is software designed to infer logical consequences from a set of asserted facts or axioms. We used the Openllet open-source semantic reasoner 48 , 49 , 50 to (i) check the consistency of our OWL graph (i.e., to guarantee that it did not imply contradictions), (ii) check that the class hierarchy respected the one of the Narrative Ontology, (iii) test geometry consistency (polygon closures and correct WKT formatting), (iv) test the possibility to execute complex SPARQL and GeoSPARQL queries.

Openllet assessed the consistency of our knowledge graph on all the checks reported above. The reasoner confirmed that the subclass relations and the complete hierarchy between the classes fully respected the ones of the Narrative Ontology. The class hierarchy allowed the correct extraction of all subclasses of a class. Finally, all geometries were assessed as consistent. GeoSPARQL queries allowed the execution of spatial reasoning and all algebraic operations between sample-selected polygonal geometries from the VCs.

Additionally, we executed automatic checks to ensure that no event in the OWL graph was empty or contained meaningless or misreported content (e.g., “N/A”, “Empty”, “Unavailable”, etc.). The checks also verified that every LAU had an associated WKT polygon from the GISCO database, and we manually verified that the correspondences were correct. An expert from the MOVING project also conducted a sample check to verify that the mapping between the pre-processed MS Excel table columns and the VC-narrative events (Table  3 ) produced meaningful and human-readable descriptions.

Performance of the named entity extraction-filtering process

We evaluated the performance of the combined process of named entity extraction plus Wikidata entry association (filtering). This process influences the results of the queries to the knowledge graph. An incomplete set of entities would indeed limit the information associated with an event. Moreover, as highlighted in the next section, it would also limit the discovery of interconnections between the events. When querying for events with several associated entities, a target event would only be selected if all entities were retrieved correctly.

To measure the quality of the named entity extraction-filtering process, we evaluated its information-extraction performance on manually annotated story events. To this aim, we selected a statistically meaningful set of events to annotate through the sample size determination formula with finite population correction , i.e.

where n is the target sample size adjusted for a finite population; n 0 is the initial sample size assuming an infinite population; Z is the Z-score corresponding to the desired confidence level (Z-score =1.96 for the 95% confidence level we used); p is the prior assessment (0.5 for uninformative conditions); MOE is the margin of error on the true error (5%), and N is the total number of events (population size).

This formula estimated that 330 events (corresponding to 30 stories) could be sufficient for performance assessment. Consequently, we randomly selected 30 stories from our collection. In this sub-collection, we identified the automatically extracted-filtered entities correctly associated with key event-related concepts ( true positives , TP). Then, we identified those unrelated to key concepts ( false positives , FP). Finally, we annotated additional entities (with valid Wikidata pages associated) from the events’ texts that the extraction-filtering process missed ( false negatives , FN). Based on these annotations, we calculated the following standard performance measurements:

The evaluation results are reported in Table  4 . The high Precision (0.99) suggests that most extracted entities were correctly associable with key concepts expressed in the events. Maximising the reliability of the extracted entities was indeed the principal target of our entity extraction-filtering process. Instead, the lower Recall (0.93) suggests that the extracted entity set could be incomplete and could negatively impact multi-entity querying. However, the F1 measure (0.96) was one of a good information extraction system. Therefore, although the extraction-filtering process could be improved (which will be part of our future work), its performance was sufficiently high to support our knowledge graph reasonably.

Query-based validation

We verified that our knowledge graph could contribute to discovering new knowledge from the data, which was its principal aim. In particular, we collected the types of interrogations the MOVING-project scientists or stakeholders considered valuable, i.e., hard to identify without a semantic knowledge representation. These interrogations were collected (i) during plenary meetings with the MOVING community of practice, (ii) after identifying the principal study targets of the rural-area experts involved in the project (typically value chains within their territories), and (iii) by reading project deliverables. For example, the experts’ targets were the VCs sharing common environmental characteristics (e.g. rivers, lakes, vineyards, and chestnut trees), issues (e.g. depopulation, pollution, and deforestation), and similar products (e.g. cow/sheep milk and cheese). Discovering this knowledge from the data holds significant value for mountain ecosystems, as it aids in planning sustainable environmental management strategies 21 . Additionally, this knowledge is valuable in supporting the long-term ecological sustainability of urban areas and comprehending and mitigating the decline of fundamental services in mountain areas brought about by the ongoing depopulation trends 2 , 51 , 52 . We demonstrated that our knowledge graph could contribute to these directions.

We focussed on ten types of knowledge-extraction targets, corresponding to ten SPARQL/GeoSPARQL queries regarding different and complementary aspects of European mountain products and their related spatial distributions. In particular, we extracted the VCs with the following characteristics:

related to vineyard products (Q1)

possibly affected by deforestation (Q2)

involving cheese with Protected Designation of Origin (PDO) certification (Q3)

producing cheese made with cow and/or goat milk (Q4)

using sheep to produce cheese (Q5)

using sheep to produce wool (Q6)

operating in the Alps (Q7)

operating around Aosta city (Italy) (Q8)

operating in Scotland (Q9)

operating around long Italian rivers (>100) (Q10)

The information extracted by these queries overall covered the interests of the MOVING community experts. It would have been hard, indeed, to extract the same information through the usual data representation and technology adopted by this scientific community. Based on the query results, we calculated Precision, Recall, and F1. We showed that high Precision was achieved for most cases, i.e., even when the information retrieved was incomplete (mostly due to misdetection by the named entity extraction processes) the results were reliable. The performance measurements (Table  5 ) demonstrate the overall high quality of our knowledge graph and the general effectiveness of the queries. In the following, we report the details of the queries and the corresponding results.

Q1 - Value chains related to vineyard products

In the following, we report the SPARQL query corresponding to Q1.

SPARQL Query 1

PREFIX narra: < https://dlnarratives.eu/ontology# >

PREFIX ecrm: < http://erlangen-crm.org/current/ >

PREFIX rdfs: < http://www.w3.org/2000/01/rdf-schema# >

SELECT DISTINCT?title?country WHERE { ?event1 narra:partOfNarrative?narrative. ?narrative rdfs:label?title. ?narrative narra:isAboutCountry? countryIRI. ?countryIRI rdfs:label?country. {

?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q22715 >.

?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q282 >.

?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q10978 >.

?event2 narra:partOfNarrative?narrative.

?event2 narra:hasEntity < https://dlnarratives.eu/resource/Q282 >.

FILTER (?event1! = ?event2) } UNION {

?event2 narra:partOfNarrative?narrative .

?event2 narra:hasEntity < https://dlnarratives.eu/resource/Q10978 >.

FILTER (?event1 ! = ? event2)

?event1 narra:hasEntity <https://dlnarratives.eu/resource/Q22715 >.

?event3 narra:partOfNarrative?narrative.

?event3 narra:hasEntity < https://dlnarratives.eu/resource/Q10978 >.

FILTER (?event1! = ?event2 & &?event1! = ?event3 & &?event2! = ?event3) } } ORDER BY lcase(?country)

The query retrieves distinct titles of narratives along with their associated countries. Several ontology prefixes are specified at the beginning of the query to shorten the corresponding IRIs, which are used in the subsequent parts of the query. The SELECT statement specifies the variables (“?title” and “?country”) whose values the query will output. The WHERE clause contains the conditions that need to be satisfied for each result. It involves several semantic-triple patterns connected by the “.” operator:

The first triple pattern (“?event1 narra:partOfNarrative?narrative”) connects an event to a narrative;

The second triple pattern (“?narrative rdfs:label?title”) retrieves the label (title) of the narrative.

The third triple pattern (“?narrative narra:isAboutCountry?countryIRI”) connects the narrative to its related country.

The fourth triple pattern (“?countryIRI rdfs:label?country”) retrieves the label (name) of the country.

The subsequent UNION blocks combine pattern alternatives, each representing a condition under which events are selected. The blocks retrieve events associated with at least one entity among “vineyard” (id. Q22715), “wine” (id. Q282), and “grape” (id. Q10978). These entities were chosen with the help of an expert. They are the entities most related to vineyards in our knowledge graph. The expert was aided by an entity search tool included in our visualisation facilities ( Usage Notes ).

The sets of narrative events containing the entities above are labelled “event1”, “event2”, and “event3”, respectively. Filters are applied (e.g., “FILTER (?event1! = ?event2)”) to ensure that the entities can singularly appear in different events.

Finally, the ORDER BY clause sorts the results alphabetically by the lowercase label of the country. In the case of multiple sub-graphs imported instead of the overall graph, the query should be changed by adding a “FROM <urn:x-arq:UnionGraph>” clause, before the WHERE clause, to specify that the query should be conducted on the union of the sub-graphs.

In summary, this query retrieves the titles of the narratives and their associated countries, comprising events related to the “vineyard”, “wine”, and “grape” entities. The query produced the output reported in Table  6 . To verify the correctness of the retrieved information, we manually checked, with the help of a MOVING expert, the VCs (among the 454) that contained information on vineyard products. Precision and Recall (0.93 and 0.90, respectively), were reasonably high, and F1 (0.91) indicated good retrieval performance. The main reason for Recall not reaching one was the presence of faults (false negatives) by the named entity extraction processes in detecting vineyard-related entities in the event texts. Precision was, instead, negatively affected by citations of vineyard-related products in VCs that did not focus on vineyard products (false positives).

Q2 - Value chains possibly affected by deforestation

In the following, we report the SPARQL query corresponding to Q2.

SPARQL Query 2

SELECT DISTINCT?title?country WHERE {?event narra:partOfNarrative?narrative. ?narrative narra:isAboutCountry? countryIRI. ?countryIRI rdfs:label?country. ?narrative rdfs:label?title.

?event narra:hasEntity < https://dlnarratives.eu/resource/Q169940 >.

} order by lcase(?country)

Similarly to Q1, this query retrieves the titles and associated countries of the narratives mentioning deforestation. The notable difference compared to Q1 is in the WHERE clause, which retrieves the events (“?event narra:hasEntity < https://dlnarratives.eu/resource/Q169940 >”) having “deforestation” (id. Q169940) among the associated entities.

This query produced the result reported in Table  7 . Expert verification assessed that it retrieved the complete and correct set (Precision and Recall equal to 1) of all VCs affected by deforestation. Therefore, this query shows the value of our knowledge graph for discovering critical threats to the VCs and their territories.

Q3 - Value chains involving cheese with Protected Designation of Origin certification

In the following, we report the SPARQL query corresponding to Q3.

SPARQL Query 3

PREFIX narra: < https://dlnarratives.eu/ontology#>

SELECT DISTINCT?title?country WHERE { ?event1 narra:partOfNarrative?narrative. ?narrative rdfs:label?title. ?narrative narra:isAboutCountry?countryIRI. ?countryIRI rdfs:label?country. {

?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q10943 >.

?event2 narra:hasEntity < https://dlnarratives.eu/resource/Q13439060 >.

FILTER (?event1! = ?event2) } } ORDER BY lcase(?country)

The query structure is similar to the one of Q1, with the difference that the entities “cheese” (id. Q10943) and “Protected designation of origin” (PDO) (id. Q13439060) are used to detect events, and consequently, the target VCs.

The query produced the results reported in Table  8 . This case is peculiar because it demonstrates the potential bottleneck of the performance of the named entity extraction processes. Although the query did not produce false positives (i.e., Precision was 1), there were many false negatives due to frequently missed recognition of PDO mentions in the event texts (Recall was 0.26). One reason is that long and articulated entities like “Protected designation of origin” are often subject to misspelling, abbreviation, and native-language reporting (e.g., DOP, in Italian), which prevent algorithms from identifying them. Therefore, Q3 showed a potential limitation of our knowledge graph when searching articulated entities. However, the same complexity of these entities ensured that the results were correct when the entities were identified.

Q4 - Value chains producing cheese made with cow and/or goat milk

In the following, we report the SPARQL query corresponding to Q4.

SPARQL Query 4

?event2 narra:hasEntity < https://dlnarratives.eu/resource/Q2934 >.

?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q830 >.

?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q11748378 >.

?event2 narra:hasEntity < https://dlnarratives.eu/resource/Q11748378 >.

?event2 narra:hasEntity < https://dlnarratives.eu/resource/Q830 >.

?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q10943> .

?event3 narra:hasEntity < https://dlnarratives.eu/resource/Q2934 >.

?event4 narra:partOfNarrative?narrative .

?event4 narra:hasEntity < https://dlnarratives.eu/resource/Q830 >.

FILTER (?event1! = ?event2 && ?event1! = ?event3 && ?event2! = ?event3 && ?event1! = ?event4 && ?event2! = ?event4 && ?event3! = ?event4) } } ORDER BY lcase(?country)

This query operates a search for narratives bound on four entities: “cheese” (id. Q10943), “cow” (id. Q11748378), “goat” (id. Q2934), and “cattle” (id. Q830). The query structure is similar to Q1.

The query produced the results reported in Table  9 . The results were still affected by the named entity extraction bottleneck because the query’s success depended on the correct identification of all four terms in a narrative. Compared to Q3, the present query tested the retrieval of multiple, simpler terms. Precision (0.84) and Recall (0.55) were indeed higher than the one of the articulated-entity search of Q3 (0.26).

Q5-Q6 - Value chains using sheep to produce cheese vs wool

In the following, we report the SPARQL queries corresponding to Q5 and Q6.

SPARQL Query 5 - VCs using sheep to produce cheese

SELECT DISTINCT?title?country WHERE { ?event1 narra:partOfNarrative?narrative. ?narrative rdfs:label?title. ?narrative narra: isAboutCountry?countryIRI. ?countryIRI rdfs:label?country. {

?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q7368 >.

?event2 narra:hasEntity < https://dlnarratives.eu/resource/Q10943 >.

FILTER (?event1! = ?event2)

ORDER BY lcase(?country)

SPARQL Query 6 - VCs using sheep to produce wool

SELECT DISTINCT?title?country WHERE { ?event1 narra:partOfNarrative?narrative. ?narrative rdfs:label?title. ?narrative narra:isAboutCountry?countryIRI. ?countryIRI rdfs:label?country.

{?event1 narra:hasEntity < https://dlnarratives.eu/resource/Q7368 >.

?event2 narra:hasEntity < https://dlnarratives.eu/resource/Q42329 >.

These queries have the same structure as Q1. They share one entity, “sheep” (id. Q7368), with two different usages (corresponding to different entities used in the WHERE clause), i.e., “cheese” (id. Q10943) in Q5 and “wool” (id. Q42329) in Q6.

The results are reported in Tables  10 and 11 . The performance measurements between the two queries were very similar. The false positives, which affected Precision, were due to mentions of sheep in other VCs that did not regard usage for cheese or wool production. Notably, although the two queries retrieved mostly different VCs, the fraction of correct narratives retrieved (Precision) was 0.69 in each case. Moreover, Recall values (0.72 and 0.75, respectively) were similar, due to similar fractions of undetected mentions (false negatives) of cheese and wool by the named entity extraction processes. Overall, a ∼ 0.70 F1 for both queries indicated an overall moderate-high reliability of the results.

Q7 - Value chains operating in the Alps

In the following, we report the GeoSPARQL query corresponding to Q7.

GeoSPARQL Query 7

PREFIX geof: < http://www.opengis.net/def/function/geosparql/ >

PREFIX geo: < http://www.opengis.net/ont/geosparql# >

PREFIX osm: < https://www.openstreetmap.org/ >

PREFIX wd: < http://www.wikidata.org/entity/ >

PREFIX osm2rdfkey: < https://osm2rdf.cs.uni-freiburg.de/rdf/key# >

SELECT?nlabel?clabel?wktLau WHERE { ?narra narra:isAboutCountry?country; narra:isAboutLAU?lau; rdfs:label?nlabel. ?country rdfs:label?clabel. ?lau geo:hasGeometry?glau. ?glau geo:asWKT?wktLau. { SELECT?wkt WHERE {

< https://qlever.cs.uni-freiburg.de/api/osm-planet > {

?osm_id osm2rdfkey:wikidata wd:Q1286; geo:hasGeometry?geometry. ?geometry geo:asWKT?wkt. } } } FILTER(geof:sfIntersects(?wktLau,?wkt)). }

The query retrieves the VC narrative titles, countries, and LAU polygons that overlap a polygon defining the Alps region. A value chain’s LAUs define the main areas where the VC operates (i.e., produces and sells products). The query internally calls the QLever endpoint provided by the University of Freiburg (Section Geometry association ), and in particular, the Open Street Map (“oms”) subgraph, to define the Alps polygonal region. The SELECT statement specifies the variables “nlabel” (narrative title), “clabel” (country name) and “wktLau” (LAU geometry in WKT format) that will be the output of the query. The WHERE clause contains the conditions that should be satisfied by each result. Differently from the previous queries, the following patterns are included:

The triple pattern “?narrative narra:isAboutLAU?lau” connects a narrative to the corresponding LAU;

the triple pattern “?lau geo:hasGeometry?glau” retrieves the geometry of the LAU;

the triple pattern “?glau geo:asWKT?wktLau” retrieves the WKT description of the LAU geometry;

A nested SELECT clause retrieves the WKT description (“?wkt”) under the following WHERE conditions:

The SERVICE keyword is used to invoke the external QLever endpoint (“ https://qlever.cs.uni-freiburg.de/api/osm-planet ”);

The triple pattern “?osm_id osm2rdfkey:wikidata wd:Q1286” retrieves the instance corresponding to the QLever entity “Alps” (wd:Q1286);

The triple pattern “?osm_id geo: geometry?geometry” retrieves the geometry-object of “Alps”;

The triple pattern “?geometry geo:asWKT?wkt” retrieves the WKT format of the “Alps” geometry.

A final FILTER clause operates the intersection between the LAU and the “Alps” geometries and retrieves all LAU geometries intersecting “Alps”. The set of LAUs returned by this query can be imported into a Geographic information system (GIS) visualiser and overlapped with the reference region (Fig.  2 ).

figure 2

Comparison between the Alps regions (as a red polygon) and the Local Administrative Units (orange polygons) of the value chains operating in this region. An interactive map is also reported in the Figshare repository associated with the present article.

The expert’s evaluation highlighted that the LAUs this query retrieved were correct and complete (Precision and Recall were 1). Therefore, the query was valuable in retrieving region-specific VCs.

Q8 - Value chains operating around Aosta city (Italy)

In the following, we report the GeoSPARQL query corresponding to Q8.

GeoSPARQL Query 8

PREFIX uom: < http://www.opengis.net/def/uom/OGC/1.0/ >

SELECT?nlabel?clabel?wktLau WHERE { { ?narra narra:isAboutCountry?country; narra:isAboutLAU?lau; rdfs:label?nlabel. ?country rdfs:label?clabel. ?lau geo:hasGeometry?glau. ?glau geo:asWKT?wktLau. } FILTER(geof:sfIntersects( ?wktLau, geof:buffer( "POINT(7.3196649 45.7370885)"^^geo:wktLiteral, 0.3, uom:degree))). }

This query extracts the VC titles, countries and LAU geometries of the value chains operating within a maximum distance of 23 km from Aosta. The query structure is similar to the one of Q7, with the difference that it does not use an external endpoint to retrieve the reference geometry. Instead, the FILTER clause operates an intersection between all VCs’ LAU geometries and a circular buffer of 0.3 degrees ( ∼ 23 km) around the Aosta longitude-latitude coordinates.

The query produced the results visualised in Fig.  3 . As in the case of Q7, the expert’s evaluation highlighted that the LAUs retrieved by this query were correct and complete (Precision and Recall were 1). Therefore, the query was valuable in retrieving city-specific VCs.

figure 3

Highlight of the Local Administrative Units (orange) of the value chains operating in a circle denoting a 0.3-degree area around Aosta, Italy (red). An interactive map is also reported in the Figshare repository associated with the present article.

Q9 - Value chains operating in Scotland

In the following, we report the GeoSPARQL query corresponding to Q9.

GeoSPARQL Query 9

PREFIX osmrel: < https://www.openstreetmap.org/relation/ >

PREFIX schema: < http://schema.org/ >

SELECT?nlabel?clabel?wktLau WHERE { ?narra narra:isAboutCountry?country; narra:isAboutLAU?lau; rdfs:label?nlabel. ?country rdfs:label?clabel. ?lau geo:hasGeometry?glau. ?glau geo:asWKT?wktLau. {SELECT?wkt WHERE {

< https://qlever.cs.uni-freiburg.de/api/osm-planet{

?osm_id osm2rdfkey:wikidata wd:Q22;

a osm:relation; geo:hasGeometry?geometry. ?geometry geo:asWKT?wkt. } } } FILTER(geof:sfWithin(?wktLau,?wkt)). }

This query extracts the VC titles, countries, and LAU geometries of the value chains operating in Scotland. The query structure is still similar to that of Q7. It uses the same external QLever Open Street Map endpoint to retrieve the geometry of Scotland boundaries. The FILTER clause operates the intersection between Scotland and the VCs’ LAU geometries.

The query produced the results reported in Fig.  4 . The expert’s evaluation highlighted that the LAUs this query retrieved were correct and complete (Precision and Recall were 1). Therefore, the query was valuable in retrieving country-specific VCs.

figure 4

Highlight of the Local Administrative Units (orange) of the value chains operating in a polygon defining Scotland national boundaries (red). An interactive map is also reported in the Figshare repository associated with the present article.

Q10 - Value chains operating around long Italian rivers

In the following, we report the GeoSPARQL query corresponding to Q10.

GeoSPARQL Query 10

PREFIX osmkey: < https://www.openstreetmap.org/wiki/Key: >

PREFIX wdt: < http://www.wikidata.org/prop/direct/ >

SELECT?nlabel?clabel?wktLau WHERE { ?narra narra:isAboutCountry?country; narra:isAboutLAU?lau; rdfs:label?nlabel. ?country rdfs:label?clabel. ?lau geo:hasGeometry?glau. ?glau geo:asWKT?wktLau. { SELECT?river_osm?river_wd?river_name?length?wkt WHERE {

SERVICE < https://qlever.cs.uni-freiburg.de/api/osm-planet>{

?river_osm a osm:relation;

osmkey:waterway?waterway; geo:hasGeometry?geometry; osmkey:name?river_name; osm2rdfkey:wikidata?river_wd. ?geometry geo:asWKT?wkt.

SERVICE < https://qlever.cs.uni-freiburg.de/api/wikidata > {

?river_wd wdt:P31/wdt:P279* wd:Q4022; wdt:P17 wd:Q38; wdt:P2043?length.

FILTER (?length > 100)

} } } ORDER BY DESC(?length) }

FILTER(geof:sfIntersects(?wktLau,?wkt)) .

This query retrieves all VCs operating close to an Italian river longer than 100 km. The query structure is very similar to that of Q7 and uses the same external endpoint. The main differences are the following:

the nested SELECT clause operates on two different QLever-instance subgraphs: Open Street Map and Wikidata. The query retrieves the river geometries from the first. Then it uses the second to retrieve the list of “rivers” (id. Q4022) present in “Italy” (id. Q38) whose “length” (id. P2043) exceeds 100 km (“FILTER (?length > 100)”);

the final FILTER clause operates the intersection between the Italian rivers and the VCs’ LAU geometries.

The query produced the results reported in Fig.  5 . All VCs retrieved were correct and complete (Precision and Recall were 1). Therefore, the query was valuable in retrieving river-related VCs and, by extension, could be used to extract water-basin-related VCs.

figure 5

Highlight of the Local Administrative Units (orange) of the value chains intersecting Italian rivers longer than 100 km (red). An interactive map is also reported in the Figshare repository associated with the present article.

Usage Notes

Our open-access Figshare repository 47 (version 3, currently) contains the entire OWL file and 454 OWL files corresponding to all VC tables. To perform non-explicit geospatial queries, a user can download and import the whole OWL graph (or a subset of the 454 OWL files to focus on specific regions or value chains) in an Apache Jena Fuseki triple store instance on a local machine 46 . After downloading and installing Fuseki 53 , users should access the server interface through a Web browser and navigate to the “manage” interface section. Then, by clicking on the “new dataset” tab, they should create a new dataset, specifying a name and type (i.e., In-memory or Persistent). Next, they should select the just-created dataset and upload the entire OWL file (or a subset of the 454 OWL files) by clicking the “add data” button. Users can verify if the dataset was successfully populated by executing queries on the Fuseki SPARQL query interface. To perform geospatially explicit queries, a user should download and install Apache Jena GeoSPARQL Fuseki 54 , a Fuseki version enhanced with GeoSPARQL features. Currently, this version does not have a Web interface to facilitate data import. Therefore, users should import data programmatically through the GeoSPARQL module embedded with this service 54 . The entire geospatialised narrative events’ data (text, entities, geometries) in the knowledge graph are also available in CSV and GeoPackage formats, which can be imported, visualised, and manipulated through GIS software (e.g., QGIS 55 ).

Our Figshare repository also contains a link to an overall visualisation facility for the VC narratives in the form of interactive and navigable Story Maps 56 . This facility allows our final users and stakeholders to easily explore the value chain locations, entities, events, and images. An overall map shows the distribution of the 454 VC narratives. After clicking on one reference pin, the user is redirected to the Story Map of the corresponding VC narrative. The Story Map lets the user go through the story events while panning, zooming, and inspecting the map locations at the right-hand side of the visualisation panel. The Story Maps are also available as an offline-visualisable HTML page collection in the Figshare repository.

Each Story Map allows users to access a “Search” Web interface (through a button at the top-left of the introductory page) that offers a visual search tool that executes semantic queries behind the scenes. This functionality interrogates the entire knowledge graph residing on a public-access Apache Jena GeoSPARQL Fuseki instance we currently offer and maintain 57 . The “Search” interface allows users to augment the knowledge reported in one Story Map through the knowledge in all other narratives. The “Search” functionality uses predefined SPARQL queries to extract:

All stories in which an entity appears;

All events across all narratives in which an entity appears;

The number of occurrences of one entity across all narratives;

The entities that co-occur with one entity across all events of all narratives.

The purpose of this feature is to allow all users (also without skills in formal semantics) to explore narrative interconnections.

Code availability

Our public-access Figshare repository 47 contains all the JAVA and Python programs we used to execute the described workflow, along with the data inputs and outputs. It allows other scientists to conduct technical validation. The repository also contains an interactive, offline-visualisable HTML version of the result tables and maps.

Our code and data also embed information from the following external sources: Wikidata 58 , OpenStreetMap 59 through the open-access QLever endpoint of the University of Freiburg 33 , and geographic area definitions from Eurostat-GISCO 38 , 39 , 40 .

UNPD (United Nations, D. o. E. & Social Affairs, P. D. World urbanization prospects: The 2018 revision (st/esa/ser.a/420). https://population.un.org/wup/Publications/Files/WUP2018-Report.pdf (2018).

United Nations. Sustainable Development Goal 11, “Make cities and human settlements inclusive, safe, resilient and sustainable”. on-line https://sdgs.un.org/goals/goal11 Accessed 4 January 2023 (2015).

H. Moving (mountain valorisation through interconnectedness and green growth https://www.moving-h2020.eu/ (2020).

Lehmann, F. Semantic networks. Computers & Mathematics with Applications 23 , 1–50 (1992).

Article   MathSciNet   CAS   Google Scholar  

Hogan, A. et al . Knowledge graphs. ACM Computing Surveys (Csur) 54 , 1–37 (2021).

Article   Google Scholar  

Guarino, N., Oberle, D. & Staab, S. What is an ontology? Handbook on ontologies 1–17 (2009).

Meghini, C., Bartalesi, V. & Metilli, D. Representing narratives in digital libraries: The narrative ontology. Semantic Web 12 , 241–264 (2021).

CRAEFT European Project. Craft Understanding, Education, Training, and Preservation for Posterity and Prosperity (CRAEFT) Project Web site. on-line Accessed 12 July 2024 https://www.craeft.eu/ (2024).

Mingei European Project. Mingei - Representation and Preservation of Heritage Crafts Project Web site. on-line Accessed 12 July https://www.mingei-project.eu/ (2024).

IMAGO Italian PRIN Project. Index Medii Aevi Geographiae Operum (IMAGO) Project Web site. on-line https://imagoarchive.it Accessed 12 July (2024).

Doerr, M. The cidoc conceptual reference module: an ontological approach to semantic interoperability of metadata. AI magazine 24 , 75–75 (2003).

Google Scholar  

Bekiari, C. et al . Definition of FRBRoo: A conceptual model for bibliographic information in object-oriented formalism. International Federation of Library Associations and Institutions (IFLA) repository https://repository.ifla.org/handle/123456789/659 (2017).

Pan, F. & Hobbs, J. R. Time ontology in owl. W3C working draft, W3C 1 , 1 (2006).

Battle, R. & Kolas, D. Geosparql, enabling a geospatial semantic web. Semantic Web Journal 3 , 355–370 (2011).

Thanos, C., Meghini, C., Bartalesi, V. & Coro, G. An exploratory approach to data driven knowledge creation. Journal of Big Data 10 , 1–15 (2023).

McInerny, G. J. et al . Information visualisation for science and policy: engaging users and avoiding bias. Trends in ecology & evolution 29 , 148–157 (2014).

Bruner, J. The narrative construction of reality. Critical inquiry 18 , 1–21 (1991).

Taylor, C. Sources of the self: The making of the modern identity (Harvard University Press, 1992).

Delafield-Butt, J. T. & Trevarthen, C. The ontogenesis of narrative: from moving to meaning. Frontiers in psychology 6 , 1157 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Bartalesi, V., Coro, G., Lenzi, E., Pagano, P. & Pratelli, N. From unstructured texts to semantic story maps. International Journal of Digital Earth 16 , 234–250 (2023).

Article   ADS   Google Scholar  

Bartalesi, V. et al . Using semantic story maps to describe a territory beyond its map. Semantic web (Online) 1–18, https://doi.org/10.3233/SW-233485 (2023).

Caquard, S. & Cartwright, W. Narrative cartography: From mapping stories to the narrative of maps and mapping. The Cartographic Journal 51 , 101–106, https://doi.org/10.1179/0008704114Z.000000000130 (2014).

Peterle, G. Carto-fiction: narrativising maps through creative writing. Social & Cultural Geography 20 , 1070–1093, https://doi.org/10.1080/14649365.2018.1428820 (2019).

Bartalesi, V., Metilli, D., Pratelli, N. & Pontari, P. Towards a knowledge base of medieval and renaissance geographical latin works: The imago ontology. Digital Scholarship in the Humanities https://doi.org/10.1093/llc/fqab060 (2021).

Korzybski, A. A non-aristotelian system and its necessity for rigour in mathematics and physics. In Science and sanity: an introduction to non-Aristotelian systems and general semantics (Lancaster, 1933).

Figshare. How Figshare aligns with the FAIR principles. Figshare Web site  =  https://help.figshare.com/article/how-figshare-aligns-with-the-fair-principles (2024).

European Environment Agency. Terrestrial protected areas in Europe. https://www.eea.europa.eu/en/analysis/indicators/terrestrial-protected-areas-in-europe (2023).

Coro, G., Panichi, G., Pagano, P. & Perrone, E. Nlphub: An e-infrastructure-based text mining hub. Concurrency and Computation: Practice and Experience 33 , e5986 (2021).

Assante, M. et al . Enacting open science by d4science. Future Generation Computer Systems 101 , 555–563 (2019).

Coro, G., Candela, L., Pagano, P., Italiano, A. & Liccardo, L. Parallelizing the execution of native data mining algorithms for computational biology. Concurrency and Computation: Practice and Experience 27 , 4630–4644 (2015).

Coro, G., Panichi, G., Scarponi, P. & Pagano, P. Cloud computing in a distributed e-infrastructure using the web processing service standard. Concurrency and Computation: Practice and Experience 29 , e4219 (2017).

Wikidata. SPARQL entity retrieval specifications and examples. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples (2024).

University of Freiburg. QLever instance. https://qlever.cs.uni-freiburg.de/ (2024).

Bast, H. & Buchhold, B. Qlever: A query engine for efficient sparql + text search. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management , 647–656 (2017).

Open Geospatial Consortium. Well-known text representation of coordinate reference systems. https://www.ogc.org/standard/wkt-crs/ (2024).

Eurostat. Local Administrative Units. https://ec.europa.eu/eurostat/web/nuts/local-administrative-units (2024).

Brandmueller, T., Schäfer, G., Ekkehard, P., Müller, O. & Angelova-Tosheva, V. Territorial indicators for policy purposes: Nuts regions and beyond. Regional Statistics 7 , 78–89 (2017).

Eurostat. GISCO - the Geographic Information System of the COmmission. https://ec.europa.eu/eurostat/web/gisco (2023).

Eurostat. LAU descriptions in GeoJSON format. https://ec.europa.eu/eurostat/web/gisco/geodata/statistical-units/local-administrative-units (2024).

Eurostat. NUTS descriptions in GeoJSON format. https://ec.europa.eu/eurostat/web/gisco/geodata/statistical-units/territorial-units-statistics (2024).

Commons, W. Wikimedia commons. Retrieved June 2 (2012).

Bartalesi, Valentina. JSON schema of the internal SMBVT data representation. https://dlnarratives.eu/schema%20JSON.json (2022).

W3C Consortium. RDF 1.1: On Semantics of RDF Datasets. https://www.w3.org/TR/rdf11-datasets/#bib-RDF11-MT (2024).

Ciccarese, P. & Peroni, S. The collections ontology: creating and handling collections in owl 2 dl frameworks. Semantic Web 5 , 515–529 (2014).

Meghini, C., Bartalesi, V., Metilli, D., Lenzi, E. & Pratelli, N. Narrative ontology. https://dlnarratives.eu/ontology/ (2024).

Jena, A. Apache jena fuseki. The Apache Software Foundation 18 (2014).

Bartalesi, V. et al . Figshare collection: A Knowledge Graph of European Mountain Territory and Value Chain data. FigShare https://doi.org/10.6084/m9.figshare.c.7098079 (2024).

Openllet. Openllet: An Open Source OWL DL reasoner for Java. https://github.com/Galigator/openllet (2023).

DuCharme, B. Learning SPARQL: querying and updating with SPARQL 1.1 (“O’Reilly Media, Inc.”, 2013).

Lam, A. N., Elvesæter, B. & Martin-Recuerda, F. A performance evaluation of owl 2 dl reasoners using ore 2015 and very large bio ontologies. In In Proceedings of DMKG2023: 1st International Workshop on Data Management for Knowledge Graphs, May 28, 2023, Hersonissos, Greece (2023).

United Nations. 68% of the world population projected to live in urban areas by 2050, says UN. on-line. https://www.un.org/en/desa/68-world-population-projected-live-urban-areas-2050-says-un (2018).

Dax, T. & Copus, A. European rural demographic strategies: Foreshadowing post-lisbon rural development policy? World 3 , 938–956 (2022).

Foundation, T. A. S. Apache jena fuseki. https://jena.apache.org/documentation/fuseki2/ (2024).

The Apache Software Foundation. GeoSPARQL Fuseki. https://jena.apache.org/documentation/geosparql/geosparql-fuseki.html (2024).

QGIS. Software download. https://qgis.org/download/ (2024).

Bartalesi, V. et al . Figshare collection: Visualisation of the MOVING 454 Story Maps. FigShare https://figshare.com/articles/online_resource/Moving_454_Storymaps/25334272?backTo=/collections/_/7098079 (2024).

Bartalesi, V., Lenzi, E. & Pratelli, N. Figshare collection: Fuseki instance for knowledge graph querying. FigShare https://tool.dlnarratives.eu/Moving_454_Storymaps/geosparql.html (2024).

Wikidata. Wikidata Web site. https://www.wikidata.org/wiki/Wikidata:Main_Page (2024).

OpenStreetMap. OpenStretMap data (Protocolbuffer Binary Format) in RDF representation enhanced by GeoSPARQL triples. https://planet.openstreetmap.org/pbf/planet-240701.osm.pbf (2024).

Download references

Acknowledgements

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the MOVING project (grant agreement no 862739). The authors wish to thank all MOVING-project partners who have contributed to the source data used for the present article.

Author information

Authors and affiliations.

Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” del Consiglio Nazionale delle Ricerche (ISTI-CNR), Pisa, 56124, Italy

Valentina Bartalesi, Gianpaolo Coro, Emanuele Lenzi, Nicolò Pratelli & Pasquale Pagano

Dipartimento di Scienze Agrarie, Alimentari e Agro-ambientali, dell’Università di Pisa, Pisa, 56124, Italy

Michele Moretti & Gianluca Brunori

You can also search for this author in PubMed   Google Scholar

Contributions

V.B. was one of the main developer of the Narrative Ontology, she orchestrated the experiment and conducted the validation; G.C. designed and developed the workflow for data augmentation and co-orchestrated the experiment; M.M. designed and developed the input MS Excel table collecting all information from the MOVING partners; N.P. designed and developed LAU/NUTS conversion through GISCO; E.L. designed and developed the Story Map visualisation and prepared the OWL graph instance on the Apache Jena Fuseki service; P.P. and G.B. supervised and supported the experiment through the MOVING project funding. All authors reviewed the manuscript.

Corresponding author

Correspondence to Gianpaolo Coro .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary table, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Bartalesi, V., Coro, G., Lenzi, E. et al. A Semantic Knowledge Graph of European Mountain Value Chains. Sci Data 11 , 978 (2024). https://doi.org/10.1038/s41597-024-03760-9

Download citation

Received : 22 March 2024

Accepted : 07 August 2024

Published : 07 September 2024

DOI : https://doi.org/10.1038/s41597-024-03760-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

data graphical representation

IMAGES

  1. Charts Are A Graphical Representation Of Data

    data graphical representation

  2. Graphical Representation

    data graphical representation

  3. Graphical Representation Of Data In Excel

    data graphical representation

  4. Graphical Representation

    data graphical representation

  5. A chart graphical representation for data Vector Image

    data graphical representation

  6. Graphical Representation of Data

    data graphical representation

VIDEO

  1. graphical representation of data @ easy tally

  2. Graphical Representation of Data🙂🙂||#ytshorts#graphicalrepresentation #maths

  3. Graphical Representation of Data #class12 #geography #project #practicalfile #SchoolingGuide

  4. INTRODUCTION TO DATA VISULALIZATION || PYTHON PROGRAMMING || LECTURE 04 BY DR NITIN SHARMA || AKGEC

  5. How to spot grouped discrete data/grouped continuous data from a question|Symmetry|Histogram

  6. Ungrouped data-graphical representation

COMMENTS

  1. Graphical Representation of Data

    Graphical Representation of Data | Graphs, Types of ...

  2. Graphical Representation of Data

    Examples on Graphical Representation of Data. Example 1: A pie chart is divided into 3 parts with the angles measuring as 2x, 8x, and 10x respectively. Find the value of x in degrees. Solution: We know, the sum of all angles in a pie chart would give 360º as result. ⇒ 2x + 8x + 10x = 360º. ⇒ 20 x = 360º.

  3. Graphical Representation

    Graphical Representation - Types, Rules ... - BYJU'S

  4. 2: Graphical Representations of Data

    2.3: Histograms, Frequency Polygons, and Time Series Graphs. A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width drawn adjacent to each other. The horizontal scale represents classes of quantitative data values and the vertical scale represents frequencies. The heights of the bars correspond ...

  5. 11 Data Visualization Techniques for Every Use-Case with Examples

    11 Data Visualization Techniques for Every Use-Case with ...

  6. What Is Data Visualization: Definition, Types, Tips, and Examples

    What Is Data Visualization: Definition, Types, Tips, and ...

  7. 2.1: Types of Data Representation

    2.1: Types of Data Representation. Page ID. Two common types of graphic displays are bar charts and histograms. Both bar charts and histograms use vertical or horizontal bars to represent the number of data points in each category or interval. The main difference graphically is that in a bar chart there are spaces between the bars and in a ...

  8. 2.1: Introduction

    Statisticians often graph data first to get a picture of the data. Then, more formal tools may be applied. Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar graph, the histogram, the stem-and-leaf plot, the frequency polygon (a type of broken line graph), the pie chart, and the box plot.

  9. What is data visualisation? A definition, examples and resources

    What is data visualisation? A definition, examples and ...

  10. Introduction to Data Visualization

    Data visualization is the graphical representation of data for understanding and communication. This encompasses two primary classes of visualization: Information Visualization - Visualization of data. This can either be: Exploratory: You are trying to explore and understand patterns and trends within your data. Explanatory: There is something in your data you would like to communicate to your ...

  11. What is Graphical Representation? Definition and FAQs

    Graphical representation refers to the use of intuitive charts to clearly visualize and simplify data sets. Data is ingested into graphical representation of data software and then represented by a variety of symbols, such as lines on a line chart, bars on a bar chart, or slices on a pie chart, from which users can gain greater insight than by ...

  12. Data Visualization: Definition, Benefits, and Examples

    Data Visualization: Definition, Benefits, and Examples

  13. 17 Important Data Visualization Techniques

    17 Important Data Visualization Techniques - HBS Online

  14. The 5 Most Important Principles of Data Visualization

    Data visualization, in the simplest terms, is a graphical representation of data to understand patterns and communicate insights. I reckon we all are now aware of the importance of dataviz in the current times. But, we still need to focus on the essential principles for creating effective and authentic visualizations. Misleading, confusing, and ...

  15. What Is Data Visualization?

    What Is Data Visualization?

  16. Graphical Representation

    Graphical Representation. Graphical representations encompass a wide variety of techniques that are used to clarify, interpret and analyze data by plotting points and drawing line segments, surfaces and other geometric forms or symbols. The purpose of a graph is a rapid visualization of a data set. For instance, it should clearly illustrate the ...

  17. Chart

    Chart. A pie chart showing the composition of the 38th Parliament of Canada. A chart (sometimes known as a graph) is a graphical representation for data visualization, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart ". [1] A chart can represent tabular numeric data ...

  18. 18 Best Types of Charts and Graphs for Data Visualization [+ Guide]

    Different Types of Graphs for Data Visualization 1. Bar Graph. I strongly suggest using a bar graph to avoid clutter when one data label is long or if you have more than 10 items to compare. Also, fun fact: If the example below was vertical it would be a column graph. Best Use Cases for These Types of Graphs. Bar graphs can help track changes ...

  19. 2.E: Graphical Representations of Data (Exercises)

    This page titled 2.E: Graphical Representations of Data (Exercises) is shared under a license and was authored, remixed, and/or curated by via that was edited to the style and standards of the LibreTexts platform. These are homework exercises to accompany the Textmap created for "Introductory Statistics" by OpenStax.

  20. Graphical Representation, Its Advantages & Uses

    Graphical Representation: Advantages, Types & Examples

  21. The 10 Best Data Visualization Examples

    What is Data Visualization? Data visualization is the graphical representation of different pieces of information or data, using visual elements such as charts, graphs, or maps. Data visualization tools provide the ability to see and understand data trends, outliers, and patterns in an easy, intuitive way. Learn more about data visualization.

  22. Graphical Representation: Types, Rules, Principles & Examples

    A graphical representation is the geometrical image of a set of data that preserves its characteristics and displays them at a glance. It is a mathematical picture of data points. It enables us to think about a statistical problem in visual terms. It is an effective tool for the preparation, understanding and interpretation of the collected data.

  23. 6 Inspiring Data Visualization Examples

    Data visualization is the process of turning raw data into graphical representations. Visualizations make it easy to communicate trends in data and draw conclusions. When presented with a graph or chart, stakeholders can easily visualize the story the data is telling, rather than try to glean insights from raw data.

  24. Med-MGF: multi-level graph-based framework for handling medical data

    Modeling patient data, particularly electronic health records (EHR), is one of the major focuses of machine learning studies in healthcare, as these records provide clinicians with valuable information that can potentially assist them in disease diagnosis and decision-making. In this study, we present a multi-level graph-based framework called MedMGF, which models both patient medical profiles ...

  25. [2409.02728] Task-Oriented Communication for Graph Data: A Graph

    Graph data, essential in fields like knowledge representation and social networks, often involves large networks with many nodes and edges. Transmitting these graphs can be highly inefficient due to their size and redundancy for specific tasks. This paper introduces a method to extract a smaller, task-focused subgraph that maintains key information while reducing communication overhead. Our ...

  26. A Semantic Knowledge Graph of European Mountain Value Chains

    Our data collection is organised as a semantic knowledge graph, i.e., a structured representation of knowledge, where knowledge entities and their relations are modelled in a graph format 4,5. The ...