Assignment 2: Exploratory Data Analysis

In groups of 3-4, identify a dataset of interest and perform exploratory analysis in Tableau to understand the structure of the data, investigate hypotheses, and develop preliminary insights. Prepare a PDF or Google Slides report using this template outline : include a set of 10 or more visualizations that illustrate your findings, one summary “dashboard” visualization, as well as a write-up of your process and what you learned.

Submit your group’s report url and individually, your peer assessments for A2 by Monday 10/1, 11:59pm .

Week 1: Data Selection

First, choose a topic of interest to you and find a dataset that can provide insights into that topic. See below for recommended datasets to help you get started.

If working with a self-selected dataset, please check with the course staff to ensure it is appropriate for this assignment. Be advised that data collection and preparation (also known as data wrangling) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.

After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you’d like to investigate.

Week 2: Exploratory Visual Analysis

Next, perform an exploratory analysis of your dataset using Tableau. You should consider two different phases of exploration.

In the first phase, you should seek to gain an overview of the shape & stucture of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform “sanity checks” for patterns you expect to see!

In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing sorting or axis scales, filtering or subsetting data, etc.) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.

Final Deliverable

Your final submission will be a written report, 10 or more captioned “quick and dirty” Tableau visualizations outlining your most important insights, and one summary Tableau “dashboard” visualization that answers one (or more) of your chosen hypotheses. The “dashboard” may have multiple charts to communicate your main findings, but it is a static image (so design and label accordingly.) Focus on the answers to your initial questions, but be sure to also describe surprises as well as challenges encountered along the way, e.g. data quality issues.

Each visualization image should accompanied with a title and short caption (<2 sentences). Provide sufficient detail for each caption such that anyone could read through your report and understand your findings. Feel free to annotate your images to draw attention to specific features of the data, keeping in mind the visual principles we’re learned so far.

To easily export images from Tableau, use the Worksheet > Export > Image… menu item.

Grading Criteria

  • Poses clear questions applicable to the chosen dataset.
  • Appropriate data quality assessment and transformations.
  • Breadth of analysis, exploring multiple questions.
  • Depth of analysis, with appropriate follow-up questions.
  • Expressive & effective visualizations appropriate to analysis questions.
  • Clearly written, understandable captions that communicate primary insights.

Data Sources

  • NYC Open Data : data on NYC trees, taxis, subway, citibike, 311 calls, land lot use, etc.
  • data.gov : everything from hourly precipitation, fruit & vegetable prices, crime reports, to electricity usage.
  • Dataset Search by Google Research : indexes public open datasets.
  • Stanford Open Policing Dataset
  • Physician Medicare Data
  • Civil Rights Data Collection
  • Data Is Plural newsletter’s Structured Archive : spreadsheet of public datasets ranging from curious to wide-reaching, e.g. “How often do Wikipedia editors edit?”, “Four years of rejected vanity license plate requests”
  • Yelp Open Dataset
  • U.S. Census Bureau : use their Discovery Tool
  • US Health Data : central searchable repository of US health data (Center for Disease Control and National Center for Health Statistics), e.g. surveys on pregnancy, cause of death, health care access, obesity, etc.
  • International Monetary Fund
  • IPUMS.org : Integrated Census & Survey Data from around the World
  • Federal Elections Commission : Campaign Finance & Expenditures
  • Stanford Mass Shootings in America Project : data up to 2016, with pointers to alternatives
  • USGS Earthquake Catalog
  • Federal Aviation Administration
  • FiveThirtyEight Data : Datasets and code behind fivethirtyeight.com
  • ProPublica Data Store : datasets collected by ProPublica or obtained via FOIA requests, e.g. Chicago parking ticket data
  • Machine Learning Repository  - large variety of maintained data sets
  • Socrata Open Data
  • 17 places to find datasets for data science projects
  • Awesome Public Datasets (github): topic-centric list of high-quality open datasets in public domains
  • Open Syllabus : 6,059,459 syllabi

Tableau Resources

Tableau Training : Specifically the Getting Started video and Visual Analytics section. Most helpful when first getting off the ground.

Build-It-Yourself Exercises : Specifically, sections Build Charts and Analyze Data > Build Common / Advanced Chart Types , and Build Data Views From Scratch > Analyze Data are a good documentation resource.

Drawing With Numbers : blog with example walkthroughs in the Visualizations section, and Tableau Wiki has a bunch of useful links for the most common advanced exploratory / visualization Tableau techniques.

Additional Tools

Your dataset almost certainly will require reformatting, restructuring, or cleaning before visualization. Here are some tools for data preparation:

  • Tableau includes basic functionality for data import, transformation & blending.
  • R with ggplot2 library
  • Python Jupyter notebooks with libraries eg. Altair or Matplotlib
  • Trifacta Wrangler interactive tool for data transformation & visual profiling.
  • OpenRefine free, open source tool for working with messy data.
  • JavaScript data utilities or Datalib JS library via Vega.
  • Pandas data table and manipulation utilites for Python.
  • Or, the programming language and tools of your choice.

Assignment 2: Exploratory Data Analysis

  • 1 Assignment Due: April 18, 2016
  • 2 Exploratory Analysis Process
  • 3 Data Sets
  • 4 Visualization Software
  • 5 How to create your wiki page
  • 6 Add a link to your finished reports here

Assignment Due: April 18, 2016

Gasprices.png

A wide variety of digital tools have been designed to help users visually explore data sets and confirm or disconfirm hypotheses about the data. The task in this assignment is to use existing software tools to formulate and answer a series of specific questions about a data set of your choice. After answering the questions you should create a final visualization that is designed to present the answer to your question to others. You should maintain a web notebook that documents all the questions you asked and the steps you performed from start to finish. The goal of this assignment is not to develop a new visualization tool, but to understand better the process of exploring data using off-the-shelf visualization tools.

Here is one way to start.

  • Step 1. Pick a domain that you are interested in. Some good possibilities might be the physical properties of chemical elements, the types of stars, or the human genome. Feel free to use an example from your own research, but do not pick an example that you already have created visualizations for.
  • Step 2. Pose an initial question that you would like to answer. For example: Is there a relationship between melting point and atomic number? Are the brightness and color of stars correlated? Are there different patterns of nucleotides in different regions in human DNA?
  • Step 3. Assess the fitness of the data for answering your question. Inspect the data--it is invariably helpful to first look at the raw values. Does the data seem appropriate for answering your question? If not, you may need to start the process over. If so, does the data need to be reformatted or cleaned prior to analysis? Perform any steps necessary to get the data into shape prior to visual analysis.

You will need to iterate through these steps a few times. It may be challenging to find interesting questions and a dataset that has the information that you need to answer those questions.

Exploratory Analysis Process

After you have an initial question and a dataset, construct a visualization that provides an answer to your question. As you construct the visualization you will find that your question evolves - often it will become more specific. Keep track of his evolution and the other questions that occur to you along the way. Once you have answered all the questions to your satisfaction, think of a way to present the data and the answers as clearly as possible. In this assignment, you should use existing visualization software tools. You may find it beneficial to use more than one tool.

Before starting, write down the initial question clearly. And, as you go, maintain a wiki notebook of what you had to do to construct the visualizations and how the questions evolved. Include in the notebook where you got the data, and documentation about the format of the dataset. Describe any transformations or rearrangements of the dataset that you needed to perform; in particular, describe how you got the data into the format needed by the visualization system. Keep copies of any intermediate visualizations that helped you refine your question. After you have constructed the final visualization for presenting your answer, write a caption and a paragraph describing the visualization, and how it answers the question you posed. Think of the figure, the caption and the text as material you might include in a research paper.

Your assignment must be posted to the wiki before class on April 18, 2016 .

You should look for data sets online in convenient formats such as Excel or a CSV file. The web contains a lot of raw data. In some cases you will need to convert the data to a format you can use. Format conversion is a big part of visualization research so it is worth learning techniques for doing such conversions. Although it is best to find a data set you are especially interested in, here are pointers to a few datasets: Online Datasets

  • Visualization Software

To create the visualizations, we will be using Tableau , a commercial visualization tool that supports many different ways to interact with the data. Tableau has given us licenses so that you can install the software on your own computer . One goal of this assignment is for you to learn to use and evaluate the effectiveness of Tableau. Please talk to me if you think it won't be possible for you to use the tool. In addition to Tableau, you are free to also use other visualization tools as you see fit.

  • Tableau Download Instructions

How to create your wiki page

Begin by creating a new wiki page for this assignment. The title of the page should be of the form:

A2-FirstnameLastname .

The wiki syntax will look like this: *[[A2-FirstnameLastname|Firstname Lastname]] . Hit the edit button for the next section to see how I created the link for my name.

To upload images to the wiki, first create a link for the image of the form [[Image:image_name.jpg]] (replacing image_name.jpg with a unique image name for use by the server). This will create a link you can follow that will then allow you to upload the image. Alternatively, you can use the "Upload file" link in the toolbox to upload the image first, and then subsequently create a link to it on your wiki page.

Add a link to your finished reports here

One you are finished editing the page, add a link to it here with full name as the link text.

  • Maneesh Agrawala
  • Pedro Dantas
  • Emma Townley-Smith
  • Matthew Pick
  • Pascal Odek
  • Flavia Grey
  • Ernesto Ramirez
  • Raymond Luong
  • Haiyin Wang
  • Christina Kao
  • Leigh Hagestad
  • Benjamin Wang
  • Shenli Yuan
  • Juan Marroquin
  • Sarah Sterman
  • Patrick Briggs
  • Samuel Hansen
  • Andrew McCabe
  • Mackenzie Leake
  • Christine Quan
  • Janette Cheng
  • Lorena Huang Liu
  • Pontus Orraryd
  • Oskar Ankarberg
  • Brandon Liu
  • Serena Wong
  • Sarah Wymer
  • Ben-han Sung
  • Nikolas Martelaro
  • Gloria Chua
  • John Morgan
  • Filippa Karrfelt
  • Catherine Mullings
  • Juliana Cook
  • Andrei Terentiev
  • Adrian Leven
  • John Newcomb
  • Tum Chaturapruek
  • Sarah Nader
  • Lucas Throckmorton
  • Kyle Dumovic
  • Kris Sankaran
  • Chinmayi Dixit
  • Shannon Kao
  • Zach Maurer
  • Maria Frank
  • Sage Isabella
  • Gabriella Brignardello
  • Cong Qiaoben
  • Erin Singer
  • Jennifer Lu
  • Kush Nijhawan
  • Santiago Seira Silva-Herzog
  • Joanne Jang
  • Bradley Reyes
  • Milan Doshi
  • Sukhi Gulati
  • Shirbi Ish-Shalom
  • Lawrence Rogers
  • Mark Schramm

Navigation menu

Personal tools.

  • View source
  • View history
  • Wiki Markup Reference
  • What links here
  • Related changes
  • Special pages
  • Printable version
  • Permanent link
  • Page information
  • This page was last modified on 19 April 2016, at 12:11.
  • Privacy policy
  • About cs448b-wiki
  • Disclaimers

Powered by MediaWiki

Welcome to our newly formatted notes. Update you bookmarks accordingly.

2   Exploratory Data Analysis (EDA)

2.1 overview.

Exploratory Data Analysis (EDA) may also be described as data-driven hypothesis generation . Given a complex set of observations, often EDA provides the initial pointers towards various learning techniques. The data is examined for structures that may indicate deeper relationships among cases or variables.

In this lesson, we will focus on both aspects of EDA:

  • Numerical summarization
  • Data Visualization

This course is based on R software. There are several attractive features of R that make it a software of choice both in academia as well as in industry.

  • R is an open-source software and is free to download.
  • R is supported by 3,000+ packages to deal with large volumes of data in a wide variety of applications. For instance, the svd() function performs the singular value decomposition in a single line of coding, which cannot be so easily implemented in C, Java or Python.
  • R is quite versatile. After an algorithm is developed in R, the program may be sped up by transforming the R codes into other languages.
  • R is a mainstream analytical tool.

Reference: * *

  • The Popularity of Data Analysis Software by R.A. Muenchen,
  • R You Ready for R? by Ashlee Vance
  • R Programming for Data Science by Roger Peng

The following diagram shows that in recent times R is gaining popularity as monthly programming discussion traffic shows explosive growth of discussions regarding R.

R has a vibrant user community. As a result of that R has the most website links that point to it.

R can be installed from the CRAN website R-Project following the instructions. Downloading R-Studio is strongly recommended. To develop familiarity with R it is suggested to follow through the material in Introduction to R . For further information refer to the Course Syllabus. Other useful websites on R are Stack Overflow R Questions and R Seek .

One of the objectives of this course is to strengthen the basics in R. The R-Labs given in the textbook are followed closely. Along with the material in the text, two other features in R are introduced.

  • R Markdown : This allows the users to knit the R codes and outputs directly into the document.
  • R library ggplot2`: A very useful and sophisticated set of plotting functions to produce high-quality graphs

Upon successful completion of this lesson, you should be able to:

  • Develop familiarity with R software.
  • Application of numerical and visual summarization of data.
  • Illustration of the importance of EDA before embarking on sophisticated model building.

2.2 What is Data

Introduction.

Anything that is observed or conceptualized falls under the purview of data. In a somewhat restricted view, data is something that can be measured. Data represent facts or something that has actually taken place, observed and measured. Data may come out of passive observation or active collection. Each data point must be rooted in a physical, demographical or behavioral phenomenon must be unambiguous and measurable. Data is observed in each unit under study and stored in an electronic device.

Definition 2.1 (Data) denotes a collection of objects and their attributes

Definition 2.2 (Attribute) (feature, variable, or field) is a property or characteristic of an object

Definition 2.3 (Collection of Attributes) describe an object (individual, entity, case, or record)

ID Sex Education Income
248 Male High School $100,000
249 Female High School $12,000
250 Male College $23,000
251 Male Child $0
252 Female High School $19,798
253 Male High School $40,100
254 Male Less than 1st Grade $2691
255 Male Child $0
256 Male 11th Grade $30,000
257 Male Ph.D. $30686

Each Row is an Object and each Column is an Attribute

Often these attributes are referred to as variables. Attributes contain information regarding each unit of observation. Depending on how many different types of information are collected from each unit, the data may be univariate, bivariate or multivariate.

Data can have varied forms and structures but in one criterion they are all the same – data contains information and characteristics that separate one unit or observation from the others.

Types of Attributes

Definition 2.4 (Nominal) Qualitative variables that do not have a natural order, e.g. Hair color, Religion, Residence zipcode of a student

Definition 2.5 (Ordinal) Qualitative variables that have a natural order, e.g. Grades, Rating of a service rendered on a scale of 1-5 (1 is terrible and 5 is excellent), Street numbers in New York City

Definition 2.6 (Interval) Measurements where the difference between two values is meaningful, e.g. Calendar dates, Temperature in Celsius or Fahrenheit

Definition 2.7 (Ratio) Measurements where both difference and ratio are meaningful, e.g. Temperature in Kelvin, Length, Counts

Discrete and Continuous Attributes

Definition 2.8 (Discrete Attribute) A variable or attribute is discrete if it can take a finite or a countably infinite set of values. A discrete variable is often represented as an integer-valued variable. A binary variable is a special case where the attribute can assume only two values, usually represented by 0 and 1. Examples of a discrete variable are the number of birds in a flock; the number of heads realized when a coin is flipped 10 times, etc.

Definition 2.9 (Continuous Attribute) A variable or attribute is continuous if it can take any value in a given range; possibly the range being infinite. Examples of continuous variables are weights and heights of birds, the temperature of a day, etc.

In the hierarchy of data, nominal is at the lowermost rank as it carries the least information. The highest type of data is ratio since it contains the maximum possible information. While analyzing the data, it has to be noted that procedures applicable to a lower data type can be applied for a higher one, but the reverse is not true. Analysis procedure for nominal data can be applied to interval type data, but it is not recommended since such a procedure completely ignores the amount of information an interval type data carries. But the procedures developed for interval or even ratio type data cannot be applied to nominal nor to ordinal data. A prudent analyst should recognize each data type and then decide on the methods applicable.

2.3 Numerical Summarization

Summary statistics.

The vast amount of numbers on a large number of variables need to be properly organized to extract information from them. Broadly speaking there are two methods to summarize data: visual summarization and numerical summarization. Both have their advantages and disadvantages and applied jointly they will get the maximum information from raw data.

Summary statistics are numbers computed from the sample that present a summary of the attributes.

Measures of Location

They are single numbers representing a set of observations. Measures of location also include measures of central tendency. Measures of central tendency can also be taken as the most representative values of the set of observations. The most common measures of location are the Mean, the Median, the Mode, and the Quartiles.

Definition 2.10 (Mean) the arithmetic average of all the observations. The mean equals the sum of all observations divided by the sample size

Definition 2.11 (Median) the middle-most value of the ranked set of observations so that half the observations are greater than the median and the other half is less. Median is a robust measure of central tendency

the most frequently occurring value in the data set. This makes more sense when attributes are not continuous

2.3.2 Quartiles

division points which split data into four equal parts after rank-ordering them.

Division points are called Q1 (the first quartile), Q2 (the second quartile or median), and Q3 (the third quartile)

Note! They are not necessarily four equidistance point on the range of the sample

Similarly, Deciles and Percentiles are defined as division points that divide the rank-ordered data into 10 and 100 equal segments.

Note! that the mean is very sensitive to outliers (extreme or unusual observations) whereas the median is not. The mean is affected if even a single observation is changed. The median, on the other hand, has a 50% breakdown which means that unless 50% values in a sample change, the median will not change.

Measures of Spread

Measures of location are not enough to capture all aspects of the attributes. Measures of dispersion are necessary to understand the variability of the data. The most common measure of dispersion is the Variance, the Standard Deviation, the Interquartile Range and Range.

Definition 2.12 (Variance) measures how far data values lie from the mean . It is defined as the average of the squared differences between the mean and the individual data values

Definition 2.13 (Standard Deviation) is the square root of the variance. It is defined as the average distance between the mean and the individual data values

Definition 2.14 (Interquartile range (IQR)) is the difference between Q3 and Q1. IQR contains the middle 50% of data

Definition 2.15 (Range) is the difference between the maximum and minimum values in the sample

Measures of Skewness

In addition to the measures of location and dispersion, the arrangement of data or the shape of the data distribution is also of considerable interest. The most ‘well-behaved’ distribution is a symmetric distribution where the mean and the median are coincident. The symmetry is lost if there exists a tail in either direction. Skewness measures whether or not a distribution has a single long tail.

Skewness is measured as: \[ \dfrac{\sqrt{n} \left( \Sigma \left(x_{i} - \bar{x} \right)^{3} \right)}{\left(\Sigma \left(x_{i} - \bar{x} \right)^{2}\right)^{\frac{3}{2}}}\]

The figure below gives examples of symmetric and skewed distributions. Note that these diagrams are generated from theoretical distributions and in practice one is likely to see only approximations.

example of a symmetric distribution

Calculate the answers to these questions then click the icon on the left to reveal the answer.

Suppose we have the data: 3, 5, 6, 9, 0, 10, 1, 3, 7, 4, 8. Calculate the following summary statistics:

  • Variance and Standard Deviation
  • Mean: (3+5+6+9+0+10+1+3+7+4+8)/11= 5.091.
  • Median: The ordered data is 0, 1, 3, 3, 4, 5, 6, 7, 8, 9, 10. Thus, 5 is the median.
  • Q1 and Q3: Q1 is 3 and Q3 is 8.
  • Variance and Standard Deviation: Variance is 10.491 (=((3-5.091)2+…+(8-5.091)2)/10). Thus, the standard deviation is the square root of 10.491, i.e. 3.239.
  • IQR: Q3-Q1=8-3=5.
  • Range: max-min=10-0=10.
  • Skewness: -0.03.

Measures of Correlation

All the above summary statistics are applicable only for univariate data where information on a single attribute is of interest. Correlation describes the degree of the linear relationship between two attributes, X and Y .

With X taking the values x (1), … , x ( n ) and Y taking the values y (1), … , y ( n ), the sample correlation coefficient is defined as: \[\rho (X,Y)=\dfrac{\sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )\left ( y(i)-\bar{y} \right )}{\left( \sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )^2\sum_{i=1}^{n}\left ( y(i)-\bar{y} \right )^2\right)^\frac{1}{2}}\]

The correlation coefficient is always between -1 (perfect negative linear relationship) and +1 (perfect positive linear relationship). If the correlation coefficient is 0, then there is no linear relationship between X and Y.

In the figure below a set of representative plots are shown for various values of the population correlation coefficient ρ ranging from - 1 to + 1. At the two extreme values, the relation is a perfectly straight line. As the value of ρ approaches 0, the elliptical shape becomes round and then it moves again towards an elliptical shape with the principal axis in the opposite direction.

example correlation coefficients

Try the applet “CorrelationPicture” and “CorrelationPoints” from the University of Colorado at Boulder .

Try the applet “Guess the Correlation” from the Rossman/Chance Applet Collection .

2.3.3 Measures of Similarity and Dissimilarity

Similarity and dissimilarity.

Distance or similarity measures are essential in solving many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in the literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

Definition 2.16 (Similarity Measure) Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1 (complete similarity)

Definition 2.17 (Dissimilarity Measure) Numerical measure of how different two data objects are range from 0 (objects are alike) to \(\infty\) (objects are different)

Definition 2.18 (Proximity) refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.

Nominal \(s=\begin{cases}
1 & \text{ if } p=q \\
0 & \text{ if } p\neq q
\end{cases}\)
\(d=\begin{cases}
0 & \text{ if } p=q \\
1 & \text{ if } p\neq q
\end{cases}\)
Ordinal

\(s=1-\dfrac{\left | p-q \right |}{n-1}\)

(values mapped to integer 0 to n-1, where n is the number of values)

\(d=\dfrac{\left | p-q \right |}{n-1}\)
Interval or Ratio \(s=1-\left | p-q \right |, s=\frac{1}{1+\left | p-q \right |}\) \(d=\left | p-q \right |\)

Distance , such as the Euclidean distance, is a dissimilarity measure and has some well-known properties: Common Properties of Dissimilarity Measures

  • d ( p , q ) ≥ 0 for all p and q , and d ( p , q ) = 0 if and only if p = q ,
  • d ( p , q ) = d(q,p) for all p and q ,
  • d ( p , r ) ≤ d ( p , q ) + d ( q , r ) for all p , q , and r, where d ( p , q ) is the distance (dissimilarity) between points (data objects), p and q .

A distance that satisfies these properties is called a metric . Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.

Euclidean Distance

Assume that we have measurements \(x_{ik}\) , \(i = 1 , \ldots , N\) , on variables \(k = 1 , \dots , p\) (also called attributes).

The Euclidean distance between the i th and j th objects is \[d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]

for every pair (i, j) of observations.

The weighted Euclidean distance is: \[d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]

If scales of the attributes differ substantially, standardization is necessary.

Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, \(x _ { i k } , i = 1 , \dots , N , k = 1 , \dots , p\) , the Minkowski distance is \[d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right)^\frac{1}{\lambda} \]

where \(\lambda \geq 1\) . It is also called the \(L_λ\) metric.

  • \(\lambda = 1 : L _ { 1 }\) metric, Manhattan or City-block distance.
  • \(\lambda = 2 : L _ { 2 }\) metric, Euclidean distance.
  • \(\lambda \rightarrow \infty : L _ { \infty }\) metric, Supremum distance. \[ \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right) ^\frac{1}{\lambda} =\text{max}\left( \left | x_{i1}-x_{j1}\right| , ... , \left | x_{ip}-x_{jp}\right| \right) \]

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

Mahalanobis Distance

Let X be a N × p matrix. Then the \(i^{th}\) row of X is \[x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\]

The Mahalanobis distance is \[d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\]

where \(∑\) is the p×p sample covariance matrix.

Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.

  • Calculate the Euclidan distances.
  • Calculate the Minkowski distances ( \(\lambda = 1\text{ and }\lambda\rightarrow\infty\) cases).
  • Euclidean distances are: \[d _ { E } ( 1,2 ) = \left( ( 1 - 1 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 1 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 1 ) ^ { 2 } \right) ^ { 1 / 2 } = 3.162\]

\[d_{ E } ( 1,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 2.646\]

\[d_{ E } ( 2,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 1.732\]

  • Minkowski distances (when \(\lambda = 1\) ) are:

\[d_ { M } ( 1,2 ) = | 1 - 1 | + | 3 - 2 | + | 1 - 1 | + | 2 - 2 | + | 4 - 1 | = 4\]

\[d_ { M } ( 1,3 ) = | 1 - 2 | + | 3 - 2 | + | 1 - 2 | + | 2 - 2 | + | 4 - 2 | = 5\]

\[d_ { M } ( 2,3 ) = | 1 - 2 | + | 2 - 2 | + | 1 - 2 | + | 2 - 2 | + | 1 - 2 | = 3\]

Minkowski distances \(( \text { when } \lambda \rightarrow \infty )\) are:

\[d _ { M } ( 1,2 ) = \max ( | 1 - 1 | , | 3 - 2 | , | 1 - 1 | , | 2 - 2 | , | 4 - 1 | ) = 3\]

\[d _ { M } ( 1,3 ) = 2 \text { and } d _ { M } ( 2,3 ) = 1\]

  • Calculate the Minkowski distance \(( \lambda = 1 , \lambda = 2 , \text { and } \lambda \rightarrow \infty \text { cases) }\) between the first and second objects.
  • Calculate the Mahalanobis distance between the first and second objects.
  • Minkowski distance is:

\[\lambda = 1 . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12\]

\[\lambda = \text{2. } \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \mathrm { d } _ { \mathrm { E } } ( 1,2 ) = \left( ( 2 - 10 ) ^ { 2 } + ( 3 - 7 ) ^ { 2 } \right) ^ { 1 / 2 } = 8.944\]

\[\lambda \rightarrow \infty . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\]

  • \[\lambda = \text{1 .} \operatorname { d_M } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12 . \lambda = \text{2 .} \operatorname { d_M } ( 1,2 ) = \operatorname { dE } ( 1,2 ) = ( ( 2 - 10 ) 2 + ( 3 - 7 ) 2 ) 1 / 2 = 8.944 . \lambda \rightarrow \infty\] . \(\operatorname { d_M } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\) . Since \(\Sigma = \left( \begin{array} { l l } { 19 } & { 11 } \\ { 11 } & { 7 } \end{array} \right)\) we have \(\Sigma ^ { - 1 } = \left( \begin{array} { c c } { 7 / 12 } & { - 11 / 12 } \\ { - 11 / 12 } & { 19 / 12 } \end{array} \right)\) Mahalanobis distance is: \(d _ { M H } ( 1,2 ) = 2\)
  • R code for Mahalanobis distance

Common Properties of Similarity Measures

Similarities have some well-known properties:

  • s ( p , q ) = 1 (or maximum similarity) only if p = q ,
  • s ( p , q ) = s ( q , p ) for all p and q , where s ( p , q ) is the similarity between data objects, p and q .

Similarity Between Two Binary Variables

The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

q=1 q=0
p=1 n n
p=0 n n

Simple Matching and Jaccard Coefficients

  • Simple matching coefficient \(= \left( n _ { 1,1 } + n _ { 0,0 } \right) / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } + n _ { 0,0 } \right)\) .
  • Jaccard coefficient \(= n _ { 1,1 } / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } \right)\) .

Calculate the answers to the question and then click the icon on the left to reveal the answer.

Given data:

  • p = 1 0 0 0 0 0 0 0 0 0
  • q = 0 0 0 0 0 0 1 0 0 1

The frequency table is:

q=1 q=0
p=1 0 1
p=0 2 7

Calculate the Simple matching coefficient and the Jaccard coefficient.

  • Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
  • Jaccard coefficient = 0 / (0 + 1 + 2) = 0.

2.4 Visualization

To understand thousands of rows of data in a limited time there is no alternative to visual representation. The objective of visualization is to reveal hidden information through simple charts and diagrams. Visual representation of data is the first step toward data exploration and formulation of an analytical relationship among the variables. In a whirl of complex and voluminous data, visualization in one, two, and three-dimension helps data analysts to sift through data in a logical manner and understand the data dynamics. It is instrumental in identifying patterns and relationships among groups of variables. Visualization techniques depend on the type of variables. Techniques available to represent nominal variables are generally not suitable for visualizing continuous variables and vice versa. Data often contains complex information. It is easy to internalize complex information through visual mode. Graphs, charts, and other visual representations provide quick and focused summarization.

Tools for Displaying Single Variables

Histograms are the most common graphical tool to represent continuous data. On the horizontal axis, the range of the sample is plotted. On the vertical axis are plotted the frequencies or relative frequencies of each class. The class width has an impact on the shape of the histogram. The histograms in the previous section were drawn from a random sample generated from theoretical distributions. Here we consider a real example to construct histograms.

The dataset used for this purpose is the Wage data that is included in the ISLR package in R. A full description of the data is given in the package. The following R code produces the figure below which illustrates the distribution of wages for all 3000 workers.

  • Sample R code for Distribution of Wage

Histogram showing the distribution of the wages of all 300 workers.

The data is mostly symmetrically distributed but there is a small bimodality in the data which is indicated by a small hump towards the right tail of the distribution.

The data set contains a number of categorical variables one of which is Race. A natural question is whether the wage distribution is the same across Race. There are several libraries in R which may be used to construct histograms across levels of categorical variables and many other sophisticated graphs and charts. One such library is ggplot2. Details of the functionalities of this library will be given in the R code below.

In the following figures, histograms are drawn for each Race separately.

  • Sample R code for Histogram of Wage by Race

Histogram showing the distribution of the wages of all 300 workers grouped by race.

Because of the huge disparity among the counts of the different races, the above histograms may not be very informative. Code for an alternative visual display of the same information is shown below, followed by the plot.

  • Sample R code for Histogram of Wage by Race (Alternative)

Histogram showing the distribution of the wages of all 300 workers grouped by race.

The second type of histogram also may not be the best way of presenting all the information. However further clarity is seen in a small concentration at the right tail.

Boxplot is used to describe the shape of data distribution and especially to identify outliers. Typically an observation is an outlier if it is either less than Q 1 - 1.5 IQR or greater than Q 3 + 1.5 IQR, where IQR is the inter-quartile range defined as Q 3 - Q 1 . This rule is conservative and often too many points are identified as outliers. Hence sometimes only those points outside of [Q 1 - 3 IQR, Q 3 + 3 IQR] are only identified as outliers.

  • Sample R code for Boxplot of Distribution of Wage

Boxplot showing the distribution of the wages of all 300 workers.

The boxplot of the Wage distribution clearly identifies many outliers. It is a reflection of the histogram depicting the distribution of Wage. The story is clearer from the boxplots drawn on the wage distribution for individual races. Here is the R code:

Here is the boxplot that results: * Sample R code for Boxplot Wage by Race

Boxplot showing the distribution of the wages of all 300 workers.

Tools for Displaying Relationships Between Two Variables

Scatterplot.

The most standard way to visualize relationships between two variables is a scatterplot. It shows the direction and strength of association between two variables but does not quantify it. Scatterplots also help to identify unusual observations. In the previous section (Section 1(b).2) a set of scatterplots are drawn for different values of the correlation coefficient. The data there is generated from a theoretical distribution of multivariate normal distribution with various values of the correlation parameter. Below is the R code used to obtain a scatterplot for these data:

The following is the scatterplot of the variables Age and Wage for the Wage data. * Sample R Code for Relationship of Age and Wage

Scatterplot between Age and Wage

It is clear from the scatterplot that the Wage does not seem to depend on Age very strongly. However, a set of points towards the top are very different from the rest. A natural follow-up question is whether Race has any impact on the Age-Wage dependency or the lack of it. Here is the R code and then the new plot:

  • Sample R Code for Relationship of Age and Wage

Scatterplot between Age and Wage by Race

We have noted before that the disproportionately high number of Whites in the data masks the effects of the other races. There does not seem to be any association between Age and Wage, controlling for Race.

Contour plot

This is useful when a continuous attribute is measured on a spatial grid. They partition the plane into regions of similar values. The contour lines that form the boundaries of these regions connect points with equal values. In spatial statistics, contour plots have a lot of applications.

Contour plots join points of equal probability. Within the contour lines concentration of bivariate distribution is the same. One may think of the contour lines as slices of a bivariate density, sliced horizontally. Contour plots are concentric; if they are perfect circles then the random variables are independent. The more oval-shaped they are, the farther they are from independence. Note the conceptual similarity in the scatterplot series in Sec 1.(b).2. In the following plot, the two disjoint shapes in the interior-most part indicate that a small part of the data is very different from the rest.

Here is the R code for the contour plot that follows:

  • Sample R Code for Contour Plot of Age and Wage

Contour Plot of Age and Wage

Tools for Displaying More Than Two Variables

Scatterplot matrix.

Displaying more than two variables on a single scatterplot is not possible. A scatterplot matrix is one possible visualization of three or more continuous variables taken two at a time.

The data set used to display the scatterplot matrix is the College data that is included in the ISLR package. A full description of the data is given in the package. Here is the R code for the scatterplot matrix that follows:

  • Sample R Code for Scatterplot Matrix of College Attributes

Scatterplot Matrix of College Attributes

Parallel Coordinates

An innovative way to present multiple dimensions in the same figure is by using parallel coordinate systems. Each dimension is presented by one coordinate and instead of plotting coordinates at the right angle to one another, each coordinate is placed side-by-side. The advantage of such an arrangement is that many different continuous and discrete variables can be handled within a parallel coordinate system, but if the number of observations is too large, the profiles do not separate out from one another and patterns may be missed.

The illustration below corresponds to the Auto data from the ISLR package. Only 35 cars are considered but all dimensions are taken into account. The cars considered are different varieties of Toyota and Ford, categorized into two groups: produced before 1975 and produced in 1975 or after. The older models are represented by dotted lines whereas the newer cars are represented by dashed lines. The Fords are represented by blue color and Toyotas are represented by pink color. Here is the R code for the profile plot of this data that follows:

  • Sample R Code for Profile Plot of Toyota and Ford Cars

Profile plot of Toyota and Ford cars

The differences among the four groups are very clear from the figure. Early Ford models had 8 cylinders, were heavy, and had high horsepower and displacement. Naturally, they had low MPG and less time to accelerate. No Toyota belonged to this category. All Toyota cars are built after 1975, have 4 cylinders (one exception only) and MPG performance belongs to the upper half of the distribution. Note that only 35 cars are compared in the profile plot. Hence each car can be followed over all the attributes. However had the number of observations been higher, the distinction among the profiles would have been lost and the plot would not be informative.

Interesting Multivariate Plots

Following are some interesting visualization of multivariate data. In Star Plot , stars are drawn according to rules as defined by their characteristics. Each axis represents one attribute and the solid lines represent each item’s value on that attribute. All attributes of the observations are possible to be represented; however, for the sake of clarity on the graph only 10 attributes are chosen.

Again, the starplot follows the R code for generating the plot:

  • Sample R Code for Starplot of College Data

Starplot of College Data

Another interesting plot technique with multivariate data is Chernoff Face where attributes of each observation are used to draw different features of the face. A comparison of 30 colleges and universities from the College dataset is compared below.

Again, R code and then the plot follows:

  • Sample R Code for Comparison of Colleges and Universities

Comparison of Colleges and Universities

For comparison of a small number of observations on up to 15 attributes, Chernoff’s face is a useful technique. However, whether two items are more similar or less, depends on interpretation.

2.5 R Scripts

This course requires a fair amount of R coding. The textbook takes the reader through R codes relevant for the chapter in a step-by-step manner. Sample R codes are also provided in the Visualization section. In this section, a brief introduction is given on a few of the important and useful features of R.

Introductions to R are available at Statistical R Tutorials and Cran R Project . There are many other online resources available for R. R users’ groups are thriving and highly communicative. A few additional resources are mentioned in the Course Syllabus.

One of the most important features of R is its libraries. They are freely downloadable from CRAN site. It is not possible to make a list of ALL or even MOST R packages. The list is ever changing as R users community is continuously building and refining the available packages. The link below is a good starting point for a list of packages for data manipulation and visualization.

R Studio Useful Packages

R Library: ggplot2

R has many packages and plotting options for data visualization but possibly none of them are able to produce as beautiful and as customizable statistical graphics as ggplot2 does. It is unlike most other graphics packages because it has a deep underlying grammar based on the Grammar of Graphics (Wilkinson, 2005). It is composed of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful because the user is not limited to a set of pre-specified graphics. The plots can be built up iteratively and edited later. The package is designed to work in a layered fashion, starting with a layer showing the raw data and then adding layers of annotations and statistical summaries.

The grammar of graphics is an answer to a question: what is a statistical graphic?

In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.

A brief description of the main components are as below:

  • The data and a set of aesthetic mappings describe how variables in the data are mapped to various aesthetic attributes
  • Geometric objects, geoms for short, represent what is actually on the plot: points, lines, polygons, etc.
  • Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model. Stats are optional but very useful.
  • A facet ing specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.

The basic command for plotting is qplot(X, Y, data = <data name>) (quick plot!). Unlike the most common plot() command, qplot() can be used for producing many other types of graphics by varying geom() . Examples of a few common geom() are given below.

  • geom = “point” is the default
  • geom = “smooth” fits a smoother to the data and displays the smooth and its standard error
  • geom = “boxplot” produces a box-and-whisker plot to summarise the distribution of a set of points

For continuous variables

  • geom = “histogram” draws a histogram
  • geom = “density” draws a density plot

For discrete variables

  • geom = “bar” produces a bar chart.

Aesthetics and faceting are two important features of ggplot2. Color, shape, size and other aesthetic arguments are used if observations coming from different subgroups are plotted on the same graph. Faceting takes an alternative approach: It creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset in an arrangement that facilitates comparison.

From Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis , Springer.

Markdown is an extremely useful facility in R which lets a user incorporate R codes and outputs directly in a document. For a comprehensive knowledge on Markdown and how to use it, you may consult R Markdown in the course STAT 485.

Source Code

Assignment Exploratory Data Analysis

Graded assignment

Find a dataset and create and form, build and perform an Exploratory Data Analysis.

You can use data from anywhere. For example, you may use Google dataset search , Kaggle datasets , a dataset from an R package, or something you collected yourself.

Requirements

  • explain the dataset in 1 or 2 paragraphs
  • use tidyverse
  • clean, legible R code (preferably following something close to the Google style guide)
  • table(s) with relevant summary statistics
  • descriptive and exploratory plots
  • explain what you did and why (maximum 5 paragraphs total) if applicable, note findings such as missingness, outliers or unlikely values or group differences, etcetera.

Other languages: If you are fluent in another programming language, then feel free to use that languages where possible. But, we require you to follow the scope of the course. So do your plots in the grammar of graphics way, make sure that what you do is statistically valid, etc. FWIW: In RStudio you can directly include code chunks into your .Rmd file for the following languages: R, Bash, D3, Python, C (Rcpp), SQL and Stan.

Format: GitHub submission of an RStudio project folder

  • the dataset (csv, xlsx, sav, dat, json, or any other common format)
  • one .Rmd (R Markdown) file
  • a compiled .pdf or .html
  • we should be able to compile the .Rmd to the same .pdf or .html. That means no errors!
  • the names of all group members. Student numbers are not needed.

HINT: If you create an RStudio Project - all files are root dependent on the .Rproj file. This avoids file path errors that are local machine dependent.

Data Analytics Coding Fundamentals : UVic BIDA302: Course Book

Assignment 2 - week 3 - exploratory data analysis {#assignment3}.

This chunk of R code loads the packages that we will be using.

Introduction

For this homework assignment, please write your answer for each question after the question text but before the line break before the next one.

In some cases, you will have to insert R code chunks, and run them to ensure that you’ve got the right result.

Use all of the R Markdown formatting you want! Bullets, bold text, etc. is welcome. And don’t forget to consider changes to the YAML.

Once you have finished your assignment, create an HTML document by “knitting” the document using either the “Preview” or “Knit” button in the top left of the script window frame

New Homes Registry

The B.C. Ministry of Municipal Affairs and Housing publishes data from BC Housing’s New Homes Registry, by regional district and municipality, by three types types of housing: single detached, multi-unit homes, and purpose built rental. 9

The name of the file is “bc-stats_2018-new-homes-data_tosend.xlsx”

Packages used

This exercise relies on the following packages:

documentation for {readxl}

in particular, review the “Overview” page and the “Articles”

  • documentation for {ggplot2}

You will also require functions from {dplyr}, {tidyr}, and (potentially) {forcats}.

1. Explore the file

List the sheet names in the file. (You may wish to assign the long and cumbersome name of the source file to a character string object with a shorter, more concise name.)

2. Importing a sheet

Here’s a screenshot of the top rows of the sheet with single detached housing:

Excel file: single detached

What problems do you anticipate with the way this sheet is laid out?

In Question 5, you will be making a plot that uses data from this sheet. Will you need all of the rows and columns that contain information of one kind or another?

What are the data types of each column in the Excel file? Ask yourself things like “What is the variable type of this column? What type do I think R will interpret this as?”

Read in the sheet, using no options. What is notable about the content of the R data table, compared to the Excel source?

Read the contents of the file again, so that the header rows are in the right place, and with the “Note:” column omitted.

(See this page on the {readxl} reference material for some tips.)

Note: there are many possible solutions to this problem. Once you’ve created an object that you can manipulate in the R environment, your solution may involve some of the {dplyr} data manipulations.

3. Tidy data

Does this data frame violate any of the principles of tidy data?

If so, use the pivot functions from {tidyr} to turn it into a tidy structure.

4. Joining tables

Because the structure of the data in the Excel file is consistent, we can copy-and-paste and edit our code above, and assemble the contents of the three sheets into a single data table.

Repeat the import and tidy steps for the sheets containing the data for multi-unit homes and purpose built rental, and assign each to a unique object. At the end of this step you will have three tidy data frame objects in your environment, one each for single detached, multi-unit homes, and purpose built rentals.

Now join the three tables, creating a single table that contains all of the information that was previously stored in three separate sheets.

5. EDA: plotting

Now you’ve got a tidy structure, it’s time for some exploratory data analysis!

Plot the total number of housing units built in B.C. by municipality, but only the 10 municipalities with the greatest number of homes built, sorted from most to least. (I will leave it up to you to decide if you want to do that by a single year or by the total of all three years in the data.)

Hints and resources:

The Data visualisation and Exploratory data analysis chapters of R for Data Science , 2nd ed. might be handy references

The {ggplot2} reference pages

The Factors chapter of R for Data Science , 2nd ed.

The {forcats} reference pages

You might need to do further data manipulation before you can plot what you want

Sometimes I find it very helpful to make a sketch of the plot I envision, and then writing down which variables are associated with the bits on the plot

  • Core library
  • fast.ai helper functions
  • Image downloader
  • Lesson 1 - Introduction to pandas and seaborn

Lesson 2 - Exploratory data analysis

  • Lesson 3 - Data cleaning and feature engineering
  • Lesson 4 - Introduction to random forests
  • Lesson 5 - Random forest deep dive
  • Lesson 6 - Model interpretability
  • Lesson 7 - Classification
  • Lesson 8 - Group project
  • Lesson 9 - Overfitting
  • Lesson 10 - Neighbours and clusters
  • Lesson 11 - Dimensions and visualisation
  • Lesson 12 - Introduction to NLP
  • Lesson 13 - NLP with Deep Learning
  • Lesson 14 - Create an image dataset
  • Lesson 15 - Image classification with deep learning
  • Python primer

Collapse All | Expand All

slides

Learning objectives

  • Understand the main steps involved in exploratory data analysis
  • Visualise geographical data with seaborn
  • Slice, mask, and index pandas.Series and pandas.DataFrame objects
  • Merge pandas.DataFrame objects together on a common key
  • Apply the DataFrame.groupby() operation to aggregate data across different groups of interest

This lesson draws heavily on the following textbook chapters:

  • Chapter 2 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron
  • Chapter 3 (pp. 146-170) of Python Data Science Handbook by Jake Vanderplas

You may also find the following blog post useful:

  • Exploratory Data Analysis of Craft Beers: Data Profiling
  • Solve the exercises included in this notebook
  • Work through lesson 2 and 4 from Kaggle Learn's introduction to pandas

If you get stuck on understanding a particular pandas technique, you might find their docs to be helpful.

What is exploratory data analysis?

assignment 2 exploratory data analysis

In data science we apply the scientific method to data with the goal gain insights. This means that we state a hypothesis about the data, test it and refine it if necessary. In this framework, exploratory data analysis (EDA) is the step where we explore the data before actually building models. This helps us understand what information is actually contained in the data and what insights could be gained from it.

Formally, the goals of EDA are:

  • Suggest hypotheses about the phenomena of interest
  • Check if necessary data is available to test these hypotheses
  • Make a selection of appropriate methods and models to achieve the goal
  • Suggest what data should be gathered for further investigation

This exploratory phase lays out the path for the rest of a data science project and is therefore a crucial part of the process.

In this lesson we will analyse two datasets:

  • housing.csv
  • housing_addresses.csv

The first is the California housing dataset we saw in lesson 1, while the second provides information about the neighbourhoods associated with each house. This auxiliary data was generated using the reverse geocoding functionality from Google Maps , where the latitude and longitude coordinates for each house are converted into the closest, human-readable address.

The type of questions we will try to find answers to include:

Which cities have the most houses?

  • Which cities have the most expensive houses?
  • What is the breakdown of the house prices by proximity to the ocean?

Import libraries

As in lesson 1, we will be making use of the pandas and seaborn libraries. It is often a good idea to import all your libraries in a single cell block near the top of your notebooks so that your collaborators can quickly see whether they need to install new libraries or not.

Load the data

As usual, we can download our datasets using our helper function get_datasets :

We also make use of the pathlib library to handle our filepaths:

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
street_number route locality-political postal_code latitude_longitude
0 3130 Grizzly Peak Boulevard Berkeley 94705.0 37.88,-122.23
1 2005 Tunnel Road Oakland 94611.0 37.86,-122.22
2 6886 Chabot Road Oakland 94618.0 37.85,-122.24
3 6365 Florio Street Oakland 94618.0 37.85,-122.25
4 5407 Bryant Avenue Oakland 94618.0 37.84,-122.25

Inspect the data

The shape of data.

Whenever we have a new dataset it is handy to begin by getting an idea of how large the DataFrame is. This can be done with either the len or DataFrame.shape methods:

Rename columns

Usually one finds that the column headers in the raw data are either ambiguous or appear in multiple DataFrame objects, in which case it is handy to give them the same name. Although it's obvious from the DataFrame.head() method what the column headers are for our housing and address data, in most cases one has tens or hundreds of columns and the fastest way to see their names is as follows:

Let's rename the route and locality-political columns to something more transparent:

street_number street_name city postal_code latitude_longitude
0 3130 Grizzly Peak Boulevard Berkeley 94705.0 37.88,-122.23
1 2005 Tunnel Road Oakland 94611.0 37.86,-122.22
2 6886 Chabot Road Oakland 94618.0 37.85,-122.24
3 6365 Florio Street Oakland 94618.0 37.85,-122.25
4 5407 Bryant Avenue Oakland 94618.0 37.84,-122.25

Note: Many functions like DataFrame.rename() can manipulate an object in-place without returning a new object. In other words, when inplace=True is passed the data is renamed in place as data_frame.an_operation(inplace=True) so we don’t need the usual assignment LHS = RHS . By default, inplace=False so you have to set it explicitly if needed.

Unique values

Since we are dealing with data about California, we should check that the city column contains a reasonable number of unique entries. In pandas we can check this the DataFrame.nunique() method:

Exercise #1

Does the above number make sense to you? What additional data could you find to determine if it does or does not?

Visualising geographical data

In this lesson we will be focusing on how the house location affects its price, so let's make a scatterplot of the latitude and longitude values to see if we can identify any interesting patterns:

Although the points look like the shape of California, we see that many are overlapping which obscures potentially interesting substructure. We can fix this by configuring the transparency of the points with the alpha argument:

This is much better as we can now see distinct clusters of houses. To make this plot even more informative, let's colour the points according to the median house value; we will use the viridis colourmap (palette) as this has been carefully designed for data that has a sequential nature (i.e. low to high values):

Exercise #2

What does the figure above tell us about the relationship between house prices and location or population density?

Finally, to make our visualisation a little more intuitive, we can also overlay the scatter plot on an actual map of California:

Exercise #3

Can you explain the light green and yellow hotspots in the above figure?

Merging DataFrames

Although the housing_data and housing_addresses DataFrames contain interesting information, it would be nice if there was a way to join the two tables.

More generally, one of the most common operations in pandas (or data science for that matter) is the combination of data contained in various objects. In particular, merge or join operations combine datasets by linking rows using one of more keys . These operations are central to relational databases (e.g. SQL-based). The pandas.merge() function in pandas is the main entry point for using these algorithms on your data.

Let's use this idea to combine the housing_data and housing_addresses pandas.DataFrame objects via their common latitude and longitude coordinates. First we need to combine the latitude and longitude columns of housing_data into the same lat,lon format as our housing_addresses pandas.DataFrame . To do so, we will use the Series.astype() function to convert the numerical column values to strings, then use string concatenation to create a new latitude_longitude column with the desired format:

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity latitude_longitude
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 37.88,-122.23
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 37.86,-122.22
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 37.85,-122.24
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 37.85,-122.25
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 37.85,-122.25

Exercise #4

Calculate the number of unique values in the latitude_longitude column of both housing_data and housing_addresses .

  • What can you conclude from this comparison?
  • Why do you think there might be less unique coordinate pairs than the total number of houses?

Now that we have latitude_longitude present in both pandas.DataFrame objects we can merge the two tables together in pandas as follows:

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity latitude_longitude street_number street_name city postal_code
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 37.88,-122.23 3130 Grizzly Peak Boulevard Berkeley 94705.0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 37.86,-122.22 2005 Tunnel Road Oakland 94611.0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 37.85,-122.24 6886 Chabot Road Oakland 94618.0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 37.85,-122.25 6365 Florio Street Oakland 94618.0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 37.85,-122.25 6365 Florio Street Oakland 94618.0

Boom! We now have a single pandas.DataFrame that links information for house prices and attributes and their addresses.

Note: The merge operation above is an example of a left join , namely we use all the key combinations of latitude_longitude found in the left table housing_data .

In general the 'how' argument of pandas.merge() allows four different join types:

  • 'left' : Use all key combinations found in the left table
  • 'right' : Use all key combinations found in the right table
  • 'inner' : Use only the key combinations observed in both tables
  • 'outer' : Use all key combinations observed in both tables together

A visual example of these different merges in action is shown in the figure below.

assignment 2 exploratory data analysis

Figure: Graphical representation of the different types of merges between two DataFrames df1 and df2 that are possible in pandas.

Warning: If the keys in one table match more than one row in the other pandas.DataFrame you can expect a larger table to appear after you do a left join. To avoid this behaviour, you can run DataFrame.drop_duplicates() before doing the merge.

Dropping columns

It is not uncommon for a dataset to have tens or hundreds of columns, and in practice we may only want to focus our attention on a smaller subset. One way to remove unwanted columns is via the DataFrame.drop() method. Since we have duplicate information about the latitude and longitude coordinate, we may as well drop the latitude_longitude column from our merged pandas.DataFrame :

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity street_number street_name city postal_code
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 3130 Grizzly Peak Boulevard Berkeley 94705.0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 2005 Tunnel Road Oakland 94611.0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 6886 Chabot Road Oakland 94618.0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 6365 Florio Street Oakland 94618.0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 6365 Florio Street Oakland 94618.0

Note: You will often encounter the cryptic axis parameter when dealing with pandas.DataFrame and pandas.Series objects. This parameter is used to specify along which dimension we want to apply a given transformation - see the figure below for a graphical representation.

assignment 2 exploratory data analysis

Figure: Visualisation of the axis parameter in pandas.

Saving a DataFrame to disk

At this point, we have a unfified table with housing data and their addresses. This is usually a good point to save the intermediate results to disk so that we can reload them without having to run all preprocessing steps. To do so in pandas, we can make use of the DataFrame.to_csv() function:

Indexing, selection, and filtering

Now that we have a tidy pandas.DataFrame let's apply some of the most common pandas methods to make queries on the data.

A pandas.Series object provides array-style item selection that allows one to perform slicing, masking, and fancy indexing . For example, we can study the housing_median_age colum from a variety of angles:

Since a pandas.DataFrame acts like a two-dimensional array, pandas provides special indexing operators called DataFrame.iloc[] and DataFrame.loc[] to slice and dice our data. As an example, let's select a single row and multiple columns by label :

We can perform similar selections with integers using DataFrame.iloc[] :

Masks or filters are especially common to use, e.g. let's select the subset of expensive houses:

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity street_number street_name city postal_code
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 3130 Grizzly Peak Boulevard Berkeley 94705.0
89 -122.27 37.80 52.0 249.0 78.0 396.0 85.0 1.2434 500001.0 NEAR BAY 321 10th Street Oakland 94607.0
140 -122.18 37.81 30.0 292.0 38.0 126.0 52.0 6.3624 483300.0 NEAR BAY NaN NaN Oakland 94611.0
459 -122.25 37.87 52.0 609.0 236.0 1349.0 250.0 1.1696 500001.0 NEAR BAY 15 Canyon Road Berkeley 94704.0
489 -122.25 37.86 48.0 2153.0 517.0 1656.0 459.0 3.0417 489600.0 NEAR BAY 2805 Kelsey Street Berkeley 94705.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20422 -118.90 34.14 35.0 1503.0 263.0 576.0 216.0 5.1457 500001.0 <1H OCEAN 890 West Potrero Road Thousand Oaks 91361.0
20426 -118.69 34.18 11.0 1177.0 138.0 415.0 119.0 10.0472 500001.0 <1H OCEAN NaN East Las Virgenes Canyon Road NaN 91307.0
20427 -118.80 34.19 4.0 15572.0 2222.0 5495.0 2152.0 8.6499 500001.0 <1H OCEAN 5135 Island Forest Place Westlake Village 91362.0
20436 -118.69 34.21 10.0 3663.0 409.0 1179.0 371.0 12.5420 500001.0 <1H OCEAN 6 Ranchero Road Bell Canyon 91307.0
20443 -118.85 34.27 50.0 187.0 33.0 130.0 35.0 3.3438 500001.0 <1H OCEAN 2182 Tierra Rejada Road Moorpark 93021.0

1257 rows × 14 columns

We can also use Python's string methods to filter for things like

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity street_number street_name city postal_code
701 -121.97 37.64 32.0 1283.0 194.0 485.0 171.0 6.0574 431000.0 <1H OCEAN 33803 Palomares Road Castro Valley 94552.0
830 -121.99 37.61 9.0 3666.0 711.0 2341.0 703.0 4.6458 217000.0 <1H OCEAN NaN South Fork Trail Castro Valley 94552.0
859 -121.97 37.57 21.0 4342.0 783.0 2172.0 789.0 4.6146 247600.0 <1H OCEAN 121 Overacker Terrace Fremont 94536.0
860 -121.96 37.58 15.0 3575.0 597.0 1777.0 559.0 5.7192 283500.0 <1H OCEAN NaN Deer Gulch Loop Trail Fremont 94536.0
861 -121.98 37.58 20.0 4126.0 1031.0 2079.0 975.0 3.6832 216900.0 <1H OCEAN 37296 Mission Boulevard Fremont 94536.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20502 -118.68 34.33 45.0 121.0 25.0 67.0 27.0 2.9821 325000.0 <1H OCEAN NaN Windmill Canyon Road Simi Valley 93063.0
20503 -118.75 34.33 27.0 534.0 85.0 243.0 77.0 8.2787 330000.0 <1H OCEAN NaN Middle Ridge Fire Road Simi Valley 93065.0
20504 -118.73 34.29 11.0 5451.0 736.0 2526.0 752.0 7.3550 343900.0 <1H OCEAN 3435 Avenida Simi Simi Valley 93063.0
20505 -118.72 34.29 22.0 3266.0 529.0 1595.0 494.0 6.0368 248000.0 <1H OCEAN 3889 Avenida Simi Simi Valley 93063.0
20506 -118.73 34.29 8.0 4983.0 754.0 2510.0 725.0 6.9454 276500.0 <1H OCEAN 3435 Avenida Simi Simi Valley 93063.0

11794 rows × 14 columns

Note: An important feature with pandas.Series and pandas.DataFrame objects is the fact that they contain an index that lets us slice and modify the data. This Index object can be accessed via Series.index or DataFrame.index and is typically an array of integers that denote the location of each row.

For example, our age object has an index for each row or house in the dataset:

We can also sort the values according to the index, which in our example amounts to reversing the order:

Finally, there are times when you want to reset the index of the pandas.Series or pandas.DataFrame objects; this can be achieved as follows:

index housing_median_age
0 0 41.0
1 1 21.0
2 2 52.0
3 3 52.0
4 4 52.0
... ... ...
20635 20635 25.0
20636 20636 18.0
20637 20637 17.0
20638 20638 18.0
20639 20639 16.0

20640 rows × 2 columns

Note this creates a new index column and resets the order of the pandas.Series object in ascending order.

Whenever you need to quickly find the frequencies associated with categorical data, the DataFrame.value_counts() and Series.nlargest() functions come in handy. For example, if we want to see which city has the most houses, we can run the following:

This seems to make sense, since Los Angeles, San Diego, and San Francisco have some of the largest populations. We can check whether this is indeed the case by aggregating the data to calculate the total population value across the group of cities. pandas provides a flexibly DataFrame.groupby() interface that enables us to slice, dice, and summarise datasets in a natural way. In particular, pandas allows one to:

  • Split a pandas object into pieces using one or more keys
  • Calculate group summary statistics, like count, mean, standard deviation, or a user-defined function
  • Apply within-group transformations or other manipulations, like normalisation, rank, or subset selection
  • Compute pivot tables and cross-tabulations.

Let's combine these ideas to answer our question, followed by an explanation of how the GroupBy mechanics really work:

population
city
Acampo 9626.0
Acton 6740.0
Adelanto 6583.0
Adin 364.0
Agoura Hills 26776.0
... ...
Yreka 8971.0
Yuba City 49772.0
Yucaipa 31965.0
Yucca Valley 18955.0
Zenia 228.0

989 rows × 1 columns

This seems to work - we now have the average number of people residing in a block of houses, but the result seems to be sorted alphabetically. To get the cities with the largest populations, we can use the Series.sort_values() method as follows:

population
city
Los Angeles 3495957.0
San Diego 1069557.0
San Jose 818234.0
San Francisco 702282.0
Sacramento 614478.0
... ...
Parker 83.0
El Portal 79.0
Pearsonville 48.0
Forest Ranch 47.0
North Richmond 42.0

That's much better! We can store the result as a new pandas.DataFrame and plot the distribution:

city population
0 Los Angeles 3495957.0
1 San Diego 1069557.0
2 San Jose 818234.0
3 San Francisco 702282.0
4 Sacramento 614478.0

GroupBy mechanics

The group operation above can be best understood by the H. Wickam's split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

Split: In the first stage of the process, data contained in a pandas.Series or pandas.DataFrame object is split into groups based on one or more specified keys . The splitting is performed on a particular axis of the object (see the notebook from lesson 2). For example, a pandas.DataFrame can be grouped on its rows ( axis=0 ) or its columns ( axis=1 ).

Apply: Once the split done, a function is applied to each group, producing a new value.

Combine: Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usualy depend on the what's being done to the data.

See the figure below for an example of a simple group aggregation.

assignment 2 exploratory data analysis

Figure: Illustraion of a group aggregation.

In general, the grouping key can take many forms and the key do not have to be all of the same type. Frequently, the grouping information is found in the same pandas.DataFrame as the data you want to work on, so the key is usually a column name . For example let's create a simple pandas.DataFrame as follows:

key1 key2 data1 data2
0 a one 1.837510 0.875095
1 a two 0.771915 -0.256958
2 b one -0.830038 -0.037795
3 b two 0.727881 -0.254475
4 a one 0.168928 -1.287844

We can then use the column names as the group keys (similar to what we did above with the beers):

This grouped variable is now a GroupBy object. It has not actually calculated anything yet except for some intermediate data about the group key df_foo['key1'] . The main idea is that this object has all of the information needed to then apply some operation to each of the groups. For example, we can get the mean per group as follows:

data1 data2
key1
a 0.926117 -0.223236
b -0.051078 -0.146135

Exercise #5

  • Use the above DataFrame.groupby() techniques to find the top 10 cities which have the most expensive houses on average. Look up some of the names on the web - do the results make sense?
  • Use the DataFrame.loc[] method to filter out the houses with the capped values of over $500,000. Repeat the same step as above.

Exercise #6

  • Use the DataFrame.groupby() and agg() techniques to find the distribution of mean house prices according to ocean_proximity .
  • Store the result from the above step in a new pandas.DataFrame and visualise it with seaborn's barplot.

Pardon Our Interruption

As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:

  • You've disabled JavaScript in your web browser.
  • You're a power user moving through this website with super-human speed.
  • You've disabled cookies in your web browser.
  • A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .

To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.

Instantly share code, notes, and snippets.

@mGalarnyk

mGalarnyk / project2.md

  • Download ZIP
  • Star ( 3 ) 3 You must be signed in to star a gist
  • Fork ( 5 ) 5 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save mGalarnyk/b6c650f8d7a11236810515f17ec45023 to your computer and use it in GitHub Desktop.

Exploratory Data Analysis Project 2 (JHU) Coursera

Unzipping and loading files, question 1 ( plot1.r ).

Have total emissions from PM2.5 decreased in the United States from 1999 to 2008? Using the base plotting system, make a plot showing the total PM2.5 emission from all sources for each of the years 1999, 2002, 2005, and 2008.

Exploratory Data Analysis Project 2 question 1

Question 2 ( plot2.R )

Have total emissions from PM2.5 decreased in the Baltimore City, Maryland (𝚏𝚒𝚙𝚜 == "𝟸𝟺𝟻𝟷𝟶") from 1999 to 2008? Use the base plotting system to make a plot answering this question.

Exploratory Data Analysis Project 2 question 2

Question 3 ( plot3.R )

Of the four types of sources indicated by the 𝚝𝚢𝚙𝚎 (point, nonpoint, onroad, nonroad) variable, which of these four sources have seen decreases in emissions from 1999–2008 for Baltimore City? Which have seen increases in emissions from 1999–2008? Use the ggplot2 plotting system to make a plot answer this question.

Exploratory Data Analysis Project 2 question 3

Question 4 ( plot4.R )

Across the United States, how have emissions from coal combustion-related sources changed from 1999–2008?

Exploratory Data Analysis Project 2 question 4

Question 5 ( plot5.R )

How have emissions from motor vehicle sources changed from 1999–2008 in Baltimore City?x

Exploratory Data Analysis Project 2 question 5

Question 6 ( plot6.R )

Compare emissions from motor vehicle sources in Baltimore City with emissions from motor vehicle sources in Los Angeles County, California (𝚏𝚒𝚙𝚜 == "𝟶𝟼𝟶𝟹𝟽"). Which city has seen greater changes over time in motor vehicle emissions?

Exploratory Data Analysis Project 2 question 6

6.859 : Interactive Data Visualization

Assignment 2: exploratory data analysis.

In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of captioned visualizations that convey key insights gained during your analysis.

  • Step 1: Data Selection

First, you will pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we’ve pre-selected a number of datasets for you to choose from.

However, if you would like to investigate a different topic and dataset, you are free to do so. If you do decide to work with a self-selected dataset, please check with the course staff to ensure it is appropriate for this assignment. Be advised that data collection and preparation (also known as data wrangling ) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.

After selecting a topic and dataset — but prior to analysis — you should write down an initial set of at least three different questions you’d like to investigate.

  • Part 2: Exploratory Visual Analysis

Next, you will perform an exploratory analysis of your dataset using a visualization tool such as Tableau. You should consider two different phases of exploration.

In the first phase, you should seek to gain an overview of the shape & structure of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform “sanity checks” for patterns you expect to see!

In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (e.g., by adding additional variables, changing sorting or axis scales, transforming your data by filtering or subsetting it, etc.) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.

  • Final Deliverable

Your final submission should take the form of a PDF report — similar to a slide show or comic book — that consists of 10 or more captioned visualizations detailing your most important insights. Your “insights” can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. To help you gauge the scope of this assignment, see this example report analyzing data about motion pictures . We’ve annotated and graded this example to help you calibrate for the breadth and depth of exploration we’re looking for.

Each visualization image should be a screenshot exported from a visualization tool, accompanied with a title and descriptive caption (2-4 sentences long) describing the insight(s) learned from that view. Provide sufficient detail for each caption such that anyone could read through your report and understand what you’ve learned. You are free, but not required, to annotate your images to draw attention to specific features of the data. You may perform highlighting within the visualization tool itself, or draw annotations on the exported image. To easily export images from Tableau, use the Worksheet > Export > Image… menu item.

The end of your report should include a brief summary of main lessons learned.

Recommended Data Sources

To get up and running quickly with this assignment, we recommend exploring one of the following provided datasets:

World Bank Indicators, 1960–2017 . The World Bank has tracked global human developed by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. The linked repository contains indicators that have been formatted to facilitate use with Tableau and other data visualization tools. However, you’re also welcome to browse and use the original data by indicator or by country . Click on an indicator category or country to download the CSV file.

Chicago Crimes, 2001–present (click Export to download a CSV file). This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

Daily Weather in the U.S., 2017 . This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network . This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column .

Social mobility in the U.S. . Raj Chetty’s group at Harvard studies the factors that contribute to (or hinder) upward mobility in the United States (i.e., will our children earn more than we will). Their work has been extensively featured in The New York Times. This page lists data from all of their papers, broken down by geographic level or by topic. We recommend downloading data in the CSV/Excel format, and encourage you to consider joining multiple datasets from the same paper (under the same heading on the page) for a sufficiently rich exploratory process.

The Yelp Open Dataset provides information about businesses, user reviews, and more from Yelp’s database. The data is split into separate files ( business , checkin , photos , review , tip , and user ), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on Yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don’t need to look at all of the data to answer interesting questions. In order to download the data you will need to enter your email and agree to Yelp’s Dataset License .

  • Additional Data Sources

If you want to investigate datasets other than those recommended above, here are some possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether your dataset is appropriate, please ask the course staff ASAP!

  • data.boston.gov - City of Boston Open Data
  • data.gov - U.S. Government Open Datasets
  • U.S. Census Bureau - Census Datasets
  • IPUMS.org - Integrated Census & Survey Data from around the World
  • Federal Elections Commission - Campaign Finance & Expenditures
  • Federal Aviation Administration - FAA Data & Research
  • fivethirtyeight.com - Data and Code behind the Stories and Interactives
  • Buzzfeed News
  • Socrata Open Data
  • 17 places to find datasets for data science projects
  • Visualization Tools

You are free to use one or more visualization tools in this assignment. However, in the interest of time and for a friendlier learning curve, we strongly encourage you to use Tableau . Tableau provides a graphical interface focused on the task of visual data exploration. You will (with rare exceptions) be able to complete an initial data exploration more quickly and comprehensively than with a programming-based tool.

  • Tableau - Desktop visual analysis software . Available for both Windows and MacOS; with free licenses for MIT students.
  • Data Transforms in Vega-Lite . A tutorial on the various built-in data transformation operators available in Vega-Lite.
  • Data Voyager , a research prototype from the UW Interactive Data Lab, combines a Tableau-style interface with visualization recommendations. Use at your own risk!
  • R , using the ggplot2 library or with R’s built-in plotting functions.
  • Jupyter Notebooks (Python) , using libraries such as Altair or Matplotlib .

Data Wrangling Tools

The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization . If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!

  • Graphical Tools
  • Tableau Prep - Tableau provides basic facilities for data import, transformation & blending. Tableau prep is a more sophisticated data preparation tool.
  • Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
  • OpenRefine - A free, open source tool for working with messy data.
  • Programming Tools
  • Arquero — a JavaScript-based data transformation library. See the Introducing Arquero Observable notebook to get started.
  • JavaScript data utilities and/or the Datalib JS library
  • Pandas - Data table and manipulation utilites for Python.
  • dplyr - A library for data manipulation in R.
  • Or, the programming language and tools of your choice…

The assignment score is out of a maximum of 10 points. Submissions that squarely meet the requirements (i.e., the “Satisfactory” column in the rubric below) will receive a score of 8. We will determine scores by judging the breadth and depth of your analysis, whether visualizations meet the expressiveness and effectiveness principles, and how well-written and synthesized your insights are.

We will use the following rubric to grade your assignment. Note, rubric cells may not map exactly to specific point scores.

Component Excellent Satisfactory Poor
Breadth of Exploration More than 3 questions were initially asked, and target substantially different portions/aspects of the data. At least 3 questions were initially asked of the data, but there is some overlap between questions. Fewer than 3 initial questions were posed of the data.
Depth of Exploration A sufficient number of follow-up questions were asked to yield insights that helped to more deeply explore the initial questions. Some follow-up questions were asked, but they did not take the analysis much deeper than the initial questions. No follow-up questions were asked after answering the initial questions.
Data Quality Data quality was thoroughly assessed with extensive profiling of fields and records. Simple checks were conducted on only a handful of fields or records. Little or no evidence that data quality was assessed.
Visualizations More than 10 visualizations were produced, and a variety of marks and encodings were explored. All design decisions were both expressive and effective. At least 10 visualizations were produced. The visual encodings chosen were largely effective and expressive, but some errors remain. Several ineffective or inexpressive design choices are made. Fewer than 10 visualizations have been produced.
Data Transformation More advanced transformation were used to extend the dataset in interesting or useful ways. Simple transforms (e.g., sorting, filtering) were primarily used. The raw dataset was used directly, with little to no additional transformation.
Captions Captions richly describe the visualizations and contextualize the insight within the analysis. Captions do a good job describing the visualizations, but could better connect prior or subsequent steps of the analysis. Captions are missing, overly brief, or shallow in their analysis of visualizations.
Creativity & Originality You exceeded the parameters of the assignment, with original insights or a particularly engaging design. You met all the parameters of the assignment. You met most of the parameters of the assignment.
  • Submission Details

This is an individual assignment. You may not work in groups.

Your completed exploratory analysis report is due by Tuesday 3/9, 11:59 pm EST . Submit your PDF report on Canvas .

Due: Tuesday 3/9, 11:59 pm EST

Submit », on this page.

Assignment 2: Exploratory Data Analysis

In this assignment, you will identify a dataset of interest and perform an initial analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of annotated and/or captioned visualizations that convey key insights gained during your analysis.

Step 1: Data Selection

First, pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we've pre-selected a number of datasets included below for you to choose from.

However, if you would like to investigate a different topic and dataset, you are free to do so. If working with a self-selected dataset and you have doubts about its appropriateness for the course, please check with the course staff. Be advised that data collection and preparation (also known as data wrangling ) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.

After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you'd like to investigate.

Part 2: Exploratory Visual Analysis

Next, you will perform an exploratory analysis of your dataset using a visualization tool such as Vega-Lite/Altair or Tableau. You should consider two different phases of exploration.

In the first phase, you should seek to gain an overview of the shape & stucture of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to perform "sanity checks" for any patterns you expect the data to contain.

In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing sorting or axis scales, filtering or subsetting data, etc. ) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.

Final Deliverable

Your final submission should take the form of a sequence of images – similar to a comic book – that consists of 8 or more visualizations detailing your most important observations.

Your observations can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. Where appropriate, we encourage you to include annotated visualizations to guide viewers' attention and provide interpretive context. (If you aren't sure what we mean by "annotated visualization," see this page for some examples .)

Provide sufficient detail such that anyone can read your report and understand what you've learned without already being familiar with the dataset. To help gauge the scope of this assignment, see this example report analyzing motion picture data .

Each image should be a visualization, including any titles or descriptive annotations highlighting the insight(s) shown in that view. For example, annotations could take the form of guidelines and text labels, differential coloring, and/or fading of non-focal elements. You are also free to include a caption for each image, though no more than 2 sentences: be concise! You may create annotations using the visualization tools of your choice, or by adding them using image editing or vector graphics tools.

You must write up your report in a computational notebook format, published online. Examples include Observable notebooks or hosted Jupyter notebooks. Submit the URL of your notebook on the Canvas A2 submission page . For example, to publish using Observable from a private notebook, click the "..." menu button in the upper right and select "Enable link sharing", then copy and submit your notebook URL.

Be sure to enable link sharing if needed (e.g., on Observable), otherwise the course staff will not be able to view your submission!

  • To export a Vega-Lite visualization, be sure you are using the "canvas" renderer, right click the image, and select "Save Image As...".
  • To export images from Tableau, use the Worksheet > Export > Image... menu item.
  • To add an image to an Observable notebook, first add your image as a notebook file attachment: click the "..." menu button and select "File attachments". Then load the image in a new notebook cell: FileAttachment("your-file-name.png").image() .

Potential Data Sources

To get up and running quickly with this assignment, here are some existing data sources.

The World Bank Data, 1960-2017

The World Bank has tracked global human development by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. We have 20 indicators from the World Bank for you to explore . Alternatively, you can browse the original data by indicators or by countries . Click on an indicator category or country to download the CSV file.

Data: https://github.com/ZeningQu/World-Bank-Data-by-Indicators

Daily Weather in the U.S., 2017

This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network . This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column .

Data: weather.csv.gz (gzipped CSV)

Yelp Open Dataset

This dataset provides information about businesses, user reviews, and more from Yelp's database. The data is split into separate files ( business , checkin , photos , review , tip , and user ), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don't need to look at all of the data to answer interesting questions.

In order to download the data you will need to enter your email and agree to Yelp's Dataset License .

Data: Yelp Access Page (data available in JSON & SQL formats)

Additional Data Sources

Here are some other possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether a dataset is appropriate, please ask the course staff ASAP!

  • data.seattle.gov - City of Seattle Open Data
  • data.wa.gov - State of Washington Open Data
  • nwdata.org - Open Data & Civic Tech Resources for the Pacific Northwest
  • data.gov - U.S. Government Open Datasets
  • U.S. Census Bureau - Census Datasets
  • IPUMS.org - Integrated Census & Survey Data from around the World
  • Federal Elections Commission - Campaign Finance & Expenditures
  • Federal Aviation Administration - FAA Data & Research
  • fivethirtyeight.com - Data and Code behind the Stories and Interactives
  • Buzzfeed News - Open-source data from BuzzFeed's newsroom
  • Kaggle Datasets - Datasets for Kaggle contests
  • List of datasets useful for course projects - curated by Mike Freeman

Visualization Tools

You are free to use one or more visualization tools in this assignment. However, in the interest of time and for a friendlier learning curve, we strongly encourage you to use Tableau and/or Vega-Lite .

  • Tableau - Desktop visual analysis software . Available for both Windows and MacOS; register for a free student license.
  • Vega-Lite is a high-level grammar of interactive graphics. It provides a concise, declarative JSON syntax to create an expressive range of visualizations for data analysis and presentation.
  • Jupyter Notebooks (Python) , using libraries such as Altair or Matplotlib .
  • Voyager - Research prototype from the UW Interactive Data Lab . Voyager combines a Tableau-style interface with visualization recommendations. Use at your own risk!

Data Wrangling Tools

The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization. If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!

Graphical Tools

  • Tableau - Tableau provides basic facilities for data import, transformation & blending.
  • Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
  • OpenRefine - A free, open source tool for working with messy data.

Programming Tools

  • Arquero : JavaScript library for wrangling and transforming data tables.
  • JavaScript basics for manipulating data in the browser .
  • Pandas - Data table and manipulation utilites for Python.
  • dplyr - A library for data manipulation in R.
  • Or, the programming language and tools of your choice...

Grading Criteria

Each submission will be graded based on both the analysis process and included visualizations. Here are our grading criteria:

  • Poses clear questions applicable to the chosen dataset.
  • Appropriate data quality assessment and transformation.
  • Sufficient breadth of analysis, exploring multiple questions.
  • Sufficient depth of analysis, with appropriate follow-up questions.
  • Expressive & effective visualizations crafted to investigate analysis questions.
  • Clearly written, understandable annotations that communicate primary insights.

Submission Details

Your completed exploratory analysis report is due Monday 4/26, 11:59pm . As described above, your report should take the form of an online notebook. Submit the URL of your notebook ( ensure any link sharing is enabled! ) on the Canvas A2 page .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 31 August 2024

Digital consults in heart failure care: a randomized controlled trial

  • Jelle P. Man   ORCID: orcid.org/0000-0003-2040-3686 1 , 2 , 3 ,
  • Maarten A. C. Koole   ORCID: orcid.org/0000-0002-0437-7329 1 , 4 , 5 ,
  • Paola G. Meregalli 1 , 3 ,
  • M. Louis Handoko   ORCID: orcid.org/0000-0002-8942-7865 1 , 3 , 6 ,
  • Susan Stienen   ORCID: orcid.org/0000-0002-5573-9377 1 , 3 ,
  • Frederik J. de Lange   ORCID: orcid.org/0000-0002-7590-9230 1 , 3 ,
  • Michiel M. Winter 1 , 3 , 4 ,
  • Marlies P. Schijven   ORCID: orcid.org/0000-0001-7013-0116 7 ,
  • Wouter E. M. Kok   ORCID: orcid.org/0000-0002-7789-6285 1 , 3 ,
  • Dorianne I. Kuipers 1 , 3 ,
  • Pim van der Harst   ORCID: orcid.org/0000-0002-2713-686X 6 ,
  • Folkert W. Asselbergs   ORCID: orcid.org/0000-0002-1692-8669 1 , 8 , 9 ,
  • Aeilko H. Zwinderman   ORCID: orcid.org/0000-0003-0361-3139 10 , 11 ,
  • Marcel G. W. Dijkgraaf   ORCID: orcid.org/0000-0003-0750-8790 10 , 11 ,
  • Steven A. J. Chamuleau   ORCID: orcid.org/0000-0002-9952-6701 1 , 2 , 3 &
  • Mark J. Schuuring   ORCID: orcid.org/0000-0002-2843-1852 12 , 13 , 14  

Nature Medicine ( 2024 ) Cite this article

Metrics details

  • Combination drug therapy
  • Heart failure
  • Outcomes research

Guideline-directed medical therapy (GDMT) has clear benefits on morbidity and mortality in patients with heart failure; however, GDMT use remains low. In the multicenter, open-label, investigator-initiated ADMINISTER trial, patients ( n  = 150) diagnosed with heart failure and reduced ejection fraction (HFrEF) were randomized (1:1) to receive usual care or a strategy using digital consults (DCs). DCs contained (1) digital data sharing from patient to clinician (pharmacotherapy use, home-measured vital signs and Kansas City Cardiomyopathy Questionnaires); (2) patient education via a text-based e-learning; and (3) guideline recommendations to all treating clinicians. All remotely gathered information was processed into a digital summary that was available to clinicians in the electronic health record before every consult. All patient interactions were standardly conducted remotely. The primary endpoint was change in GDMT score over 12 weeks (ΔGDMT); this GDMT score directly incorporated all non-conditional class 1 indications for HFrEF therapy with equal weights. The ADMINISTER trial met its primary outcome of achieving a higher GDMT in the DC group after a follow-up of 12 weeks (ΔGDMT score in the DC group: median 1.19, interquartile range (0.25, 2.3) arbitrary units versus 0.08 (0.00, 1.00) in usual care; P  < 0.001). To our knowledge, this is the first multicenter randomized controlled trial that proves a DC strategy is effective to achieve GDMT optimization. ClinicalTrials.gov registration: NCT05413447 .

Heart failure (HF) affects more than 64 million people worldwide, and this concerning healthcare problem is projected to worsen due to an increasing prevalence 1 . The number of healthcare professionals and available resources in outpatient clinics is limited, however, and it, therefore, poses a challenge to deliver optimal care.

The prognosis of patients with HF and reduced ejection fraction (HFrEF) has improved considerably since the introduction of recent HF therapies, including β-blockers, angiotensin-converting enzyme inhibitors (ACEs)/angiotensin receptor neprilysin inhibitors (ARNIs), mineralocorticoid receptor antagonists (MRAs), sodium-glucose co-transporter 2 inhibitors (SGLT2is) and intravenous iron administration 2 . In patients with HFrEF, the estimated effect of the medication is the greatest for a combination of β-blocker, ARNI, MRA and SGLT2i, and rapid optimization with a combination is recommended by the 2023 Focused Update of the 2021 European Society of Cardiology (ESC) guidelines 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 . Strikingly, there is still substantial underuse of guideline-directed medical therapy (GDMT) 13 , 14 , 15 . The explanation for the worldwide underuse of GDMT is multifactorial and includes inter-doctor and inter-hospital variation and the absence of sufficient infrastructure that is able to support rapid optimization 15 .

Remote digital GDMT optimization using at-home measured vital signs and guideline support, defined as multifaceted digital consults (DCs), seems promising 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 . In patients with inflammatory bowel disease, a multifaceted digital intervention was proven safe and effective at reducing hospitalizations and outpatient consults 24 . Previous studies regarding digital GDMT optimization in patients with HFrEF showed an increase in GDMT usage. However, these studies were single center or had non-randomized designs limiting their generalizability 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 . Hence, the open-label Assessment of Digital consults in heart failure Management regarding clINical Impact, SafeTy and Efficacy using a Randomized controlled trial (ADMINISTER) was performed. A multifaceted approach was adopted by providing a multifaceted DC constituting the following components: (1) digital data sharing, including the exchange of pharmacotherapy use and home-measured vital signs; (2) patient education via a text-based e-learning; and (3) digital guideline recommendations to treating clinicians.

Between 22 September 2022 and 12 March 2024, 150 patients with HFrEF were randomly assigned to receive DC or usual care (Fig. 1 ). The last patient completed the 12-week follow-up on 4 June 2024. The median age was 70 years (interquartile range (IQR) (58.3, 75.0)), and 74% ( n  = 111) of the patients were male. The groups were similar in terms of baseline characteristics (Table 1 ).

figure 1

Patient flow diagram.

Primary endpoint

A DC strategy resulted in a higher change in GDMT score (ΔGDMT) than in the usual care group over a 12-week follow-up period (median 1.19, IQR (0.25, 2.34) arbitrary units (AU) in the DC group versus 0.08 (0.00, 1.00) in usual care; P  < 0.001, difference = 0.75, 95% confidence interval (CI) (0.21, 1.12); Fig. 2 ). The internal components of the ΔGDMT score are displayed in Table 2 .

figure 2

The increase in the median GDMT score is shown (along with error bars displaying the 95% CIs). The asterisk indicates a significant difference according to the two-sided Mann–Whitney U -test (difference = 0.75, 95% CI (0.21, 1.12), P  < 0.01).

Secondary endpoints

Time-until-event analysis revealed a lower time to optimal medical therapy (OMT) in the DC group compared to usual care during the 12-week follow-up (hazard ratio = 4.51, P  < 0.01). At 12 weeks, OMT was reached more often in the DC group (22 (28.2%) versus 5 (6.9%), P  < 0.01). No difference was observed in the amount of time investment for patients (3.0 h (1.5, 4.0) versus 2.5 h (1.0, 6.0), P  = 0.59), change in quality of life (QoL) (2.8 AU (−2.1, 9.8) versus 2.1 AU (−2.8, 15.3), P  = 0.70) or satisfaction (0 AU (−1, 0.25) versus 0 AU (−0.75, 0), P  = 0.38) during the 12-week follow-up period. The DC strategy was safe, as there were no differences in the number of hyperkalemia events (9 versus 10, P  = 0.85), estimated glomerular filtration rate (eGFR) < 30 ml min −1  1.73 m −2 (3 versus 4, P  = 0.91) or number of HF hospitalizations per group (10 versus 7, P  = 0.73) during the 12-week follow-up period. The strategy was associated with more remote consults (2.0 (1.0, 3.75) versus 1.0 (0.0, 2.0), P  < 0.01) and the same number of physical consults (1.2 versus 1.4, P  = 0.9) during the 12-week follow-up period. The number of summaries in the DC group sent to the clinician was 3 (2, 5). The net promoter score (NPS) for clinicians was 7.4, which is moderately positive; seven clinicians were promotors; 11 were passive; and six were detractors.

Exploratory endpoints

A DC strategy resulted in a higher GDMT score than usual in the pre-specified subgroup analysis among patients with new-onset HF, patients who received HF nurse support or no nurse support, age higher or lower than the median, eGFR higher or lower than the median, New York Heart Association (NYHA) class II or class III and non-academic hospitals or tertiary academic referral centers (Fig. 3 ). No significant interactions were observed. The P values of the interaction terms are included in the Supplementary Information .

figure 3

The red Kaplan–Meier curve represents the time until OMT in the treatment group, and the blue curve represents the time until OMT in the usual care group. Kaplan–Meier estimates and error bars displaying the 95% CIs are shown.

The ADMINISTER trial showed that a DC strategy was effective at optimizing the GDMT within 12 weeks in patients with HFrEF. A notable additional finding was that a DC strategy was safe, as no differences were observed in the occurrence of hyperkaliemia, kidney dysfunction or hospitalizations. Moreover, this approach did not lead to an increased burden on patient-reported time spent on healthcare, QoL or satisfaction. Furthermore, subgroup analysis revealed that the effect was observed among different NYHA classes, HF nurse support, age and eGFR groups, new-onset or existing HF and non-academic hospitals or tertiary academic referral centers (Fig. 4 ). The ADMINISTER trial hereby provides, to our knowledge, the first multicenter evidence of the efficacy and safety of multifaceted DC for optimizing GDMT.

figure 4

The median, along with error bars indicating the 95% CI, is shown, as well as the P values of the two-sided Mann–Whitney U -test for the effect in each subgroup.

Most studies of digital systems for HF management focus on monitoring vital signs to detect and act on worsening HF 23 , 25 , 26 , 27 , 28 , 29 . Little focus has thus far been placed on the impact of digital systems for remote GDMT optimization or on a multifaceted approach, but there are some single-center trials and non-randomized studies of digital systems for remote GDMT optimization 19 , 20 , 21 , 23 . The largest single-center randomized controlled trial (RCT) of remote GDMT optimization was conducted by Brahmbhatt et al. 22 . Other pilot RCTs by Antonicelli et al., Artanian et al. and Romero et al. all evaluated similar methodologies 19 , 20 , 21 , 23 . All of these methods use intensive monitoring from a HF titration clinic to optimize GDMT remotely. These methods were effective at increasing GDMT, but considering that these trials were exclusively performed in tertiary centers, questions remain regarding the generalizability of these approaches, as expertise on GDMT optimization is plentiful in these clinics, and nurses are available to frequently check GDMT. In the ADMINISTER trial, DCs are implemented in tertiary referral centers and non-academic hospitals, and the safety, efficacy and feasibility of these consults are, therefore, tested in multiple centers.

Ghazi et al. 30 recently showed with PROMPT-HF that alerts can result in an increased chance of a new GDMT class prescription (relative risk = 1.41, 95% CI (1.03, 1.93); P  = 0.03). PROMPT-HF is, therefore, an important advocate for the use of guideline support for clinicians; however, remote strategies are likely to still be needed to effectively optimize GDMT, as patients with HFrEF need to have recurrent contact with clinicians to achieve GDMT optimization. Without a remote strategy, GDMT optimization would lead to a substantial increase in physical appointments and an associated burden on the healthcare system. The present trial showed that GDMT optimization can be achieved using DCs, which resulted in increased remote contact and no significant difference in time spent on healthcare. The PROMPT-HF study has some limitations regarding its generalizability, as it was a single-center study using a single electronic health record system. The ADMINISTER trial points toward a transferable digital solution that includes guideline support in a remote digital GDMT optimization strategy.

A relevant factor to consider regarding the efficacy of DC is the time investment required from researchers to enable clinicians to perform DCs. The preparation time to make a digital summary in the electronic health record was approximately 12 min for the first consult and 4–5 min for additional consults. The time investment per patient would, therefore, be approximately 17–18 min for the average number of consults performed in the intervention group. The creation of these digital summaries is, however, automatable. This would require the following digital infrastructure:

Automatic generation of a note to clinicians containing medication status and (at-home measured) vital signs before each consult with a patient with HFrEF

The digital distribution of an e-learning and a message to the patient to record vital signs and to check their medication before an appointment

Interactive fields in the digital summary to clinicians that change based on the latest (at-home measured) information

With such a system, recreating the procedures performed in the DC group would require no additional time from investigators.

During GDMT optimization, a patient may not tolerate more medication—for example, after a drop in systolic blood pressure (BP) < 90 mmHg or an increase in potassium > 5.0 mmol l −1 . ESC guidelines state that optimization should continue until the specified target dose is reached or until maximal tolerability is reached. This maximum tolerability occurs at different dosages depending on the patient’s reaction to the treatment. BP measurements are essential to access whether OMT was reached. BP was measured more often in the treatment group as part of the home measurements. An increased number of measurements means more data to act on, and this has the added benefit of the clinician being more aware of the situation of the patient. However, it is unlikely that the effect of a higher GDMT score due to the increased number of patients reaching OMT (22 in the DC group versus five in control reached OMT) occurred for a large part due to increased number of measurements as:

Non-persistent drops of systolic BP ≤ 90 mmHg in patients with otherwise normal systolic BP were not classified as hypotension if the patients were not symptomatic.

81.2% in the treatment group and 60% in the control group of the patients who reached OMT were optimized on GDMT while participating in the trial (Table 2 ). This increased prescription rate of GDMT has profoundly more impact on the BP of the patient than increased number of measurements.

Among clinicians, the NPS was 7.4, which is a moderately positive NPS score. We used a single-timepoint NPS for clinicians as the DC strategy first needs to be implemented before a clinician can reflect on its use in practice. Critics frequently indicated (in the accompanying free text) that they think that a remote strategy does not work for every patient. Promoters frequently indicated that having a summary of relevant (at-home measured) clinical information was useful. Although there have been critiques of NPS, it has been shown to correspond well with the intention of a person to change behavior 31 , 32 . This score thus points toward a moderately positive attitude of clinicians to adopt a DC strategy. More in-depth qualitative research on the concerns of critics might be useful to identify potential improvements. Not knowing about the efficacy of DC might have lowered the NPS for some clinicians.

Patients with HFrEF exhibit a wide range of clinical profiles, in both variety and severity. Not all patients of older age use digital solutions 33 , 34 . These patients could have participated less in this study, as they generally have minimal experience with digital technology and sometimes struggle to use it 35 , 36 . However, the patients in this trial were similar in age to other studies of patients with HFrEF 26 , 27 , 28 , 29 , 30 , 37 , 38 , 39 , 40 , 41 , 42 . Although we did not track active family support for DC, feedback from outpatient clinics indicated that family members were engaged throughout the optimization process, which might have enhanced the confidence of patients in participating in this trial. The refusal percentage of 35.6% in this trial (Fig. 1 ) was similar to the average refusal rate of other pragmatic RCTs (38.4%) 43 .

In the ADMINISTER trial, only patients who not already received OMT or had contraindications for any GDMT optimizations were considered for participation. Compared to the CHECK-HF and TITRATE-HF registries, enrolled patients in the ADMINISTER trial constituted a representative sample of patients with HF with similar important baseline characteristics, such as age, ischemic or non-ischemic cause of HF, occurrence of chronic obstructive pulmonary disease (COPD) and laboratory values 37 , 38 . Also, regarding DMT, baseline use rates were similar; in the CHECK-HF trial, 84% of patients were treated with ACE/angiotensin II receptor blockers (ARB), 86% with β-blocker and 56% with MRA. SGLT2i and ARNI were not available at that time. In the more recent TITRATE-HF trial, 87% of patients were treated with ACE/ARB, 87% with β-blocker and 76% with MRA. Furthermore, 65% of patients were treated with SGLT2i and 57% with ARNI.

Applicability of this research of the DC strategy to other healthcare systems outside The Netherlands needs to be tested. This trial was not powered on its secondary outcomes. In this trial, clinicians were not informed of a usual care group assignment to optimally capture local practice. However, in some cases, assignment to the usual care group might have been deduced, which might have caused an underestimation of the treatment effect. Changes in heart rate (HR), BP and renal function during 12-week follow-up indicated that patients were taking their prescribed medication. Patient adherence was not otherwise assessed. No validated GDMT score was available at the start of the trial. The used GDMT score (Table 3 ) is directly incorporating all non-conditional recommendations for the treatment of chronic HF from ESC guidelines. The primary outcome can, therefore, also be interpreted as a direct measure of clinician adherence with regard to GDMT optimization.

Despite the efficacy of our intervention, substantial room for improvement persists. Although 29% of the DC group achieved OMT, which is a clear contrast to the 7% in the usual care group, it is essential to highlight that 71% of the DC group still has considerable potential for enhancement. An important factor in GDMT optimization is, of course, patient motivation. Not all patients are motivated to take (extra) medication. However, many patients are motivated to change less-appropriate medication for GDMT recommendations. Also, in this trial, clinicians are requested and advised to book regular appointments but are not forced into a schedule. This allows for an easier implementation in various types of clinics and takes into account work schedules of participating clinicians. However, optimization in this trial is, thus, also limited to clinicians’ availability for GDMT optimization. Achieving greater optimization is expected through several key measures: increasing clinician awareness, allocating more time for dedicated HF care paths with personalized digital platforms and implementing even more intensive follow-ups with additional contact moments at the outpatient clinic. We suggest that reimbursement structures be explored to reflect the time needed to optimize GDMT in patients with HF using digital pathways. This approach can lead to better management of patients with HF or, in the future, an even larger group of patients with chronic diseases, to improve guideline adherence and satisfaction, ultimately leading to better healthcare outcomes.

In summary, the ADMINISTER trial met its primary outcome of achieving a higher ΔGDMT score in the DC group in 12 weeks. Moreover, a DC strategy was safe and did not lead to an increased burden on patient-reported time spent on healthcare, QoL or satisfaction. To our knowledge, this is the first multicenter RCT that proves that a DC strategy is effective to achieve GDMT optimization.

The ADMINISTER trial was a prospective, investigator-initiated, pragmatic, multicenter RCT to evaluate the effect of DC on GDMT optimization, safety, time spent on healthcare and quality of care. The study was conducted at four centers in The Netherlands, with a case mix of two academic tertiary referral centers (University Medical Center Utrecht and Amsterdam UMC at two locations: AMC and VUmc) and two non-academic hospitals (Cardiology Center of The Netherlands and Red Cross Hospital). The local medical ethics committee of Amsterdam University Medical Center issued a waiver for this study because two routine treatments were compared (DC and usual care), and the patient burden was limited to only two questionnaires. The institutional review boards of the University Medical Center Utrecht, Cardiology Center of The Netherlands and Red Cross Hospital subsequently approved the trial based on their own review and the previous approval from the medical ethics committee of the Amsterdam University Medical Center. This trial was conducted in accordance with the Declaration of Helsinki and the International Conference of Harmonization Guidelines for Good Clinical Practice. The authors are solely responsible for the design and execution of this study, all study analyses, the drafting and editing of the paper and its final contents. This trial is registered at ClinicalTrials.gov (identifier: NCT05413447 ).

Randomization

Patients were randomly assigned to receive DC or usual care. Randomization and enrollment were performed by the investigator using a computerized randomization tool (Castor EDC). Patients were randomly assigned to a 1:1 ratio stratified by new-onset HF, established HF status and hospital. A variable block randomization algorithm with block sizes of two, four and six was used.

Patient selection

Patients diagnosed with HFrEF (defined as left ventricular ejection fraction (LVEF) ≤ 40) who were older than 18 years of age from four participating centers in The Netherlands were eligible for the study. All different etiologies of HFrEF were included in this study because they share similar uptitration schemes of GDMT. In this pragmatic trial, clinicians were encouraged to refer patients with HFrEF and who had not already reached OMT or had contraindications for all medications for potential participation in this study. Moreover, the research team screened for patients on the ward and outpatient clinics for patients with HFrEF who did not already have OMT or had contraindications for GDMT optimization. When screening was done, all patients of participating clinicians planned for a particular period were assessed. Patients with HFrEF and at initial assessment potential GDMT optimization were thus considered for participation in this study. Researchers excluded patients with NYHA class I ( n  = 9), who did not understand the Dutch language ( n  = 3), who had an active coronavirus 2019 (COVID‐19) infection ( n  = 0) and who had contraindications for all medications or had already reached maximal tolerability for GDMT optimization ( n  = 35) (Fig. 1 ). All patients provided written informed consent.

Study procedures

Patients randomized to the intervention group received multifaceted DC 43 . A researcher digitally collected vital signs measured at home by the patient, symptoms, information on salt and fluid intake, information on medication and relevant laboratory results that were digitally sent by participants in the DC group. This information was passed to the clinicians using electronic health records. This information was combined with tailored guideline recommendations in one summary. The following data were digitally transferred from patient to clinician in this manner:

Pharmacotherapy use and home-measured vital signs (systolic BP, diastolic BP, HR and weight). If the patient was not in possession of a BP monitor, it was provided to them for the duration of the trial. The BP monitors were validated and recommended by the Dutch Heart Foundation. If a personal BP monitor was used, it was checked if this BP monitor is validated and recommended by the Dutch Heart Foundation, and, if not, the patient was supplied with a validated BP monitor.

Digital questionnaires on QoL (using the Kansas City Cardiomyopathy Questionnaire), symptoms, checked medication and salt and fluid intake.

A text-based e-learning on HF with a section on recent advances in HF medical therapies. The text was based on patient-directed information on https://www.heartfailurematters.org/nl . Patients performed the BP measurements at home using instructions from the text-based e-learning and the validated BP monitors.

As part of the e-learning, information was given on salt and fluid intake. Patients were first informed about the fluid and salt restriction and how they can deal with their restrictions and were asked if they feel that they can adhere to their fluid and salt restriction. The e-learning was delivered one time to each patient via email, with an option to revisit the e-learning any time via a dedicated site or email. The text of the e-learning is provided in the supplementary materials. The research staff was available for questions about these e-learning/digital at-home measurements; the treating cardiologist was also available for questions during any upcoming remote or physical consult.

The summarizing note was a standardized format that was systematically added to the electronic health record 1 d before every consult with a nurse or cardiologist. The investigators were not able to measure whether this report was read; it was included as a standard note to the electronic health record. A mockup of this note is included in the supplementary materials. All follow-up consults over a period of 12 weeks after the first consult were preferably and standardly held via video (Microsoft Teams) or telephone (remote). Even though consults were standardly planned as a remote consult and encouraged for all patients in the DC group, clinicians were allowed to perform a physical consult if they thought this was necessary.

If the patient was drawn into the usual care group, no alterations were made to the usual care. Usual care varies per clinician and institution and was left up to practice routines; however, every patient contact was recorded. To optimally capture regular practice, clinicians were not informed about the assignment of a patient to the usual care group. The definition of patient consults is any outpatient patient–clinician contact and is divided into remote consults (telephone or video contact) and in-person contact (referred to as physical consults). These contacts are planned ahead in all participating centers. All DCs and consults in the usual care group were performed by cardiologists, cardiologists in training or HF nurses. The trial was open labeled as it was immediately apparent when a patient was allocated to the DC group, and clinicians needed to know when to use the DC strategy in the DC group.

The primary outcome was ΔGDMT score (Table 3 ). The ΔGDMT score was calculated by dividing the received dose by the target dose according to ESC guidelines at baseline and at study completion. The score is directly incorporating all non-conditional recommendations for the treatment of chronic HF from ESC guidelines without any manual weighing factors or alteration. No other validated score was available at the start of the trial. The score at study completion was subtracted from the score at baseline for every patient to obtain the ΔGDMT score. The score ranges per medicine between a maximum of 1 (corresponding to the optimal treatment according to the guidelines) and a minimum of 0 (corresponding to not administering the medicine). The maximum GDMT score per patient was 6 (all four pharmacotherapy groups constituting GDMT at the target dose, a switch to ARNI and adequate iron status screening and supplementation if needed). The GDMT score thus includes the following items:

ACE/ARB/ARNI dose.

Because ARNI is recommended as a replacement for ACE, an extra score of 1 is assigned for the replacement of ACE with ARNI.

β-Blocker dose.

SGLT2i dose.

Intravenous iron administration if the patient had iron insufficiency, defined as ferritin < 100 ng ml −1 or ferritin < 300 ng ml −1 with transferrin saturation (TSAT) < 20%. For patients with a screening for iron deficiency no longer than 1 year ago and if appropriate supplementation, a score of 1 was assigned.

Valid reasons for not prescribing GDMT were determined by the treating clinician, and a valid reason counted as 1 for the GDMT score. Common valid reasons were:

Persistent systolic BP ≤ 90 mmHg (valid reason for all four drugs). The standard operating procedure was that a patient had too low systolic BP if the patient had two or more measurements of BP ≤ 90 mmHg (if a patient is not symptomatic) 2 .

Symptomatic hypotension.

eGFR < 30 ml min −1  1.73 m −2 (valid reason for ACE/ARB/ARNI and MRA) 2 .

eGFR < 20 ml min −1  1.73 m −2 (valid reason for SGLT2i) 2 .

Potassium > 5.0 mmol l −1 (valid reason for ACE/ARB/ARNI and MRA) 2 .

HR ≤ 60 beats per minute (valid reason for β-blocker).

Allergy to a medication group.

The following secondary outcomes were collected:

Throughout the 12-week follow-up period of all patients, it was monitored if the patient achieves OMT. For patients who reach optimal pharmacotherapy, the time until OMT was analyzed. OMT was defined as a score of 1 for every medication group.

Patient-reported time spent on healthcare. The following question is asked to patients digitally via Castor EDC as part of the questionnaires sent to the patient: ‘How much time have you spent on your consult appointments in the past 3 months (including travel time and preparations for your consult)?’.

12-week changes in QoL were evaluated using total Kansas City Cardiomyopathy Questionnaire 12 scores at the start and end of the trial period via Castor EDC.

12-week changes in patient satisfaction were evaluated using the NPS. Patients were asked the following question at baseline: ‘How likely are you to recommend your current care with regard to heart failure care to a friend or colleague with heart failure?’ and the following question at end of the 12-week follow-up: ‘How likely are you to recommend the care provided in this trial to a friend or colleague with heart failure?’. The answer ranges between 1 and 10 with a step size of 1 and is distributed via Castor EDC.

Data on the safety of DC were acquired by reporting on the total number of hospitalizations, occurrences of eGFR < 30 ml min −1  1.73 m −2 and occurrences of potassium > 5.0 mmol l −1 in the DC group versus usual care during the 12-week follow-up period.

Healthcare consumption was measured using the frequency of remote consults (the number of remote consults in the DC group versus usual care) and physical consults (the number of physical consults in DC group versus usual care) during the 12-week follow-up period.

Satisfaction of the clinicians with DC evaluated using the NPS. The following question was asked to participating clinicians at the end of the trial: ‘How likely are you to recommend the care provided in this trial (with digital summaries of home measurements and remote consultations) to a colleague?’. The answer ranges between 1 and 10 with a step size of 1 and is distributed via Castor EDC. The answers were classified according to the standard classification system of a single-timepoint NPS: promoters scored a 9 or 10; passive users scored a 7 or 8; and patients who scored a 6 or lower were classified as critics. This is the standard scoring system for a single-timepoint NPS 31 , 32 .

Statistical analysis

The required sample size is calculated from a superiority perspective, using the primary outcome. Division into de novo and established HF is done because of differing reasons for potential undertreatment and different baseline values. New onset was defined as a patient who received the diagnosis of HFrEF fewer than 3 months ago and if patients had no or only one consult after this diagnosis. It is uncertain if the benefit of the intervention will differ between strata and is, therefore, assumed to be equal for all strata. According to the sample size calculation in nQuery (Statsols), a sample size of 71 per arm will have a statistical power of 80% to detect a difference in means of 0.36 (the difference between a group 1 mean, µ 1 , of 2.26 and a group 2 mean, µ 2 , of 1.9) assuming that the common standard deviation is 0.76 using a two-group t -test with a 5% two-sided significance level. The sample size calculation is based on 53 patients treated for HFrEF in 2022 between 1 January 2022 and 20 March 2022. To facilitate a 5% dropout, 150 patients in total were enrolled. This sample size seems feasible given the number of visiting patients with HFrEF. The treatment effect is estimated to be a 0.36 increase in the primary outcome. This constitutes to one in three patients receiving the target dosage for one medicine or one intravenous iron administration/appropriate screening after 12 weeks of being in the intervention group.

The ΔGDMT score was not normally distributed and is presented as median ± IQR. Between-group differences were calculated using the Mann–Whitney U -test. The pre-specified secondary outcomes of the number of remote and physical consults per patient were reported as rates per consult type, and between-group differences were tested using Poisson regression analysis or, in case of over-dispersion, negative binomial regression. Time to OMT was analyzed using a Cox proportional hazards model and visualized using Kaplan–Meier curves. The number of patients with an eGFR < 30 ml min −1  1.73 m − 2 , potassium > 5.0 mmol l −1 or at least one hospitalization during 12-week follow-up were reported as counts and percentages and analyzed using chi-square tests. Time spent on healthcare, 12-week changes in QoL and patient and healthcare satisfaction were not normally distributed and are reported as median ± IQR or mean and standard deviation, if appropriate. Between-group differences were tested using the Mann‒Whitney U -test. Healthcare satisfaction of clinicians with the intervention was reported using the NPS. A pre-specified subgroup analysis will be performed on the primary outcome. The covariates used for this subgroup analysis will be as follows: eGFR greater or less than the median, NYHA classes, new-onset or existing HF, ischemic or non-ischemic etiologies, age eGFR greater or less than the median, the use of nurse support and non-academic hospitals or tertiary academic referral centers. The effect of the intervention in each pre-specified subgroup was tested using the Mann–Whitney U -test and quantified with the difference of the medians of the outcomes between intervention and control groups and with the associated CI of this difference. Interactions between subgroups and interventions were subsequently tested by comparing the difference of the effects versus the pooled standard errors using a t -test.

All primary, secondary and exploratory outcomes were pre-specified in the statistical analysis plan or requested by reviewers and were performed in the intention-to-treat population. The trial did not have a data safety management board, as this was considered to be a low-risk trial. The analysis was carried out using R version 4.3.1. All recordkeeping was done using Castor EDC (2022.3.0.0). A two-tailed P value less than 0.05 was considered significant for all outcomes. This trial was registered under clinical trial registration number NCT05413447 at ClinicalTrials.gov.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Anonymized participant data can be made available upon requests directed to the corresponding author. Proposals will be reviewed on the basis of scientific merit, ethical review, available resources and regulatory requirements. All requests complying with legal and ethical requirements for data sharing will be granted. Responses to such requests can be expected within 1 month. After approval of a proposal, anonymized data will be made available for re-use. A steering committee will have the right to review and comment on any draft papers based on these data before publication.

Code availability

No custom computational code or software was developed for this study. Analyses were performed with publicly available software packages as described in the Methods section.

Savarese, G. et al. Global burden of heart failure: a comprehensive and updated review of epidemiology. Cardiovasc. Res. 118 , 3272–3287 (2023).

Article   PubMed   Google Scholar  

McDonagh, T. A. et al. 2021 ESC guidelines for the diagnosis and treatment of acute and chronic heart failure. Eur. Heart J. 42 , 3599–3726 (2021).

Article   CAS   PubMed   Google Scholar  

Fatima, K., Butler, J. & Fonarow, G. C. Residual risk in heart failure and the need for simultaneous implementation and innovation. Eur. J. Heart Fail. 25 , 1477–1480 (2023).

Tromp, J. et al. A systematic review and network meta-analysis of pharmacological treatment of heart failure with reduced ejection fraction. JACC Heart Fail. 10 , 73–84 (2022).

Fauvel, C. et al. Differences between heart failure specialists and non‐specialists regarding heart failure drug implementation and up‐titration. Eur. J. Heart Fail. 25 , 1884–1886 (2023).

Greene, S. J., Butler, J. & Fonarow, G. C. Simultaneous or rapid sequence initiation of quadruple medical therapy for heart failure—optimizing therapy with the need for speed. JAMA Cardiol. 6 , 743–744 (2021).

Chioncel, O. et al. Non‐cardiac comorbidities and intensive up‐titration of oral treatment in patients recently hospitalized for heart failure: insights from the STRONG‐HF trial. Eur. J. Heart Fail. 25 , 1994–2006 (2023).

Shen, L. et al. Accelerated and personalized therapy for heart failure with reduced ejection fraction. Eur. Heart J. 43 , 2573–2587 (2022).

Patolia, H., Khan, M. S., Fonarow, G. C., Butler, J. & Greene, S. J. Implementing guideline-directed medical therapy for heart failure: JACC Focus Seminar 1/3. J. Am. Coll. Cardiol. 82 , 529–543 (2023).

Jalloh, M. B. et al. Bridging treatment implementation gaps in patients with heart failure: JACC Focus Seminar 2/3. J. Am. Coll. Cardiol. 82 , 544–558 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Packer, M. & McMurray, J. J. V. Rapid evidence-based sequencing of foundational drugs for heart failure and a reduced ejection fraction. Eur. J. Heart Fail. 23 , 882–894 (2021).

Mebazaa, A. et al. Safety, tolerability and efficacy of up-titration of guideline-directed medical therapies for acute heart failure (STRONG-HF): a multinational, open-label, randomised, trial. Lancet 400 , 1938–1952 (2022).

Brunner-La Rocca, H.-P. et al. Contemporary drug treatment of chronic heart failure with reduced ejection fraction. JACC Heart Fail. 7 , 13–21 (2019).

Pierce, J. B. et al. Contemporary use of sodium-glucose cotransporter-2 inhibitor therapy among patients hospitalized for heart failure with reduced ejection fraction in the US: the Get With The Guidelines-Heart Failure registry. JAMA Cardiol. 8 , 652–661 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Savarese, G. et al. Heart failure drug treatment—inertia, titration, and discontinuation: a multinational observational study (EVOLUTION HF). JACC Heart Fail. 11 , 1–14 (2023).

Samsky, M. D. et al. Patient perspectives on digital interventions to manage heart failure medications: the VITAL-HF Pilot. J. Clin. Med. 12 , 4676 (2023).

Giordano, A., Zanelli, E. & Scalvini, S. Home-based telemanagement in chronic heart failure: an 8-year single-site experience. J. Telemed. Telecare 17 , 382–386 (2011).

Massot, M. et al. Ultra-fast remote up-titration of heart failure treatment: a safe, efficient and feasible protocol. Eur. Heart J. 43 , ehac544.945 (2022).

Article   Google Scholar  

Antonicelli, R., Mazzanti, I., Abbatecola, A. M. & Parati, G. Impact of home patient telemonitoring on use of β-blockers in congestive heart failure. Drugs Aging 27 , 801–805 (2010).

Romero, E. et al. Remote monitoring titration clinic to implement guideline-directed therapy for heart failure patients with reduced ejection fraction: a pilot quality-improvement intervention. Front. Cardiovasc. Med. 10 , 1202615 (2023).

Artanian, V. et al. Impact of remote titration combined with telemonitoring on the optimization of guideline-directed medical therapy for patients with heart failure: internal pilot of a randomized controlled trial. JMIR Cardio. 4 , e21962 (2020).

Brahmbhatt, D. H. et al. The effect of using a remote patient management platform in optimizing guideline-directed medical therapy in heart failure patients: a randomized controlled trial. JACC Heart Fail. 12 , 678–690 (2024).

Man, J. P. et al. Digital solutions to optimize guideline-directed medical therapy prescriptions in heart failure patients: current applications and future directions. Curr. Heart Fail. Rep. 21 , 147–161 (2024).

de Jong, M. J. et al. Telemedicine for management of inflammatory bowel disease (myIBDcoach): a pragmatic, multicentre, randomised controlled trial. Lancet 390 , 959–968 (2017).

Brahmbhatt, D. H. & Cowie, M. R. Remote management of heart failure: an overview of telemonitoring technologies. Card. Fail. Rev. 5 , 86–92 (2019).

Brugts, J. J. et al. Remote haemodynamic monitoring of pulmonary artery pressures in patients with chronic heart failure (MONITOR-HF): a randomised clinical trial. Lancet 401 , 2113–2123 (2023).

Hernandez, A. F. et al. Multiple cArdiac seNsors for mAnaGEment of Heart Failure (MANAGE-HF)—phase I evaluation of the integration and safety of the HeartLogic multisensor algorithm in patients with heart failure. J. Card. Fail. 28 , 1245–1254 (2022).

Lindenfeld, J. A. et al. Haemodynamic-guided management of heart failure (GUIDE-HF): a randomised controlled trial. Lancet 398 , 991–1001 (2021).

Adamson, P. B. et al. Pulmonary artery pressure-guided heart failure management reduces 30-day readmissions. Circ. Heart Fail. 9 , e002600 (2016).

Ghazi, L. et al. Electronic health record alerts for management of heart failure with reduced ejection fraction in hospitalized patients: the PROMPT-AHF trial. Eur. Heart J. 44 , 4233–4242 (2023).

Krol, M. W., de Boer, D., Delnoij, D. M. & Rademakers, J. J. D. J. M. The Net Promoter Score—an asset to patient experience surveys? Health Expect. 18 , 3099–3109 (2015).

Lucero, K. S. Net promoter score (NPS): what does net promoter score offer in the evaluation of continuing medical education? J. Eur. CME 11 , 2152941 (2022).

Schuuring, M. J., Man, J. P. & Chamuleau, S. A. J. Inclusive health tracking. JACC Adv. 2 , 100545 (2023).

Guasti, L. et al. Digital health in older adults for the prevention and management of cardiovascular diseases and frailty. A clinical consensus statement from the ESC Council for Cardiology Practice/Taskforce on Geriatric Cardiology, the ESC Digital Health Committee and the ESC Working Group on e-Cardiology. ESC Heart Fail. 9 , 2808–2822 (2022).

Bujnowska-Fedak, M. & Grata-Borkowska, U. Use of telemedicine-based care for the aging and elderly: promises and pitfalls. Smart Homecare Technol. TeleHealth 3 , 91–105 (2015).

Chen, C., Ding, S. & Wang, J. Digital health for aging populations. Nat. Med. 29 , 1623–1630 (2023).

Malgie, J. et al. Contemporary guideline-directed medical therapy in de novo, chronic, and worsening heart failure patients: first data from the TITRATE-HF study. Eur. J. Heart Fail. 26 , 1549–1560 (2024).

Brunner-La Rocca, H. P. et al. Contemporary drug treatment of chronic heart failure with reduced ejection fraction: the CHECK-HF registry. JACC Heart Fail. 7 , 13–21 (2019).

McMurray, J. J. V. et al. Dapagliflozin in patients with heart failure and reduced ejection fraction. N. Engl. J. Med. 381 , 1995–2008 (2019).

Packer, M. et al. Cardiovascular and renal outcomes with empagliflozin in heart failure. N. Engl. J. Med. 383 , 1413–1424 (2020).

Ghazi, L. et al. Electronic alerts to improve heart failure therapy in outpatient practice: a cluster randomized trial. J. Am. Coll. Cardiol. 79 , 2203–2213 (2022).

Lin, L. Y., Jochym, N. & Merz, J. F. Refusal rates and waivers of informed consent in pragmatic and comparative effectiveness RCTs: a systematic review. Contemp. Clin. Trials 104 , 106361 (2021).

Man, J. P. et al. Digital consults to optimize guideline-directed therapy: design of a pragmatic multicenter randomized controlled trial. ESC Heart Fail. 11 , 560–569 (2024).

Download references

Acknowledgements

This investigator-initiated study was funded by the Amsterdam University Medical Center Innovation Grant 2021 without any contribution from an industry partner. The funder had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank all heart failure nurses, other treating clinicians and medical students for their contributions.

Author information

Authors and affiliations.

Department of Cardiology, Amsterdam UMC, Amsterdam, The Netherlands

Jelle P. Man, Maarten A. C. Koole, Paola G. Meregalli, M. Louis Handoko, Susan Stienen, Frederik J. de Lange, Michiel M. Winter, Wouter E. M. Kok, Dorianne I. Kuipers, Folkert W. Asselbergs & Steven A. J. Chamuleau

Netherlands Heart Institute, Utrecht, The Netherlands

Jelle P. Man & Steven A. J. Chamuleau

Amsterdam Cardiovascular Science, University of Amsterdam, Amsterdam, The Netherlands

Jelle P. Man, Paola G. Meregalli, M. Louis Handoko, Susan Stienen, Frederik J. de Lange, Michiel M. Winter, Wouter E. M. Kok, Dorianne I. Kuipers & Steven A. J. Chamuleau

Cardiology Center of the Netherlands, Utrecht, The Netherlands

Maarten A. C. Koole & Michiel M. Winter

Department of Cardiology, Red Cross Hospital, Beverwijk, The Netherlands

Maarten A. C. Koole

Department of Cardiology, University Medical Center Utrecht, Utrecht, The Netherlands

M. Louis Handoko & Pim van der Harst

Department of Surgery, Amsterdam UMC, Amsterdam, The Netherlands

Marlies P. Schijven

Institute of Health Informatics, University College London, London, UK

Folkert W. Asselbergs

National Institute for Health Research, University College London Hospitals, Biomedical Research Centre, University College London, London, UK

Department of Epidemiology and Data Science, Amsterdam UMC, Amsterdam, The Netherlands

Aeilko H. Zwinderman & Marcel G. W. Dijkgraaf

Methodology, Amsterdam Public Health, Amsterdam, The Netherlands

Department of Cardiology, Medical Spectrum Twente, Enschede, The Netherlands

Mark J. Schuuring

Department of Biomedical Signals and Systems, University of Twente, Enschede, The Netherlands

Cardiovascular Health Research Pillar, University of Twente, Enschede, The Netherlands

You can also search for this author in PubMed   Google Scholar

Contributions

J.P.M. was involved in the design, execution and analysis of the trial and the writing of the manuscript. M.A.C.K., M.L.H. and S.S. were involved in the execution of the trial and the editing of the manuscript. M.G.W.D., W.E.M.K. and M.P.S. were involved with the design of the trial and the editing of the manuscript. F.J.d.L. was involved in the design and execution of the trial and the editing of the manuscript. D.I.K., M.M.W., P.G.M. and P.v.d.H. were involved in the execution of the trial. F.W.A. was involved with the analysis and the editing of the manuscript. A.H.Z. was involved with the statistical analysis of the trial and the editing of the manuscript. S.A.J.C. and M.J.S. were involved with the design, execution and analysis of the trial and the editing of the manuscript.

Corresponding author

Correspondence to Mark J. Schuuring .

Ethics declarations

Competing interests.

M.L.H. is supported by the Dutch Heart Foundation (Dr. E. Dekker Senior Clinical Scientist Grant 2020T058) and CVON (2020B008 RECONNEXT). M.L.H. received an investigator-initiated research grant from Vifor Pharma; an educational grant from Boehringer Ingelheim and Novartis; and speaker/consultancy fees from Abbott, AstraZeneca, Bayer, Boehringer Ingelheim, Merck Sharp & Dohme, Novartis, Daiichi Sankyo, Quin and Vifor Pharma. W.E.M.K. received a speaking fee from Novartis. F.W.A. received grant funding from the European Union Horizon scheme (AI4HF 101080430 and DataTools4Heart 101057849). M.J.S. received an independent research grant from AstraZeneca to the research institute. The other authors declare no competing interests.

Peer review

Peer review information.

Nature Medicine thanks Christiane Angermann and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Fig. 1, Tables 1–4 and Notes 1–4.

Reporting Summary

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Man, J.P., Koole, M.A.C., Meregalli, P.G. et al. Digital consults in heart failure care: a randomized controlled trial. Nat Med (2024). https://doi.org/10.1038/s41591-024-03238-6

Download citation

Received : 02 July 2024

Accepted : 07 August 2024

Published : 31 August 2024

DOI : https://doi.org/10.1038/s41591-024-03238-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

assignment 2 exploratory data analysis

IMAGES

  1. Assignment 2: Exploratory Data Analysis / Andrew Yu

    assignment 2 exploratory data analysis

  2. 2

    assignment 2 exploratory data analysis

  3. Assignment 2: Exploratory Data Analysis (EDA) using Cognos Analytics

    assignment 2 exploratory data analysis

  4. 2. Exploratory Data Analysis.pdf

    assignment 2 exploratory data analysis

  5. Exploratory Data Analysis Python and Pandas with Examples

    assignment 2 exploratory data analysis

  6. II. Exploratory Data Analysis

    assignment 2 exploratory data analysis

VIDEO

  1. Day 05

  2. Day 03

  3. Session 29

  4. EDA with pandas and Introduction to Data Visualization (Week 5, Session 10)

  5. Exploratory Data Analysis Important Questions Anna University

  6. Python for Data Science

COMMENTS

  1. Assignment 2: Exploratory Data Analysis

    Assignment 2: Exploratory Data Analysis. In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of ...

  2. Assignment 2: Exploratory Data Analysis

    Assignment 2: Exploratory Data Analysis. In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of ...

  3. Assignment 2: Exploratory Data Analysis

    Assignment 2: Exploratory Data Analysis. In groups of 3-4, identify a dataset of interest and perform exploratory analysis in Tableau to understand the structure of the data, investigate hypotheses, and develop preliminary insights. Prepare a PDF or Google Slides report using this template outline: include a set of 10 or more visualizations ...

  4. Assignment 2: Exploratory Data Analysis

    The goal of this assignment is not to develop a new visualization tool, but to understand better the process of exploring data using off-the-shelf visualization tools. Here is one way to start. Step 1. Pick a domain that you are interested in. Some good possibilities might be the physical properties of chemical elements, the types of stars, or ...

  5. 2 Exploratory Data Analysis (EDA)

    2.1 Overview. Exploratory Data Analysis (EDA) may also be described as data-driven hypothesis generation. Given a complex set of observations, often EDA provides the initial pointers towards various learning techniques. The data is examined for structures that may indicate deeper relationships among cases or variables.

  6. Assignment Exploratory Data Analysis

    Assignment Exploratory Data Analysis. Graded assignment. Find a dataset and create and form, build and perform an Exploratory Data Analysis. You can use data from anywhere. For example, you may use Google dataset search, Kaggle datasets, a dataset from an R package, or something you collected yourself.

  7. Assignment 2

    The Data visualisation and Exploratory data analysis chapters of R for Data Science, 2nd ed. might be handy references. The {ggplot2} reference pages. The Factors chapter of R for Data Science, 2nd ed. The {forcats} reference pages. You might need to do further data manipulation before you can plot what you want

  8. Exploratory Data Analysis

    Welcome to Week 2 of Exploratory Data Analysis. This week covers some of the more advanced graphing systems available in R: the Lattice system and the ggplot2 system. While the base graphics system provides many important tools for visualizing data, it was part of the original R system and lacks many features that may be desirable in a plotting ...

  9. Lesson 2

    Understand the main steps involved in exploratory data analysis. Visualise geographical data with seaborn. Slice, mask, and index pandas.Series and pandas.DataFrame objects. Merge pandas.DataFrame objects together on a common key. Apply the DataFrame.groupby() operation to aggregate data across different groups of interest.

  10. PDF Computer Assignment 2

    The purpose of this assignment is to walk you through a typical starting analysis of real-world data in R. We emphasize on using data visualization as a way to understand the datasets better. Remember to modify the headline of each code chunk to ensure you display the answers requested for each question.

  11. Step-by-Step Exploratory Data Analysis (EDA) using Python

    Exploratory Data Analysis in Python. Exploratory data analysis (EDA) is a critical initial step in the data science workflow. It involves using Python libraries to inspect, summarize, and visualize data to uncover trends, patterns, and relationships. Here's a breakdown of the key steps in performing EDA with Python: 1. Importing Libraries:

  12. GitHub

    Assignment. The overall goal of this assignment is to explore the National Emissions Inventory database and see what it say about fine particulate matter pollution in the United states over the 10-year period 1999 2008. You may use any R package you want to support your analysis.

  13. Exploratory Data Analysis

    Welcome to Week 2 of Exploratory Data Analysis. This week covers some of the more advanced graphing systems available in R: the Lattice system and the ggplot2 system. While the base graphics system provides many important tools for visualizing data, it was part of the original R system and lacks many features that may be desirable in a plotting ...

  14. Assignment 2: Exploratory Data Analysis

    Assignment 2: Exploratory Data Analysis. A variety of digital tools have been designed to help users visually explore data sets and confirm or disconfirm hypotheses about the data. The task in this assignment is to use an existing visualization tool to formulate and answer a series of specific questions about a data set of your choice. After ...

  15. Assignment 2: Exploratory Data Analysis

    Assignment 2: Exploratory Data Analysis. In this assignment, you will identify a dataset of interest and perform exploratory analysis to better understand the shape & structure of the data, identify data quality issues, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a ...

  16. Assignment 2

    Assignment 2 - Exploratory Analysis Course: MIS 576 - Data Mining for Business Analytics 1. Assignment 2: Exploratory Analysis Lesson Plan Overview and Data Set Details So far, we have discussed about the basic concepts of data mining techniques including various supervised (defined target variable) and unsupervised (no target variable ...

  17. Assignment 2 (docx)

    BUSI 650 - Assignment 2 Weightage: 20% of the final grade Exploratory Data Analysis with Python In this assignment, you have been provided with a dataset that contains quantity, sales, and profit as a synthetic dataset designed to simulate real-world scenarios. However, the dataset is not perfect and presents some challenges that you must overcome. . It contains missing values and outliers ...

  18. Exploratory Data Analysis Project 2 (JHU) Coursera

    Of the four types of sources indicated by the 𝚝𝚢𝚙𝚎 (point, nonpoint, onroad, nonroad) variable, which of these four sources have seen decreases in emissions from 1999-2008 for Baltimore City?

  19. 6.859: Interactive Data Visualization (Spring 2021)

    Assignment 2: Exploratory Data Analysis. In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of ...

  20. How do smartphone users access the internet? An exploratory analysis of

    An exploratory analysis of mobile web browser use. Ludwig Fichte ... (4%), creating educational content (3%), submitting a class assignment (3%), and participating in a class (3%) all ranked among the least ... After 2 days of data collection, we obtained a total of 101 recorded videos, each between 1 and 6 min in length, including participants ...

  21. Assignment 2: Exploratory Data Analysis

    Assignment 2: Exploratory Data Analysis. In this assignment, you will identify a dataset of interest and perform an initial analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of annotated ...

  22. Digital consults in heart failure care: a randomized ...

    Between 22 September 2022 and 12 March 2024, 150 patients with HFrEF were randomly assigned to receive DC or usual care (Fig. 1).The last patient completed the 12-week follow-up on 4 June 2024.