logo

Applied Machine Learning in Python

Introduction ¶.

This book aims to provide an accessible introduction into applying machine learning with Python, in particular using the scikit-learn library. I assume that you’re already somewhat familiar with Python and the libaries of the scientific Python ecosystem. If you find that you have a hard time following along some of the details of numpy, matplotlib and pandas, I highly recommend you look at Jake VanderPlas’ Python Data Science handbook .

Scope and Goals ¶

After reading this book, you will be able to do exploratory data analysis on a dataset, evaluate potential machine learning solutions, implement, and evaluate them. The focus of the book is on tried-and-true methodology for solving real-world machine learning problems. However, we will not go into the details of productionizing and deloying the solutions. We will mostly focus on what’s know as tabular data, i.e. data that would usually be represented as a pandas DataFrame, Excel Spreadsheet, or CSV file. While we will discuss working with text-data in Chapter, there are many more advanced techniques, for which I’ll point you towards Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. We will not look at image recognition, video or speech data, or time series forecasting, though many of the core concepts described in this book will also apply there.

What is machine learning? ¶

Machine learning, also known as predictive modeling in statistics, is a research field and a collection of techniques to extract knowledge from data, often used to automate decision-making processes. Applications of machine learning are pervasive in technology, in particular in complex websites such as facebook, Amazon, youtube or Google. These sites use machine learning to personalize the experience, show relevant content, decide on advertisements, and much more. Without machine learning, none of these services would look anything like they do today. Outside the web, machine learning has also become integral to commercial applications in manifacturing, logistics, material design, financial markets and many more. Finally, over the last years, machine learning has also become essential to research in practically all data-driven sciences, including physics, astronomy, biology, medicine, earth sciences and social sciences.

There are three main sub-areas of machine learning, supervised learning, unsupervised learning, and reinforcement learning, each of which applies to a somewhat different setting. We’ll discuss each in turn, and give some examples of how they can be used.

Supervised Learning ¶

Supervised learning is by far the most commonly used in practice. In supervised learning, a model is built from a dataset of input-output pairs, where the input is known as features or independent variables, which we’ll denote by \(x\) , and the output is known as target or label, which we’ll denote by \(y\) . The input here is a representation of an entity of interest, say a customer to your online shop, represented by their age, location and shopping history. The output is a quantity of interest that we want our model to predict, say whether they would buy a particular product if we recommend it to them. To build a model, we need to collect many such pair, i.e. we need to build records of many customers and their decisions about whether or not they bought the product after an recommendation was shown to them. Such a set of input-output pairs for the purpose of building a supervised machine learning model is called a training set .

(customers of bank easier? because discrete products?)

Once we collected this dataset, we can (attempt to) build a supervised machine learning model that will make a prediction for a new user that wasn’t included in the training dataset. That might enable us to make better recommendations, i.e. only show recommendations to a user that’s likely to buy.

(given an example?)

The name supervised learning comes from the fact that during learning, the dataset contains the correct targets, which acts as a supervisor for the model training.

For both regression and classification, it’s important to keep in mind the concept of generalization. Let’s say we have a regression task. We have features, that is data vectors x_i and targets y_i drawn from a joint distribution. We now want to learn a function f, such that f(x) is approximately y, not on this training data, but on new data drawn from this distribution. This is what’s called generalization, and this is a core distinction to function approximation. In principle we don’t care about how well we do on x_i, we only care how well we do on new samples from the distribution. We’ll go into much more detail about generalization in about a week, when we dive into supervised learning.

Classification and Regression ¶

\(^1\) There are many other kinds of supervised learning tasks such as ranking or probability estimation, however, we will focus on classification and regression, the most commonly used tasks, in this book.

There are two main kinds of supervised learning tasks, called classification and regression \(^1\) . If the target of interest \(y\) that we want to predict is a quantity, the task is a regression problem. If it is discrete, i.e. one of several distinct choices, then it is a classification problem. For example, predicting the time it will take a patient to recover from an illness is a regression task, say measured in days. We might want our model to predict whether a patient will be ready to leave a hospital 3.5 days after admission or 5 or 10. This is regression becaues the time is clearly a continuous quantity, and there is a clear sense of ordering and distance between the different possible predictions. If the correct prediction is that the patient can leave after 4.5 days, but instead we predict 5, that might not be exactly correct, but it might still be a useful prediction. Even 6 might be somewhat useful, while 20 would be totally wrong.

\(^2\) This might be more naturally formulated as a multi-label task, which is basically a series of binary classification tasks. There could be more than one medication that leads to success, so this could be phrased as a yes/no question for each candidate.

An example for a classification task would be which of a set of medications the patient would respond best to \(^2\) . Here, we have a fixed set of disjoint candidates that are known a-priori, and there is usually no order or sense of distance between the classes. If medication A is the best, then predicting any other medication is a mistake, so we need to predict the exact right outcome for the prediction to be accurate. A very common instance of classification is the special case of binary classification, where there are exactly two choices. Often this can be formulated as a “yes/no” question to which you want to predict an answer. Examples of this are “is this email spam?”, “is there a pedestrian on the street”, “will this customer buy this product” or “should we run an X-ray on this patient”.

The distinction into classification is important, as it will change the algorithms we will use, and the way we measure success. For classification, a common metric is accuracy , the fraction of correctly classified examples, i.e. the fraction of times the model predictied the right class. For regression on the other hand, a common metric is mean squared error, which is the squared average distance from the prediction to the correct answer. In other words, in regression, you want the prediction to be close to the truth, while in classification you want to predict exactly the correct class. In practice, the difference is a bit more subtle, and we will discuss model evaluation in depth in chapter TODO.

Usually it’s quite clear whether a task is classification or regression, but there are some cases that could be solved using either approach. A somewhat common example is ratings in the 5-star rating system that that’s popular on many online platforms. Here, the possible ratings are one start, two starts, three stars, four stars and five stars. So these are discrete choices, and you could apply a classification algorithm. On the other hand, there is a clear ordering, and if the real answer is one star, predicting two stars is probably better than predicting 5 stars, which means it might be more appropriate to use regression. Here, which one is more appropriate depends on the particular algorithm you’re using and how to integrate into your larger workflow.

Generalization ¶

When building a model for classification or regression, keep in mind that what we’re interested in is applying the model to new data for which we do not know the outcome. If we build a model for detecting spam emails, but it only works on emails in the training set, i.e. emails the model has seen during building of the model, it will be quite useless. What we want from a spam detection algorithm is to predict reasonably well whether an email is spam or not for a new email that was not included in the training set. The ability for a supervised model to make accurate predictions on new data is called generalization and is the core goal of supervised learning. Whithout asking for generalization, an algorithm could solve the spam detection task on the training data by just storing all the data, and when presented with one of these emails, look up what the correct answer was. This approach is known as memorization, but it’s impossible to apply to new data.

Conditions for success ¶

For a supervised learning model to generalize well, i.e. for it to be able to learn to make accurate prediction on new data, some key assumptions must be met:

First, the necessary information for making the correct prediction actually needs to be encoded in the training data . For example, if I try to learn to predict a fair coin flip before the coin is tossed, iI won’t be able to build a working machine learning model, no matter what I choose as the input features. The process is very (or entirely?) random, and the information to make a prediction is just not available. More technically, one might say the process has high intrinsic randomness that we can not overcome by building better models. While you’re unlikely to encounter a case as extreme (and obvious) as a coin toss, many processes in the real world are quite random (such as the behavior of people) and it’s impossible to make entirely accurate predictions for them.

In other cases, a prediction might be possible in principle, but we might not have provided the right information to the model. For example, it might be possible for a machine learning model to learn to diagnose pneumonia in a patient, but not if the only information about the patient that we represent to them is their shopping habbits and wardrobe. If we use a chest x-ray as a representation of the patient, together with a collection of symptoms, we will likely have better success. Even if the information is represented in the input, learning might still fail if the model is unable to extract the information. For example, visual stimuli are very easy to interpret for humans, but in general much harder to understand for machine learning algorithms. Consequently, it would be much harder for a machine to determine if a graffiti is offensive by presenting it with a photograph, than if the same information was represented as a text file.

Secondly, the training dataset needs to be large and varied enough to capture the variability of the process . In other words, the training data needs to be representative of the whole process, not only representing a small portion of it. Humans are very good at abstracting properties, and a child will be able to understand what a car is after seeing only a handfull. Machine learning algorithms on the other hand require a lot of variability to be present. For example, to learn the concept of what a car looks like, an algorithm likely needs to see pictures of vans, of trucks, of sedans, pictures from the front, the side and above, pictures parking and in traffic, pictures in rain and in sunshine, in a garage and outdoors, maybe even pictures taken by a phone camera and pictures taken by a news camera. As we said before, the whole point of supervised learning is to generalize, so we want our model to apply to new settings. However, how new a setting can be depends on the representation of the data and the algorithm in question. If the algorithm has only ever seen trucks, it might not recognize a sedan. If the algorithm has never seen a snow-covered car, it’s unlikely it will recognize it. Photos (also known as natural images in machine learning) are a very extreme example as they have a lot of variability, and so often require a lot of training data. If your data has a simple structure, or the relationship between your features and your target are simple, then only a handful of training examples might be enough.

example of simple training task

Third and finally, the data that the model is applied to needs to be generated from the same process as the data the model was trained on . A model can only generalize to data that in essence adheres to the same rules and has the same structure. If I collect data about public transit ridership in Berlin, and use it to make predictions in New York, my model is likely to perform poorly. While I might be able to measure the same things, say number of people at stations, population density, holidays etc, there are so many differences between data collected in Berlin and data collected in New York that it’s unlikely a model trained on one could predict well on the other. As another example, let’s say you train an image recognition model for recognizing hot dogs on a dataset of stock photos, and you want to deploy it to an app using a phone camera. This is also likely to fail, as stock photography doesn’t resemble photos taken by users pointing their phone. Stock photography is professionally produced and well-lit, the angles are carefully chosen, and often the food is altered to show it in it’s best light (have you noticed how food in a restaurant never looks like in a commercial?). However, machine learning requires you to use a training dataset that was generated by the same process as the data the model will be applied to.

Mathematical Background

From a mathematical standpoint, supervised learning assumes that there is a joint distribution \(p(x, y)\) and that the training dataset consists for independent, identically distributed (i.i.d.) samples from this joint distribution. The model is then applied to new data sampled from the same distribution, but \(y\) is unknown. The model is then used to estimate \(p(y | x)\) , or more commonly the mode of this distribution, i.e. the most likely value for \(y\) to take given the \(x\) we observed. In the case of learning to predict a coin flip, you could actually learn a very accurate model of \(p(y | x)\) , that predicts heads and tails with equal probability. There is no way to predict the particular outcome itself, though.

The third requirement for success is easily expressed as saying that the test data is sampled i.i.d. from the same distribution \(p(x, y)\) that the training data was generated from.

Unsupervised Learning ¶

In unsupervised machine learning, we are usually just given data points \(x\) , and the goal is to learn something about the structure of the data. This is usually a more open-ended task than what we saw in supervised learning. This kind of task is called unsupervised, because even during training, there is no “supervision” providing a correct answer. There are several sub-categories of unsupervised learning that we’ll discuss in Chapter 3, in particular clustering, dimensionality reduction, and signal decomposition. Clustering is the task of finding coherent groups within a dataset, say subgroups of customers that behave in a similar way, say “students”, “new parents” and “retirees”, that each have a distinct shopping pattern. However, here, in contrast to classification, the groups are not pre-defined. We might not know what the groups are, how many groups there are, or even if there is a coherent way to define any groups. There might also be several different ways the data could be grouped: say you’re looking at portraits. One way to group them could be by whether the subject has classes or not. Another way to group them could be by the direction they are facing. Yet another might be hair color or skin color. If you tell an algorithm to cluster the data, you don’t know which aspect it will pick up on, and usually manually inspecting the groups or clusters is the only way to interpret the results.

Two other, related, unsupervised learning tasks are dimensionality reduction and signal decomposition. In these, we are not looking at groups in the data, but underlying factors of variance, that are potentially more semantic than the original representation. Going back to the example of portraits, an algorithm might find that head orientation, lighting and hair color are important aspects of the image that vary independently. In dimensionality reduction, we are usually looking for a representation that is lower-dimensional, i.e. that has less variables than the original feature space. This can be particularly useful for visualizing dataset with many features, by projecting them into a two-dimensional space that’s easily plotted. Another common application of signal decomposition is topic modeling of text data. Here, we are trying to find topics among a set of documents, say news articles, or court documents, or social media posts. This is related to clustering, though with the difference that each document can be assigned multiple topics, i.e. topics in the news could be politics, religion, sports and economics, and an article could be about both, politics and economics.

Both, clustering and signal decomposition, are most commonly used in exploratory analysis, where we are trying to understand the data. They are less commonly used in production systems, as they lend themselves less easily to automating a decision process. Sometimes signal decomposition is used as a mechanism to extract more semantic features from a dataset, on top of which a supervised model is learned. This can be particularly useful if there is a large amount of data, but only a small amount of annotated data, i.e. data for which the outcome \(y\) is known.

Reinforcement Learning ¶

The third main family of machine learning tasks is reinforcement learning, which is quite different from the other two. Both supervised and unsupervised learning basically work on a dataset that was collected and stored, from which we then build a model. Potentially, this model is then applied to new data in the future. In reinforcement learning, on the other hand, there is no real notion of a dataset. Instead, reinforcement learning is about a program (usually known as an agent) interacting with a particular environment. Through this interaction, the agent learns to achieve a particular goal. A good example of this is a program learning to play a video game. Here, the agent would be an AI playing the game, while the environment would be the game itself, i.e. the world in which it plays out. The agent is presented with the environment, and has choices of actions (say moving forward and backward and jumping) and each of these actions will result in the environment being in a new state (i.e. with the agent placed a bit forward, or backward, or falling in a hole). Given the new environment, the agent again can choose an action and the environment will be in a new state as a consequence.

../_images/reinforcement_cycle.png

Fig. 1 The reinforcement learning cycle. ¶

The learning in reinforcement learning happens with so-called rewards , which need to be specified by the data scientist building the system. The agent is trained to seek rewards (hence the name reinforcement learning), and will find series of actions that maximize its reward. In a game, a reward could be given to the environment every time they score points, or just once when the agent wins the game. In the second case, there might be a long delay between the agent taking an action, and the agent winning the game, and one of the main challenges in reinforcement learning is dealing with such settings (this is known as credit attribution problem: which of my actions should I give credit for me winning the game).

Compared with supervised learning, reinforcement learning is a much more indirect way to specify the learning problem: we don’t provide the algorithm with the correct answer (i.e. the correct sequence of actions to win the game), instead we only reward the agent once they achieve a goal. Suprisingly, this can work surprisingly well in practice. This is like learning a game without someone ever telling you the rules, or what the goal of the game is, but only telling you whether you lost or won at the end. As you might expect, it might take you many many tries to figure out the game.

However, algorithms are notoriously patient, and researchers have been able to use reinforcement learning to create programs that can play a wide variety of complex games. Potentially one of the most suprising and impressive feats was learning to play the ancient chinese boardgame of Go at a superhuman level.

citations etc, numbers of games, years…

When this was publicided in TODO, many researchers in the area were shocked, as the game was known to be notoriously hard, and many believed it could not be learned by any known algorithms. While the initial work used some human knowledge, later publications learned to play the game from scratch, i.e. without any rewards other than for winning the game, by the agent repeatedly playing the game against itself. The resulting programs are now playing at superhuman level, meaning that they are basically unbeatable, even by the best human players in the world. Similar efforts are now underway for other games, in particular computer games like StarCraft II and DOTA.

Algorithms achieved super-human performance in the game of chess long before this, in the year TODO with the famous play of Kasparov against Deep Blue. Chess has much fewer possible moves and games are much shorter sequences of actions than in Go or StarCraft, which makes it much easier to devise algorithms to play chess.

Reinforcement learning also has a long history in other areas, in particular robotics, where it is used for learning and tuning behaviors, such as walking or grasping. While many impressive achievements have been made with reinforcement learning, there are several aspects that limit it’s broad usefulness. A potential application of reinforcement learning could be self-driving cars. However, as mentioned above, reinforcement learning usually requires many attempts or iterations before it learns a specific task. If I wanted to train a car to learn to park, it might fail thousands or hundreds of thousands of times first. Unfortunately, in the real world this is impractical: self-driving cars are very expensive, and we don’t want to crash them over and over again. It might also be risky for the person conducting the experiment. With thousands of attempts, even if the car doesn’t crash, the gas will run out, and the person having to reset the experiment every time will probably get very tired very quickly. Therefore, reinforcement learning is most successful when there is a good way to simulate the environment, as is the case with games, and with some aspects of robotics. For learning how to park, a simulation might actually work well, as the sensing of other cars and the steering of the car can be simulated well. However, for really learning how to drive, a car would need to be able to deal with a variety of situations, such as different weather conditions, crowded streets, people running on the strees, kids chasing balls, navigating detours and many other scenarios. Simulating these in a realistic way is very hard, and so reinforcement learning is much harder to apply in the physical world.

A setting that has attracted some attention, and might become more relevant soon, is online platforms that are not games. You could think of a social media timeline as an agent that gets rewarded for you looking at it. Right now, this is often formulated as a supervised learning task (TODO or more acurately active learning). However, your interactions with social media are not usually indepentent events, but your behavior online is shaped by what is being presented to you, and what was shown to you in the past might influence what is shown to you in the future. A maybe somewhat cynical analogy would be to think of this as a timeline being an agent, playing you, winning whenever you stay glued to the screen (or click an ad or buy a product). I’m not aware that this has been implemented anywhere, but as computational capacity increase and algorthms become more sophisticated, it is a natural direction to explore.

Reinforcement learning is a fascinating topic, but much beyond the scope of this book. For an introduction, see TODO Sutten Barto. For an overview of modern approaches, see TODO.

As you might have notices in the table of contents, this book mostly concerns itself with supervised and unsupervised learning, and we will not discuss reinforcement learning any further. As a matter of fact, the book heavily emphasizes supervised learning, which has found the larges success among the three in practical applications so far. While all three of these areas are interesting in their own right, when you see an application of machine learning, or when someone says they are using machine learning for something, chances are they mean supervised learning, which is arguably the most well-understood, and the most easy to productionize and analyze.

Isn’t this just statistics? ¶

A very common question I get is ‘is machine learning not just statistics?’ and I want to quickly address how the approach in this book differs from the approach taken in a statistics class or textbook. The machine learning community and the statistics community have some historical differences (ML being born much later and from within computer science), but study many of the same subjects. So I don’t think it makes sense to say that one thing is statistics and the other thing is machine learning. However, there is usually a somewhat different emphasis in the kinds of problems and questions that are addressed in either, and I think it’s important to distinguish these tasks. Much of statistics often deals with inference , which means that given a dataset, we want to make statements that hold for the dataset (often called population in statistics) as a whole. Machine learning on the other hand often emphasizes prediction , which means we are looking to make statements about each sample, in other words indivdual level statements. Asking “do people that take zinc get sick less often”” is an inference question, as it asks about whether something happens on average over the whole population. A related prediction question would be “will this particular person get sick if they take zinc?”. The answer for the inference question would be either “yes” or “no”, and using hypothesis testing methodology this statement could have an effect size and a significance level attached to it. The answer for the prediction question would be a prediction for each sample of interest, or maybe even a program that can make prediction given information about a new patient.

As you can see, these are two fundamentally different kinds of questions, and require fundamentally different kinds of tools to answer them. This book solely looks at the prediction task, and we consider a model a good model if it can make good predictions. We do not claim that the model allows us to make any statistical or even causal statements that hold for the whole population, or the process that generated the dataset.

There are some other interesting differences between the kind of prediction questions studied in supervised machine learning compared to inference questions that are traditionally studied in statistics; in particular, machine learning usually assumes that we have access to data that was generated from the process we want to model, and that all samples are created equal (i.i.d.). Statistical inference usually makes no such assumptions, and instead assumes that we have some knowledge about the structure of the process that generated the data. As an example, consider predicting a presidential election outcome. As of this writing, there’s 58 past elections to learn from. For a machine learning task, this is by no means enough observations to learn from. But even worse, these samples are not created equally. The circumstances of the first election are clearly different than the will be for the next election. The economic and societal situation will be different, as will be the candidates. So really, we have no examples whatsoever from the process that we’re interested in. However, understanding all the differences to previous elections, we might still be able to make accurate forecasts using statistical modeling.

is this math? should this been in a math section?

If you’re interested in a discussion of prediction vs inference and how they relate, I highly recommend the seminal paper Statistical Modeling: The Two Cultures by Leo Breiman.

Some of my favorite machine learning textbooks are written by statisticians (the subfield is called predictive modeling), and there are certainly machine learning researchers that work on inference questions, so I think making a distinction between statistics and machine learning is not that useful. However, if you look at how a statistics textbook teaches, say, logistic regression, the intention is likely to be inference, and so the methods will be different from this book, where the emphasis is on prediction, and you should keep this in mind.

This is not to say that one is better than the other in any sense, but that it’s important to pick the right tool for the job. If you want to answer an inference question, the tools in this book are unlikely to help you, but if you want to make accurate predictions, they likely will.

The bigger picture ¶

This book is mostly technical in nature, with an emphasis on practial programming techniques. However, there are some important guiding principles for developing machine learning solutions that are often forgotten by practitioners who find themselves deep in the technical aspects. In this section, I want to draw your attention to what I think are crucial aspects of using machine learning in applications. It might seem a bit dry for now, but I encourage you to keep these ideas in mind while working through the rest of the book, and maybe come back here in the end, once you’ve got your toes a bit wet.

The machine learning process ¶

Outside of the depicted process is the formulation of the problem, and the definition of measures, both of which are critical, but usually not part of the loop. The actual machine learning process itself starts with data collection, which might mean mining historical data, labeling data by hand, or running simulations or even performing actual physical experiments. Once the data is collected, it needs to be processed into a format suitable for machine learning, which we’ll discuss in more detail in Chapter TODO. Before building the model expoloratory data analysis and visualization are essential to form or confirm intuition on the structure of the data, to spot potential data quality issues, to select suitable candidate models, and potentially generate new features. The next step, model building, usually involves building several candidate models, tweaking them, and comparing them. Once a model is selected, it is usually evaluated first in an off-line manner, that is using already collected data.

Unclear, should we talk about shadow-running models?

Then, potentially it is further validated in a live setting with current data. Finally, the model is deployed into the production environment. For a web app, deployment might be deployment in the software sense: deploying a service that takes user data, runs the model, and renders some outcome on your website. For industrial applications, deployment could mean integrating your defect detection into an assembly line and discarding defect; if your model is for evaluating real-estate, deployment might mean buying highly valued properties.

This process is depicted as a circle, as deployment usually generates new data, or informs future data collection, and restarts the process. While I drew a circle, actually this is more than one loop, in fact this is a fully connected graph, where after each step, you might decide to go back to previous steps and start over, improving your model or your process. At any point, you might find data quality issue, figure out new informative ways to represent the data, or find out that your model doesn’t perform as well as you thought. Each time, you might decide to improve any of the previously taken steps. Usually there are many iterations before reaching integration and deployment for the first time, as using an unsuitable model might represent substantial risk to your project.

The rest of the book will focus on model building and evaluation, which are at the core of machine learning. However, for a successful project, all of the steps in the process are important. Formulating the problem, collecting data, and establishing success metrics are often at least as crucial as selecting the right model and tweaking it. Given the technical nature of the material presented in this book, it’s easy to lose sight of how critical all the steps of the process are. We will discuss some of these in a bit more detail now.

The role of data ¶

Clearly the data used for building and evaluating a machine learning model are crucial ingredients. Data collection is often overlooked in machine learning education, where students usually look at fixed datasets, and the same is true for online competitions and platforms such as kaggle . However, in practice, data collection is usually part of building any machine learning application, and there is usually a choice to collect additional data, or to change the data collection. Having more data can be the difference between a model that’s not working, and a model that outperforms human judgement, in particular if you can collect data that covers the variablity that you will encounter in prediction. Sometimes it might be possible to collect additional features that make the task much easier, and selecting what data to collect is often as critical as selecting the right model. Usually it’s easier to throw away data later than to add new fields to the data collection. It’s common for data scientist to start working on a model only to discover that a critical aspect of the process was not logged, and a task that could have been easy becomes extremely hard.

data quality?

Potentially one of the most ingenious ways to capture labeled training data is ReCAPTCHA . It provides a service to verify that web user is not a bot by solving visual tasks. These are then used as ground truth annotation for training machine learning models.

Depending on the problem you’re tackling, the effort and cost of data collection can vary widely. In some settings, the data is basically free and endless . Say you want to predict how much attention a post will receive on social media. As long as your post is similar to other posts on the platform, you can obtain arbitrary amounts of training data by looking at existing posts, and collect the number of likes and comments and other engagement. This data rich situation often appears when you are tying to predict the future, and you can observe the labels of past data simply by waiting, i.e. seeing how many people like a photo. In some cases the same might be true for showing ads or recommendations, where you are able to observe past behavior of users, or in content moderation, where users might flag offending content for you. This assumes that the feedback loop is relatively small and the events repeat often, though. If you work in retail, the two data points that are most crucial (at least in the US) are Black Friday and Christmas. And while you might be able to observe them, you can only observe them once a year, and if you make a bad decision, you might go out of business before observing them again.

Another common situation is automating a business process that before has been done manually. Usually collecting the answers is not free in this setting, but it’s often possible to collect additional data by manually annotation . The price of collecting more data then depends on the level of qualification required to create accurate labels, and the time involved. If you want to detect personal attacks in your online community, you can likely use a crowd-sourcing platform or a contractor to get reasonable labels. If your decision requires expert knowledge, say, which period a painting was created in, hiring an expert might be much more expensive or even impossible. In this situation, it’s often intersting to ask yourself what is more cost-effective: spending time building and tuning a complex machine learning model, or collecting more data and potentially getting results with less effort. We will discuss how to make this decision in TODO.

Finally, there are situations where getting additional data is infeasible or impossible; in this situations, people speak of precious data . Examples of this could be the outcome of a drug-trial, which is lengthy and expensive and where collecting additional data might not be feasible. Or the simulation of a complex physical system, or observations on a scientific measurement. Maybe each sample corresponds to a new microchip architecture for which you want to model energy efficiency. These settings are those where tweaking your model and diving deep into the problem might pay off, but these situations are overall rather rare in practice.

say why machines are good at some things and bad at others? like medical imaging?

Feedback loops in data collection ¶

One particular aspect that is often neglected in data collection is that the effect of deploying a machine learning model might change the process generating the data. A simple example for this would be a spammer, who, once a model is able to flag their content, would change their strategy or content so as to be no longer detected. Clearly, the data here changed as a consequence of deploying the model, and the model that might have been able to accurately identify spam in an offline setting might not work in practice. In this example, there is an adverserial intent and the spammers intentionally try to defeat the model. However, similar changes might happen indicentally, but still invalidate a previous model. For example, when building systems for product recommentation, the model often relies on data that was collected using some other recommendation scheme, and the choice of this scheme clearly influences what data will be collected. If a streaming platform never suggests a particular movie, it’s unlikely to be seen by many users, and so it will not show up in user data that’s collected, and so a machine learning algorithm will not recommend it, creating a feedback loop that will lead to the movie being ignored. There is a whole subset of machine learning devoted to this kind of interactive data collection, called active learning , where the data that is collected is closely related to the model that’s being build. This area also has a close relation to reinforcement learning.

Given the existence of these feedback loops, it’s important to ensure that your model performs well, not only in an offline test, but also in a production environment. Often this is hard to simulate, as you might not be able to anticipate the reaction of your users to deploying an algorithm. In this case, using A/B testing might be a way to evaluate your system more rigourously.

A/B testing

A particular nefarious example of this feedback loop has been observed (TODO citation) in what is known as predictive policing . The premise of predictive policing is to send police patrols to neighborhoods where they expect to observe crime, at times that they expect to observe crime. However, if police is sent to a neighborhood, they are likely to find criminal activity there (even if it might be minor); and clearly they will not find criminal activity in neighborhoods they did not patrol. Historically, police patrols in certain US cities have focused on non-white neighborhoods, and given this historical data, predictive policing methods steered patrols to these same neighborhoods. Which then lead them to observe more crime there, leading to more data showing crime in these neighborhoods, leading to more patrols being send there, and so on.

Metrics and evaluation ¶

One of the most important parts of machine learning is defining the goal, and defining a way to measure that goal. The first part of this is having the right data for evaluating your model, data that will reflect the way the model will be used in production. Equally important is establishing a measure of impact for your task. Usually your application is driven by some ultimate goal, such as user engagement, revenue, keeping patients healthy or any number of possible motivations. The question is now how your machine learning solution will impact this goal. It’s important to note that the goal is rarely if ever to make accurate predictions. It’s not some evaluation metrics that counts, but it’s the real world impact of the decision that are made by your model.

\(^3\) Here and in the following I will talk of your goals in terms of a business, however, if your goal is scientific discovery, health or justice, the same principles apply.

There are some common hurdles in measuring the impact of your model. Often, the effect on the bottom line might only be very indirect. If you’re removing fake news from your social media platform, this will not directly increase your ad revenue, and removing a particular fake news article will probably have no measurable impact. However, curating your platform will help maintain a brand image and might drive users to your platform, which in turn will create more revenue. But this effect is likely to be mixed in with many other effects, and much delayed in time. So often data scientists rely on surrogate metrics, measures that relate to intermediate business goals \(^3\) that can be measured more directly, such as user engagement or click-through-rate.

The problem with such surrogate metrics is that they might not capture what you assume they capture. I heard an (if not true, then at least illustrative) anectote about optimizing the placement of an ad on a shopping website. An opptimization algorithm placed it right next to the search button, with the same color as the search button, which resulted in the the most clicks. However, when analyzing the results more closely, the team found that the clicks were caused by users missing the search button and accidentally clicking the ad, not resulting in any sales, but resulting in irritated users that had to go back and search again.

There is usually a hierachy of measurements, from accuracy of a model on an offline holdout dataset, which is easy to calculate but can be misleading in several ways, to more business specific metrics that can be evaluated on an online system, to the actual business goal. Moving from evaluating just the model to the whole process and then to how the process integrates into your business makes evaluation more complex and more risky. Usually, evaluation on all levels is required if possible: if a model does well in offline tests, it can be tried in a user-study. If the user study is promising, it can be deployed more widely, and potentially an outcome on the actual objective can be observed. However, often we have to be satisfied with surrogate metrics on some level, as it’s unlikely that each model will have a measurable impact on the bottom line of a complex product.

One aspect that I find is often overlooked by junior data scientists is to establish a baseline. If you are employing machine learning in any part of your process, you should have a baseline of not employing machine learning. What are your gains if you do? What if your replace your deep neural network with the simplest heuristic that you can come up with? How will it affect your users? There are cases in which the difference between 62% accuracy and 63% accuracy can have a big impact on the bottom line, but more often than not, small improvements in the model will not drastically alter the overall process and result.

When developing any machine learning solution, always keep in mind how your model will fit in the overal process, what consequences your predictions have, and how to measure the overall impact of the decisions made by the model.

When to use and not to use machine learning ¶

As you might be able to tell by the existance of this book, I’m excited about machine learning and an avid advocate. However, I think it is cruicial to not fall victim to hype, and carefully consider whether a particular situation calls for a machine learning solution. Many machine learning practitioners get caught up in the (fascinating) details of algorithms and datasets, but lose perspective of the bigger picture. To the data scientist with a machine learning hammer, too often everything looks like a classification nail. In general, I would recommend restricting yourself to supervised learning in most practical settings; in other words, if you do not have a training dataset for which you know the outcome, it will be very hard to create an effective solution. As mentioned before, machine learning will be most useful for making individual-level predictions, not for inference. I also already laid out some prerequisits for using supervised learning in the respective section above. Let’s assume all of these criteria are met and you carefully chose your business-relevant metrics. This still doesn’t mean machine learning is the right solution for your problem. There are several aspects that need to be balanced; on the one hand there is the positive effect a successful model can have. On the other hand there is the cost of developing the initial solution. Is it worth your time as a data scientist to attack this problem, or are there problems where you can have a bigger impact? There is also the even greater cost of maintaining a machine learning solution in a production environment [SHG+14] . Machine learning models are often intransparent and hard to maintain. The exact behavior depends on the training data, and if the data changes (maybe the trends on social media change, or the political climate changes or a new competitor appeared), the model needs to be adjusted. A model might also make unexpected predictions, potentially leading to costly errors, or annoyed customers. All of these issues need to be weight against the potential benefits of using a model.

Your default should be not to use machine learning, unless you can demonstrate that your solution improves the overall process and impacts relevant business goals, while being robust to possible changes in the data and potentially even to adverserial behavior. Try hard to come up with heuristics to outperform any model you develop, and always compare your model to the simplest approach and the simplest model that you can think of. And keep in mind, don’t evaluate these via model accuracy, evaluate them on something relevant to your process.

Ethical aspects of machine learning ¶

One aspect of machine learning that only recently is getting quite a bit of attention is ethics. The field of ethics in technology is quite broad, and machine learning and data science have many of the same questions that are associated with any use of technology. However, there are many situations where machine learning quite directly impacts individuals, for example when hiring decision, credit approvals or even risk assessments in the criminal justice [BHJ+18] system are powered by machine learning. Given the complexity of machine learning algorithms, and the intricate dependencies on the training data, together with the potentials for feedback loops, it is often hard to assess the potential impact that the deployment of an algorithm or model can have on individuals. However, that by no means relieves data scientists of the responsibility to investigate potential issues of bias and discrimination in machine learning. There is a growing community that investigates fairness, accountability and transparency in machine learning and data science, providing tools to detect and address issues in algorithms and datasets. On the other hand, there are some that question algorithmic solutions to ethical issues, and ask for a broader perspective on the impact of data science and machine learning on society [KHD19] [FL20] . This is a complex topic, and so far, there is little consensus on best practices and concrete steps. However, most researchers and practitioners agree that fairness, accountability and transparancy are essential principles for the future of machine learning in society. While approaches to fair machine learning are beyond the scope of this book, I want to encourage you to keep the issues of bias and discrimination in mind. Real-world examples, such as the use of predictive policing, racial discrimination in criminal risk assessment or gender discrimination in ML-driven hiring unfortunately abound. If your applications involves humans in any capacity (and most do), make sure to pay special attention to these topics, and research best practices for evaluating your process, data, and modeling.

Scaling up ¶

This book focuses on using Python and scikit-learn for machine learning. One of the main limitations of scikit-learn that I’m often asked about is that it is usually restricted to using a single machine, not a cluster (though there are some ways around this in some cases). My standpoint on this is that for most applications, using a single machine is often enough, easier, and potentially even faster [RND+12] . Once your data is processed to a point where you can start your analysis, few applications require more than at most several gigabites of data, and many applications only require megabytes. These workloads are easily done on modern machines: even if your machine does not have enough memory, it’s quick and cheap to rent machines with hundreds of GB of RAM from a cloud provider, and do all your machine learning in memory on a single machine. If this is possible, I would encourage you to go with this solution. The interactivity and simplicity that comes from working on a single machine is hard to beat, and the amount of libraries available is far greater for local computations. If your raw data, say user logs, is many terrabytes, it might still be possible that after extracting the data needed for machine learning is only in the hundreds of megabytes, and so after preparing the data in a distributed environment such as Spark, you can then transition to a single machine for your machine learning workflow. Clearly there are situations when a single machine is not enough; large tech companies often use their own bespoke in-house systems to learn models on immense datastreams. However, most projects don’t operate on the scale of the facebook timeline or google searches, and even if your production environment requires truely large amounts of data, prototyping on a subset on a single machine can be helpful for a quick exploratory analysis or a prototype. Avoid premature optimization and start small–where small these days might mean hundreds of gigabytes.

Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: the state of the art. Sociological Methods & Research , pages 0049124118782533, 2018.

Sina Fazelpour and Zachary C Lipton. Algorithmic fairness from a non-ideal perspective. In AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) 2020 . 2020.

Os Keyes, Jevan Hutson, and Meredith Durbin. A mulching proposal: analysing and improving an algorithmic system for turning the elderly into high-nutrient slurry. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing systems , 1–11. 2019.

Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O’Shea, and Andrew Douglas. Nobody ever got fired for using hadoop on a cluster. In Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing , 1–5. 2012.

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine learning: the high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop) . 2014.

logo

Applied Machine Learning - Welcome

Applied machine learning - welcome #, introduction #.

This course provides an overview of key algorithms and concepts in machine learning, with a focus on applications. Introduces supervised and unsupervised learning, including logistic regression, support vector machines, neural networks, Gaussian mixture models, as well as other methods for classification, regression, clustering, and dimensionality reduction. Covers foundational concepts such as overfitting, regularization, maximum likelihood estimation, generative models, latent variables, and non-parametric methods. Applications include data analysis on images, text, time series, and other types of data using modern software tools such as numpy, scikit-learn, and pytorch.

What’s Inside #

Prerequisites #.

This masters-level course requires a background in mathematics and programming at the level of introductory college courses. Experience in Python is recommended, but not required. A certain degree of ease with mathematics will be helpful.

Programming experience (ideally Python; Cornell CS 1110 or equivalent)

Linear algebra. (Cornell MATH 2210, MATH 4310 or equivalent)

Statistics and probability. (Cornell STSCI 2100 or equivalent)

Instructors #

These lecture notes accompany CS5785 Applied Machine Learning at Cornell University and Cornell Tech, as well as the open online version of that course. They are based on materials developed at Cornell by:

  • Volodymyr Kuleshov , Assistant Professor, Computer Science, Cornell Tech
  • Nathan Kallus , Associate Professor, Operations Research, Cornell Tech
  • Serge Belongie , Professor, Computer Science, University of Copenhagen

The open version of CS5785 and the accompanying online lectures have been produced by Hongjun Wu . We are also grateful to over a dozen teaching assistants that have helped with drafts of these lecture notes.

Table of Contents #

  • Lecture 1: Introduction to Machine Learning
  • 1.1. What is Machine Learning?
  • 1.2. Three Approaches to Machine Learning
  • 1.3. Logistics and Course Information.
  • Lecture 2: Supervised Machine Learning
  • 2.1. Elements of A Supervised Machine Learning Problem
  • 2.2. Anatomy of a Supervised Learning Problem: The Dataset
  • 2.3. Anatomy of a Supervised Learning Problem: The Learning Algorithm
  • Lecture 3: Linear Regression
  • 3.1. Calculus Review
  • 3.2. Gradient Descent in Linear Models
  • 3.3. Ordinary Least Squares
  • 3.4. Non-Linear Least Squares
  • Lecture 4: Classification and Logistic Regression
  • 4.1. Classification
  • 4.2. Logistic Regression
  • 4.3. Maximum Likelihood
  • 4.4. Learning a Logistic Regression Model
  • 4.5. Softmax Regression for Multi-Class Classification
  • 4.6. Maximum Likelihood: Advanced Topics
  • Lecture 5: Regularization
  • 5.1. Two Failure Cases of Supervised Learning
  • 5.2. Evaluating Supervised Learning Models
  • 5.3. A Framework for Applying Supervised Learning
  • 5.4. L2 Regularization
  • 5.5. L1 Regularization and Sparsity
  • 5.6. Why Does Supervised Learning Work?
  • Lecture 6: Generative Models and Naive Bayes
  • 6.1. Text Classification
  • 6.2. Generative Models
  • 6.3. Naive Bayes
  • 6.4. Learning a Naive Bayes Model
  • Lecture 7: Gaussian Discriminant Analysis
  • 7.1. Revisiting Generative Models
  • 7.2. Gaussian Mixture Models
  • 7.3. Gaussian Discriminant Analysis
  • 7.4. Discriminative vs. Generative Algorithms
  • Lecture 8: Unsupervised Learning
  • 8.1. Introduction to Unsupervised Learning
  • 8.2. The Language of Unsupervised Learning
  • 8.3. Unsupervised Learning in Practice
  • Lecture 9: Density Estimation
  • 9.1. Outlier Detection Using Probabilistic Models
  • 9.2. Kernel Density Estimation
  • 9.3. Nearest Neighbors
  • Lecture 10: Clustering
  • 10.1. Gaussian Mixture Models for Clustering
  • 10.2. Expectation Maximization
  • 10.3. Expectation Maximization in Gaussian Mixture Models
  • 10.4. Generalization in Probabilistic Models
  • Lecture 12: Support Vector Machines
  • 12.1. Classification Margins
  • 12.2. The Max-Margin Classifier
  • 12.2.2. Algorithm: Linear Support Vector Machine Classification
  • 12.3. Soft Margins and the Hinge Loss
  • 12.4. Optimization for SVMs
  • Lecture 13: Dual Formulation of Support Vector Machines
  • 13.1. Lagrange Duality
  • 13.2. Dual Formulation of SVMs
  • 13.3. Practical Considerations for SVM Duals
  • Lecture 14: Kernels
  • 14.1. The Kernel Trick in SVMs
  • 14.2. Kernelized Ridge Regression
  • 14.3. More on Kernels
  • Lecture 15: Tree-Based Algorithms
  • 15.1. Decision Trees
  • 15.2. Learning Decision Trees
  • 15.3. Bagging
  • 15.4. Random Forests
  • Lecture 16: Boosting
  • 16.1. Defining Boosting
  • 16.2. Structure of a Boosting Algorithm
  • 16.3. Adaboost
  • 16.4. Ensembling
  • 16.5. Additive Models
  • 16.6. Gradient Boosting

Your browser is ancient! Upgrade to a different browser to experience this site.

Applied Machine Learning in Python

Description.

This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit through a tutorial. The issue of dimensionality of data will be discussed, and the task of clustering data, as well as evaluating those clusters, will be tackled. Supervised approaches for creating predictive models will be described, and learners will be able to apply the scikit learn predictive modelling methods while understanding process issues related to data generalizability (e.g. cross validation, overfitting). The course will end with a look at more advanced techniques, such as building ensembles, and practical limitations of predictive models. By the end of this course, students will be able to identify the difference between a supervised (classification) and unsupervised (clustering) technique, identify which technique they need to apply for a particular dataset and need, engineer features to meet that need, and write python code to carry out an analysis.

This course should be taken after Introduction to Data Science in Python and Applied Plotting, Charting & Data Representation in Python and before Applied Text Mining in Python and Applied Social Analysis in Python.

based on 7401 ratings

applied machine learning assignment 4

Applied Data Science with Python

U-M Credit Eligible

applied machine learning assignment 4

Kevyn Collins-Thompson

Associate Professor

School of Information

Know someone who would like this course? Share it with them!

Share on Facebook

Share on Twitter

Share on LinkedIn

  • Description
  • Announcements
  • Class Logistics

Live Session Plan

  • Assignments and Final Project Submission Guidelines

DataSci 207: Applied Machine Learning

Lecture: mo, tu, th, office hours: tu, 8-9 am pt.

This course provides a practical introduction to the rapidly growing field of machine learning— training predictive models to generalize to new data. We start with linear and logistic regression and implement gradient descent for these algorithms, the core engine for training. With these key building blocks, we work our way to understanding widely used neural network architectures, focusing on intuition and implementation with TensorFlow/Keras. While the course centers on neural networks, we will make sure to cover key ideas in unsupervised learning and nonparametric modeling.

Along the way, weekly short coding assignments will connect lectures with concrete data and real applications. A more open-ended final project will tie together crucial concepts in experimental design and analysis with models and training.

This class meets for one 90 min class periods each week.

All materials for this course are posted on GitHub in the form of Jupyter notebooks.

  • Please fill out this PRE-COURSE survey so I can get to know a bit more about you and your programming background.
  • Due to a large number of private Slack inquiries, I encourage you to first read this website for commonly asked questions.
  • Any questions regarding course content and organization (including assignments and final project) should be posted on my Slack channel. You are strongly encouraged to answer other students' questions when you know the answer.
  • If there are private matters specific to you (e.g., special accommodations), please contact me directly.
  • If you miss a class, watch the recording and inform me here .
  • If you want to stay up to date with recent work in AI/ML, start by looking at the conferences NeurIPS and ICML .
  • ML study guidelines: Stanford's super cheatsheet .

Core data science courses: research design, storing and retrieving data, exploring and analyzing data.

Undergraduate-level probability and statistics. Linear algebra is recommended.

Python (v3).

Jupiter and JupiterLab notebooks. You can install them in your computer using pip or Anaconda . More information here .

Git(Hub), including clone/commmit/push from the command line. You can sign up for an account here.

If you have a MacOS M1, this .sh script will install everything for you (credit goes to one of my former students, Michael Tay)

Mac/Windows/Linux are all acceptable to use.

  • Raschka & Mirjalili (RM) , Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2.
  • Weekly coding assignments, submitted via GitHub and Digital Campus (see notes below).
  • You will present your final project in class during the final session. You are allowed to work in teams (see notes below).
  • You will submmit your code and presentation slides via GitHub (see notes below).

Communication channel

For the final project you will form a group (3-4 people are ideal; 2-5 people are allowed; no 1 person group allowed). Grades will be calibrated by group size. Your group can only include members from the section in which you are enrolled.

Do not just re-run an existing code repository; at the minimum, you must demonstrate the ability to perform thoughtful data preprocessing and analysis (e.g., data cleaning, model training, hyperparameter selection, model evaluation).

The topic of your project is totally flexible (see also below some project ideas).

  • week 04: inform me here about your group, question and dataset you plan to use.
  • week 08: prepare the baseline presentation of your project. You will present in class (no more than 10 min).
  • week 14: prepare the final presentation of your project. You will present in class (no more than 10 min).
  • Second Sight through Machine Learning
  • Can we predict solar panel electricity production using equipment and weather data?
  • Predict Stock Portfolio Returns using News Headlines
  • Pneumonia Detection from Chest Xrays
  • Predicting Energy Usage from Publically Available Building Performance Data
  • Can we Predict What Movies will be Well Received?
  • ML for Music Genre Classification
  • Predicting Metagenome Sample Source Environment from Protein Annotations
  • California Wildfire Prediction
  • Title, Authors
  • What is the question you will be working on? Why is it interesting?
  • What is the data you will be using? Include data source, size of dataset, main features to be used. Please also include summary statistics of your data.
  • What prediction algorithms do you plan to use? Please describe them in detail.
  • How will you evaluate your results? Please describe your chosen performance metrices and/or statistical tests in detail.
  • (15%) Motivation: Introduce your question and why the question is interesting. Explain what has been done before in this space. Describe your overall plan to approach your question. Provide a summary of your results.
  • (15%) Data: Describe in detail the data that you are using, including the source(s) of the data and relevant statistics.
  • (15%) Approach: Describe in detail the models (baseline + improvement over baseline) that you use in your approach.
  • (30%) Experiments: Provide insight into the effect of different hyperperameter choices. Please include tables, figures, graphs to illustrate your experiments.
  • (10%) Conclusions: Summarize the key results, what has been learned, and avenues for future work.
  • (15%) Code submission: Provide link to your GitHub repo. The code should be well commented and organized.
  • Contributions: Specify the contributions of each author (e.g., data processing, algorithm implementation, slides etc).
  • Create a GitHub repo for Assignments 1-10. Upload the homework's .ipynb file to Gradescope each week before the deadline.
  • Create a team GitHub repo for Final Project. This repo will contain your code as well as PowerPoint slides. Add me as a contributor if your repo is private (my username is corneliailin), and add the link to your repo here

Integrating a diverse set of experiences is important for a more comprehensive understanding of machine learning. I will make an effort to read papers and hear from a diverse group of practitioners, still, limits exist on this diversity in the field of machine learning. I acknowledge that it is possible that there may be both overt and covert biases in the material due to the lens with which it was created. I would like to nurture a learning environment that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including race, gender, class, sexuality, religion, ability, veteran status, etc.) in the spirit of the UC Berkeley Principles of Community.

To help accomplish this, please contact me or submit anonymous feedback through I School channels if you have any suggestions to improve the quality of the course. If you have a name and/or set of pronouns that you prefer I use, please let me know. If something was said in class (by anyone) or you experience anything that makes you feel uncomfortable, please talk to me about it. If you feel like your performance in the class is being impacted by experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. Also, anonymous feedback is always an option, and may lead to me to make a general announcement to the class, if necessary, to address your concerns.

As a participant in teamwork and course discussions, you should also strive to honor the diversity of your classmates.

If you prefer to speak with someone outside of the course, MICS Academic Director Lisa Ho, I School Assistant Dean of Academic Programs Catherine Cronquist Browning, and the UC Berkeley Office for Graduate Diversity are excellent resources. Also see the following link.

Machine Learning and image analysis towards improved energy management in Industry 4.0: a practical case study on quality control

  • Original Article
  • Open access
  • Published: 13 May 2024
  • Volume 17 , article number  48 , ( 2024 )

Cite this article

You have full access to this open access article

applied machine learning assignment 4

  • Mattia Casini 1 ,
  • Paolo De Angelis 1 ,
  • Marco Porrati 2 ,
  • Paolo Vigo 1 ,
  • Matteo Fasano 1 ,
  • Eliodoro Chiavazzo 1 &
  • Luca Bergamasco   ORCID: orcid.org/0000-0001-6130-9544 1  

155 Accesses

1 Altmetric

Explore all metrics

With the advent of Industry 4.0, Artificial Intelligence (AI) has created a favorable environment for the digitalization of manufacturing and processing, helping industries to automate and optimize operations. In this work, we focus on a practical case study of a brake caliper quality control operation, which is usually accomplished by human inspection and requires a dedicated handling system, with a slow production rate and thus inefficient energy usage. We report on a developed Machine Learning (ML) methodology, based on Deep Convolutional Neural Networks (D-CNNs), to automatically extract information from images, to automate the process. A complete workflow has been developed on the target industrial test case. In order to find the best compromise between accuracy and computational demand of the model, several D-CNNs architectures have been tested. The results show that, a judicious choice of the ML model with a proper training, allows a fast and accurate quality control; thus, the proposed workflow could be implemented for an ML-powered version of the considered problem. This would eventually enable a better management of the available resources, in terms of time consumption and energy usage.

Similar content being viewed by others

applied machine learning assignment 4

Towards Operation Excellence in Automobile Assembly Analysis Using Hybrid Image Processing

applied machine learning assignment 4

Deep Learning Based Algorithms for Welding Edge Points Detection

applied machine learning assignment 4

Artificial Intelligence: Prospect in Mechanical Engineering Field—A Review

Avoid common mistakes on your manuscript.

Introduction

An efficient use of energy resources in industry is key for a sustainable future (Bilgen, 2014 ; Ocampo-Martinez et al., 2019 ). The advent of Industry 4.0, and of Artificial Intelligence, have created a favorable context for the digitalisation of manufacturing processes. In this view, Machine Learning (ML) techniques have the potential for assisting industries in a better and smart usage of the available data, helping to automate and improve operations (Narciso & Martins, 2020 ; Mazzei & Ramjattan, 2022 ). For example, ML tools can be used to analyze sensor data from industrial equipment for predictive maintenance (Carvalho et al., 2019 ; Dalzochio et al., 2020 ), which allows identification of potential failures in advance, and thus to a better planning of maintenance operations with reduced downtime. Similarly, energy consumption optimization (Shen et al., 2020 ; Qin et al., 2020 ) can be achieved via ML-enabled analysis of available consumption data, with consequent adjustments of the operating parameters, schedules, or configurations to minimize energy consumption while maintaining an optimal production efficiency. Energy consumption forecast (Liu et al., 2019 ; Zhang et al., 2018 ) can also be improved, especially in industrial plants relying on renewable energy sources (Bologna et al., 2020 ; Ismail et al., 2021 ), by analysis of historical data on weather patterns and forecast, to optimize the usage of energy resources, avoid energy peaks, and leverage alternative energy sources or storage systems (Li & Zheng, 2016 ; Ribezzo et al., 2022 ; Fasano et al., 2019 ; Trezza et al., 2022 ; Mishra et al., 2023 ). Finally, ML tools can also serve for fault or anomaly detection (Angelopoulos et al., 2019 ; Md et al., 2022 ), which allows prompt corrective actions to optimize energy usage and prevent energy inefficiencies. Within this context, ML techniques for image analysis (Casini et al., 2024 ) are also gaining increasing interest (Chen et al., 2023 ), for their application to e.g. materials design and optimization (Choudhury, 2021 ), quality control (Badmos et al., 2020 ), process monitoring (Ho et al., 2021 ), or detection of machine failures by converting time series data from sensors to 2D images (Wen et al., 2017 ).

Incorporating digitalisation and ML techniques into Industry 4.0 has led to significant energy savings (Maggiore et al., 2021 ; Nota et al., 2020 ). Projects adopting these technologies can achieve an average of 15% to 25% improvement in energy efficiency in the processes where they were implemented (Arana-Landín et al., 2023 ). For instance, in predictive maintenance, ML can reduce energy consumption by optimizing the operation of machinery (Agrawal et al., 2023 ; Pan et al., 2024 ). In process optimization, ML algorithms can improve energy efficiency by 10-20% by analyzing and adjusting machine operations for optimal performance, thereby reducing unnecessary energy usage (Leong et al., 2020 ). Furthermore, the implementation of ML algorithms for optimal control can lead to energy savings of 30%, because these systems can make real-time adjustments to production lines, ensuring that machines operate at peak energy efficiency (Rahul & Chiddarwar, 2023 ).

In automotive manufacturing, ML-driven quality control can lead to energy savings by reducing the need for redoing parts or running inefficient production cycles (Vater et al., 2019 ). In high-volume production environments such as consumer electronics, novel computer-based vision models for automated detection and classification of damaged packages from intact packages can speed up operations and reduce waste (Shahin et al., 2023 ). In heavy industries like steel or chemical manufacturing, ML can optimize the energy consumption of large machinery. By predicting the optimal operating conditions and maintenance schedules, these systems can save energy costs (Mypati et al., 2023 ). Compressed air is one of the most energy-intensive processes in manufacturing. ML can optimize the performance of these systems, potentially leading to energy savings by continuously monitoring and adjusting the air compressors for peak efficiency, avoiding energy losses due to leaks or inefficient operation (Benedetti et al., 2019 ). ML can also contribute to reducing energy consumption and minimizing incorrectly produced parts in polymer processing enterprises (Willenbacher et al., 2021 ).

Here we focus on a practical industrial case study of brake caliper processing. In detail, we focus on the quality control operation, which is typically accomplished by human visual inspection and requires a dedicated handling system. This eventually implies a slower production rate, and inefficient energy usage. We thus propose the integration of an ML-based system to automatically perform the quality control operation, without the need for a dedicated handling system and thus reduced operation time. To this, we rely on ML tools able to analyze and extract information from images, that is, deep convolutional neural networks, D-CNNs (Alzubaidi et al., 2021 ; Chai et al., 2021 ).

figure 1

Sample 3D model (GrabCAD ) of the considered brake caliper: (a) part without defects, and (b) part with three sample defects, namely a scratch, a partially missing letter in the logo, and a circular painting defect (shown by the yellow squares, from left to right respectively)

A complete workflow for the purpose has been developed and tested on a real industrial test case. This includes: a dedicated pre-processing of the brake caliper images, their labelling and analysis using two dedicated D-CNN architectures (one for background removal, and one for defect identification), post-processing and analysis of the neural network output. Several different D-CNN architectures have been tested, in order to find the best model in terms of accuracy and computational demand. The results show that, a judicious choice of the ML model with a proper training, allows to obtain fast and accurate recognition of possible defects. The best-performing models, indeed, reach over 98% accuracy on the target criteria for quality control, and take only few seconds to analyze each image. These results make the proposed workflow compliant with the typical industrial expectations; therefore, in perspective, it could be implemented for an ML-powered version of the considered industrial problem. This would eventually allow to achieve better performance of the manufacturing process and, ultimately, a better management of the available resources in terms of time consumption and energy expense.

figure 2

Different neural network architectures: convolutional encoder (a) and encoder-decoder (b)

The industrial quality control process that we target is the visual inspection of manufactured components, to verify the absence of possible defects. Due to industrial confidentiality reasons, a representative open-source 3D geometry (GrabCAD ) of the considered parts, similar to the original one, is shown in Fig. 1 . For illustrative purposes, the clean geometry without defects (Fig.  1 (a)) is compared to the geometry with three possible sample defects, namely: a scratch on the surface of the brake caliper, a partially missing letter in the logo, and a circular painting defect (highlighted by the yellow squares, from left to right respectively, in Fig.  1 (b)). Note that, one or multiple defects may be present on the geometry, and that other types of defects may also be considered.

Within the industrial production line, this quality control is typically time consuming, and requires a dedicated handling system with the associated slow production rate and energy inefficiencies. Thus, we developed a methodology to achieve an ML-powered version of the control process. The method relies on data analysis and, in particular, on information extraction from images of the brake calipers via Deep Convolutional Neural Networks, D-CNNs (Alzubaidi et al., 2021 ). The designed workflow for defect recognition is implemented in the following two steps: 1) removal of the background from the image of the caliper, in order to reduce noise and irrelevant features in the image, ultimately rendering the algorithms more flexible with respect to the background environment; 2) analysis of the geometry of the caliper to identify the different possible defects. These two serial steps are accomplished via two different and dedicated neural networks, whose architecture is discussed in the next section.

Convolutional Neural Networks (CNNs) pertain to a particular class of deep neural networks for information extraction from images. The feature extraction is accomplished via convolution operations; thus, the algorithms receive an image as an input, analyze it across several (deep) neural layers to identify target features, and provide the obtained information as an output (Casini et al., 2024 ). Regarding this latter output, different formats can be retrieved based on the considered architecture of the neural network. For a numerical data output, such as that required to obtain a classification of the content of an image (Bhatt et al., 2021 ), e.g. correct or defective caliper in our case, a typical layout of the network involving a convolutional backbone, and a fully-connected network can be adopted (see Fig. 2 (a)). On the other hand, if the required output is still an image, a more complex architecture with a convolutional backbone (encoder) and a deconvolutional head (decoder) can be used (see Fig. 2 (b)).

As previously introduced, our workflow targets the analysis of the brake calipers in a two-step procedure: first, the removal of the background from the input image (e.g. Fig. 1 ); second, the geometry of the caliper is analyzed and the part is classified as acceptable or not depending on the absence or presence of any defect, respectively. Thus, in the first step of the procedure, a dedicated encoder-decoder network (Minaee et al., 2021 ) is adopted to classify the pixels in the input image as brake or background. The output of this model will then be a new version of the input image, where the background pixels are blacked. This helps the algorithms in the subsequent analysis to achieve a better performance, and to avoid bias due to possible different environments in the input image. In the second step of the workflow, a dedicated encoder architecture is adopted. Here, the previous background-filtered image is fed to the convolutional network, and the geometry of the caliper is analyzed to spot possible defects and thus classify the part as acceptable or not. In this work, both deep learning models are supervised , that is, the algorithms are trained with the help of human-labeled data (LeCun et al., 2015 ). Particularly, the first algorithm for background removal is fed with the original image as well as with a ground truth (i.e. a binary image, also called mask , consisting of black and white pixels) which instructs the algorithm to learn which pixels pertain to the brake and which to the background. This latter task is usually called semantic segmentation in Machine Learning and Deep Learning (Géron, 2022 ). Analogously, the second algorithm is fed with the original image (without the background) along with an associated mask, which serves the neural networks with proper instructions to identify possible defects on the target geometry. The required pre-processing of the input images, as well as their use for training and validation of the developed algorithms, are explained in the next sections.

Image pre-processing

Machine Learning approaches rely on data analysis; thus, the quality of the final results is well known to depend strongly on the amount and quality of the available data for training of the algorithms (Banko & Brill, 2001 ; Chen et al., 2021 ). In our case, the input images should be well-representative for the target analysis and include adequate variability of the possible features to allow the neural networks to produce the correct output. In this view, the original images should include, e.g., different possible backgrounds, a different viewing angle of the considered geometry and a different light exposure (as local light reflections may affect the color of the geometry and thus the analysis). The creation of such a proper dataset for specific cases is not always straightforward; in our case, for example, it would imply a systematic acquisition of a large set of images in many different conditions. This would require, in turn, disposing of all the possible target defects on the real parts, and of an automatic acquisition system, e.g., a robotic arm with an integrated camera. Given that, in our case, the initial dataset could not be generated on real parts, we have chosen to generate a well-balanced dataset of images in silico , that is, based on image renderings of the real geometry. The key idea was that, if the rendered geometry is sufficiently close to a real photograph, the algorithms may be instructed on artificially-generated images and then tested on a few real ones. This approach, if properly automatized, clearly allows to easily produce a large amount of images in all the different conditions required for the analysis.

In a first step, starting from the CAD file of the brake calipers, we worked manually using the open-source software Blender (Blender ), to modify the material properties and achieve a realistic rendering. After that, defects were generated by means of Boolean (subtraction) operations between the geometry of the brake caliper and ad-hoc geometries for each defect. Fine tuning on the generated defects has allowed for a realistic representation of the different defects. Once the results were satisfactory, we developed an automated Python code for the procedures, to generate the renderings in different conditions. The Python code allows to: load a given CAD geometry, change the material properties, set different viewing angles for the geometry, add different types of defects (with given size, rotation and location on the geometry of the brake caliper), add a custom background, change the lighting conditions, render the scene and save it as an image.

In order to make the dataset as varied as possible, we introduced three light sources into the rendering environment: a diffuse natural lighting to simulate daylight conditions, and two additional artificial lights. The intensity of each light source and the viewing angle were then made vary randomly, to mimic different daylight conditions and illuminations of the object. This procedure was designed to provide different situations akin to real use, and to make the model invariant to lighting conditions and camera position. Moreover, to provide additional flexibility to the model, the training dataset of images was virtually expanded using data augmentation (Mumuni & Mumuni, 2022 ), where saturation, brightness and contrast were made randomly vary during training operations. This procedure has allowed to consistently increase the number and variety of the images in the training dataset.

The developed automated pre-processing steps easily allows for batch generation of thousands of different images to be used for training of the neural networks. This possibility is key for proper training of the neural networks, as the variability of the input images allows the models to learn all the possible features and details that may change during real operating conditions.

figure 3

Examples of the ground truth for the two target tasks: background removal (a) and defects recognition (b)

The first tests using such virtual database have shown that, although the generated images were very similar to real photographs, the models were not able to properly recognize the target features in the real images. Thus, in a tentative to get closer to a proper set of real images, we decided to adopt a hybrid dataset, where the virtually generated images were mixed with the available few real ones. However, given that some possible defects were missing in the real images, we also decided to manipulate the images to introduce virtual defects on real images. The obtained dataset finally included more than 4,000 images, where 90% was rendered, and 10% was obtained from real images. To avoid possible bias in the training dataset, defects were present in 50% of the cases in both the rendered and real image sets. Thus, in the overall dataset, the real original images with no defects were 5% of the total.

Along with the code for the rendering and manipulation of the images, dedicated Python routines were developed to generate the corresponding data labelling for the supervised training of the networks, namely the image masks. Particularly, two masks were generated for each input image: one for the background removal operation, and one for the defect identification. In both cases, the masks consist of a binary (i.e. black and white) image where all the pixels of a target feature (i.e. the geometry or defect) are assigned unitary values (white); whereas, all the remaining pixels are blacked (zero values). An example of these masks in relation to the geometry in Fig. 1 is shown in Fig. 3 .

All the generated images were then down-sampled, that is, their resolution was reduced to avoid unnecessary large computational times and (RAM) memory usage while maintaining the required level of detail for training of the neural networks. Finally, the input images and the related masks were split into a mosaic of smaller tiles, to achieve a suitable size for feeding the images to the neural networks with even more reduced requirements on the RAM memory. All the tiles were processed, and the whole image reconstructed at the end of the process to visualize the overall final results.

figure 4

Confusion matrix for accuracy assessment of the neural networks models

Choice of the model

Within the scope of the present application, a wide range of possibly suitable models is available (Chen et al., 2021 ). In general, the choice of the best model for a given problem should be made on a case-by-case basis, considering an acceptable compromise between the achievable accuracy and computational complexity/cost. Too simple models can indeed be very fast in the response yet have a reduced accuracy. On the other hand, more complex models can generally provide more accurate results, although typically requiring larger amounts of data for training, and thus longer computational times and energy expense. Hence, testing has the crucial role to allow identification of the best trade-off between these two extreme cases. A benchmark for model accuracy can generally be defined in terms of a confusion matrix, where the model response is summarized into the following possibilities: True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). This concept can be summarized as shown in Fig. 4 . For the background removal, Positive (P) stands for pixels belonging to the brake caliper, while Negative (N) for background pixels. For the defect identification model, Positive (P) stands for non-defective geometry, whereas Negative (N) stands for defective geometries. With respect to these two cases, the True/False statements stand for correct or incorrect identification, respectively. The model accuracy can be therefore assessed as Géron ( 2022 )

Based on this metrics, the accuracy for different models can then be evaluated on a given dataset, where typically 80% of the data is used for training and the remaining 20% for validation. For the defect recognition stage, the following models were tested: VGG-16 (Simonyan & Zisserman, 2014 ), ResNet50, ResNet101, ResNet152 (He et al., 2016 ), Inception V1 (Szegedy et al., 2015 ), Inception V4 and InceptionResNet V2 (Szegedy et al., 2017 ). Details on the assessment procedure for the different models are provided in the Supplementary Information file. For the background removal stage, the DeepLabV3 \(+\) (Chen et al., 2018 ) model was chosen as the first option, and no additional models were tested as it directly provided satisfactory results in terms of accuracy and processing time. This gives preliminary indication that, from the point of view of the task complexity of the problem, the defect identification stage can be more demanding with respect to the background removal operation for the case study at hand. Besides the assessment of the accuracy according to, e.g., the metrics discussed above, additional information can be generally collected, such as too low accuracy (indicating insufficient amount of training data), possible bias of the models on the data (indicating a non-well balanced training dataset), or other specific issues related to missing representative data in the training dataset (Géron, 2022 ). This information helps both to correctly shape the training dataset, and to gather useful indications for the fine tuning of the model after its choice has been made.

Background removal

An initial bias of the model for background removal arose on the color of the original target geometry (red color). The model was indeed identifying possible red spots on the background as part of the target geometry as an unwanted output. To improve the model flexibility, and thus its accuracy on the identification of the background, the training dataset was expanded using data augmentation (Géron, 2022 ). This technique allows to artificially increase the size of the training dataset by applying various transformations to the available images, with the goal to improve the performance and generalization ability of the models. This approach typically involves applying geometric and/or color transformations to the original images; in our case, to account for different viewing angles of the geometry, different light exposures, and different color reflections and shadowing effects. These improvements of the training dataset proved to be effective on the performance for the background removal operation, with a validation accuracy finally ranging above 99% and model response time around 1-2 seconds. An example of the output of this operation for the geometry in Fig.  1 is shown in Fig. 5 .

While the results obtained were satisfactory for the original (red) color of the calipers, we decided to test the model ability to be applied on brake calipers of other colors as well. To this, the model was trained and tested on a grayscale version of the images of the calipers, which allows to completely remove any possible bias of the model on a specific color. In this case, the validation accuracy of the model was still obtained to range above 99%; thus, this approach was found to be particularly interesting to make the model suitable for background removal operation even on images including calipers of different colors.

figure 5

Target geometry after background removal

Defect recognition

An overview of the performance of the tested models for the defect recognition operation on the original geometry of the caliper is reported in Table 1 (see also the Supplementary Information file for more details on the assessment of different models). The results report on the achieved validation accuracy ( \(A_v\) ) and on the number of parameters ( \(N_p\) ), with this latter being the total number of parameters that can be trained for each model (Géron, 2022 ) to determine the output. Here, this quantity is adopted as an indicator of the complexity of each model.

figure 6

Accuracy (a) and loss function (b) curves for the Resnet101 model during training

As the results in Table 1 show, the VGG-16 model was quite unprecise for our dataset, eventually showing underfitting (Géron, 2022 ). Thus, we decided to opt for the Resnet and Inception families of models. Both these families of models have demonstrated to be suitable for handling our dataset, with slightly less accurate results being provided by the Resnet50 and InceptionV1. The best results were obtained using Resnet101 and InceptionV4, with very high final accuracy and fast processing time (in the order \(\sim \) 1 second). Finally, Resnet152 and InceptionResnetV2 models proved to be slightly too complex or slower for our case; they indeed provided excellent results but taking longer response times (in the order of \(\sim \) 3-5 seconds). The response time is indeed affected by the complexity ( \(N_p\) ) of the model itself, and by the hardware used. In our work, GPUs were used for training and testing all the models, and the hardware conditions were kept the same for all models.

Based on the results obtained, ResNet101 model was chosen as the best solution for our application, in terms of accuracy and reduced complexity. After fine-tuning operations, the accuracy that we obtained with this model reached nearly 99%, both in the validation and test datasets. This latter includes target real images, that the models have never seen before; thus, it can be used for testing of the ability of the models to generalize the information learnt during the training/validation phase.

The trend in the accuracy increase and loss function decrease during training of the Resnet101 model on the original geometry are shown in Fig. 6 (a) and (b), respectively. Particularly, the loss function quantifies the error between the predicted output during training of the model and the actual target values in the dataset. In our case, the loss function is computed using the cross-entropy function and the Adam optimiser (Géron, 2022 ). The error is expected to reduce during the training, which eventually leads to more accurate predictions of the model on previously-unseen data. The combination of accuracy and loss function trends, along with other control parameters, is typically used and monitored to evaluate the training process, and avoid e.g. under- or over-fitting problems (Géron, 2022 ). As Fig. 6 (a) shows, the accuracy experiences a sudden step increase during the very first training phase (epochs, that is, the number of times the complete database is repeatedly scrutinized by the model during its training (Géron, 2022 )). The accuracy then increases in a smooth fashion with the epochs, until an asymptotic value is reached both for training and validation accuracy. These trends in the two accuracy curves can generally be associated with a proper training; indeed, being the two curves close to each other may be interpreted as an absence of under-fitting problems. On the other hand, Fig. 6 (b) shows that the loss function curves are close to each other, with a monotonically-decreasing trend. This can be interpreted as an absence of over-fitting problems, and thus of proper training of the model.

figure 7

Final results of the analysis on the defect identification: (a) considered input geometry, (b), (c) and (d) identification of a scratch on the surface, partially missing logo, and painting defect respectively (highlighted in the red frames)

Finally, an example output of the overall analysis is shown in Fig. 7 , where the considered input geometry is shown (a), along with the identification of the defects (b), (c) and (d) obtained from the developed protocol. Note that, here the different defects have been separated in several figures for illustrative purposes; however, the analysis yields the identification of defects on one single image. In this work, a binary classification was performed on the considered brake calipers, where the output of the models allows to discriminate between defective or non-defective components based on the presence or absence of any of the considered defects. Note that, fine tuning of this discrimination is ultimately with the user’s requirements. Indeed, the model output yields as the probability (from 0 to 100%) of the possible presence of defects; thus, the discrimination between a defective or non-defective part is ultimately with the user’s choice of the acceptance threshold for the considered part (50% in our case). Therefore, stricter or looser criteria can be readily adopted. Eventually, for particularly complex cases, multiple models may also be used concurrently for the same task, and the final output defined based on a cross-comparison of the results from different models. As a last remark on the proposed procedure, note that here we adopted a binary classification based on the presence or absence of any defect; however, further classification of the different defects could also be implemented, to distinguish among different types of defects (multi-class classification) on the brake calipers.

Energy saving

Illustrative scenarios.

Given that the proposed tools have not yet been implemented and tested within a real industrial production line, we analyze here three perspective scenarios to provide a practical example of the potential for energy savings in an industrial context. To this, we consider three scenarios, which compare traditional human-based control operations and a quality control system enhanced by the proposed Machine Learning (ML) tools. Specifically, here we analyze a generic brake caliper assembly line formed by 14 stations, as outlined in Table 1 in the work by Burduk and Górnicka ( 2017 ). This assembly line features a critical inspection station dedicated to defect detection, around which we construct three distinct scenarios to evaluate the efficacy of traditional human-based control operations versus a quality control system augmented by the proposed ML-based tools, namely:

First Scenario (S1): Human-Based Inspection. The traditional approach involves a human operator responsible for the inspection tasks.

Second Scenario (S2): Hybrid Inspection. This scenario introduces a hybrid inspection system where our proposed ML-based automatic detection tool assists the human inspector. The ML tool analyzes the brake calipers and alerts the human inspector only when it encounters difficulties in identifying defects, specifically when the probability of a defect being present or absent falls below a certain threshold. This collaborative approach aims to combine the precision of ML algorithms with the experience of human inspectors, and can be seen as a possible transition scenario between the human-based and a fully-automated quality control operation.

Third Scenario (S3): Fully Automated Inspection. In the final scenario, we conceive a completely automated defect inspection station powered exclusively by our ML-based detection system. This setup eliminates the need for human intervention, relying entirely on the capabilities of the ML tools to identify defects.

For simplicity, we assume that all the stations are aligned in series without buffers, minimizing unnecessary complications in our estimations. To quantify the beneficial effects of implementing ML-based quality control, we adopt the Overall Equipment Effectiveness (OEE) as the primary metric for the analysis. OEE is a comprehensive measure derived from the product of three critical factors, as outlined by Nota et al. ( 2020 ): Availability (the ratio of operating time with respect to planned production time); Performance (the ratio of actual output with respect to the theoretical maximum output); and Quality (the ratio of the good units with respect to the total units produced). In this section, we will discuss the details of how we calculate each of these factors for the various scenarios.

To calculate Availability ( A ), we consider an 8-hour work shift ( \(t_{shift}\) ) with 30 minutes of breaks ( \(t_{break}\) ) during which we assume production stop (except for the fully automated scenario), and 30 minutes of scheduled downtime ( \(t_{sched}\) ) required for machine cleaning and startup procedures. For unscheduled downtime ( \(t_{unsched}\) ), primarily due to machine breakdowns, we assume an average breakdown probability ( \(\rho _{down}\) ) of 5% for each machine, with an average repair time of one hour per incident ( \(t_{down}\) ). Based on these assumptions, since the Availability represents the ratio of run time ( \(t_{run}\) ) to production time ( \(t_{pt}\) ), it can be calculated using the following formula:

with the unscheduled downtime being computed as follows:

where N is the number of machines in the production line and \(1-\left( 1-\rho _{down}\right) ^{N}\) represents the probability that at least one machine breaks during the work shift. For the sake of simplicity, the \(t_{down}\) is assumed constant regardless of the number of failures.

Table  2 presents the numerical values used to calculate Availability in the three scenarios. In the second scenario, we can observe that integrating the automated station leads to a decrease in the first factor of the OEE analysis, which can be attributed to the additional station for automated quality-control (and the related potential failure). This ultimately increases the estimation of the unscheduled downtime. In the third scenario, the detrimental effect of the additional station compensates the beneficial effect of the automated quality control on reducing the need for pauses during operator breaks; thus, the Availability for the third scenario yields as substantially equivalent to the first one (baseline).

The second factor of OEE, Performance ( P ), assesses the operational efficiency of production equipment relative to its maximum designed speed ( \(t_{line}\) ). This evaluation includes accounting for reductions in cycle speed and minor stoppages, collectively termed as speed losses . These losses are challenging to measure in advance, as performance is typically measured using historical data from the production line. For this analysis, we are constrained to hypothesize a reasonable estimate of 60 seconds of time lost to speed losses ( \(t_{losses}\) ) for each work cycle. Although this assumption may appear strong, it will become evident later that, within the context of this analysis – particularly regarding the impact of automated inspection on energy savings – the Performance (like the Availability) is only marginally influenced by introducing an automated inspection station. To account for the effect of automated inspection on the assembly line speed, we keep the time required by the other 13 stations ( \(t^*_{line}\) ) constant while varying the time allocated for visual inspection ( \(t_{inspect}\) ). According to Burduk and Górnicka ( 2017 ), the total operation time of the production line, excluding inspection, is 1263 seconds, with manual visual inspection taking 38 seconds. For the fully automated third scenario, we assume an inspection time of 5 seconds, which encloses the photo collection, pre-processing, ML-analysis, and post-processing steps. In the second scenario, instead, we add an additional time to the pure automatic case to consider the cases when the confidence of the ML model falls below 90%. We assume this happens once in every 10 inspections, which is a conservative estimate, higher than that we observed during model testing. This results in adding 10% of the human inspection time to the fully automated time. Thus, when \(t_{losses}\) are known, Performance can be expressed as follows:

The calculated values for Performance are presented in Table  3 , and we can note that the modification in inspection time has a negligible impact on this factor since it does not affect the speed loss or, at least to our knowledge, there is no clear evidence to suggest that the introduction of a new inspection station would alter these losses. Moreover, given the specific linear layout of the considered production line, the inspection time change has only a marginal effect on enhancing the production speed. However, this approach could potentially bias our scenario towards always favouring automation. To evaluate this hypothesis, a sensitivity analysis which explores scenarios where the production line operates at a faster pace will be discussed in the next subsection.

The last factor, Quality ( Q ), quantifies the ratio of compliant products out of the total products manufactured, effectively filtering out items that fail to meet the quality standards due to defects. Given the objective of our automated algorithm, we anticipate this factor of the OEE to be significantly enhanced by implementing the ML-based automated inspection station. To estimate it, we assume a constant defect probability for the production line ( \(\rho _{def}\) ) at 5%. Consequently, the number of defective products ( \(N_{def}\) ) during the work shift is calculated as \(N_{unit} \cdot \rho _{def}\) , where \(N_{unit}\) represents the average number of units (brake calipers) assembled on the production line, defined as:

To quantify defective units identified, we consider the inspection accuracy ( \(\rho _{acc}\) ), where for human visual inspection, the typical accuracy is 80% (Sundaram & Zeid, 2023 ), and for the ML-based station, we use the accuracy of our best model, i.e., 99%. Additionally, we account for the probability of the station mistakenly identifying a caliper as with a defect even if it is defect-free, i.e., the false negative rate ( \(\rho _{FN}\) ), defined as

In the absence of any reasonable evidence to justify a bias on one mistake over others, we assume a uniform distribution for both human and automated inspections regarding error preference, i.e. we set \(\rho ^{H}_{FN} = \rho ^{ML}_{FN} = \rho _{FN} = 50\%\) . Thus, the number of final compliant goods ( \(N_{goods}\) ), i.e., the calipers that are identified as quality-compliant, can be calculated as:

where \(N_{detect}\) is the total number of detected defective units, comprising TN (true negatives, i.e. correctly identified defective calipers) and FN (false negatives, i.e. calipers mistakenly identified as defect-free). The Quality factor can then be computed as:

Table  4 summarizes the Quality factor calculation, showcasing the substantial improvement brought by the ML-based inspection station due to its higher accuracy compared to human operators.

figure 8

Overall Equipment Effectiveness (OEE) analysis for three scenarios (S1: Human-Based Inspection, S2: Hybrid Inspection, S3: Fully Automated Inspection). The height of the bars represents the percentage of the three factors A : Availability, P : Performance, and Q : Quality, which can be interpreted from the left axis. The green bars indicate the OEE value, derived from the product of these three factors. The red line shows the recall rate, i.e. the probability that a defective product is rejected by the client, with values displayed on the right red axis

Finally, we can determine the Overall Equipment Effectiveness by multiplying the three factors previously computed. Additionally, we can estimate the recall rate ( \(\rho _{R}\) ), which reflects the rate at which a customer might reject products. This is derived from the difference between the total number of defective units, \(N_{def}\) , and the number of units correctly identified as defective, TN , indicating the potential for defective brake calipers that may bypass the inspection process. In Fig.  8 we summarize the outcomes of the three scenarios. It is crucial to note that the scenarios incorporating the automated defect detector, S2 and S3, significantly enhance the Overall Equipment Effectiveness, primarily through substantial improvements in the Quality factor. Among these, the fully automated inspection scenario, S3, emerges as a slightly superior option, thanks to its additional benefit in removing the breaks and increasing the speed of the line. However, given the different assumptions required for this OEE study, we shall interpret these results as illustrative, and considering them primarily as comparative with the baseline scenario only. To analyze the sensitivity of the outlined scenarios on the adopted assumptions, we investigate the influence of the line speed and human accuracy on the results in the next subsection.

Sensitivity analysis

The scenarios described previously are illustrative and based on several simplifying hypotheses. One of such hypotheses is that the production chain layout operates entirely in series, with each station awaiting the arrival of the workpiece from the preceding station, resulting in a relatively slow production rate (1263 seconds). This setup can be quite different from reality, where slower operations can be accelerated by installing additional machines in parallel to balance the workload and enhance productivity. Moreover, we utilized a literature value of 80% for the accuracy of the human visual inspector operator, as reported by Sundaram and Zeid ( 2023 ). However, this accuracy can vary significantly due to factors such as the experience of the inspector and the defect type.

figure 9

Effect of assembly time for stations (excluding visual inspection), \(t^*_{line}\) , and human inspection accuracy, \(\rho _{acc}\) , on the OEE analysis. (a) The subplot shows the difference between the scenario S2 (Hybrid Inspection) and the baseline scenario S1 (Human Inspection), while subplot (b) displays the difference between scenario S3 (Fully Automated Inspection) and the baseline. The maps indicate in red the values of \(t^*_{line}\) and \(\rho _{acc}\) where the integration of automated inspection stations can significantly improve OEE, and in blue where it may lower the score. The dashed lines denote the breakeven points, and the circled points pinpoint the values of the scenarios used in the “Illustrative scenario” Subsection.

A sensitivity analysis on these two factors was conducted to address these variations. The assembly time of the stations (excluding visual inspection), \(t^*_{line}\) , was varied from 60 s to 1500 s, and the human inspection accuracy, \(\rho _{acc}\) , ranged from 50% (akin to a random guesser) to 100% (representing an ideal visual inspector); meanwhile, the other variables were kept fixed.

The comparison of the OEE enhancement for the two scenarios employing ML-based inspection against the baseline scenario is displayed in the two maps in Fig.  9 . As the figure shows, due to the high accuracy and rapid response of the proposed automated inspection station, the area representing regions where the process may benefit energy savings in the assembly lines (indicated in red shades) is significantly larger than the areas where its introduction could degrade performance (indicated in blue shades). However, it can be also observed that the automated inspection could be superfluous or even detrimental in those scenarios where human accuracy and assembly speed are very high, indicating an already highly accurate workflow. In these cases, and particularly for very fast production lines, short times for quality control can be expected to be key (beyond accuracy) for the optimization.

Finally, it is important to remark that the blue region (areas below the dashed break-even lines) might expand if the accuracy of the neural networks for defect detection is lower when implemented in an real production line. This indicates the necessity for new rounds of active learning and an augment of the ratio of real images in the database, to eventually enhance the performance of the ML model.

Conclusions

Industrial quality control processes on manufactured parts are typically achieved by human visual inspection. This usually requires a dedicated handling system, and generally results in a slower production rate, with the associated non-optimal use of the energy resources. Based on a practical test case for quality control on brake caliper manufacturing, in this work we have reported on a developed workflow for integration of Machine Learning methods to automatize the process. The proposed approach relies on image analysis via Deep Convolutional Neural Networks. These models allow to efficiently extract information from images, thus possibly representing a valuable alternative to human inspection.

The proposed workflow relies on a two-step procedure on the images of the brake calipers: first, the background is removed from the image; second, the geometry is inspected to identify possible defects. These two steps are accomplished thanks to two dedicated neural network models, an encoder-decoder and an encoder network, respectively. Training of these neural networks typically requires a large number of representative images for the problem. Given that, one such database is not always readily available, we have presented and discussed an alternative methodology for the generation of the input database using 3D renderings. While integration of the database with real photographs was required for optimal results, this approach has allowed fast and flexible generation of a large base of representative images. The pre-processing steps required for data feeding to the neural networks and their training has been also discussed.

Several models have been tested and evaluated, and the best one for the considered case identified. The obtained accuracy for defect identification reaches \(\sim \) 99% of the tested cases. Moreover, the response of the models is fast (in the order of few seconds) on each image, which makes them compliant with the most typical industrial expectations.

In order to provide a practical example of possible energy savings when implementing the proposed ML-based methodology for quality control, we have analyzed three perspective industrial scenarios: a baseline scenario, where quality control tasks are performed by a human inspector; a hybrid scenario, where the proposed ML automatic detection tool assists the human inspector; a fully-automated scenario, where we envision a completely automated defect inspection. The results show that the proposed tools may help increasing the Overall Equipment Effectiveness up to \(\sim \) 10% with respect to the considered baseline scenario. However, a sensitivity analysis on the speed of the production line and on the accuracy of the human inspector has also shown that the automated inspection could be superfluous or even detrimental in those cases where human accuracy and assembly speed are very high. In these cases, reducing the time required for quality control can be expected to be the major controlling parameter (beyond accuracy) for optimization.

Overall the results show that, with a proper tuning, these models may represent a valuable resource for integration into production lines, with positive outcomes on the overall effectiveness, and thus ultimately leading to a better use of the energy resources. To this, while the practical implementation of the proposed tools can be expected to require contained investments (e.g. a portable camera, a dedicated workstation and an operator with proper training), in field tests on a real industrial line would be required to confirm the potential of the proposed technology.

Agrawal, R., Majumdar, A., Kumar, A., & Luthra, S. (2023). Integration of artificial intelligence in sustainable manufacturing: Current status and future opportunities. Operations Management Research, 1–22.

Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L. (2021). Review of deep learning: Concepts, cnn architectures, challenges, applications, future directions. Journal of big Data, 8 , 1–74.

Article   Google Scholar  

Angelopoulos, A., Michailidis, E. T., Nomikos, N., Trakadas, P., Hatziefremidis, A., Voliotis, S., & Zahariadis, T. (2019). Tackling faults in the industry 4.0 era-a survey of machine—learning solutions and key aspects. Sensors, 20 (1), 109.

Arana-Landín, G., Uriarte-Gallastegi, N., Landeta-Manzano, B., & Laskurain-Iturbe, I. (2023). The contribution of lean management—industry 4.0 technologies to improving energy efficiency. Energies, 16 (5), 2124.

Badmos, O., Kopp, A., Bernthaler, T., & Schneider, G. (2020). Image-based defect detection in lithium-ion battery electrode using convolutional neural networks. Journal of Intelligent Manufacturing, 31 , 885–897. https://doi.org/10.1007/s10845-019-01484-x

Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 26–33).

Benedetti, M., Bonfà, F., Introna, V., Santolamazza, A., & Ubertini, S. (2019). Real time energy performance control for industrial compressed air systems: Methodology and applications. Energies, 12 (20), 3935.

Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K., & Ghayvat, H. (2021). Cnn variants for computer vision: History, architecture, application, challenges and future scope. Electronics, 10 (20), 2470.

Bilgen, S. (2014). Structure and environmental impact of global energy consumption. Renewable and Sustainable Energy Reviews, 38 , 890–902.

Blender. (2023). Open-source software. https://www.blender.org/ . Accessed 18 Apr 2023.

Bologna, A., Fasano, M., Bergamasco, L., Morciano, M., Bersani, F., Asinari, P., Meucci, L., & Chiavazzo, E. (2020). Techno-economic analysis of a solar thermal plant for large-scale water pasteurization. Applied Sciences, 10 (14), 4771.

Burduk, A., & Górnicka, D. (2017). Reduction of waste through reorganization of the component shipment logistics. Research in Logistics & Production, 7 (2), 77–90. https://doi.org/10.21008/j.2083-4950.2017.7.2.2

Carvalho, T. P., Soares, F. A., Vita, R., Francisco, R., d. P., Basto, J. P., & Alcalá, S. G. (2019). A systematic literature review of machine learning methods applied to predictive maintenance. Computers & Industrial Engineering, 137 , 106024.

Casini, M., De Angelis, P., Chiavazzo, E., & Bergamasco, L. (2024). Current trends on the use of deep learning methods for image analysis in energy applications. Energy and AI, 15 , 100330. https://doi.org/10.1016/j.egyai.2023.100330

Chai, J., Zeng, H., Li, A., & Ngai, E. W. (2021). Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Machine Learning with Applications, 6 , 100134.

Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 801–818).

Chen, L., Li, S., Bai, Q., Yang, J., Jiang, S., & Miao, Y. (2021). Review of image classification algorithms based on convolutional neural networks. Remote Sensing, 13 (22), 4712.

Chen, T., Sampath, V., May, M. C., Shan, S., Jorg, O. J., Aguilar Martín, J. J., Stamer, F., Fantoni, G., Tosello, G., & Calaon, M. (2023). Machine learning in manufacturing towards industry 4.0: From ‘for now’to ‘four-know’. Applied Sciences, 13 (3), 1903. https://doi.org/10.3390/app13031903

Choudhury, A. (2021). The role of machine learning algorithms in materials science: A state of art review on industry 4.0. Archives of Computational Methods in Engineering, 28 (5), 3361–3381. https://doi.org/10.1007/s11831-020-09503-4

Dalzochio, J., Kunst, R., Pignaton, E., Binotto, A., Sanyal, S., Favilla, J., & Barbosa, J. (2020). Machine learning and reasoning for predictive maintenance in industry 4.0: Current status and challenges. Computers in Industry, 123 , 103298.

Fasano, M., Bergamasco, L., Lombardo, A., Zanini, M., Chiavazzo, E., & Asinari, P. (2019). Water/ethanol and 13x zeolite pairs for long-term thermal energy storage at ambient pressure. Frontiers in Energy Research, 7 , 148.

Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow . O’Reilly Media, Inc.

GrabCAD. (2023). Brake caliper 3D model by Mitulkumar Sakariya from the GrabCAD free library (non-commercial public use). https://grabcad.com/library/brake-caliper-19 . Accessed 18 Apr 2023.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

Ho, S., Zhang, W., Young, W., Buchholz, M., Al Jufout, S., Dajani, K., Bian, L., & Mozumdar, M. (2021). Dlam: Deep learning based real-time porosity prediction for additive manufacturing using thermal images of the melt pool. IEEE Access, 9 , 115100–115114. https://doi.org/10.1109/ACCESS.2021.3105362

Ismail, M. I., Yunus, N. A., & Hashim, H. (2021). Integration of solar heating systems for low-temperature heat demand in food processing industry-a review. Renewable and Sustainable Energy Reviews, 147 , 111192.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436–444.

Leong, W. D., Teng, S. Y., How, B. S., Ngan, S. L., Abd Rahman, A., Tan, C. P., Ponnambalam, S., & Lam, H. L. (2020). Enhancing the adaptability: Lean and green strategy towards the industry revolution 4.0. Journal of cleaner production, 273 , 122870.

Liu, Z., Wang, X., Zhang, Q., & Huang, C. (2019). Empirical mode decomposition based hybrid ensemble model for electrical energy consumption forecasting of the cement grinding process. Measurement, 138 , 314–324.

Li, G., & Zheng, X. (2016). Thermal energy storage system integration forms for a sustainable future. Renewable and Sustainable Energy Reviews, 62 , 736–757.

Maggiore, S., Realini, A., Zagano, C., & Bazzocchi, F. (2021). Energy efficiency in industry 4.0: Assessing the potential of industry 4.0 to achieve 2030 decarbonisation targets. International Journal of Energy Production and Management, 6 (4), 371–381.

Mazzei, D., & Ramjattan, R. (2022). Machine learning for industry 4.0: A systematic review using deep learning-based topic modelling. Sensors, 22 (22), 8641.

Md, A. Q., Jha, K., Haneef, S., Sivaraman, A. K., & Tee, K. F. (2022). A review on data-driven quality prediction in the production process with machine learning for industry 4.0. Processes, 10 (10), 1966. https://doi.org/10.3390/pr10101966

Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., & Terzopoulos, D. (2021). Image segmentation using deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence, 44 (7), 3523–3542.

Google Scholar  

Mishra, S., Srivastava, R., Muhammad, A., Amit, A., Chiavazzo, E., Fasano, M., & Asinari, P. (2023). The impact of physicochemical features of carbon electrodes on the capacitive performance of supercapacitors: a machine learning approach. Scientific Reports, 13 (1), 6494. https://doi.org/10.1038/s41598-023-33524-1

Mumuni, A., & Mumuni, F. (2022). Data augmentation: A comprehensive survey of modern approaches. Array, 16 , 100258. https://doi.org/10.1016/j.array.2022.100258

Mypati, O., Mukherjee, A., Mishra, D., Pal, S. K., Chakrabarti, P. P., & Pal, A. (2023). A critical review on applications of artificial intelligence in manufacturing. Artificial Intelligence Review, 56 (Suppl 1), 661–768.

Narciso, D. A., & Martins, F. (2020). Application of machine learning tools for energy efficiency in industry: A review. Energy Reports, 6 , 1181–1199.

Nota, G., Nota, F. D., Peluso, D., & Toro Lazo, A. (2020). Energy efficiency in industry 4.0: The case of batch production processes. Sustainability, 12 (16), 6631. https://doi.org/10.3390/su12166631

Ocampo-Martinez, C., et al. (2019). Energy efficiency in discrete-manufacturing systems: Insights, trends, and control strategies. Journal of Manufacturing Systems, 52 , 131–145.

Pan, Y., Hao, L., He, J., Ding, K., Yu, Q., & Wang, Y. (2024). Deep convolutional neural network based on self-distillation for tool wear recognition. Engineering Applications of Artificial Intelligence, 132 , 107851.

Qin, J., Liu, Y., Grosvenor, R., Lacan, F., & Jiang, Z. (2020). Deep learning-driven particle swarm optimisation for additive manufacturing energy optimisation. Journal of Cleaner Production, 245 , 118702.

Rahul, M., & Chiddarwar, S. S. (2023). Integrating virtual twin and deep neural networks for efficient and energy-aware robotic deburring in industry 4.0. International Journal of Precision Engineering and Manufacturing, 24 (9), 1517–1534.

Ribezzo, A., Falciani, G., Bergamasco, L., Fasano, M., & Chiavazzo, E. (2022). An overview on the use of additives and preparation procedure in phase change materials for thermal energy storage with a focus on long term applications. Journal of Energy Storage, 53 , 105140.

Shahin, M., Chen, F. F., Hosseinzadeh, A., Bouzary, H., & Shahin, A. (2023). Waste reduction via image classification algorithms: Beyond the human eye with an ai-based vision. International Journal of Production Research, 1–19.

Shen, F., Zhao, L., Du, W., Zhong, W., & Qian, F. (2020). Large-scale industrial energy systems optimization under uncertainty: A data-driven robust optimization approach. Applied Energy, 259 , 114199.

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .

Sundaram, S., & Zeid, A. (2023). Artificial Intelligence-Based Smart Quality Inspection for Manufacturing. Micromachines, 14 (3), 570. https://doi.org/10.3390/mi14030570

Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence (vol. 31).

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9).

Trezza, G., Bergamasco, L., Fasano, M., & Chiavazzo, E. (2022). Minimal crystallographic descriptors of sorption properties in hypothetical mofs and role in sequential learning optimization. npj Computational Materials, 8 (1), 123. https://doi.org/10.1038/s41524-022-00806-7

Vater, J., Schamberger, P., Knoll, A., & Winkle, D. (2019). Fault classification and correction based on convolutional neural networks exemplified by laser welding of hairpin windings. In 2019 9th International Electric Drives Production Conference (EDPC) (pp. 1–8). IEEE.

Wen, L., Li, X., Gao, L., & Zhang, Y. (2017). A new convolutional neural network-based data-driven fault diagnosis method. IEEE Transactions on Industrial Electronics, 65 (7), 5990–5998. https://doi.org/10.1109/TIE.2017.2774777

Willenbacher, M., Scholten, J., & Wohlgemuth, V. (2021). Machine learning for optimization of energy and plastic consumption in the production of thermoplastic parts in sme. Sustainability, 13 (12), 6800.

Zhang, X. H., Zhu, Q. X., He, Y. L., & Xu, Y. (2018). Energy modeling using an effective latent variable based functional link learning machine. Energy, 162 , 883–891.

Download references

Acknowledgements

This work has been supported by GEFIT S.p.a.

Open access funding provided by Politecnico di Torino within the CRUI-CARE Agreement.

Author information

Authors and affiliations.

Department of Energy, Politecnico di Torino, Turin, Italy

Mattia Casini, Paolo De Angelis, Paolo Vigo, Matteo Fasano, Eliodoro Chiavazzo & Luca Bergamasco

R &D Department, GEFIT S.p.a., Alessandria, Italy

Marco Porrati

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Luca Bergamasco .

Ethics declarations

Conflict of interest statement.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 354 KB)

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Casini, M., De Angelis, P., Porrati, M. et al. Machine Learning and image analysis towards improved energy management in Industry 4.0: a practical case study on quality control. Energy Efficiency 17 , 48 (2024). https://doi.org/10.1007/s12053-024-10228-7

Download citation

Received : 22 July 2023

Accepted : 28 April 2024

Published : 13 May 2024

DOI : https://doi.org/10.1007/s12053-024-10228-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Industry 4.0
  • Energy management
  • Artificial intelligence
  • Machine learning
  • Deep learning
  • Convolutional neural networks
  • Computer vision
  • Find a journal
  • Publish with us
  • Track your research
  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, remote sensing estimation of δ 15 n pn in the zhanjiang bay using sentinel-3 olci data based on machine learning algorithm.

applied machine learning assignment 4

  • 1 College of Electronic and Information Engineering, Guangdong Ocean University, Zhanjiang, China
  • 2 College of Chemistry and Environmental Science, Guangdong Ocean University, Zhanjiang, China
  • 3 College of Ocean and Meteorology, Guangdong Ocean University, Zhanjiang, China

The particulate nitrogen (PN) isotopic composition (δ 15 N PN ) plays an important role in quantifying the contribution rate of particulate organic matter sources and indicating water environmental pollution. Estimation of δ 15 N PN from satellite images can provide significant spatiotemporal continuous data for nitrogen cycling and ecological environment governance. Here, in order to fully understand spatiotemporal dynamic of δ 15 N PN , we have developed a machine learning algorithm for retrieving δ 15 N PN . This is a successful case of combining nitrogen isotopes and remote sensing technology. Based on the field observation data of Zhanjiang Bay in May and September 2016, three machine learning retrieval models (Back Propagation Neural Network, Random Forest and Multiple Linear Regression) were constructed using optical indicators composed of in situ remote sensing reflectance as input variable and δ 15 N PN as output variable. Through comparative analysis, it was found that the Back Propagation Neural Network (BPNN) model had the better retrieval performance. The BPNN model was applied to the quasi-synchronous Ocean and Land Color Imager (OLCI) data onboard Sentinel-3. The determination coefficient (R 2 ), root mean square error (RMSE) and mean absolute percentage error (MAPE) of satellite-ground matching point data based on the BPNN model were 0.63, 1.63‰, and 20.10%, respectively. From the satellite retrieval results, it can be inferred that the retrieval value of δ 15 N PN had good consistency with the measured value of δ 15 N PN . In addition, independent datasets were used to validate the BPNN model, which showed good accuracy in δ 15 N PN retrieval, indicating that an effective model for retrieving δ 15 N PN has been built based on machine learning algorithm. However, to enhance machine learning algorithm performance, we need to strengthen the information collection covering diverse coastal water bodies and optimize the input variables of optical indicators. This study provides important technical support for large-scale and long-term understanding of the biogeochemical processes of particulate organic matter, as well as a new management strategy for water quality and environmental monitoring.

1 Introduction

Nitrogen is one of the main nutrients for marine organisms, and it is the two most basic elements in marine ecosystems along with carbon ( Eppley and Peterson, 1979 ; Falkowski, 1997 ; Galloway et al., 2004 ). There is a transformation of nitrogen forms between particulate and dissolved states, and the mutual transformation of different nitrogen forms constitutes a complex marine nitrogen cycle ( Pajares and Ramos, 2019 ). The marine nitrogen cycle is closely related to the carbon cycle, and nitrogen limitation or excess can lead to a decrease or increase in the absorption of CO 2 by phytoplankton, so the nitrogen cycle can indirectly affect climate change by regulating the carbon cycle ( Falkowski, 1997 ; Voss et al., 2013 ). Consequently, accurately grasping the spatiotemporal characteristics of ocean nitrogen is of great significance for deeply understanding the ocean nitrogen cycle and climate change.

Although particulate nitrogen (PN) only accounts for 0.5% of the total nitrogen pool in the ocean, it has the characteristics of easy degradation and fast cycling speed, and is an important component of the nitrogen pool in the ocean ( Capone et al., 2008 ). The main sources of coastal marine particulate nitrogen include marine phytoplankton production, riverine inputs and sewage effluent, and there are significant differences in the isotopic values of particulate nitrogen from different sources ( Montoya et al., 2002 ; Wu et al., 2007 ; Lu et al., 2021 ). Particulate nitrogen isotope (δ 15 N PN ) is a potential indicator of particulate organic matter sources, and the contribution ratio of different sources of particulate organic matter can be quantitatively calculated using δ 15 N PN and particulate organic carbon isotope (δ 13 C) ( Chen et al., 2021 ; Huang et al., 2021 ; Lu et al., 2021 ). δ 15 N PN can also indicate the pollution of the water environment, as it reflects the source of absorbed nutrients ( Sarma et al., 2020 ). One of the main sources of nitrogen-containing nutrients in coastal waters comes from sewage, and δ 15 N PN is significantly enriched (>10%) for sewage ( Sarma et al., 2020 ). In addition, the variation of particulate nitrogen isotope values is also affected by isotope fractionation during nitrogen conversion processes such as nitrification, denitrification, and biological assimilation absorption ( Cifuentes et al., 1988 ; Granger et al., 2010 ). Therefore, knowledge of the distribution and variation of δ 15 N PN , and the factors controlling their distribution is essential to elucidate the sources and biogeochemical processes of particulate organic matter ( Huang et al., 2020 ).

The traditional particulate nitrogen isotope values are obtained by collecting in situ water sampling and laboratory determination ( Chen et al., 2021 ; Lu et al., 2021 ). However, this method is time-consuming, labor-intensive, inefficient, and cannot obtain particulate nitrogen isotope values on a large-scale and for a long-time. It is worth exploring how to more conveniently and effectively obtain particulate nitrogen isotope values. Ocean color remote sensing is a method of retrieving water parameters by establishing a response relationship between the remote sensing reflectance and the water parameters ( Wang et al., 2022 ). It has the advantages of large-scale and long-term continuous observation ( Shen et al., 2020 ). In the past several decades, some environmental parameters in water have been retrieved by remote sensing, such as chlorophyll a (Chl a), total suspended matter (TSM), colored dissolved organic matter (CDOM), total phosphorus (TP), total nitrogen (TN), and dissolved inorganic nitrogen (DIN) ( Xu et al., 2010 ; Ondrusek et al., 2012 ; Mathew et al., 2017 ; Du et al., 2018 ; Watanabe et al., 2018 ; Shen et al., 2022 ). Among these retrieving elements, the spectral response of other elements may not be significant compared to Chl a, TSM and CDOM, which makes traditional empirical fitting methods difficult to retrieve ( Zheng et al., 2024 ). Machine learning methods generally produce better performance than simple empirical fitting methods ( Liu et al., 2021 ). Recently, inland, coastal, and oceanic water environments have been studied using machine learning methods ( Cao et al., 2020 ; Liu et al., 2021 ; Shen et al., 2022 ; Tian et al., 2024 ; Maciel et al., 2021 ; Pahlevan et al., 2020 ). Machine learning algorithms can not only combine multiple input features that are sensitive to the target variable, but also have stronger fitting ability to capture the relationship between the input variable and the target variable ( Liu et al., 2021 ). Sentinel-3, the third of the Copenhagen mission’s six satellites, is equipped with the most sophisticated water color sensor, the Ocean and Land Color Imager (OLCI) instrument. It will regularly monitor the ocean in almost real-time, with its data being made publicly accessible worldwide. And it has a widespread application in the field of water color remote sensing ( Du et al., 2018 ; Pahlevan et al., 2020 ; Shen et al., 2020 , 2022 ).

Thus, this study aims to explore the potential of machine learning methods for δ 15 N PN satellite retrieving. To achieve this aim, based on determining the optical indicators for retrieving particulate nitrogen isotope values, and then constructing optimal machine learning retrieval model of δ 15 N PN is applied to the Sentinel-3 OLCI data to obtain spatiotemporal information of δ 15 N PN in the Zhanjiang Bay (a typical eutrophic bay in China). The results from this study could improve the ability of remote sensing monitoring of coastal δ 15 N PN and comprehensively understand the biogeochemical processes of marine nitrogen cycle.

2 Materials and methods

2.1 study area.

Zhanjiang Bay is located in the northwest of the South China Sea and is a typical bay with a small mouth and large belly ( Figure 1 ). The connection between Zhanjiang Bay and the South China Sea is mainly through a narrow channel with a width of approximately 2 km ( Zhang et al., 2020b ). Zhanjiang Bay is located in the subtropical monsoon climate zone, with a rainy season from April to September, and less rainfall from November to February of the following year ( Chen et al., 2019 ). There are many industrial zones, agricultural zones, aquaculture zones, ports, and densely populated areas along the coast of Zhanjiang Bay. A large amount of industrial and agricultural wastewater and domestic sewage are discharged into the bay, bringing a large amount of nutrients and organic matter to Zhanjiang Bay, which has a certain impact on the ecological environment of Zhanjiang Bay ( Zhang et al., 2023 ). Previous studies have shown that the degree of eutrophication in the water of Zhanjiang Bay is gradually becoming severe ( Zhang et al., 2020b ).

www.frontiersin.org

Figure 1 Study area and field sampling stations. S1-S23 and A1-A29 mark the sampling stations in May and September 2016, respectively.

2.2 Field data collection and analysis

The sample collection was conducted in Zhanjiang Bay in May and September of 2016, respectively, with 23 surface water samples collected in May and 29 surface water samples collected in September. The sampling stations are shown in Figure 1 . The water samples were placed in polyethylene bottles (each bottle was acid-cleaned and rinsed with ultrapure water) and refrigerated at 4°C in a refrigerator. The water samples were taken back to the laboratory for further analysis on the same day. Furthermore, a spectroradiometer (USB2000+, Ocean Optics, Inc., USA) was used to measure the remote sensing reflectance spectra above the water’s surface between 200 and 1100 nm (1 nm interval) in accordance with the protocols proposed by Mobley ( Mobley, 1999 ). Remote sensing reflectance was determined from an above-water method with an azimuth angle of 135° from the sun and 45° from the nadir ( Mobley, 1999 ). At every water sampling site, the radiances from the sky, the water, and the reference panel were measured. Remote sensing reflectance (R rs (λ)) was calculated using the following Equation 1 :

where λ is the wavelength, L u (λ) is the upwelling spectral radiance, L s (λ) is the incident spectral sky radiance and r is the proportionality coefficient, with a value of 0.025 ( Yu et al., 2023 ). L p (λ) is the radiance from gray reference panel, ρ p (λ) is the known reflectance of the gray panel ( Yu et al., 2023 ).

Glass fiber filter membranes (pre-combustion at 450 °C for 4 h, GF/F, Whatman) with a 47-mm diameter were used to filter the TSM, Chl a, and PN samples. The weight method was used to calculate TSM concentrations ( Zhou et al., 2021 ). Chl a in the GF/F filter was extracted using 90% acetone and analyzed using the fluorometric method ( Lao et al., 2021 ; Zhou et al., 2021 ). An element analyzer, coupled with a stable isotope ratio mass spectrometer (EA Isolink-253 Plus, Thermo Fisher Scientific, Inc. USA) was used to measure the concentration of PN and δ 15 N PN ( Chen et al., 2021 ). The mean standard deviation of δ 15 N PN and PN concentration was ±0.3‰ and ±0.3%, respectively ( Chen et al., 2021 ).

2.3 Satellite data acquisition and processing

The satellite data for this study was selected from the Ocean and Land Color Instrument (OLCI) data carried by Sentinel-3. The OLCI data contains a total of 21 spectral bands, ranging from 400 to 1020 nm, including 16 water-color bands, with a spatial resolution of 300 m and a global coverage time of 1-2 days ( Su et al., 2021 ). It can achieve global multispectral medium resolution ocean/land observation capabilities. Sentinel-3 OLCI image data can be downloaded through the European Space Agency’s official website (ESA, https://scihub.copernicus.eu/dhus/#/home ). We used the C2RCC (case 2 regional coast color) algorithm integrated into Sentinel Application Platform (SNAP) software to perform atmospheric correction on Sentinel-3 OLCI data. The image data after atmospheric correction was further processed and analyzed in SNAP software.

2.4 δ 15 N PN retrieval algorithms

2.4.1 back propagation neural network.

Back Propagation Neural Network (BPNN) is a common multi-layer feedforward neural network in artificial neural networks, which uses backpropagation algorithm to train network weights ( Liu et al., 2017 ). Its main characteristics are strong nonlinear fitting ability and self adaptive learning performance. BPNN is widely used to retrieve water parameters in oceans and lakes ( Wang et al., 2023a ; Chen et al., 2015 ; Ju et al., 2023 ). This study used a three-layer BPNN, which includes one input layer, one hidden layer, and one output layer. The hidden layer transfer function was selected as the S-type tangent function “tansig”, the output layer function was selected as the linear function “purelin”, and the training function was used as “trainlm”. The maximum training frequency was set to 1000 times, the learning rate was 0.3, and the training error was 0.001. The determination of hidden layer nodes is a key step in the BPNN model, and the basic principle for determining the number of hidden layer nodes is to select as few hidden layer nodes as possible while meeting accuracy requirements ( Sun et al., 2009 ). This study set up 1 to 10 hidden layer nodes and conducted experiments to determine the optimal number of nodes in the hidden layer. In addition, due to the small number of training samples in this study, in order to prevent overfitting, Bayesian regularization was introduced ( MacKay, 1992 ), which has been achieved through the function “trainbr”. The neural network models for each node were trained separately, and the determination coefficient (R 2 ), mean absolute percentage error (MAPE) and root mean square error (RMSE) of the measured and predicted values of the training samples were calculated to select the optimal number of nodes ( Table 1 ). Moreover, we applied the trained model to the testing samples and obtained the R 2 , MAPE, and RMSE of the measured and predicted values of the testing samples ( Table 2 ). From Table 1 and Table 2 , it can be seen that the network with regularization has strong generalization ability, and the selection of hidden layer nodes has little impact on the model training results, which eliminates the tentative work required to determine the optimal network size. Based on Table 1 and Table 2 , we select 10 hidden layer nodes for our BPNN model. The training and testing of the BPNN model were conducted in MATLAB R2018a software.

www.frontiersin.org

Table 1 R 2 , MAPE and RMSE of the measured and predicted values of training samples with different hidden layer nodes.

www.frontiersin.org

Table 2 R 2 , MAPE and RMSE of the measured and predicted values of testing samples with different hidden layer nodes.

2.4.2 Random Forest

Random Forest (RF) is a powerful machine learning algorithm. As an ensemble learning technique, RF uses several decision trees, each of which is trained using randomly chosen feature and sample subsets ( Belgiu and Drăguţ, 2016 ; Wang et al., 2023a ). By averaging or voting the predictions from each individual tree, the final prediction is obtained ( Belgiu and Drăguţ, 2016 ; Wang et al., 2023a ). This study utilized the “TreeBagger” tool in MATLAB R2018a to construct a random forest model. Through experiment ( Figure 2 ), the optimal number of trees and optimal number of leaf nodes were determined to be 200 and 5, respectively.

www.frontiersin.org

Figure 2 Determination of the optimal number of leaf nodes and trees.

2.4.3 Multiple Linear Regression

Multiple Linear Regression (MLR) describes how the dependent variable changes with multiple independent variables. This algorithm is simple, fast, low computational complexity, and suitable for local scale, widely used in remote sensing estimation of water parameters ( Qing et al., 2013 ; Olmanson et al., 2016 ; Yang et al., 2017 ). This study used δ 15 N PN as the dependent variable and optical indicators composed of remote sensing reflectance as the independent variable for multiple linear regression fitting. The fitting tool was the “regress” function in MATLAB R2018a software.

2.5 Accuracy evaluation

The accuracy evaluation of this study mainly includes four indices. Specific calculation formula of Pearson correlation coefficient (r), determination coefficient (R 2 ), mean absolute percentage error (MAPE) and root mean square error (RMSE) are following Equations 2 – 5 :

where n is the number of samples, X i and Y i refer to the value of two variables, Z and W denote the mean value of two variables in the sample.

3.1 In situ distribution of TSM and Chl a concentration

As is shown in Figure 3 , the TSM concentration ranged from 5.25 to 45.35 mg/L (averaged at 13.70 mg/L) in May. It should be noted that operational errors prevented us from obtaining the Chl a concentration outside the bay in May. Chl a concentration inside the bay ranged from 0.68 to 19.33 μg/L (averaged at 5.49 μg/L) in May. While in September, the concentration of TSM and Chl a ranged from 2.60 to 62.40 mg/L (with an average of 11.63 mg/L) and from 1.62 to 21.88 μg/L (with an average of 7.53 μg/L), respectively ( Figure 3 ).

www.frontiersin.org

Figure 3 In situ distribution of the TSM and Chl a concentration in (A) May and (B) September.

3.2 In situ distribution of δ 15 N PN and PN concentration

During the survey period, PN concentration ranged from 0.026 to 0.135 mg/L in May, with an average value of 0.048 mg/L. In September, PN concentration ranged from 0.022 to 0.09 mg/L, with an average value of 0.043 mg/L. Overall, the PN concentration in September was slightly lower than that in May. In addition, as shown in Figure 4 , the average concentration of PN outside the bay was higher than that inside the bay in May and September. In May, the average concentration of PN outside and inside the bay was 0.062mg/L and 0.044mg/L, respectively. In September, the average concentration of PN outside and inside the bay was 0.053mg/L and 0.034mg/L, respectively. The δ 15 N value ranged from 5.89‰ to 10.14‰ (average 7.77‰) in May and 3.73‰ to 12.08‰ (average 7.77‰) in September, respectively. Similar to the distribution of PN concentration, the average δ 15 N outside the bay was higher than that inside the bay in May and September. Figure 4 presents that the average value of δ 15 N outside and inside the bay in May was 8.71% and 7.43%, respectively. In September, the average value of δ 15 N outside and inside the bay was 10% and 5.9%, respectively. The spatial distribution differences of δ 15 N inside and outside the bay were more pronounced in September ( Figure 4 ).

www.frontiersin.org

Figure 4 In situ distribution of the δ 15 N PN and PN concentration in (A) May and (B) September.

3.3 Development, validation, and application of δ 15 N PN retrieval model

Firstly, we conducted correlation analysis using in situ R rs (λ) and δ 15 N PN , and found that single band remote sensing reflectance retrieval was not effective, with low correlations (P>0.05), which will not be presented here. In order to obtain the best band combination of retrieval bands, we designed 6 optical indicators ( Table 3 ). The method of optical indicators refers to the R rs (λ) combination forms designed by Ling et al ( Ling et al., 2020 ). Concretely, 240 possible combinations of R rs (λ) with the sixteen OLCI bands (400 nm, 413 nm, 443 nm, 490 nm, 510 nm, 560 nm, 620 nm, 665 nm, 674 nm, 681 nm, 709 nm, 754 nm, 761 nm, 764 nm, 768 nm, and 779 nm) were trained by using MATLAB R2018a software to determine those optimal results for each form, X. After training, based on the correlation coefficient (r) between X and δ 15 N PN , the highest value was taken, and the selection of λ 1 and λ 2 was ultimately determined ( Table 3 ). From Table 3 , it can be seen that X2 and X5 perform well, with r values of -0.78 (P<0.01) and -0.77 (P<0.01), respectively. We selected these two optical indicators as input variables for the model to construct δ 15 N PN retrieval model.

www.frontiersin.org

Table 3 Design of optical indicators and selection of optimal band combinations.

We randomly divided the dataset (input and output variables) into a training set (35 samples) and a testing set (17 samples), trained the model separately, and verified its performance. As shown in Figure 5 , from the evaluation indices R 2 , MAPE, and RMSE ( Figure 5A-F ), BPNN, RF and MLR methods perform well on the training sets (BPNN: R 2 =  0.64, MAPE = 14.93%, RMSE = 1.32‰; RF: R 2 =  0.70, MAPE = 13.84%, RMSE = 1.20‰; MLR: R 2 =  0.65, MAPE = 15.14%, RMSE = 1.29‰) and testing sets (BPNN: R 2 =  0.84, MAPE = 10.71%, RMSE = 0.99‰; RF: R 2 =  0.65, MAPE = 13.66%, RMSE = 1.22‰; MLR: R 2 =  0.84, MAPE = 11.94%, RMSE = 1.03‰), which can meet our retrieval requirements for δ 15 N PN . To further validate the performance of BPNN, RF and MLR methods for δ 15 N PN retrieval, we obtained quasi-synchronous Sentinel-3 data (September 20, 2016) during the sampling period, and obtained 20 satellite-ground matching points that were less affected by clouds, shadows, and solar flares. The established BPNN, RF and MLR models were applied to Sentinel-3 data. As shown in Figure 6 , the BPNN and MLR model (BPNN: R 2 =  0.63, MAPE = 20.10%, RMSE = 1.63‰; MLR: R 2 =  0.63, MAPE = 20.71%, RMSE = 1.63‰) perform slightly better than the RF model (R 2 =  0.55, MAPE = 21.99%, RMSE = 1.67‰), with points more evenly distributed on both sides of the trend line. From Figure 6 , it can be seen that the overall accuracy of the BPNN model is comparable to that of the MLR model, but the MAPE of BPNN model is slightly lower than that of the MLR model. Therefore, we used the BPNN model as the model for retrieving δ 15 N PN . It is worth noting that there is a certain degree of overestimation or underestimation of satellite retrieval value of δ 15 N PN ( Figure 6 ). This may be due to the fact that the sampling time and satellite transit time are not synchronized in real-time (exceeding 24 hours), and the time window is an important factor affecting the accuracy of retrieval ( Fu et al., 2023 ). On the other hand, it may be due to errors caused by atmospheric correction ( Zhao et al., 2022 ). Moreover, Figure 7 shows the comparison between the OLCI-derived values of optical indicators (X2 and X5) and the measured values, demonstrating acceptable performance. The band combination reduces the errors caused by atmospheric correction in the algorithm implementation process to some extent ( Zhao et al., 2022 ).

www.frontiersin.org

Figure 5 Scatter plots between the estimated δ 15 N PN of BPNN (A, B) , RF (C, D) , and MLR (E, F) models and the measured δ 15 N PN .

www.frontiersin.org

Figure 6 Scatter plots between the satellite-retrieved δ 15 N PN using BPNN, RF and MLR models and the measured δ 15 N PN in September 2016.

www.frontiersin.org

Figure 7 Comparison of measured and OLCI-derived values for match-up points at (A) R rs (674)/R rs (681) and (B) R rs (674)-R rs (681)/R rs (674)+R rs (681).

As shown in Figure 8 , the spatial distribution map obtained by applying the BPNN model to Sentinel-3 data shows that δ 15 N PN outside Zhanjiang Bay is slightly higher than inside Zhanjiang Bay. However, a few areas affected by factories, docks, and aquaculture areas (circled in Figure 8 ) ( Lu et al., 2020 ; Zhang et al., 2020a ; Zhou et al., 2022 ), these areas are highly susceptible to the influence of sewage or wastewater, resulting in higher dynamic changes in δ 15 N PN in these areas. Therefore, the δ 15 N PN retrieval results in these regions may have some differences from the measured values. But overall, the retrieval results are relatively consistent with the measured results, which also indicates that the BPNN model has a certain reliability in retrieving δ 15 N PN . Simultaneously utilizing Sentinel-3 data to retrieve δ 15 N PN has great potential for application.

www.frontiersin.org

Figure 8 δ 15 N PN retrieved from Sentinel-3 OLCI image (20 September 2016) in Zhanjiang Bay and its adjacent waters.

4 Discussion

4.1 influencing factors of δ 15 n pn and pn concentration.

PN concentration and δ 15 N PN in the ocean are influenced by various processes such as water mass mixing, nutrient gain and loss, and phytoplankton production ( Sigman and Casciotti, 2001 ; Dagg et al., 2004 ; Ye et al., 2017 ). For bays strongly affected by human activities, PN concentration and δ 15 N PN will also be affected by terrestrial factors such as soil, land runoff, and discharge of wastewater ( Cloern et al., 2002 ; Bristow et al., 2013 ; Ye et al., 2017 ). Under the joint action of multiple factors, PN concentration and δ 15 N PN in Zhanjiang Bay exhibited different characteristics in different months and regions.

TSM is the main carrier of terrestrial particulate matter ( Chester and Jickells, 2012 ). As shown in Figure 9 , there was a significant positive correlation between PN concentration and TSM concentration in May (r=0.746, P<0.01), but there was no significant correlation between PN concentration and TSM concentration in September. This indicated that terrestrial particulate matter in May had an important impact on PN concentration, while the terrestrial component content of PN in September was relatively low. To a certain extent, the Chl a concentration reflects the status of phytoplankton production ( Luhtala et al., 2013 ). As illustrated in Figure 9 , there was a certain positive correlation between PN concentration and Chl a concentration in May and September (r=0.647, P<0.01 for May; r=0.476, P<0.01 for September), indicating that phytoplankton production had a certain impact on PN concentration. In general, δ 15 N PN can effectively indicate the source of PN ( Ye et al., 2017 ; Chen et al., 2021 ). The δ 15 N PN composition of marine organic matter range from 3‰ to 12‰ ( Chen et al., 2021 ; Huang et al., 2021 ), while wastewater and livestock usually have the δ 15 N PN values of 10‰ to 22‰ ( Huang et al., 2021 ). The δ 15 N values ranged from 3.73‰ to 12.08‰ (average 7.77‰) in this study. Therefore, the source of PN in Zhanjiang Bay may be mainly marine organic matter, but terrestrial input was also mixed in, such as wastewater. In addition, there was a significant correlation between δ 15 N PN and Chl a concentration in May and September (r=0.574, P<0.05 for May; r=0.806, P<0.01 for September; Figure 9 ). Both PN concentration and δ 15 N PN showed a good correlation with Chl a concentration, indicating that phytoplankton production had a significant contribution to the source of PN.

www.frontiersin.org

Figure 9 Correlation of PN concentration, δ 15 N PN and related environmental parameters in the surface water of Zhanjiang Bay in May (A) and September (B) .

Significantly, stations with higher δ 15 N PN (>10 ‰) generally had higher Chl a concentration (>10 μg/L) ( Table 4 ). The reception of hypereutrophic municipal wastewater in the bay area can easily lead to the bloom of phytoplankton, resulting in higher concentrations of Chl a ( Gao et al., 2021 ). When the growth rate of phytoplankton accelerates, the isotopic fractionation that occurs during the rapid absorption of inorganic nitrogen by phytoplankton can lead to a heavier nitrogen isotope composition of particulate organic matter ( Mariotti et al., 1984 ). Due to the preferential absorption of NH 4 + during the growth process of phytoplankton, the strong nitrification in coastal water and the preferential utilization of 14 N in NH 4 + by phytoplankton can lead to the accumulation of residual NH 4 + in water by 15 N ( Cifuentes et al., 1988 ). When phytoplankton continue to absorb these enriched 15 N in NH 4 + , it will cause an increase in the δ 15 N PN value of the produced particulate organic matter ( Cifuentes et al., 1988 ; Ke et al., 2017 ). Moreover, due to the rapid economic development and increased human activities in Zhanjiang, the process of heterotrophic bacteria has been intensified ( Li et al., 2021 ). The strong biodegradation process prioritizes the degradation of organic matter containing lighter isotopes, leading to the enrichment of residual organic matter with heavy nitrogen isotopes ( Li et al., 2021 ). It is worth noting that the δ 15 N PN value at station S13 and A29 was relatively high (>10 ‰), but the Chl a concentration was not high (5.34 μg/L and 6.75 μg/L, respectively), this may be due to the impact of sewage or wastewater input, as S13 and A29 are located near the factory and aquaculture industry, respectively. Previous study showed that in the region of algal uptake, sewage-derived NH 4 + and sewage-derived NO 3 - could raise the δ 15 N PN value by 9.0-17.2‰ and 10-15‰, respectively ( Estep and Vigg, 1985 ; Leavitt et al., 2006 ). In addition, Zhou et al. (2021) also indicated that during non-typhoon periods in Zhanjiang Bay, the PN heavy isotopes can be attributed to the utilization of mineralized NH 4 + from wastewater by phytoplankton. Therefore, the relatively heavy δ 15 N PN component in Zhanjiang Bay and its adjacent waters can be attributed to phytoplankton production and sewage or wastewater input.

www.frontiersin.org

Table 4 Stations with higher δ 15 N PN .

4.2 Evaluation of δ 15 N PN remote sensing retrieval model

In section 4.1, we obtained that the PN source in Zhanjiang Bay is mainly phytoplankton production, and δ 15 N PN had a good correlation with concentration of Chl a. Therefore, the δ 15 N PN can be linked to the water color parameter Chl a, which can establish a suitable δ 15 N PN remote sensing retrieval model. It is well-known that there is an absorption peak in the spectral reflectance near the wavelength of 674 nm due to the absorption of phytoplankton pigments, while there is a fluorescence peak near the wavelength of 681 nm, both of which are spectral characteristic bands specific to Chl a ( Su et al., 2021 ). Consequently, choosing these two bands to retrieve δ 15 N PN has a certain scientificity and reliability. In the RF model, due to the dominance of measured values between 5 and 11 in the training dataset, there may be biases in the decision tree constructed by decision tree learners, resulting in predicted values being concentrated within the range of 5 to 11. In addition, during prediction, each tree is given a predicted value, and the average of all predicted values is taken. This results in the predicted value of the random forest being within the range of the training sample’s predicted values, so it cannot be extrapolated. The predicted value can only be between the minimum and maximum values of the training sample’s predicted values, resulting in a set of identical estimated values between 10 and 11 in the testing set. When we need to infer independent or non independent variables that are beyond the range, the random forest does not do well ( Wang et al., 2023b ). The solution is to expand the scope of the dataset in the future to maintain balance. The BPNN model and MLR model performed well on both the training and testing sets, and they also performed well in the application of Sentinel-3 data. Although the multiple linear regression algorithm is simple and ease of implement, it should be noted that the regression coefficients of this model is only suitable for the Zhanjiang Bay and its adjacent sea areas. For other sea areas, parameter regionalization may be required, and the applicability of the model needs further verification. The samples in this study were collected during the rainy season. During the rainy season, increased rainfall leads to an increase in nutrients carried into the sea by land runoff, resulting in an increase in phytoplankton biomass ( Baek et al., 2009 ). There was a good correlation between Chl a, and δ 15 N PN in May and September. This provides a good foundation for the establishment of δ 15 N PN remote sensing model. However, during the dry season, the decrease in rainfall leads to changes in the physical, chemical, and biological conditions of the water, as well as changes in the activity of phytoplankton. Whether Chl a and δ 15 N PN still maintain a good correlation remains to be further explored. Whether the δ 15 N PN retrieval model we established is still applicable depends on further sample collection and verification.

In order to better evaluate the δ 15 N PN remote sensing retrieval model established in this study, we selected six widely used Chl a retrieval algorithms ( Table 5 ), including 3 empirical algorithms (Three-band algorithm: TBA; Fluorescence Line Height algorithm:FLH; Maximum Chlorophyll Index algorithm: MCI) ( Gower et al., 1999 ; Dall'Olmo et al., 2005 ; Gower et al., 2005 ) and 3 semi-analytical algorithms (Gons; Simis; Quasi-Analytical Algorithm improved form: QAA750E) ( Gons et al., 2002 ; Simis et al., 2005 ; Xue et al., 2019 ), to attempt to retrieve δ 15 N PN . As shown in Table 5 , we substituted the in situ remote sensing reflectance into the expressions of the following six algorithms, and then perform comparison analysis between the results of the expressions with measured δ 15 N PN . It is worth mentioning that the comparison analysis between the absorption coefficient of phytoplankton (a ph (λ)) obtained by the semi-analytical algorithm and δ 15 N PN was conducted. From Table 5 , it can be seen that the BPNN algorithm proposed in this study has the highest accuracy (R 2 =  0.66, MAPE=13.52%, RMSE=1.19‰), followed by the TBA algorithm (R 2 =  0.49, MAPE=14.32%, RMSE=1.46‰). The Gons and Simis semi-analytical algorithms also have a good performance (Gons: R 2 =  0.46, MAPE=16.31%, RMSE=1.49‰; Simis: R 2 =  0.42, MAPE=16.81%, RMSE=1.56‰), indicating that semi-analytical algorithms have certain application potential in retrieving δ 15 N PN . In addition, MCI algorithm, FLH algorithm and QAA750E algorithm perform poorly. Therefore, the algorithms composed of more bands may lead to more indeterminacy factors introduced, which directly affects the retrieval accuracy.

www.frontiersin.org

Table 5 Expressions of 6 Chl a retrieval algorithms and the comparison analysis between algorithm results and δ 15 N PN. .

Additionally, we obtained 18 measured δ 15 N PN values in September 2017 and applied the BPNN model established in this study to the Sentinel-3 OLCI image data (18 September 2017). The retrieval results were shown in Figure 10 . From the perspective of retrieval accuracy ( Figure 11 ), the R 2 , RMSE and MAPE were 0.59, 1.78‰, and 34.06%, respectively. The retrieval results can meet the requirements to a certain extent, indicating that our δ 15 N PN retrieval model has certain applicability.

www.frontiersin.org

Figure 10 δ 15 N PN retrieved from Sentinel-3 OLCI image (18 September 2017) in Zhanjiang Bay and its adjacent waters.

www.frontiersin.org

Figure 11 Scatter plots between the satellite-retrieved δ 15 N PN using BPNN model and the measured δ 15 N PN in September 2017.

4.3 The advantages and prospects of developing δ 15 N PN remote sensing model

Isotope fractionation gives PN from different sources specific nitrogen stable isotope characteristic values, which provides the possibility of determining the source and destination of PN, making δ 15 N PN a valuable tracer for tracking PN sources and understanding N cycling in water systems ( Chen et al., 2021 ; Huang et al., 2021 ; Lu et al., 2021 ). Although traditional field surveys and chemical methods can accurately obtain δ 15 N PN , they consume a lot of time, manpower, and resources. How to improve efficiency and enable us to quickly and extensively understand the dynamic changes of δ 15 N PN ? Satellite remote sensing has developed rapidly in recent decades, and various high-performance sensors have been developed for marine environmental monitoring, which is very conducive to our research and exploration of the ocean. Satellite remote sensing has irreplaceable advantages in large-scale spatial and long-term series monitoring. We only need to sacrifice a small amount of accuracy to obtain acceptable results, which is of great significance for the biogeochemical processes and nitrogen cycling research of PN in the ocean. According to the analysis above, water environmental pollution can also be distinguished by δ 15 N PN values, which can indirectly indicate the pollution level of water bodies. This also provides a new method and strategy for traditional water quality monitoring and management.

However, the δ 15 N PN remote sensing model established in this study only involves data from two cruises, and it cannot be denied that the limitations of the model exist. Meanwhile, the quality of satellite images and atmospheric correction can also bring uncertainty to δ 15 N PN estimation. In the future, we will increase the sampling frequency and use more data from different seasons and regions to validate our established model, enhancing its robustness and universality.

5 Conclusions

Based on the measured δ 15 N PN values and remote sensing reflectance in Zhanjiang Bay in May and September 2016, this study constructed three machine learning models (BPNN, RF, MLR) for δ 15 N PN retrieval. After screening and analysis, the model input variables consisted of two optical indicators, namely R rs (674)/R rs (681) and R rs ( 674 ) − R rs ( 681 ) R rs ( 674 ) + R rs ( 681 ) . Through the accuracy evaluation of the training sets and test sets and the analysis of the retrieval results of Sentinel-3, it was found that the BPNN model performed better compared to the other two models. In addition, PN source in Zhanjiang Bay was mainly phytoplankton production, and phytoplankton production was closely related to chlorophyll a, which provided a reliable basis for remote sensing retrieval of δ 15 N PN . This basis was also confirmed by the fact that the two sensitive bands (674 nm and 681 nm) that respond to δ 15 N PN were also the spectral characteristic bands of chlorophyll a. However, due to limited data sets and insufficient model optimization, the performance of the δ 15 N PN retrieval model that we established still needs to be improved. In the future, we will continue to expand the data sets, optimize the model input variables, and construct a more robust δ 15 N PN retrieval model for continuous and long-term monitoring.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

GY: Conceptualization, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing. YZ: Conceptualization, Data curation, Formal analysis, Validation, Writing – review & editing. DF: Funding acquisition, Project administration, Supervision, Writing – review & editing. FC: Resources, Visualization, Writing – review & editing. CC: Methodology, Writing – review & editing.

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This study was supported by the National Key Research and Development Program of China (No. 2022YFC3103101); Key Special Project for Introduced Talents Team of Southern Marine Science and Engineering Guangdong Laboratory (No. GML2021GD0809); National Natural Science Foundation of China (No. 42206187); Key projects of the Guangdong Education Department (No. 2023ZDZX4009).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Baek S. H., Shimode S., Kim H. C., Han M. S., Kikuchi T. (2009). Strong bottom-up effects on phytoplankton community caused by a rainfall during spring and summer in Sagami Bay, Japan. J. Mar. Syst. 75, 253–264. doi: 10.1016/j.jmarsys.2008.10.005

CrossRef Full Text | Google Scholar

Belgiu M., Drăguţ L. (2016). Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 114, 24–31. doi: 10.1016/j.isprsjprs.2016.01.011

Bristow L. A., Jickells T. D., Weston K., Marca-Bell A., Parker R., Andrews J. E. (2013). Tracing estuarine organic matter sources into the southern North Sea using C and N isotopic signatures. Biogeochemistry 113, 9–22. doi: 10.1007/s10533-012-9758-4

Cao Z., Ma R., Duan H., Pahlevan N., Melack J., Shen M., et al. (2020). A machine learning approach to estimate chlorophyll-a from Landsat-8 measurements in inland lakes. Remote Sens. Environ. 248, 111974. doi: 10.1016/j.rse.2020.111974

Capone D. G., Bronk D. A., Mulholland M. R., Carpenter E. J. (2008). Nitrogen in the marine environment (Amsterdam: Elsevier), 1–50.

Google Scholar

Chen F., Lao Q., Jia G., Chen C., Zhu Q., Zhou X. (2019). Seasonal variations of nitrate dual isotopes in wet deposition in a tropical city in China. Atmos. Environ. 196, 1–9. doi: 10.1016/j.atmosenv.2018.09.061

Chen F., Lu X., Song Z., Huang C., Jin G., Chen C., et al. (2021). Coastal currents regulate the distribution of the particulate organic matter in western Guangdong offshore waters as evidenced by carbon and nitrogen isotopes. Mar. pollut. Bull. 172, 112856. doi: 10.1016/j.marpolbul.2021.112856

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen J., Quan W., Cui T., Song Q. (2015). Estimation of total suspended matter concentration from MODIS data using a neural network model in the China eastern coastal zone. Estuar. Coast. Shelf Sci. 155, 104–113. doi: 10.1016/j.ecss.2015.01.018

Chester R., Jickells T. D. (2012). Marine geochemistry (London: John Wiley & Sons), 1–50.

Cifuentes L. A., Sharp J. H., Fogel M. L. (1988). Stable carbon and nitrogen isotope biogeochemistry in the Delaware estuary. Limnol. Oceanogr. 33, 1102–1115. doi: 10.4319/lo.1988.33.5.1102

Cloern J. E., Canuel E. A., Harris D. (2002). Stable carbon and nitrogen isotope composition of aquatic and terrestrial plants of the San Francisco Bay estuarine system. Limnol. Oceanogr. 47, 713–729. doi: 10.4319/lo.2002.47.3.0713

Dagg M., Benner R., Lohrenz S., Lawrence D. (2004). Transformation of dissolved and particulate materials on continental shelves influenced by large rivers: plume processes. Cont. Shelf Res. 24, 833–858. doi: 10.1016/j.csr.2004.02.003

Dall'Olmo G., Gitelson A. A., Rundquist D. C., Leavitt B., Barrow T., Holz J. C. (2005). Assessing the potential of SeaWiFS and MODIS for estimating chlorophyll concentration in turbid productive waters using red and near-infrared bands. Remote Sens. Environ. 96, 176–187. doi: 10.1016/j.rse.2005.02.007

Du C., Wang Q., Li Y., Lyu H., Zhu L., Zheng Z., et al. (2018). Estimation of total phosphorus concentration using a water classification method in inland water. Int. J. Appl. Earth Obs. Geoinf. 71, 29–42. doi: 10.1016/j.jag.2018.05.007

Eppley R. W., Peterson B. J. (1979). Particulate organic matter flux and planktonic new production in the deep ocean. Nature 282, 677–680. doi: 10.1038/282677a0

Estep M. L., Vigg S. (1985). Stable carbon and nitrogen isotope tracers of trophic dynamics in natural populations and fisheries of the Lahontan Lake system, Nevada. Can. J. Fish. Aquat.Sci. 42, 1712–1719. doi: 10.1139/f85-215

Falkowski P. G. (1997). Evolution of the nitrogen cycle and its influence on the biological sequestration of CO 2 in the ocean. Nature 387, 272–275. doi: 10.1038/387272a0

Fu L., Zhou Y., Liu G., Song K., Tao H., Zhao F., et al. (2023). Retrieval of chla concentrations in lake Xingkai using OLCI images. Remote Sens. 15, 3809. doi: 10.3390/rs15153809

Galloway J. N., Dentener F. J., Capone D. G., Boyer E. W., Howarth R. W., Seitzinger S. P., et al. (2004). Nitrogen cycles: past, present, and future. Biogeochemistry 70, 153–226. doi: 10.1007/s10533-004-0370-0

Gao C., Yu F., Chen J., Huang Z., Jiang Y., Zhuang Z., et al. (2021). Anthropogenic impact on the organic carbon sources, transport and distribution in a subtropical semi-enclosed bay. Sci. Total Environ. 767, 145047. doi: 10.1016/j.scitotenv.2021.145047

Gons H. J., Rijkeboer M., Ruddick K. G. (2002). A chlorophyll-retrieval algorithm for satellite imagery (Medium Resolution Imaging Spectrometer) of inland and coastal waters. J. Plankton Res. 24, 947–951. doi: 10.1093/plankt/24.9.947

Gower J. F. R., Doerffer R., Borstad G. A. (1999). Interpretation of the 685nm peak in water-leaving radiance spectra in terms of fluorescence, absorption and scattering, and its observation by MERIS. Int. J. Remote Sens. 20, 1771–1786. doi: 10.1080/014311699212470

Gower J., King S., Borstad G., Brown L. (2005). Detection of intense plankton blooms using the 709 nm band of the MERIS imaging spectrometer. Int. J. Remote Sens. 26, 2005–2012. doi: 10.1080/01431160500075857

Granger J., Sigman D. M., Rohde M. M., Maldonado M. T., Tortell P. D. (2010). N and O isotope effects during nitrate assimilation by unicellular prokaryotic and eukaryotic plankton cultures. Geochim. Cosmochim. Acta 74, 1030–1040. doi: 10.1016/j.gca.2009.10.044

Huang C., Chen F., Zhang S., Chen C., Meng Y., Zhu Q., et al. (2020). Carbon and nitrogen isotopic composition of particulate organic matter in the Pearl River Estuary and the adjacent shelf. Estuarine Coast. Shelf Sci. 246, 107003. doi: 10.1016/j.ecss.2020.107003

Huang C., Lao Q., Chen F., Zhang S., Chen C., Bian P., et al. (2021). Distribution and sources of particulate organic matter in the northern south China Sea: implications of human activity. J. Ocean Univ. China. 20, 1136–1146. doi: 10.1007/s11802-021-4807-z

Ju A., Wang H., Wang L., Weng Y. (2023). Application of machine learning algorithms for prediction of ultraviolet absorption spectra of chromophoric dissolved organic matter (CDOM) in seawater. Front. Mar. Sci. 10, 1065123. doi: 10.3389/fmars.2023.1065123

Ke Z., Tan Y., Huang L., Zhao C., Jiang X. (2017). Spatial distributions of δ 13 C, δ 15 N and C/N ratios in suspended particulate organic matter of a bay under serious anthropogenic influences: Daya Bay, China. Mar. pollut. Bull. 114, 183–191. doi: 10.1016/j.marpolbul.2016.08.078

Lao Q., Liu G., Shen Y., Su Q., Lei X. (2021). Biogeochemical processes and eutrophication status of nutrients in the northern Beibu Gulf, South China. J. Earth Syst. Sci. 130, 199. doi: 10.1007/s12040-021-01706-y

Leavitt P. R., Brock C. S., Ebel C., Patoine A. (2006). Landscape-scale effects of urban nitrogen on a chain of freshwater lakes in central North America. Limnol. Oceanogr. 51, 2262–2277. doi: 10.4319/lo.2006.51.5.2262

Li J., Chen F., Zhang S., Huang C., Chen C., Zhou F., et al. (2021). Origin of the particulate organic matter in a monsoon-controlled bay in southern China. J. Mar. Sci. Eng. 9, 541. doi: 10.3390/jmse9050541

Ling Z., Sun D., Wang S., Qiu Z., Huan Y., Mao Z., et al. (2020). Remote sensing estimation of colored dissolved organic matter (CDOM) from GOCI measurements in the Bohai Sea and Yellow Sea. Environ. Sci. pollut. Res. 27, 6872–6885. doi: 10.1007/s11356-019-07435-6

Liu H., Li Q., Bai Y., Yang C., Wang J., Zhou Q., et al. (2021). Improving satellite retrieval of oceanic particulate organic carbon concentrations using machine learning methods. Remote Sens. Environ. 256, 112316. doi: 10.1016/j.rse.2021.112316

Liu Y., Xu L., Li M. (2017). The parallelization of back propagation neural network in mapreduce and spark. Int. J. Parallel Progr. 45, 760–779. doi: 10.1007/s10766-016-0401-1

Lu X., Huang C., Chen F., Zhang S., Lao Q., Chen C., et al. (2021). Carbon and nitrogen isotopic compositions of particulate organic matter in the upwelling zone off the east coast of Hainan Island, China. Mar. pollut. Bull. 167, 112349. doi: 10.1016/j.marpolbul.2021.112349

Lu X., Zhou F., Chen F., Lao Q., Zhu Q., Meng Y., et al. (2020). Spatial and seasonal variations of sedimentary organic matter in a subtropical bay: Implication for human interventions. Int. J. Environ. Res. Public Health 17, 1362. doi: 10.3390/ijerph17041362

Luhtala H., Tolvanen H., Kalliola R. (2013). Annual spatio-temporal variation of the euphotic depth in the SW-Finnish archipelago, Baltic Sea. Oceanologia 55, 359–373. doi: 10.5697/oc.55-2.359

Maciel D. A., Barbosa C. C. F., de Moraes Novo E. M. L., Júnior R. F., Begliomini F. N. (2021). Water clarity in Brazilian water assessed using Sentinel-2 and machine learning methods. ISPRS J. Photogramm. Remote Sens. 182, 134–152. doi: 10.1016/j.isprsjprs.2021.10.009

MacKay D. J. (1992). Bayesian interpolation. Neural comput. 4, 415–447. doi: 10.1162/neco.1992.4.3.415

Mariotti A., Lancelot C., Billen G. (1984). Natural isotopic composition of nitrogen as a tracer of origin for suspended organic matter in the Scheldt estuary. Geochim. Cosmochim. Acta 48, 549–555. doi: 10.1016/0016-7037(84)90283-7

Mathew M. M., Srinivasa Rao N., Mandla V. R. (2017). Development of regression equation to study the Total Nitrogen, Total Phosphorus and Suspended Sediment using remote sensing data in Gujarat and Maharashtra coast of India. J. Coast. Conserv. 21, 917–927. doi: 10.1007/s11852-017-0561-1

Mobley C. D. (1999). Estimation of the remote-sensing reflectance from above-surface measurements. Appl. opt. 38, 7442–7455. doi: 10.1364/AO.38.007442

Montoya J. P., Carpenter E. J., Capone D. G. (2002). Nitrogen fixation and nitrogen isotope abundances in zooplankton of the oligotrophic North Atlantic. Limnol. Oceanogr. 47, 1617–1628. doi: 10.4319/lo.2002.47.6.1617

Olmanson L. G., Brezonik P. L., Finlay J. C., Bauer M. E. (2016). Comparison of Landsat 8 and Landsat 7 for regional measurements of CDOM and water clarity in lakes. Remote Sens. Environ. 185, 119–128. doi: 10.1016/j.rse.2016.01.007

Ondrusek M., Stengel E., Kinkade C. S., Vogel R. L., Keegstra P., Hunter C., et al. (2012). The development of a new optical total suspended matter algorithm for the Chesapeake Bay. Remote Sens. Environ. 119, 243–254. doi: 10.1016/j.rse.2011.12.018

Pahlevan N., Smith B., Schalles J., Binding C., Cao Z., Ma R., et al. (2020). Seamless retrievals of chlorophyll-a from Sentinel-2 (MSI) and Sentinel-3 (OLCI) in inland and coastal waters: A machine-learning approach. Remote Sens. Environ. 240, 111604. doi: 10.1016/j.rse.2019.111604

Pajares S., Ramos R. (2019). Processes and microorganisms involved in the marine nitrogen cycle: knowledge and gaps. Front. Mar. Sci. 6, 739. doi: 10.3389/fmars.2019.00739

Qing S., Zhang J., Cui T., Bao Y. (2013). Retrieval of sea surface salinity with MERIS and MODIS data in the Bohai Sea. Remote Sens. Environ. 136, 117–125. doi: 10.1016/j.rse.2013.04.016

Sarma V. V. S. S., Krishna M. S., Srinivas T. N. R. (2020). Sources of organic matter and tracing of nutrient pollution in the coastal Bay of Bengal. Mar. pollut. Bull. 159, 111477. doi: 10.1016/j.marpolbul.2020.111477

Shen M., Duan H., Cao Z., Xue K., Qi T., Ma J., et al. (2020). Sentinel-3 OLCI observations of water clarity in large lakes in eastern China: Implications for SDG 6.3. 2 evaluation. Remote Sens. Environ. 247, 111950. doi: 10.1016/j.rse.2020.111950

Shen M., Luo J., Cao Z., Xue K., Qi T., Ma J., et al. (2022). Random forest: An optimal chlorophyll-a algorithm for optically complex inland water suffering atmospheric correction uncertainties. J. Hydrol. 615, 128685. doi: 10.1016/j.jhydrol.2022.128685

Sigman D. M., Casciotti K. L. (2001). Nitrogen isotopes in the ocean (Amsterdam: Elsevier), 1884–1894. doi: 10.1006/rwos.2001.0172

Simis S. G., Peters S. W., Gons H. J. (2005). Remote sensing of the cyanobacterial pigment phycocyanin in turbid inland water. Limnol. Oceanogr. 50, 237–245. doi: 10.4319/lo.2005.50.1.0237

Su H., Lu X., Chen Z., Zhang H., Lu W., Wu W. (2021). Estimating coastal chlorophyll-a concentration from time-series OLCI data based on machine learning. Remote Sens. 13, 576. doi: 10.3390/rs13040576

Sun D., Li Y., Wang Q., Le C. (2009). Remote sensing retrieval of CDOM concentration in Lake Taihu with hyper-spectral data and neural network model. Geomatics Inf. Sci. Wuhan University. 34, 851–855.

Tian D., Zhao X., Gao L., Liang Z., Yang Z., Zhang P., et al. (2024). Estimation of water quality variables based on machine learning model and cluster analysis-based empirical model using multi-source remote sensing data in inland reservoirs, South China. Environ. pollut. 342, 123104. doi: 10.1016/j.envpol.2023.123104

Voss M., Bange H. W., Dippner J. W., Middelburg J. J., Montoya J. P., Ward B. (2013). The marine nitrogen cycle: recent discoveries, uncertainties and the potential relevance of climate change. Philos. Trans. R. Soc. B. 368, 20130121. doi: 10.1098/rstb.2013.0121

Wang Y., Liu H., Wu G. (2022). Satellite retrieval of oceanic particulate organic nitrogen concentration. Front. Mar. Sci. 9, 943867. doi: 10.3389/fmars.2022.943867

Wang J., Tang J., Wang W., Wang Y., Wang Z. (2023a). Quantitative retrieval of chlorophyll-a concentrations in the Bohai-yellow sea using GOCI surface reflectance products. Remote Sens. 15 (22), 5285. doi: 10.3390/rs15225285

Wang J., Wu X., Ma D., Wen J., Xiao Q. (2023b). Remote sensing retrieval based on machine learning algorithm: Uncertainty analysis. Natl. Remote Sens. Bullet. 27 (3), 790–801. doi: 10.11834/jrs.20221172

Watanabe F., Alcântara E., Curtarelli M., Kampel M., Stech J. (2018). Landsat-based remote sensing of the colored dissolved organic matter absorption coefficient in a tropical oligotrophic reservoir. Remote Sens. Appl.: Soc Environ. 9, 82–90. doi: 10.1016/j.rsase.2017.12.004

Wu Y., Dittmar T., Ludwichowski K. U., Kattner G., Zhang J., Zhu Z. Y., et al. (2007). Tracing suspended organic nitrogen from the Yangtze River catchment into the East China Sea. Mar. Chem. 107, 367–377. doi: 10.1016/j.marchem.2007.01.022

Xu Y., Zhang Y., Zhang D. (2010). Retrieval of dissolved inorganic nitrogen from multi-temporal MODIS data in Haizhou Bay. Mar. Geod. 33, 1–15. doi: 10.1080/01490410903530257

Xue K., Ma R., Duan H., Shen M., Boss E., Cao Z. (2019). Inversion of inherent optical properties in optically complex waters using sentinel-3A/OLCI images: A case study using China's three largest freshwater lakes. Remote Sens. Environ. 225, 328–346. doi: 10.1016/j.rse.2019.03.006

Yang Z., Reiter M., Munyei N. (2017). Estimation of chlorophyll-a concentrations in diverse water bodies using ratio-based NIR/Red indices. Remote Sens. Appl.: Soc Environ. 6, 52–58. doi: 10.1016/j.rsase.2017.04.004

Ye F., Guo W., Shi Z., Jia G., Wei G. (2017). Seasonal dynamics of particulate organic matter and its response to flooding in the Pearl River Estuary, China, revealed by stable isotope (δ 13 C and δ 15 N) analyses. J. Geophys. Res.: Oceans. 122, 6835–6856. doi: 10.1002/2017JC012931

Yu G., Zhong Y., Liu S., Lao Q., Chen C., Fu D., et al. (2023). Remote sensing estimates of particulate organic carbon sources in the Zhanjiang bay using sentinel-2 data and carbon isotopes. Remote Sens. 15, 3768. doi: 10.3390/rs15153768

Zhang J., Fu M., Zhang P., Sun D., Peng D. (2023). Unravelling nutrients and carbon interactions in an urban coastal water during algal bloom period in Zhanjiang bay, China. Water 15, 900. doi: 10.3390/w15050900

Zhang P., Peng C. H., Zhang J. B., Zou Z. B., Shi Y. Z., Zhao L. R., et al. (2020a). Spatiotemporal urea distribution, sources, and indication of DON bioavailability in Zhanjiang Bay, China. Water 12, 633. doi: 10.3390/w12030633

Zhang P., Xu J. L., Zhang J. B., Li J. X., Zhang Y. C., Li Y., et al. (2020b). Spatiotemporal dissolved silicate variation, sources, and behavior in the eutrophic Zhanjiang Bay, China. Water 12, 3586. doi: 10.3390/w12123586

Zhao Z., Cai X., Huang C., Shi K., Li J., Jin J., et al. (2022). A novel semianalytical remote sensing retrieval strategy and algorithm for particulate organic carbon in inland waters based on biogeochemical-optical mechanisms. Remote Sens. Environ. 280, 113213. doi: 10.1016/j.rse.2022.113213

Zheng H., Wu Y., Han H., Wang J., Liu S., Xu M., et al. (2024). Utilizing residual networks for remote sensing estimation of total nitrogen concentration in Shandong offshore areas. Front. Mar. Sci. 11, 1336259. doi: 10.3389/fmars.2024.1336259

Zhou X., Jin G., Li J., Song Z., Zhang S., Chen C., et al. (2021). Effects of typhoon mujigae on the biogeochemistry and ecology of a semi-enclosed bay in the northern South China sea. J. Geophys. Res.: Biogeosci. 126, e2020JG006031. doi: 10.1029/2020JG006031

Zhou F., Xiong M., Wang S., Tian S., Jin G., Chen F., et al. (2022). Impacts of human activities and environmental changes on spatial-seasonal variations of metals in surface sediments of Zhanjiang bay, China. Front. Mar. Sci. 9, 925567. doi: 10.3389/fmars.2022.925567

Keywords: particulate nitrogen, δ 15 N PN , remote sensing, machine learning algorithm, Sentinel-3, OLCI, Zhanjiang Bay

Citation: Yu G, Zhong Y, Fu D, Chen F and Chen C (2024) Remote sensing estimation of δ 15 N PN in the Zhanjiang Bay using Sentinel-3 OLCI data based on machine learning algorithm. Front. Mar. Sci. 11:1366987. doi: 10.3389/fmars.2024.1366987

Received: 08 January 2024; Accepted: 24 April 2024; Published: 14 May 2024.

Reviewed by:

Copyright © 2024 Yu, Zhong, Fu, Chen and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yafeng Zhong, [email protected] ; Dongyang Fu, [email protected]

IMAGES

  1. Applied-Machine-Learning-in-Python/Assignment 4.ipynb at master

    applied machine learning assignment 4

  2. Assignment 4

    applied machine learning assignment 4

  3. COMP1804 Applied Machine Learning Assignment Sample

    applied machine learning assignment 4

  4. Assignment 4 707.docx

    applied machine learning assignment 4

  5. GitHub

    applied machine learning assignment 4

  6. CS498AML

    applied machine learning assignment 4

VIDEO

  1. Engineering 2nd Year Artificial Intelligence & Machine Learning

  2. Introduction to Machine Learning || NPTEL week 10 answers 2024 #nptel #machinelearning #skumaredu

  3. Introduction to Machine Learning || NPTEL week 7 answers 2024 #nptel #machinelearning #skumaredu

  4. NPTEL INTRODUCTION TO MACHINE LEARNING ASSIGNMENT-8 ANSWERS IITKGP#machine learning#

  5. Big Data Computing

  6. 4. ID3 Algorithm 18CSL76 AI-ML LAB PROGRAM 4 VTU 7th SEM CSE/ISE

COMMENTS

  1. applied-machine-learning-in-python/Assignment+4.ipynb at master

    Solutions to the 'Applied Machine Learning In Python' Coursera course exercises - amirkeren/applied-machine-learning-in-python

  2. Applied Machine Learning in Python

    There are 4 modules in this course. This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit through ...

  3. Solved Course

    This problem has been solved! You'll get a detailed solution from a subject matter expert that helps you learn core concepts. Question: Course - Coursera - Applied machine learning by Python - module 4 - Assignment 4 - Predicting and understanding viewer engagement with educational videos. About the prediction problem One critical property of a ...

  4. Applied Machine Learning in Python

    Syllabus. Module 1: Fundamentals of Machine Learning - Intro to SciKit Learn. This module introduces basic machine learning concepts, tasks, and workflow using an example classification problem based on the K-nearest neighbors method, and implemented using the scikit-learn library. Module 2: Supervised Machine Learning - Part 1.

  5. Introduction

    This book aims to provide an accessible introduction into applying machine learning with Python, in particular using the scikit-learn library. I assume that you're already somewhat familiar with Python and the libaries of the scientific Python ecosystem. If you find that you have a hard time following along some of the details of numpy ...

  6. DataSci W207: Applied Machine Learning (Fall 2022)

    Step 1: Create GitHub repos for Assignments 1-10 and Final Project. Step 2: If weekly assignments, upload .ipynb file in Gradescope. If final project, upload an .ipynb file that contains the link to your group GitHub repo (add your presentation slides to the repo; each team member submits in Gradescope) Grading.

  7. Applied Machine Learning

    Introduction. This course provides an overview of key algorithms and concepts in machine learning, with a focus on applications. Introduces supervised and unsupervised learning, including logistic regression, support vector machines, neural networks, Gaussian mixture models, as well as other methods for classification, regression, clustering ...

  8. Introduction to Applied Machine Learning

    There are 4 modules in this course. This course is for professionals who have heard the buzz around machine learning and want to apply machine learning to data analysis and automation. Whether finance, medicine, engineering, business or other domains, this course will introduce you to problem definition and data preparation in a machine ...

  9. PDF Introductory Applied Machine Learning: Assignment 4

    Introductory Applied Machine Learning: Assignment 4 School of Informatics, University of Edinburgh Instructors: Victor Lavrenko and Nigel Goddard Assignment prepared by Sean J. Moran, revised by Boris Mitrovic For due date and time, see the course web page. Hard copy and electronic submission required. The time of the deadline will be strictly ...

  10. Solved Course

    This problem has been solved! You'll get a detailed solution from a subject matter expert that helps you learn core concepts. Question: Course - Coursera - Applied machine learning by Python - module 4 - Assignment 4 - Predicting and understanding viewer engagement with educational videos.

  11. 4-Steps to Get Started in Applied Machine Learning

    Learn the high-level process of applied machine learning. Learn how to use a tool enough to be able to work through problems. Practice on datasets, a lot. Transition into the details and theory of machine learning algorithms. Applied Machine Learning Process. I have written a lot about the process of applied machine learning. I advocate a 6 ...

  12. | notebook.community

    Assignment 4 - Understanding and Predicting Property Maintenance Fines. This assignment is based on a data challenge from the Michigan Data Science Team ().The Michigan Data Science Team and the Michigan Student Symposium for Interdisciplinary Statistical Sciences have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight.

  13. Applied ML Week 4 Assignment solution|Evaluation

    Key ConceptsUnderstand how specific supervised learning algorithms - in particular, those based on decision trees and neural networks - estimate their own pa...

  14. Applied Machine Learning in Python

    This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit through a tutorial. The issue of dimensionality of data will be discussed, and the task of ...

  15. Apply Machine Learning Module 4 Assignment

    If the issue persists, it's likely a problem on our side. Unexpected token < in JSON at position 0. keyboard_arrow_up. content_copy. SyntaxError: Unexpected token < in JSON at position 0. Refresh. Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources.

  16. 5 Free University Courses to Learn Machine Learning

    2. Data Science: Machine Learning - Harvard. Data Science: Machine Learning is another course where you'll get to learn machine learning fundamentals by working on practical applications such as movie recommendation systems. The course goes over the following topics: Link: Data Science: Machine Learning. 3.

  17. Applied Machine Learning in Python

    This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit through a tutorial. ... Assignment 4 • 180 ...

  18. Smart Delivery Assignment through Machine Learning and the ...

    Intelligent transportation and advanced mobility techniques focus on helping operators to efficiently manage navigation tasks in smart cities, enhancing cost efficiency, increasing security, and reducing costs. Although this field has seen significant advances in developing large-scale monitoring of smart cities, several challenges persist concerning the practical assignment of delivery ...

  19. Applied Sciences

    Airborne pollutants pose a significant threat in the occupational workplace resulting in adverse health effects. Within the Industry 4.0 environment, new systems and technologies have been investigated for risk management and as health and safety smart tools. The use of predictive algorithms via artificial intelligence (AI) and machine learning (ML) tools, real-time data exchange via the ...

  20. DataSci 207: Applied Machine Learning (Spring 2023)

    Office hours: Tu, 8-9 am PT. This course provides a practical introduction to the rapidly growing field of machine learning— training predictive models to generalize to new data. We start with linear and logistic regression and implement gradient descent for these algorithms, the core engine for training. With these key building blocks, we ...

  21. Examining the synergistic effects through machine learning prediction

    The anaerobic co-digestion (ACoD) of palm oil mill effluent (POME) and decanter cake (DC) is gaining considerable attention as it helps to overcome insufficient feedstock issues in anaerobic digesters during low crop seasons. However, the ACoD process involves complex non-linear relationships between input parameters and process outcomes, posing challenges in accurately evaluating the ...

  22. Machine Learning and image analysis towards improved energy ...

    With the advent of Industry 4.0, Artificial Intelligence (AI) has created a favorable environment for the digitalization of manufacturing and processing, helping industries to automate and optimize operations. In this work, we focus on a practical case study of a brake caliper quality control operation, which is usually accomplished by human inspection and requires a dedicated handling system ...

  23. Applied Sciences

    The recognition and localization of strawberries are crucial for automated harvesting and yield prediction. This article proposes a novel RTF-YOLO (RepVgg-Triplet-FocalLoss-YOLO) network model for real-time strawberry detection. First, an efficient convolution module based on structural reparameterization is proposed. This module was integrated into the backbone and neck networks to improve ...

  24. Frontiers

    2.4.2 Random Forest. Random Forest (RF) is a powerful machine learning algorithm. As an ensemble learning technique, RF uses several decision trees, each of which is trained using randomly chosen feature and sample subsets (Belgiu and Drăguţ, 2016; Wang et al., 2023a).By averaging or voting the predictions from each individual tree, the final prediction is obtained (Belgiu and Drăguţ, 2016 ...