thesis for data science

Analytics Insight

10 Best Research and Thesis Topic Ideas for Data Science in 2022

' src=

These research and thesis topics for data science will ensure more knowledge and skills for both students and scholars

  • Handling practical video analytics in a distributed cloud:  With increased dependency on the internet, sharing videos has become a mode of data and information exchange. The role of the implementation of the Internet of Things (IoT), telecom infrastructure, and operators is huge in generating insights from video analytics. In this perspective, several questions need to be answered, like the efficiency of the existing analytics systems, the changes about to take place if real-time analytics are integrated, and others.
  • Smart healthcare systems using big data analytics: Big data analytics plays a significant role in making healthcare more efficient, accessible, and cost-effective. Big data analytics enhances the operational efficiency of smart healthcare providers by providing real-time analytics. It enhances the capabilities of the intelligent systems by using short-span data-driven insights, but there are still distinct challenges that are yet to be addressed in this field.
  • Identifying fake news using real-time analytics:  The circulation of fake news has become a pressing issue in the modern era. The data gathered from social media networks might seem legit, but sometimes they are not. The sources that provide the data are unauthenticated most of the time, which makes it a crucial issue to be addressed.
  • Secure federated learning with real-world applications : Federated learning is a technique that trains an algorithm across multiple decentralized edge devices and servers. This technique can be adopted to build models locally, but if this technique can be deployed at scale or not, across multiple platforms with high-level security is still obscure.
  • Big data analytics and its impact on marketing strategy : The advent of data science and big data analytics has entirely redefined the marketing industry. It has helped enterprises by offering valuable insights into their existing and future customers. But several issues like the existence of surplus data, integrating complex data into customers’ journeys, and complete data privacy are some of the branches that are still untrodden and need immediate attention.
  • Impact of big data on business decision-making: Present studies signify that big data has transformed the way managers and business leaders make critical decisions concerning the growth and development of the business. It allows them to access objective data and analyse the market environments, enabling companies to adapt rapidly and make decisions faster. Working on this topic will help students understand the present market and business conditions and help them analyse new solutions.
  • Implementing big data to understand consumer behaviour : In understanding consumer behaviour, big data is used to analyse the data points depicting a consumer’s journey after buying a product. Data gives a clearer picture in understanding specific scenarios. This topic will help understand the problems that businesses face in utilizing the insights and develop new strategies in the future to generate more ROI.
  • Applications of big data to predict future demand and forecasting : Predictive analytics in data science has emerged as an integral part of decision-making and demand forecasting. Working on this topic will enable the students to determine the significance of the high-quality historical data analysis and the factors that drive higher demand in consumers.
  • The importance of data exploration over data analysis : Exploration enables a deeper understanding of the dataset, making it easier to navigate and use the data later. Intelligent analysts must understand and explore the differences between data exploration and analysis and use them according to specific needs to fulfill organizational requirements.
  • Data science and software engineering : Software engineering and development are a major part of data science. Skilled data professionals should learn and explore the possibilities of the various technical and software skills for performing critical AI and big data tasks.

Whatsapp Icon

Disclaimer: Any financial and crypto market information given on Analytics Insight are sponsored articles, written for informational purpose only and is not an investment advice. The readers are further advised that Crypto products and NFTs are unregulated and can be highly risky. There may be no regulatory recourse for any loss from such transactions. Conduct your own research by contacting financial experts before making any investment decisions. The decision to read hereinafter is purely a matter of choice and shall be construed as an express undertaking/guarantee in favour of Analytics Insight of being absolved from any/ all potential legal action, or enforceable claims. We do not represent nor own any cryptocurrency, any complaints, abuse or concerns with regards to the information provided shall be immediately informed here .

You May Also Like

Scorpion Casino

How To Invest In DeFi With Scorpion Casino Floki Iu & Bitbot Following Top Crypto Predictions


Asians are Dumping Cryptos and the US is buying it! Why?


10 Short-Term Courses that Can Lead to High-Paying Jobs


Best Certifications for DevOps Engineers


Analytics Insight® is an influential platform dedicated to insights, trends, and opinion from the world of data-driven technologies. It monitors developments, recognition, and achievements made by Artificial Intelligence, Big Data and Analytics companies across the globe.


  • Select Language:
  • Privacy Policy
  • Content Licensing
  • Terms & Conditions
  • Submit an Interview

Special Editions

  • Dec – Crypto Weekly Vol-1
  • 40 Under 40 Innovators
  • Women In Technology
  • Market Reports
  • AI Glossary
  • Infographics

Latest Issue

Influential Tech Leaders 2024

Disclaimer: Any financial and crypto market information given on Analytics Insight is written for informational purpose only and is not an investment advice. Conduct your own research by contacting financial experts before making any investment decisions, more information here .

Second Menu

thesis for data science

Warning icon

Thesis/Capstone for Master's in Data Science | Northwestern SPS - Northwestern School of Professional Studies

  • Post-baccalaureate
  • Undergraduate
  • Professional Development
  • Pre-College
  • Center for Public Safety
  • Get Information

SPS Logo

Data Science

Capstone and thesis overview.

Capstone and thesis are similar in that they both represent a culminating, scholarly effort of high quality. Both should clearly state a problem or issue to be addressed. Both will allow students to complete a larger project and produce a product or publication that can be highlighted on their resumes. Students should consider the factors below when deciding whether a capstone or thesis may be more appropriate to pursue.

A capstone is a practical or real-world project that can emphasize preparation for professional practice. A capstone is more appropriate if:

  • you don't necessarily need or want the experience of the research process or writing a big publication
  • you want more input on your project, from fellow students and instructors
  • you want more structure to your project, including assignment deadlines and due dates
  • you want to complete the project or graduate in a timely manner

A student can enroll in MSDS 498 Capstone in any term. However, capstone specialization courses can provide a unique student experience and may be offered only twice a year. 

A thesis is an academic-focused research project with broader applicability. A thesis is more appropriate if:

  • you want to get a PhD or other advanced degree and want the experience of the research process and writing for publication
  • you want to work individually with a specific faculty member who serves as your thesis adviser
  • you are more self-directed, are good at managing your own projects with very little supervision, and have a clear direction for your work
  • you have a project that requires more time to pursue

Students can enroll in MSDS 590 Thesis as long as there is an approved thesis project proposal, identified thesis adviser, and all other required documentation at least two weeks before the start of any term.

From Faculty Director, Thomas W. Miller, PhD

Tom Miller

Capstone projects and thesis research give students a chance to study topics of special interest to them. Students can highlight analytical skills developed in the program. Work on capstone and thesis research projects often leads to publications that students can highlight on their resumes.”

A thesis is an individual research project that usually takes two to four terms to complete. Capstone course sections, on the other hand, represent a one-term commitment.

Students need to evaluate their options prior to choosing a capstone course section because capstones vary widely from one instructor to the next. There are both general and specialization-focused capstone sections. Some capstone sections offer in individual research projects, others offer team research projects, and a few give students a choice of individual or team projects.

Students should refer to the SPS Graduate Student Handbook for more information regarding registration for either MSDS 590 Thesis or MSDS 498 Capstone.

Capstone Experience

If students wish to engage with an outside organization to work on a project for capstone, they can refer to this checklist and lessons learned for some helpful tips.

Capstone Checklist

  • Start early — set aside a minimum of one to two months prior to the capstone quarter to determine the industry and modeling interests.
  • Networking — pitch your idea to potential organizations for projects and focus on the business benefits you can provide.
  • Permission request — make sure your final project can be shared with others in the course and the information can be made public.
  • Engagement — engage with the capstone professor prior to and immediately after getting the dataset to ensure appropriate scope for the 10 weeks.
  • Teambuilding — recruit team members who have similar interests for the type of project during the first week of the course.

Capstone Lesson Learned

  • Access to company data can take longer than expected; not having this access before or at the start of the term can severely delay the progress
  • Project timeline should align with coursework timeline as closely as possible
  • One point of contact (POC) for business facing to ensure streamlined messages and more effective time management with the organization
  • Expectation management on both sides: (business) this is pro-bono (students) this does not guarantee internship or job opportunities
  • Data security/masking not executed in time can risk the opportunity completely

Publication of Work

Northwestern University Libraries offers an option for students to publish their master’s thesis or capstone in Arch, Northwestern’s open access research and data repository.

Benefits for publishing your thesis:

  • Your work will be indexed by search engines and discoverable by researchers around the world, extending your work’s impact beyond Northwestern
  • Your work will be assigned a Digital Object Identifier (DOI) to ensure perpetual online access and to facilitate scholarly citation
  • Your work will help accelerate discovery and increase knowledge in your subject domain by adding to the global corpus of public scholarly information

Get started:

  • Visit Arch online
  • Log in with your NetID
  • Describe your thesis: title, author, date, keywords, rights, license, subject, etc.
  • Upload your thesis or capstone PDF and any related supplemental files (data, code, images, presentations, documentation, etc.)
  • Select a visibility: Public, Northwestern-only, Embargo (i.e. delayed release)
  • Save your work to the repository

Your thesis manuscript or capstone report will then be published on the MSDS page. You can view other published work here .

For questions or support in publishing your thesis or capstone, please contact [email protected] .

Chapman University Digital Commons

Home > Dissertations and Theses > Computational and Data Sciences (PhD) Dissertations

Computational and Data Sciences (PhD) Dissertations

Below is a selection of dissertations from the Doctor of Philosophy in Computational and Data Sciences program in Schmid College that have been included in Chapman University Digital Commons. Additional dissertations from years prior to 2019 are available through the Leatherby Libraries' print collection or in Proquest's Dissertations and Theses database.

Dissertations from 2023 2023

Computational Analysis of Antibody Binding Mechanisms to the Omicron RBD of SARS-CoV-2 Spike Protein: Identification of Epitopes and Hotspots for Developing Effective Therapeutic Strategies , Mohammed Alshahrani

Integration of Computer Algebra Systems and Machine Learning in the Authoring of the SANYMS Intelligent Tutoring System , Sam Ford

Voluntary Action and Conscious Intention , Jake Gavenas

Random Variable Spaces: Mathematical Properties and an Extension to Programming Computable Functions , Mohammed Kurd-Misto

Computational Modeling of Superconductivity from the Set of Time-Dependent Ginzburg-Landau Equations for Advancements in Theory and Applications , Iris Mowgood

Application of Machine Learning Algorithms for Elucidation of Biological Networks from Time Series Gene Expression Data , Krupa Nagori

Stochastic Processes and Multi-Resolution Analysis: A Trigonometric Moment Problem Approach and an Analysis of the Expenditure Trends for Diabetic Patients , Isaac Nwi-Mozu

Applications of Causal Inference Methods for the Estimation of Effects of Bone Marrow Transplant and Prescription Drugs on Survival of Aplastic Anemia Patients , Yesha M. Patel

Causal Inference and Machine Learning Methods in Parkinson's Disease Data Analysis , Albert Pierce

Causal Inference Methods for Estimation of Survival and General Health Status Measures of Alzheimer’s Disease Patients , Ehsan Yaghmaei

Dissertations from 2022 2022

Computational Approaches to Facilitate Automated Interchange between Music and Art , Rao Hamza Ali

Causal Inference in Psychology and Neuroscience: From Association to Causation , Dehua Liang

Advances in NLP Algorithms on Unstructured Medical Notes Data and Approaches to Handling Class Imbalance Issues , Hanna Lu

Novel Techniques for Quantifying Secondhand Smoke Diffusion into Children's Bedroom , Sunil Ramchandani

Probing the Boundaries of Human Agency , Sook Mun Wong

Dissertations from 2021 2021

Predicting Eye Movement and Fixation Patterns on Scenic Images Using Machine Learning for Children with Autism Spectrum Disorder , Raymond Anden

Forecasting the Prices of Cryptocurrencies using a Novel Parameter Optimization of VARIMA Models , Alexander Barrett

Applications of Machine Learning to Facilitate Software Engineering and Scientific Computing , Natalie Best

Exploring Behaviors of Software Developers and Their Code Through Computational and Statistical Methods , Elia Eiroa Lledo

Assessing the Re-Identification Risk in ECG Datasets and an Application of Privacy Preserving Techniques in ECG Analysis , Arin Ghazarian

Multi-Modal Data Fusion, Image Segmentation, and Object Identification using Unsupervised Machine Learning: Conception, Validation, Applications, and a Basis for Multi-Modal Object Detection and Tracking , Nicholas LaHaye

Machine-Learning-Based Approach to Decoding Physiological and Neural Signals , Elnaz Lashgari

Learning-Based Modeling of Weather and Climate Events Related To El Niño Phenomenon via Differentiable Programming and Empirical Decompositions , Justin Le

Quantum State Estimation and Tracking for Superconducting Processors Using Machine Learning , Shiva Lotfallahzadeh Barzili

Novel Applications of Statistical and Machine Learning Methods to Analyze Trial-Level Data from Cognitive Measures , Chelsea Parlett

Optimal Analytical Methods for High Accuracy Cardiac Disease Classification and Treatment Based on ECG Data , Jianwei Zheng

Dissertations from 2020 2020

Development of Integrated Machine Learning and Data Science Approaches for the Prediction of Cancer Mutation and Autonomous Drug Discovery of Anti-Cancer Therapeutic Agents , Steven Agajanian

Allocation of Public Resources: Bringing Order to Chaos , Lance Clifner

A Novel Correction for the Adjusted Box-Pierce Test — New Risk Factors for Emergency Department Return Visits within 72 hours for Children with Respiratory Conditions — General Pediatric Model for Understanding and Predicting Prolonged Length of Stay , Sidy Danioko

A Computational and Experimental Examination of the FCC Incentive Auction , Logan Gantner

Exploring the Employment Landscape for Individuals with Autism Spectrum Disorders using Supervised and Unsupervised Machine Learning , Kayleigh Hyde

Integrated Machine Learning and Bioinformatics Approaches for Prediction of Cancer-Driving Gene Mutations , Oluyemi Odeyemi

On Quantum Effects of Vector Potentials and Generalizations of Functional Analysis , Ismael L. Paiva

Long Term Ground Based Precipitation Data Analysis: Spatial and Temporal Variability , Luciano Rodriguez

Gaining Computational Insight into Psychological Data: Applications of Machine Learning with Eating Disorders and Autism Spectrum Disorder , Natalia Rosenfield

Connecting the Dots for People with Autism: A Data-driven Approach to Designing and Evaluating a Global Filter , Viseth Sean

Novel Statistical and Machine Learning Methods for the Forecasting and Analysis of Major League Baseball Player Performance , Christopher Watkins

Dissertations from 2019 2019

Contributions to Variable Selection in Complexly Sampled Case-control Models, Epidemiology of 72-hour Emergency Department Readmission, and Out-of-site Migration Rate Estimation Using Pseudo-tagged Longitudinal Data , Kyle Anderson

Bias Reduction in Machine Learning Classifiers for Spatiotemporal Analysis of Coral Reefs using Remote Sensing Images , Justin J. Gapper

Estimating Auction Equilibria using Individual Evolutionary Learning , Kevin James

Employing Earth Observations and Artificial Intelligence to Address Key Global Environmental Challenges in Service of the SDGs , Wenzhao Li

Image Restoration using Automatic Damaged Regions Detection and Machine Learning-Based Inpainting Technique , Chloe Martin-King

Theses from 2017 2017

Optimized Forecasting of Dominant U.S. Stock Market Equities Using Univariate and Multivariate Time Series Analysis Methods , Michael Schwartz

  • Collections
  • Disciplines

Advanced Search

  • Notify me via email or RSS

Author Corner

  • Submit Research
  • Rights and Terms of Use
  • Leatherby Libraries
  • Chapman University

ISSN 2572-1496

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright

  • Thesis Option

Data Science master’s students can choose to satisfy the research experience requirement by selecting the thesis option. Students will spend the majority of their second year working on a substantial data science project that culminates in the submission and oral defense of a master’s thesis. While all thesis projects must be related to data science, students are given leeway in finding a project in a domain of study that fits with their background and interest.

All students choosing the thesis option must find a research advisor and submit a thesis proposal by mid-April of their first year of study. Thesis proposals will be evaluated by the Data Science faculty committee and only those students whose proposals are accepted will be allowed to continue with the thesis option.  

To account for the time spent on thesis research, students choosing the thesis option are able substitute three required courses (the Capstone and two "free" elective courses (as defined in the final bullet point on the degree requirement page )) with AC 302.

In Applied Computation

  • How to Apply
  • Learning Outcomes
  • Master of Science Degree Requirements
  • Master of Engineering Degree Requirements
  • CSE courses
  • Degree Requirements
  • Data Science courses
  • Data Science FAQ
  • Secondary Field Requirements
  • Advising and Other Activities
  • AB/SM Information
  • Alumni Stories
  • Financing the Degree
  • Student FAQ
  • AI+ Training
  • Speak at ODSC

thesis for data science

  • Data Engineering
  • Data Visualization
  • Deep Learning
  • Generative AI
  • Machine Learning
  • NLP and LLMs
  • Business & Use Cases
  • Career Advice
  • Write for us
  • ODSC Community Slack Channel
  • Upcoming Webinars

17 Compelling Machine Learning Ph.D. Dissertations

17 Compelling Machine Learning Ph.D. Dissertations

Machine Learning Modeling Research posted by Daniel Gutierrez, ODSC August 12, 2021 Daniel Gutierrez, ODSC

Working in the field of data science, I’m always seeking ways to keep current in the field and there are a number of important resources available for this purpose: new book titles, blog articles, conference sessions, Meetups, webinars/podcasts, not to mention the gems floating around in social media. But to dig even deeper, I routinely look at what’s coming out of the world’s research labs. And one great way to keep a pulse for what the research community is working on is to monitor the flow of new machine learning Ph.D. dissertations. Admittedly, many such theses are laser-focused and narrow, but from previous experience reading these documents, you can learn an awful lot about new ways to solve difficult problems over a vast range of problem domains. 

In this article, I present a number of hand-picked machine learning dissertations that I found compelling in terms of my own areas of interest and aligned with problems that I’m working on. I hope you’ll find a number of them that match your own interests. Each dissertation may be challenging to consume but the process will result in hours of satisfying summer reading. Enjoy!

Please check out my previous data science dissertation round-up article . 

1. Fitting Convex Sets to Data: Algorithms and Applications

This machine learning dissertation concerns the geometric problem of finding a convex set that best fits a given data set. The overarching question serves as an abstraction for data-analytical tasks arising in a range of scientific and engineering applications with a focus on two specific instances: (i) a key challenge that arises in solving inverse problems is ill-posedness due to a lack of measurements. A prominent family of methods for addressing such issues is based on augmenting optimization-based approaches with a convex penalty function so as to induce a desired structure in the solution. These functions are typically chosen using prior knowledge about the data. The thesis also studies the problem of learning convex penalty functions directly from data for settings in which we lack the domain expertise to choose a penalty function. The solution relies on suitably transforming the problem of learning a penalty function into a fitting task; and (ii) the problem of fitting tractably-described convex sets given the optimal value of linear functionals evaluated in different directions.

2. Structured Tensors and the Geometry of Data

This machine learning dissertation analyzes data to build a quantitative understanding of the world. Linear algebra is the foundation of algorithms, dating back one hundred years, for extracting structure from data. Modern technologies provide an abundance of multi-dimensional data, in which multiple variables or factors can be compared simultaneously. To organize and analyze such data sets we can use a tensor , the higher-order analogue of a matrix. However, many theoretical and practical challenges arise in extending linear algebra to the setting of tensors. The first part of the thesis studies and develops the algebraic theory of tensors. The second part of the thesis presents three algorithms for tensor data. The algorithms use algebraic and geometric structure to give guarantees of optimality.

3. Statistical approaches for spatial prediction and anomaly detection

This machine learning dissertation is primarily a description of three projects. It starts with a method for spatial prediction and parameter estimation for irregularly spaced, and non-Gaussian data. It is shown that by judiciously replacing the likelihood with an empirical likelihood in the Bayesian hierarchical model, approximate posterior distributions for the mean and covariance parameters can be obtained. Due to the complex nature of the hierarchical model, standard Markov chain Monte Carlo methods cannot be applied to sample from the posterior distributions. To overcome this issue, a generalized sequential Monte Carlo algorithm is used. Finally, this method is applied to iron concentrations in California. The second project focuses on anomaly detection for functional data; specifically for functional data where the observed functions may lie over different domains. By approximating each function as a low-rank sum of spline basis functions the coefficients will be compared for each basis across each function. The idea being, if two functions are similar then their respective coefficients should not be significantly different. This project concludes with an application of the proposed method to detect anomalous behavior of users of a supercomputer at NREL. The final project is an extension of the second project to two-dimensional data. This project aims to detect location and temporal anomalies from ground motion data from a fiber-optic cable using distributed acoustic sensing (DAS). 

4. Sampling for Streaming Data

Advances in data acquisition technology pose challenges in analyzing large volumes of streaming data. Sampling is a natural yet powerful tool for analyzing such data sets due to their competent estimation accuracy and low computational cost. Unfortunately, sampling methods and their statistical properties for streaming data, especially streaming time series data, are not well studied in the literature. Meanwhile, estimating the dependence structure of multidimensional streaming time-series data in real-time is challenging. With large volumes of streaming data, the problem becomes more difficult when the multidimensional data are collected asynchronously across distributed nodes, which motivates us to sample representative data points from streams. This machine learning dissertation proposes a series of leverage score-based sampling methods for streaming time series data. The simulation studies and real data analysis are conducted to validate the proposed methods. The theoretical analysis of the asymptotic behaviors of the least-squares estimator is developed based on the subsamples.

5.  Statistical Machine Learning Methods for Complex, Heterogeneous Data

This machine learning dissertation develops statistical machine learning methodology for three distinct tasks. Each method blends classical statistical approaches with machine learning methods to provide principled solutions to problems with complex, heterogeneous data sets. The first framework proposes two methods for high-dimensional shape-constrained regression and classification. These methods reshape pre-trained prediction rules to satisfy shape constraints like monotonicity and convexity. The second method provides a nonparametric approach to the econometric analysis of discrete choice. This method provides a scalable algorithm for estimating utility functions with random forests, and combines this with random effects to properly model preference heterogeneity. The final method draws inspiration from early work in statistical machine translation to construct embeddings for variable-length objects like mathematical equations

6. Topics in Multivariate Statistics with Dependent Data

This machine learning dissertation comprises four chapters. The first is an introduction to the topics of the dissertation and the remaining chapters contain the main results. Chapter 2 gives new results for consistency of maximum likelihood estimators with a focus on multivariate mixed models. The presented theory builds on the idea of using subsets of the full data to establish consistency of estimators based on the full data. The theory is applied to two multivariate mixed models for which it was unknown whether maximum likelihood estimators are consistent. In Chapter 3 an algorithm is proposed for maximum likelihood estimation of a covariance matrix when the corresponding correlation matrix can be written as the Kronecker product of two lower-dimensional correlation matrices. The proposed method is fully likelihood-based. Some desirable properties of separable correlation in comparison to separable covariance are also discussed. Chapter 4 is concerned with Bayesian vector auto-regressions (VARs). A collapsed Gibbs sampler is proposed for Bayesian VARs with predictors and the convergence properties of the algorithm are studied. 

7.  Model Selection and Estimation for High-dimensional Data Analysis

In the era of big data, uncovering useful information and hidden patterns in the data is prevalent in different fields. However, it is challenging to effectively select input variables in data and estimate their effects. The goal of this machine learning dissertation is to develop reproducible statistical approaches that provide mechanistic explanations of the phenomenon observed in big data analysis. The research contains two parts: variable selection and model estimation. The first part investigates how to measure and interpret the usefulness of an input variable using an approach called “variable importance learning” and builds tools (methodology and software) that can be widely applied. Two variable importance measures are proposed, a parametric measure SOIL and a non-parametric measure CVIL, using the idea of a model combining and cross-validation respectively. The SOIL method is theoretically shown to have the inclusion/exclusion property: When the model weights are properly around the true model, the SOIL importance can well separate the variables in the true model from the rest. The CVIL method possesses desirable theoretical properties and enhances the interpretability of many mysterious but effective machine learning methods. The second part focuses on how to estimate the effect of a useful input variable in the case where the interaction of two input variables exists. Investigated is the minimax rate of convergence for regression estimation in high-dimensional sparse linear models with two-way interactions, and construct an adaptive estimator that achieves the minimax rate of convergence regardless of the true heredity condition and the sparsity indices.

8.  High-Dimensional Structured Regression Using Convex Optimization

While the term “Big Data” can have multiple meanings, this dissertation considers the type of data in which the number of features can be much greater than the number of observations (also known as high-dimensional data). High-dimensional data is abundant in contemporary scientific research due to the rapid advances in new data-measurement technologies and computing power. Recent advances in statistics have witnessed great development in the field of high-dimensional data analysis. This machine learning dissertation proposes three methods that study three different components of a general framework of the high-dimensional structured regression problem. A general theme of the proposed methods is that they cast a certain structured regression as a convex optimization problem. In so doing, the theoretical properties of each method can be well studied, and efficient computation is facilitated. Each method is accompanied by a thorough theoretical analysis of its performance, and also by an R package containing its practical implementation. It is shown that the proposed methods perform favorably (both theoretically and practically) compared with pre-existing methods.

9. Asymptotics and Interpretability of Decision Trees and Decision Tree Ensembles

Decision trees and decision tree ensembles are widely used nonparametric statistical models. A decision tree is a binary tree that recursively segments the covariate space along the coordinate directions to create hyper rectangles as basic prediction units for fitting constant values within each of them. A decision tree ensemble combines multiple decision trees, either in parallel or in sequence, in order to increase model flexibility and accuracy, as well as to reduce prediction variance. Despite the fact that tree models have been extensively used in practice, results on their asymptotic behaviors are scarce. This machine learning dissertation presents analyses on tree asymptotics in the perspectives of tree terminal nodes, tree ensembles, and models incorporating tree ensembles respectively. The study introduces a few new tree-related learning frameworks which provides provable statistical guarantees and interpretations. A study on the Gini index used in the greedy tree building algorithm reveals its limiting distribution, leading to the development of a test of better splitting that helps to measure the uncertain optimality of a decision tree split. This test is combined with the concept of decision tree distillation, which implements a decision tree to mimic the behavior of a block box model, to generate stable interpretations by guaranteeing a unique distillation tree structure as long as there are sufficiently many random sample points. Also applied is mild modification and regularization to the standard tree boosting to create a new boosting framework named Boulevard. Also included is an integration of two new mechanisms: honest trees , which isolate the tree terminal values from the tree structure, and adaptive shrinkage , which scales the boosting history to create an equally weighted ensemble. This theoretical development provides the prerequisite for the practice of statistical inference with boosted trees. Lastly, the thesis investigates the feasibility of incorporating existing semi-parametric models with tree boosting. 

10. Bayesian Models for Imputing Missing Data and Editing Erroneous Responses in Surveys

This dissertation develops Bayesian methods for handling unit nonresponse, item nonresponse, and erroneous responses in large-scale surveys and censuses containing categorical data. The focus is on applications of nested household data where individuals are nested within households and certain combinations of the variables are not allowed, such as the U.S. Decennial Census, as well as surveys subject to both unit and item nonresponse, such as the Current Population Survey.

11. Localized Variable Selection with Random Forest  

Due to recent advances in computer technology, the cost of collecting and storing data has dropped drastically. This makes it feasible to collect large amounts of information for each data point. This increasing trend in feature dimensionality justifies the need for research on variable selection. Random forest (RF) has demonstrated the ability to select important variables and model complex data. However, simulations confirm that it fails in detecting less influential features in presence of variables with large impacts in some cases. This dissertation proposes two algorithms for localized variable selection: clustering-based feature selection (CBFS) and locally adjusted feature importance (LAFI). Both methods aim to find regions where the effects of weaker features can be isolated and measured. CBFS combines RF variable selection with a two-stage clustering method to detect variables where their effect can be detected only in certain regions. LAFI, on the other hand, uses a binary tree approach to split data into bins based on response variable rankings, and implements RF to find important variables in each bin. Larger LAFI is assigned to variables that get selected in more bins. Simulations and real data sets are used to evaluate these variable selection methods. 

12. Functional Principal Component Analysis and Sparse Functional Regression

The focus of this dissertation is on functional data which are sparsely and irregularly observed. Such data require special consideration, as classical functional data methods and theory were developed for densely observed data. As is the case in much of functional data analysis, the functional principal components (FPCs) play a key role in current sparse functional data methods via the Karhunen-Loéve expansion. Thus, after a review of relevant background material, this dissertation is divided roughly into two parts, the first focusing specifically on theoretical properties of FPCs, and the second on regression for sparsely observed functional data.

13. Essays In Causal Inference: Addressing Bias In Observational And Randomized Studies Through Analysis And Design

In observational studies, identifying assumptions may fail, often quietly and without notice, leading to biased causal estimates. Although less of a concern in randomized trials where treatment is assigned at random, bias may still enter the equation through other means. This dissertation has three parts, each developing new methods to address a particular pattern or source of bias in the setting being studied. The first part extends the conventional sensitivity analysis methods for observational studies to better address patterns of heterogeneous confounding in matched-pair designs. The second part develops a modified difference-in-difference design for comparative interrupted time-series studies. The method permits partial identification of causal effects when the parallel trends assumption is violated by an interaction between group and history. The method is applied to a study of the repeal of Missouri’s permit-to-purchase handgun law and its effect on firearm homicide rates. The final part presents a study design to identify vaccine efficacy in randomized control trials when there is no gold standard case definition. The approach augments a two-arm randomized trial with natural variation of a genetic trait to produce a factorial experiment. 

14. Bayesian Shrinkage: Computation, Methods, and Theory

Sparsity is a standard structural assumption that is made while modeling high-dimensional statistical parameters. This assumption essentially entails a lower-dimensional embedding of the high-dimensional parameter thus enabling sound statistical inference. Apart from this obvious statistical motivation, in many modern applications of statistics such as Genomics, Neuroscience, etc. parameters of interest are indeed of this nature. For over almost two decades, spike and slab type priors have been the Bayesian gold standard for modeling of sparsity. However, due to their computational bottlenecks, shrinkage priors have emerged as a powerful alternative. This family of priors can almost exclusively be represented as a scale mixture of Gaussian distribution and posterior Markov chain Monte Carlo (MCMC) updates of related parameters are then relatively easy to design. Although shrinkage priors were tipped as having computational scalability in high-dimensions, when the number of parameters is in thousands or more, they do come with their own computational challenges. Standard MCMC algorithms implementing shrinkage priors generally scale cubic in the dimension of the parameter making real-life application of these priors severely limited. 

The first chapter of this dissertation addresses this computational issue and proposes an alternative exact posterior sampling algorithm complexity of which that linearly in the ambient dimension. The algorithm developed in the first chapter is specifically designed for regression problems. The second chapter develops a Bayesian method based on shrinkage priors for high-dimensional multiple response regression. Chapter three chooses a specific member of the shrinkage family known as the horseshoe prior and studies its convergence rates in several high-dimensional models. 

15.  Topics in Measurement Error Analysis and High-Dimensional Binary Classification

This dissertation proposes novel methods to tackle two problems: the misspecified model with measurement error and high-dimensional binary classification, both have a crucial impact on applications in public health. The first problem exists in the epidemiology practice. Epidemiologists often categorize a continuous risk predictor since categorization is thought to be more robust and interpretable, even when the true risk model is not a categorical one. Thus, their goal is to fit the categorical model and interpret the categorical parameters. The second project considers the problem of high-dimensional classification between the two groups with unequal covariance matrices. Rather than estimating the full quadratic discriminant rule, it is proposed to perform simultaneous variable selection and linear dimension reduction on original data, with the subsequent application of quadratic discriminant analysis on the reduced space. Further, in order to support the proposed methodology, two R packages were developed, CCP and DAP, along with two vignettes as long-format illustrations for their usage.

16. Model-Based Penalized Regression

This dissertation contains three chapters that consider penalized regression from a model-based perspective, interpreting penalties as assumed prior distributions for unknown regression coefficients. The first chapter shows that treating a lasso penalty as a prior can facilitate the choice of tuning parameters when standard methods for choosing the tuning parameters are not available, and when it is necessary to choose multiple tuning parameters simultaneously. The second chapter considers a possible drawback of treating penalties as models, specifically possible misspecification. The third chapter introduces structured shrinkage priors for dependent regression coefficients which generalize popular independent shrinkage priors. These can be useful in various applied settings where many regression coefficients are not only expected to be nearly or exactly equal to zero, but also structured.

17. Topics on Least Squares Estimation

This dissertation revisits and makes progress on some old but challenging problems concerning least squares estimation, the work-horse of supervised machine learning. Two major problems are addressed: (i) least squares estimation with heavy-tailed errors, and (ii) least squares estimation in non-Donsker classes. For (i), this problem is studied both from a worst-case perspective, and a more refined envelope perspective. For (ii), two case studies are performed in the context of (a) estimation involving sets and (b) estimation of multivariate isotonic functions. Understanding these particular aspects of least squares estimation problems requires several new tools in the empirical process theory, including a sharp multiplier inequality controlling the size of the multiplier empirical process, and matching upper and lower bounds for empirical processes indexed by non-Donsker classes.

How to Learn More about Machine Learning

At our upcoming event this November 16th-18th in San Francisco,  ODSC West 2021  will feature a plethora of talks, workshops, and training sessions on machine learning and machine learning research. You can  register now for 50% off all ticket types  before the discount drops to 40% in a few weeks. Some  highlighted sessions on machine learning  include:

  • Towards More Energy-Efficient Neural Networks? Use Your Brain!: Olaf de Leeuw | Data Scientist | Dataworkz
  • Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
  • Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
  • Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability and Analytics Researcher | Northwestern University

Sessions on MLOps:

  • Tuning Hyperparameters with Reproducible Experiments: Milecia McGregor | Senior Software Engineer | Iterative
  • MLOps… From Model to Production: Filipa Peleja, PhD | Lead Data Scientist | Levi Strauss & Co
  • Operationalization of Models Developed and Deployed in Heterogeneous Platforms: Sourav Mazumder | Data Scientist, Thought Leader, AI & ML Operationalization Leader | IBM
  • Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber: Eduardo Blancas | Data Scientist | Fidelity Investments

Sessions on Deep Learning:

  • GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow: Ajay Baranwal | Center Director | Center for Deep Learning in Electronic Manufacturing, Inc
  • Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
  • Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
  • Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google

thesis for data science

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

DE Summit Square

How To Unlock Trust and Success Before You Start an AI Project

East 2024 Business + Management posted by ODSC Community Apr 1, 2024 Editor’s note: Cal Al-Dhubaib is a speaker for ODSC East this April 23-25. Be sure to...

ODSC’s AI Weekly Recap: Week of March 29th

ODSC’s AI Weekly Recap: Week of March 29th

AI and Data Science News posted by ODSC Team Mar 29, 2024 Open Data Science Blog Recap BrainBox AI has introduced ARIA, a generative AI-powered virtual Building Assistant...

Hugging Face’s Cosmopedia Hopes To Reshape Pre-Training Data

Hugging Face’s Cosmopedia Hopes To Reshape Pre-Training Data

AI and Data Science News posted by ODSC Team Mar 28, 2024 Since the beginning of AI models, the creation of datasets for supervised and instruction-tuning of AI...

AI weekly square

eml header

37 Research Topics In Data Science To Stay On Top Of

Stewart Kaplan

  • February 22, 2024

As a data scientist, staying on top of the latest research in your field is essential.

The data science landscape changes rapidly, and new techniques and tools are constantly being developed.

To keep up with the competition, you need to be aware of the latest trends and topics in data science research.

In this article, we will provide an overview of 37 hot research topics in data science.

We will discuss each topic in detail, including its significance and potential applications.

These topics could be an idea for a thesis or simply topics you can research independently.

Stay tuned – this is one blog post you don’t want to miss!

37 Research Topics in Data Science

1.) predictive modeling.

Predictive modeling is a significant portion of data science and a topic you must be aware of.

Simply put, it is the process of using historical data to build models that can predict future outcomes.

Predictive modeling has many applications, from marketing and sales to financial forecasting and risk management.

As businesses increasingly rely on data to make decisions, predictive modeling is becoming more and more important.

While it can be complex, predictive modeling is a powerful tool that gives businesses a competitive advantage.

predictive modeling

2.) Big Data Analytics

These days, it seems like everyone is talking about big data.

And with good reason – organizations of all sizes are sitting on mountains of data, and they’re increasingly turning to data scientists to help them make sense of it all.

But what exactly is big data? And what does it mean for data science?

Simply put, big data is a term used to describe datasets that are too large and complex for traditional data processing techniques.

Big data typically refers to datasets of a few terabytes or more.

But size isn’t the only defining characteristic – big data is also characterized by its high Velocity (the speed at which data is generated), Variety (the different types of data), and Volume (the amount of the information).

Given the enormity of big data, it’s not surprising that organizations are struggling to make sense of it all.

That’s where data science comes in.

Data scientists use various methods to wrangle big data, including distributed computing and other decentralized technologies.

With the help of data science, organizations are beginning to unlock the hidden value in their big data.

By harnessing the power of big data analytics, they can improve their decision-making, better understand their customers, and develop new products and services.

3.) Auto Machine Learning

Auto machine learning is a research topic in data science concerned with developing algorithms that can automatically learn from data without intervention.

This area of research is vital because it allows data scientists to automate the process of writing code for every dataset.

This allows us to focus on other tasks, such as model selection and validation.

Auto machine learning algorithms can learn from data in a hands-off way for the data scientist – while still providing incredible insights.

This makes them a valuable tool for data scientists who either don’t have the skills to do their own analysis or are struggling.

Auto Machine Learning

4.) Text Mining

Text mining is a research topic in data science that deals with text data extraction.

This area of research is important because it allows us to get as much information as possible from the vast amount of text data available today.

Text mining techniques can extract information from text data, such as keywords, sentiments, and relationships.

This information can be used for various purposes, such as model building and predictive analytics.

5.) Natural Language Processing

Natural language processing is a data science research topic that analyzes human language data.

This area of research is important because it allows us to understand and make sense of the vast amount of text data available today.

Natural language processing techniques can build predictive and interactive models from any language data.

Natural Language processing is pretty broad, and recent advances like GPT-3 have pushed this topic to the forefront.

natural language processing

6.) Recommender Systems

Recommender systems are an exciting topic in data science because they allow us to make better products, services, and content recommendations.

Businesses can better understand their customers and their needs by using recommender systems.

This, in turn, allows them to develop better products and services that meet the needs of their customers.

Recommender systems are also used to recommend content to users.

This can be done on an individual level or at a group level.

Think about Netflix, for example, always knowing what you want to watch!

Recommender systems are a valuable tool for businesses and users alike.

7.) Deep Learning

Deep learning is a research topic in data science that deals with artificial neural networks.

These networks are composed of multiple layers, and each layer is formed from various nodes.

Deep learning networks can learn from data similarly to how humans learn, irrespective of the data distribution.

This makes them a valuable tool for data scientists looking to build models that can learn from data independently.

The deep learning network has become very popular in recent years because of its ability to achieve state-of-the-art results on various tasks.

There seems to be a new SOTA deep learning algorithm research paper on  every single day!

deep learning

8.) Reinforcement Learning

Reinforcement learning is a research topic in data science that deals with algorithms that can learn on multiple levels from interactions with their environment.

This area of research is essential because it allows us to develop algorithms that can learn non-greedy approaches to decision-making, allowing businesses and companies to win in the long term compared to the short.

9.) Data Visualization

Data visualization is an excellent research topic in data science because it allows us to see our data in a way that is easy to understand.

Data visualization techniques can be used to create charts, graphs, and other visual representations of data.

This allows us to see the patterns and trends hidden in our data.

Data visualization is also used to communicate results to others.

This allows us to share our findings with others in a way that is easy to understand.

There are many ways to contribute to and learn about data visualization.

Some ways include attending conferences, reading papers, and contributing to open-source projects.

data visualization

10.) Predictive Maintenance

Predictive maintenance is a hot topic in data science because it allows us to prevent failures before they happen.

This is done using data analytics to predict when a failure will occur.

This allows us to take corrective action before the failure actually happens.

While this sounds simple, avoiding false positives while keeping recall is challenging and an area wide open for advancement.

11.) Financial Analysis

Financial analysis is an older topic that has been around for a while but is still a great field where contributions can be felt.

Current researchers are focused on analyzing macroeconomic data to make better financial decisions.

This is done by analyzing the data to identify trends and patterns.

Financial analysts can use this information to make informed decisions about where to invest their money.

Financial analysis is also used to predict future economic trends.

This allows businesses and individuals to prepare for potential financial hardships and enable companies to be cash-heavy during good economic conditions.

Overall, financial analysis is a valuable tool for anyone looking to make better financial decisions.

Financial Analysis

12.) Image Recognition

Image recognition is one of the hottest topics in data science because it allows us to identify objects in images.

This is done using artificial intelligence algorithms that can learn from data and understand what objects you’re looking for.

This allows us to build models that can accurately recognize objects in images and video.

This is a valuable tool for businesses and individuals who want to be able to identify objects in images.

Think about security, identification, routing, traffic, etc.

Image Recognition has gained a ton of momentum recently – for a good reason.

13.) Fraud Detection

Fraud detection is a great topic in data science because it allows us to identify fraudulent activity before it happens.

This is done by analyzing data to look for patterns and trends that may be associated with the fraud.

Once our machine learning model recognizes some of these patterns in real time, it immediately detects fraud.

This allows us to take corrective action before the fraud actually happens.

Fraud detection is a valuable tool for anyone who wants to protect themselves from potential fraudulent activity.

fraud detection

14.) Web Scraping

Web scraping is a controversial topic in data science because it allows us to collect data from the web, which is usually data you do not own.

This is done by extracting data from websites using scraping tools that are usually custom-programmed.

This allows us to collect data that would otherwise be inaccessible.

For obvious reasons, web scraping is a unique tool – giving you data your competitors would have no chance of getting.

I think there is an excellent opportunity to create new and innovative ways to make scraping accessible for everyone, not just those who understand Selenium and Beautiful Soup.

15.) Social Media Analysis

Social media analysis is not new; many people have already created exciting and innovative algorithms to study this.

However, it is still a great data science research topic because it allows us to understand how people interact on social media.

This is done by analyzing data from social media platforms to look for insights, bots, and recent societal trends.

Once we understand these practices, we can use this information to improve our marketing efforts.

For example, if we know that a particular demographic prefers a specific type of content, we can create more content that appeals to them.

Social media analysis is also used to understand how people interact with brands on social media.

This allows businesses to understand better what their customers want and need.

Overall, social media analysis is valuable for anyone who wants to improve their marketing efforts or understand how customers interact with brands.

social media

16.) GPU Computing

GPU computing is a fun new research topic in data science because it allows us to process data much faster than traditional CPUs .

Due to how GPUs are made, they’re incredibly proficient at intense matrix operations, outperforming traditional CPUs by very high margins.

While the computation is fast, the coding is still tricky.

There is an excellent research opportunity to bring these innovations to non-traditional modules, allowing data science to take advantage of GPU computing outside of deep learning.

17.) Quantum Computing

Quantum computing is a new research topic in data science and physics because it allows us to process data much faster than traditional computers.

It also opens the door to new types of data.

There are just some problems that can’t be solved utilizing outside of the classical computer.

For example, if you wanted to understand how a single atom moved around, a classical computer couldn’t handle this problem.

You’ll need to utilize a quantum computer to handle quantum mechanics problems.

This may be the “hottest” research topic on the planet right now, with some of the top researchers in computer science and physics worldwide working on it.

You could be too.

quantum computing

18.) Genomics

Genomics may be the only research topic that can compete with quantum computing regarding the “number of top researchers working on it.”

Genomics is a fantastic intersection of data science because it allows us to understand how genes work.

This is done by sequencing the DNA of different organisms to look for insights into our and other species.

Once we understand these patterns, we can use this information to improve our understanding of diseases and create new and innovative treatments for them.

Genomics is also used to study the evolution of different species.

Genomics is the future and a field begging for new and exciting research professionals to take it to the next step.

19.) Location-based services

Location-based services are an old and time-tested research topic in data science.

Since GPS and 4g cell phone reception became a thing, we’ve been trying to stay informed about how humans interact with their environment.

This is done by analyzing data from GPS tracking devices, cell phone towers, and Wi-Fi routers to look for insights into how humans interact.

Once we understand these practices, we can use this information to improve our geotargeting efforts, improve maps, find faster routes, and improve cohesion throughout a community.

Location-based services are used to understand the user, something every business could always use a little bit more of.

While a seemingly “stale” field, location-based services have seen a revival period with self-driving cars.


20.) Smart City Applications

Smart city applications are all the rage in data science research right now.

By harnessing the power of data, cities can become more efficient and sustainable.

But what exactly are smart city applications?

In short, they are systems that use data to improve city infrastructure and services.

This can include anything from traffic management and energy use to waste management and public safety.

Data is collected from various sources, including sensors, cameras, and social media.

It is then analyzed to identify tendencies and habits.

This information can make predictions about future needs and optimize city resources.

As more and more cities strive to become “smart,” the demand for data scientists with expertise in smart city applications is only growing.

21.) Internet Of Things (IoT)

The Internet of Things, or IoT, is exciting and new data science and sustainability research topic.

IoT is a network of physical objects embedded with sensors and connected to the internet.

These objects can include everything from alarm clocks to refrigerators; they’re all connected to the internet.

That means that they can share data with computers.

And that’s where data science comes in.

Data scientists are using IoT data to learn everything from how people use energy to how traffic flows through a city.

They’re also using IoT data to predict when an appliance will break down or when a road will be congested.

Really, the possibilities are endless.

With such a wide-open field, it’s easy to see why IoT is being researched by some of the top professionals in the world.

internet of things

22.) Cybersecurity

Cybersecurity is a relatively new research topic in data science and in general, but it’s already garnering a lot of attention from businesses and organizations.

After all, with the increasing number of cyber attacks in recent years, it’s clear that we need to find better ways to protect our data.

While most of cybersecurity focuses on infrastructure, data scientists can leverage historical events to find potential exploits to protect their companies.

Sometimes, looking at a problem from a different angle helps, and that’s what data science brings to cybersecurity.

Also, data science can help to develop new security technologies and protocols.

As a result, cybersecurity is a crucial data science research area and one that will only become more important in the years to come.

23.) Blockchain

Blockchain is an incredible new research topic in data science for several reasons.

First, it is a distributed database technology that enables secure, transparent, and tamper-proof transactions.

Did someone say transmitting data?

This makes it an ideal platform for tracking data and transactions in various industries.

Second, blockchain is powered by cryptography, which not only makes it highly secure – but is a familiar foe for data scientists.

Finally, blockchain is still in its early stages of development, so there is much room for research and innovation.

As a result, blockchain is a great new research topic in data science that vows to revolutionize how we store, transmit and manage data.


24.) Sustainability

Sustainability is a relatively new research topic in data science, but it is gaining traction quickly.

To keep up with this demand, The Wharton School of the University of Pennsylvania has  started to offer an MBA in Sustainability .

This demand isn’t shocking, and some of the reasons include the following:

Sustainability is an important issue that is relevant to everyone.

Datasets on sustainability are constantly growing and changing, making it an exciting challenge for data scientists.

There hasn’t been a “set way” to approach sustainability from a data perspective, making it an excellent opportunity for interdisciplinary research.

As data science grows, sustainability will likely become an increasingly important research topic.

25.) Educational Data

Education has always been a great topic for research, and with the advent of big data, educational data has become an even richer source of information.

By studying educational data, researchers can gain insights into how students learn, what motivates them, and what barriers these students may face.

Besides, data science can be used to develop educational interventions tailored to individual students’ needs.

Imagine being the researcher that helps that high schooler pass mathematics; what an incredible feeling.

With the increasing availability of educational data, data science has enormous potential to improve the quality of education.

online education

26.) Politics

As data science continues to evolve, so does the scope of its applications.

Originally used primarily for business intelligence and marketing, data science is now applied to various fields, including politics.

By analyzing large data sets, political scientists (data scientists with a cooler name) can gain valuable insights into voting patterns, campaign strategies, and more.

Further, data science can be used to forecast election results and understand the effects of political events on public opinion.

With the wealth of data available, there is no shortage of research opportunities in this field.

As data science evolves, so does our understanding of politics and its role in our world.

27.) Cloud Technologies

Cloud technologies are a great research topic.

It allows for the outsourcing and sharing of computer resources and applications all over the internet.

This lets organizations save money on hardware and maintenance costs while providing employees access to the latest and greatest software and applications.

I believe there is an argument that AWS could be the greatest and most technologically advanced business ever built (Yes, I know it’s only part of the company).

Besides, cloud technologies can help improve team members’ collaboration by allowing them to share files and work on projects together in real-time.

As more businesses adopt cloud technologies, data scientists must stay up-to-date on the latest trends in this area.

By researching cloud technologies, data scientists can help organizations to make the most of this new and exciting technology.

cloud technologies

28.) Robotics

Robotics has recently become a household name, and it’s for a good reason.

First, robotics deals with controlling and planning physical systems, an inherently complex problem.

Second, robotics requires various sensors and actuators to interact with the world, making it an ideal application for machine learning techniques.

Finally, robotics is an interdisciplinary field that draws on various disciplines, such as computer science, mechanical engineering, and electrical engineering.

As a result, robotics is a rich source of research problems for data scientists.

29.) HealthCare

Healthcare is an industry that is ripe for data-driven innovation.

Hospitals, clinics, and health insurance companies generate a tremendous amount of data daily.

This data can be used to improve the quality of care and outcomes for patients.

This is perfect timing, as the healthcare industry is undergoing a significant shift towards value-based care, which means there is a greater need than ever for data-driven decision-making.

As a result, healthcare is an exciting new research topic for data scientists.

There are many different ways in which data can be used to improve healthcare, and there is a ton of room for newcomers to make discoveries.


30.) Remote Work

There’s no doubt that remote work is on the rise.

In today’s global economy, more and more businesses are allowing their employees to work from home or anywhere else they can get a stable internet connection.

But what does this mean for data science? Well, for one thing, it opens up a whole new field of research.

For example, how does remote work impact employee productivity?

What are the best ways to manage and collaborate on data science projects when team members are spread across the globe?

And what are the cybersecurity risks associated with working remotely?

These are just a few of the questions that data scientists will be able to answer with further research.

So if you’re looking for a new topic to sink your teeth into, remote work in data science is a great option.

31.) Data-Driven Journalism

Data-driven journalism is an exciting new field of research that combines the best of both worlds: the rigor of data science with the creativity of journalism.

By applying data analytics to large datasets, journalists can uncover stories that would otherwise be hidden.

And telling these stories compellingly can help people better understand the world around them.

Data-driven journalism is still in its infancy, but it has already had a major impact on how news is reported.

In the future, it will only become more important as data becomes increasingly fluid among journalists.

It is an exciting new topic and research field for data scientists to explore.


32.) Data Engineering

Data engineering is a staple in data science, focusing on efficiently managing data.

Data engineers are responsible for developing and maintaining the systems that collect, process, and store data.

In recent years, there has been an increasing demand for data engineers as the volume of data generated by businesses and organizations has grown exponentially.

Data engineers must be able to design and implement efficient data-processing pipelines and have the skills to optimize and troubleshoot existing systems.

If you are looking for a challenging research topic that would immediately impact you worldwide, then improving or innovating a new approach in data engineering would be a good start.

33.) Data Curation

Data curation has been a hot topic in the data science community for some time now.

Curating data involves organizing, managing, and preserving data so researchers can use it.

Data curation can help to ensure that data is accurate, reliable, and accessible.

It can also help to prevent research duplication and to facilitate the sharing of data between researchers.

Data curation is a vital part of data science. In recent years, there has been an increasing focus on data curation, as it has become clear that it is essential for ensuring data quality.

As a result, data curation is now a major research topic in data science.

There are numerous books and articles on the subject, and many universities offer courses on data curation.

Data curation is an integral part of data science and will only become more important in the future.


34.) Meta-Learning

Meta-learning is gaining a ton of steam in data science. It’s learning how to learn.

So, if you can learn how to learn, you can learn anything much faster.

Meta-learning is mainly used in deep learning, as applications outside of this are generally pretty hard.

In deep learning, many parameters need to be tuned for a good model, and there’s usually a lot of data.

You can save time and effort if you can automatically and quickly do this tuning.

In machine learning, meta-learning can improve models’ performance by sharing knowledge between different models.

For example, if you have a bunch of different models that all solve the same problem, then you can use meta-learning to share the knowledge between them to improve the cluster (groups) overall performance.

I don’t know how anyone looking for a research topic could stay away from this field; it’s what the  Terminator  warned us about!

35.) Data Warehousing

A data warehouse is a system used for data analysis and reporting.

It is a central data repository created by combining data from multiple sources.

Data warehouses are often used to store historical data, such as sales data, financial data, and customer data.

This data type can be used to create reports and perform statistical analysis.

Data warehouses also store data that the organization is not currently using.

This type of data can be used for future research projects.

Data warehousing is an incredible research topic in data science because it offers a variety of benefits.

Data warehouses help organizations to save time and money by reducing the need for manual data entry.

They also help to improve the accuracy of reports and provide a complete picture of the organization’s performance.

Data warehousing feels like one of the weakest parts of the Data Science Technology Stack; if you want a research topic that could have a monumental impact – data warehousing is an excellent place to look.

data warehousing

36.) Business Intelligence

Business intelligence aims to collect, process, and analyze data to help businesses make better decisions.

Business intelligence can improve marketing, sales, customer service, and operations.

It can also be used to identify new business opportunities and track competition.

BI is business and another tool in your company’s toolbox to continue dominating your area.

Data science is the perfect tool for business intelligence because it combines statistics, computer science, and machine learning.

Data scientists can use business intelligence to answer questions like, “What are our customers buying?” or “What are our competitors doing?” or “How can we increase sales?”

Business intelligence is a great way to improve your business’s bottom line and an excellent opportunity to dive deep into a well-respected research topic.

37.) Crowdsourcing

One of the newest areas of research in data science is crowdsourcing.

Crowdsourcing is a process of sourcing tasks or projects to a large group of people, typically via the internet.

This can be done for various purposes, such as gathering data, developing new algorithms, or even just for fun (think: online quizzes and surveys).

But what makes crowdsourcing so powerful is that it allows businesses and organizations to tap into a vast pool of talent and resources they wouldn’t otherwise have access to.

And with the rise of social media, it’s easier than ever to connect with potential crowdsource workers worldwide.

Imagine if you could effect that, finding innovative ways to improve how people work together.

That would have a huge effect.

crowd sourcing

Final Thoughts, Are These Research Topics In Data Science For You?

Thirty-seven different research topics in data science are a lot to take in, but we hope you found a research topic that interests you.

If not, don’t worry – there are plenty of other great topics to explore.

The important thing is to get started with your research and find ways to apply what you learn to real-world problems.

We wish you the best of luck as you begin your data science journey!

Other Data Science Articles

We love talking about data science; here are a couple of our favorite articles:

  • Why Are You Interested In Data Science?
  • Recent Posts

Stewart Kaplan

  • How Are Rates and Ratios Related: Exploring Connections [Unlock the Secrets] - April 3, 2024
  • Are There Any Black Software Engineers? [Discover Their Impact Now] - April 2, 2024
  • Mastering the Empirical Rule for Dummies [Boost Your Statistical Analysis Skills] - April 2, 2024

Trending now

Multivariate Polynomial Regression Python

  • Press Enter to activate screen reader mode.

Department of Computer Science

Thesis projects and research in ds.

The Master's thesis is a mandatory course of the Master's program in Data Science. The thesis is supervised by a professor of the data science faculty list .

Research in Data Science is a core elective for students in Data Science under the supervision of a data science professor.

Research in Data Science

The project is in independent work under the supervision of a member of the faculty in data science

Only students who have passed at least one core course in Data Management and Processing, and one core course in Data Analysis can start with a research project.

Before starting, the project must be registered in mystudies and a project description must be submitted at the start of the project to the studies administration by e-mail (address see Contact in right column).

Master's Thesis

The Master's Thesis requires 6 months of full time study/work, and we strongly discourage you from attending any courses in parallel. We recommend that you acquire all course credits before the start of the Master’s thesis. The topic for the Master’s thesis must be chosen within Data Science.

Before starting a Master’s thesis, it is important to agree with your supervisor on the task and the assessment scheme. Both have to be documented thoroughly. You electronically register the Master’s thesis in mystudies.

It is possible to complete the Master’s thesis in industry provided that a professor involved in the Data Science Master’s program supervises the thesis and your tutor approves it.

Further details on internal regulations of the Master’s thesis can be downloaded from the following website: .

Overview Master's Theses Projects

Chair of programming methodology.

  • Prof. Dr. Martin Vechev

Institute for Computing Platform

  • Prof. Dr. Gustavo Alonso
  • Prof. Dr. Torsten Hoefler
  • Prof. Dr. Ana Klimovic
  • Prof. Dr. Timothy Roscoe

Institute for Machine Learning

  • Prof. Dr. Valentina Boeva
  • Prof. Dr. Joachim Buhmann
  • Prof. Dr. Ryan Cotterell    
  • external page Prof. Dr. Menna El-Assady call_made   
  • Prof. Dr. Niao He
  • Prof. Dr. Thomas Hofmann
  • Prof. Dr. Andreas Krause
  • external page Prof. Dr. Fernando Perez Cruz call_made
  • Prof. Dr. Gunnar Rätsch
  • external page Prof. Dr. Mrinmaya Sachan call_made
  • external page Prof. Dr. Bernhard Schölkopf call_made  
  • Prof. Dr. Julia Vogt

Institute for Persasive Computing

  • Prof. Dr. Otmar Hilliges

Institute of Computer Systems

  • Prof. Dr. Markus Püschel

Institute of Information Security

  • Prof. Dr. David Basin
  • Prof. Dr. Srdjan Capkun
  • external page Prof. Dr. Florian Tramèr call_made

Institute of Theoretical Computer Science

  • Prof. Dr. Bernd Gärtner

Institute of Visual Computing

  • Prof. Dr. Markus Gross
  • Prof. Dr. Marc Pollefeys
  • Prof. Dr. Olga Sorkine
  • Prof. Dr. Siyu Tang

Disney Research Zurich

  • external page Prof. Dr. Robert Sumner call_made

Automatic Control Laboratory

  • Prof. Dr. Florian Dörfler
  • Prof. Dr. John Lygeros

Communication Technology Laboratory

  • Prof. Dr. Helmut Bölcskei

Computer Engineering and Networks Laboratory

  • Prof. Dr. Laurent Vanbever
  • Prof. Dr. Roger Wattenhofer

Computer Vision Laboratory

  • Prof. Dr. Ender Konukoglu
  • Prof. Dr. Luc Van Gool
  • Prof. Dr. Fisher Yu

Institute for Biomedical Engineering

  • Prof. Dr. Klaas Enno Stephan

Integrated Systems Laboratory

  • Prof. Dr. Luca Benini
  • Prof. Dr. Christoph Studer

Signal and Information Processing Laboratory (ISI)

  • Prof. Dr. Amos Lapidoth
  • Prof. Dr. Hans-Andrea Loeliger

D-MATH does not publish Master's Theses projects. In case of interest contact the professor directly.

FIM - Insitute for Mathematical Research

  • Prof. Dr. Alessio Figalli

Financial Mathematics

  • Prof. Dr. Josef Teichmann

Institute for Operations Research

  • Prof. Dr. Robert Weismantel
  • Prof. Dr. Rico Zenklusen

RiskLab Switzerland

  • external page Prof. Dr. Patrick Cheridito call_made
  • external page Prof. Dr. Mario Valentin Wüthrich call_made

Seminar for Applied Mathematics

  • Prof. Dr. Rima Alaifari
  • Prof. Dr. Siddhartha Mishra

Seminar for Statistics

  • Prof. Dr. Afonso Bandeira
  • Prof. Dr. Peter Bühlmann
  • Prof. Dr. Yuansi Chen
  • Prof. Dr. Nicolai Meinshausen
  • Prof. Dr. Jonas Peters
  • Prof. Dr. Johanna Ziegel

Law, Economics, and Data Science Group

  • Prof. Dr. Eliott Ash , D-GESS)

Institute for Geodesy and Photogrammetry

  • Prof. Dr. Konrad Schindler (D-BSSE)

Machine Learning - CMU

PhD Dissertations

PhD Dissertations

[all are .pdf files].

Learning Models that Match Jacob Tyo, 2024

Improving Human Integration across the Machine Learning Pipeline Charvi Rastogi, 2024

Reliable and Practical Machine Learning for Dynamic Healthcare Settings Helen Zhou, 2023

Automatic customization of large-scale spiking network models to neuronal population activity (unavailable) Shenghao Wu, 2023

Estimation of BVk functions from scattered data (unavailable) Addison J. Hu, 2023

Rethinking object categorization in computer vision (unavailable) Jayanth Koushik, 2023

Advances in Statistical Gene Networks Jinjin Tian, 2023 Post-hoc calibration without distributional assumptions Chirag Gupta, 2023

The Role of Noise, Proxies, and Dynamics in Algorithmic Fairness Nil-Jana Akpinar, 2023

Collaborative learning by leveraging siloed data Sebastian Caldas, 2023

Modeling Epidemiological Time Series Aaron Rumack, 2023

Human-Centered Machine Learning: A Statistical and Algorithmic Perspective Leqi Liu, 2023

Uncertainty Quantification under Distribution Shifts Aleksandr Podkopaev, 2023

Probabilistic Reinforcement Learning: Using Data to Define Desired Outcomes, and Inferring How to Get There Benjamin Eysenbach, 2023

Comparing Forecasters and Abstaining Classifiers Yo Joong Choe, 2023

Using Task Driven Methods to Uncover Representations of Human Vision and Semantics Aria Yuan Wang, 2023

Data-driven Decisions - An Anomaly Detection Perspective Shubhranshu Shekhar, 2023

Applied Mathematics of the Future Kin G. Olivares, 2023



Principled Machine Learning for Societally Consequential Decision Making Amanda Coston, 2023

Long term brain dynamics extend cognitive neuroscience to timescales relevant for health and physiology Maxwell B. Wang, 2023

Long term brain dynamics extend cognitive neuroscience to timescales relevant for health and physiology Darby M. Losey, 2023

Calibrated Conditional Density Models and Predictive Inference via Local Diagnostics David Zhao, 2023

Towards an Application-based Pipeline for Explainability Gregory Plumb, 2022

Objective Criteria for Explainable Machine Learning Chih-Kuan Yeh, 2022

Making Scientific Peer Review Scientific Ivan Stelmakh, 2022

Facets of regularization in high-dimensional learning: Cross-validation, risk monotonization, and model complexity Pratik Patil, 2022

Active Robot Perception using Programmable Light Curtains Siddharth Ancha, 2022

Strategies for Black-Box and Multi-Objective Optimization Biswajit Paria, 2022

Unifying State and Policy-Level Explanations for Reinforcement Learning Nicholay Topin, 2022

Sensor Fusion Frameworks for Nowcasting Maria Jahja, 2022

Equilibrium Approaches to Modern Deep Learning Shaojie Bai, 2022

Towards General Natural Language Understanding with Probabilistic Worldbuilding Abulhair Saparov, 2022

Applications of Point Process Modeling to Spiking Neurons (Unavailable) Yu Chen, 2021

Neural variability: structure, sources, control, and data augmentation Akash Umakantha, 2021

Structure and time course of neural population activity during learning Jay Hennig, 2021

Cross-view Learning with Limited Supervision Yao-Hung Hubert Tsai, 2021

Meta Reinforcement Learning through Memory Emilio Parisotto, 2021

Learning Embodied Agents with Scalably-Supervised Reinforcement Learning Lisa Lee, 2021

Learning to Predict and Make Decisions under Distribution Shift Yifan Wu, 2021

Statistical Game Theory Arun Sai Suggala, 2021

Towards Knowledge-capable AI: Agents that See, Speak, Act and Know Kenneth Marino, 2021

Learning and Reasoning with Fast Semidefinite Programming and Mixing Methods Po-Wei Wang, 2021

Bridging Language in Machines with Language in the Brain Mariya Toneva, 2021

Curriculum Learning Otilia Stretcu, 2021

Principles of Learning in Multitask Settings: A Probabilistic Perspective Maruan Al-Shedivat, 2021

Towards Robust and Resilient Machine Learning Adarsh Prasad, 2021

Towards Training AI Agents with All Types of Experiences: A Unified ML Formalism Zhiting Hu, 2021

Building Intelligent Autonomous Navigation Agents Devendra Chaplot, 2021

Learning to See by Moving: Self-supervising 3D Scene Representations for Perception, Control, and Visual Reasoning Hsiao-Yu Fish Tung, 2021

Statistical Astrophysics: From Extrasolar Planets to the Large-scale Structure of the Universe Collin Politsch, 2020

Causal Inference with Complex Data Structures and Non-Standard Effects Kwhangho Kim, 2020

Networks, Point Processes, and Networks of Point Processes Neil Spencer, 2020

Dissecting neural variability using population recordings, network models, and neurofeedback (Unavailable) Ryan Williamson, 2020

Predicting Health and Safety: Essays in Machine Learning for Decision Support in the Public Sector Dylan Fitzpatrick, 2020

Towards a Unified Framework for Learning and Reasoning Han Zhao, 2020

Learning DAGs with Continuous Optimization Xun Zheng, 2020

Machine Learning and Multiagent Preferences Ritesh Noothigattu, 2020

Learning and Decision Making from Diverse Forms of Information Yichong Xu, 2020

Towards Data-Efficient Machine Learning Qizhe Xie, 2020

Change modeling for understanding our world and the counterfactual one(s) William Herlands, 2020

Machine Learning in High-Stakes Settings: Risks and Opportunities Maria De-Arteaga, 2020

Data Decomposition for Constrained Visual Learning Calvin Murdock, 2020

Structured Sparse Regression Methods for Learning from High-Dimensional Genomic Data Micol Marchetti-Bowick, 2020

Towards Efficient Automated Machine Learning Liam Li, 2020

LEARNING COLLECTIONS OF FUNCTIONS Emmanouil Antonios Platanios, 2020

Provable, structured, and efficient methods for robustness of deep networks to adversarial examples Eric Wong , 2020

Reconstructing and Mining Signals: Algorithms and Applications Hyun Ah Song, 2020

Probabilistic Single Cell Lineage Tracing Chieh Lin, 2020

Graphical network modeling of phase coupling in brain activity (unavailable) Josue Orellana, 2019

Strategic Exploration in Reinforcement Learning - New Algorithms and Learning Guarantees Christoph Dann, 2019 Learning Generative Models using Transformations Chun-Liang Li, 2019

Estimating Probability Distributions and their Properties Shashank Singh, 2019

Post-Inference Methods for Scalable Probabilistic Modeling and Sequential Decision Making Willie Neiswanger, 2019

Accelerating Text-as-Data Research in Computational Social Science Dallas Card, 2019

Multi-view Relationships for Analytics and Inference Eric Lei, 2019

Information flow in networks based on nonstationary multivariate neural recordings Natalie Klein, 2019

Competitive Analysis for Machine Learning & Data Science Michael Spece, 2019

The When, Where and Why of Human Memory Retrieval Qiong Zhang, 2019

Towards Effective and Efficient Learning at Scale Adams Wei Yu, 2019

Towards Literate Artificial Intelligence Mrinmaya Sachan, 2019

Learning Gene Networks Underlying Clinical Phenotypes Under SNP Perturbations From Genome-Wide Data Calvin McCarter, 2019

Unified Models for Dynamical Systems Carlton Downey, 2019

Anytime Prediction and Learning for the Balance between Computation and Accuracy Hanzhang Hu, 2019

Statistical and Computational Properties of Some "User-Friendly" Methods for High-Dimensional Estimation Alnur Ali, 2019

Nonparametric Methods with Total Variation Type Regularization Veeranjaneyulu Sadhanala, 2019

New Advances in Sparse Learning, Deep Networks, and Adversarial Learning: Theory and Applications Hongyang Zhang, 2019

Gradient Descent for Non-convex Problems in Modern Machine Learning Simon Shaolei Du, 2019

Selective Data Acquisition in Learning and Decision Making Problems Yining Wang, 2019

Anomaly Detection in Graphs and Time Series: Algorithms and Applications Bryan Hooi, 2019

Neural dynamics and interactions in the human ventral visual pathway Yuanning Li, 2018

Tuning Hyperparameters without Grad Students: Scaling up Bandit Optimisation Kirthevasan Kandasamy, 2018

Teaching Machines to Classify from Natural Language Interactions Shashank Srivastava, 2018

Statistical Inference for Geometric Data Jisu Kim, 2018

Representation Learning @ Scale Manzil Zaheer, 2018

Diversity-promoting and Large-scale Machine Learning for Healthcare Pengtao Xie, 2018

Distribution and Histogram (DIsH) Learning Junier Oliva, 2018

Stress Detection for Keystroke Dynamics Shing-Hon Lau, 2018

Sublinear-Time Learning and Inference for High-Dimensional Models Enxu Yan, 2018

Neural population activity in the visual cortex: Statistical methods and application Benjamin Cowley, 2018

Efficient Methods for Prediction and Control in Partially Observable Environments Ahmed Hefny, 2018

Learning with Staleness Wei Dai, 2018

Statistical Approach for Functionally Validating Transcription Factor Bindings Using Population SNP and Gene Expression Data Jing Xiang, 2017

New Paradigms and Optimality Guarantees in Statistical Learning and Estimation Yu-Xiang Wang, 2017

Dynamic Question Ordering: Obtaining Useful Information While Reducing User Burden Kirstin Early, 2017

New Optimization Methods for Modern Machine Learning Sashank J. Reddi, 2017

Active Search with Complex Actions and Rewards Yifei Ma, 2017

Why Machine Learning Works George D. Montañez , 2017

Source-Space Analyses in MEG/EEG and Applications to Explore Spatio-temporal Neural Dynamics in Human Vision Ying Yang , 2017

Computational Tools for Identification and Analysis of Neuronal Population Activity Pengcheng Zhou, 2016

Expressive Collaborative Music Performance via Machine Learning Gus (Guangyu) Xia, 2016

Supervision Beyond Manual Annotations for Learning Visual Representations Carl Doersch, 2016

Exploring Weakly Labeled Data Across the Noise-Bias Spectrum Robert W. H. Fisher, 2016

Optimizing Optimization: Scalable Convex Programming with Proximal Operators Matt Wytock, 2016

Combining Neural Population Recordings: Theory and Application William Bishop, 2015

Discovering Compact and Informative Structures through Data Partitioning Madalina Fiterau-Brostean, 2015

Machine Learning in Space and Time Seth R. Flaxman, 2015

The Time and Location of Natural Reading Processes in the Brain Leila Wehbe, 2015

Shape-Constrained Estimation in High Dimensions Min Xu, 2015

Spectral Probabilistic Modeling and Applications to Natural Language Processing Ankur Parikh, 2015 Computational and Statistical Advances in Testing and Learning Aaditya Kumar Ramdas, 2015

Corpora and Cognition: The Semantic Composition of Adjectives and Nouns in the Human Brain Alona Fyshe, 2015

Learning Statistical Features of Scene Images Wooyoung Lee, 2014

Towards Scalable Analysis of Images and Videos Bin Zhao, 2014

Statistical Text Analysis for Social Science Brendan T. O'Connor, 2014

Modeling Large Social Networks in Context Qirong Ho, 2014

Semi-Cooperative Learning in Smart Grid Agents Prashant P. Reddy, 2013

On Learning from Collective Data Liang Xiong, 2013

Exploiting Non-sequence Data in Dynamic Model Learning Tzu-Kuo Huang, 2013

Mathematical Theories of Interaction with Oracles Liu Yang, 2013

Short-Sighted Probabilistic Planning Felipe W. Trevizan, 2013

Statistical Models and Algorithms for Studying Hand and Finger Kinematics and their Neural Mechanisms Lucia Castellanos, 2013

Approximation Algorithms and New Models for Clustering and Learning Pranjal Awasthi, 2013

Uncovering Structure in High-Dimensions: Networks and Multi-task Learning Problems Mladen Kolar, 2013

Learning with Sparsity: Structures, Optimization and Applications Xi Chen, 2013

GraphLab: A Distributed Abstraction for Large Scale Machine Learning Yucheng Low, 2013

Graph Structured Normal Means Inference James Sharpnack, 2013 (Joint Statistics & ML PhD)

Probabilistic Models for Collecting, Analyzing, and Modeling Expression Data Hai-Son Phuoc Le, 2013

Learning Large-Scale Conditional Random Fields Joseph K. Bradley, 2013

New Statistical Applications for Differential Privacy Rob Hall, 2013 (Joint Statistics & ML PhD)

Parallel and Distributed Systems for Probabilistic Reasoning Joseph Gonzalez, 2012

Spectral Approaches to Learning Predictive Representations Byron Boots, 2012

Attribute Learning using Joint Human and Machine Computation Edith L. M. Law, 2012

Statistical Methods for Studying Genetic Variation in Populations Suyash Shringarpure, 2012

Data Mining Meets HCI: Making Sense of Large Graphs Duen Horng (Polo) Chau, 2012

Learning with Limited Supervision by Input and Output Coding Yi Zhang, 2012

Target Sequence Clustering Benjamin Shih, 2011

Nonparametric Learning in High Dimensions Han Liu, 2010 (Joint Statistics & ML PhD)

Structural Analysis of Large Networks: Observations and Applications Mary McGlohon, 2010

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy Brian D. Ziebart, 2010

Tractable Algorithms for Proximity Search on Large Graphs Purnamrita Sarkar, 2010

Rare Category Analysis Jingrui He, 2010

Coupled Semi-Supervised Learning Andrew Carlson, 2010

Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong, 2009

Efficient Matrix Models for Relational Learning Ajit Paul Singh, 2009

Exploiting Domain and Task Regularities for Robust Named Entity Recognition Andrew O. Arnold, 2009

Theoretical Foundations of Active Learning Steve Hanneke, 2009

Generalized Learning Factors Analysis: Improving Cognitive Models with Machine Learning Hao Cen, 2009

Detecting Patterns of Anomalies Kaustav Das, 2009

Dynamics of Large Networks Jurij Leskovec, 2008

Computational Methods for Analyzing and Modeling Gene Regulation Dynamics Jason Ernst, 2008

Stacked Graphical Learning Zhenzhen Kou, 2007

Actively Learning Specific Function Properties with Applications to Statistical Inference Brent Bryan, 2007

Approximate Inference, Structure Learning and Feature Estimation in Markov Random Fields Pradeep Ravikumar, 2007

Scalable Graphical Models for Social Networks Anna Goldenberg, 2007

Measure Concentration of Strongly Mixing Processes with Applications Leonid Kontorovich, 2007

Tools for Graph Mining Deepayan Chakrabarti, 2005

Automatic Discovery of Latent Variable Models Ricardo Silva, 2005

thesis for data science

Instructions for MSc Thesis

Before the thesis.

Before you start work on your thesis, it is important to put some thought into the choice of topic and familiarize yourself with the criteria and procedure. To do that, follow these steps, in this order:

Step 0: Read the university instructions .

Read the MSc thesis instructions and grading criteria on the university website. Computer Science Master's program: [link] . Data Science Master's program: [ link ].

Step 1: Choose a topic .

Choose a topic among the ones listed on the group's webpage [ link ].

You can also propose your own topic. In this case, you must explain what the main contribution of the thesis will be and identify at least one scientific publication that is related to the topic you propose.

Step 2: Contact us .

Submit the application form [ link ] to let us know of your interest to do your thesis in the group. Note : If you contact us, then please be ready to start work on the thesis within one month .

Step 3: Agree on the topic .

We have a brief discussion about the topic and devise a high-level plan for thesis work and content. We also discuss a start date , when you start work on the thesis. In addition, you should contact a second evaluator for the thesis.

Thesis timeline

Below you find the milestones after you have started work on the thesis. In parenthesis, you find an estimate of when each milestone occurs. The thesis work ends when you submit it for approval. The total duration from start to end of the thesis should be about four months.

Milestone #0: Thesis outline (at most 3 weeks from the start) .

You create a first outline of the thesis. The outline should contain the titles of the chapters, along with a (tentative) list of sections and contents. An indicative template for the outline is shown below on this page.

Milestone #1: A draft with first results (about 2 months from start) .

All chapters should contain some readable content (not necessarily polished). Most importantly, some results should already be described. Ideally, you should be able to complete and refine the results within one more month.

Milestone #2: A draft with all results (about 1 month before the end).

Most content should now be in the draft. Some polishing remains and some results may still be refined. Notify the second evaluator that you are near the end of the thesis work. Optionally, you may send the thesis draft and receive preliminary comments from the second evaluator.

Milestone #3: Submit the thesis for approval (end of thesis work).

You will receive a grade and comments after the next program board's meeting.


What you can expect from the supervisor:

  • Comments for the thesis draft after each milestone (see timeline above) and, if necessary, a meeting.
  • Suggestions for how to proceed in cases when you encounter a major hurdle.

In addition, you are welcome to participate in the group meetings and discuss your thesis work with other group members.

Note however that one of the grading criteria for the thesis is whether you worked independently -- and in the end, the thesis should be your own work.

Template for Thesis Outline

Below you find a suggested template for the outline of the thesis. You may adapt it to your work, of course (e.g., change chapter titles or structure).

A summary of the thesis that mentions the broader topic of the thesis and why it is important; the research question or technical problem addressed by the thesis; the main thesis contributions (e.g., data gathering, developed methods and algorithms, experimental evaluation) and results.

Chapter 1: Introduction

The introduction should motivate the thesis and give a longer summary. It should be written in a way that allows anyone in your program to understand it, even if they are not experts in the topic.

  • What is the broader topic of the thesis?
  • Why is it important?
  • What research question(s) or technical problems does the thesis address?
  • What are the most related works from the literature on the topic? How does the thesis differ from what has already been done?
  • What are the main thesis contributions (e.g., data gathering, developed methods and algorithms, experimental evaluation)?
  • What are the results?

Chapter 2: Related literature

Organize this chapter in sections, with one section for each research area that is related to your thesis. For each research area, cite all the publications that are related to your topic, and describe at least the most important of them.

Chapter 3: Preliminaries

In this chapter, place the information that is necessary for you to describe the contributions and results of the thesis. It may be different from thesis to thesis, but could include sections about:

Setting. Define the terms and notation you will be using. State any assumptions you make across the thesis. Background on Methods . Describe existing methods from the literature (e.g., algorithms or ML models) that you use for your work. Data (esp. for a Data Science thesis). If the main contribution is data analysis, then describe the data here, before the analysis.

Chapter 4: Methodological contribution

For a Computer Science thesis, this part typically describes the algorithm(s) developed for the thesis. For a Data Science thesis, this part typically describes the method for the analysis.

Chapter 5: Results

This chapter describes the results obtained when the methods of Chapter 4 are used on data.

For a Computer Science thesis, this part typically describes the performance of the developed algorithm(s) on various synthetic and real datasets. For a Data Science thesis, this part typically describes the findings of the analysis.

The chapter should also describe what insights are obtained from the results.

Chapter 6: Conclusion

  • Summarize the contribution of the thesis.
  • Provide an evaluation: are the results conclusive, are there limitations in the contribution?
  • How would you extend the thesis, what can be done next on the same topic?


Data science masters theses.

The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis. This collection contains a selection of masters theses or capstone projects by MSDS graduates.

Collection Details

Chair of Data Science

Thesis guide.

This short guide is intended to give you a brief guideline on how to organise and conduct your thesis. This guide covers both, a master and a bachelor thesis.

Managing your Thesis

A thesis is a complex task, similar to a software project has to be managed properly. It can be seen to have four stages, which are iterated several time.

Define the goal of your thesis together with your supervisor. Write them down as bullet point list. This list should have 3-5. If there are more points, you have to aggregate them and if you can not aggregated them, then you have set yourself too many goals. Note that goals are not tasks. Goals determine what you want to find out with your thesis and form the basis of the research questions. Most likely you will refine your goals throughout the thesis. That is OK because your knowledge on the domain gets better and therewith you are able to write down more accurate questions. 

It is often underestimated, but very important to search what others did. Most likely others worked on similar topics and so you have to set out searching for what they did. Use search engines like Goolge Scholar or grab a recent Book on the topic of your thesis and start your literature research from there. A in depth research allows you to avoid the unpleasant suprise that after 6 months of work one identifies the same solution made by somebody others. Moreover, it builds up your background knowledge in the domain. Only with sufficient background knowledge you are able to take the correct decisions.

Good implementation starts with a workplan that has tasks, milestone (i.e. what to achieve when) and thoughts on how to get there. You do not have to create a full fledged Gantt and Pert Chart, a simple task list might be sufficient to structure your work. Use a issue tracking system or something similar, because it also helps you to keep track of things your already tried (and it gives a good feeling when closing open issues).

This is the most important part, which is often overlooked. You have invested a most time in implementing/realising your goals, and so it seems that evaluation is annoying and time consuming. However, it is the evaluation of your system that answers your research questions or let you judge whether you achieved your goals or not. So plan your evaluation before the implementation by writing down a coarse-grained evaluation plan, and, reserve enough time for it.

Note that this is not a linear, but a iterative process. In your evaluation you discover the something does not work out and so you have to adapt your evaluation or even your goals. But that is pretty fine since you learn as you go. If you already would know it from the beginning, there is no need for research.

Templates and Further Ressources

  • Structure+Hints for Seminar, Bachelor, Master and PhD Talks
  • Latex Template for MA/BA Thesis
  • Scientific Writing
  • Hints on Scientific Presentations - although focused on Theoretical Computer Science, most parts are also relevant for Computer Science in general (where proofs are not given in a formal way but by implementation and/or empirical analysis)

How to conduct a good scientific thesis

Regardless whether you are a Bachelor, Master or PhD Student (or later on a researcher), there is always the central question on how to conduct a good thesis. Sadly, there are no strict “rules” for doing so and it requires a lot of expertise. Fortunately, some simple tips allow you to bootstrap the quality of writings, presentations and your thesis in general and enable you to develop a critical thinking, but open mind – the most important tool for any future career. In this article i want to give you some tips on how to bootstrapp your scientific skills. So what are scientific skills? Science is about discovering stuff nobody has known before and to explain your discovery to other people. Explaining your discovery is important, since it

  • allows others to validate your work
  • helps yourself in gaining a deeper understanding
  • enable future discovery based on what you have found out.

That is not only true for research, but for nearly any industry jobs for academically educated presons. How to convince your boss that your solution for a particular project is the best one? How to argue that the current roadmap does not make sense? How to convince your team members to invest time in a particular functionality? How to judge the validity of your own decisions? Conducting a good scientific thesis requires you to do a decent job and communicate the results. Hence it involves

  • Reading Skills
  • Writing Skill
  • Presentation Skills
  • Discussions
  • Critical Thinking and self-reflexion

I will cover all three points below and give tips on how to bootstrap the particular skills. In my seminars and when i have to judge thesis, i apply those principles as criterions for the grade you obtain. So at least my students should take the advices seriously ;) Disclaimer: it is by no means exhaustive and only my personal opinion/experience. The most important thing to do is to find your own way, while giving credit to all the people that walked the way before and trying to learn from their experience. An attitude not only important in science and research.

One of the most important skills is motivation. Once upon a time a student asked the teacher how to learn uninteresting stuff efficiently. The short answer is you can’t, at least not for beyond the exam. So the key insight to motivation is in choosing stuff you really like to do. Often student make the error in going the easy way, taking courses where exams are “easy going”. That is a waste of time since it is by far harder to learn boring stuff, the really interesting stuff. Think of it when you have watched a really entertaining movie. You can remember nearly all details and you had a good feeling after watching it. How do you feel after a bad movie?

So when you are choosing a topic for a scientific thesis, take one that makes you feel every day like you have watched the best movie ever. Most supervisors give you the freedom to adjust a particular topic of a thesis so that it fits the motivation of the student better. So engage a critical discussion with your supervisor and what you want to do and what not. Of course this requires you to go deeper on potential topics and to explore what best fits your interest. But especially when you are studying computer science, you should have a natural habit on being attracted by strange, nerdy stuff. Key lesson: Learn to motivate yourself and take topics that you are excited about it.

  • Read with a purpose (start with questions to the article. the more specific, the better)
  • Discuss what you read with others
  • Write a short summary of your key findings. either do it graphically (Mindmaps, Rhino Maps) or textually.

Write often; write concise. Writing requires you in expressing your (parallel, mostly non-linear) thoughts on a linear medium. And this is very difficult. But the medium forces you to express your thoughts in a clear manner and connecting each thought with each other (the so called flow or read thread in German). That is difficult and learning good writing techniques covers whole courses and books. However, for getting started there are five simple rules:

  • Write down the 3-6 most important questions you want to answer. Two sentences per question maximum
  • Write down the motivation, why those questions are important
  • Give an example for every question
  • Write down how you will answer the three questions.
  • Write down the answer to the questions

If 2. and 3. do have nothing in common, you have put the wrong questions together in the same box.

Presentations follow the same rules as writing, with the exception that oral presentations does not allow you to present all your details. So avoid explaining every little detail, because you will loose the audience.  In general your talk should be structured similar to your written thesis

  • Motivation: Why is your talk relevant. Give an example.
  • State-of-the-Art: What did others do in the context of your talk?
  • Questions: What questions will you answer in this talk?
  • Methodology: How will you answer the questions?
  • Experiments: How did you implement the methodology?
  • Evaluation:  What is the answer to every question? What did you learn?
  • Future Work: What are the loose ends of your work (a work without loose ends is most often not very helpful)

Keep that simple structure and use the question to guide the audience through your thoughts and findings. Stick to the KISS principle (Keep it short and simple). Your audience will appreciate it.

Beim Anzeigen des Videos wird Ihre IP-Adresse an einen externen Server ( gesendet.

Google Custom Search

Wir verwenden Google für unsere Suche. Mit Klick auf „Suche aktivieren“ aktivieren Sie das Suchfeld und akzeptieren die Nutzungsbedingungen.

Hinweise zum Einsatz der Google Suche

Technical University of Munich

  • Data Analytics and Machine Learning Group
  • TUM School of Computation, Information and Technology
  • Technical University of Munich

Technical University of Munich

Open Topics

We offer multiple Bachelor/Master theses, Guided Research projects and IDPs in the area of data mining/machine learning. A  non-exhaustive list of open topics is listed below.

If you are interested in a thesis or a guided research project, please send your CV and transcript of records to Prof. Stephan Günnemann via email and we will arrange a meeting to talk about the potential topics.

Robustness of Large Language Models

Type: Master's Thesis


  • Strong knowledge in machine learning
  • Very good coding skills
  • Proficiency with Python and deep learning frameworks (TensorFlow or PyTorch)
  • Knowledge about NLP and LLMs


The success of Large Language Models (LLMs) has precipitated their deployment across a diverse range of applications. With the integration of plugins enhancing their capabilities, it becomes imperative to ensure that the governing rules of these LLMs are foolproof and immune to circumvention. Recent studies have exposed significant vulnerabilities inherent to these models, underlining an urgent need for more rigorous research to fortify their resilience and reliability. A focus in this work will be the understanding of the working mechanisms of these attacks.

We are currently seeking students for the upcoming Summer Semester of 2024, so we welcome prompt applications. 

Contact: Tom Wollschläger


  • Universal and Transferable Adversarial Attacks on Aligned Language Models
  • Attacking Large Language Models with Projected Gradient Descent
  • Representation Engineering: A Top-Down Approach to AI Transparency
  • Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Generative Models for Drug Discovery

Type:  Mater Thesis / Guided Research

  • Strong machine learning knowledge
  • Proficiency with Python and deep learning frameworks (PyTorch or TensorFlow)
  • Knowledge of graph neural networks (e.g. GCN, MPNN)
  • No formal education in chemistry, physics or biology needed!

Effectively designing molecular geometries is essential to advancing pharmaceutical innovations, a domain which has experienced great attention through the success of generative models. These models promise a more efficient exploration of the vast chemical space and generation of novel compounds with specific properties by leveraging their learned representations, potentially leading to the discovery of molecules with unique properties that would otherwise go undiscovered. Our topics lie at the intersection of generative models like diffusion/flow matching models and graph representation learning, e.g., graph neural networks. The focus of our projects can be model development with an emphasis on downstream tasks ( e.g., diffusion guidance at inference time ) and a better understanding of the limitations of existing models.

Contact :  Johanna Sommer , Leon Hetzel

Equivariant Diffusion for Molecule Generation in 3D

Equivariant Flow Matching with Hybrid Probability Transport for 3D Molecule Generation

Structure-based Drug Design with Equivariant Diffusion Models

Data Pruning and Active Learning

Type: Interdisciplinary Project (IDP) / Hiwi / Guided Research / Master's Thesis

Data pruning and active learning are vital techniques in scaling machine learning applications efficiently. Data pruning involves the removal of redundant or irrelevant data, which enables training models with considerably less data but the same performance. Similarly, active learning describes the process of selecting the most informative data points for labeling, thus reducing annotation costs and accelerating model training. However, current methods are often computationally expensive, which makes them difficult to apply in practice. Our objective is to scale active learning and data pruning methods to large datasets using an extrapolation-based approach.

Contact: Sebastian Schmidt , Tom Wollschläger , Leo Schwinn

  • Large-scale Dataset Pruning with Dynamic Uncertainty

Efficient Machine Learning: Pruning, Quantization, Distillation, and More - DAML x Pruna AI

Type: Master's Thesis / Guided Research / Hiwi

The efficiency of machine learning algorithms is commonly evaluated by looking at target performance, speed and memory footprint metrics. Reduce the costs associated to these metrics is of primary importance for real-world applications with limited ressources (e.g. embedded systems, real-time predictions). In this project, you will work in collaboration with the DAML research group and the Pruna AI startup on investigating solutions to improve the efficiency of machine leanring models by looking at multiple techniques like pruning, quantization, distillation, and more.

Contact: Bertrand Charpentier

  • The Efficiency Misnomer
  • A Gradient Flow Framework for Analyzing Network Pruning
  • Distilling the Knowledge in a Neural Network
  • A Survey of Quantization Methods for Efficient Neural Network Inference

Deep Generative Models

Type:  Master Thesis / Guided Research

  • Strong machine learning and probability theory knowledge
  • Knowledge of generative models and their basics (e.g., Normalizing Flows, Diffusion Models, VAE)
  • Optional: Neural ODEs/SDEs, Optimal Transport, Measure Theory

With recent advances, such as Diffusion Models, Transformers, Normalizing Flows, Flow Matching, etc., the field of generative models has gained significant attention in the machine learning and artificial intelligence research community. However, many problems and questions remain open, and the application to complex data domains such as graphs, time series, point processes, and sets is often non-trivial. We are interested in supervising motivated students to explore and extend the capabilities of state-of-the-art generative models for various data domains.

Contact : Marcel Kollovieh , David Lüdke

  • Flow Matching for Generative Modeling
  • Auto-Encoding Variational Bayes
  • Denoising Diffusion Probabilistic Models 
  • Structured Denoising Diffusion Models in Discrete State-Spaces

Graph Structure Learning

Type:  Guided Research / Hiwi

  • Optional: Knowledge of graph theory and mathematical optimization

Graph deep learning is a powerful ML concept that enables the generalisation of successful deep neural architectures to non-Euclidean structured data. Such methods have shown promising results in a vast range of applications spanning the social sciences, biomedicine, particle physics, computer vision, graphics and chemistry. One of the major limitations of most current graph neural network architectures is that they often rely on the assumption that the underlying graph is known and fixed. However, this assumption is not always true, as the graph may be noisy or partially and even completely unknown. In the case of noisy or partially available graphs, it would be useful to jointly learn an optimised graph structure and the corresponding graph representations for the downstream task. On the other hand, when the graph is completely absent, it would be useful to infer it directly from the data. This is particularly interesting in inductive settings where some of the nodes were not present at training time. Furthermore, learning a graph can become an end in itself, as the inferred structure can provide complementary insights with respect to the downstream task. In this project, we aim to investigate solutions and devise new methods to construct an optimal graph structure based on the available (unstructured) data.

Contact : Filippo Guerranti

  • A Survey on Graph Structure Learning: Progress and Opportunities
  • Differentiable Graph Module (DGM) for Graph Convolutional Networks
  • Learning Discrete Structures for Graph Neural Networks

NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification

A Machine Learning Perspective on Corner Cases in Autonomous Driving Perception  

Type: Master's Thesis 

Industrial partner: BMW 


  • Strong knowledge in machine learning 
  • Knowledge of Semantic Segmentation  
  • Good programming skills 
  • Proficiency with Python and deep learning frameworks (TensorFlow or PyTorch) 


In autonomous driving, state-of-the-art deep neural networks are used for perception tasks like for example semantic segmentation. While the environment in datasets is controlled in real world application novel class or unknown disturbances can occur. To provide safe autonomous driving these cased must be identified. 

The objective is to explore novel class segmentation and out of distribution approaches for semantic segmentation in the context of corner cases for autonomous driving. 

Contact: Sebastian Schmidt


  • Segmenting Known Objects and Unseen Unknowns without Prior Knowledge 
  • Efficient Uncertainty Estimation for Semantic Segmentation in Videos  
  • Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family  
  • Description of Corner Cases in Automated Driving: Goals and Challenges 

Active Learning for Multi Agent 3D Object Detection 

Type: Master's Thesis  Industrial partner: BMW 

  • Knowledge in Object Detection 
  • Excellent programming skills 

In autonomous driving, state-of-the-art deep neural networks are used for perception tasks like for example 3D object detection. To provide promising results, these networks often require a lot of complex annotation data for training. These annotations are often costly and redundant. Active learning is used to select the most informative samples for annotation and cover a dataset with as less annotated data as possible.   

The objective is to explore active learning approaches for 3D object detection using combined uncertainty and diversity based methods.  

  • Exploring Diversity-based Active Learning for 3D Object Detection in Autonomous Driving   
  • Efficient Uncertainty Estimation for Semantic Segmentation in Videos   
  • KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection
  • Towards Open World Active Learning for 3D Object Detection   

Graph Neural Networks

Type:  Master's thesis / Bachelor's thesis / guided research

  • Knowledge of graph/network theory

Graph neural networks (GNNs) have recently achieved great successes in a wide variety of applications, such as chemistry, reinforcement learning, knowledge graphs, traffic networks, or computer vision. These models leverage graph data by updating node representations based on messages passed between nodes connected by edges, or by transforming node representation using spectral graph properties. These approaches are very effective, but many theoretical aspects of these models remain unclear and there are many possible extensions to improve GNNs and go beyond the nodes' direct neighbors and simple message aggregation.

Contact: Simon Geisler

  • Semi-supervised classification with graph convolutional networks
  • Relational inductive biases, deep learning, and graph networks
  • Diffusion Improves Graph Learning
  • Weisfeiler and leman go neural: Higher-order graph neural networks
  • Reliable Graph Neural Networks via Robust Aggregation

Physics-aware Graph Neural Networks

Type:  Master's thesis / guided research

  • Proficiency with Python and deep learning frameworks (JAX or PyTorch)
  • Knowledge of graph neural networks (e.g. GCN, MPNN, SchNet)
  • Optional: Knowledge of machine learning on molecules and quantum chemistry

Deep learning models, especially graph neural networks (GNNs), have recently achieved great successes in predicting quantum mechanical properties of molecules. There is a vast amount of applications for these models, such as finding the best method of chemical synthesis or selecting candidates for drugs, construction materials, batteries, or solar cells. However, GNNs have only been proposed in recent years and there remain many open questions about how to best represent and leverage quantum mechanical properties and methods.

Contact: Nicholas Gao

  • Directional Message Passing for Molecular Graphs
  • Neural message passing for quantum chemistry
  • Learning to Simulate Complex Physics with Graph Network
  • Ab initio solution of the many-electron Schrödinger equation with deep neural networks
  • Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions
  • Tensor field networks: Rotation- and translation-equivariant neural networks for 3D point clouds

Robustness Verification for Deep Classifiers

Type: Master's thesis / Guided research

  • Strong machine learning knowledge (at least equivalent to IN2064 plus an advanced course on deep learning)
  • Strong background in mathematical optimization (preferably combined with Machine Learning setting)
  • Proficiency with python and deep learning frameworks (Pytorch or Tensorflow)
  • (Preferred) Knowledge of training techniques to obtain classifiers that are robust against small perturbations in data

Description : Recent work shows that deep classifiers suffer under presence of adversarial examples: misclassified points that are very close to the training samples or even visually indistinguishable from them. This undesired behaviour constraints possibilities of deployment in safety critical scenarios for promising classification methods based on neural nets. Therefore, new training methods should be proposed that promote (or preferably ensure) robust behaviour of the classifier around training samples.

Contact: Aleksei Kuvshinov

References (Background):

  • Intriguing properties of neural networks
  • Explaining and harnessing adversarial examples
  • SoK: Certified Robustness for Deep Neural Networks
  • Certified Adversarial Robustness via Randomized Smoothing
  • Formal guarantees on the robustness of a classifier against adversarial manipulation
  • Towards deep learning models resistant to adversarial attacks
  • Provable defenses against adversarial examples via the convex outer adversarial polytope
  • Certified defenses against adversarial examples
  • Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks

Uncertainty Estimation in Deep Learning

Type: Master's Thesis / Guided Research

  • Strong knowledge in probability theory

Safe prediction is a key feature in many intelligent systems. Classically, Machine Learning models compute output predictions regardless of the underlying uncertainty of the encountered situations. In contrast, aleatoric and epistemic uncertainty bring knowledge about undecidable and uncommon situations. The uncertainty view can be a substantial help to detect and explain unsafe predictions, and therefore make ML systems more robust. The goal of this project is to improve the uncertainty estimation in ML models in various types of task.

Contact: Tom Wollschläger ,   Dominik Fuchsgruber ,   Bertrand Charpentier

  • Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
  • Predictive Uncertainty Estimation via Prior Networks
  • Posterior Network: Uncertainty Estimation without OOD samples via Density-based Pseudo-Counts
  • Evidential Deep Learning to Quantify Classification Uncertainty
  • Weight Uncertainty in Neural Networks

Hierarchies in Deep Learning

Type:  Master's Thesis / Guided Research

Multi-scale structures are ubiquitous in real life datasets. As an example, phylogenetic nomenclature naturally reveals a hierarchical classification of species based on their historical evolutions. Learning multi-scale structures can help to exhibit natural and meaningful organizations in the data and also to obtain compact data representation. The goal of this project is to leverage multi-scale structures to improve speed, performances and understanding of Deep Learning models.

Contact: Marcel Kollovieh , Bertrand Charpentier

  • Tree Sampling Divergence: An Information-Theoretic Metricfor Hierarchical Graph Clustering
  • Hierarchical Graph Representation Learning with Differentiable Pooling
  • Gradient-based Hierarchical Clustering
  • Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space

MIT Libraries home DSpace@MIT

  • DSpace@MIT Home
  • MIT Libraries

This collection of MIT Theses in DSpace contains selected theses and dissertations from all MIT departments. Please note that this is NOT a complete collection of MIT theses. To search all MIT theses, use MIT Libraries' catalog .

MIT's DSpace contains more than 58,000 theses completed at MIT dating as far back as the mid 1800's. Theses in this collection have been scanned by the MIT Libraries or submitted in electronic format by thesis authors. Since 2004 all new Masters and Ph.D. theses are scanned and added to this collection after degrees are awarded.

MIT Theses are openly available to all readers. Please share how this access affects or benefits you. Your story matters.

If you have questions about MIT theses in DSpace, [email protected] . See also Access & Availability Questions or About MIT Theses in DSpace .

If you are a recent MIT graduate, your thesis will be added to DSpace within 3-6 months after your graduation date. Please email [email protected] with any questions.


MIT Theses may be protected by copyright. Please refer to the MIT Libraries Permissions Policy for permission information. Note that the copyright holder for most MIT theses is identified on the title page of the thesis.

Theses by Department

  • Comparative Media Studies
  • Computation for Design and Optimization
  • Computational and Systems Biology
  • Department of Aeronautics and Astronautics
  • Department of Architecture
  • Department of Biological Engineering
  • Department of Biology
  • Department of Brain and Cognitive Sciences
  • Department of Chemical Engineering
  • Department of Chemistry
  • Department of Civil and Environmental Engineering
  • Department of Earth, Atmospheric, and Planetary Sciences
  • Department of Economics
  • Department of Electrical Engineering and Computer Sciences
  • Department of Humanities
  • Department of Linguistics and Philosophy
  • Department of Materials Science and Engineering
  • Department of Mathematics
  • Department of Mechanical Engineering
  • Department of Nuclear Science and Engineering
  • Department of Ocean Engineering
  • Department of Physics
  • Department of Political Science
  • Department of Urban Studies and Planning
  • Engineering Systems Division
  • Harvard-MIT Program of Health Sciences and Technology
  • Institute for Data, Systems, and Society
  • Media Arts & Sciences
  • Operations Research Center
  • Program in Real Estate Development
  • Program in Writing and Humanistic Studies
  • Science, Technology & Society
  • Science Writing
  • Sloan School of Management
  • Supply Chain Management
  • System Design & Management
  • Technology and Policy Program

Collections in this community

Doctoral theses, graduate theses, undergraduate theses, recent submissions.


Peptide-Functionalized Layer-by-Layer Nanoparticles Demonstrate Improved Blood Brain Barrier Permeability for Glioblastoma Treatment 

Nanoparticle superlattice processing: monodispersed building-blocks & single crystal films , machine learning methods for discovering metabolite structures from mass spectra .


thesis for data science

Bachelor and Master Theses

BSc (96), MSc (74), MSc SciComp (15), Diploma (5)

  • Nicolas Hellthaler: Footnote-Augmented Documents for Passage Retrieval , Bachelor Thesis, February 2024.
  • Simon Gimmini: Exploring Temporal Patterns in Art Through Diffusion Models , Master Thesis, February 2024
  • Xingqi Cheng:  A Rule-based Post-processor for Temporal Knowledge Graph Extrapolation , Master Thesis, January 2024.
  • Raphael Ebner: Leveraging Large Language Models for Information Extraction and Knowledge Representation , Bachelor Thesis, January 2024.
  • Angelina Basova: Table Extraction from PDF Documents , Master Thesis, December 2023
  • Milena Bruseva:  Benchmarking Vector Databases: A Framework for Evaluating Embedding Based Retrieval , Master Thesis, December 2023.
  • Luis Wettach:  Medical Electronic Data Capture at Home – A Privacy Compliant Framework , Master Thesis, December 2023.
  • Jayson Pyanowski:  Semantic Search with Contextualized Query Generation , Master Thesis, December 2023.
  • Philipp Göldner: Information Retrieval using Sparse Embeddings , Master Thesis, December 2023.
  • Vivian Kazakova:  A Topic Modeling Framework for Biomedical Text Analysis , Bachelor Thesis, October 2023
  • Dennis Geiselmann: Context-Aware Dense Retrieval , Master Thesis, October 2023.
  • Konrad Goldenbaum: Semantic Search and Topic Exploration of Scientific Paper Corpora , Bachelor Thesis, October 2023
  • Yingying Cao:  Keyword-based Summarization of (Legal) Documents , Master Thesis Scientific Computing, August, 2023.
  • Julian Freyberg: Structural and Logical Document Layout Analysis u sing Graph Neural Networks , Master Thesis, August 2023.
  • Marina Walther:  A Universal Online Social Network Conversation Model , Master Thesis, August 2023.
  • David Pohl:  Zero-Shot Word Sense Disambiguation using Word Embeddings , Bachelor Thesis, August 2023
  • Klemens Gerber:  Automatic Enrichment of Company Information in Knowledge Graphs , Master Thesis, August 2023.
  • Bastian Müller:  An Adaptable Question Answering Framework with Source-Citations , Bachelor Thesis, August 2023
  • Jiahui Li:  Styled Text Summarization via Domain-specific Paraphrasing ,  Master Thesis Scientific Computing, July 2023.
  • Sophia Matthis: Multi-Aspect Exploration of Plenary Protocols , Master Thesis, June 2023.
  • Till Rostalski:  A Generic Patient Similarity Framework for Clinical Data Analysis , Bachelor Thesis, June 2023
  • David Jackson:  Automated Extraction of Drug Analysis and Discovery Networks , Master Thesis Scientific Computing, May 2023.
  • Christopher Brückner:  Multi-Feature Clustering of Search Results , Master Thesis, April 2023.
  • Paul Dietze:  Formula Classification and Mathematical Token Embeddings , Bachelor Thesis, April 2023.
  • Sophia Hammes:  A Neural-Based Approach for Link Discovery in the Process Management Domain , Master Thesis, March 2023.
  • Fabian Kneissl:  Time-Dependent Graph Modeling of Twitter Conversations , Master Thesis, March 2023.
  • Lucienne-Sophie Marmé:   A Bootstrap Approach for Classifying Political Tweets into Policy Fields , Bachelor Thesis, March 2023.
  • Jing Fan: Assessing Factual Accuracy of Generated Text Using Semantic Role Labeling , Bachelor Thesis, March 2023.
  • Fabio Gebhard:  A Rule-based Approach for Numerical Question Answering , Master Thesis, December 2022.
  • Severin Laicher:  Learning and Exploring Similarity of Sales Items and its Dependency on Sales Data , Master Thesis, September 2022.
  • Raeesa Yousaf: Explainability of Graph Roles Extracted from Networks , Bachelor Thesis, September 2022.
  • Julian Seibel: Towards GAN-based Open-World Knowledge Graph Completion , Master Thesis, June 2022.
  • Claire Zhao Sun: Extracting and Exploring Causal Factors from Financial Documents , Master Thesis Scientific Computing, May 2022.
  • Ziqiu Zhou:  Semantic Extensions of OSM Data Through Mining Tweets in the Domain of Disaster Management , Master Thesis, May 2022.
  • Lukas Ballweg:  Analysis of Lobby Networks and their Extraction from Semi-Structured Data ,  Bachelor Thesis, April 2022.
  • Benjamin Wagner:  Benchmarking Graph Databases for Knowledge Graph Handling , Bachelor Thesis, March 2022.
  • Cedric Bender:  Exploration and Analysis of Methods for German Tweet Stream Summarization , Bachelor Thesis, March 2022. 
  • Johannes Klüh:  Polyphonic Music Generation for Multiple Instruments using Music Transformer , Bachelor Thesis, March 2022.
  • Nicolas Reuter: Automatic Annotation of Song Lyrics Using Wikipedia Resources , Master Thesis, December 2021.
  • Mateusz Chrzastek: Extraktive Keyphrases form Noun Chunk Similarity , Bachelor Thesis, October 2021. 
  • Fabrizio Primerano: Document Information Extraction from Visually-rich Documents with Unbalanced Class Structure , Master Thesis, October 2021.
  • Sarah Marie Bopp: Gender-centric Analysis of Tweets from German Politicians , Bachelor Thesis, September 2021.
  • Philipp Göldner: A Framework for Numerical Information Extraction , Bachelor Thesis, July 2021.
  • Robin Khanna: Adaptive Topic Modelling for Twitter Data , Bachelor Thesis, July 2021.
  • Thomas Rekers: Correlating Postings from Different Social Media Platforms , Master Thesis, July 2021.
  • Duc Anh Phi: Background Linking of News Articles , Master Thesis, May 2021.
  • Eike Harms: Linking Table and Text Quantities in Documents , Master Thesis, April 2021.
  • Raphael Arndt: Regelbasierte Binärklassifizierung von Webseiten , Bachelor Thesis, April 2021.
  • Jonas Gann: Integrating Identity Management Providers based on Online Access Law , Bachelor Thesis, March 2021.
  • Björn Ternes: Kontextbasierte Informationsextraktion aus Datenschutzerklärungen , Bachelor Thesis, March 2021.
  • Fabio Becker: A Generative Model for Dynamic Networks with Community Structures , Master Thesis, December 2020.
  • Jan-Gabriel Mylius: Visual Analysis of Paragraph Similarity , Bachelor Thesis, December 2020
  • Alexander Hebel: Information Retrieval mit PostgreSQL , Master Thesis, November 2020.
  • Jonas Albrecht: Lexikon-basierte Sentimentanalyse von Tweets , Bachelor Thesis, November 2020.
  • Marina Walther: A Network-based Approach to Investigate Medical Time Series Data , Bachelor Thesis, September 2020.
  • Stefan Hickl: Automatisierte Generierung von Inhaltsverzeichnissen aus PDF-Dokumenten , Bachelor Thesis, September 2020.
  • Christopher Brückner: Structure-centric Near-Duplicate Detection , Bachelor Thesis, August 2020.
  • David Jackson: Extracting Knowledge Graphs from Biomedical Literature , Bachelor Thesis, August 2020.
  • David Richter: Single-Pass Training von Klassifikatoren basierend auf einem großem Web-Korpus , Master Thesis, August 2020.
  • Julian Freyberg: Time-sensitive Multi-label Classification of News Articles , Bachelor Thesis, July 2020.
  • John Ziegler: Modelling and Exploration of Property Graphs for Open Source Intelligence , Master Thesis, August, 2020.
  • Johannes Keller: A Network-based Approach for Modeling Twitter Topics , Master Thesis, June 2020.
  • Erik Koynov : Three Stage Statute Retrieval Algorithm with BERT and Hierachical Pretraining" , Bachelor Thesis, Mai 2020.
  • Fabian Kaiser: Cross-Reference Resolution in German and European Law , Master Thesis, April 2020.
  • Hasan Malik: Open Numerical Information Extraction , Master Thesis, Scientific Computing, March 2020.
  • Matthias Rein: Exploration of User Networks and Content Analysis of the German Political Twittersphere , Master Thesis, March 2020.
  • Philip Hausner: Time-centric Content Exploration in Large Document Collections , Master Thesis, March 2020.
  • Mohammad Dawas: On the Analysis of Networks Extracted from Relational Databases , Master Thesis, Scientific Computing, February 2020.
  • Lea Zimmermann: Mapping Machine Learning Frameworks to End2End Infrastructures , Bachelor Thesis, February 2020
  • Bente Nittka: Modelling Verdict Documents for Automated Judgment Grounds Prediction , Bachelor Thesis, November 2019
  • Michael Pronkin: A Framework for a Person-Centric Gazetteer Service , Bachelor Thesis, November 2019
  • Jessica Löhr: Analysis and Exploration of Register Data of Companies , Bachelor Thesis, October 2019
  • Seida Basha: Extraction of Comment Threads of Political News Articles , Bachelor Thesis, September 2019
  • Lukas Rüttgers: Analyse von YouTube-Kommentaren zur Förderung von Diskussionen , Master Thesis, Scientific Computing, July 2019
  • Gloria Feher: Concepts in Context: A Network-based Approach , Master Thesis, July 2019
  • Dennis Aumiller: Implementation of a Relational Document Hypergraph for Information Retrieval , Master Thesis, April 2019
  • Raheel Ahsan: Efficient Entity Matching , Master Thesis, Scientific Computing, March 2019
  • Christian Straßberger: Time-Varying Graphs to Explore Medical Time Series , Master Thesis, Scientific Computing, March 2019
  • Frederik Schwabe: Zitationsnetzwerke in Gesetzestexten und juristischen Entscheidungen , Bachelor Thesis, February 2019
  • Kilian Claudius Valenti: Extraktion und Exploration von Kookkurenznetzwerken aus Arztbriefen , Bachelor Thesis, February 2019
  • Satya Almasian: Learning Joint Vector Representation of Words and Named Entities , Master Thesis, Scientific Computing, October 2018
  • Naghmeh Fazeli: Evolutionary Analysis of News Article Networks , Master Thesis, Scientific Computing, October 2018
  • Lukas Kades: Development and Evaluation of an Indoor Simulation Model for Visitor Behaviour on a Trade Fair , Master Thesis, October 2018
  • David Stronczek: Named Entity Disambiguation using Implicit Networks , Master Thesis, August 2018
  • Julius Franz Foitzik: A Social Network Approach towards Location-based Recommendation , Master Thesis, April 2018
  • Carine Dengler: Network-based Modeling and Analysis of Political Debates , Master Thesis, May 2018
  • Maximilian Langknecht: Exploration-Based Feature Analysis of Time Series Using Minimum Spanning Trees ,  Bachelor Thesis, May 2018
  • Jayson Salazar: Extraction and Analysis of Dynamic Co-occurence Networks from Medical Text , Master Thesis, Scientific Computing, April 2018
  • Fabio Becker: Toponym Resolution in HeidelPlace , Bachelor Thesis, April 2018
  • Felix Stern: Correlating Finance News Articles and Stock Indexes , Master Thesis, March 2018
  • Oliver Hommel: Symbolical Inversion of Formulas in an OLAP Context , Master Thesis, Scientific Computing,  March 2018
  • Jan Greulich: Reasoning with Imprecise Temporal and Geographical Data , Master Thesis, February 2018
  • Johannes Visintini: Modelling and Analyzing Political Activity Networks , Bachelor Thesis, February 2018
  • Sebastian Lackner:  Efficient Algorithms for Anti-community Detection , Master Thesis, February 2018
  • Leonard Henger: Erstellung eines konzeptionellen Datenmodells für Zeitreihen und Erkennung von Zeitreihenausreißern , Bachelor Thesis, December 2017
  • Christian Kromm: Short-term travel time prediction in complex contents , Master Thesis, December 2017
  • Christian Schütz: A Generative Model for Correlated Geospatial Property Graphs with Social Network Characteristics , Bachelor Thesis, December 2017
  • Sophia Stahl: Association Rule Based Pattern Mining of Cancer Genome Variants , Master Thesis, December 2017
  • Patrick Breithaupt: Evolving Topic-centric Information Networks , Master Thesis, October 2017
  • Michael Müller: Graph Based Event Summarization , Master Thesis, September 2017
  • Slavin Donchev: Statement Extraction from German Newspaper Articles , Bachelor Thesis, August 2017
  • Dennis Aumiller: Mining Relationship Networks from University Websites , Bachelor Thesis, August 2017
  • Katja Hauser: Latent Information Networks from German Newspaper Articles , Bachelor Thesis, April 2017
  • Xiaoyu Ye: Extraction and Analysis of Organization and Person Networks , Master Thesis, April 2017
  • Martin Enderlein: Modeling and Exploring Company Networks , Bachelor Thesis, January 2017
  • Ludwig Richter: A Generic Gazetter Data Model and an Extensible Framework for Geoparsing , Master Thesis, October 2016
  • Benjamin Keller: Matching Unlabeled Instances against a Known Data Schema Using Active Learning , Bachelor Thesis, August 2016
  • Julien Stern: Generation and Analysis of Event Networks from GDELT Data , Bachelor Thesis, July 2016
  • Hüseyin Dagaydin: Personalized Filtering of SAP Internal Search Results based on Search Behavior , Master Thesis, March 2016
  • Zaher Aldefai: Improvement of SAP Search HANA results through Text Analysis , Master Thesis, April 2016
  • Jens Cram: Adapting In-Memory Representations of Property Graphs to Mixed Workloads , Bachelor Thesis, April 2016
  • Antonio Jiménez Fernández: Collection and Analysis of User Generated Comments on News Articles , Bachelor Thesis, April 2016
  • Nils Weiher: Temporal Affiliation Network Extraction from Wikidata , Bachelor Thesis, March 2016
  • Claudia Dünkel: Erweiterung des Wu-Holme Modells für Zitationsnetzwerke , Bachelor Thesis, January 2016
  • Muhammad El-Hindi: VisIndex: A Multi-dimensional Tree Index for Histogram Queries , Master Thesis, December 2015
  • Annika Boldt: Rahmenwerk für kontextsensitive Hilfe von webbasierten Anwendungen , Master Thesis, December 2015
  • Carine Dengler: Das INDY-Bildanalyseframework für die Geschichtswissenschaften , Bachelor Thesis, October 2015
  • Leif-Nissen Lundbaek: Conceptional analysis of cryptocurrencies towards smart financial networks , Master Thesis, Scientific Computing, October 2015
  • Viktor Bersch: Effiziente Identifikation von Ereignissen zur Auswertung komplexer Angriffsmuster auf IT Infrastrukturen , Master Thesis, September 2015
  • Ranjani Dilip Banhatti: Graph Regularization Parameter for Non-Negative Matrix Factorization , Master Thesis, Scientific Computing, September 2015
  • Konrad Kühne: Temporal-Topological Analysis of Evolutionary Message Networks , Bachelor Thesis, July 2015
  • Stefanie Bachmann: The K-Function and its use for Bandwidth Parameter Estimation , Bachelor Thesis, July 2015
  • Philipp Daniel Freiberger: Temporal Evolution of Communities in Affiliation Networks , Bachelor Thesis, June 2015
  • Johannes Auer: Bewertung von GitHub Projekten anhand von Eventdaten , Bachelor Thesis, March 2015
  • Christian Kromm: Erkennung und Analyse von Regionalen Hashtag Communities in Twitter , Bachelor Thesis, March 2015
  • Matthias Brandt: Evolution of Correlation of Hashtags in Twitter, Master Thesis, February 2015
  • Jonas Scholten: Effizientes Indexing von Twitter-Daten für temporale und räumliche TopK-Suche unter Verwendung von Mongo DB , Bachelor Thesis, February 2015
  • Patrick Breithaupt: Experimentelle Analyse des Exponetial Random Graph Modells , Bachelor Thesis, February 2015
  • Timm Schäuble: Classification of Temporal Relations between Events , Bachelor Thesis, January 2015
  • Andreas Spitz: Analysis and Exploration of Centrality and Referencing Patterns in Networks of News Articles, Master Thesis , November 2014
  • Tobias Zatti: Simulation und Erweiterung von sozialen Netzwerken durch Random Graphs am Beispiel von Twitter , Bachelor Thesis, November 2014
  • Ludwig Richter: Automated Field-Boundary Detection by Trajectory Analysis of Agricultural Machinery , Bachelor Thesis, August 2014
  • Thomas Metzger: Mining Sequential Patterns from Song Lists , Bachelor Thesis, July 2014
  • Arthur Arlt: Determining Rates of False Positives and Negatives in Fast Flux Botnet Detection , Master Thesis, July 2014.
  • Hanna Lange: Stream-based Event and Place Detection from Social Media , June 2014
  • Christian Karr: Effektive Indexierung von räumlichen und zeitlichen Daten , Bachelor Thesis, May 2014
  • Haikuhi Jaghinyan: Evaluation of the HANA Graph Engine, Bachelor Thesis, March 2014
  • Sebastian Rode: Speeding Up Graph Traversals in the SAP HANA Database , Diploma Thesis, Mathematics/Computer Science, March 2014
  • Isil Özge Pekel: Performing Cluster Analysis on Column Store Databases , Master Thesis, March 2014
  • Andreas Runk: Integrating Information about Persons from Linked Open Data , Master Thesis, February 2014
  • Tobias Limpert: Verbesserung der spatio-temporal Event Extraktion und ihrer Kontextinformation durch Relationsextraktionsmethoden , Bachelor Thesis, December 2013
  • Christian Seyda: Comparison of graph-based and vector-space geographical topic detection , Master Thesis, December 2013
  • Bartosz Bogasz: Generation of Place Summaries from Wikipedia , Master Thesis, December 2013
  • David Richter: Segmentierung geographischer Regionen aus Social Media mittels Superpixelverfahren , Bachelor Thesis, Oktober 2013
  • Marek Walkowiak: Gazetteer-gestützte Erkennung und Disambiguierung von Toponymen in Text , Bachelor Thesis, Oktober 2013
  • Mirko Kiefer: Histo: A Protocol for Peer-to-Peer Data Synchronization in Mobile Apps , Bachelor Thesis, September 2013
  • Daniel Egenolf: Extraktion und Normalisierung von Personeninformation für die Kombination mit Spatio-temporal Events , Bachelor Thesis, September 2013
  • Lisa Tuschner: Tag-Recommendation auf Basis von Flickr Daten , Bachelor Thesis, September 2013
  • Edward-Robert Tyercha: An Efficient Access Structure to 3D Mesh Data with Column Store Databases , Master Thesis, September 2013
  • Matthias Iacsa: Study of NetPLSA with respect to regularization in multidimensional spaces , Bachelor Thesis, Juli 2013
  • Timo Haas: Analyse und Exploration von temporalen Aspekten in OSM-Daten , Bachelor Thesis, June 2013
  • Julian Wintermayr: Evaluation of Semantic Web storage solutions focusing on Spatial and Temporal Queries , Bachelor Thesis, June 2013
  • Bertil Nestorius Baron: Aggregate Center Queries in Dynamic Road Networks , Diploma Thesis, Mathematics/Computer Science, Mai 2013
  • Viktor Bersch: Methoden zur temporalen Analyse und Exploration von Reviews , Bachelor Thesis, Mai 2013
  • Cornelius Ratsch: Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems , Master Thesis, April 2013
  • Andreas Zerkowitz: Aufbau und Analyse eines Event-Repository aus Wikipedia , Bachelor Thesis, April 2013
  • Erik von der Osten: Influential Graph Properties of Collaborative-Filtering based Recommender Systems , Diploma Thesis, Mathematics/Computer Science, March 2013
  • Philipp Harth: Local Similarity in Geometric Graphs via Spectral Correspondence , Master Thesis, February 2013
  • Benjamin Kirchholtes: A General Solution for the Point Cloud Docking Problem , Master Thesis, February 2013
  • Manuel Kaufmann: Modellierung und Analyse heuristischer und linguistischer Methoden zur Eventextraktion , Bachelor Thesis, November 2012
  • Dennis Runz: Socio-Spatial Event Detection in Dynamic Interaction Graphs , Master Thesis, November 2012
  • Andreas Schuster: Compressed Data Structures for Tuple Identifiers in Column-Oriented Databases , Master Thesis, October 2012
  • Christian Kapp: Person Comparison based on Name Normalization and Spatio-temporal Events , Master Thesis, September 2012
  • Jörg Hauser: Algorithms for Model Assignment in Multi-Gene Phylogenetics , Master Thesis, August 2012
  • Andreas Klein: The CSGridFile for Managing and Querying Point Data in Column Stores , Master Thesis, August 2012
  • Andreas Runk: Dynamisches Rerouting in Strassennetzwerken , Bachelor Thesis, August 2012
  • Markus Neusinger: Erkennung von Sternströmen mit Hilfe moderner Clusteringverfahren , Diploma Thesis Physics/Computer Science, August 2012
  • Clemens Maier: Visualisierung und Modellierung des auf BRF+ aufgebauten Workflows , Bachelor Thesis, August 2012
  • Daniel Kruck: Investigation of Exact Graph and Tree Isomorphism Problems , Bachelor Thesis, July 2012
  • Andreas Fay: Correlation and Exploration of Events , Master Thesis, February 2012
  • Cornelius Ratsch: Extending Context-Aware Query Autocompletion , Bachelor Thesis, February 2012
  • Alexander Wilhelm: Spezifikation und Suche komplexer Routen in Strassennetzwerken , Diploma Thesis, Mathematics/Computer Science, February 2012
  • Britta Keller: Ein Event-basiertes Ähnlichkeitsmodell für biomedizinische Dokumente , Bachelor Thesis, February 2012
  • Simon Jarke: Effiziente Suche von Substrukturen in grossen geometrischen Graphen , Master Thesis, November 2011
  • Markus Kurz: Visualizing and Exploring Nonparametric Density Estimations of Context-aware Itemsets , Bachelor Thesis, October 2011
  • Frank Tobian: Modelle und Rankingverfahren zur Kombination von textueller und geographischer Suche , Bachelor Thesis, September 2011
  • Alexander Hochmuth: Efficient Computation of Hot Spots in Road Networks , Bachelor Thesis, June 2011
  • Selina Raschack: Spezifikation von Mustern auf räumlichen Daten und Suche von zugehörigen Musterinstanzen , Bachelor Thesis, Mai 2011
  • Bechir Ben Slama: Dynamische Erkennung von Ausreißern in Straßennetzwerken , Master Thesis, March 2011
  • Marcus Schaber: Scalable Routing using Spatial Database Systems , Bachelor Thesis, March 2011
  • Edward-Robert Tyercha: Co-Location Pattern Mining mit MapReduce , Bachelor Thesis, March 2011
  • Benjamin Hiller: Analyse und Verarbeitung von OpenStreetMap-Daten mit MapReduce , Bachelor Thesis, March 2011
  • Serge Thiery Akoa Owona: Apache Cassandra as Database System for the Activiti BPM Engine , Bachelor Thesis, February 2011
  • Maik Häsner: Bestimmung und Überwachung von Hot Spots in Strassennetzwerken , Master Thesis, October 2010.
  • Philipp Harth: Scale-Dependent Pattern Mining on Volunteered Geographic Information , Bachelor Thesis, August 2010.
  • Peter Artmann: Design and Implementation of a Rule-based Warning and Messaging System , Bachelor Thesis, June 2010.
  • Christopher Röcker: Analyse und Rekonstruktion unvollständiger Sensordaten , Bachelor Thesis, March 2010.
  • Andreas Klein: Eine Indexstruktur zur Verwaltung und Anfrage an Moving Regions auf Grundlage des TPR∗-Baumes , Bachelor Thesis, February 2010.
  • Benjamin Kirchholtes: Object Recognition and Extraction in Satellites Images using the Insight Segmentation and Registration Toolkit (ITK) , Bachelor Thesis, February 2010.
  • Fabian Rühle: Performance Analysis of Column-based Main Memory Databases , Bachelor Thesis, December 2009.
  • Pavel Popov: GeoDok: Extraktion und Visualisierung von Ortsinformationen in Dokumenten , Bachelor Thesis, Dezember 2009.


  • Library Catalog Only
  • Videos & More
  • Library Website
  • Research Guides

Voices of XR: Sanjeel Parekh

speaker Sanjeel Parekh.

Sanjeel Parekh is a research scientist at Meta Reality Labs Research. His research primarily focuses on building machine learning tools for problems involving audio-visual data such as source separation, event detection, and speech enhancement. 

He earned his PhD in computer science at Technicolor and Telecom University of Paris-Saclay in 2019. His thesis was on learning representations for robust audio-visual scene analysis. Other areas he finds interesting and engaging are multimedia and ML research, music, philosophy, math, and machines. 

His talk will focus on audiovisual scene understanding and how the field appears through the lens of augmented/virtual reality. Processing multi-sensory information to robustly detect and respond to objects and events in our surroundings lies at the heart of human perception. What does it take to impart such ability to machines?  In this talk, he will explore this question in two parts: first through some of his work on multimodal and interpretable ML methods for audiovisual scene analysis. He will then outline research challenges and opportunities posed in the context of AR/VR, delving into a few in greater detail. A secondary goal of this presentation is to provide an overview of open research initiatives by the lab for collaboration with the broader research community. Date: Monday, April 29, 2024 Time: 2-3:15pm (EDT) Location: Studio X - Carlson Library, 1st Floor & Zoom Register to attend. 

Explore our other speakers

The  Voices of XR speaker series  is made possible by Kathy McMorran Murray   and the National Science Foundation (NSF) Research Traineeship (NRT) program as part of the  Interdisciplinary Graduate Training in the Science, Technology, and Applications of Augmented and Virtual Reality at the University of Rochester  (# 1922591 ).

My Logo

Let your curiosity lead the way:

Apply Today

  • Arts & Sciences
  • Graduate Studies in A&S

Thesis Oral Defense: Assessing reproducibility of Brain-behavior associations using bootstrap aggregation methods

Abstract: In this thesis, amidst growing utilization of resting-state functional connectivity MRI (rs-fcMRI) for linking neural activity to pathological conditions, we confront the prevalent concerns regarding the reliability of such data. Our exploration concentrates on improving the reproducibility of brain-behavior associations within the framework of the Human Connectome Project (HCP) dataset. We employ two distinct bootstrap aggregation approaches to investigate the enhancement of functional connectivity reliability: individual time series bagging using Circular Block Bootstrap (CBB) and subject-level bagging utilizing Linear Support Vector Regression (LSVR) models. Our investigation into individual time series bagging with CBB reveals that this method does not significantly bolster the reproducibility of brain-behavior associations. This finding points to the complexity of achieving reliable functional connectivity measures and the limitations of certain aggregation methods in overcoming this challenge. In contrast, our examination of subject-level bagging through LSVR models presents a more promising outcome. This approach markedly enhances the reliability of model weights between analyses, demonstrating its efficacy in improving data robustness and reproducibility. This differential impact of the two methodologies underscores the critical role of appropriate analytical strategies in enhancing the reliability of neuroimaging data. By delineating the outcomes of these two methodologies, this thesis contributes to the broader discussion on data reliability in the field of neuroimaging. It underscores the necessity for continued methodological innovations and validations across varied datasets to advance the reliability and interpretability of rs-fcMRI studies.

Advisors: Dr. Wheelock and Dr. Lahiri

Committee Members: Soumendra Lahiri, Muriah Wheelock, Robert Lunde

Share this event

Thesis Data Science LM-DATA UNICT

Thesis Data Science LM-DATA UNICT

Code source


Avez-vous consulté notre Base de connaissances ?

Message envoyé ! Notre équipe va l’examiner et vous répondre par courriel.

Computer Science Thesis Oral

April 5, 2024 2:00pm — 4:00pm.

Location: In Person and Virtual - ET - Reddy Conference Room, Gates Hillman 4405 and Zoom

Speaker: MATT BUTROVICH , Ph.D. Candidate, Computer Science Department, Carnegie Mellon University

On Embedding Database Management System Logic in Operating Systems via Restricted Programming Environments

The rise in computer storage and network performance means that disk I/O and network communication are often no longer bottlenecks in database management systems (DBMSs). Instead, the overheads associated with operating system (OS) services (e.g., system calls, thread scheduling, and data movement from kernel-space) limit query processing responsiveness. User-space applications can elide these overheads with a kernel-bypass design. However, extracting benefits from kernel-bypass frameworks is challenging, and the libraries are incompatible with standard deployment and debugging tools. 

This thesis presents an alternative in user-bypass: a design that extends OS behavior for DBMS-specific features, including observability, networking, and query execution. Historically, DBMS developers avoid kernel extensions for safety and security reasons, but recent improvements in OS extensibility present new opportunities. With user-bypass, developers write safe, event-driven programs to push DBMS logic into the kernel and avoid user-space overheads. There are two ways to to invoke user-bypass logic: (1) when a DBMS in user-space invokes these programs, user-bypass provides behavior similar to a new OS system call, albeit without kernel modifications. In contrast, (2) when an OS thread or interrupt triggers these programs in kernel-space, user-bypass inserts DBMS logic into the kernel stack.

First, we present a framework that employs user-bypass to collect training data for self-driving DBMSs efficiently. User-bypass programs reduce the number of round trips to kernel-space to retrieve performance counters and other system metrics. Next, we present a database proxy that applies user-bypass to support features like connection pooling and workload replication while reducing data copying and user-space thread scheduling. User-bypass programs embed DBMS network protocol logic in multiple layers of the OS network stack, applying DBMS proxy logic in a kernel-space fast path. Lastly, we present an embedded DBMS for future user-bypass applications. We discuss the design decisions, environment challenges, and performance characteristics of a DBMS that offers ACID transactions over multi-versioned data in kernel-space. We also explore applications of this user-bypass DBMS and compare them to modern user-space systems.

The techniques proposed in this thesis show user-bypass benefits across multiple DBMS design disciplines and provide a template for future DBMS and OS co-design.

Thesis Committee:

Andrew Pavlo (Chair) Jignesh M. Patel Justine Sherry Samuel Madden (Massachusetts Institute of Technology)  

In Person and Zoom Participation.  See announcement.

Add event to Google Add event to iCal

Professor Promotes Data Literacy With New Book

thesis for data science

Adam Tashman has seen firsthand the value of teaching data science at the high school level.

In 2022, Tashman, an associate professor of data science at the University of Virginia, and one of his former master’s students, Matt Dakolios, launched a pilot course at The Covenant School in Charlottesville where they worked with high school students on data analysis, covering key tools and applications.

“My feeling was we can and should teach even really young kids,” Tashman said. “There are lots of opportunities where kids are doing math and science to just do more with data,” he said. 

The course went well, Tashman said, but it was also a lot of work, and he began to wonder how the materials he was developing could be used to promote data literacy on a larger scale. 

Out of this experience and the insights it provided emerged a new book, authored by Tashman, titled “From Concepts to Code: Introduction to Data Science,” set to be published on April 12 . The book includes weekly lesson plans that could be used as part of a full-year introductory course for advanced high school or college students.

Tashman said he hopes the book will be a valuable resource for educators and students but also for anyone interested in learning more about this transformative discipline.  

The book is the latest step in Tashman’s journey to data science that began when he was a mathematics major at UVA. After graduating, he discovered that there was significant demand for quantitative analysts on Wall Street, positions known as quants, whose worked focused on derivative pricing, portfolio optimization, and risk management. 

Eventually, though, he discovered a new path. 

“I got excited about working in other areas apart from finance,” Tashman said. He would soon get an opportunity to lead a team of data scientists who worked on a variety of projects.  

In 2019 he joined the faculty of UVA’s School of Data Science where he has taught a wide range of courses on topics such as big data systems, R programming, and probability. He has also served as director of the online master’s program.

In November 2022 tragedy struck Charlottesville when UVA football players Devin Chandler, Lavel Davis Jr., and D’Sean Perry were shot and killed on Grounds after returning from a class field trip. The incident had a profound impact on the University community, including on Tashman.

“I was just thinking, what is something that I could do that maybe could be helpful, just based on my background, my experience?” he said. 

So, using the materials he created during his work with The Covenant School, he got to work, waking up early every morning and writing about two pages per day.

After many months of writing, Tashman produced a resource that aims to demystify the world of data, educating readers on tools they can use to address data questions. 

For some, the book could serve as an entry point to a career in data science. But Tashman hopes that, even for readers who are not interesting in becoming a data scientist, the book will expand their data literacy and allow them to better understand the potential and impact of this rapidly expanding field.

“We’re all going to be faced with a lot of data, and unfortunately some of it can be quite technical,” Tashman said. “It can be easy for people to be misled or mislead with data.” 

“I really feel like, even if people don’t want to necessarily be a data scientist or mathematician, to have that literacy just gives them power to be able to critically examine things, even if it’s just reading the newspaper,” he added.

Tashman also hopes that readers of his new book will realize that data science is not just a field for people who already excel in math or computer science.

“It’s really about solving problems where data is at the forefront,” Tashman said, noting that he only took one computing class in college but developed into a good programmer through continued practice and motivation. 

“I feel like that there are a lot of people out there not giving themselves enough credit,” he said. 

In addition to his new book, Tashman recently worked with Siri Russell , associate dean for diversity, equity, inclusion, and community partnerships at the School of Data Science, to develop the content for the Data and Society Challenge , a competition for 11 th and 12 th grade students in Virginia to learn more about data science and the career opportunities it can provide. It’s all part of Tashman’s work to show young people that data science is a field that is both impactful and inclusive.

“It turns out, with data science you can make a nice living, you can improve the lives of people, and you can do things that you feel passionate about,” Tashman said. “I think if you get that, then you’ll be more motivated to keep an open mind and learn these tools.”

high school data science students and teacher

Should High Schools Teach Data Science? If So, Alum Says ‘Why Not Me?’

Participants in the Starr Hill Program's data science pathway studied basketball analytics, which included taking the court themselves. (Photo by Cody Huff)

Summer Program Shows Local Students the Possibilities of Data Science

Data Justice Academy student researches present and discuss their poster presentations and research findings

Empowering Change: Data Justice Academy Showcases Research Findings

Headshot of Adam Tashman

Adam Tashman

Get the latest news.

Subscribe to receive updates from the School of Data Science.

  • Prospective Student
  • School of Data Science Alumnus
  • UVA Affiliate
  • Industry Member


  1. SOLUTION: Thesis chapter 4 analysis and interpretation of data sample

    thesis for data science

  2. 😂 Thesis on cloud computing data security. Cloud Computing Security

    thesis for data science

  3. (PDF) Towards Data Science

    thesis for data science

  4. Computer Science Thesis Data Analysis

    thesis for data science

  5. Thesis Methodology Sample

    thesis for data science

  6. (PDF) Master’s Thesis in Computing Science

    thesis for data science



  2. PhD Thesis Defense. Vadim Sotskov

  3. Data Science MSc thesis oral presentation

  4. DATA SCIENTIST: Ways to become a Data Scientist

  5. The Data Sciences Institute

  6. Unleashing The Power Of Data Science To Transform Industries


  1. 10 Best Research and Thesis Topic Ideas for Data Science in 2022

    In this article, we have listed 10 such research and thesis topic ideas to take up as data science projects in 2022. Handling practical video analytics in a distributed cloud: With increased dependency on the internet, sharing videos has become a mode of data and information exchange. The role of the implementation of the Internet of Things ...

  2. How to write a great data science thesis

    They will stress the importance of structure, substance and style. They will urge you to write down your methodology and results first, then progress to the literature review, introduction and conclusions and to write the summary or abstract last. To write clearly and directly with the reader's expectations always in mind.

  3. Five Tips For Writing A Great Data Science Thesis

    Although educational programs, conventions and thesis requirements vary wildly, I hope to offer some common guidelines for any student currently working on a Data Science thesis. The article offers five guidance points, but may effectively be summarized in a single line: "Write for your reader, not for yourself."

  4. Thesis/Capstone for Master's in Data Science

    Data Science; Capstone and Thesis Overview; Capstone and Thesis Overview. Capstone and thesis are similar in that they both represent a culminating, scholarly effort of high quality. Both should clearly state a problem or issue to be addressed. Both will allow students to complete a larger project and produce a product or publication that can ...

  5. Ten Research Challenge Areas in Data Science

    Abstract. To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning science, technology, and society.

  6. Top 10 Essential Data Science Topics to Real-World Application From the

    1. Introduction. Statistics and data science are more popular than ever in this era of data explosion and technological advances. Decades ago, John Tukey (Brillinger, 2014) said, "The best thing about being a statistician is that you get to play in everyone's backyard."More recently, Xiao-Li Meng (2009) said, "We no longer simply enjoy the privilege of playing in or cleaning up everyone ...

  7. Computational and Data Sciences (PhD) Dissertations

    Computational and Data Sciences (PhD) Dissertations. Below is a selection of dissertations from the Doctor of Philosophy in Computational and Data Sciences program in Schmid College that have been included in Chapman University Digital Commons. Additional dissertations from years prior to 2019 are available through the Leatherby Libraries ...

  8. 10 Compelling Machine Learning Ph.D. Dissertations for 2020

    This dissertation explores three topics related to random forests: tree aggregation, variable importance, and robustness. 10. Climate Data Computing: Optimal Interpolation, Averaging, Visualization and Delivery. This dissertation solves two important problems in the modern analysis of big climate data.

  9. Thesis Option

    Data Science master's students can choose to satisfy the research experience requirement by selecting the thesis option. Students will spend the majority of their second year working on a substantial data science project that culminates in the submission and oral defense of a master's thesis. While all thesis projects must be related to data science, students are given leeway in finding a ...

  10. 17 Compelling Machine Learning Ph.D. Dissertations

    This dissertation revisits and makes progress on some old but challenging problems concerning least squares estimation, the work-horse of supervised machine learning. Two major problems are addressed: (i) least squares estimation with heavy-tailed errors, and (ii) least squares estimation in non-Donsker classes.

  11. 37 Research Topics In Data Science To Stay On Top Of » EML

    Also, data science can help to develop new security technologies and protocols. As a result, cybersecurity is a crucial data science research area and one that will only become more important in the years to come. 23.) Blockchain. Blockchain is an incredible new research topic in data science for several reasons.

  12. 5 Must-Read Data Science Papers (and How to Use Them)

    How to use: follow the experts' practical tips to streamline development and production. #2 — Software 2.0 💻. This classic post from Andrej Karpathy articulated the paradigm that machine learning models are software applications with code based in data.

  13. Thesis Projects and Research in DS

    Thesis Projects and Research in DS. The Mas­ter's thesis is a man­dat­ory course of the Mas­ter's pro­gram in Data Sci­ence. The thesis is su­per­vised by a pro­fessor of the data sci­ence fac­ulty list. Re­search in Data Sci­ence is a core elect­ive for stu­dents in Data Sci­ence un­der the su­per­vi­sion of a data sci ...

  14. PhD Dissertations

    PhD Dissertations [All are .pdf files] Probabilistic Reinforcement Learning: Using Data to Define Desired Outcomes, and Inferring How to Get There Benjamin Eysenbach, 2023. Data-driven Decisions - An Anomaly Detection Perspective Shubhranshu Shekhar, 2023. METHODS AND APPLICATIONS OF EXPLAINABLE MACHINE LEARNING Joon Sik Kim, 2023. Applied Mathematics of the Future Kin G. Olivares, 2023

  15. Instructions for MSc Thesis

    For a Data Science thesis, this part typically describes the method for the analysis. Chapter 5: Results. This chapter describes the results obtained when the methods of Chapter 4 are used on data. For a Computer Science thesis, this part typically describes the performance of the developed algorithm(s) on various synthetic and real datasets.

  16. Data Science Masters Theses // Arch : Northwestern University

    Data Science Masters Theses. The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis.

  17. OATD

    You may also want to consult these sites to search for other theses: Google Scholar; NDLTD, the Networked Digital Library of Theses and Dissertations.NDLTD provides information and a search engine for electronic theses and dissertations (ETDs), whether they are open access or not. Proquest Theses and Dissertations (PQDT), a database of dissertations and theses, whether they were published ...

  18. BSc/MSc Thesis

    BSc/MSc Thesis. Our research group offers various interesting topics for a BSc or MSc thesis, the latter both in Computer Science and Scientific Computing. These topics are typically closely related to ongoing research projects (see our Research Page and Publications ). Below, we outline the basic procedure you should follow when planning to do ...

  19. Thesis guide • University of Passau

    This short guide is intended to give you a brief guideline on how to organise and conduct your thesis. This guide covers both, a master and a bachelor thesis. Managing your Thesis. A thesis is a complex task, similar to a software project has to be managed properly. It can be seen to have four stages, which are iterated several time.

  20. Open Theses

    Open Topics We offer multiple Bachelor/Master theses, Guided Research projects and IDPs in the area of data mining/machine learning. A non-exhaustive list of open topics is listed below.. If you are interested in a thesis or a guided research project, please send your CV and transcript of records to Prof. Stephan Günnemann via email and we will arrange a meeting to talk about the potential ...

  21. MIT Theses

    MIT's DSpace contains more than 58,000 theses completed at MIT dating as far back as the mid 1800's. Theses in this collection have been scanned by the MIT Libraries or submitted in electronic format by thesis authors. Since 2004 all new Masters and Ph.D. theses are scanned and added to this collection after degrees are awarded.

  22. Bachelor and Master Theses

    Jiahui Li: Styled Text Summarization via Domain-specific Paraphrasing , Master Thesis Scientific Computing, July 2023. Sophia Matthis: Multi-Aspect Exploration of Plenary Protocols, Master Thesis, June 2023. Till Rostalski: A Generic Patient Similarity Framework for Clinical Data Analysis, Bachelor Thesis, June 2023.

  23. Voices of XR: Sanjeel Parekh

    Sanjeel Parekh is a research scientist at Meta Reality Labs Research. His research primarily focuses on building machine learning tools for problems involving audio-visual data such as source separation, event detection, and speech enhancement. He earned his PhD in computer science at Technicolor and Telecom University of Paris-Saclay in 2019. His thesis was on learning representations for ...

  24. Thesis Oral Defense: Assessing reproducibility of Brain-behavior

    Abstract: In this thesis, amidst growing utilization of resting-state functional connectivity MRI (rs-fcMRI) for linking neural activity to pathological conditions, we confront the prevalent concerns regarding the reliability of such data. Our exploration concentrates on improving the reproducibility of brain-behavior associations within the framework of the Human Connectome Project (HCP) dataset.

  25. Thesis Data Science LM-DATA UNICT

    Official thesis template for the Data Science master degree course at University of Catania. Un éditeur LaTeX en ligne facile à utiliser. Pas d'installation, collaboration en temps réel, gestion des versions, des centaines de modèles de documents LaTeX, et plus encore.

  26. Computer Science Thesis Oral

    Computer Science Thesis Oral April 5, 2024 2:00pm — 4:00pm ... (OS) services (e.g., system calls, thread scheduling, and data movement from kernel-space) limit query processing responsiveness. User-space applications can elide these overheads with a kernel-bypass design. However, extracting benefits from kernel-bypass frameworks is ...

  27. Professor Promotes Data Literacy With New Book

    Out of this experience and the insights it provided emerged a new book, authored by Tashman, titled "From Concepts to Code: Introduction to Data Science," set to be published on April 12. The book includes weekly lesson plans that could be used as part of a full-year introductory course for advanced high school or college students.

  28. Invitation to the Thesis Defense of Mr. Philippe Anthony C. Bautista

    The Department of Information Systems and Computer Science invites you to the Thesis entitled " Revenue Time-series Forecasting of a Subscription-based Telecommunications Business Product Using LSTM and GRU " by Mr. Philippe Anthony ... A Data-Driven Disaster Preparedness and Recovery Policy Framework. Thu, 11 Apr 2024 Katipunan Avenue ...

  29. ESS Oral Defense: Amanda Semler "Microbial Life at Marine Hydrocarbon

    Stanford University *** Ph.D. Thesis/ Oral Defense *** Amanda Semler Wednesday, April 10th, 2:00pm Green Earth Science 365 Department of Earth System Science Advisor: Dr. Anne Dekas Marine hydrocarbon seeps, also known as cold seeps, are widespread features of continental margins and sedimentary basins worldwide and are hotspots of microbial activity on the seafloor.