master thesis bioinformatics example

BSc and MSc Thesis Subjects of the Bioinformatics Group

On this page you can find an overview of the BSc and MSc thesis topics that are offered by our group. The procedure to find the right thesis project for you is described below.

MSc thesis: In the Bioinformatics group, we offer a wide range of MSc thesis projects, from applied bioinformatics to computational method development. Here is a list of available MSc thesis projects . Besides the fact that these topics can be pursued for a MSc thesis, they can also be pursued as part of a Research Practice .

BSc thesis: As a BSc student you will work as an apprentice alongside one of the PhD students or postdocs in the group. You will work on your own research project, closely guided by your supervisor. You will be expected to work with several tools and/or databases, be creative and potentially overcome technical challenges. Below you will find short descriptions of the research projects of our PhDs and Postdocs. In addition you can take a look at the list of MSc thesis projects above.

Procedure for WUR students:

  • Request an intake meeting with one of our thesis coordinators by filling out the MSc intake form or BSc intake form and sending it to [email protected]
  • Contact project supervisors to discuss specific projects that fit your background and interest
  • Upon a match, take care of the required thesis administration together with your supervisor(s) and enroll in the thesis BrightSpace site to find more information on a thesis in the Bioinformatics group

Procedure for non-WUR students or students in other non-standard situations: We have limited space for interns from other institutes. If you are interested, please email our thesis coordinators at [email protected]; please attach your CV and indicate what are your main research interests.

BSc thesis topics

Integrative omics for the discovery of biosynthetic pathways in plants, molecular function prediction of natural products, linking the metabolome and genome, linking metagenomics and metatranscriptomics to study the endophytic root microbiome, exploiting variation in lettuce and its wild relatives.

logo

Direct Links

JLU von A-Z

Informationen für

  • Schülerinnen & Schüler
  • Studieninteressierte
  • Studierende
  • Menschen mit Fluchthintergrund
  • Unternehmen
  • Jobs & Karriere
  • Wissenschaftler/innen
  • Promovierende
  • Weiterbildungsangebote für JLU-Angehörige
  • Lehrerfortbildung
  • Wissenschaftliche Weiterbildung
  • Ehemalige (Alumni)
  • E-Campus ( Stud.IP , ILIAS , FlexNow , eVV )

Studium & Campus

  • Vor dem Studium
  • Studienangebot
  • Bewerbung/Einschreibung
  • Information/Beratung
  • Vorlesungsverzeichnis
  • Studien- und Prüfungsordnungen (MUG)
  • Hochschulrechenzentrum
  • Universitätsbibliothek
  • Campusplan | Geschosspläne/JLUmaps
  • Raumvergabe (ZLIS)
  • Studierendenwerk/Mensen
  • Corporate Design, Leitfäden, Logos
  • Bildergalerie Pressestelle
  • Formulare | Rundschreiben
  • SAP & JustOS (JLU-Online-Shop)
  • Rechtliche Grundlagen (MUG)
  • Störungsmeldung
  • Datenschutz

Karriere, Kultur, Sport, Marketing

  • Allgemeiner Hochschulsport (ahs)
  • Botanischer Garten
  • Career Services
  • Gender & JLU
  • Hochschuldidaktik
  • Justus' Kinderuni
  • Sammlungen der JLU
  • Universitätsorchester
  • Uni-Shop/Merchandising
  • E-Mail-Kontakt
  • Telefonbuch
  • Wegbeschreibung
  • Call Justus

Open thesis topics

Within our group we can offer various topics in the field of applied bioinformatics, high-throughput data analysis, genome and metagenome research as well as postgenomics and systems biology. Below you can find a list of suggested open topics for BSc and MSc theses and student projects. For further details on each topic or alternative projects please contact us.

Exploring the Role of Nasal Microbiota in Neurological Diseases ( M.Sc.) Background Microorganisms, including those in the human nasal cavity, maintain stability and functionality. Recent research suggests a potential link between the nasal microbiota and neurological diseases such as Parkinson’s disease (PD), Alzheimer’s disease (AD), and multiple sclerosis (MS)(1). However, the nature of this relationship remains unclear due to a limited number of studies. While much focus has been on the gut-brain axis, the influence of the nose-brain axis on the immune system and respiratory homeostasis requires further investigation (2). Some studies have indicated that altering the nasal microbiota could potentially prevent or treat neurological diseases, highlighting the need to understand the complex interactions between the nasal microbiota and the brain. Evidence suggests that the nasal microbiome may travel through the olfactory pathway to the brain (2, 3). The diversity of bacteria in the nasal cavity is highly dynamic and can vary depending on age, physiology, and lifestyle. This project will investigate how nasal microbiota stability impacts the blood-brain barrier (BBB) and its potential role in the development and progression of neurological diseases. Our goal is to gain a comprehensive understanding of the nasal microbial community, the conditions under which it remains stable, and how disruptions in nasal homeostasis might contribute to neurodegeneration. Objective The primary objective of this project is to explore the conditions under which nasal microbiota stability or instability is associated with neurological diseases, focusing on potential diagnostic and therapeutic applications. Methodology 1. Literature Review: Conduct a thorough review of existing studies on the nasal microbiota and its potential impact on neurological diseases. 2. Data Comparison and Analysis: Compare data gathered from literature on the nasal microbiota, analyzing differences in composition and diversity, and identifying potential patterns. 3. Mechanistic Studies: Explore how alterations in the nasal microbiota might influence the BBB and contribute to the pathology of neurological diseases. 4. Model Creation and Analysis: Develop a model based on literature data to analyze the stability of the nasal microbiota and its potential role in modulating the risk of neurological diseases. Expected Outcome This project aims to shed light on the role of the nasal microbiota in neurological diseases, potentially leading to novel diagnostic and therapeutic strategies. By understanding the dynamics of nasal microbiota stability, we hope to uncover new insights into preventing and treating neurodegenerative conditions. Reference 1.García-Jiménez, Beatriz, et al., Computational and Structural Biotechnology Journal 19 (2021): 226-246. 2. Xie, Jin, et al. Pharmacological Research 179 (2022): 106189. 3. Thangaleela, Subramanian, et al., Microorganisms 10.7 (2022): 1405 Contact: Dr. Reihaneh Mostolizadeh

Automated reconstruction of high-quality genome-scale models using machine learning (b.sc. or m.sc.) background genome-scale metabolic models (gems) are essential in biological research and biotechnological development, as they enable the comprehensive analysis of metabolic networks and fluxes. reconstructing a high-quality genome-scale model (gem) involves a detailed workflow of 96 steps (6). despite the standard protocols and operating procedures available for gem construction, the process remains time-consuming. this has led to recent efforts aimed at automating the reconstruction steps. researchers have developed various protocols that combine automated steps to streamline the reconstruction and refinement of gems. in recent years, machine learning (ml) has played a significant role in the reconstruction and analysis of gems, enhancing their quality and accuracy (1, 4).  objective: this project aims to develop an automated protocol for reconstructing high-quality genome-scale models using available ml approaches. we have compiled all available literature focusing on the application of ml in the reconstruction of gems. by integrating these ml-based methods into a cohesive automated procedure, we intend to facilitate the reconstruction and refinement of gems. methodology: 1. literature review and compilation: gather and analyze literature on ml approaches used in gem reconstruction. 2. automation protocol development: combine the identified ml-based steps into an automated workflow. 3. comparison and selection: in the first step, for organism with multiple annotated genomes, for instance, compare the annotations and select the most comprehensive one. 4. gem reconstruction: apply the automated protocol to reconstruct the gem. 5. refinement using ml: to refine the reconstructed gem, employ ml algorithms such as gapfill, pathway tool prediction (2), gene essentiality (5), ec numbers (3), etc. expected outcome: this project will result in an automated, ml-based protocol for gem reconstruction. it will allow for comparing different ml approaches and improve the efficiency and quality of gems.  reference: 1. kim, yeji, gi bae kim, and sang yup lee. "machine learning applications in genomescale metabolic modeling." current opinion in systems biology 25 (2021): 42-49. 2. dale, joseph m., liviu popescu, and peter d. karp. "machine learning methods for metabolic pathway prediction." bmc bioinformatics 11 (2010): 1-14. 3. ryu, jae yong, hyun uk kim, and sang yup lee. "deep learning enables high-quality and high-throughput prediction of enzyme commission numbers." proceedings of the national academy of sciences 116.28 (2019): 13996-14001. 4. zampieri, guido, et al. "machine and deep learning meet genome-scale metabolic modeling." plos computational biology 15.7 (2019): e1007084. 5. hasibi, ramin, tom michoel, and diego a. oyarzún. "integration of graph neural networks and genome-scale metabolic models for predicting gene essentiality." npj systems biology and applications 10.1 (2024): 24. 6. thiele, ines, and bernhard ø. palsson. "a protocol for generating a high-quality genome-scale metabolic reconstruction." nature protocols 5.1 (2010): 93-121. contact: dr. reihaneh mostolizadeh, comparative genome analysis of streptococcus agalactiae (gbs) from elephants (m.sc.).

Background Group B Streptococci are fairly common. In livestock, they are the causative agent of an udder inflamation, most often seen in dairy cows. 

In elephants, S. agalactiae is associated with Paronchya. Under human care, elephants are known to reach a high age. This comes with an age-related decline in their immune system, which can lead usually harmless skin- or foot diseases to become chronic. Gaining a better knowledge about the bacterial infections is a vital foundation for optimized treatments and therapeutic approaches. 

In a newer study done by the "Hessische Landeslabor" (Hesse state labratory (LHL)), some S. agalactiae isolates were compared, using microbiological methods and had extensive biochemical profiles created.  Noticable was the high number of isolates, for which the serotypes could not be determined. For this reason some isolates got sequenced, so a full comparative genome analysis could be done, using the latest methods in bioinformatics.

Thesis aims

  • Implementation of typical bioinformatic analyses (Assembly, mapping, annotation...)
  • Comparative analysis of GBS Isolates (ABR, pan- and coregenome, virulence factors...)
  • Closer inspection of Genes for serotyping

Prerequisites

  • Interested in solving biological/veterenary questions by usage of bioinformatics
  • Extensive knowledge of the Linux command line
  • Ability to work independently and methodical

Contact: Linda Fenske

Workflow Design (Nextflow) (M.Sc.)

Analysing (bacterial) sequence data for biological/medical questions means often repeating certain standard processes (QC, Assembly, Annotation etc.)

For better reproduceability and simplification of these processes, flexible pipelines with a wide palette of tools are used. Often Nextflow (of similar workflow tools) is used to enable support for a variety of enviroments or to simplify the installation.

With DSL2, Nextflow recently introduced a significant development of the Nextflow language, which promises a better scalability and modulariziation of pipelines, along with a better design of workflows.

  • Revision and updating of an existing workflow for analysing bacerial data
  • Transmission of the workflow from nf-DSL1 to DSL2
  • Visualising the results (creating a GUI)

Prerequisites 

  • Knowledge of Nextflow or motivation to become acquainted with Nextflow
  • Programming knowledge in Python, Groovy (Nextflow) or similar
  • Knowledge and interest in visualisation and processing of data

Platon Bioinformatics Tool Enhancement for Faster Plasmid Identification (M.Sc.) - taken

Modern high-throughput sequencing devices enable the rapid determination of sequence data obtained from interacting microbial communities without a prior cultivation step. Hereby, access to genetic information from otherwise unculturable microbiota is easily achieved. (Computational) Interpretation of such data relies on either assignment of raw sequencing reads to corresponding source organisms in order to infer their taxonomic origin or gene-coding content, or, these metagenome datasets can be assembled, thereby recovering longer contiguous DNA stretches of the underlying microbial genomes.

Assembled metagenomic contigs are typically clustered (most often, depending on coverage or nucleotide composition), yielding individual draft or complete genomes of novel bacterial species. In this process, however, contigs of non-chromosomal origin such as plasmids are often overlooked.

Still, the analysis of plasmids is of utmost imoprtance, since they constitute a key mechanism of horizontal gene transfer between microbial hosts. They are known to harbor essential genes that are beneficial or important for microbial fittness or survival under certain environmental conditions (e.g. in the presence of certain antimicrobial agents) or perform metabolic processes that they otherwise wouldn‘t have been able to (e.g. degradation of novel substrates).

Several bioinformatics applications have been developed for the computational identification of plasmid-borne contigs, most typically focusing on the extraction of plasmid contigs from the assemblies of individual draft genomes. Among these tools are Platon (Schwengers et al., 2020), PlasClass (Pellow et al., 2020) and PlasFlow (Krawczyk et al., 2018), of which Platon exhibits excellent performance, but its runtime characteristics currently impede its application to potentially large metagenome assemblies.

  • Overhaul of the Platon code base, switching from a contig-centered approach to one based on bulk data processing in order to significantly decrease overall runtime.
  • Inlining of certain sub-analysis steps such as circularity testing into the python codebase instead of relying on the invocation of external tools: (Pyrodigal, pyHMMER, PyTrimal)
  • Conditional tool execution: Do not invoke additional tools if preceding steps already exclude a sequence from being a plasmid
  • Runtime and performance assessment with regard to the original implementation

Requirements

  • Familiarity with Linux and (modular) python programming (incl. unit testing)
  • Methodological way of working
  • Able to work independently

Contact: Oliver Schwengers

Develop and Compare Curare Modules for Different DGE Libraries (M. Sc)

Differential gene expression analysis (DGE) is a commonly used method in RNA sequencing, in which the expressions of different genes in samples from different conditions are statistically compared to identify relevant genes in stress or defense situations. To simplify the execution of these analyses, the software Curare was developed.

Currently, the R library DESeq2 is used for the statistical evaluation of expression data, but there are also alternative libraries such as edgeR or Limma that pursue similar or completely different statistical approaches.

This Master's thesis aims to write, compare, and combine Curare modules for various DGE libraries. This requires working with different R libraries, integrating the evaluation into Curare (written in Snakemake), and visualizing the results in an HTML report.

  • Write Curare modules for different DGE libraries and compare and combine them.
  • Learn about different R libraries for statistical analysis of expression data.
  • Integrate the analysis in Curare (written in Snakemake) and visualize the results in an HTML report.

Contact: Patrick Blumenkamp

Reconstruction and visualization of KEGG metabolic pathways in the EDGAR platform (M.Sc.)

EDGAR  is a web-based platform for analyzing microbial data. It is developed by employees of the Bioinformatics and Systems Biology department at JLU Giessen and provides multifaceted methods for investigating genomes.

KEGG ( Kyoto Encyclopedia of Genes and Genomes) provides curated databases and resources for (among other things) the functional annotation and classification of genes. In previous projects, KEGG functional categories for all organisms and their corresponding genes were computed in the EDGAR platform. These are currently displayed directly in two analysis modules, in purely quantitative terms.

MinPath is a program for reconstructing biological/metabolic pathways. It attempts to infer a minimal biological metabolic network by excluding redundant metabolic pathways that can explain the genes found in a given dataset. The above-mentioned KEGG categories will be used as input for this program.

The goal of the project is to develop a comparative analysis module, based on KEGG pathway information, for the EDGAR platform.

Thesis Aims

  • Parse the available KEGG data in a structured manner and compute KEGG metabolic pathways for all given genomes in EDGAR using MinPath.
  • Design comparative visualizations for the EDGAR frontend using the resulting data, allowing users to interactively explore their data (see fig. 4 here as an example)
  • Adjust the project scope in consultation with the student depending on the project status to accommodate shared ideas, as EDGAR incorporates a wide selection of data with potential for creative analysis methods.

Requirements  

Programming skills in Python and JavaScript (can also be learned during the process)

Basic SQL database knowledge

PlasmidHunter: Validation of a metagenome-based plasmid search using public plasmid sequences (M.Sc.)

Plasmids play an important role in the genetic variability of organisms. They replicate independently and between organisms - within and between species. Therefore, plasmids are key drivers of horizontal gene transfer. Often, they are the effective and only difference between commensal and pathogenic bacterial strains. In recent years, it became obvious that plasmids belong to the main mechanisms for the dissemination of antimicrobial resistances and hence are of special interest in medical microbiology. Detecting plasmids and analyzing their dissemination is an important epidemiological and scientific topic that might help to detect current and prevent future outbreaks of antibiotic resistances.

One promising data source containing known and unknown plasmids are whole-metagenome datasets of samples from different sources (soil, waste water, the human gut). For many of these samples, sequencing data is freely accessible in public databases, often annotated with additional meta information such as date, source and location of each sample.

Our project processes these datasets from the MGnify database in a standardized way via modern cloud technologies and makes them accessible to users for a fast search of new plasmids within this huge amount of data.

This master thesis should validate this search via existing plasmid databases (such as PLSDB) and analyze search results including comprehensive visualizations.

  • Implementation of a workflow to process PLSDB entries with our existing search workflow
  • Statistical analysis of the results, and screen for potential interesting candidates for further analysis
  • Visualization of the results
  • Knowledge of command line tools and Python
  • Interest in cloud technologies
  • Prior experience with workflow systems, like Nextflow or Snakemake

Contact: Sebastian Beyvers

Webservice for searching gene families in plants (M. Sc.)

The input is a list of protein sequences. In step 1a, a Pfam search is performed with the sequences to find common domains. In step 1b, a multiple sequence alignment of the sequences is calculated. The conserved regions are automatically extracted from the alignment to calculate HMMs. In step 2, the HMMs of the domains from 1a and 1b are used to search a database of plant proteins.

  • The results are visualized and made available for download
  • Steps 1 and 2 are also provided as a command-line tool
  • The programming language(s) and frameworks can be freely chosen
  • Test data will be provided

Contact: Oliver Rupp

R ibosomal binding site prediction based   on 16S-rRNA (M.Sc.)

Bacterial translation is initiated by the assembly of ribosomal proteins as part of the translation initiation complex at the coding sequence (CDS) start site. For most CDS, there is a ribosomal binding site (RBS) immediately upstream of the gene, consisting of a 5-10bp spacer and a (partial or complete) Shine-Dalgarno sequence (SD) 5’-AGGAGG-3’ to which the ribosome binds. However, some genes have neither an SD nor a known RBS and are still expressed (Omotajo, D. et al. , 2015) . The Shine-Dalgarno sequence was first described in E. coli but is found in many bacterial genomes and is complementary to the anti-SD sequence at the 3′-end of 16S-rRNA.

The exact Shine-Dalgarno and spacer sequences vary between bacterial species. However, because the anti-Shine-Dalgarno sequence is present in the 16S-rRNA of each bacterial genome, it can be used to predict RBS in a species-independent manner.  Therefore, a deep learning approach using the 16S-rRNA sequences and the sequence upstream of the CDS is promising for accurately predicting the presence of RBS independent of species-specific variants.

  • Design and implementation of a neural network for ribosomal binding site prediction in bacteria,
  • evaluation of the features used by the neural network, and
  • analysis of the presence of RBS in exemplary bacterial genomes
  • Prior experience with deep learning frameworks such as Tensorflow/Keras, or willingness to learn them
  • Prior experience in the development of documented code and dependency management or willingness to learn them

Contact: Julian Hahnfeld

Integrative Omics FAIR Workflow (M.Sc.) Background

Processing and analysing 'omics data often requires applying predefined building blocks of code, i.e. for performing quality control, statistical analysis or machine learning. However, biologists and ecologists are often overwhelmed with the technical complexity of programmatic approaches and interfaces. Hence, scientific workflows can not just automate, but also facilitate important re-occuring processes in high-throughput 'omics analysis.

The existing modularized iESTIMATE pipeline aims at automating and facilitating the complex analysis of ecological metabolomics data and the integration with other phenomics and preparation for sequencing and (meta-)genomics data. The central aim of the pipeline is to extract so called molecular traits that explain molecular mechanisms in plants or microorganisms. Thesis Aims

  • Revision and modularisation of existing code  to create the R package "iESTIMATE"
  • Implementing a workflow in NextFlow or Common Workflow Language (CWL) using test data, implementing unit tests and capture provenance information
  • Publish R package and the workflow following the FAIR principles
  • Knowledge of R and a bit of Python
  • Knowledge of Linux command line, containers, NextFlow (Groovy), YAML, or motivation to become acquainted with them
  • Keen interest in analysis of integrative 'omics data and in topics in molecular ecology

Contact: Kristian Peters

Aarhus University logo

Bioinformatics Research Centre

Master's thesis in bioinformatics.

In the Master’s program in bioinformatics, you must do a 30 ECTS Master’s thesis. You must start your 30 ECTS thesis no later than February 1 (or September 1 ) a year and a half after commencement of your studies (i.e. February 2021 for students admitted in summer 2019, or September 2021 for students admitted in winter 2020). You must complete your thesis (including the exam) no later than June 30 the same year, if you started on February 1 (or January 31 the following year, if you started on September 1).

You can read the course description for the MSc thesis project at:

kursuskatalog.au.dk/en/course/114372/Thesis-30-ECTS-Bioinformatics

You can read some general information and advice about Master’s thesis work at:

https://studerende.au.dk/en/studies/subject-portals/bioinformatics/masters-thesis/masters-thesis/

You can see abstracts of (some) Master's theses from BiRC at:

https://www.birc.au.dk/~cstorm/birc-msc/birc-msc.html

Thesis contract

Before you start your thesis, you must make a thesis contract. The thesis contract must be completed and approved by January 15  (or August 15 ). You can read about how to submit the contract on the above www page. As part of the thesis contract, you must attach a pdf file containing project description, project goals, activity plan, and supervision plan. This is very much like what you have to describe for a Project in Bioinformatics. At BiRC, you should use the following template for this description.

Problem statement, activity plan, and supervision plan (in docx format)

When formulating the thesis project, you should keep in mind that it should cover 30 ECTS of work, i.e. full-time work for the entire semester and the following exam period. Group projects should of course cover this for every group member.

Choosing a topic

Before you can make a thesis contract, and commence your thesis work, you must (of course) chose a topic and a supervisor. The supervisor must be a tenured researcher associated to BiRC, but you can also have one or more co-supervisors.

When choosing a thesis topic, it is a good idea to think about the classes and projects that you have done during your Master’s studies, and what kind of work do you like? Contact potential supervisors as early as possible to discuss your wishes and ideas. Remember that you are always welcome to come by our offices and discuss. You can also ask potential supervisors for examples of thesis’s that they have supervised in order to get a better idea of how a thesis can look.

Also, we plan an information meeting for students that focus on thesis and project work every Fall. Below are the slides from the last such information meeting.

Slides from MSc info meeting (November 2023)

Ten simple rules for writing a great MSc thesis at BiRC (November 2022)

The slides also contain good advice about how to organize your thesis work. The above www page also contains some advice.

Group projects: It is possible to do the thesis project as a group project. Each group member must fill out individual contracts stating the other groups members. A group hand in a single thesis, but each group member is examined individually. In general, we very much encourage group assignments as it for many students is motivating to work together in a group, and to have group member to discuss and solve the many the details of a thesis project together with.

Projects involving external collaborators: It is possible to do a project that involves external collaboration, e.g. with people from industry, or from other university departments. Such collaborators will be associated to your thesis as co-supervisors. In the thesis contract, it is possible to indicate that the thesis project is done in collaboration with an industrial partner, if an NDA has been signed, and if the final thesis report must be made public available.

The thesis report presents the completed work and can be written in Danish or English. The report must contain an English summary/abstract. The summary/abstract is included in the assessment, and the assessment places emphasis on the academic content, as well as the student’s spelling and writing skills. The extent of the thesis report is agreed with the supervisor, but is typically about 50-60 pages excluding frontpage, table of content and appendices. If the MSc thesis is done as a group project provided, the report must be done in such a way that the group members can be assessed individually. This means that you can either (1) do a joint report in which everyone is equally responsible for all parts of the report, or (2) do a joint report, where it is stated (fx in the table of content) who of you has done the individual parts of the report and is responsible for them. See https://studerende.au.dk/en/studies/subject-portals/bioinformatics/masters-thesis/masters-thesis/ under "Group assignment" for details.

In your thesis contract, you state the hand in date. This can between June 1 and 15 (or January 1 and 15 ), earlier dates are also possible. The exact date is (of course) decided in collaboration with your supervisor. You hand in your thesis via Digital Exam (like you are used to for Projects in Bioinformatics).

The thesis exam is 60 min oral exam. It starts with a 30 min presentation from you about your thesis work followed by a 30 min discussion between you, the examiner (your supervisor), and an external examiner. Your presentation is based upon a question that you get from your supervisor one week before the exam. The exam must be held before June 30 (or January 31 ). In principle, the exam can be held from the day after you hand in your thesis. The exact date is decided upon by your supervisor, and often depends on the availability of external examiners. The final grade reflects an overall assessment of your report, your presentation, and your discussion.

If you have any questions about thesis work, then you are always welcome to ask!

Your browser is unsupported

We recommend using the latest version of IE11, Edge, Chrome, Firefox or Safari.

Richard and Loan Hill Department of Biomedical Engineering

Colleges of engineering and medicine, ms in bioinformatics.

Required Semester Hours: 36

Thesis track Heading link Copy link

DNA helix with computer code

The thesis track is designed for MS in Bioinformatics students who are interested in conducting research. This track is strongly advised if you may be interested in pursuing a PhD in the future.

Researching and writing a master’s thesis is an academically intensive process that takes the place of 8 credits of traditional coursework. Students work with a faculty advisor to choose a topic of interest, engage in high-level study of that topic, and develop a paper that is suitable for presentation at a conference or submission to a journal.

The thesis experience provides definition to your master’s degree experience and can bolster your application for jobs or doctoral-level study by demonstrating your capabilities.

In the thesis option, you will earn 8 credits in BME 598 Master’s Thesis Research and at least 28 credit hours from coursework. At least 12 of your coursework credits must come from courses at the 500 level, excluding BME 595, BME 596, and BIOE 598. You may be allowed limited credit hours from BME 596 Independent Study with department approval. There is no comprehensive examination.

Recent UIC master’s thesis projects in bioinformatics include:

thesis titles Heading link Copy link

Nikita Dsouza

Strategies for Identification of Small Molecule Inhibitors of Ad2 E3-19K/HLA-A2 Binding Interaction

A Statistical Framework for GeneSet Enrichment Analysis based on DNA Methylation and Gene Expression

Navya Josyula

Identifying Ligand Binding Sites of Proteins using Crystallographic Bfactors and Relative Pocket Sizes

Non-thesis track Heading link Copy link

In the non-thesis track, you earn all of your required 36 credit hours from coursework. Of these, 16 must be from courses at the 500 level. There is no comprehensive examination.

Across-the-board requirements Heading link Copy link

  • 1 hour of BME 595
  • Present at least one seminar (BME 595) before graduation
  • Students entering the program without an undergraduate degree in bioengineering or biomechanical engineering must also take BME 480, BME 481, and BME 530

MS alumni in their own words Heading link Copy link

Daiqing

Daiqing Chen ’21 MS in Bioinformatics

What led you to choose bioinformatics for your MS degree? How do you think computational technology is changing biomedical engineering? I was doing molecular biology during my undergrad. Wet lab experiments are very time- and money-consuming. I have seen people using bioinformatics methods to solve biological questions, and I want to be able to use them. I actually don’t know much about engineering, but I believe a computational method can be useful for any field. The high efficiency allows people to do more things than ever before.

What are your plans for once you have completed your degree? I am planning on working as a research assistant in biological lab, most likely doing research about cancer. My time at UIC helped me get more familiar with American culture.

Have you worked in any labs? Yes, the Computational Functional Genomics Laboratory . I did a project to validate machine learning models that predict kidney function decline. I also worked on high-throughput single-cell sequence analysis.

Your primary hobby/outside interest: Playing badminton.

Favorite restaurant in Chicago: Minhin’s cuisine for the dim sum.

Additional information Heading link Copy link

  • MS in Bioinformatics course checklist: thesis track
  • MS in Bioinformatics course checklist: non-thesis track
  • MS in Bioinformatics graduate catalog page
  • UIC Graduate College admissions
  • Important deadlines for BME graduate students
  • Director’s Welcome
  • Participating Departments
  • Frontiers in Computational Biosciences Seminar Series
  • Current Ph.D. Students
  • Current M.S. Students
  • Bioinformatics Department Handbook
  • B.I.G. Summer Institute
  • The Collaboratory
  • Diversity and Inclusiveness
  • Helpful Information for Current Students
  • Joint UCLA-USC Meeting
  • Student Blog and Twitter Feed
  • Social Gatherings
  • Introduction to the Program
  • Bioinformatics Admissions Information
  • Admissions FAQs
  • Student Funding
  • Curriculum and Graduate Courses
  • Research Rotations
  • Qualifying Exams
  • Doctoral Dissertation
  • Student Publications
  • Admissions Information
  • Capstone Project
  • Undergraduate Courses
  • Undergraduate and Masters Research
  • Bioinformatics Minor Course Requirements
  • Bioinformatics Minor FAQs
  • Bioinformatics Minor End-of-Year Celebration
  • For Engineering Students

Thesis Preparation and Filing: Staff from the University Archives and the UCLA Graduate Division present information on University regulations governing manuscript preparation and completion of degree requirements. Students should plan to attend at least one quarter before they plan to file a thesis or dissertation. More information is found at https://grad.ucla.edu/gasaa/library/thesisintro.htm

The official UCLA manuscript preparation guide for PhD Dissertations can be found at https://grad.ucla.edu/gasaa/etd/thesisguide.pdf

Featured News

Researchers awarded $4.7 million to study genomic variation in stem cell production, dr. nandita garud recognized for her research on gut microbiome, ucla study reveals how immune cells can be trained to fight infections, ucla scientists decode the ‘language’ of immune cells, dr. eran halperin elected as fellow of international society for computational biology, upcoming events, labor day holiday, recent student publications.

RECENT STUDENT PUBLICATIONS LINK-PLEASE CLICK!

Updates Coming Soon!

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Repository of the master project in Bioinformatics at Lund University

mostafa-ti/Master_thesis_bioinformatics

Folders and files.

NameName
37 Commits

Repository files navigation

Investigation of age-related changes in neuroblast populations using rna velocity, project overview.

README file for a master project in Bioinformatics

  • BINP52 - Master project - 60 ECTS
  • Master programme in Bioinformatics
  • Department of Biology, Faculty of Science, Lund University, Sweden

This work aimed to understand the aging signature on the neurogenesis process in the brain. For this purpose, three regions of mice brain, the Dentate Gyrus (DG), Subventricular zone (SVZ), and Olfactory Bulb (OB), in three age sample groups young (3 months), adults (14 months), and aged (24 months) were compared. This project provided an RNA velocity map of neurogenic niches of mice, together with substantial new knowledge about the biology of newly born neuroblasts and how they are affected by aging. Overall, our data advanced our understanding of the aging signature on the neurogenesis process in the brain.

  • Student: Mostafa Torbati
  • Supervisors: Henrik Ahlenius ( [email protected] ) and Jonas Fritze ( [email protected] )
  • Institute: Department of Clinical Sciences Lund, Division of Neurology, Lund Stem Cell Center, Lund University, Sweden
  • Corresponding author: Mostafa Torbati

Hardware specifications

The following hardware was used:

  • Apple MacBook Pro (Mid 2014, macOS Big Sur Version 11.7) used as a local computer.
  • LUNARC's Aurora service HPC Desktop (High-performance computing), is used for preprocessing and visualizing the raw data.

Software versions

  • Python 3.8.2
  • JupyterLab 2.2.8

10x Genomics Cell Ranger v6.0

  • Scanpy v1.7.2
  • Velocyto v0.17.17
  • ScVelo v0.2.4

Installation

On this project, we're working with large data and software such as Cell Ranger, which need High-performance computing (HPC) power. All the bioinformatics workflows for this project have been done on LUNARC Aurora service HPC Desktop ( the center for scientific and technical computing at Lund University). All the software is pre-installed on the LUNARC clusters so that users can access and load software packages based on the project's requirements. To efficiently management of the compute resources, we need to follow the SLURM job scheduler, which in simple words, is a bash script that loads all the necessary packages and sends it to the system backend. Here is an example of a bash script for submitting a job on Aurora:

The script is then submitted to the scheduler using the command:

Project Flowchart

Data preparation with Cell Ranger workflow

Build a custom reference.

For this project, we used transgenic mice in which Enhanced Green Fluorescent Protein (EGFP) is expressed under the control of the DCX promoter. As EGFP is not included in the pre-built reference genome, we first created a custom reference genome and added EGFP to an existing mouse reference.

The cellranger mkref command is used to build a custom reference for use with the Cell Ranger pipeline.

Prepare input files

  • The following files are required to build the reference:
  • FASTA file containing the genome sequences
  • GTF file containing the gene annotation

get the FASTA file

get the GTF file

filter the GTF file

GTF files can contain entries for non-polyA transcripts that overlap with protein-coding gene models.To remove these entries from the GTF, we need to use cellranger mkgtf nad add the filter argument --attribute=gene_biotype:protein_coding to the mkgtf command:

Setup the command for cellranger mkref

Add a marker gene (egfp) to the fasta and gtf.

get the EGFP gene sequence EGFP complete sequence

Edit the header of EGFP fasta file tp make it more informatice

We need to count the number of bases in the sequence:

The results of this command shows there are 4733 bases. This is important to know for the creating custom GTF for EGFP.

Now to make a custom GTF for EGFP with the following command:

Note: we need to insert the tabs that separate the 9 columns of information required for GTF.

This is what the EGFP.gtf file looks like with the cat EGFP.gtf command:

Next, add the EGFP.fa to the end of the M. musculus genome FASTA. But first, make a copy so that the original is unchanged.

Now append EGFP.fa to Mus_musculus.filtered_gtf_EGFP.gtf as following:

Use cellranger mkref command to create custom reference directory with EGFP added

Now use the genome_Mus_musculus_EGFP.fa and Mus_musculus.filtered_gtf_EGFP.gtf files as inputs to the cellranger mkref pipeline:

This outputs a custom reference directory called Mus.musculus_genome_EGFP/ .

Converting RAW data to FASTQ files

Raw base call (BCL) was provided for this project from Illumina sequencers. 10X Genomics describes several scenarios about how to design the workflow of the experiments. Based on each scenario, we need to follow a specific CellRanger pipeline. Here is the schematic description of our scenario:

Project workflow

In this example, one sample is processed through one GEM well, resulting in one library which is sequenced across multiple flow cells. This workflow is commonly performed to increase sequencing depth.

CellRanger pipeline based on this workflow:

  • Convert BCL files obtained from Illumina sequencer to FASTQ files usisng cellranger mkfastq
  • cellranger count that takes FASTQ files from cellranger mkfastq and performs alignment, filtering, barcode counting, and UMI counting. It uses the Chromium cellular barcodes to generate feature-barcode matrices, determine clusters, and perform gene expression analysis. Implementing this pipeline allows all reads to be combined in a single instance.

Run cellranger mkfastq for the fisrt sequencing run

mkfastq workflow

Run the command on LUNARC:

Running this code successfully generates a directory containing the sample folders (it's mkfastq_1stRun in our case) would be named according to the flow cell ID.

Run cellranger mkfastq for the second sequencing run

Running this code successfully generates a directory containing the sample folders (it's mkfastq_2ndRun in our case) would be named according to the flow cell ID.

By running cellranger mkfastq on the Illumina BCL we ended up with two folders containing FASTQ files for each sequencing run which will be used in the next step.

Cell ranger count pipeline.

The count pipeline is the next step after mkfastq . In this step FASTQ files from two sequencing libraries and generates a count matrix, which is a table of the number of times each gene was detected in each cell. This matrix is then used as input for downstream analysis, such as identifying differentially expressed genes or clustering cells into groups. The pipeline does this by using a series of steps, including quality control, alignment, and quantification of reads.

--fastqs in the cellranger count pipeline, we can add multiple comma-separated paths of FASTQ files for this argument. Doing this will treat all reads from the library, across flow cells, as one sample. This approach is essential when we have the same library sequenced on multiple flow cells.
--sample argument takes the sample name as specified in the sample sheet supplied to cellranger mkfastq .
--transcriptome import the the custom reference genome that we created with the cellranger mkref .

For this project, we have nine samples of three age groups from 3 neurogenic niches of mice brains, Dentate Gyrus (DG), Subventricular Zone (SVZ), and Olfactory Bulb (OB). We must repeat the cellranger count pipeline for each region and age group.

cellranger count generates multiple outputs in different formats which can be used for many downstream analysis.

Velocyto Run on 10X Chromium samples

We need to combine spliced and unspliced RNA-seq data for the velocity analysis. For this purpose, we use the Velocyto command line tool . Velocyto provides tools for different technologies. We use run10x as our samples are generated through 10X Chromium technology.

Example bash script on LUNARC

We need to run this code for each of our nine samples. This pipeline will generate a folder in the cellranger count output folder, which contain a loom file for spliced and unspliced regions.

Scanpy and ScVelo

The rest of the analysis performs on Jupyter Notebook, and the prepared data is implemented on the Scanpy package for Pre-processing and ScVelo for RNA velocity.

Wolf, F., Angerer, P. & Theis, F. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018).

La Manno et al. (2018), RNA velocity of single cells, Nature.

Bergen et al. (2020), Generalizing RNA velocity to transient cell states through dynamical modeling, Nature Biotech.

Bergen et al. (2021), RNA velocity - current challenges and future perspectives, Molecular Systems Biology.

Copyright © Mostafa Torbati, Department of Clinical Sciences, Division of Neurology, Lund Stem Cell Center, Lund University, Sweden

  • Jupyter Notebook 99.9%

Bodleian Libraries

  • Bodleian Libraries
  • Oxford LibGuides
  • Bioinformatics
  • Theses & Dissertations

Bioinformatics: Theses & Dissertations

  • Journals and Conference Proceedings
  • Online resources

Links for Theses and Dissertations

  • Proquest Dissertations and Theses Search US theses and dissertations. Accessed through OxLip+, search for 'dissertations and theses'.
  • Oxford Research Archive (ORA) Search for and download recent Oxford DPhil theses. Also contains an archive of articles, papers and research posters produced by academics and researchers at Oxford University. more... less... ORA is freely available and does not require a log-in.
  • EThOS Access to UK theses from the British Library [Currently unavailable]. more... less... To use this service you will be required to set up an individual account.
  • DART-Europe Search European E-theses.

Theses and Dissertations On-line

Electronic collections.

A number of recent theses and dissertations prepared at Oxford are available to download from the Oxford Research Archive (ORA) . The British Library provides access to UK theses through its EThOS service  [currently unavailable]. Already digitised UK theses can be downloaded freely as PDF files. Requests can be made to digitise older theses, but there is a charge for this service and waiting time of 30 days for digitisation. The British Library no longer provides theses on microfilm.

Finding Oxford Theses

SOLO allows you to search for Theses in the Oxford collections.

1. Navigate to the  SOLO  homepage.

2. Click on the ' Advanced Search ' button

3. Click the ' Resource Type ' menu and choose the ' Theses ' option.

4. Type in the title or author of the thesis you are looking for and click the ' Search ' button.

Other Relevant Guides

  • ORA: Oxford University Research Archive by Jason Partridge Last Updated Apr 10, 2024 3362 views this year
  • << Previous: Online resources
  • Last Updated: Aug 23, 2024 12:31 PM
  • URL: https://libguides.bodleian.ox.ac.uk/bioinformatics

Website feedback

Accessibility Statement - https://visit.bodleian.ox.ac.uk/accessibility

Google Analytics - Bodleian Libraries use Google Analytics cookies on this web site. Google Analytics anonymously tracks individual visitor behaviour on this web site so that we can see how LibGuides is being used. We only use this information for monitoring and improving our websites and content for the benefit of our users (you). You can opt out of Google Analytics cookies completely (from all websites) by visiting https://tools.google.com/dlpage/gaoptout

© Bodleian Libraries 2021. Licensed under a Creative Commons Attribution 4.0 International Licence

Department of Mathematics and Computer Science

Service navigation.

  • Privacy Policy
  • Accessibility Statement
  • DE: Deutsch
  • EN: English
  • Studying Bioinformatics

Path Navigation

  • Bioinformatics
  • Master’s Degree Program
  • Master's Thesis

Master's Thesis

Master’s thesis with accompanying colloquium (30 credits).

The master’s thesis is meant to prove the student’s ability to work independently on an advanced problem from the bioinformatical field using scientific methods, as well as the student's ability to evaluate the findings appropriately and to depict them both orally and in written form in an adequate manner. (SPO 2019, § 9)

If the study regulations of 2012 apply to you, please have a look here .

If you're looking for a thesis , here are some suggestions.

Unofficial Extract from the Regulations:

  • Students can only be admitted to the master's thesis if they have successfully completed modules totaling 60 credits or more within the master's degree program.
  • For the registration of the master thesis please use the form "Registration for the master thesis". You can find it on the pages of the examination office ! Important: Be sure to register your master thesis right at the beginning of work! Otherwise you risk that the examiner combination or the topic will not be accepted!
  • The master's thesis should be approximately 70 pages in length.
  • The processing time is 23 weeks . Note: An extension is not possible. If your thesis is delayed for an important reason (for which you are not responsible), please contact the Examination Office with the relevant supporting documents.
  • The written part must be written in English.
  • The master's thesis must be evaluated by two authorized examiners . One of the two examiners should be the supervisor of the master's thesis. At least one of the two authorized examiners* must be involved in teaching the master's program and simultaneously be a lecturer at the Department of Mathematics and Computer Science or the Department of Biology, Chemistry, Pharmacy of the Freie Universität Berlin or at Charité.
  • If approved by the examining board, the work on the master's thesis can also be done externally at a suitable business or scientific or research institution, as long as scientific and scholarly supervision by an examiner in the program in bioinformatics is ensured.
  • The master's thesis is accompanied by a colloquium , which usually takes place in the assigned working group during the processing time. Students are expected to give a one-time presentation lasting approximately 30 minutes on the progress of their master's thesis.
  • The master's thesis must be submitted in electronic form (PDF), by e-mail to the examination office. When submitting the thesis, the student must certify in writing that he or she has written the thesis independently and has not used any sources or aids other than those specified. Use the Declaration of Authorship provided by the examination office for this purpose.

*These are usually all PhD scientists involved in teaching in the Master's program in Bioinformatics. However, persons who are not directly involved in teaching may also be authorized. In case of doubt, please contact the examination office , which can check if a certain examiner or combination of examiners is possible or not. Note: The two examiners of a master thesis should come from different working groups.

The Informationen & Anleitungen of the examination office offer further information concerning the registration and submitting regulations of the master’s thesis (in german). The registration form is available in English.

Please note: If you have completed all the coursework and only need to finish the master's thesis, you no longer need to be enrolled, (but you are allowed to, of course).

Every summer semester the Mentoring organizes the workshop “How to write a bachelor’s / master’s thesis in bioinformatics”. Here you receive helpful tips and are free to ask your questions.

Here you can find a compilation of important information (FAQ Abschlussarbeit, in German).

Related Links

  • Publications
  • Software/Server
  • Freiburg Galaxy

Diploma / Master / Bachelor theses and Projects

In the following we list available, currently processed, and finished theses and student projects. When looking for a topic, please check not only the available topics but also the processed and recently finished topics. There might be unannounced but available follow up theses or projects that are not yet announced. So if you find a topic interesting, please contact the corresponding supervisor for further information. Bioinformatics is a highly specialized application area of computer science and biology and to successfully solve research questions in this field, you require a lot of interdisciplinary knowledge. Therefore, to do a Master thesis with us, we have the minimum requirement that you have attended one of our teaching courses . We may also ask you to present an introductory talk about your chosen topic (given material provided by us) before we can accept you. This does not apply to Bachelor theses or projects.

Open Topics

Approximative iterative prediction of complex non-nested rna structures.

The structure of RNA molecules is typically studied in a simplified graph model that represents the formed intra-molecular interactions, i.e. base pairings. Due to computational complexity, such RNA structure models are typically restricted to nested base pairing models that can be visualized by a non-crossing planar graph. Such models were shown to cover the majority of structure defining base pairs and are thus often sufficient to do biologically relevant studies. Nevertheless, there is a large class of RNAs where the final structure is defined by the formation of non-nested base pairs, i.e. base pairs that have to be represented by crossing lines within a planar graph. Algorithms that consider such pairings often have a time complexity of O(n^5) or more depending on the imposed restrictions in which context crossing base pairs are considered. Thus, they are not feasible to be applied to long RNA molecules or in large scale studies. Within this project we want to tackle this problem with an iterative scheme of structure prediction. That is we will apply structure and interaction prediction approaches to predict nested and crossing structure elements in an hierarchical approach. While this will not necessarily identify the optimal crossing structure, it provides a most general model of crossing structure formation utilizing the speed of nested structure prediction approaches.

Port a raw read pipeline for microbiome data analysis to Galaxy

Microbiome is the collection of all microbes, such as bacteria, fungi, viruses, along with their genes, which live inside and outside our bodies in all environments surrounding us. To investigate microbiomes, researchers use sequencing data and microbiome analyses. These analyses rely on sequencing data to investigate microbiomes. Such analysis relies on sophisticated computational approaches: assembly, binning, taxonomic classification, functional profiling etc. Analyzing microbiome data makes it possible to answer the two main questions for most microbiome analysis. Who (microorganisms) are there: by extracting the community from the microbiome reads What are they doing (and how): by extracting the gene/pathway abundance profile from the metagenomics reads and transcript abundance profiles from the metatranscriptomics reads and combining them These analyses rely on bioinformatics tools and also databases. Few workflows to process this data are available and most are not openly available, not transparent, or not easy to use by researchers. To tackle this problem, the Freiburg Galaxy team together with the microGalaxy community use Galaxy to build workflows to analyze microbiome sequencing data.

Project context: MGnify offers an automated pipeline for the analysis and archiving of microbiome data to help determine the taxonomic diversity and functional & metabolic potential of environmental samples. The pipeline even if documented is not really usable outside their resources. We would like to offer this pipeline for Galaxy users. This project aims to port the raw reads part of the pipeline into Galaxy. More information about the project can be found here: https://github.com/usegalaxy-eu/project-ideas/issues/31

CRISPR accessory proteins

The CRISPR-Cas system is an adaptive immune system in many archaea and bacteria, which provides resistance against invading genetic elements. The three major components of CRISPR-Cas systems are CRISPR-array, leader sequence and Cas genes. A recent study[1] demonstrated that there are proteins adjacent to the Cas proteins that help the CRISPR-Cas to switch targeting and degrading. This work aims to cluster/classify all the accessory proteins based on the associated Cas proteins. To do this, you will use the method from [1] to identify and analysis clusters. Project Outline - Start scanning all archaeal and bacterial genomes that have a CRISPR-Cas system. - Extract the up-and-downstream flanking genes of each CRISPR-Cas system. - Classify the genes according to different conditions and find clusters concerning locations and functions. [1] https://www.tandfonline.com/doi/full/10.1080/15476286.2018.1483685

Implementing new features for RNA-RNA interaction prediction

Our group develops the tool IntaRNA , which is one of the state-of-the-art programs for RNA-RNA interaction prediction. We are continously extending the tool (c++11, boost, autotools, openmp) that is hosted on Github BackofenLab/IntaRNA . Within the development process, we offer various student projects covering different aspects of the project. For a list of open topics, please refer to "student project" marked issues @github . If you are interested, please contact Martin Raden . Most topics can be adapted to be suitable for a student project, bachelor, or master thesis.

Docker based RNA-analysis workbench

You are interested in bleeding edge Linux-Kernel-Technologdy and virtualization? You want to help to distribute software packages in a OS-independend way? Than you can help us to solve the deployment problems of scientific software in a general way. That project will use Docker [1], an open source project that automates the deployment of applications, to produce self-contained images (containers). These containers are OS independent, versioned (like a git-history) and easy to use, which enables reproducibility of research results and easy deployment of entire software stacks. Prerequisites: Linux/Unix, Bash, autotools [1] https://www.docker.io Team-Project: can be combined with the "Graph visualization framework" and the "Galaxy Tool integration" project

Galaxy Tool integration

Galaxy is an open, web-based platform for data intensive research. The University of Freiburg is running a Galaxy server to serve all different needs of our researchers. In addition to the common Next-Gerneration-Sequencing Tools, we offer Tools for cheminformatics [1], proteomics and RNA bioinformatics. To integrate an apllication into Galaxy, a thin wrapper between the Galaxy API and the targeted application needs to be written. Here usability is key. Good wrappers are easy to use and abstracting complicated application details. As part of our Galaxy project we are permanently seeking for motivated tool-wrappers that are enthusiastic about usability, want to work with a vibrant community to make Bio- & Cheminformatik Tools accessible for more researchers. The overall aim is to put the developed wrapper in the Galaxy Tool Shed [2], a Galaxy Appstore, where everyone can get there favorite application with a few mouse clicks. Prerequisites: XML, Bash, autotools, Python [1] https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox [2] https://wiki.galaxyproject.org/Tool%20Shed Team-Project: can be combined with the "Graph visualization framework" and the "Docker" project

Interactive molecule design based on graph grammars

Further topics from the galaxy team.

The Freiburg Galaxy team is hosting further project ideas in its own GitHub repo. You want to work on NGS, big-data analysis, Cloud- or HPC-computing or develop complex front- end backends have a look at the topics in the link below: https://github.com/bgruening/project-ideas/issues

Further topics concerning CRISPR research

Further topics ....

Further Topics are available on request. If you have a suggestion for a topic you are interested in, do not hesitate to contact us. Otherwise, the completed theses may lead you.

Topics in Progress

Automated web crawling for publications using compensatory mutation experiments.

Compensatory mutation experiments provide the most reliable proof of specific inter-molecular base pairs formed by RNA-RNA interactions. They provide proof that specific base pairs are part of mechanism that is based on the formation of an RNA-RNA interaction. Such experiments are expensive both w.r.t. time and resources and thus often part of the methodology of research projects that try to unwind specific molecular biological mechanisms. Thus, the experiments are often "only" one step in a longer list of experiments to gather proof for a projects hypothesis and therefore only described within the main text of the manuscript. Since publication search engines typically only parse and index title and abstract information of published articles, identifying publications that involve compensatory mutations is not easily done. In order to better understand the details of RNA-RNA interactions, e.g. to improve prediction algorithms or to design new ones, compensatory mutation experiment data would be most beneficial. This projects aims at the development of a web crawling tool to systematically identify publications that provide such experimental details.

Design of a data base and respective user front ends to collect and browse compensatory mutation experiments

Compensatory mutation experiments provide the most reliable proof of specific inter-molecular base pairs formed by RNA-RNA interactions. They provide proof that specific base pairs are part of mechanism that is based on the formation of an RNA-RNA interaction. Such experiments are expensive both w.r.t. time and resources and thus often part of the methodology of research projects that try to unwind specific molecular biological mechanisms. Experimental details are often only presented in form of illustrative images or use non-uniform textual encodings. Thus, the extraction of such information is typically manually done. In order to better understand the details of RNA-RNA interactions, e.g. to improve prediction algorithms or to design new ones, compensatory mutation experiment data would be most beneficial. This projects aims at the development of a data base scheme suited to store experimental details and respective meta data. To help in the manual encoding and reviewing of compensatory mutations, an interactive user front end is to be developed.

Visualizing the effect of homo-dimerization on RNA-RNA interaction formation

The interactions formation between RNAs is key to many regulatory processes in life. Such RNA-RNA interactions (RRIs) are typically formed between regions of the molecules that are not involved in (intra-molecular) structure formation of the molecules themselves. Thus, in order to predict RRIs the structure of the interacting RNAs has to be taken into account. This is well modeled and done using accessibility-based RRI prediction tools like our in-house tool IntaRNA. Some regulatory molecules are produced in large amounts in order to fulfill their regulatory function via RRI formation. In such a scenario, it is quite likely that the molecules not only interact with their regulatory targets but also with molecules of their own type, which is called homo-dimerization. Dimerization can have multiple effects, e.g. (i) it might reduce the regulatory effect since many RNAs are bound and not available for interaction with the target molecule or (ii) it might change the structure of the dimerizing RNAs and thus "unlock" regions for RRI formation with the target that are otherwise blocked by intra-molecular structure. This project aims at studying such effects of homo-dimerization. To this end, a workflow is to be implemented that combines RRI prediction and constraint RNA structure prediction to model the effects of homo-dimerization. Furthermore, respective visualizations are to be developed and integrated into the workflow to simplify the study and interpretation of such effects.

Clustering SARS-CoV-2 spike protein sequences using autoencoder neural network

The aim of this project is to create a low-dimensional representation of SARS-CoV-2 spike protein sequences using an autoencoder neural network. Then, the low dimensional representation of sequences should be clustered using popular clustering algorithms such as TSNE and UMAP to explore if the original differences in sequences belonging to different clades (categories of sequences) are also maintained in lower dimensions. Related reading

Learn and predict nucleotide evolution in SARS-COV2 sequences using generative adversarial neural network

SARS-COV2 sequences mutate to multiple variants categorized into lineages and clades, some of which alter the pathogenicity of the virus making it more virulent. Using generative adversarial neural networks, artificial sequences can be generated using the knowledge of the evolution of SARS-COV2 sequences in the past. Ideally, the neural network should learn the 'edit' mechanism of the sequences that evolved in the past and should generate sequences based on the learned knowledge. The generated sequences should be compared with the true sequences to see how good the neural network performs.

Closed Topics

Genomic long range rna-rna interactions in flaviviruses.

For the replication of flaviviruses, the formation of a specific long-range RNA-RNA interaction of the trailing untranslated regions of the virus genomes is crucial. This project aims at the prediction, comparison and modeling of these interactions using state-of-the-art tools for RNA-RNA interaction prediction and RNA alignment to identify common and species-specific details of these interactions.

Evaluating classification methods based on microbial community composition

Microbiome is the collection of all microbes, such as bacteria, fungi, viruses, along with their genes, which live inside and outside our bodies in all environments surrounding us. To investigate microbiomes, researchers use sequencing data and microbiome analyses. These analyses rely on sequencing data to investigate microbiomes. Such analysis relies on sophisticated computational approaches: assembly, binning, taxonomic classification, functional profiling etc. Analyzing microbiome data makes it possible to answer the two main questions for most microbiome analyses. Who (microorganisms) are there: by extracting the community from the microbiome reads What are they doing (and how): by extracting the gene/pathway abundance profile from the metagenomics reads and transcript abundance profiles from the metatranscriptomics reads and combining them. The MGnify Pipeline (https://www.ebi.ac.uk/metagenomics/pipelines/5) provides a standardized way to process metagenomic data and store the results on a public database.

Project context: The MGnify database can be accessed via an API, that allows the retrieval of microbiome abundance data of various origins. Example notebooks that use the API are already included in galaxy. The projects aim to use this abundance data and investigate potential applications in machine learning and comparative metagenomics. Therefore, different normalization approaches need to be applied, that normalize the data in regard to experiment-specific parameters, such as sequencing depth and sample size. The normalized data should be used to investigate the potential to classify samples from different biomes as well as different host phenotypes. The workflows should be implemented and documented in galaxy.

Integration of multi-modal omics analysis framework into Galaxy

Single-cell multimodal omics allows simultaneous profiling of different types information such as gene expression, DNA methylation, chromatin accessibility and surface protein levels of each individual cells. Such data enables cell characterization based on complex gene regulatory networks. Analysis of such datasets requires immense knowledge in programming languages such as R, python and statistics. To provide experimentalists with complex multimodal analysis workflows, this project aims to integrate computational workflows in Galaxy. We chose to integrate muon based workflows for such data analysis. The muon framework shares datatypes and features with an already Galaxy integrated framework called Scanpy. The objectives of this project are integration of muon multimodal analysis workflows into Galaxy and development of Galaxy training material based on the integrated workflows.

Creation of a tutorial for metagenomics data analysis

Emerging and powerful technologies like DNA sequencing are getting cheaper and therefore more accessible for many applications, e.g. in microbiome. This produces more data to analyze by scientists. Platforms like Galaxy help scientists to analyze their own (complex) data in a user friendly way. But they need to learn how to do that. The Galaxy Training Network (GTN) created an open-source e-learning infrastructure to provide a collection of tutorials developed and maintained by the worldwide Galaxy community ( https://training.galaxyproject.org ). Related to microbiome data analysis, the GTN currenlty offers 8 tutorials, built around a research story ( https://training.galaxyproject.org/training-material/topics/metagenomics/ ). The microGalaxy community aims to expand that catalog for whole-genome microbiome data analysis. The aim is this project is to create a tutorial using data from the Human Microbiome Project, tools an tutorials developed by the Hüttenhover lab to update the general overview tutorial.

Port an amplicon pipeline for microbiome data analysis to Galaxy

Project context: MGnify offers an automated pipeline for the analysis and archiving of microbiome data to help determine the taxonomic diversity and functional & metabolic potential of environmental samples. The pipeline even if documented is not really usable outside their resources. We would like to offer this pipeline for Galaxy users. This project aims to port the amplicon part of the pipeline into Galaxy. More information about the project can be found here: https://github.com/usegalaxy-eu/project-ideas/issues/31

Development of a Galaxy pipeline for detection of SARS-CoV-2 variants in wastewater samples

Nearly two years after the first report of SARS-CoV-2 in Wuhan, China, the COVID-19 pandemic has affected more than 485 million people. Wastewater surveillance has attracted extensive public attention during the SARS-CoV-2 pandemic, as a passive monitoring system to complement clinical and genomic surveillance activities. Several methods and protocols are already in place that effectively facilitate the detection and quantification of viral RNA in wastewater samples, and concentrations in wastewater have been shown to correlate with trends in reported cases. The Galaxy community has put a lot of efforts for continuous analysis of intra-host variation in SARS-CoV-2 ( https://galaxyproject.org/projects/covid19/ ), including development of workflows. The aim of this thesis are to: (i) Evaluate existing workflows for wastewater data analaysis; (ii) Expand and adaptat existing Galaxy workflows; (iii) Extensive test of workflows on mock and real data; (iv) Connect with existing data sources.

Machine-learning-based improvement of genome-wide target prediction of sRNAs

Identifying putative regulatory target regions of bacterial small (s)RNAs is still a challenging problem due to the high false positive rate of predictive methods. One way to greatly reduce false positives is to combine genome-wide predictions of related organisms, which is the core feature of the CopraRNA approach. This project aims at the identification and benchmarking of fast, simple but still sufficiently reliable target prediction workflows based on machine learning techniques to speedup CopraRNA.

Graph neural network based model for cancer driver prediction

  • Python programming experience, Machine Learning

Gene prioritization based on pheotypes

Graph neural network-based method for single-cell rna-seq denoising, development of an automated scoring system for shared galaxy histories.

Due to the pandemic situation the interaction with the public face-to-face is not feasible. Therefore, we with the Street Science Community started the development of an online data analysis game ( http://streetscience.community/DNAnalyzer/ ). Within the game users will learn about the microbiome, DNA, sequencing and how to perform a data analysis. Galaxy provides the perfect platform to learn and later perform data analyses. To get scores for there data analysis gamer will share there histories. Within this project a tool will be developed where two shared galaxy history are compared and a score for the submitted history will be calculated. Further information about the topic can be found here: https://github.com/usegalaxy-eu/project-ideas/issues/28

Implementation of the infrastructure for an online & interactive game on DNA data analysis

Due to the pandemic situation the interaction with the public face-to-face is not feasible. Therefore the street science community started the development of an online data analysis game ( http://streetscience.community/DNAnalyzer/ ). Within the game users will learn about the microbiome, DNA, sequencing and how to perform a data analysis. Galaxy provides the perfect platform to learn and later perform data analyses. However, the gamer will register on a separate website connected with Galaxy and additionally tracks the successes and results of each gamer. The aim of this project is to implement a small webserver to register participants, display the videos, questions, puzzle, collect and display the score of the participants, and connect with the automated scoring system developed in an other master project. Further information about the topic can be found here: https://github.com/usegalaxy-eu/project-ideas/issues/22

Tool Resource Prediction for Genomic Datasets

The amount of CPU, RAM, and processing time for a tool to complete is dependent on the size of the input dataset and the complexity of the tool. By emulating these processing requirements with a benchmarking stress-testing tool such as stress-ng, we wish to accurately measure the footprint of the top set of tools on the UseGalaxy.eu workbench with repeated benchmarks, and try to predict their future footprint based on input data size and other extractable content, using machine-learning.

Integrating Multi-Omics Data and Pathway Structure with Explainable Graph Neural Network for Precision Medicine

Cancer is a disease that has afflicted the human species for ages, with each tumor possessing its own set of unique characteristics. As a result, people with comparable phenotypes respond to similar therapy in different ways. Largely unsolved, this area has started evolving over the past few decades owing to the availability of multi-omics data and large-scale data of cancer cell lines with different drugs approved for clinical trials. Consequently, a new area termed personalized tumor therapy has emerged. The goal of this research is to propose a novel method that aims at predicting drug response for cancer cell lines.

Analyzing miRNA processing patterns from single-cell small RNA sequening

Studying mRNA expression at single-cell resolution is a well established research area. There exists numerous experimental and computational methods to sequence and analyze the single-cell transcriptomcs. But all of them were designed and optimized to work with protein-coding genes only. Currently, there are only a very few experimental protocols to sequence small non-coding RNAs at single-cell level. It was shown that the existing computational methods that are used for single-cell mRNA-Seq can be used to cluster mature miRNAs and miRNAs also show cell-type specific expressions. In this project we aim to investigate whether miRNAs processing is cell-type specific. To achieve this, we use apply existing computational methods that were developed for bulk miRNA-Seq data to cluster individual cells based on miRNA processing patterns.

Analysis of CRISPR-Cas System in Marine Metagenomics

The 2020 Nobel Prize in Chemistry to Emmanuelle Charpentier and Jennifer A. Doudna for the discovery and development of CRISPR/Cas9 system highlight the importance of CRISPR-Cas systems. CRISPR-Cas system is an adaptive immune system found in prokaryotic lifeforms and is very diverse in nature. Cas proteins evolve rapidly. Here, we aim to analyse metagenomic data found in the marine ecosystem for the CRISPR-Cas proteins. The main focus is on class-2 type-V system, as the effector protein Cas12 from this system is a promising gene editing candidate. We used three databases for the analysis: Tara Oceans database with 2,631 draft metagenomes, MarRef dataset with 970 assembled metagenome, and IMG/VR dataset with above 90 percent completeness. We built four pipelines comprising different methods and tools for the whole analysis: pipeline 1 for detecting CRISPR-Cas systems and Cas12 proteins, pipeline 2 for transposons, pipeline 3 for repeats and their secondary structures, and pipeline 4 for the spacers and protospacer adjacent motifs (PAMs). We observed that the two tools (CRISPRCasIdentifier and CRISPRCasTyper) used for detecting CRISPR-Cas systems produce very different results, indicating the requirement for building a more accurate and robust tool for the identification of CRISPR-Cas systems. For different variants of Cas12 proteins, we detected different transposable elements. From the analysis of detected repeats, we identified 13 different secondary structures for the repeats found in type V systems and many having a conserved GAAAC or GAA sequence at the 3� terminus. During the spacer analysis, we detected different PAMs. Along with 5� T-rich PAMs, we also detected 5� A-rich PAMs along the upstream of detected spacer sequences. Our work shows that there is still a lot not known about Cas12 proteins, and further in-depth analysis can lead to a better understanding of Cas12 proteins and CRISPR-Cas systems.

Peak Calling und Workflow-Implementierung f�r das single cell Assay for Transposase-Accessible Chromatin Verfahren durch Sequenzierung

In der vorliegenden Bachelorarbeit wird das Verfahren scATAC-Seq und seine biologischen Hintergründe vorgestellt, welches offene Regionen im Chromatin des Genoms einzelner Zellen findet. Des Weiteren wird untersucht, wie die Daten von scATAC-Seq am besten verarbeitet werden, so dass mäglichst viele, hoch qualitative Informationen zu den offenen Chromatinregionen erhalten werden kännen. Dafür werden die Daten speziell vorverarbeitet, anschlie�end werden die Zellen teilweise gruppiert und schlussendlich die Peaks durch Peak Calling bestimmt. Im Anschluss werden die Peaks der einzelnen Zell-Gruppen wieder zusammengefügt, um sie schlussendlich zu vergleichen und auf verschiedene Qualitätskriterien zu überprüfen. In dieser Arbeit werden vier verschiedene Methoden vorgestellt, um diesen Ablauf, mit kleineren Änderungen, durchzuführen. Dazu werden ungefähr 3000, durch scATAC-Seq gewonnene, menschliche Zellen durch die verschiedenen Methoden bearbeitet und untersucht. Anschlie�end werden die Ergebnisse verglichen. Die Resultate zeigen Potential zur Feststellung von diesen Arten der Verarbeitung der Daten. Dabei kann in dieser Arbeit aber nicht eine Methode klar empfohlen werden, da es tiefere Untersuchung der gewonnenen Peaks benätigt, um ein abschlie�endes Urteil über die Qualität der Ergebnisse zu erhalten.

How genomes are shaped by direct and indirect selection pressure: a study in in silico experimental evolution

What are the different pressures that can shape genomes in evolution? The aim of this thesis is to focus particularly on the case of reductive genome evolution, i.e. the reduction of genome size over time as observed in some marine cyanobacteria. To address this topic, is used silico artificial evolution, a method in which genomes of virtual organisms evolve via computer simulations, and particularly the Aevol model. Several experiments have been conducted to test the effect of several parameters (population size, mutation rate, and selection strength) on the genome structures and other selection measures (e.g. fitness, robustness).

Drug repurposing and adverse event prediction through EHR knowledge graph completion

Drug repurposing is the process of discovering new indications of existing, approved drugs while the latter comprises identifying probable harmful effects of known or novel drugs. It is normally done by in vivo and vitro methods which are of high costs, slow results, and limited sample size besides some ethical issues. Therefore, effective computational methods are needed. In this project, we investigate EHR data and create a machine learning model using the relational graph attention network to predict the potential links between entities of interest link drugs, diagnoses, etc.

TAD detection in Hi-C data

Within the 3D space of a cell, DNA forms a structure resembling a ball of wool. The points of contacts of the DNA with itself, called DNA interactions, have "threads" within this "ball of wool" that form smaller loop-like structures called DNA loops. At a close genomic distance, these loops are called topological associated domains (TADs). A z-score based detection algorithm currently exists to detect these TADs, but lacks the ability to detect overlapping TADs and hierarchical structures. In this master project a new approach based on neural networks should be investigated and implemented.

CRISPR-Cas9 Off-Target Prediction Methods

  • Python programming experience and Machine learning

Assigning tissues of action to genomic loci associated with kidney function

  • R-Programming experience

Website for visualization and publishing of single-cell RNA-sequencing (scRNA-seq) datasets

Rri prediction ranking.

While IntaRNA is a state-of-the-art method to predict RNA-RNA interaction, it is not clear if this prediction will happen in nature. We are building a support vector machine model which should validate the in silico interaction on its occurrence in vivo and can therefore be use to post filter interaction predictions. RNA-RNA predictions can experimental be verified by mutation experiments. Based on this experimentally verified interactions we are building we are developing a positive and negative trainings set. This dataset is already discussed in CopomuS by Raden et al.

Machine Learning for Gene Discovery

Current approaches to finding new genes have a high false positive rate. Help us develop a tool to filter candidates in this straightforward Machine Learning project. You will expand on our Scikit-Learn python code and work with state of the art bioinformatics tools. The project covers feature extraction, filtering and classification on an annotated dataset of alignment files.

Binding Affinity Prediction of Protein-Ligand Complexes

This project predicts the binding affinities between the potential drugs (ligands) and the target proteins responsible for diseases or conditions.It uses the data of protein-ligand complexes stored in the PDBBind database to train a machine learning model.From every complex, features related to proteins are extracted by using the pocket-finding software fpocket. Four ML models were studied in this project - Simple Linear Regression, Random Forest Regression, Support Vector Regression, and Rotation Forest Regression.

Multi Protein-Ligand Interaction Prediction using Machine

In this thesis, a voxelization procedure was developed and applied to targets (or proteins) in the PCBA (PubChem BioAssay) dataset to create a three-dimensional image of the protein-ligand binding site. These voxelization data were used to train a neural network, more specifically a CNN autoencoder to featurize the binding site by keeping only the most relevant information. This information was then combined with ligand features (which have been calculated using the RDKit descriptor tool from the RDKit library) and finally using machine learning techniques, protein-ligand binding affinity was predicted for each protein-ligand pair.

BioBlend to Galaxy API extension and OpenAPI specification

BioBlend is a Python library to enable simple interaction with Galaxy via the command line or scripts.Galaxy is a data analysis platform for accessible, reproducible and transparent computational research. It includes a web interface through which users can design and perform tasks in a visual and interactive manner. The Galaxy server also exposes this functionality through its REST-based Application Programming Interface (API). In this project several important new features were introduced into BioBlend and the Galaxy API and a tutorial written for future developers.

Predicting Hi-C contact matrices using machine learning approaches

In recent years, many studies have shown that the three-dimensional conformation of genomes is a key factor for understanding several important mechanisms on the molecular biological level. However, the Hi-C experiments typically conducted to measure this 3D-structure are still expensive, so that computational methods for predicting the spatial chromatin organization from existing data have recently become subject to research. In this thesis, two machine learning approaches are investigated with regard to their usability for predicting chromosome conformation in form of Hi-C contact matrices from ChIP-seq data. Here, the first method adapts and extends an existing dense neural network architecture for Hi-C matrix predictions, while the novel second method, Hi-cGAN, leverages techniques from image synthesis, especially conditional generative adversarial networks (cGANs). While the dense neural network approach can neither produce satisfactory predictions for the Hi-C matrices of human cell lines GM12878 and K562, nor for Drosophila Melanogaster embry- onic cells in the chosen setting, Hi-cGAN yields encouraging outcomes in all three cases.

Within the 3D space of a cell, DNA forms a structure resembling a ball of wool. The points of contacts of the DNA with itself, called DNA interactions, have "threads" within this "ball of wool" that form smaller loop-like structures called DNA loops. At a close genomic distance, these loops are called topological associated domains (TADs). A z-score based detection algorithm currently exists to detect these TADs, but lacks the ability to detect overlapping TADs and hierarchical structures. In this master project a new approach based on machine learning classifiers should be investigated and implemented.

Hi-C interaction matrix prediction based on protein location

In the 3D space of a cell the DNA forms a structure that looks like a ball of wool. Obviously, many points of contacts of the DNA wire with itself, called DNA interactions, exists in this "ball of wool" and form a structure including DNA loops. These loops contribute to the stability of the DNA and do play an important role in gene regulation. Current research shows that proteins bind on the DNA at these loop locations and contribute to the formation of loops and therefore for the whole structure. The structure of the DNA can be read out with a technique called Hi-C and the resulting data is represented as an interaction matrix in the computer. However, Hi-C is an expensive technique and for many cell types no data is existing while at the same time the technique to read out the position of proteins on the DNA (ChIP-Seq) is quite cheap and a lot of data is online available. The goal of this master project is to use a random forest approach to predict Hi-C interaction matrices by learning the location of proteins. Based on the results of the master project from Andre Bajorat, possible optimizations for this model are investigated.

Hierarchical TAD detection in Hi-C data

Within the 3D space of a cell, DNA forms a structure resembling a ball of wool. The points of contacts of the DNA with itself, called DNA interactions, have "threads" within this "ball of wool" that form smaller loop-like structures called DNA loops. At a close genomic distance, these loops are called topological associated domains (TADs). A z-score based detection algorithm currently exists to detect these TADs, but lacks the ability to detect overlapping TADs and hierarchical structures. In this Bachelor thesis a method to detect these was developed and implemented.

Creating a linkage analysis workflow in Galaxy

Classical linkage analysis is the method of looking for genes that are inherited together in a family tree, which has been now superseded by variant analysis in the era of high-throughput sequencing, but is still relevant in rare disease studies. The Galaxy project is a free and open-source web-based platform for bioinformatic research, and offers users an interactive drag-and-drop avenue to perform their analyses. This project would involve wrapping tools into Galaxy, and chaining them together in a workflow for public user access. Optionally, training material can be written to guide users through the analysis. Applicants need only to know basic HTML/XML and Markdown.

Integrating a haplotype analysis visualization into Galaxy

The study of haplotypes is relevant to pedigree analysis, which looks for mutations inherited from founders that manifest only after many generations due to the semi-random/coalescent nature of inheritance. This project will be wrapping an existing haplotype visualization tool into Galaxy, an open source web-based bioinformatic analysis environment, in order to reach a greater number of users. Applicants must know basic Javascript and HTML/XML.

Multi-site RNA-RNA interaction prediction

Accessibility-based RNA-RNA interaction prediction methods are typically modelling a single block of consecutive inter-molecular base pairs. Thus, interaction pattern that consists of multiple concurrently formed blocks can not be predicted. Within this project, we are developing and testing possibilities to efficiently predict concurrent blocks of interaction within an accessibility-based prediction model. The approach will be based on IntaRNA , which is one of the state-of-the-art programs for RNA-RNA interaction prediction. The respective extensions of the IntaRNA package will be integrated into the main package for external use and further development.

Graph neural network-based method for disease gene prioritization

The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by a large number of candidate genes and by the heterogeneity of the available information. Therefore, computational methods for the prioritization of candidate genes are needed to deal with these problems. A number of methods have been proposed and have shown potential results. However, there is still a need to develop more accurate disease gene prioritization methods. The aim of this project is to develop a graph neural network-based method for disease gene prioritization. This choice is supported by (1) graphs are a common and natural way to represent the gene relations, and (2) Neural network for graphs are now state-of-the-art in graph (graph node) classification problem.

A deep learning model to detect triple helices in genomics data

Triple helix formation has been known to interfere in the gene expression process by often modifying the transcription of targeted genes. Therefore, understanding how and where triple helices form is crucial to better understand gene expression. To identify regions where triple helices formed, wet-lab experiments and some computational methods are performed. However, non-existing methods are based on machine learning. Here we would like to propose a deep learning-based method to detect triple helices in genomic.

CRISPR/Cas9 is a unique and robust gene-editing method that has the ability to accurately edit target genes in a wide variety of organisms. However, experimental results indicate that the binding and cleavage of off-target sequences are a major concern for the application of CRISPR/Cas9 and the sgRNAs should be designed in such a way that the impact of off-targets is minimized. Several computational methods have been proposed as a substitute for expensive lab experiments to predict off-targets. Yet, powerful approaches need to be devised to make precise predictions. Here we aim at proposing a Graph Convolutional Network model to predict off-targets of CRISPR/Cas9. The proposed model is expected to overcome following typical challenges: data imbalance, robustness, prediction crossing different cell-types.

Ranking of mutations in RNA-RNA interactions

Point mutations are a common way to verify RNA-RNA interactions. So far, the selection of the position and the introduced mutation is done manually based on expert knowledge of the experimenter. Within this project, we are developing and testing possibilities to automatically evaluate and rank candidate mutations concerning their potential for interaction validation. The approach is based on IntaRNA , which is one of the state-of-the-art programs for RNA-RNA interaction prediction. The respective extensions of the IntaRNA package are integrated into the main package for external use and further development.

Benchmarking Big-Data Workflows Across European Academic Clouds to Evaluate Cloud Bursting Strategies

The Galaxy-Project, a web platform for big-data biomedical research, needs a lot of computational resources and cloud bursting, e.g. sending excess workloads to the cloud, may be a solution in high-demand situations. But how do the various academic clouds, spread across Europe, perform? May one be better suited than the other for a specific workload? Does physical distance and connectivity between data centers play a big enough role? What about the underlying infrastructure? Do they make a difference, even if the actual instance size is the same? In this work, where I benchmarked various academic clouds in Europe, I want to answer these questions and even offer a framework for future benchmarks, as the need for benchmarking more clouds in the future arise.

Base-pair probabilities for accessibility-based RNA-RNA interaction prediction

Computing base pair probabilities of RNA-RNA interactions allows for a number of useful applications, such as the creation of dot plots, which allow for easy and fast comparison between different base pairing patterns. A number of tools exist that already incorporate base pair probability calculation, such as RNAcofold and NUPACK. However these tools are limited to a specific algorithm for the optimal interaction computation that might lack in precision or computational efficiency depending on the application. IntaRNA on the other hand is a highly exible RNA-RNA interaction prediction tool that implements a large number of different prediction algorithms, including very efficient seed-constraint methods. This thesis explores the benefits and difficulties of introducing the computation of base pair probabilities into a number of IntaRNA predictors, including seed-based predictors. For this reason IntaRNA was extended with the ability to compute base pair probabilities, depending on the chosen prediction model. The output is provided as a dot plot to allow for easy investigation. Finally, a number of applications are presented that bene t from base pair probabilities, including the comparison between verified and non-verified RNA-RNA interactions and the detection of multi-site RNA interactions. Based on these results, potential improvements for IntaRNA's prediction model are discussed, including different approaches for the accessibility computation and the incorporation of sequence conservation into the prediction estimation.

In the 3D space of a cell the DNA forms a structure that looks like a ball of wool. Obviously, many points of contacts of the DNA wire with itself, called DNA interactions, exists in this "ball of wool" and form a structure including DNA loops. These loops contribute to the stability of the DNA and do play an important role in gene regulation. Current research shows that proteins bind on the DNA at these loop locations and contribute to the formation of loops and therefore for the whole structure. The structure of the DNA can be read out with a technique called Hi-C and the resulting data is represented as an interaction matrix in the computer. However, Hi-C is an expensive technique and for many cell types no data is existing while at the same time the technique to read out the position of proteins on the DNA (ChIP-Seq) is quite cheap and a lot of data is online available. The goal of this master project is to use machine learning and neural network regression models/approaches to predict Hi-C interaction matrices by learning the location of proteins.

Hi-C interaction matrix correction

In the 3D space of a cell the DNA forms a structure that looks like a ball of wool. Obviously, many points of contacts of the DNA wire with itself, called DNA interactions, exists in this "ball of wool" and form a structure including DNA loops. However, many o f these contacts are random contacts or measurement errors and need to be corrected. A Python implementation is existing but comes to its limits for high resolution data caused by high memory usage. This master project should try out if a more efficient algorithm is existing and if an implementation in C++ is possible with less resources.

Statistical significance for RNA alignment predictions and an evaluation schema for multiple sequence alignments in local mode

To evaluate the predicted alignment of the RNA sequence-structure alignment tool LocARNA, so far the alignment score of the has been provided. The score is the optimal value of the objective function from the LocARNA optimization problem. However the scores are not very informative for the end-users, e.g. how well the predicted alignment is significant and likely to occur by chance. It would be desirable to have a statistical measure that not only rank the quality of a given alignment but also make it possible to compare the prediction to other alignment tools and the reference alignment. In this thesis an empirical p-value for LocARNA will be developed. Furthermore, to evaluate a multiple sequence alignment results a suitable scoring schema for multiple sequence alignments will be investigated.

p-value statistics of IntaRNA predictions

The RNA-RNA interaction prediction tool IntaRNA provides sophisticated and highly accurate results in terms of free energy minimization. Since it is non-trivial for users to interprete the provided free energy terms, this project investigates ways how energy statistics and respective p-values can be provided.

RNA-RNA interaction prediction via seed extension

Our group develops the tool IntaRNA , which is one of the state-of-the-art programs for RNA-RNA interaction prediction. We are continously extending the tool (C++, c++11, boost, doxygen, autotools, openmp) that is hosted on Github BackofenLab/IntaRNA . This project aims at the implementation and testing of seed-extension strategies to speedup and improve IntaRNA's predictions. The developed extensions to IntaRNA will be integrated into the main package for external use and further development.

Implementing bioinformatics algorithms for teaching

Within the last years, we have created interactive implementations of various algorithms discussed in our lectures. These are freely available at the Freiburg RNA tools - Teaching section of our public webserver. The algorithms are implemented in Javascript and are accompanied with according visualizations to better understand each approach.

Identifying and analysis of new anti-CRISPR proteins

CRISPR-Cas system of archaea and bacteria provides resistance against viruses and phages. Since phages have a constant battle against prokaryotic; recent discoveries show that have described phage genes that inhibit the CRISPR-Cas function. These are, however, likely to be quite diverse in function as they can interfere with the CRISPR-Cas response at different stages. This work aims to develop a new method of identifying a new family for anti-CRISPR proteins based on homology search.

Identification of CRISPR arrays using machine learning approach

Archaea and Bacteria are known to acquire immunity against viruses and plasmids through a widely conserved RNA-based gene silencing pathway. This mechanism involves non-coding RNA that originates from Clustered Regularly Interspaced Short Palindromic Repeats, and CRISPR-associated proteins (CRISPR-Cas system). CRISPRs consist of identical repeats that are between 20 to 47 base pairs in length, separated from each other by unique spacer sequences of similar length (27 to 72 base pairs). Most CRISPR arrays are flanked on the upstream (5') side by a leader sequence of 60 to 500 base pairs. These leaders often contain low complexity sequences and are rarely conserved between more distantly related species. Finally, there are the Cas genes, which are usually located directly up- or downstream of CRISPR array, however, they can also be found in very different locations. These genes encode protein complexes which work together with CRISPR arrays to confer the host cell with an adaptive immune system to fight invading viruses and plasmids. This work aims to develop a new tool to detect a CRISPR-Array using machine learning approaches.

Crossdating of wood samples using MICA-aligned density profiles

Our group develops the tool MICA , which enables Multiple Interval-based Curve Alignment of arbitrary curve/profile data. It is currently applied to derive meaningful consensus data of experimentally measured wood density samples. Within this project, we will use MICA density profile alignments to evaluate their potential for crossdating, i.e. the time annotation of wood samples. Given the increased information compared to standard methods based on ring widths, the approach should yield high precision even for small wood samples (few rings).

Modular benchmark pilot framework for evaluating RNA alignment tools

To benchmark the quality of RNA alignment algorithms, it is important to validate their performance and compare with similar tools. For this purpose a benchmark-pilot framework to automatically benchmark RNA alignment algorithms such as LocARNA and SPARSE will be developed. The aim is to have a modular and easily extendable framework to evaluate various range of tool for different computation platforms from PCs to High Performance Computing grid systems. The task of this project is focused on development of the benchmark-pilot code in python using SnakeMake workflow manager, to replace the previously deployed system.

RNA-RNA interaction prediction for long molecules

Our group develops the tool IntaRNA (see PhD-thesis A. Richter for details), which is one of the state-of-the-art programs for RNA-RNA interaction prediction. We are currently reimplementing and extending the tool (C++, c++11, boost, doxygen, autotools, openmp) that is hosted on Github BackofenLab/IntaRNA . This project aims at the implementation and testing of strategies to enable predictions for very long input molecules, for which the standard approach might break due to extreme memory consumption. The idea is to apply a window-based segmentation, which requires a special result handling to avoid duplications in the output. The developed extensions to IntaRNA will be integrated into the main package for external use and further development.

Constrained RNA-RNA interaction prediction

Our group develops the tool IntaRNA (see PhD-thesis A. Richter for details), which is one of the state-of-the-art programs for RNA-RNA interaction prediction. We are currently reimplementing and extending the tool (C++, c++11, boost, doxygen, autotools, openmp) that is hosted on Github BackofenLab/IntaRNA . This project aims at the implementation and testing of new prediction modi, which incorporate additional constraints to further improve prediction quality. To this end, an IntaRNA benchmark set and according protocol is compiled that is used in the course of the thesis to evaluate the newly integrated features. Furthermore, statistics on known interaction and single-molecule structures will provide the parameters for the new constraints. The developed extensions to IntaRNA will be integrated into the main package for external use and further development.

Within the last years, we have created interactive implementations of various algorithms discussed in our lectures. These are freely available at the Freiburg RNA tools - Teaching section of our public webserver. The algorithms are implemented in Javascript and are accompanied with according visualizations to better understand each approach. In the course of this project we are focusing on sequence alignment algorithms as taught in our Bioinformatics-1 and -2 lecture.

Prediction of non-consecutive RNA-RNA interactions

  • exam in "RNA bioinformatics" lecture
  • C++ experiences

Integration of BioJS into Galaxy

Galaxy is an open, web-based platform for data-intensive research. The University of Freiburg is running a Galaxy server to serve all different needs of our researchers. Visualization is a key aspect in the understanding of data analysis for medical and biological research. The Javascript library BioJS provides powerful visualization of multiple biological data. The overall aim is to integrate specific BioJS modules into Galaxy via its plugin architecture.

Large-scale clustering of non-coding RNAs in the Galaxy framework

Clustering of putative RNAs is currently the major approach for functional annotation of putative ncRNAs detected in genome-wide screens. GraphClust is one of the few approaches that can cluster hundreds of thousands putative ncRNAs as it is based on an alignment-free approach using an advanced graph kernel. The candidate clusters are iteratively retrieved and refined using RNA alignment tools. However the clustering pipeline requires in-depth knowledge as several tools have to be installed and configured. The goal of this project is an extension of the GraphClust tool using Galaxy framework that makes it possible to (a) perform the clustering of RNAs via a web interface, (b) run the computations on various operating systems and computation frameworks, (c) freely customize and extend the generic pipeline for specific needs. The project involves also attempts to apply the Galaxy workflow on a metatranscriptome dataset.

Characterization of ribosomal footprints with use of graph kernel based approaches

Ribosome profiling is an emerging technique that with use of deep sequencing methods, gives new insight to translation of proteins from single codon to genome scale. In comparison to former available methods microarrays and RNA-seq, Ribo-seq solely considers active mRNAs at translation phase in a cell which prepare information for protein synthesis. This novel charac- teristic of Ribo-seq provides new data with focus on translation level. The obtained patterns of ribosomal footprints may reveal new aspects in trans- lation field. The aim of this work is to classify Ribo-seq profiles according to different conditions and find clusters with respect to Ribo-seq profiles. This is done by a tool named BlockClust, which is based on a graph kernel method called Neighborhood fast graph kernel (NSPDK). BlockClust en- codes expression profiles data to graphs format and employ NSPDK method for achieving a high performance. Although BlockClust previously applied for clustering non-coding RNAs from their RNA-seq expression profiles, it can also be adapted to use for clustering and classification tasks on other types of data e.g. Ribosome profiling. Therefore, we have adapted Block- Clust by defining new attributes for finding patterns in Ribo-seq data and adding them to the former available set of attributes. Moreover, we per- formed an optimization by using different parameter sets. Furthermore, we showed that it is possible to employ BlockClust on Ribosome profiles. We achieved a good performance in classification of these profiles.

Approximate nearest neighbor query methods for large scale structured datasets

The task of efficiently finding the most similar representatives in a large set of entities is at the core of many problems in a variety of applications, ranging from chemoinformatics to recommendations systems; when the objects of interest are structured entities the problem becomes harder. In these cases structured instances are explicitly converted in sparse vectors that live in very high dimensional spaces (even millions of features). Exact algorithms have unfortunately a computational complexity that scales quadratically with the number of instances times the representation length of each instance, hence these approaches cannot be used when we have a large number of structured instances. A possible solution is to accept approximate results to gain efficiency. The candidate will extended one such approximate technique (the MinHash approximate nearest neighbor scheme) to efficiently solve the neighbor query in sub-linear time. The overall goals of the thesis were to provide an efficient and simple to use implementation for approximate nearest neighbor queries for large collections of high dimensional sparse vectors.

Learning to design RNA polymers with graph kernels

Graph data structures allow us to model complex entities in a natural and expressive way. In the machine learning literature, several types of discriminative systems that can deal with graphs in input are known (e.g. recursive neural networks, graph kernels, graphical models, etc), however, there are few generative approaches that can sample structures belonging to a desired distribution or class. The task of generating samples from a given distribution when this is accessible only via a finite number of examples is well developed when the domain of interest can be embedded in a vector space. The extension of these approaches to structured domains (i.e. where instances are strings, trees, graphs or hyper graphs) is however substantially less developed. One approach for learning constructive systems is based on a variant of the Metropolis Hastings (MH) algorithm guided by an efficient graph grammar, which, crucially, can be efficiently induced from an example set. Such a neighborhood graph based grammar is suitable when the feasibility constraints are local in nature. RNA polymers, which form structures comprising hundreds of nodes (nucleotides), exhibit however dependencies between distant portions of the structure. In order to extend the constructive system to the RNA domain, Mr. Mautner has introduced a multi level strategy based on a notion of graph minors, i.e. graphs obtained by edge contraction operations. An edge contraction is an operation which removes an edge from a graph while simultaneously merging the two vertices that it previously joined. By carefully defining a domain dependent contraction strategy, Mr. Maunter was able to operate on smaller graphs for which local rules are sufficient to capture the feasibility constraints.

Reinforcement learning techniques in RNA inverse folding

A non-coding RNA molecule functionality depends on its structure, which in turn, is determined by the specific arrangement of its nucleotides. The inverse folding of an RNA refers to the problem of designing an RNA sequence which will fold into a desired structure. This is a computationally complex problem. Algorithms which solve this problem take different approaches, but they share the following attitude: They start from an initial sequence or population and try to move it towards a desired product by performing normal or optimized search methods. RNA inverse folding programs are given different constraints such as GC-content ranges or basepair or nucleotide configurations. The output is normally one or more sequences which fold to the target structure. This work introduces a basic system that given a set of sample RNA secondary structures, produces models which generate structures similar to the sample set. The objectives and constraints are automatically extracted from samples. For doing this, a system is designed which generates models by performing learning on families of RNA sequences. This system consists of two subsystems: one responsible for decomposing secondary structures of sample RNAs into structural features and building a structural features corpus. It also extracts neighborhood connectivity models of structural features in the form of N-grams. The other subsystem is a reinforcement learning framework which uses the corpus and connectivity rules to produce models for generating structures which are similar to the samples. Results in this work show that the current system is able to produce models from RNA families which have a symmetric shape. To make the system capable of dealing with a broader range of RNA families and producing structures with functionalities identical to the sample structures, a refined feature extraction module has been added to the system. This module extracts the GC-content, size and local information of structural features and builds a refined feature corpus. This can provide the basis for a new set of experiments and a start point for producing models with practical applications.

Explorative Enumeration of large energy landscapes

  • C++ implementation of the explorative energy landscape enumeration using the Vienna RNA package library.
  • Parallelization, benchmarking and implementation tuning.
  • Application of the developed program to large RNA molecules.
  • Creation of a complete pipeline to study kinetics of RNA molecules including visualization.

Investigating LocARNA parameter search space by using automatic configuration methods

In recent years many novel RNA species have been discovered by new sequencing techniques. The correct classification of these RNAs into new and existing families heavily relies on accurate sequence-structure alignment tools, which makes it desirable to constantly improve their alignment quality. Therefore, having a high-performing RNA alignment tool is of fundamental importance in the field of computational biology. LocARNA implements an efficient heuristic version of Sankoff's accurate but computationally expensive algorithm for simultaneous sequence and structure alignment. The use of heuristics makes the algorithm applicable in practice, but also forces the inclusion of many additional parameters. Since the performance of an algorithm depends on the parameter setting, it is desirable to optimize these settings in order to improve alignment results. One way to find optimal parameter configurations is to use an automtic algorithm configuration technique. In this work the state of the art algorithm configuration tool SMAC is applied to improve LocARNA 's default parameter settings. The optimization focuses on fundamental parameters of the LocARNA algorithm. Both global and local alignment cases are covered, although for the local case this marks the first in-depth optimization attempt. Hence this work also introduces a complete local alignment parameter optimization pipeline for LocARNA. As a result, improved default parameter settings as well as different input scenario settings for both the global and local alignment cases are proposed. Notably, the average alignment quality of the local case on an extension of the Bralibase dataset was improved up to 26%. In conclusion, the presented work not only managed to optimize LocARNA 's local alignment but also provides a solid foundation for further works on parameter optimization using the implemented pipeline.

Graph-based clustering of CRISPR-Cas systems

  • Find the best way to encode the CRISPR-Cas system as a graph that represents nature as realistically as possible.
  • Use EDeN to perform unsupervised clustering of all available CRISPR-Cas systems in bacteria and archaea.
  • Compare results to previous classification systems.

Learning to Construct Graphs with Real Vector Attributes Using Graph Kernels

Graph data structures allow us to model complex entities in a natural and expressive way. In the machine learning literature, several types of discriminative systems that can deal with graphs in input are known (e.g. recursive neural networks, graph kernels, graphical models, etc), however, there are few generative approaches that can sample structures belonging to a desired distribution or class. The task of generating samples from a given distribution when this is accessible only via a finite number of examples is well developed when the domain of interest can be embedded in a vector space. The extension of these approaches to structured domains (i.e. where instances are strings, trees, graphs or hyper-graphs) is however substantially less developed. While specialized applications exist, e.g. sampling phylogenetic trees, sampling dependency graphs for structural learning in graphical models, or sampling large Web like networks, data driven approaches that can deal with general types of graphs, are still in their infancy. Important applications of a successful generative graph system include the de-novo generation of molecular graphs for drugs and RNA biopolymers with user defined properties derived from prototypical natural examples. In these cases the spatial information of the atom arrangement becomes important for the determination of the associated physicochemical properties. There is therefore the necessity to upgrade these generative graph systems to deal with graphs that can encode spatial information in the form of multiple real valued attributes (e.g. 3D coordinates, distances, angles). In the Thesis the candidate will address the constructive learning problem using a variant of the Metropolis Hastings (MH) algorithm tailored for structural data types. She will upgraded the efficient graph grammar approach of a pre-existing code base to deal with graphs with real valued attributes.

A graph kernel approach to the identification and characterisation of structured noncoding RNAs using multiple sequence alignment information

Structured noncoding RNAs perform many functions that are essential for protein synthesis, RNA processing, and gene regulation. Structured RNAs can be detected by comparative genomics, in which homologous sequences are identified and inspected for mutations that conserve RNA secondary structure. To detect novel RNA classes in bacteria and archaea, a variety of bioinformatics strategies have been used, e.g. looking in upstream regions of protein coding genes for cis-regulatory RNAs. To identify ncRNAs independently from protein coding genes, Z. Weinberg has proposed a computational pipeline based on an initial BLAST clustering further refined by looking into secondary structures with CMfinder. The identified structures are then used in homology searches to find homologues that allow CMfinder to further refine its structural alignment. The resulting alignments are scored and then analysed manually to identify the most promising candidates and to infer possible biologic roles.

Interactive de novo molecular design

Synthesis of small molecules that improve on the curative properties of existing drugs or that are effective in previously untreatable illnesses is a very hard task, a task on which pharmaceutical companies are investing enormous amounts of resources. Computational methods become therefore an interesting solution when they can effectively replace the time consuming and expensive design, synthesis and test phases. Since de novo molecule-design systems have to explore a virtually infinite search space, exhaustive searching is infeasible, and they typically resort to local optimisation strategies. To date, one of the most critical aspects is the reliability of the evaluation function invoked to judge the quality of molecules that can be (and generally are) very different from those used in the function induction phase. One possible approach to overcome this difficulty is to integrate the expert knowledge of (medicinal) chemists in the evaluation loop. Doing so in an efficient way is not a trivial task, since one has to 1) minimise the number of times the system resorts to the expensive human oracle, and 2) use a form of interaction suitable for humans.

CRISPRloci visualization

  • Find the best way to modify/customize the CGView tool in order to work for our purpose (Java).
  • Integrate into CRISPRloci web server (JSP,Html,Java).

RNA energy landscapes with pseudoknot structures

Most studies of RNA kinetics use nested structure models to enable at least moderate sequence lengths. Nevertheless, there is evidence that pseudoknot structures are important for the function of some RNA molecules. Thus, ommitting them in kinetics fosters wrong results. This project will compare kinetics based on energy landscape with and without pseudoknot structures. Furthermore, new strategies have to be explored in order to face the vast increase of the landscape size to enable reasonable studies.

  • C++ implementation of the explorative energy landscape enumeration strategies presented in our article in concert with the identified strategies by Bettina Hübner using the available algorithm implementations from the Energy Landscape Library (ELL) .

Similarity notions for RNA kinetics comparison

For larger RNA molecules it is often not computationally feasible to enumerate their whole energy landscape. Thus only partial fews of the landscapes are used to compute the kinetics of the respective molecule. Within this project, different strategies are explored to measure the similarity of kinetics, i.e. to evaluate how well the coarse grained model reflects the kinetics based on the complete energy landscape information.

Generating a local ncRNA benchmark set to evaluate local RNA alignment tools

Multiple local alignment of RNA sequences is by now still a challenging problem as parameters for already existing tools are not optimized yet for the local alignment case. The first step to solve this problem is the generation of a local benchmark set to be able to evaluate existing local RNA alignment tools. The main part of this work is the implementation of a pipeline to append genomic context of a given length to an already existing (global) benchmark set. A simple evaluation of LocARNA on the local ncRNA benchmark set and a random test set will be performed.

Differential Benchmarking of CopraRNA - Finding the optimal input for a specific question

  • Generate an extensive dataset for differential benchmarking. (also non enteric bacteria)
  • Write scripts that automatically run and evaluate the CopraRNA runs.
  • Draw conclusions and develop guidelines for input organism selection.

Java GUI for Multiple Interval-based Curve Alignments (MICA)

  • The MICA reimplementation of the core algorithm in Java.
  • Development of a Graphical User Interface for MICA in Java.
  • Application of the new tool on tree growth data and other data from literature, evaluating the new implementation.

Improving miRNA target prediction in humans using a highly descriptive graph-based, machine-learning model

  • Compile training and test datasets of miRNA-mRNA interactions.
  • Generate highly sensitive candidate interaction sites.
  • Integrate all possible features into a novel graph model.
  • Train and test machine learning model using different settings and parameters and use model to filter candidates.
  • Compare results to existing tools.

Pruning strategies for large energy landscapes

The energy landscape framework enables the study of the folding kinetics of molecules. For instance the structure formation process of single RNA molecules or the interaction formation of two RNAs. To this end, transition probabilities of one structure to possible successive structures have to be identified. Unfortunately, there is an exponential growth of possible structures a molecule can adopt and accordingly an exponential growth of the energy landscape. One approach to face this problem is to group structures into "macro-states" and to consider only transitions between such structure ensembles. But their number is often still too large to enable kinetics computations. Within this project, different approaches to prune the macro-state energy landscape represenation are tested in order to reduce the according transition encoding to a feasible size open for kinetics computations. The pruning strategies are subject to quantitative and qualitative evaluations concerning reduced computational requirements and preserved kinetics quality.

RNA Barcodes for High-Throughput Sequencing Experiments

CLIP-seq is a method for genome-wide screening of interactions between RNAs and RNA-binding proteins. iCLIP is an extension of CLIP-seq that allows locating RNA-protein interactions with nuceleotide precision. iCLIP employs random sequence tags in to enable calculation of the number of binding events from PCR amplified source material. Errors introduced into these sequence tags during amplification or sequencing can lead to serious overestimation of binding events. This thesis examins the suitability of RNA barcodes developed for multiplex sequencing assays to prevent or mitigate this effect.

Graph-kernel based aromaticity prediction

  • Data collection and preparation for training and testing of the SVMs.
  • Evaluation of the NSPDK prediction using the available tools from the GGL- and NSPDK-package.

Atom mapping of chemical reactions via Constraint Programming

  • C++ implementation of the CP-based atom mapping approach for even ITS rings presented in our article using the Gecode library.
  • Extension of the CP-approach to odd rings.
  • Evaluation of the approach using atom mappings of known chemical reactions.

Cluster based prediction of SH2 domain-peptide interactions using Graph Kernel

  • Data colloection from several high-throughput experiments (e.g. microarray) and compile them to prepare the training and test sets.
  • Optimise hyper-parameters for the NSPDK kernel.
  • Use Support Vector Machine (SVM) based on NSPDK kernel for the classification.

Large Scale Activity Profile Induction for Small Molecules

  • efficiency in the train and in the test phase: some bioassays with hundreds of thousands instances are available; in the test phase 30M compounds have to be screened;
  • accuracy: the predicted activity profile has to be sufficiently close to the true activity profile to provide a reliable localization of compounds in activity space;
  • semi-supervised mode of training should be possible: since many bioassays contain information only for few tens to hundreds compounds it is necessary to make the best use of the vast amount of unsupervised information available;

In this thesis the candidate will use a graph kernel (NSPDK) to train a linear max margin model via fast stochastic gradient descent technique. The candidate will set up the necessary infrastructure to perform and monitor the in-silico predictions and develop novel techniques for large scale semi-supervised problems in the chemoinformatics field.

Analysis of CLIP-seq and PARCLIP data for Argonaute to identify miRNA target sites

  • Collect PARCLIP and HITS-CLIP data for mammals and identify the corresponding mRNA sequences to the CLIP sequences.
  • Develop quality measures to map microRNA to each CLIP sequence.
  • Explore general properties and uniqe characteristics of collected data. How do these datasets correspond to data found in microRNA databases?
  • Optimise IntaRNA parameters to identify correct target sites so that the predictions are very sensitive.

Learning binding preferences of RNA-binding proteins using in vitro affinities and in vivo binding sites

Structural elements in long non-coding rnas.

Non-coding RNAs (ncRNAs) form a heterogeneous class of transcripts with little or no protein-coding capacity. Recently, it turned out that these molecules have a plethora of key regulatory roles in eukaryotic cells. NcRNAs directly act at the RNA level without ever being translated to protein. According to their length, one basically distinguishes small ( 200bp) ncRNAs. The function of a small RNA is typically determined by its secondary structure fold rather than underlying primary sequence. There are several ncRNA classes among small ncRNAs with well defined and well understood secondary structure motifs, examples include micro RNAs (usually forming stem-loop structures) or transfer RNAs (which exhibit the prominent cloverleaf motif). In contrast, it is unclear to which extent long non-coding RNAs contain and are determined by regions of conserved secondary structure. The aim of this work is to analyse secondary structures of long ncRNAs on a genome-wide scale with state-of-the-art bioinformatic techniques, to possibly identify and further characterise common structural elements shared by these transcripts. This may yield novel insights to the computational de novo prediction of long ncRNAs in recently sequenced eukayotic genomes, one of the open problems in current RNA bioinformatics.

De Novo Molecular Design Using Graph Kernels

Large scale multiple genome alignment via an efficient kernel method.

In order to make use of the large amount of genomic information that the sequencing experiments are making available, efficient algorithmic procedures are needed. One of the most fundamental type of processing for genomic data is that of genome alignment, whereby regions belonging to several related genomes are put in bi-univocal correspondence. As a result of the alignment procedure, information of biological relevance can be derived, such as the evolutionary conservation rate of given regions. The sequences in these regions are believed to be important and to correspond to functional biological entities like proteins and non-coding RNA. Correct alignments allow, in other terms, the (semi-)automatic discovery of biological objects (either belonging to known classes, or even to yet unknown classes). However, current genomic alignment techniques 1) are suitable for relatively closely related species, and 2) can process a relatively small number of genomes. In order to allow alignments for thousands of genomes, novel efficient techniques are needed. The choice of computational models suitable for this task has to take into consideration several requirements, such as a) efficiency, b) accuracy and c) flexibility.

Intersections of genomic intervals using interval trees

Testing to find overlaps between genomic features is an important task in genomics research. We know this feature as intersection. In this project I implement a fast and exible method to find intersections between two sets of genomic intervals by using interval trees. The implementation(unionBed) uses sets of features in BED format as input data and find overlaps between them. Then the unionBed results data is used to analyse three different secondary structure prediction hypotheses for co-transcriptional RNA folding and to compare them to each other.

hIntaRNA - Comparative prediction of sRNA targets in prokaryotes

The prediction of targets of bacterial sRNAs is a very challenging task, addressed by several approaches. The experimental testing and verification of sRNA targets is very costly and labour-intensive. Therefore, the reliable algorithmic prediction of putative sRNA targets could vastly reduce the amount of wet lab work. However, due to very short and often imperfect complementarity between the sRNA and its target the prediction is not a trivial task. The IntaRNA algorithm is one approach, which frequently, however, does not yield satisfying results yet and therefore demands improvement. It has been stated "that it is difficult to make significant target predictions when searching sequences from a single organism, and that targets should be predicted in a comparative analysis of multiple organisms". Eventhough this was stated for eukaryotes, the basic idea of this thesis also holds for bacteria. The task of improving the IntaRNA algorithm's prediction quality utilizes exactly this concept, also incorporating the individual phylogenetic distances between the organisms analyzed. For instance, there is compelling evidence, that the MicA and RybB sRNAs in E. coli and Salmonella each have homologous targets in both organisms, thus indicating a conservation on the regulatory level. Here, the implementation of the idea that overlapping target predictions for distinct organisms yield stronger evidence of correct functional prediction is presented.

Secondary structure motif determination in ncRNA via graph kernel based computational models

A partition function variant of rna base pair maximization in adp.

The goal of the project is to lay the foundations of computing RNA base pair probabilities, as done by the Mc Caskill algorithm, in the framework of Algebraic Dynamic Programming (ADP). In order to concentrate on the essential aspects of this problem, we simplify the scoring model of the algorithm to a Nussinov-style base pair maximization. The main challenge is to compute the outside part, which has no natural correspondence in the grammar parsing framework underlying ADP.

Generic JSP-based web frontend creation

Web frontends of terminal-based bioinformatics tools are important to ease their use for non-computer scientists and to enable ad hoc usage. The project aims at the development of a highly generic web frontend framework generalizing the currently available JSP-based frameworks of the CPSP-web-tools and Freiburg-RNA-tools . Main goals are to simplify the setup of new frontends for arbitrary terminal tools and to develop a robust generic framework. The integration is exemplified by creating a frontend for the recently developed program CARNA .

Alignmentverbesserungen mit Hilfe von Consensus-Dotplots

RNA-alignments are essential for identifying and characterising structured non-coding RNA. RNA-alignments are different to DNA or protein alignments in the fact that they not only align according to sequence similarity, but also take the base-pairing patterns of secondary structures into account. A common procedure to characterise the structure of non-coding RNAs is to predict the consensus structure of elements of the same family. The problem with this, is that any errors in the alignment are reflected directly in the quality of the predicted consensus structure. Therefore, it is of high importance to get the correct alignment of RNA families. The largest database of such family alignments is contained in Rfam. A common error in these alignments is that a small subset has been misaligned with respect to the structure, which results in some stems slightly offset to either the left or right in comparison to the others. The goal of this thesis is to develop a method to automatically detect and re-align these misaligned stems and to thus deliver a quick method to improve these common errors in the Rfam database. Furthermore, a key part of the work is to understand the state-of-the art in approaches to align RNA sequences and to perform benchmark experiments that compare current tools to the here developed method. It is also important to understand the complexity of measuring the "goodness" of one alignment and to develop and compare such measures.

Local sequence and structure features in long RNA sequences

There is much evidence in molecular biology that RNA plays an important role in living cells. Research results in the last decade have shown that protein coding sequences are only the tip of the iceberg w.r.t. genomic functional elements. Up to 90% of the genome is transcribed into RNA for which the function still remains largely unknown. The structure of an RNA is an important property for its correct function, e.g. the cloverleaf of a tRNA. However, the experimental determination of the structure is still a very challenging task, therefore we try to deduce the structure from the nucleotide sequence, which encodes it. Furthermore, we find evidence that long RNAs have local regions of functionality and that the entire sequence does not always contribute to a particular function. For example cis regulatory elements on mRNA such as SECIS elements and miRNA binding sites. In this project we want to analyse long RNA sequences in respect to different sequence and structure features. The project aims to identify signatures of natural RNAs and dependencies between RNA sequence and structure. Sequence features comprise the A,U,G and C content as well as di-nucleotide and tri-nucleotide content. In terms of structural features we want to consider accessibilities, base-pair probabilities, accuracies (MEA) and predictions from tools like RNALfold. Given these features, we want to identify dependencies between them and between different sequences. First, the project involves a graph visualisation for the raw data of single features and different combinations of features. Because of the huge amount of data, we need to be able to focus or zoom into regions of interest. Further, we want to reduce the feature information to only regions of high significance in comparison to a background model. Thus, a suitable background model needs to be defined for each feature. With the simplified view, it should be easier to visually spot correlations between several features at once. After an initial visual inspection, automatic methods shall be developed to analyse real datasets of different RNA classes to identify distinct sequence and/or structure signals. First we would like to concentrate on known cis regulatory elements within the UTRs of mRNAs and finally we would like to apply the automatic analysis developed in this thesis to find unknown signals in long non-coding RNAs.

RNA-Protein interaction prediction with Graph Kernels

The aim of this work is to help with the implementation and evaluation of the novel algorithm Exparna-P. This algorithm computes all exact pattern matchings between two RNA strands for the entire structure ensemble. In order to speed the algorithm up, a new method needs to be implemented which computes the probability that a position is unpaired under a loop. Then the already existing chaining algorithm has to be slightly modified in order to compute the best set of non-overlapping and non-crossing exact pattern matchings for Exparna-P. The third part of this bachelor thesis is the comparison of the performance of the Exparna-P tool compared to the Exparna tool.

Multiple sequence alignment methods of long non-coding RNAs

Long ncRNA is a rapidly advancing field of genetics, with yet only briely studied roles (in gene regulation), organization, conservation or medical implications. It is however expected that they will play a great role in further genetic studies and progress. Due to their (sometimes impressive) length (of up to several hundreds of kb) and other particularities, their sequences are rather difficult to align. However, valid sequence alignments are the essential pre-requisite for most subsequent bioinformatic studies of lncRNAs. Therefore, we analyse, compare and benchmark different alignment sets of vertebrate long ncRNAs, namely the Ensembl EPO alignmets, the Galaxy Multiz/TBA blocks and alignments generated by a self-developed pipeline and identify advantages and drawbacks of sequence alignments of lncRNAs.

Evaluating contaminations in genomic sequences

Despite continued advances in whole genome sequencing techniques and the development of powerful assembly algorithms, newly sequenced genomes still often suffer from contaminations during the sequencing process. The most common sources of contamination are accessory DNAs deliberately attached to the DNA/RNA under investigation, including vectors, adapters, linkers, and PCR primers. However, there are also unintended events, e.g. caused by transposon activity or simply impurities, leading to contaminated genomic sequences. These may then result in missassemblies of genomic sequences, meaningless analyses and potentially erroneous conclusions. However, noone knows to which extent publicly available genomes are contaminated. To encompass this unsatisfying situation we therefore plan to develop a comparative genomics approach to broadly identify contaminations in available genomic sequences. The project is not only open for bioinformaticians and computer scientists, it is also suitable for students with a background in biology.

A new heuristic algorithm for IntaRNA for improved RNA-RNA interaction prediction

The number of discovered ncRNAs(non-coding RNAs) that regulate target mRNAs by base pairing is growing fast. This demands for identification of the target mRNAs for those ncRNAs. Thus prediction of such interactions between ncRNAs and mRNAs became of great neccesity to help identify targets for known ncRNAs. A few computational algorithms for this purpose were developed to predict such interactions. While some of the algorithms were fast enough for genome-wide searches, they were not so accurate in predicting interactions between long RNAs. This is because they neglected an important factor for interaction formation which is the interacting site accessibility. IntaRNA considers site accessibility while maintaining the same time and space complexities of these fast algorithms. IntaRNA includes two algorithms, one that gives optimal results according to the Turner free energy model, but is time consuming with time complexity O(n 2 m 2 ). The second algorithm is heuristic with time complexity O(nm) only, but does not give optimal results for all input sequences. In this thesis we present improvements over both algorithms of IntaRNA. First we modified the non-heuristic algorithm to model more accurately how RNAs are actually forming an interaction. It simulates - in the same order - the sequence of events in which interaction formation is thought to happen in real. The new implementation allows to forbid high energy barriers that might be encountered during interaction formation and that are less likely to be overcome. Second we improved the accuracy of the heuristic algorithm of IntaRNA, making it more accurate and reliable for use in biological researches, without significantly increasing its runtime and space requirements.

Development and Implementation of an Alignment Program for Canonical Pseudoknots

At our lab, a general method to align various restricted classes of pseudoknots has been developed. The alignment scheme has also been implemented, but due to its generality, it is comparably slow and not suitable for many large scale practical applications. This work focuses on developing an efficient implementation of only one specialized instance of this scheme (The R&G pseudoknot class) that can be used in real practical scenarios. The topic is suitable for people interested in algorithms, datastructures, software development, and C++ programming.

RNA Consensus Interaction Prediction

RNA-RNA interaction is a subject of considerable biological relevance as the binding of ncRNA to mRNA can affect both the transcription and translation of the bound mRNA and hence regulate gene expression. The accuracy and reliability of single sequence RNA structure prediction has been shown to increase significantly when the structure of an aligned set of RNA homologs is computed. As such, it is posited that by augmenting an existing RNA-RNA interaction prediction algorithm, that determines an interaction structure based only on thermodynamics, with a phylogenetic component a structure prediction of improved quality can be obtained. This thesis presents the theory, implementation and evaluation of an algorithm that combines thermodynamic and phylogenetic information to predict a consensus interaction structure on a set of aligned mRNAs and ncRNAs.

Experimentelle und theoretische Untersuchungen zur Echtzeitanalyse Mikroarray-basierter RNA-Amplifikation

  • Application of CDDM to a NASBA microarray with fixed amounts of target RNA
  • Incorporation of NASBA amplification into the CDDM binding kinetics

Centroid-based identification of local RNA elements

In this thesis we try to tackle the problem of identifying local RNA elements in a genomewide scale. We employ a fast sparse algorithm to predict maximum expected accuracy structures based on base-pairing/unpairing probabilities. Moreover, we introduce a new locality definition and present an accuracy function reflecting this locality. Base-pairing and base-unpairing probabilities can be efficiently computed using RNAplfold included in the Vienna package. Based on these probabilities, we identify structured regions that have high probabilities of containing significant local RNA motifs. After that, we introduce our new program RNAMotid together with other included features that enables it to scan genome-wide sequences for structured regions. Moreover, we discuss how several modules were integrated together in our program to allow flexibility and optionality of the analysis. Finally, we evaluate the performance of RNAMotid in identifying local RNA motifs embedded in randomly shuffled context. Before that, we apply an overall parameter training followed by a family-based parameter training. Then we discuss the factors that affect the performance of RNAMotid.

Exploring structural characteristics of mRNA target sites using local folding

  • Research and preparation of the topic: read about RNA secondary structure, local folding, accessibility, positional entropy, etc. Also gather information on what has previously been done in the structural analysis of target sites.
  • Gather data: find experimentally validated target sites for different types of non-coding RNA (and proteins if possible).
  • Apply existing local folding programs to the data and calculate the structural characteristics of the target sites.
  • Implement a well-documented pipeline with Perl to be able to analyse arbitrary target sites in future.
  • Written manuscript.

Folding simulations in side chain lattice protein models

Side chain lattice protein models are a reasonable and necessary extension of the widely used backbone lattice protein models. To enable folding simulations a structural neighborhood relation, a so called move set, has to be defined that is utilizes that enable e.g. Monte-Carlo simulations of the folding process. The thesis presents the K-local move set, a local move set defined generically for lattice protein models. The K-local move set is defined for both backbone and side-chain protein models via constraint satisfaction problems. The use of the constraint-based approach enabled its use for an arbitrary lattice. The K-local move set is then used for a simulation procedure for side-chain protein structures in the face-centered cubic lattice using real protein sequences and structures.

Infering RNA Stem-Loop descriptors from multiple sequence-structure alignments for an indexed-based RNA search method

RNA can be grouped into certain RNA families according structural and functional similarities. Currently, the Rfam 9.1 database ( http://rfam.sanger.ac.uk ) contains more than 1300 such families. We have already developed a fast index-based (with affix-trees) search method for RNAs. Here, the query is a descriptor and it consists of a stem-loop structure with possible wildcards at different positions. The more sequence information is given the faster is the underlying index-based search engine. On the other hand, if too much sequence information is given, related, but inexact matching stem-loop structure would not be found. Therefore, the goal of this bachelor thesis is to derive such descriptors from Rfam seed-alignments (or other multiple RNA sequence-structure alignments) too feed them into the search engine. If each necessary single descriptor gives a match within a certain region, one could infer a match of the underlying RNA family. A descriptor can been seen therefore as a necessary local motif of an RNA familiy.

Approximate pattern matching under generalised edit distance and extensions to suffix array library

The approximate pattern matching problem is the problem of finding all occurences of a certain pattern in a usually much longer text allowing for a fixed error threshold in the matching. The problem has been studied extensively and many very good solutions were found. However, general enough instances of the problem, namely those allowing for generalised error functions, remain with without satisfactory algorithms. This thesis is an attempt to provide such a solution. The new method provided relies on the suffix array data structure to preprocess the text linearly and allow later for fast queries. The new algorithm has the two desirable features of having a fairly simple explanation and implementation and having space and time bounds independent of the size of the alphabet, allowing for arbitrarily large alphabets. Furthermore, the new algorithm handles wildcards quite well while retaining the same time and space worst-case complexities. The algorithms are compared on genuine genetic data from Zebrafish genome and the results are presented. Finally, a parallelized version of the algorithm is presented on CREW-PRAM model of computation. In addition to presenting the new algorithm, several contributions were made to an existing affix array library.

A Library for Index-based Bidirectional Pattern Search with an Application to RNA Structural Motifs

In dieser Masterarbeit präsentieren wir sowohl bekannte, als auch neue Algorithmen zur effzienten Konstruktion und Verwendung von Indexdatenstrukturen. Diese Datenstrukturen haben mannigfaltige Anwendungsmöglichkeiten im Bereich des String-processings. Insbesondere können durch sie Mustersuchen in indexierten Texten beschleunigt werden, wodurch sie eine wichtige Rolle in der Analyse biomolekulare Sequenzen wie z.B. DNA- (Desoxyribonukleinsäure), RNA- (Ribonukleinsäure) und Protein-Sequenzen, spielen.

Variations of the Sankoff-Algorithm with a Focus on Heuristics

The combination of the alignment and secondary structure prediction solutions of two RNA sequences can significantly improve the accuracy of the structural predictions. The algorithm which simultaneously solves these problems tends to be computationally expensive like the original form "Sankoff Algorithm" (Sankoff, 1985). Thus, the methods which addressed this problem impose constraints that reduce the computational complexity by restricting the folding and/or alignment and thus make the Sankoff algorithm more practical. In this thesis, reviewing the different Sankoff-style methods in such a way that compares them corresponding to the Sankoff algorithm, through the parallels and differences. As well as, the focus is on the heuristics (i.e. the imposed constraints on the alignments and/or the structures) and comparing between them.

Abstractions for barrier estimations in RNA energy landscapes

RNAs take part in diverse processes in cells. Energy landscapes can be used to characterize the structural space of an RNA and thus can help us to better understand the processes in which RNAs are involved. The task of estimating energy barriers in RNA landscapes is important in many practical problems such that kinetic RNA folding (Geis et al., 2008) and search for bistable RNA molecules (Flamm et al., 2001). A few approaches has been developed to solve this problem. They need to be improved in two ways: improve time complexity and, at the same time, improve the accuracy of estimations. This master thesis has a task of investigating possible solutions to above-mentioned problem. We apply shape abstraction to the barrier height estimation problem. In the master thesis a number of precise algorithms based on this abstraction have been developed and compared to already existing ones.

Kinetics of RNA-RNA hybridization

There are two conceptually different approaches to predict probable structures of RNA molecules: thermodynamic and kinetic modeling of RNA folding. While purely thermodynamic approaches (e.g. RNAfold) solely consider thermodynamic properties to determine favourable structures, kinetic approaches (e.g. Kinfold) consider the structural changes over a timeframe in addition to their thermodynamic properties. While thermodynamic RNA structure prediction has been extended to RNA-RNA interactions, there doesn't seem to exist a kinetic modeling approach yet. The goal of my diploma thesis will be the implementation and evaluation of two kinetic folding algorithms for RNA-RNA interactions based on stochastic simulations using a Gillespie algorithm as well as macro states.

Signifikanz von RNA-RNA Interaktionen und RNA Sequenz-Struktur Alignments

In dieser Arbeit wurden Signifikanzuntersuchungen für die am Lehrstuhl für Bioinformatik der Universität Freiburg entwickelten Programme LocARNA und IntaRNA angestellt. LocARNA bewertet anhand eines Sequenz-Struktur-Alignments die strukturelle, sowie sequenzielle Ähnlichkeit zwischen zwei ncRNA Sequenzen, IntaRNA bewertet mögliche Bindestellen zwischen ncRNA und mRNA, unter Berücksichtigung nicht nur der Sequenz, sondern auch der Sekundärstruktur der Sequenzen. Es wurde analysiert, wie sich die ausgegebenen Bewertungen dieser zwei Programme in Abhängigkeit von den Eigenschaften Länge, AU gegen GC Anteil und minimaler freier Energie der eingegebenen Sequenzen verändern. Dazu wurden große Mengen an zufälligen Sequenzen erzeugt, die Verteilung bei Sequenzen mit gleichen Eigenschaften untersucht und geprüft, wie sich die Verteilung bei Variation der Länge und des AU zu GC Mengenverhältnisses ändert. Bei LocARNA wurde mit den Daten eine Support Vektor Maschine trainiert, die nun für Sequenzpaare die zu erwartende Verteilung angeben kann. Mit dieser Verteilung als Nullmodell ist es möglich, die P-Werte, und damit die Signifikanz, der von LocARNA ausgegebenen Bewertungen zu bestimmen. Bei IntaRNA wurde festgestellt, dass die ncRNA Sequenzen einen Einfluss auf die Ausgabe haben, der sich nicht allein durch Länge, AU-Anteil und freier Energie erklären lässt. Hier sind weitere Untersuchungen nötig, bevor Gesetzmäßigkeiten bestimmt werden können mit denen die Signifikanz bewertet werden kann.

Effiziente Algorithmen zum paarweisen Sequenz-Struktur-Alignment unter Beachtung von Pseudoknoten

Sehr viele Probleme, die sich mit Berechnungen von RNA beschäftigen, die Pseudoknoten enthalten, sind NP-hart. Am bekanntesten sind hier die Fälle der Strukturvorhersage und der Berechnung des Alignments. Im Bereich der Strukturvorhersage wurden schon verschiedenste Algorithmen vorgestellt. Diese beinhalten allerdings Einschränkungen was die Art der vorkommenden Pseudoknoten angeht, um effizient arbeiten zu können. Diese hier vorgestellte Arbeit leistet einen Beitrag zu einem neuen Algorithmus, der vom Lehrstuhl für Bioinformatik vorgestellt wurde. Mit seiner Hilfe ist es möglich, effizient Sequenz-Struktur-Alignments von RNA Sequenzen zu berechnen, die Pseudoknoten enthalten. Hierfür wird auf das Prinzip der dynamischen Programmierung zurückgegriffen. Der Algorithmus besteht dabei im Wesentlichen aus zwei Teilen. Zuerst wird eine der beiden Sequenzen in einen Parsetree zerlegt und anschliessend das Alignment gebildet. Das Alignment profitiert hierbei von Einschränkungen auf bestimmte Klassen von Pseudoknoten auf die selbe Weise wie die jeweilige Strukturvorhersage. Dies hat zur Folge, dass die Komplexität des Alignments nur um einen linearen Faktor in Bezug auf den jeweiligen Vorhersagealgorithmus steigt. Diese Arbeit beschäftigt sich mit dem ersten Teil, dem Berechnen einer Zerlegung der ersten Sequenz. Hier werden verschiedene Methoden untersucht, wie dies geschehen kann, sowie diese hinsichtlich ihrer Auswirkungen auf das spätere Alignment analysiert.

Multiples Sequenz-Struktur-Alignment von RNAs fester Eingabestrukturen mit konsistenzbasierter Erweiterung

Partition function alignment of rnas.

ncRNAs are observed to have important roles in transcription, translation and in post-translation activities. Computational detection of ncRNAs requires sophisticated methods which take into account structural conservation apart from sequence information. We present a new pairwise sequence-structure alignment algorithm, LocARNAp with an aim of obtaining more accurate multiple alignments than its ancestor LocARNA by using partition function of alignments and consistency based transformation. LocARNAp is dynamic programming based sequence-structure alignment algorithm which computes posterior probabilities of edge alignment from the partition function of pairwise alignments. Obtained posterior probabilities are then consistency transformed to include information about other sequences, and thereby making an improvement in multiple alignment computed using mLocARNA. We compare the multiple alignments generated by LocARNAp to those obtained from LocARNA, LARA, STRAL and FoldAlign using benchmark - BRAliBase 2.1�s datasets and extensively study the effect of each parameter setting on the alignment quality. The analysis of results suggests that our algorithm obtains overall better quality of results compared to its ancestor - LocARNA and other algorithms. While there is a huge scope of further improvements, LocARNAp develops a strong foundation for further research in this direction.

Ein Hybdridkinetik Ansatz für RNA Faltungswahrscheinlichkeiten

Es werden in der derzeitigen Forschung zwei wesentliche Ansätze für die RNA Faltungsvorhersage verwendet. Zum einen die direkte Simulation des Faltungsprozesses, bei der über viele Iterationen hinweg stochastisch Wahrscheinlichkeiten für verschiedene Faltungen/Strukturen bestimmt werden. Dies ist die am häufigsten genutzte, aber auch bei weitem aufwendigste Methode. Daher wird oft auf Kinetiken ausgewichen, welche auf einer Vereinfachung der Energielandschaft basieren. Energielandschaften sind hierbei eine diskrete Beschreibung des Strukturraums einer RNA, in dem sich der Faltungsprozess abspielt. Zum einen werden ganze Teile der Energielandschaft zu sogenannten Macrostates zusammengefasst, zum anderen wird die Landschaft oft vereinfachend durch BarrierTree repräsentiert wodurch Adjazenzinformationen zugunsten einer effizienten Berechnung verworfen werden. In dieser Arbeit wird ein Hybridansatz vorgestellt, welcher die häufig verwendeten Macrostate- und Arrheniuskinetiken miteinander verknüpft. Für den unteren Bereich der Energielandschaft wird die Macrostatekinetik verwendet, während im oberen Bereich durch ein Sampling der Übergangshöhen eine Arrheniuskinetik möglich wird. Diese beiden Kinetiken arbeiten jedoch mit unterschiedlichen Zeitfaktoren, so dass eine Skalierung der jeweiligen Ratenmatrizen nötig ist, um die resultierende Ratenmatrix zu verwenden. Die Arbeit untersucht Möglichkeiten der Hybridisierung beider Kinetiken, und zeigt grundsätzliche Limitierungen des Kinetikansatzes auf. Zudem wird eine Metrik für den Vergleich von Kinetiken vorgestellt, um optische Unterschiede im Faltungsverhalten zu quantifizieren. Dieses Mass wird schliesslich verwendet, um die Qualität der Hybridkinetik zu bewerten.

Sampling von Folding funnels in diskreten Energielandschaften

Im Rahmen des Projektes soll eine Methode umgesetzt werden, um den folding funnel von Modellmolekuelen mit diskreten Energielandschaften zu schaetzen. Der zu erstellende Ansatz baut direkt auf vorangegangenen Arbeiten auf und ergaenzt diese um neue Datenstrukturen und Methoden. Die Implementierung soll auf der C++ Programmierbibliothek Energy Landscape Library (ELL) aufbauen und diese ggf. ergaenzen. Zudem soll ein standalone Programm entwickelt werden, mit dem direkt folding funnel Studien ermoeglicht werden. Fuer einige gegebene, vorhandene Modelle sollen hierzu die gesampelten Daten mit bestehenden exakten Studien verglichen werden, um Qualitaet und Laufzeit des neuen Ansatzes zu bestimmen.. F�r Hintergrunddetails siehe pdf-Version der Projektbeschreibung.

Constraint Approach for Protein Structure Prediction in the Side Chain HP Model

Protein structure prediction has been always a very interesting problem to solve, especially in the last 10 years. Many previous methods tried to focus on HP model prediction, however most of those methods have the drawback of giving approximate solutions. Another drawback of most of those works is that they ignore the representation of the side chains of the amino acids. In this master thesis, we develop an approach of a concrete constraint model that takes the side chains of the amino acids into consideration and gives the exact solutions for a given sequence in the side chain HP model in terms of minimizing the energy, without any approximation. In this work we also present the obtained results of the predication model and show some important statistics especially about the degeneracy. In addition to that we present some interesting results on generating protein like sequences, although this is a hard task in the model. The protein like sequences can be found in a very low probability in random sequences in the side chain model as we explain in this thesis.

Core Construction in the Cubic Lattice

Für das kombinatorische Problem der Kernkonstruktion existieren bereits ausführliche theoretische Vorarbeiten, sowie eine Implementierung in Mozart/Oz. Es geht daher ausdrücklich um eine effiziente Reimplementierung. C++- oder Javakenntnisse, sowie Erfahrung mit Constraintprogrammierung, bzw. die Bereitschaft zur tieferen Einarbeitung in diese Technik sind für das Projekt vorausgesetzt.

Combining the results of different motif discovery programs for de novo prediction of TFBS - A critical approach

The project tackled the question : Can we trust the results of tools for de novo motif (TFBS) detection? If not, how can we improve the results?

Strukturraumanalyse von Gitterproteinen unter Verwendung von Pull Moves

Pairwise comparison of rna secondary structures via exact pattern matches.

In this thesis we have developed two pairwise comparison methods on the basis of exact matching substructures, called exact pattern matches. In a first step, a set of overlapping and crossing substructures for two nested RNA secondary structures is found with the approach of pairwise common substructures from Siebert/Backofen. Our first method deals with the task to identify the best global subset of Non-Crossing exact pattern matches for two given RNAs. In relation to the LAPCS problem, we call this problem the Longest Common Subsequence of Exact RNA Patterns. The developed dynamic programming algorithm needs O(n�m�) time and O(nm) space. Our second approach detects (local) clusters of exact pattern matches. A cluster is a Non-Crossing arrangement of exact pattern matches with a distance constraint between the substructures included in a cluster. The developed clustering strategy to find clusters is fast and flexible enough for different analytical problems. We have tested both methods with two Hepatitis C virus RNAs and two 16s ribosomal RNAs. The results show that both methods are able to identify significant similarities between two RNA secondary structures in a fast way.

Identifying Key Regulators in Genetic Networks

Genetic Regulatory Networks are a method of representing the complex assemblages of interconnected genes, proteins and other molecules. Components are represented as nodes, and activations and repressions between them are represented as labeled edges. Within these networks are Key Regulators; these are nodes capable of regulating a sub-network of the network. The task at hand is to present a model for genetic regulatory networks and to implement a means to identify the most suitable key regulator within the network. In addition, cycles representing positive or negative feedback loops increase the complexity of the task. The model is further refined by considering constraints such as choosing a key regulator that regulates a certain sub-network but has no effect on another sub-network.

3D-Structure-Motifs Aware Sequence Structure Alignment of RNAs

Comparison of RNAs is mainly based on information about the sequence and their secondary structure. The function of RNAs on the other hand is based on their 3D-Structure, which is hard to determine. However, there are wide-spread 3D motifs which can be identified more easily. Such a motif can be defined, due to Eric Westhof, as an ordered assembly of non-watson-crick basepairs within a helix. Current sequence structure alignment methods are not aware of such motifs. However, these motifs can give strong guidance for such alignments. The project's aim is to integrate the knowledge about motifs into the recent tool LocARNA, which is a program for simultaneous folding and alignment of RNAs. The dynamic programming algorithm should be modified to detect the motifs and tested on biological data.

Exploration of biopolymer energy landscapes via random sampling

The structures of RNA molecules and proteins, which are both important biopolymers, are commonly assumed to be uniquely determined by their sequences. The structures of these biomolecules are in turn necessary to carry out the molecules' biological functions. Discretized structure models provide a coarse-grained description of the molecular structure, which is necessary to perform computational studies. In this research, RNA molecules were modeled as secondary structures for RNA, and proteins were modeled as self-avoiding walks on a lattice. The structure formation process of biopolymers is crucially determined by the properties and the topology of the underlying energy landscape, in which the folding proceeds. Typical characteristics of the energy landscape, like the number of local optima, the basin distribution as well as the transition states between the optima, can be visualized by barrier trees. Barrier trees provide a reduced representation of energy landscapes, which can be used to study the dynamical behavior of biopolymer folding. The research described in this thesis aimed to present a generic, problem-independent approach for the generation of barrier trees.

Efficient solving of alignment-problems with side conditions using constraint techniques

Many problems and open questions of modern microbiology demand the comparison of RNA, DNA or protein sequences. The computational effort in performing these calculations is high and microbiology urges the development of efficient tools. An important requirement is the capability of these tools to deal with certain constraints. Examples might be a specified number of matches of two compared sequences within certain regions or a consideration of the secondary structure of a proteine. Subject of this thesis is the efficient alignment of sequences, where we focus especially on different constraints on the alignment and how to combine these constraints, even in multiple alignments.

Vollständige Aufzählung der optimalen Strukturen von Gitterproteinen durch dynamische Zerlegung des assoziierten Constraint Satisfaction Problems

The standard depth first search (DFS) method to solve Constraint Satisfaction Problems (CPSs) shows much redundant work if the CSP contains several unsolved independent partial problems. This is a frequent observation when constraint programming is applied to solve the structure prediction problem in the HP model. Within this thesis, an implementation of a dynamically decomposing search strategy is developed and implemented. The resulting prototypical implementation is applied to the structure prediction problem and yields significant speedups compared to standard DFS. This speedup is neccessary e.g. to allow for high throughput degeneracy calculation of lattice proteins in the HP model.

MuLoRa - Ein Ansatz für multiple, lokale RNA-Sequenz-Struktur-Alignments

Merkmalauswahlverfahren zur lokalisierung der bindungsstellen von transkriptionsfaktoren, signifikanz von rna sequenz-struktur-motiven.

  • Prospective students
  • International students
  • Companies and organizations
  • About this master
  • Academic information
  • Collaborating institutions
  • Master Publications
  • Grants & Scholarships
  • Access & Admission

Master in Bioinformatics for Health Sciences

Master thesis.

In the second course of the Master, students are required to complete a Master Thesis or Project. This internship offers the student the opportunity to become familiar with the real-world bioinformatics, integrating all the skills and knowledge acquired along the programme.

Each academic course, the master coordinator opens an international call for research projects. As a result, a portfolio of potential Master Completion Projects with more than 60 projects is provided to the students. Alternatively, students can contact themselves other research groups different to those offered in the portfolio and propose possible projects.

The portfolio includes projects supervised by researchers from UPF and UB, as well as from different collaborating institutions . Among the years the master have created a network of specialized centers in bioinformatics, including research centers, private companies, universities and hospitals.

UCLA Graduate Division

  • Recommendations
  • Notifications
  • My Favorites

Favorites, recommendations, and notifications are only available for UCLA Graduate Students at this time.

Access features exclusively for UCLA students and staff.

As a student, you can:

  • Add funding awards to your favorites list
  • Get notified of upcoming deadlines and events
  • Receive personalized recommendations for funding awards

 We're Sorry

You've signed in with a UCLA undergraduate student account.

UCLA Graduate Programs

Students meeting in an on-campus coffee shop

Program Requirements for Bioinformatics (Medical Informatics)

Applicable only to students admitted during the 2024-2025 academic year.

Bioinformatics

Interdepartmental Program College of Letters and Science

Graduate Degrees

The Medical Informatics Program offers the Master of Science (M.S.) and Doctor of Philosophy (Ph.D.) degrees in Medical Informatics.

Admissions Requirements

Master’s Degree

All academic affairs for graduate students in the program are directed by the program’s faculty graduate adviser, who is assisted by staff in the Graduate Student Affairs Office. Upon matriculation, students are assigned a three-faculty guidance committee by the faculty graduate adviser.

The chair of the guidance committee acts as the provisional adviser until a permanent adviser is selected. Provisional advisers are not committed to supervise examination or thesis work and students are not committed to the provisional adviser. Students select a permanent adviser before establishing a comprehensive examination or thesis committee.

Areas of Study

This area of study exposes students to foundational concepts in medical informatics, providing a background in clinical data, big data management, and analyses of new and emergent data utilized to guide biomedical research and healthcare. Study comprises of an introduction to computational methods, clinical and biomedical knowledge representation, and exposure to core informatics topics.

Foreign Language Requirement

Course Requirements

Medical Informatics 11 40

Students must be enrolled full time and complete 40 units (11 courses) of graduate (200 or 500 series) course work for the master’s degree. All courses must be taken for a letter grade, unless offered on S/U grading basis only.

Students must complete all of the following: (1) eight core courses (30 units): Bioengineering 220, 223A, 223B, one course from BE 224A or Bioinformatics M222 through M226, BE 224B, BE M226, BE M227, and BE M228; (2) eight units of Bioinformatics 596; and (3) two units of 200-level seminar or journal club courses approved by the program.

Teaching Experience

Not required.

Field Experience

Capstone Plan

The master’s capstone is an individual project in the format of a written report resulting from a research project. The report should describe the results of the student’s investigation of a problem in the area of medical informatics under the supervision of a faculty member in the program, who approves the subject and plan of the project, as well as reading and approving the completed report. While the problem may be one of only limited scope, the report must exhibit a satisfactory style, organization, and depth of understanding of the subject. A student should normally start to plan the project at least one quarter before the award of the M.S. degree is expected. The advisory committee evaluates and grades the written report as not pass or M.S. pass and forwards the results to the faculty graduate adviser. Students who do not pass the evaluation are permitted one additional opportunity to pass, which must be submitted to and graded by the advisory  committee by the end of the 6th quarter.

The capstone plan is available for students in the Medical Informatics field. However, students in Computational & Systems Biology major are required to follow the Thesis Plan only.

Thesis Plan

Every master’s degree thesis plan requires the completion of an approved thesis that demonstrates the student’s ability to perform original, independent research.

Students must choose a permanent faculty adviser and submit a thesis proposal by the end of the third quarter of study. The proposal must be approved by the permanent adviser who served as the thesis adviser. The thesis is evaluated by a three-person committee that is nominated by the program and appointed by the Division of Graduate Education. Students must present the thesis in a public seminar.

Time-to-Degree

Normative time-to-degree for all fields is five quarters.

DEGREE NORMATIVE TIME TO ATC (Quarters) NORMATIVE TTD

MAXIMUM TTD

M.S.

Doctoral Degree

The Medical Informatics Advising Committee, chaired by the Faculty Graduate Advisor, advises students during the first year and is available to students throughout their tenure of their study.

Upon entering their second year in the program, students will select a mentor who will serve as their dissertation chair, research advisor, and primary graduate advisor. Together the student and the mentor will convene a doctoral committee who will guide the student throughout their research, the University Oral Qualifying Exam, Doctoral Dissertation Defense, and will approve the final dissertation.

Individual Development Plan: Beginning with a mandatory training workshop in the first quarter of graduate study, students are required to generate an Individual Development Plan via myIDP Website: http://myidp.sciencecareers.org/ in order to map out their academic and professional development goals throughout graduate school. The myIDP must be updated annually, and the resulting printed summary discussed with and signed by (Year 1) the student’s advising committee member, or (Years 2-5) thesis adviser, and then turned in to the Graduate Student Affairs Office to be placed in the student’s academic file each year by June 1.

Annual Committee Meetings: Beginning one year after advancement to doctoral candidacy, and in each year thereafter until completion of the degree program, students are required to meet annually with their doctoral committee. At each meeting, students give a brief, 30-minute oral presentation of their dissertation research progress to their committee. The purpose of the meeting is to monitor the student’s progress, identify difficulties that may occur as the student progresses toward successful completion of the dissertation and, if necessary, approve changes in the  dissertation project. The presentation is not an examination.

Annual Progress Report: All students are required to submit a brief report (a one-page form is provided) of their time-to-degree progress and research activities indicating the principal research undertaken and any important results, research plans for the next year, conferences attended, seminars given, and publications appearing or manuscripts in preparation. Annual Progress report must be submitted to the Bioinformatics IDP Student Affairs Office for review by the Program Director.

Major Fields or Subdisciplines

These fields include computer science, translational bioinformatics, imaging informatics, public health informatics, and social medicine.

Students are required to enroll full-time in a minimum of 12 units each quarter. In addition to basic course requirements, all students are required to enroll in Bioinformatics 596 or 599 each quarter.

Students who have gaps in their previous training may take, with their thesis adviser’s approval, appropriate undergraduate courses. For example, students without statistical background are recommended to take STATS 100B (Introduction to Mathematics Statistics) in their 1st year. Students without a Computer Science background are recommended to take COM SCI 180  Introduction to Algorithms and Complexity), COM SCI 145 (Introduction to Data Mining), COM SCI 146 (Introduction to Machine Learning), or COM SCI 148 (Introduction to Data Science). However, these courses may not be applied toward the required course work for the doctoral degree.

Students must complete all of the following: (1) eight core courses (30 units) Bioengineering 220, 223A, 223B, one course from BE 224A or Bioinformatics M223 or M226, BE 224B, BE M226, BE M227, and BE M228; (2) MIMG C234; (3) eight units of Bioinformatics 596; (4) four units of 200-level seminar or journal club courses approved by the program; and (5) six electives, chosen from the following list: Bioinformatics M223, M226; Biomathematics 210, M230, M281, M282; Biostatistics 213, M232, M234, M235, 241, 276; Computer Science 240A, 240B, 241B, 245, 246, 247, 262A, M262C, 262Z, 263A, 265A, M268, M276A; Electrical and Computer Engineering 206, 210A, 210B, 211A, M217, 219; Information Studies 228, 246, 272, 277; Linguistics 218, 232; Neuroscience CM272; Physics in Biology and Medicine 210, 214. M248; Statistics 221, M231A, 231B, M232A, M232B, 238, M241, M243, M250, 256. Please note: other elective courses can be taken with the agreement of the Home Area Director and the student’s PI/faculty mentor. Courses must be taken for a letter grade, unless offered on S/U grading basis only.

Written and Oral Qualifying Examinations

Academic Senate regulations require all doctoral students to complete and pass university written and oral qualifying examinations prior to doctoral advancement to candidacy. Also, under Senate regulations, the University Oral Qualifying Examination is open only to the student and appointed members of the doctoral committee. In addition to university requirements, some graduate programs have other pre-candidacy examination requirements. What follows in this section is how students are required to fulfill all of these requirements for this doctoral program.

All committee nominations and reconstitutions adhere to the  Minimum Standards for Doctoral Committee Constitution .

Doctoral students must complete the core courses described above before they are permitted to take the written and oral qualifying examinations. Students are required to pass a written qualifying examination that consists of a research proposal outside of their dissertation topic and the University Oral Qualifying Examination in which they defend their dissertation research proposal before their doctoral committee. Students are expected to complete the written examination in the summer following the first year and the oral qualifying examination by the end of fall quarter of the third year. The written qualifying examination must be passed before the University Oral Qualifying Examination can be taken.

During their first year, doctoral students perform laboratory rotations with program faculty whose research is of interest to them and select a dissertation adviser from the program faculty inside list by the end of their third quarter of enrollment. By the end of their second spring quarter, students must select a doctoral committee that is approved by the program chair and the Division of Graduate Education.

Written Qualifying Examination

The Written Qualifying Examination (WQE) must take place in the summer following the first year of doctoral study. In order to be eligible to take the WQE, students must have achieved at least two passing lab rotation evaluations, as well as at least a B average in all course work. Students are expected to formulate a testable research question and answer it, by carrying out a small, well-defined and focused project over a fixed one-month period. It must include the development of novel bioinformatic methodology. The topic and methodologies are to be selected by the student. The topic requires advance approval by the faculty committee, and may not be a project from a previous course, a rotation project, a project related to the student’s prior research experience, an anticipated dissertation research topic, or an active or anticipated research project in the laboratory of the student’s mentor. The WQE must be the student’s own ideas and work exclusively. Students are expected to complete a WQE paper of publication quality (except for originality), with a maximum length of 10 pages, single-spaced, excluding figures and references. This paper is submitted to the Student Affairs Office and graded by a faculty committee on a pass or no-pass basis. Students who do not pass the examination are permitted one additional opportunity to pass, which must be submitted to and graded by the faculty committee no later than the end of the summer of the first year.

Oral Qualifying Examination

The University Oral Qualifying Examination must be completed and passed by the end of the fall quarter of the third year. Students prepare a written description of the scientific background of their proposed dissertation research project, the specific aims of the project, preliminary findings, and proposed bioinformatic approaches for addressing the specific aims. This dissertation proposal must be written following an NIH research grant application format and be at least six pages, single spaced and excluding references, and is submitted to the students’ doctoral committee at least 10 days in advance of the examination. Exclusive of their doctoral committee members, students are free to consult with their dissertation adviser, or other individuals in  formulating the proposed research. The examination consists of an oral presentation of the proposal by the student to the committee. The student’s oral presentation and examination are expected to demonstrate: (1) a scholarly understanding of the background of the research proposal; (2) well-designed and testable aims; (3) a critical understanding of the bioinformatic, mathematical or statistical methodologies to be employed in the proposed research; and (4) an understanding of potential bioinformatic outcomes and their interpretation. This examination is graded Pass, Conditional Pass, or Fail. If the doctoral committee decides that the examination reflects performance below the expected mastery of graduate-level content, the committee may vote to give the student a Conditional Pass. A student who receives a Conditional Pass will be required to modify or re-write their research proposal, so as to bring it up to required standard. In the case of a Conditional Pass, the student will be permitted to seek the advice of their committee in modifying or re-writing the proposal. Any required re-write or modification will be submitted to, and reviewed by the doctoral committee. A second oral presentation is not necessary unless the doctoral committee requires so. The signed Report on the Oral Qualifying Examination & Request for Advancement to Candidacy will be retained in the Graduate Student Affairs Office until the student has satisfied the doctoral committee’s request for revision or re-write. Students are allowed only one chance to revise or re-write their proposal.

Advancement to Candidacy

Students are advanced to candidacy upon successful completion of the written and oral qualifying examinations.

Doctoral Dissertation

Every doctoral degree program requires the completion of an approved dissertation that demonstrates the student’s ability to perform original, independent research and constitutes a distinct contribution to knowledge in the principal field of study.

Final Oral Examination (Defense of the Dissertation)

Required for all students in the program.

Students are expected to complete the written qualifying examination in the summer following the first year of study and the University Oral Qualifying Examination by the end of fall quarter of the third year. Normative time-to-degree is five years (15 quarters).

DEGREE NORMATIVE TIME TO ATC (Quarters) NORMATIVE TTD

MAXIMUM TTD

Ph.D.

Academic Disqualification and Appeal of Disqualification

University Policy

A student who fails to meet the above requirements may be recommended for academic disqualification from graduate study. A graduate student may be disqualified from continuing in the graduate program for a variety of reasons. The most common is failure to maintain the minimum cumulative grade point average (3.00) required by the Academic Senate to remain in good standing (some programs require a higher grade point average). Other examples include failure of examinations, lack of timely progress toward the degree and poor performance in core courses. Probationary students (those with cumulative grade point averages below 3.00) are subject to immediate dismissal upon the recommendation of their department. University guidelines governing academic disqualification of graduate students, including the appeal procedure, are outlined in Standards and Procedures for Graduate Study at UCLA .

Special Departmental or Program Policy

Students must receive at least a grade of B- in core courses or repeat the course. Students who received three grades of B- or lower in core courses, who fail all or part of the written or oral qualifying examinations twice, or who fail to maintain minimum progress may be recommended for academic disqualification by vote of the entire interdepartmental program committee. Failure to identify and maintain a thesis adviser is a basis for recommendation for academic disqualification. Students may appeal a recommendation for academic disqualification in writing to the interdepartmental program committee, and may personally present additional or mitigating information to the committee, in person or in writing.

Get the Reddit app

## A subreddit to discuss the intersection of computers and biology. ------ A subreddit dedicated to bioinformatics, computational genomics and systems biology.

Am I overthinking my Master Thesis?

to give you a bit of context: I am currently studying in a bioinformatics Master and expect to do my Thesis next semester. Over the last year I have been working for a research group in the immunology field mainly applying machine learning models.

Unfortunately I felt like I had hit a ceiling on what they could teach me since the majority, including the group leader in the group have no informatics background whatsoever.

Since its my Master Thesis I didn't want to do some task I had done over the whole last year without learning anything new. I talked to the group leader and she could not really offer me anything concrete. I general I had always the feeling she just saw me as a guy who can quickly code here something together before getting to the more interesting immunology stuff.

So a week ago I told her I won't extend my contract and do my Thesis somewhere else. I am currently talking to a former colleague who offered me a Thesis position at the company he is working. So I am thinking this might be a great opportunity to see how work in the industry compares to academia. But this position seems more centered around software dev and less data science/bioinformatics and I am unsure as of yet which area interests me more.

So long story short: Am I overthinking the importance of the actual Thesis topic? In the end the degree is what counts right? Did anyone have similar situations and how did you decision turn out?

By continuing, you agree to our User Agreement and acknowledge that you understand the Privacy Policy .

Enter the 6-digit code from your authenticator app

You’ve set up two-factor authentication for this account.

Enter a 6-digit backup code

Create your username and password.

Reddit is anonymous, so your username is what you’ll go by here. Choose wisely—because once you get a name, you can’t change it.

Reset your password

Enter your email address or username and we’ll send you a link to reset your password

Check your inbox

An email with a link to reset your password was sent to the email address associated with your account

Choose a Reddit account to continue

IMAGES

  1. Masters In Bioinformatics

    master thesis bioinformatics example

  2. Workflow of bioinformatics analysis.

    master thesis bioinformatics example

  3. Master's thesis- Structural Bioinformatics (Dalhousie)

    master thesis bioinformatics example

  4. Bioinformatics Template

    master thesis bioinformatics example

  5. Bioinformatics LAb Report

    master thesis bioinformatics example

  6. Bioinformatics Template

    master thesis bioinformatics example

VIDEO

  1. Absolute Genius S1E7: Bioinformatics

  2. What I Did for My I/O Psychology Thesis

  3. Master's thesis- Structural Bioinformatics (Dalhousie)

  4. Guidelines in Writing the Title/How To Formulate Thesis Title?

  5. Top AI tools for researchers for literature analysis #bioinformatics #skills #thesis #phd

  6. Bioinformatics

COMMENTS

  1. PDF Bioinformatics Group

    This project will assess whether AMGs generally evolve into distinct shorter versions of the bacterial gene and whether the transfer of metabolic genes from phages to bacteria is a prevalent phenomenon. To this end, publicly available genomes of phages and bacteria will be scanned for metabolic genes (Shaffer et al. 2020).

  2. PDF Master's Thesis

    My passion to work with bioinformatics and molecular biology had been fulfilled by getting involved in this master thesis project. It was a great opportunity for me to practice bioinformatics techniques and the wet-lab work which enabled me to gain an immense knowledge that is useful for my future research. ...

  3. PDF Bioinformatic analysis of next-generation sequencing data

    Master`s Thesis Bioinformatics Masters Degree Programme, Institute of Biomedical Technology University of Tampere, Finland Tommi Rantapero May, 2012 . ii ... 3.1 Sample selection for sequencing 33 3.2 Targeted re-sequencing in FIMM 33 3.3 The bioinformatics workflow for variant data analysis 34 ...

  4. BSc and MSc Thesis Subjects of the Bioinformatics Group

    MSc thesis: In the Bioinformatics group, we offer a wide range of MSc thesis projects, from applied bioinformatics to computational method development. Here is a list of available MSc thesis projects.Besides the fact that these topics can be pursued for a MSc thesis, they can also be pursued as part of a Research Practice.. BSc thesis: As a BSc student you will work as an apprentice alongside ...

  5. Masters Thesis : r/bioinformatics

    Masters Thesis. Hello all, I am entering my 2nd (last) year of my masters in Bioinformatics. Recently, I have began to start giving some serious thought into conducting and writing a thesis project. I have ideas, but I am having a hard time finding quality bioinformatics MS thesis to look at as examples of what a quality thesis should look like.

  6. Master's Thesis

    Thesis Advisors must: Hold a faculty appointment at a Harvard University school at the rank of Assistant Professor or above. Have a research program that uses computational methods in biomedical applications. Students may be co-advised by up to two advisors, with approval from the Program. The Thesis Advisor is expected to meet with students ...

  7. PDF Bioinformatics, Master of Science (M.S.)

    The Master of Science in Bioinformatics non-thesis option is a Professional Science Master's degree program. The mission of this professionally oriented program is to train graduates for leadership roles in bioinformatics, biotechnology, biomedicine and other sectors of the life sciences. The program imparts interdisciplinary knowledge ...

  8. Master Thesis subjects proposed by 3BIO-BioInfo Computational Biology

    The master thesis topics related to this project can be entirely bioinformatics or include an experimental part. 6. Food and house dust mite allergens [Dimitri Gilis] Allergy represents an important public health problem. On the one hand, we are developing bioinformatics tools to predict whether a protein corresponds to a food allergen.

  9. Open thesis topics

    Open thesis topics. Within our group we can offer various topics in the field of applied bioinformatics, high-throughput data analysis, genome and metagenome research as well as postgenomics and systems biology. Below you can find a list of suggested open topics for BSc and MSc theses and student projects.

  10. Master's Thesis in Bioinformatics

    In the Master's program in bioinformatics, you must do a 30 ECTS Master's thesis. You must start your 30 ECTS thesis no later than February 1 (or September 1) a year and a half after commencement of your studies (i.e. February 2021 for students admitted in summer 2019, or September 2021 for students admitted in winter 2020).

  11. MS in Bioinformatics

    The thesis track is designed for MS in Bioinformatics students who are interested in conducting research. This track is strongly advised if you may be interested in pursuing a PhD in the future. Researching and writing a master's thesis is an academically intensive process that takes the place of 8 credits of traditional coursework.

  12. Thesis

    Thesis. Every master's degree thesis plan requires the completion of an approved thesis that demonstrates the student's ability to perform original, independent research. Students must choose a permanent faculty adviser and submit a thesis proposal by the end of the third quarter of study. The proposal must be approved by the permanent ...

  13. Theses

    Theses. Thesis Preparation and Filing: Staff from the University Archives and the UCLA Graduate Division present information on University regulations governing manuscript preparation and completion of degree requirements. Students should plan to attend at least one quarter before they plan to file a thesis or dissertation.

  14. GitHub

    README file for a master project in Bioinformatics. BINP52 - Master project - 60 ECTS; Master programme in Bioinformatics; Department of Biology, Faculty of Science, Lund University, Sweden; This work aimed to understand the aging signature on the neurogenesis process in the brain.

  15. Oxford LibGuides: Bioinformatics: Theses & Dissertations

    A number of recent theses and dissertations prepared at Oxford are available to download from the Oxford Research Archive (ORA). The British Library provides access to UK theses through its EThOS service [currently unavailable]. Already digitised UK theses can be downloaded freely as PDF files.

  16. PDF Thesis in Bioinformatics

    CATALOG/COURSE DESCRIPTION. Research in bioinformatics, or interdisciplinary investigation of biomedical problems with significant bioinformatic components. This research is at the master's level, leading to completion of a scientific project for presentation as a thesis. May be repeated for credit.

  17. Master's Thesis • Studying Bioinformatics • Department of Mathematics

    The master's thesis is meant to prove the student's ability to work independently on an advanced problem from the bioinformatical field using scientific methods, as well as the student's ability to evaluate the findings appropriately and to depict them both orally and in written form in an adequate manner. (SPO 2019, § 9) Please read § 9 ...

  18. Bioinformatics Group Freiburg

    Bioinformatics is a highly specialized application area of computer science and biology and to successfully solve research questions in this field, you require a lot of interdisciplinary knowledge. Therefore, to do a Master thesis with us, we have the minimum requirement that you have attended one of our teaching courses. We may also ask you to ...

  19. Master in Bioinformatics for Health Sciences

    Master Thesis. In the second course of the Master, students are required to complete a Master Thesis or Project. This internship offers the student the opportunity to become familiar with the real-world bioinformatics, integrating all the skills and knowledge acquired along the programme. Each academic course, the master coordinator opens an ...

  20. Program Requirements for Bioinformatics (Medical Informatics)

    Bioinformatics. Interdepartmental Program College of Letters and Science. ... Every master's degree thesis plan requires the completion of an approved thesis that demonstrates the student's ability to perform original, independent research. ... For example, students without statistical background are recommended to take STATS 100B ...

  21. Help on how to decide masters thesis topic : r/bioinformatics

    Help on how to decide masters thesis topic. academic. I am doing my masters now and I have to submit a thesis for my program. I am not sure on what topic should I do on . I have the following modules in my bioinfo masters: statistics, omics, machine learning and system biology. 1.Can I have some recommendation on topics which I can do my thesis ...

  22. Am I overthinking my Master Thesis? : r/bioinformatics

    If you go the academic route your thesis is going set the groundwork for your PhD so it's important. If not I can say from experience in bioinformatics with a masters degree, if you have an idea of what to do with a fasta for bam file you are qualified. Yes, you are overthinking it if you are going the industry route. Reply.