BloomTech’s Downfall: A Long Time Coming

programming assignment reinforcement learning

Coursera’s 2023 Annual Report: Big 5 Domination, Layoffs, Lawsuit, and Patents

Coursera sees headcount decrease and faces lawsuit in 2023, invests in proprietary content while relying on Big 5 partners.

  • 7 Best Latin Courses for Beginners in 2024
  • 7 Best Elm Courses for 2024
  • 9 Best ChatGPT & Prompt Engineering Courses for 2024
  • [2024] 1300+ Free SWAYAM + NPTEL Courses
  • 6 Best Crystal Programming Courses for 2024

600 Free Google Certifications

Most common

  • digital marketing
  • web development
  • cyber security

Popular subjects

Web Development

Graphic Design

Popular courses

Internet History, Technology, and Security

Product Management Fundamentals

Competitive Strategy

Organize and share your learning with Class Central Lists.

View our Lists Showcase

Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Fundamentals of Reinforcement Learning

University of Alberta and Alberta Machine Intelligence Institute via Coursera Help

Limited-Time Offer: Up to 75% Off Coursera Plus!

  • Welcome to the Course!
  • Welcome to: Fundamentals of Reinforcement Learning, the first course in a four-part specialization on Reinforcement Learning brought to you by the University of Alberta, Onlea, and Coursera. In this pre-course module, you'll be introduced to your instructors, get a flavour of what the course has in store for you, and be given an in-depth roadmap to help make your journey through this specialization as smooth as possible.
  • An Introduction to Sequential Decision-Making
  • For the first week of this course, you will learn how to understand the exploration-exploitation trade-off in sequential decision-making, implement incremental algorithms for estimating action-values, and compare the strengths and weaknesses to different algorithms for exploration. For this week’s graded assessment, you will implement and test an epsilon-greedy agent.
  • Markov Decision Processes
  • When you’re presented with a problem in industry, the first and most important step is to translate that problem into a Markov Decision Process (MDP). The quality of your solution depends heavily on how well you do this translation. This week, you will learn the definition of MDPs, you will understand goal-directed behavior and how this can be obtained from maximizing scalar rewards, and you will also understand the difference between episodic and continuing tasks. For this week’s graded assessment, you will create three example tasks of your own that fit into the MDP framework.
  • Value Functions & Bellman Equations
  • Once the problem is formulated as an MDP, finding the optimal policy is more efficient when using value functions. This week, you will learn the definition of policies and value functions, as well as Bellman equations, which is the key technology that all of our algorithms will use.
  • Dynamic Programming
  • This week, you will learn how to compute value functions and optimal policies, assuming you have the MDP model. You will implement dynamic programming to compute value functions and optimal policies and understand the utility of dynamic programming for industrial applications and problems. Further, you will learn about Generalized Policy Iteration as a common template for constructing algorithms that maximize reward. For this week’s graded assessment, you will implement an efficient dynamic programming agent in a simulated industrial control problem.

Martha White and Adam White

Related Courses

Decision making and reinforcement learning, a complete reinforcement learning system (capstone), reinforcement learning, deep reinforcement learning, related articles, 10 best applied ai & ml courses, 50+ free online courses and webinars on artificial intelligence in healthcare, 10 best artificial intelligence courses, 1700 coursera courses that are still completely free, 250 top free coursera courses of all time, massive list of mooc-based microcredentials.

4.9 rating, based on 24 Class Central reviews

4.8 rating at Coursera based on 2734 ratings

Select rating

Start your review of Fundamentals of Reinforcement Learning

  • AA Anonymous 4 years ago This course has very good outline and appropriate level for RL beginners. The presentation and description in the lectures is simple but very accurate. I can totally follow it without reading the textbook. The workload is small. I finished the entir… Read more This course has very good outline and appropriate level for RL beginners. The presentation and description in the lectures is simple but very accurate. I can totally follow it without reading the textbook. The workload is small. I finished the entire course in a week. The coding assignment is well organized and insightful as well. However, the quiz is sometimes confusing without enough details from the lecture. But I think it will be fine if you have more time than me and can read the materials from the textbook they told you. The math is harder for beginners than most other ML introductory course, which is unavoidable because that’s the most important part in reinforcement learning. Better to start with some background in probability and stochastic process. Helpful
  • AA Anonymous 4 years ago The course is overall very good, and it actually introduces you to Reinforcement Learning from scratch. Lectures are very clear, quizzes are challenging and the course relies on a text book, provided when you enroll. The only weak point, but not a serious issue, is that most of the lectures do not add content to what is in the book. Since studying the book is in fact mandatory, they could have used the lectures to better explain some concepts, assuming people read the book. Sometimes they do, but not so often. Helpful
  • Luiz Cunha 4 years ago Fantastic Course. That's the RL MOOC I have been waiting for so long. No surprise it is from Students of RL guru R. Sutton at Uni of Alberta. Very clearly and simply explained. Exercise and Test difficulty spot on. Wouldn't change a iota from this Course. Can't wait to do the rest of this RL specialization Helpful
  • Stewart Adamson 4 years ago This is a great course on Reinforcement Learning (RL) and I thoroughly recommend it. This is the first course in the four course Reinforcement Learning specialization from the Alberta Machine Intelligence Institute (AMII) at University of Alberta. T… Read more This is a great course on Reinforcement Learning (RL) and I thoroughly recommend it. This is the first course in the four course Reinforcement Learning specialization from the Alberta Machine Intelligence Institute (AMII) at University of Alberta. The course introduces the key concepts and goals of RL and follows the standard text on the subject, (Sutton & Barto 2018), very closely. AMII is the "home" of Rich Sutton and Andy Barto the authors of Reinforcement Learning an Introduction which is used throughout the specialization. It is available as a free PDF as part of the course material and each week of the course starts with a reading exercise from the book covering the algorithms to be covered in that week's videos, quizzes and assignments. Sutton & Barto 2018 is also used by Stanford and DeepMind in their RL courses. In the final week of the course, you get to implement a Dynamic Programming algorithm in a Jupyter notebook in Python as a programming assignment. You can check out the syllabus on Coursera.org for details of this course and the other courses in the specialization. Helpful
  • AA Anonymous 4 years ago First of all this course is based on an excellent book, "Reinforcement Learning, An Introduction - 2nd edition" by Sutton and Barto. The text is a clearly written with graphs and illustrations. I especially like the bibliographical and historical re… Read more First of all this course is based on an excellent book, "Reinforcement Learning, An Introduction - 2nd edition" by Sutton and Barto. The text is a clearly written with graphs and illustrations. I especially like the bibliographical and historical remarks to get context since I'm a contextual learner. Ok, so I'm not a huge fan of multiple selection questions (the ones with the square boxes). But it is what it is. The Jupyter notebooks for weeks 1 and 4 give you a feel of research quality results with just simple coding required. For an introductory course it's a good balance, but I suspect the programming workload to increase progressively in later courses. My recommendation is to take this course even if machine learning isn't your speciality. You're sure to find plenty of applications for the tools you'll learn here. Helpful
  • PJ Prateek Chandra Jha 3 years ago This course comes straight from the capital of reinforcement learning and that is Alberta Machine Intelligence Institute which is co-headed by Rich Sutton himself. You can be sure of the fact that this course is free of all fat and nutritious. The c… Read more This course comes straight from the capital of reinforcement learning and that is Alberta Machine Intelligence Institute which is co-headed by Rich Sutton himself. You can be sure of the fact that this course is free of all fat and nutritious. The course gives you the fundamental understanding of Markov Decision Processes and approximate Dynamic Programming required to build expert RL systems in course 2 and 3 of the specialization. I recommend everyone who has ever thought of breaking into RL come and take this course. It's hands-on with explanation of all key conceptual paradigms in Python and gives you that much needed head start you've been looking for so long after you heard about AlphaZero, AlphaGo and other recent advancements in AI. Regards & Thanks Prateek. Helpful
  • AA Anonymous 4 years ago I enjoyed taking this course and feel that it has expanded my tool-kit. The course was well constructed with reading assignments from the book followed by relatively short videos and assignments. I spent around 4-5 hours a week and most of the time was on the reading assignment. I enjoyed the programming assignments though it would have been nice if they had provided lesser scaffolding and asked us to write more code. The quizzes were ok and some of the questions were rather ambiguous and could have been improved with better wording. Overall I had a great learning experience and look forward to completing the series. Helpful
  • Matteo Hee 4 years ago This course formed an exceptional introduction to RL as a field while only requiring a rudimentary background in Python. I used this course to grow my understanding of Python (which is a new programming language for me) as well as introduce some of the problems of Reinforcement Learning. The "guest lecturers" in each module provided an interesting breadth of perspective on the material, while the primary instructors were clear, articulate, and engaging. Helpful
  • AA Anonymous 4 years ago Fundamentals of Reinforcement Learning is one of the best Online Courses I did on Coursera. I like that the course is based on a text book (Reinforcement Learning by Sutton), so you can really dig into the theory. Also the exercises are very helpful and ambitious which I like. I haven't found much advanced online courses which are so well explained like this one. Helpful
  • AP Andrei Petrovskii 3 years ago The course covers basics of reinforcement learning. It presents main ideas of the topic. The course instructors are very clear. Understanding of the course does not require neither advanced skills in mathematics nor in programming. However the presented concepts are useful, if you are beginner and know nothing about reinforcement learning Helpful
  • AA Anonymous 4 years ago This is a great course which require a genuine commitment. The teachers have made a great effort to make you understand the Bellman equation in details! The quizzes and the coding exercises have an appropriate level of difficulty. You really have to take your pencil and your paper before answering. Helpful
  • AA Anonymous 4 years ago Really good course. It's really interesting and explains really well the basic concepts of the reinforcement learning. The mix between reading the text book and the videos gives the ability to understand very well, and the programming assignments let's you put to test the things you've learned. Helpful
  • AA Anonymous 4 years ago I really like the course. It's simple and easy to understand. I managed to finish the entire course in a day. The programming assignments are well thought out and are a good mix of easy and challenging, while also giving a good understanding of the underlying concepts. Helpful
  • AA Anonymous 4 years ago It is a reallui good course. It is basically an introduction course to RL but it has good reference (that you have to read) and video lectures which explain the reference book with some examples. The resources, such as notebooks, are well done and challenging enough. Helpful
  • AA Anonymous 4 years ago This course provided great value for me, the content and explanations are of good quality. Quizzes and programming exsersies are challenging enough to help you grasp nessesary concepts and get hands on experience. Look forward to the next course in the spesialisation. Helpful
  • AA Anonymous 4 years ago It gets the point across and the examples used are decent. But it's not a very engaging course. The assignments are not very difficult and walk you through the problem well. I found myself skipping through the lectures a bit, but learnt the basic ideas of RL fine. Helpful
  • AA Anonymous 4 years ago This course enforces one to become strong with the fundamentals of RL and implementing it in code just adds icing on the cake by giving confidence. I would recommend one to take this course and move ahead in this field. Helpful
  • AA Anonymous 4 years ago Best place to start reinforcement learning if you are new to it. Challenging assignments based concepts. Quizzes to test your fundamentals. Everything is organised. Much more importantly good discussion forum. Helpful
  • KF Kim Falk 4 years ago I found the course very interesting. The videos very good an informative. I like the fact that the videos are describing theory (with real world examples) and not trying to teach Python while doing it. Helpful
  • AA Anonymous 4 years ago This course is from the number 1 origin of reinforcement learning. I think this course is a basic course and passing it alone without the other courses in the Specialization is not enough. Helpful

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

We're sorry but you will need to enable Javascript to access all of the features of this site.

Stanford Online

Reinforcement learning.

Stanford School of Engineering

To realize the full potential of AI, autonomous systems must learn to make good decisions. Reinforcement Learning (RL) is a powerful paradigm for training systems in decision making. RL algorithms are applicable to a wide range of tasks, including robotics, game playing, consumer modeling, and healthcare.

In this course, you will gain a solid introduction to the field of reinforcement learning. Through a combination of lectures and coding assignments, you will learn about the core approaches and challenges in the field, including generalization and exploration. You will also have a chance to explore the concept of deep reinforcement learning—an extremely promising new area that combines reinforcement learning with deep learning techniques.

  • Find the best strategies in an unknown environment using Markov decision processes, Monte Carlo policy evaluation, and other tabular solution methods.
  • Design and implement reinforcement learning algorithms on a larger scale with linear value function approximation and deep reinforcement learning techniques.
  • Model and optimize your strategies with policy-based reinforcement learning such as score functions, policy gradient, and REINFORCE.
  • Evaluate and enhance your reinforcement learning algorithms with bandits and MDPs.
  • Maximize learnings from a static dataset using offline and batch reinforcement learning methods.

Core Competencies

  • Batch/Offline Reinforcement Learning
  • Dynamic Programming
  • Monte Carlo Methods and Temporal Difference Learning
  • Monte Carlo Tree Search
  • Policy Gradient Methods
  • RL With Value Function Approximation

What You Need to Get Started

Prior to enrolling in your first course in the AI Professional Program, you must complete a short application (15 min) to demonstrate:

  • Proficiency in Python : Coding assignments will be in Python. Some assignments will require familiarity with basic Linux command line workflows.
  • College Calculus and Linear Algebra : You should be comfortable taking (multivariable) derivatives and understand matrix/vector notation and operations.
  • Probability Theory : You should be familiar with basic probability distributions (Continuous, Gaussian, Bernoulli, etc.) and be able to define concepts for both continuous and discrete random variables: Expectation, independence, probability distribution functions, and cumulative distribution functions.

Groups and Teams

Special Pricing

Have a group of five or more? Enroll as a group and learn together! By participating together, your group will develop a shared knowledge, language, and mindset to tackle challenges ahead. We can advise you on the best options to meet your organization’s training and development goals.

Teaching Team

Emma Brunskill

Emma Brunskill

Associate Professor

Computer Science

Emma Brunskill is an Assistant Professor at Stanford University. Her goal is to increase human potential through advancing interactive machine learning. Revolutions in storage and computation have made it easy to capture and react to sequences of decisions made and their outcomes. Simultaneously, due to the rise of chronic health conditions, and demand for educated workers, there is an urgent need for more scalable solutions to assist people to reach their full potential. Interactive machine learning systems could be a key part of the solution. To enable this, her lab's work spans from advancing theoretical understanding of reinforcement learning, to developing new self-optimizing tutoring systems that they test with learners and in the classroom. Their applications focus on education since education can radically transform the opportunities available to an individual.

You May Also Like

Thumbnail

Introduction to Robotics

Thumbnail

Machine Learning with Graphs

Course image for Machine Learning

Machine Learning

  • Engineering
  • Computer Science & Security
  • Business & Management
  • Energy & Sustainability
  • Data Science
  • Medicine & Health
  • Explore All
  • Technical Support
  • Master’s Application FAQs
  • Master’s Student FAQs
  • Master's Tuition & Fees
  • Grades & Policies
  • Graduate Application FAQs
  • Graduate Student FAQs
  • Graduate Tuition & Fees
  • Community Standards Review Process
  • Academic Calendar
  • Exams & Homework FAQs
  • Enrollment FAQs
  • Tuition, Fees, & Payments
  • Custom & Executive Programs
  • Free Online Courses
  • Free Content Library
  • School of Engineering
  • Graduate School of Education
  • Stanford Doerr School of Sustainability
  • School of Humanities & Sciences
  • Stanford Human Centered Artificial Intelligence (HAI)
  • Graduate School of Business
  • Stanford Law School
  • School of Medicine
  • Learning Collaborations
  • Stanford Credentials
  • What is a digital credential?
  • Grades and Units Information
  • Our Community
  • Get Course Updates

programming assignment reinforcement learning

CS234: Reinforcement Learning Spring 2024

programming assignment reinforcement learning

Announcements

  • The poster session will be from 11:30am-2:30pm in the Huang Foyer (area outside of NVIDIA auditorium).

Course Description & Logistics

  • Lectures will be live every Monday and Wednesday: Videos of the lecture content will also be made available to enrolled students through canvas.
  • Lecture Materials (videos and slides) : All standard lecture materials will be delivered through modules with pre-recorded course videos that you can watch at your own time. Each week's modules are listed in the schedule and can be accessed here , and will be posted by the end of Sunday before that week's class. Guest lectures will be presented live and recorded for later watching. Recordings will be available to enrolled students through Canvas.
  • 1:1 office hours: Students can sign up for 1:1 office hours with faculty and CAs. These will all be appointment-based so that students need not to wait in queue. See our calendar for times and sign up links. [Office hour schedules will be posted by the end of Tuesday on week 1] --> here . These may be offered in person but will definitely be offered via zoom. -->
  • Problem session practice: We will also make available optional additional problem session questions and videos to provide additional opportunities to learn about the material.
  • Quizzes: Instead of a large high-stakes midterm, there will be four quizzes over the course. We will drop the lowest score of Quiz 1-3.
  • Project: There will be no final project.
  • --> Platforms: All assignments and quizzes will be handled through Gradescope, where you will also find your grades. We will send out links and access codes to enrolled students through Canvas. You can find Winter 2023 materials here.
    -->
    Tabular MDP planning


    [Assignment 1 Released] Apr 8
    Apr 9
    Apr 10 Apr 11 Apr 12 Apr 13 Apr 14 Lecture Materials


    -->
    --> Policy Evaluation


    5:30pm-6:30pm
    Q learning and Function Approximation

    Assignment 1


    [Assignment 2 Released]
    Apr 15
    Apr 16 Apr 17 Apr 18 Apr 19
    Apr 20 Apr 21 Lecture Materials




    [Quiz 1]

    --> Policy Search 1



    Policy Search 2

    Apr 22
    Apr 23
    Apr 24 Apr 25 Apr 26 Apr 27 Apr 28 Lecture Materials




    [Quiz 1]

    --> Policy Search 3



    Offline RL 1

    Assignment 2

    Apr 29
    Apr 30
    May 1 May 2 May 3 May 4 May 5 Lecture Materials


    [Assignment 2]

    --> Offline RL 2



    Midterm
    [Assignment 3 Released]
    Project Proposal

    May 6
    May 7
    May 8 May 9 May 10 May 11 May 12 Lecture Materials


    [Assignment 2]

    --> Offline RL 3



    Exploration 1

    May 13
    May 14
    May 15 May 16 May 17 May 18 May 19 Lecture Materials


    [Assignment 2]

    --> Exploration 2



    Exploration 3

    Assignment 3
    May 20
    May 21
    May 22 May 23 May 24 May 25 May 26 Lecture Materials


    [Assignment 2]

    --> Multi-Agent Game Playing



    Guest Lecture


    Project Milestone
    May 27
    May 28
    May 29 May 30 May 31 Jun 1 Jun 2 Lecture Materials


    [Assignment 2]

    --> Memorial Day:
    No Class

    In-class Quiz

    Jun 3 Jun 4 Jun 5 Jun 6 Jun 7 Jun 8 Jun 9 Lecture Materials
    Value Alignment

    Poster Session

    Jun 10 Jun 11 Jun 12 Jun 13 Jun 14 Jun 15 Jun 16 Final Project Report
    • Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition. This is available for free here and references will refer to the final pdf version available here .
    • Reinforcement Learning: State-of-the-Art, Marco Wiering and Martijn van Otterlo, Eds. [ link ]
    • Artificial Intelligence: A Modern Approach, Stuart J. Russell and Peter Norvig.[ link ]
    • Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville. [ link ]
    • David Silver's course on Reinforcement Learning [ link ]

    Grade Breakdown

    • Assignment 1: 10%
    • Assignment 2: 18%
    • Assignment 3: 18%
    • Midterm: 25%
    • Course Project: 24%
    • Proposal: 1%
    • Milestone: 2%
    • Poster Presentation: 5%

    Late Day Policy

    • You can use 5 late days total.
    • A late day extends the deadline by 24 hours.
    • You are allowed up to 2 late days for assignments 1, 2, 3, project proposal, and project milestone, not to exceed 5 late days total. You may not use any late days for the project poster presentation and final project paper. For group submissions such as the project proposal and milestone, all group members must have the corresponding number of late days used on the assignment, and if one or more members do not have a sufficient amount of late days, all group members will incur a grade penalty of 50% within 24 hours and 100% after 24 hours, as explained below.
    • If you use two late days and hand an assignment in after 48 hours, it will be worth at most 50%. If you do not have enough late days left, handing the assignment within 1 day after it was due (adjusting for the late days used) will be worth at most 50%. No credit will be given to assignments handed in after 24 hours they were due (adjusting for any late days. E.g. if you use 2 late days, then after this policy applies 24 hours after your 2 late days, e.g. after 72 hours). Please contact us if you think you have an extremely rare circumstance for which we should make an exception. This policy is to ensure that feedback can be given in a timely manner.
    • There will be one midterm and one quiz. See the schedule for the dates.
    • Exams will be held in class for on-campus students.
    • Conflicts : If you are not able to attend the in class midterm and quizzes with an official reason, please email us at [email protected] , as soon as you can so that an accommodation can be scheduled. (Historically this is either to ask you to take the exam remotely at the same time, or to schedule an alternate exam time).
    • Notes for the exams : You are welcome to bring a 1-sided 1 (letter sized) page of handwritten notes to the midterm. For the quiz you are welcome to bring a double sided (letter sized) page of handwritten notes. No calculators, laptops, cell phones, tablets or other resources will be allowed.

    Assignments and Submission Process

    • Assignments : See Assignments page where all the assignments will be posted.
    • Computing Resources : We will have some cloud resources available for later assignments.
    • Submission Process : The submission instructions for the assignments can also be found on the Assignments page .

    Office Hours

    • Individual 1:1 15 minute office hours that you can sign up here . See our calendar for detailed schedules. Video conference links will be provided during sign up.
    • Group Office Hours are 5PM-8PM on Wed, Thu, and Fri and will be held on Nooks . We will have small tables where students can work on particular problems together.
    • Go to the Zoom Client for Linux page and download the correct Linux package for your Linux distribution type, OS architecture and version.
    • Follow the linux installation instructions here .
    • Download Zoom installer here .
    • Installation instructions can be found here .
    • Go to Stanford Zoom and select 'Launch Zoom'.
    • Click 'Host a Meeting'; nothing will launch but this will give a link to 'download & run Zoom'.
    • Click on 'download & run Zoom' to obtain and download 'Zoom_launcher.exe'.
    • Run 'Zoom_launcher.exe' to install.

    Communication

    Regrading requests.

    • If you think that the course staff made a quantifiable error in grading your assignment or exam, then you are welcome to submit a regrade request. Regrade requests should be made on gradescope and will be accepted for three days after assignments or exams are returned.
    • Note that while doing a regrade we may review your entire assigment, not just the part you bring to our attention (i.e. we may find errors in your work that we missed before).

    Academic Collaboration, AI Tools Usage and Misconduct

    Academic accommodation, credit/no credit enrollment.

    Readings Responses

    Things to do asap (before the first class if possible), week 1 (1/18): class overview, intro, and multi-armed bandits, week 2 (1/25): mdps and dynamic programming, week 3 (2/1): monte carlo methods and temporal difference learning, week 4 (2/8): n-step bootstrapping and planning, week 5 (2/15): on-policy prediction with approximation, week 6 (2/22): on policy control with approximation and off policy methods with approximation, week 7 (3/1): eligibility traces, week 8 (3/8): policy gradient methods, week 9 (3/22): applications and case studies, week 10 (3/29): abstraction: options and hierarchy, week 9 (10/25): game playing, week 11 (4/5): exploration and intrinsic motivation, week 12 (4/12): learning from human input, week 13 (4/19): multiagent rl and safe rl, week 14 (4/26): modern landscape, week 15 (5/3): reproducibility, evaluation, and wrap-up, final project (tba).

    Reinforcement Learning Specialization - Coursera - course 1 - Fundamentals of Reinforcement Learning

    From University of Alberta. My notes on course 1.

    May 3, 2021 • 18 min read

    reinforcement learning   deepmind   coursera

    5/3/21 - Course 1 - Week 1 - An introduction to Sequential Decision-Making

    Module 2 learning objectives, lesson 1: introduction to markov decision processes, lesson 2: goal of reinforcement learning, lesson 3: continuing tasks, module 3 learning objectives, lesson 1: policies and value functions, lesson 2: bellman equations, lesson 3: optimality (optimal policies & value functions), module 4 learning objectives, lesson 1: policy evaluation (prediction), lesson 2: policy iteration (control), lesson 3: generalized policy iteration.

    Coursera website: course 1 - Fundamentals of Reinforcement Learning of Reinforcement Learning Specialization

    my notes on course 2 - Sample-based Learning Methods , course 3 - Prediction and Control with Function Approximation , course 4 - A Complete Reinforcement Learning System (Capstone)

    4 courses on 16 weeks by Martha White and Adam White.

    Fundamentals of Reinforcement Learning

    Sample-based Learning Methods

    Prediction and Control with Function Approximation

    A Complete Reinforcement Learning System (Capstone)

    specialization roadmap

    course 1 - we begin our study with multi-arm bandit problems. Here, we get our first taste of the complexities of incremental learning , exploration , and exploitation . After that, we move onto Markov decision processes to broaden the class of problems we can solve with reinforcement learning methods. Here we will learn about balancing short-term and long-term reward . We will introduce key ideas like policies and value functions using almost all RL systems. We conclude Course 1 with classic planning methods called dynamic programming . These methods have been used in large industrial control problems and can compute optimal policies given a complete model of the world.

    course 2 - In Course 2, we built on these ideas and design algorithms for learning without a model of the world. We study three classes of methods designed for learning from trial and error interaction. We start with Monte Carlo methods and then move on to temporal difference learning, including Q learning. We conclude Course 2 with an investigation of methods for planning with learned models.

    course 3 - In Course 3, we leave the relative comfort of small finite MDPs and investigate RL with function approximation . Here we will see that the main concepts from Courses 1 and 2 transferred to problems with larger infinite state spaces . We will cover feature construction , neural network learning , policy gradient methods , and other particularities of the function approximation setting.

    course 4 - The final course in this specialization brings everything together in a Capstone project. Throughout this specialization, as in Rich and Andy’s book, we stress a rigorous and scientific approach to RL. We conduct numerous experiments designed to carefully compare algorithms. It takes careful planning and a lot of hard work to produce a meaningful empirical results. In the Capstone, we will walk you through each step of this process so that you can conduct your own scientific experiment. We will explore all the stages from problem specification, all the way to publication quality plots. This is not just academic. In real problems, it’s important to verify and understand your system. After that, you should be ready to test your own new ideas or tackle a new exciting application of RL in your job. We hope you enjoyed the show half as much as we enjoyed making it for you.

    Alberta is in Canada.

    I have set recommended goals 3 times a week.

    about supervised learning, unsupervised learning and RL

    You might wonder what’s the difference between supervised learning, unsupervised learning, and reinforcement learning? The differences are quite simple. In supervised learning we assume the learner has access to labeled examples giving the correct answer. In RL, the reward gives the agent some idea of how good or bad its recent actions were. You can think of supervised learning as requiring a teacher that helps you by telling you the correct answer. A reward on the other hand, is like having someone who can identify what good behavior looks like but can’t tell you exactly how to do it. Unsupervised learning sounds like it could be related but really has a very different goal. Unsupervised learning is about extracting underlying structure in data. It’s about the data representation. It can be used to construct representations that make a supervised or RL system better. In fact, as you’ll see later in this course, techniques from both supervised learning and unsupervised learning can be used within RL to aid generalization

    industrial control

    So I think the place we’re really going to see it take off is an industrial control. In industrial control, we have experts that are really looking for ways to improve the optimal- how well their systems work. So we’re going to see it do things like reduce energy costs or save on other types of costs that we have in these industrial control systems. In the hands of experts, we can really make these algorithms work well in the near future. So I really see it as a tool that’s going to facilitate experts in their work rather than say, doing something like replacing people or automating them away.

    Reinforcement Learning Textbook

    as always, Reinforcement Learning: An introduction (Second Edition) by Richard S. Sutton and Andrew G. Barto is THE reference. I didn’t know that Adam White was student from Sutton. Lucky guy ;)

    K-armed Bandit problem

    programming assignment reinforcement learning

    Starts with reading of RLbook p25-36 (Chapter 2 Multi-armed Bandits)

    Evaluative vs instructive feedback. Nonassociative refers to cases where you take one action per state. At the end there is a generalization where bandit problem becomes associative, that is, when actions are taken in more than one situation.

    It is a stationary case meaning that value of actions are fixed during experiences. If the bandit task were nonstationary, that is, the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the greedy one.

    sample-average action-value estimates

    $\epsilon$-greedy action selection

    With nonstationary problem, we want to give more weights to recent rewards. It can be done with Q n + 1 = Q n + α [ R n − Q n ] Q_{n+1}=Q_n+\alpha[R_n-Q_n] Q n + 1 ​ = Q n ​ + α [ R n ​ − Q n ​ ] Where α \alpha α is a constant step-size parameter, α ∈ [ 0 , 1 ] \alpha \in [0,1] α ∈ [ 0 , 1 ] . So it can be written that way Q n + 1 = ( 1 − α ) n Q 1 + ∑ i = 1 n α ( 1 − α ) n − i R i Q_{n+1}=(1-\alpha)^nQ_1+\displaystyle\sum_{i=1}^{n} \alpha(1-\alpha)^{n-i}R_i Q n + 1 ​ = ( 1 − α ) n Q 1 ​ + i = 1 ∑ n ​ α ( 1 − α ) n − i R i ​ . Weighted average because the sum of the weights is 1.

    2 other topics are discussed: optimistic initial values (that can push exploration in 1st steps) and upper-confidence-bound (UCB) action selection. With optimistic initial values the idea is too have high initial value for reward so that the 1st actions are disappointing pushing for explorations. With UCB

    The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with c determining the confidence level. Each time a is selected the uncertainty is presumably reduced: N t (a) increments, and, as it appears in the denominator, the uncertainty term decreases. On the other hand, each time an action other than a is selected, t increases but N t (a) does not; because t appears in the numerator, the uncertainty estimate increases. The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.

    Exploration vs Exploitation trade-off

    How do we choose when to explore, and when to exploit? Randomly

    programming assignment reinforcement learning

    Assignement

    implementation of greedy agent, $\epsilon$-greedy agent. Comparisons. Various $\epsilon$ values, various step-sizes (1/N(a), …)

    notebooks in github

    end of C1W1 (course 1 week 1)

    5/7/21 - Course 1 - Week 2 - Markov Decision Process

    Understand Markov Decision Processes, or MDPs

    Describe how the dynamics of an MDP are defined

    Understand the graphical representation of a Markov Decision Process

    Explain how many diverse processes can be written in terms of the MDP framework

    Describe how rewards relate to the goal of an agent

    Understand episodes and identify episodic tasks

    Formulate returns for continuing tasks using discounting

    Describe how returns at successive time steps are related to each other

    Understand when to formalize a task as episodic or continuing

    Reading chapter 3.1 to 3.3 (p47-56) in Sutton’s book

    Finite Markov Decision Processes

    • 3.1 - the Agent-Environment Interface
    • 3.2 - Goals and Rewards
    • 3.3 - Returns and Episodes

    In a Markov decision process, the probabilities given by p completely characterize the environment’s dynamics. That is, the probability of each possible value for $S_t$ and $R_t$ depends only on the immediately preceding state and action, $S_{t-1}$ and $A_{t-1}$ , and, given them, not at all on earlier states and actions.

    The state must include information about all aspects of the past agent–environment interaction that make a difference for the future. In general, actions can be any decisions we want to learn how to make, and the states can be anything we can know that might be useful in making them.

    The agent–environment boundary represents the limit of the agent’s absolute control, not of its knowledge.

    Goal can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward ). The reward signal is your way of communicating to the agent what you want it to achieve, not how you want it achieved.

    Expected return $G_t$ is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards: G t ≐ R t + 1 + R t + 2 + R t + 3 + . . . + R T G_t \doteq R_{t+1}+R_{t+2}+R_{t+3}+...+R_{T} G t ​ ≐ R t + 1 ​ + R t + 2 ​ + R t + 3 ​ + ... + R T ​ where $T$ is the final time step.

    With continuing tasks, we can have $T=\infty$, we can then introduce discounting . Agent chooses $A_t$ to maximize the expected discounted return: G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . = ∑ k = 0 ∞ γ k R t + k + 1 G_t \doteq R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...=\displaystyle\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} G t ​ ≐ R t + 1 ​ + γ R t + 2 ​ + γ 2 R t + 3 ​ + ... = k = 0 ∑ ∞ ​ γ k R t + k + 1 ​ where $\gamma$ is called the discount rate . G t = R t + 1 + γ G t + 1 G_t = R_{t+1}+\gamma G_{t+1} G t ​ = R t + 1 ​ + γ G t + 1 ​ Video MDP by Martha. By the end of this video: Understand Markov Decision Process (MDP) , Describe how the dynamics of an MDP are defined.

    Martha highlights differences between k-armed bandit and MDP. The k-armed bandit agent is presented with the same situation at each time and the same action is always optimal. In many problems, different situations call for different responses. The actions we choose now affect the amount of reward we can get into the future. In particular if state changes, k-armed bandit don’t adapt. It is why we need MDP.

    Video examples of MDPs by Adam . By the end of this video: Gain experience formalizing decision-making problems as MDPs , Appreciate the flexibility of the MDP formalism.

    Adam uses 2 examples: robot recycling cans and robot arm.

    Video the Goal of Reinforcement Learning by Adam. By the end of this video: Describe how rewards relate to the goal of an agent, Identify episodic tasks .

    With MDP, agents can have long-term goals.

    Video the Reward Hypothesis by Michael Littman.

    He gives a nice idea when defining reward hypothesis: a contrast between the simplicity of the idea of rewards with the complexity of the real world.

    Video Continuing Tasks by Martha. By the end of this video: Differentiate between episodic and continuing tasks . Formulate returns for continuing tasks using discounting . Describe how returns at successive time steps are related to each other.

    Adam uses a link to Sutton’s book., This is a 2020 version of this book.

    Video Examples of Episodic and Continuing Tasks by Martha. By the end of this video: Understand when to formalize a task as episodic or continuing .

    Martha gives 2 examples: one of an episodic tasks where episode ends when player is touched by an enemy, one of continuous tasks where an agent accepts or rejects tasks depending on priority and servers available (never ending episode).

    Weekly assessment.

    This is a quizz and a peer-graded assignment. I had to describe 3 MDPs with all its detail (states actions, rewards).

    5/10/21 - Course 1 - Week 3 - Value Functions & Bellman Equations

    • Recognize that a policy is a distribution over actions for each possible state
    • Describe the similarities and differences between stochastic and deterministic policies
    • Identify the characteristics of a well-defined policy
    • Generate examples of valid policies for a given MDP
    • Describe the roles of state-value and action-value functions in reinforcement learning
    • Describe the relationship between value functions and policies
    • Create examples of valid value functions for a given MDP
    • Derive the Bellman equation for state-value functions
    • Derive the Bellman equation for action-value functions
    • Understand how Bellman equations relate current and future values
    • Use the Bellman equations to compute value functions
    • Define an optimal policy
    • Understand how a policy can be at least as good as every other policy in every state
    • Identify an optimal policy for given MDPs
    • Derive the Bellman optimality equation for state-value functions
    • Derive the Bellman optimality equation for action-value functions
    • Understand how the Bellman optimality equations relate to the previously introduced Bellman equations
    • Understand the connection between the optimal value function and optimal policies
    • Verify the optimal value function for given MDPs

    Reading chapter 3.5 to 3.8 (p58-67) in Sutton’s book

    Almost all reinforcement learning algorithms involve estimating value functions —functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).

    Searching for additional informations, I have fallen into ShangtongZhang page and repos . Only 2 of them but seem to be great: reinforcement-learning-an-introduction contains implementations in Python of all concepts from Sutton’s book. DeepRL seems to be a pytorch implementations (DQN, A2C, PPO, …)

    Here we see Bellman equation for state-value function $v_\pi(s)$

    Bellman equation for action-value function $q_\pi(s,a)$

    Optimal state-value function $v_*$ :

    Optimal action-value function $q_*$ : q ∗ ( s , a ) ≐ max ⁡ π q π ( s , a ) = E [ R t + 1 + γ . v ∗ ( S t + 1 ) ∣ S t = s , A t = a ] q_*(s,a) \doteq \max\limits_{\pi} q_\pi(s,a) = \mathbb{E}[R_{t+1}+\gamma.v_*(S_{t+1})|S_t=s, A_t=a] q ∗ ​ ( s , a ) ≐ π max ​ q π ​ ( s , a ) = E [ R t + 1 ​ + γ . v ∗ ​ ( S t + 1 ​ ) ∣ S t ​ = s , A t ​ = a ]

    We denote all optimal policies by $\pi_*$

    Bellman optimality equation for $v_*$

    Bellman optimality equation for $q_*$

    programming assignment reinforcement learning

    Video Specifying Policies by Adam.

    By the end of this video, you’ll be able to

    Recognize that a policy is a distribution over actions for each possible state , describe the similarities and differences between stochastic and deterministic policies , and generate examples of valid policies for a given MDP or Markup Decision Process.

    Video Value Functions by Adam.

    describe the roles of the state-value and action-value functions in reinforcement learning, describe the relationship between value-functions and policies , and create examples of value-functions for a given MDP.

    Video Rich Sutton and Andy Barto: A brief History of RL

    Video Bellman Equation Derivation by Martha

    By the end of this video, you’ll be able to derive the Bellman equation for state-value functions , derive the Bellman equation for action-value functions , and understand how Bellman equations relate current and future values .

    Video Why Bellman Equations? by Martha

    By the end of this video, you’ll be able to use the Bellman equations to compute value functions

    Video Optimal Policies by Martha

    By the end of this video, you will be able to define an optimal policy , understand how policy can be at least as good as every other policy in every state, and identify an optimal policy for a given MDP.

    Video Optimal Value Functions by Martha

    By the end of this video, you will be able to derive the Bellman optimality equation for the state-value function , derive the Bellman optimality equation for the action-value function , and understand how the Bellman optimality equations relate to the previously introduced Bellman equations .

    Video Using Optimal Value Functions to Get Optimal Policies by Martha

    By the end of this video, you’ll be able to understand the connection between the optimal value function and optimal policies and verify the optimal value function for a given MDP

    Video week 3 summary by Adam

    Policies

    5/18/21 - Course 1 - Week 4 - Dynamic Programming

    • Understand the distinction between policy evaluation and control
    • Explain the setting in which dynamic programming can be applied, as well as its limitations
    • Outline the iterative policy evaluation algorithm for estimating state values under a given policy
    • Apply iterative policy evaluation to compute value functions
    • Understand the policy improvement theorem
    • Use a value function for a policy to produce a better policy for a given MDP
    • Outline the policy iteration algorithm for finding the optimal policy
    • Understand “the dance of policy and value”
    • Apply policy iteration to compute optimal policies and optimal value functions
    • Understand the framework of generalized policy iteration
    • Outline value iteration, an important example of generalized policy iteration
    • Understand the distinction between synchronous and asynchronous dynamic programming methods
    • Describe brute force search as an alternative method for searching for an optimal policy
    • Describe Monte Carlo as an alternative method for learning a value function
    • Understand the advantage of Dynamic programming and “bootstrapping” over these alternative strategies for finding the optimal policy

    Reading chapter 4.1, 4.2, 4.3, 4.4, 4.6, 4.7 (pages 73-88) in Sutton’s book (with the help of Solutions_to_Reinforcement_Learning_by_Sutton_Chapter_4_r5.pdf )

    A common way of obtaining approximate solutions for tasks with continuous states and actions is to quantize the state and action spaces and then apply finite-state DP methods.

    Video Policy Evaluation vs. Control by Martha

    By the end of this video you will be able to understand the distinction between policy evaluation and control , and explain the setting in which dynamic programming can be applied as well as its limitations.

    Video Iterative Policy Evaluation by Martha

    By the end of this video you will be able to outline the iterative policy evaluation algorithm for estimating state values for a given policy, and apply iterative policy evaluation to compute value functions.

    The magic here is to turn the bellman equation into an iterative evaluation which converges to $v_\pi$.

    programming assignment reinforcement learning

    Video Policy Improvement by Marta

    By the end of this video, you will be able to understand the policy improvement theorem , and how it can be used to construct improved policies, and use the value function for a policy to produce a better policy for a given MDP.

    Greedified policy is a strict improvement.

    programming assignment reinforcement learning

    Video Policy Iteration by Marta

    By the end of this video, you will be able to outline the policy iteration algorithm for finding the optimal policy, understand the dance of policy and value , how policy iteration reaches the optimal policy by alternating between evaluating policy and improving it, and apply policy iteration to compute optimal policies and optimal value functions.

    programming assignment reinforcement learning

    Video Flexibility of the Policy Iteration Framework by Adam

    By the end of this video, you’ll be able to understand the framework of generalized policy iteration , outline value iteration and important special case of generalized policy iteration, and differentiate synchronous and asynchronous dynamic programming methods.

    Video Efficiency of Dynamic Programming by Adam

    By the end of this video, you’ll be able to describe Monte Carlo sampling as an alternative method for learning a value function. Describe brute force-search as an alternative method for finding an optimal policy. And understand the advantages of dynamic programming and bootstrapping over these alternatives.

    The most important takeaway is that bootstrapping can save us from performing a huge amount of unnecessary work by exploiting the connection between the value of a state and its possible successors.

    Video Warren Powell: Approximate Dynamic Programming for Fleet Management (Short)

    Video Week 4 Summary by Adam

    Reading chapter summary Chapter 4.8, (pages 88-89)

    Optimal Policies with Dynamic Programming

    end of C1W4 (course 1 week 4)

    end of course 1 (and with a certificate ;) )

    programming assignment reinforcement learning

    Reinforcement Learning

    Pacman seeks reward. Should he eat or should he run? When in doubt, Q-learn.

    Introduction

    In this project, you will implement value iteration and Q-learning. You will test your agents first on Gridworld (from class), then apply them to a simulated robot controller (Crawler) and Pacman.

    As in previous programming assignments, this assignment includes an autograder for you to grade your answers on your machine. This can be run with the command:

    It can be run for one particular question, such as q2, by:

    It can be run for one particular test by commands of the form:

    The code for this project consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download all the code and supporting files as a zip archive (patched) .

    A value iteration agent for solving known MDPs.
    Q-learning agents for Gridworld, Crawler and Pacman.
    A file to put your answers to questions given in the project.
    Defines methods on general MDPs.
    Defines the base classes and , which your agents will extend.
    Utilities, including , which is particularly useful for Q-learners.
    The Gridworld implementation.
    Classes for extracting features on (state, action) pairs. Used for the approximate Q-learning agent (in ).
    Abstract class for general reinforcement learning environments. Used by .
    Gridworld graphical display.
    Graphics utilities.
    Plug-in for the Gridworld text interface.
    The crawler code and test harness. You will run this but not edit it.
    GUI for the crawler robot.
    Project autograder
    Parses autograder test and solution files
    General autograding test classes
    Directory containing the test cases for each question
    Project specific autograding test classes

    Files to Edit and Submit: You will fill in portions of valueIterationAgents.py , qlearningAgents.py , and analysis.py during the assignment. Please do not change the other files in this distribution or submit any of our original files other than these file.

    Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's judgements -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

    Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

    Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, section, and the discussion forum are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.

    Discussion: Please be careful not to post spoilers.

    To get started, run Gridworld in manual control mode, which uses the arrow keys:

    You will see the two-exit layout from class. The blue dot is the agent. Note that when you press up , the agent only actually moves north 80% of the time. Such is the life of a Gridworld agent!

    You can control many aspects of the simulation. A full list of options is available by running:

    The default agent moves randomly

    You should see the random agent bounce around the grid until it happens upon an exit. Not the finest hour for an AI agent.

    Note: The Gridworld MDP is such that you first must enter a pre-terminal state (the double boxes shown in the GUI) and then take the special 'exit' action before the episode actually ends (in the true terminal state called TERMINAL_STATE , which is not shown in the GUI). If you run an episode manually, your total return may be less than you expected, due to the discount rate ( -d to change; 0.9 by default).

    Look at the console output that accompanies the graphical output (or use -t for all text). You will be told about each transition the agent experiences (to turn this off, use -q ).

    As in Pacman, positions are represented by (x,y) Cartesian coordinates and any arrays are indexed by [x][y] , with 'north' being the direction of increasing y , etc. By default, most transitions will receive a reward of zero, though you can change this with the living reward option ( -r ).

    Question 1 (4 points): Value Iteration

    Recall the value iteration state update equation:

    bellman

    Write a value iteration agent in ValueIterationAgent , which has been partially specified for you in valueIterationAgents.py . Your value iteration agent is an offline planner, not a reinforcement learning agent, and so the relevant training option is the number of iterations of value iteration it should run (option -i ) in its initial planning phase. ValueIterationAgent takes an MDP on construction and runs value iteration for the specified number of iterations before the constructor returns.

    Value iteration computes k-step estimates of the optimal values, V k . In addition to running value iteration, implement the following methods for ValueIterationAgent using V k .

    • computeActionFromValues(state) computes the best action according to the value function given by self.values .
    • computeQValueFromValues(state, action) returns the Q-value of the (state, action) pair given by the value function given by self.values .

    These quantities are all displayed in the GUI: values are numbers in squares, Q-values are numbers in square quarters, and policies are arrows out from each square.

    Important: Use the "batch" version of value iteration where each vector V k is computed from a fixed vector V k-1 (like in lecture), not the "online" version where one single weight vector is updated in place. This means that when a state's value is updated in iteration k based on the values of its successor states, the successor state values used in the value update computation should be those from iteration k-1 (even if some of the successor states had already been updated in iteration k). The difference is discussed in Sutton & Barto in the 6th paragraph of chapter 4.1.

    Note: A policy synthesized from values of depth k (which reflect the next k rewards) will actually reflect the next k+1 rewards (i.e. you return \(\pi_{k+1}\)). Similarly, the Q-values will also reflect one more reward than the values (i.e. you return Q k+1 ).

    You should return the synthesized policy \(\pi_{k+1}\).

    Hint: You may optionally use the util.Counter class in util.py , which is a dictionary with a default value of zero. However, be careful with argMax : the actual argmax you want may be a key not in the counter!

    Note: Make sure to handle the case when a state has no available actions in an MDP (think about what this means for future rewards).

    To test your implementation, run the autograder:

    The following command loads your ValueIterationAgent , which will compute a policy and execute it 10 times. Press a key to cycle through values, Q-values, and the simulation. You should find that the value of the start state ( V(start) , which you can read off of the GUI) and the empirical resulting average reward (printed after the 10 rounds of execution finish) are quite close.

    Hint: On the default BookGrid, running value iteration for 5 iterations should give you this output:

    value iteration with k=5

    Grading: Your value iteration agent will be graded on a new grid. We will check your values, Q-values, and policies after fixed numbers of iterations and at convergence (e.g. after 100 iterations).

    Question 2 (1 point): Bridge Crossing Analysis

    BridgeGrid is a grid world map with the a low-reward terminal state and a high-reward terminal state separated by a narrow "bridge", on either side of which is a chasm of high negative reward. The agent starts near the low-reward state. With the default discount of 0.9 and the default noise of 0.2, the optimal policy does not cross the bridge. Change only ONE of the discount and noise parameters so that the optimal policy causes the agent to attempt to cross the bridge. Put your answer in question2() of analysis.py . (Noise refers to how often an agent ends up in an unintended successor state when they perform an action.) The default corresponds to:

    value iteration with k=100

    Grading: We will check that you only changed one of the given parameters, and that with this change, a correct value iteration agent should cross the bridge. To check your answer, run the autograder:

    Question 3 (5 points): Policies

    Consider the DiscountGrid layout, shown below. This grid has two terminal states with positive payoff (in the middle row), a close exit with payoff +1 and a distant exit with payoff +10. The bottom row of the grid consists of terminal states with negative payoff (shown in red); each state in this "cliff" region has payoff -10. The starting state is the yellow square. We distinguish between two types of paths: (1) paths that "risk the cliff" and travel near the bottom row of the grid; these paths are shorter but risk earning a large negative payoff, and are represented by the red arrow in the figure below. (2) paths that "avoid the cliff" and travel along the top edge of the grid. These paths are longer but are less likely to incur huge negative payoffs. These paths are represented by the green arrow in the figure below.

    DiscountGrid

    In this question, you will choose settings of the discount, noise, and living reward parameters for this MDP to produce optimal policies of several different types. Your setting of the parameter values for each part should have the property that, if your agent followed its optimal policy without being subject to any noise, it would exhibit the given behavior. If a particular behavior is not achieved for any setting of the parameters, assert that the policy is impossible by returning the string 'NOT POSSIBLE' .

    Here are the optimal policy types you should attempt to produce:

    • Prefer the close exit (+1), risking the cliff (-10)
    • Prefer the close exit (+1), but avoiding the cliff (-10)
    • Prefer the distant exit (+10), risking the cliff (-10)
    • Prefer the distant exit (+10), avoiding the cliff (-10)
    • Avoid both exits and the cliff (so an episode should never terminate)

    To check your answers, run the autograder:

    question3a() through question3e() should each return a 3-item tuple of (discount, noise, living reward) in analysis.py .

    Note: You can check your policies in the GUI. For example, using a correct answer to 3(a), the arrow in (0,1) should point east, the arrow in (1,1) should also point east, and the arrow in (2,1) should point north.

    Note: On some machines you may not see an arrow. In this case, press a button on the keyboard to switch to qValue display, and mentally calculate the policy by taking the arg max of the available qValues for each state.

    Grading: We will check that the desired policy is returned in each case.

    Question 4 (1 point): Asynchronous Value Iteration

    Write a value iteration agent in AsynchronousValueIterationAgent , which has been partially specified for you in valueIterationAgents.py . Your value iteration agent is an offline planner, not a reinforcement learning agent, and so the relevant training option is the number of iterations of value iteration it should run (option -i ) in its initial planning phase. AsynchronousValueIterationAgent takes an MDP on construction and runs cyclic value iteration (described in the next paragraph) for the specified number of iterations before the constructor returns. Note that all this value iteration code should be placed inside the constructor ( __init__ method).

    The reason this class is called AsynchronousValueIterationAgent is because we will update only one state in each iteration, as opposed to doing a batch-style update. Here is how cyclic value iteration works. In the first iteration, only update the value of the first state in the states list. In the second iteration, only update the value of the second. Keep going until you have updated the value of each state once, then start back at the first state for the subsequent iteration. If the state picked for updating is terminal, nothing happens in that iteration . You can implement it as indexing into the states variable defined in the code skeleton.

    As a reminder, here's the value iteration state update equation:

    Value iteration iterates a fixed-point equation, as discussed in class. It is also possible to update the state values in different ways, such as in a random order (i.e., select a state randomly, update its value, and repeat) or in a batch style (as in Q1). In Q4, we will explore another technique.

    AsynchronousValueIterationAgent inherits from ValueIterationAgent from Q1. This implies that you only need to implement runValueIteration , and all other methods (such as computeQValueFromValues ) are inherited from Q1. Since the superclass constructor calls runValueIteration , overriding it is sufficient to change the agent's behavior as desired.

    To test your implementation, run the autograder. It should take less than a second to run. If it takes much longer, you may run into issues later in the project, so make your implementation more efficient now.

    The following command loads your AsynchronousValueIterationAgent in the Gridworld, which will compute a policy and execute it 10 times. Press a key to cycle through values, Q-values, and the simulation. You should find that the value of the start state ( V(start) , which you can read off of the GUI) and the empirical resulting average reward (printed after the 10 rounds of execution finish) are quite close.

    Grading: Your value iteration agent will be graded on a new grid. We will check your values, Q-values, and policies after fixed numbers of iterations and at convergence (e.g., after 1000 iterations).

    Question 5 (3 points): Prioritized Sweeping Value Iteration

    You will now implement PrioritizedSweepingValueIterationAgent , which has been partially specified for you in valueIterationAgents.py . Note that this class derives from AsynchronousValueIterationAgent , so the only method that needs to change is runValueIteration , which actually runs the value iteration.

    Prioritized sweeping attempts to focus updates of state values in ways that are likely to change the policy.

    For this project, you will implement a simplified version of the standard prioritized sweeping algorithm, which is described in this paper . We've adapted this algorithm for our setting. First, we define the predecessors of a state s as all states that have a nonzero probability of reaching s by taking some action a . Also, theta , which is passed in as a parameter, will represent our tolerance for error when deciding whether to update the value of a state. Here's the algorithm you should follow in your implementation.

    • Compute predecessors of all states.
    • Initialize an empty priority queue.
    • Find the absolute value of the difference between the current value of s in self.values and the highest Q-value across all possible actions from s (this represents what the value should be); call this number diff . Do NOT update self.values[s] in this step.
    • Push s into the priority queue with priority -diff (note that this is negative ). We use a negative because the priority queue is a min heap, but we want to prioritize updating states that have a higher error.
    • If the priority queue is empty, then terminate.
    • Pop a state s off the priority queue.
    • Update s 's value (if it is not a terminal state) in self.values .
    • Find the absolute value of the difference between the current value of p in self.values and the highest Q-value across all possible actions from p (this represents what the value should be); call this number diff . Do NOT update self.values[p] in this step.
    • If diff > theta , push p into the priority queue with priority -diff (note that this is negative ), as long as it does not already exist in the priority queue with equal or lower priority. As before, we use a negative because the priority queue is a min heap, but we want to prioritize updating states that have a higher error.

    A couple of important notes on implementation:

    • When you compute predecessors of a state, make sure to store them in a set , not a list, to avoid duplicates.
    • Please use util.PriorityQueue in your implementation. The update method in this class will likely be useful; look at its documentation.

    To test your implementation, run the autograder. It should take about 1 second to run. If it takes much longer, you may run into issues later in the project, so make your implementation more efficient now.

    You can run the PrioritizedSweepingValueIterationAgen in the Gridworld using the following command.

    Grading: Your prioritized sweeping value iteration agent will be graded on a new grid. We will check your values, Q-values, and policies after fixed numbers of iterations and at convergence (e.g., after 1000 iterations).

    Question 6 (4 points): Q-Learning

    Note that your value iteration agent does not actually learn from experience. Rather, it ponders its MDP model to arrive at a complete policy before ever interacting with a real environment. When it does interact with the environment, it simply follows the precomputed policy (e.g. it becomes a reflex agent). This distinction may be subtle in a simulated environment like a Gridword, but it's very important in the real world, where the real MDP is not available.

    You will now write a Q-learning agent, which does very little on construction, but instead learns by trial and error from interactions with the environment through its update(state, action, nextState, reward) method. A stub of a Q-learner is specified in QLearningAgent in qlearningAgents.py , and you can select it with the option '-a q' . For this question, you must implement the update , computeValueFromQValues , getQValue , and computeActionFromQValues methods.

    Note: For computeActionFromQValues , you should break ties randomly for better behavior. The random.choice() function will help. In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, an unseen action may be optimal.

    Important: Make sure that in your computeValueFromQValues and computeActionFromQValues functions, you only access Q values by calling getQValue . This abstraction will be useful for question 10 when you override getQValue to use features of state-action pairs rather than state-action pairs directly.

    With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard:

    Recall that -k will control the number of episodes your agent gets to learn. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake." Hint: to help with debugging, you can turn off noise by using the --noise 0.0 parameter (though this obviously makes Q-learning less interesting). If you manually steer Pacman north and then east along the optimal path for four episodes, you should see the following Q-values:

    QLearning

    Grading: We will run your Q-learning agent and check that it learns the same Q-values and policy as our reference implementation when each is presented with the same set of examples. To grade your implementation, run the autograder:

    Question 7 (2 points): Epsilon Greedy

    Complete your Q-learning agent by implementing epsilon-greedy action selection in getAction , meaning it chooses random actions an epsilon fraction of the time, and follows its current best Q-values otherwise. Note that choosing a random action may result in choosing the best action - that is, you should not choose a random sub-optimal action, but rather any random legal action.

    You can choose an element from a list uniformly at random by calling the random.choice function. You can simulate a binary variable with probability p of success by using util.flipCoin(p) , which returns True with probability p and False with probability 1-p .

    After implementing the getAction method, observe the following behavior of the agent in gridworld (with epsilon = 0.3).

    Your final Q-values should resemble those of your value iteration agent, especially along well-traveled paths. However, your average returns will be lower than the Q-values predict because of the random actions and the initial learning phase.

    You can also observe the following simulations for different epsilon values. Does that behavior of the agent match what you expect?

    With no additional code, you should now be able to run a Q-learning crawler robot:

    If this doesn't work, you've probably written some code too specific to the GridWorld problem and you should make it more general to all MDPs.

    This will invoke the crawling robot from class using your Q-learner. Play around with the various learning parameters to see how they affect the agent's policies and actions. Note that the step delay is a parameter of the simulation, whereas the learning rate and epsilon are parameters of your learning algorithm, and the discount factor is a property of the environment.

    Question 8 (1 point): Bridge Crossing Revisited

    First, train a completely random Q-learner with the default learning rate on the noiseless BridgeGrid for 50 episodes and observe whether it finds the optimal policy.

    Now try the same experiment with an epsilon of 0. Is there an epsilon and a learning rate for which it is highly likely (greater than 99%) that the optimal policy will be learned after 50 iterations? question8() in analysis.py should return EITHER a 2-item tuple of (epsilon, learning rate) OR the string 'NOT POSSIBLE' if there is none. Epsilon is controlled by -e , learning rate by -l .

    Note: Your response should be not depend on the exact tie-breaking mechanism used to choose actions. This means your answer should be correct even if for instance we rotated the entire bridge grid world 90 degrees.

    To grade your answer, run the autograder:

    Question 9 (1 point): Q-Learning and Pacman

    Time to play some Pacman! Pacman will play games in two phases. In the first phase, training , Pacman will begin to learn about the values of positions and actions. Because it takes a very long time to learn accurate Q-values even for tiny grids, Pacman's training games run in quiet mode by default, with no GUI (or console) display. Once Pacman's training is complete, he will enter testing mode. When testing, Pacman's self.epsilon and self.alpha will be set to 0.0, effectively stopping Q-learning and disabling exploration, in order to allow Pacman to exploit his learned policy. Test games are shown in the GUI by default.

    Without any code changes you should be able to run Q-learning Pacman as follows:

    Note that PacmanQAgent is already defined for you in terms of the QLearningAgent you've already written. PacmanQAgent is only different in that it has default learning parameters that are more effective for the Pacman problem ( epsilon=0.05, alpha=0.2, gamma=0.8 ). You will receive full credit for this question if the command above works without exceptions and your agent wins at least 80% of the time. The autograder will run 100 test games after the 2000 training games.

    Note: If you want to experiment with learning parameters, you can use the option -a , for example -a epsilon=0.1,alpha=0.3,gamma=0.7 . These values will then be accessible as self.epsilon, self.gamma and self.alpha inside the agent.

    Note: While a total of 2010 games will be played, the first 2000 games will not be displayed because of the option -x 2000 , which designates the first 2000 games for training (no output). Thus, you will only see Pacman play the last 10 of these games. The number of training games is also passed to your agent as the option numTraining .

    Hint: If your QLearningAgent works for gridworld.py and crawler.py but does not seem to be learning a good policy for Pacman on smallGrid , it may be because your getAction and/or computeActionFromQValues methods do not in some cases properly consider unseen actions. In particular, because unseen actions have by definition a Q-value of zero, if all of the actions that have been seen have negative Q-values, an unseen action may be optimal. Beware of the argMax function from util.Counter!

    To grade your answer, run:

    Note: If you want to watch 10 training games to see what's going on, use the command:

    During training, you will see output every 100 games with statistics about how Pacman is faring. Epsilon is positive during training, so Pacman will play poorly even after having learned a good policy: this is because he occasionally makes a random exploratory move into a ghost. As a benchmark, it should take between 1,000 and 1400 games before Pacman's rewards for a 100 episode segment becomes positive, reflecting that he's started winning more than losing. By the end of training, it should remain positive and be fairly high (between 100 and 350).

    Make sure you understand what is happening here: the MDP state is the exact board configuration facing Pacman, with the now complex transitions describing an entire ply of change to that state. The intermediate game configurations in which Pacman has moved but the ghosts have not replied are not MDP states, but are bundled in to the transitions.

    Once Pacman is done training, he should win very reliably in test games (at least 90% of the time), since now he is exploiting his learned policy.

    However, you will find that training the same agent on the seemingly simple mediumGrid does not work well. In our implementation, Pacman's average training rewards remain negative throughout training. At test time, he plays badly, probably losing all of his test games. Training will also take a long time, despite its ineffectiveness.

    Pacman fails to win on larger layouts because each board configuration is a separate state with separate Q-values. He has no way to generalize that running into a ghost is bad for all positions. Obviously, this approach will not scale.

    Question 10 (3 points): Approximate Q-Learning

    Implement an approximate Q-learning agent that learns weights for features of states, where many states might share the same features. Write your implementation in ApproximateQAgent class in qlearningAgents.py , which is a subclass of PacmanQAgent .

    Note: Approximate Q-learning assumes the existence of a feature function f(s,a) over state and action pairs, which yields a vector f 1 (s,a) .. f i (s,a) .. f n (s,a) of feature values. We provide feature functions for you in featureExtractors.py . Feature vectors are util.Counter (like a dictionary) objects containing the non-zero pairs of features and values; all omitted features have value zero.

    The approximate Q-function takes the following form

    where each weight w i is associated with a particular feature f i (s,a). In your code, you should implement the weight vector as a dictionary mapping features (which the feature extractors will return) to weight values. You will update your weight vectors similarly to how you updated Q-values:

    Note that the \(difference\) term is the same as in normal Q-learning, and \( r \) is the experienced reward.

    By default, ApproximateQAgent uses the IdentityExtractor , which assigns a single feature to every (state,action) pair. With this feature extractor, your approximate Q-learning agent should work identically to PacmanQAgent . You can test this with the following command:

    Important: ApproximateQAgent is a subclass of QLearningAgent , and it therefore shares several methods like getAction . Make sure that your methods in QLearningAgent call getQValue instead of accessing Q-values directly, so that when you override getQValue in your approximate agent, the new approximate q-values are used to compute actions.

    Once you're confident that your approximate learner works correctly with the identity features, run your approximate Q-learning agent with our custom feature extractor, which can learn to win with ease:

    Even much larger layouts should be no problem for your ApproximateQAgent . ( warning : this may take a few minutes to train)

    If you have no errors, your approximate Q-learning agent should win almost every time with these simple features, even with only 50 training games.

    Grading: We will run your approximate Q-learning agent and check that it learns the same Q-values and feature weights as our reference implementation when each is presented with the same set of examples. To grade your implementation, run the autograder:

    Congratulations! You have a learning Pacman agent!

    Complete Questions 1 through 10 as specified in the project instructions. Then upload valueIterationAgents.py , qlearningAgents.py , and analysis.py to Gradescope.

    Prior to submitting, be sure you run the autograder on your own machine. Running the autograder locally will help you to debug and expediate your development process. The autograder can be invoked on your own machine using the command:

    To run the autograder on a single question, such as question 3, invoke it by

    Note that running the autograder locally will not register your grades with us. Remember to submit your code below when you want to register your grades for this assignment.

    The autograder on Gradescope might take a while but don't worry: so long as you submit before the due date, it's not late .

    Get the Reddit app

    Reinforcement learning is a subfield of AI/statistics focused on exploring/understanding complicated environments and learning how to optimally acquire rewards. Examples are AlphaGo, clinical trials & A/B tests, and Atari game playing.

    New Coursera specialization on RL

    There is a new Coursera specialization on the fundamentals of reinforcement learning.

    The specialization is taught out of University of Alberta by Dr. Adam White and Dr. Martha White, with guest lectures from many well known researchers and practitioners in the field. The specialization follows the Sutton Barto textbook from chapter 2 to 13 (give or take a few sections).

    Right now, the first course is available. It goes from Bandits to Dynamic Programming and sets a foundation for more advanced topics in the field.

    Anyways, go sign up and tell your friends :)

    Navigation Menu

    Search code, repositories, users, issues, pull requests..., provide feedback.

    We read every piece of feedback, and take your input very seriously.

    Saved searches

    Use saved searches to filter your results more quickly.

    To see all available qualifiers, see our documentation .

    • Notifications You must be signed in to change notification settings

    Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.

    dennybritz/reinforcement-learning

    Folders and files.

    NameName
    262 Commits

    Repository files navigation

    This repository provides code, exercises and solutions for popular Reinforcement Learning algorithms. These are meant to serve as a learning tool to complement the theoretical materials from

    • Reinforcement Learning: An Introduction (2nd Edition)
    • David Silver's Reinforcement Learning Course

    Each folder in corresponds to one or more chapters of the above textbook and/or course. In addition to exercises and solution, each folder also contains a list of learning goals, a brief concept summary, and links to the relevant readings.

    All code is written in Python 3 and uses RL environments from OpenAI Gym . Advanced techniques use Tensorflow for neural network implementations.

    Table of Contents

    • Introduction to RL problems & OpenAI Gym
    • MDPs and Bellman Equations
    • Dynamic Programming: Model-Based RL, Policy Iteration and Value Iteration
    • Monte Carlo Model-Free Prediction & Control
    • Temporal Difference Model-Free Prediction & Control
    • Function Approximation
    • Deep Q Learning (WIP)
    • Policy Gradient Methods (WIP)
    • Learning and Planning (WIP)
    • Exploration and Exploitation (WIP)

    List of Implemented Algorithms

    • Dynamic Programming Policy Evaluation
    • Dynamic Programming Policy Iteration
    • Dynamic Programming Value Iteration
    • Monte Carlo Prediction
    • Monte Carlo Control with Epsilon-Greedy Policies
    • Monte Carlo Off-Policy Control with Importance Sampling
    • SARSA (On Policy TD Learning)
    • Q-Learning (Off Policy TD Learning)
    • Q-Learning with Linear Function Approximation
    • Deep Q-Learning for Atari Games
    • Double Deep-Q Learning for Atari Games
    • Deep Q-Learning with Prioritized Experience Replay (WIP)
    • Policy Gradient: REINFORCE with Baseline
    • Policy Gradient: Actor Critic with Baseline
    • Policy Gradient: Actor Critic with Baseline for Continuous Action Spaces
    • Deterministic Policy Gradients for Continuous Action Spaces (WIP)
    • Deep Deterministic Policy Gradients (DDPG) (WIP)
    • Asynchronous Advantage Actor Critic (A3C)
    • David Silver's Reinforcement Learning Course (UCL, 2015)
    • CS294 - Deep Reinforcement Learning (Berkeley, Fall 2015)
    • CS 8803 - Reinforcement Learning (Georgia Tech)
    • CS885 - Reinforcement Learning (UWaterloo), Spring 2018
    • CS294-112 - Deep Reinforcement Learning (UC Berkeley)

    Talks/Tutorials:

    • Introduction to Reinforcement Learning (Joelle Pineau @ Deep Learning Summer School 2016)
    • Deep Reinforcement Learning (Pieter Abbeel @ Deep Learning Summer School 2016)
    • Deep Reinforcement Learning ICML 2016 Tutorial (David Silver)
    • Tutorial: Introduction to Reinforcement Learning with Function Approximation
    • John Schulman - Deep Reinforcement Learning (4 Lectures)
    • Deep Reinforcement Learning Slides @ NIPS 2016
    • OpenAI Spinning Up
    • Advanced Deep Learning & Reinforcement Learning (UCL 2018, DeepMind) - Deep RL Bootcamp

    Other Projects:

    • carpedm20/deep-rl-tensorflow
    • matthiasplappert/keras-rl

    Selected Papers:

    • Human-Level Control through Deep Reinforcement Learning (2015-02)
    • Deep Reinforcement Learning with Double Q-learning (2015-09)
    • Continuous control with deep reinforcement learning (2015-09)
    • Prioritized Experience Replay (2015-11)
    • Dueling Network Architectures for Deep Reinforcement Learning (2015-11)
    • Asynchronous Methods for Deep Reinforcement Learning (2016-02)
    • Deep Reinforcement Learning from Self-Play in Imperfect-Information Games (2016-03)
    • Mastering the game of Go with deep neural networks and tree search

    Contributors 41

    • Jupyter Notebook 97.0%
    • Python 3.0%

    ACM Digital Library home

    • Advanced Search

    Intelligent Decision-Making System of Air Defense Resource Allocation via Hierarchical Reinforcement Learning

    New citation alert added.

    This alert has been successfully added and will be sent to:

    You will be notified whenever a record that you have chosen has been cited.

    To manage your alert preferences, click on the button below.

    New Citation Alert!

    Please log in to your account

    Information & Contributors

    Bibliometrics & citations, view options, recommendations, decision modeling and simulation of fighter air-to-ground combat based on reinforcement learning.

    With the Artificial Intelligence (AI) widely used in air combat simulation system, the decision-making system of fighter has reached a high level of complexity. Traditionally, the pure theoretical analysis and the rule-based system are not enough to ...

    Deep reinforcement learning-based air combat maneuver decision-making: literature review, implementation tutorial and future direction

    Nowadays, various innovative air combat paradigms that rely on unmanned aerial vehicles (UAVs), i.e., UAV swarm and UAV-manned aircraft cooperation, have received great attention worldwide. During the operation, UAVs are expected to perform agile ...

    Reinforcement Learning with Hierarchical Decision-Making

    This paper proposes a simple, hierarchical decision-making approach to reinforcement learning, under the framework of Markov decision processes. According to the approach, the choice of an action, in every time stage, is made through a successive ...

    Information

    Published in.

    John Wiley and Sons Ltd.

    United Kingdom

    Publication History

    • Research-article

    Contributors

    Other metrics, bibliometrics, article metrics.

    • 0 Total Citations
    • 0 Total Downloads
    • Downloads (Last 12 months) 0
    • Downloads (Last 6 weeks) 0

    View options

    Login options.

    Check if you have access through your login credentials or your institution to get full access on this article.

    Full Access

    Share this publication link.

    Copying failed.

    Share on social media

    Affiliations, export citations.

    • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
    • Download citation
    • Copy citation

    We are preparing your search results for download ...

    We will inform you here when the file is ready.

    Your file of search results citations is now ready.

    Your search export query has expired. Please try again.

IMAGES

  1. Dynamic Programming

    programming assignment reinforcement learning

  2. PPT

    programming assignment reinforcement learning

  3. Dynamic Programming: Reinforcement Learning Homework Assignment

    programming assignment reinforcement learning

  4. Ulasan Buku Reinforcement Learning With Python Belajar

    programming assignment reinforcement learning

  5. Reinforcement Learning Algorithms and Applications

    programming assignment reinforcement learning

  6. Reinforcement Learning In Python

    programming assignment reinforcement learning

VIDEO

  1. NPTEL REINFORCEMENT LEARNING || ASSIGNMENT ANSWERS|| WEEK 9

  2. NPTEL REINFORCEMENT LEARNING || ASSIGNMENT ANSWERS|| WEEK 6

  3. NPTEL REINFORCEMENT WEEK(1) ASSIGNMENT ANSWERS

  4. 18 Juni 2024

  5. REINFORCEMENT LEARNING-WEEK8-ASSIGNMENT ANSWERS-NPTEL-JULY2023

  6. NPTEL REINFORCEMENT LEARNING || ASSIGNMENT ANSWERS|| WEEK 11

COMMENTS

  1. ChanchalKumarMaji/Reinforcement-Learning-Specialization

    David Silver Reinforcement Learning course - slides, YouTube-playlist About [Coursera] Reinforcement Learning Specialization by "University of Alberta" & "Alberta Machine Intelligence Institute"

  2. greyhatguy007/Machine-Learning-Specialization-Coursera

    C3 - Unsupervised Learning, Recommenders, Reinforcement Learning. C3 - Unsupervised Learning, Recommenders, Reinforcement Learning ... Programming Assignment. Deep Q-Learning - Lunar Lander; Certificate of Completion. Specialization Certificate. Course Review : This Course is a best place towards becoming a Machine Learning Engineer. Even if ...

  3. Reinforcement Learning Specialization

    Through programming assignments and quizzes, students will: Build a Reinforcement Learning system that knows how to make automated decisions. Understand how RL relates to and fits under the broader umbrella of machine learning, deep learning, supervised and unsupervised learning.

  4. Unsupervised Learning, Recommenders, Reinforcement Learning

    In the third course of the Machine Learning Specialization, you will: • Use unsupervised learning techniques for unsupervised learning: including clustering and anomaly detection. • Build recommender systems with a collaborative filtering approach and a content-based deep learning method. • Build a deep reinforcement learning model.

  5. Assignment 1: Bandits and Exploration/Exploitation

    Welcome to Assignment 1. This notebook will: ... Introduce you to some of the reinforcement learning software we are going to use for this specialization; This class uses RL-Glue to implement most of our experiments. It was originally designed by Adam White, Brian Tanner, and Rich Sutton. This library will give you a solid framework to ...

  6. Programming Assignments for Reinforcement Learning Specialization

    Through programming assignments and quizzes, students will: Build a Reinforcement Learning system that knows how to make automated decisions. Understand how RL relates to and fits under the broader umbrella of machine learning, deep learning, supervised and unsupervised learning.

  7. A Complete Reinforcement Learning System (Capstone)

    There are 6 modules in this course. In this final course, you will put together your knowledge from Courses 1, 2 and 3 to implement a complete RL solution to a problem. This capstone will let you see how each component---problem formulation, algorithm selection, parameter selection and representation design---fits together into a complete ...

  8. Fundamentals of Reinforcement Learning

    Reinforcement Learning is a subfield of Machine Learning, but is also a general purpose formalism for automated decision-making and AI. This course introduces you to statistical learning techniques where an agent explicitly takes actions and interacts with the world. ... I enjoyed the programming assignments though it would have been nice if ...

  9. COMP 150: Reinforcement Learning

    These exercises will generally not involve extensive or elaborate programs. The emphasis should be on empirically analyzing various learning algorithms and reporting the results. The reports, all relevant code and data should be submitted on canvas. Grades for each programming assignment will be out of 10 points.

  10. Reinforcement Learning AI Course

    Reinforcement Learning. $1,750.00. Course materials are available for 90 days after the course ends. Course materials will be available through your mystanfordconnection account on the first day of the course at noon Pacific Time. A course syllabus and invitation to an optional Orientation Webinar will be sent 10-14 days prior to the course start.

  11. CS234: Reinforcement Learning Spring 2024

    Assignments will include the basics of reinforcement learning as well as deep reinforcement learning — an extremely promising new area that combines deep learning techniques with reinforcement learning. ... If you have a lot of programming experience but in a different language (e.g. C/ C++/ Matlab/ Javascript) you will probably be fine ...

  12. Decision Making and Reinforcement Learning

    You will learn to implement TD prediction, TD batch and offline methods, SARSA and Q-learning, and compare on-policy vs off-policy TD learning. You will then apply your knowledge in solving a Tic-tac-toe programming assignment.You could post in the discussion forum if you need assistance on the quiz and assignment.

  13. Assignments for Reinforcement Learning: Theory and Practice

    I anticipate projects taking one of two forms. Practice (preferred): An implemenation of RL in some domain of your choice - ideally one that you are using for research or in some other class. In this case, please describe the domain and your initial plans on how you intend to implement learning.

  14. Review: Coursera Reinforcement Learning specialization

    A Complete Reinforcement Learning System (Capstone) Course contents. ... The programming assignments (3-4 per course) are carried out in Jupyter notebooks coded in Python 3. The level of coding ...

  15. Coursera Fundamentals of Reinforcement Learning

    Reinforcement Learning is a subfield of Machine Learning, but is also a general purpose formalism for automated decision-making and AI. This course introduces you to statistical learning techniques where an agent explicitly takes actions and interacts with the world.

  16. Assignments for Reinforcement Learning: Theory and Practice

    Final project proposal due at 11:59pm on Thursday, 3/10. See the project page for full details on the project. Complete Homework for Chapters 10+11 on edx by Friday 11:59 PM CST. Complete Programming Assignment for Chapters 4,5,6,7 on edx by Sunday at 11:59 PM CST. You can submit your reading response here.

  17. Reinforcement Learning Specialization

    Video the Goal of Reinforcement Learning by Adam. By the end of this video: ... This is a quizz and a peer-graded assignment. I had to describe 3 MDPs with all its detail (states actions, rewards). ... Course 1 - Week 4 - Dynamic Programming Module 4 Learning Objectives. Lesson 1: Policy Evaluation (Prediction)

  18. Reinforcement Learning

    In this project, you will implement value iteration and Q-learning. You will test your agents first on Gridworld (from class), then apply them to a simulated robot controller (Crawler) and Pacman. As in previous programming assignments, this assignment includes an autograder for you to grade your answers on your machine.

  19. Fundamentals of Reinforcement Learning

    Reinforcement Learning is a subfield of Machine Learning, but is also a general purpose formalism for automated decision-making and AI. This course introduces you to statistical learning techniques where an agent explicitly takes actions and interacts with the world. ... The programming assignments are really great and practically introduce you ...

  20. GitHub

    No assignment; Week 3: Temporal Difference Learning Methods for Prediction. Assignment: Policy Evaluation with Temporal Difference Learning; Week 4: Temporal Difference Learning Methods for Control. Assignment: Q-learning and Expected Sarsa; Week 5: Planning, Learning & Actiong. Assignment: Dyna-Q and Dyna-Q+

  21. New Coursera specialization on RL : r/reinforcementlearning

    There is a new Coursera specialization on the fundamentals of reinforcement learning. The specialization is taught out of University of Alberta by Dr. Adam White and Dr. Martha White, with guest lectures from many well known researchers and practitioners in the field. The specialization follows the Sutton Barto textbook from chapter 2 to 13 ...

  22. A novel ensemble deep reinforcement learning model for short‐term load

    The proposed hybrid deep reinforcement learning method can effectively aggregate three sub-models. The model has higher prediction accuracy than a single DL network. In addition, Q-learning, as a reinforcement learning algorithm, dynamically optimizes the weights of the basic predictors through an iterative approach. 2)

  23. GitHub

    Reinforcement Learning: An Introduction (2nd Edition) Classes: David Silver's Reinforcement Learning Course (UCL, 2015) CS294 - Deep Reinforcement Learning (Berkeley, Fall 2015) CS 8803 - Reinforcement Learning (Georgia Tech) CS885 - Reinforcement Learning (UWaterloo), Spring 2018; CS294-112 - Deep Reinforcement Learning (UC Berkeley) Talks ...

  24. Intelligent Decision-Making System of Air Defense Resource Allocation

    To address these problems, a new hierarchical reinforcement learning algorithm named Hierarchy Asynchronous Advantage Actor-Critic (H-A3C) is developed. This algorithm is designed to have a hierarchical decision-making framework considering the characteristics of air defense operations and employs the hierarchical reinforcement learning method ...