Shiva Soleimany

Introduction

This project was in the partnership of the University of Alberta (UofA) and Delphi Technology Corp (DTC), an innovative ﬂight training solution provider with offices in Winnipeg and Calgary. The ultimate goal of this research is to decrease the needed help from pilots during the training procedure of new pilots. This objective, once reached, will have a positive impact on the entire Canadian aviation industry. Research along this direction will also help in significantly lowering the cost of the pilot training program by reducing the number of additional flight instructors required to produce the required numbers of new pilots. Additionally, this technology can potentially be used in several related tutoring tasks.
The first phase of this project will focus on using ML to identify mistakes or sub-optimal maneuvers of trainee pilots inside a flight simulator (Xplane),to reduce human trainer supervision while improving student learning outcomes. To learn how to fly, trainees at DTC are currently guided through exercises in a flight simulator by a human flying instructor. We posit that training a predictive/RL-enabled system to take on some tasks is a viable approach to reduce the workload of an instructor and allow them to interact with more students simultaneously. In the first step, an RL agent is trained with SAC (Soft Actor-Critic) algorithm to perform the most important task in flying an airplane, flying straight and level. Pitch, yaw, and roll of the airplane are used to measure the success of the agent in performing this task. These variables are shown in the below figure.

Defining a good reward function that can capture the essence of a task is crucial in training an RL agent that can properly perform that task. For flying straight and level, the pitch and roll of the airplane should be nearly zero and the yaw is defined based on the point towards which the airplane is flying. For our task, yaw is 180 degrees. The agent can receive a reward for being within a reasonable range (10 degrees) from these goal values. The reward increases as the agent gets closer to the exact desired values. Throughout the flight, the state of the agent is represented using the first letter of each measurement: P, Y, and R. If the agent is within range of the desired pitch (zero), but not within range of perfect Roll and yaw values, the agent’s state is represented as ‘P’. If the agent's state is ‘PRY’ it means that the agent is within range for all the measurements. A simplified version of the final reward function based on these values is mentioned below:

Reward Function

Reward = Yaw_reward + Pitch_reward + Roll_reward

Yaw_reward	if 'Y'	1 - (yaw_error)^0.4
Pitch_reward	if 'YP'	1 - (pitch_error)^0.4
Roll_reward	if 'YPR'	1 - (roll_error)^0.4

After training the agent, the performance of the agent is compared against the performance of a pilot. Some of the results from this comparison are mentioned in the Compare section. After ensuring that the agent is performing comparably to a pilot, the agent can be used as a guideline for students during their trainings and showing them warnings, or calculating a score for them based on how similar to the trained agent they acted. It can also be used for prompting and teaching them the correct moves and decisions. This phase is still ongoing at DTC. In the following sections, we demonstrate the state distribution, reward and a video for pilot and agent. In order to better show that our agent reaches comparable results with the pilot, the corresponding results for agent and pilot are shown side by side in the Compare section.

Agent

Since we had the recorded data from a flight performed by a pilot for 68 episodes, we looked into 68 episodes of a trained agent. The figure on the left shows the distibution of the agent's states over these 68 episodes. The right figure, shows the agent's episodic reward based on the reward function mentione in the Introduction section.

Below is a recorded video from the trained agent performing the straight and level task. The reward for each timestep is shown on the video in time. This reward is based on the same reward function used for training the agent and is calculated by reading the pitch, roll and heading of the airplane using OCR.

Pilot

A pilot was asked to perform fly straight and level for 68 episodes. The figure on the left shows the distibution of the airplane's states over these 68 episodes. The right figure, shows the pilot's episodic reward based on the reward function mentioned in the Introduction section.

Below is a recorded video from the pilot performing the straight and level task.

Comparison

Since the agent is specifically trained based on the reward function while the pilot might include several other parameters in his/her dicision makings, the agent reaches a higher reward based on our reward function. This alone however cannot be an indication for better performance of the agent compared to pilot. Additionally, we were not aiming to perform better than the pilot, if possible. Our goal was to train an agent that takes similar decisions to the pilot in the same situtions. Therefore, the below figure can better show that our agent can guides students during training similar to a pilot. It shows that over the training time, the difference between the actions that the agent and pilot take in similar states decreases.

Comparing the agent's' performance with two student's we get the below figure. The actions that Student 2 takes are more different to agent's actions compared to student 1. This is a promising result since student 2 is on her first trial while student 1 is on her fifth.

Below on the left is the pilot's recorder flight, and on the right, is the agent's recorded flight.