Deep reinforcement learning for task offloading in UAV-assisted smart farm networks

1. Preface

With the increasing power of wireless network communication, "show talent does not go out, you know the world" has become the norm, drones, artificial intelligence in agriculture is also becoming more and more important, automatic monitoring of farmland, thereby improving agricultural landscape, performing a large number of image classification tasks to prevent damage to farms in the event of fires or floods, but with current technology, drones have limited energy and computing power, may not be able to perform all intensive image classification tasks, how to improve the capabilities of drones has become a top priority.

2. Related work

The use of reinforcement learning (RL) to manage wireless network resources to optimize performance has been extensively studied in many different applications. Investigates the challenges and opportunities of AI in 5G and 6G networks. Such as energy management and radio resource allocation. The use of AI to achieve energy efficiency in 6G will be essential. In addition, a deep RL approach has been proposed for joint optimization that solves the problem of maximizing computing and minimizing energy consumption by offloading for 5G and later networks.

Their network also utilizes MEC servers as processing units to assist their network with compute-intensive tasks. Similarly, deep RL algorithms have been introduced in IIoT environments. Find an optimal strategy for virtual network function placement and scheduling to minimize end-to-end latency and cost.

In the study on the use of drones in smart farms, details on how drones can be used to capture aerial images and use image classification to identify crops and weeds in fields. The idea of using drones to spray pesticides and discussed the trade-offs between latency and battery use.

In 5G and beyond, the simultaneous use of drones and MEC devices is beneficial for applications. Extensive surveys on the use of drones and MECs are provided for different applications such as space-air-ground networks and emergency search and rescue missions. In addition, the possibility of using drones to provide connectivity for 6G connected car applications was also discussed.

Existing methods for optimizing energy consumption and latency for drones are not limited to smart farm scenarios. For example, power consumption is reduced by optimizing the following parameters "user affinity, power control, computing power allocation, and location planning". A network consisting of satellites, drones, ground base stations, and IoT devices is considered. Deep RL is used as a task scheduling solution to minimize processing latency while considering drone energy limitations. Alternatively, use clustering and trajectory planning to optimize energy efficiency and task latency.

In addition, game theory solutions are used in drone swarm scenarios to solve task offloading problems. While we are exploring similar problems, we are focused on jointly solving energy and task latency optimization problems through DQL.

3. System model

Our network consists of a group of drones J ∈ J. They can communicate with IoT devices z ∈ Z, other drones, and a set of MEC servers l ∈ l. Each drone has a battery with a maximum capacity of ΥBj. Both drones and MEC devices have processing power, j0 ∈ J+, and they can handle the tasks of IoT devices.

At time t ∈ T, IoT devices can offload K types of tasks to drones (αBjt). Each task type has a predefined deadline αDjt, as well as the time required by the processing unit to perform such a task, αPjt. The goal is to find a scheduling algorithm for each drone in order to assign each task to a processing unit in a way that allows the task to be completed by its deadline and maximizes the drone's hover time. These two goals merge to form our multi-objective maximization problem, maximizing:

where W represents the importance of maximizing hover-time targets, ΥRj0 represents the remaining battery of a drone, vjt represents the number of mission deadline violations that have occurred, and Θ is the scaling factor used to normalize v. The first goal is to maximize the minimum remaining power to extend the hover time of the drone network. The remaining power of the drone ΥRj0 can be calculated as follows:

Among them, ΥBj0 represents the battery capacity, ΥHj0 represents the energy required for the drone to hover, ΥAj0 represents the energy required for the antenna to transmit the signal, ΥIj0 represents the energy consumed by the processing unit in idle mode, T represents the simulation time, and ΥCj0 represents the energy consumed by the drone when completing the task.

pjtj0t0 is a binary decision variable that is equal to 1 if processing unit j0 processes the task. Processing unit latency ∆jt is the total number of times a task must remain in the processing unit queue, plus the task's processing delay αPjt. The processing unit delay is given by the following formula:

P+JTJ0T0 is a binary decision variable, if it is the time interval t0 at which the processing unit J0 starts processing the task, it is equal to 1, T0 is the time interval at which the task starts processing on the processing unit J0, and T is the time interval at which the task reaches the processing unit J.

Deadline violation vjt occurs at time t, when the sum of the transmission delay Δzjt from IoT to the drone, the transmission delay between the processing unit delay Δjt and the transmission delay Δj0t0 between the processing unit exceeds the deadline αDjt of the task. This can be expressed as a mathematical formula:

XJTJ0 is used to determine if the task is completed on processing unit J0. When the task will be executed on processing unit j0, it is set to 1, otherwise it is set to 0. To avoid the ping-pong effect, a task can only be unloaded once.

In traditional Q-learning, Q values are stored in Q tables. When the agent needs to make a decision, it looks for the current state in the Q table and selects the action with the highest Q value. The Q value measures the future cumulative discount reward for that action in a given state. At each time step, the agent performs an action and observes feedback from the environment, then updates the Q table to reflect the new knowledge.

In deep Q-learning, we use a deep neural network instead of Q tables. The input to the neural network is the state, and the output is the Q estimate for each action. The agent selects the action with the highest Q estimate. At each time step, the agent performs actions and observes feedback from the environment, which is then used to train the neural network. This approach can handle more complex state spaces and does not require explicit maintenance of Q tables.

In DQL, after the agent performs the selected action, the Q value of the state-action pair is updated in the Q table and the agent moves to another state. Due to the limited memory of the computer, Q-Learning's state space and action space are limited. In DQL, instead of looking up the Q value in the Q table, we use DNN to predict the Q value of each action in a given state. After the agent selects and executes the action, the agent's experience is collected.

Experience is a tuple that includes the agent's current state, next state, actions, and rewards. Experience is stored in a buffer called experience playback, and this buffer is used to train the DNN. As experience increases, the DNN becomes more accurate in predicting the Q value of each action.

Each drone in the network will have its own MDP framework. In this case, drones are agents who receive tasks from IoT devices and must decide where the tasks will be processed. After the drone sends the task to the appropriate processing unit, the drone's battery level changes, as does the latency of the processing unit, and these changes are reported to the drone. The drone must choose a processing unit that minimizes deadline violations and energy consumption in order to receive the highest reward. MDP is defined as follows:

Status: The status includes the offloaded task type k, all processing unit latency ∆j0∈J+, battery level of each drone ΥLj0∈J and transmission delay between each drone and MEC equipment ∆j1∈J+t∈Tj2∈J+. Status is defined as:

The reward function is divided into two parts, namely the battery power reward (Υ L_ja-1) and the deadline violation penalty (1-E(vja) + V_L_ja*E(vja)). Υ L_ja Reward agent for actions that do not result in a significant increase in energy consumption. where e refers to the threshold at which energy consumption changes. V_L_ja punish agents for choosing actions that result in deadline violations.

If deadline violations can be avoided by offloading tasks to another processing unit, the penalties are harsher. If a deadline violation is unavoidable, the penalty is lighter because no better computing position exists.

4. Benchmark methodology

1. Round-robin scheduling (RR): Each device j0∈J+ in the network with a processing unit is assigned a sequence from 1 to J+. The current drone will cycle through an ordered list to determine where tasks are offloaded.

2. Highest Energy Priority (HEF): Drones regularly update each other's battery levels. The current drone first finds the device with the highest remaining charge. If the difference between the current energy level and the maximum energy level exceeds 1%, the task is offloaded to the drone with the highest energy level, otherwise the task is calculated locally.

Since the MEC device has unlimited power, we have to limit the number of times a task is sent to the MEC. The selection probability for each MEC device is 1/J+.

3. Minimum Queue Time and Highest Energy Priority (QHEF): Drones regularly update each other's battery level and queue time. First, the algorithm finds the shortest queuing time. Then the drone finds the device with the highest energy level and a queue time that is less than or equal to the minimum queue time. If the maximum energy level is one threshold higher than the current energy level, the current drone offloads tasks to this device. Otherwise, the drone will calculate the task locally.

4.Q-Learning: We used the proposed Q-Learning algorithm. Action sets, reward functions, and epsilon-greedy strategies defined in the Q-Learning algorithm. The Q-Learning algorithm has the same state, but without the transmission delay ∆j1∈J+t∈Tj2∈J+.

5. Performance evaluation

We used Simu5G, a 5G network emulator running on Omnet++, to simulate our smart farm network. In our simulation, there are four drones (J=4) and one MEC device (L=1). There are three types of tasks: fire detection, pest detection, and growth monitoring.

Task arrival intervals are modeled as exponentially distributed, and each task type has a unique average arrival rate and processing time.

The remaining power and latency violations result in an average of ten runs with different seed values. For Q-Learning and Deep Q-Learning, a learning rate of 0.05 is assumed and a discount value of 0.85 is assumed. For comparison with references [6], we used their energy consumption model and parameters. We make the same assumptions about battery type and hover power formula.

The energy level of the simulated drone during the entire simulation is as follows: maximum battery capacity (ΥBj0) equals 570, hover (ΥHj0) equals 211, antenna equals 17, idle processing unit equals 4320, active processing unit equals 12960.

6. Conclusion:

We propose an algorithm based on deep reinforcement learning to improve the convergence speed of existing Q-Learning algorithms. The deep learning part of the algorithm also allows us to incorporate more observations into the state, so our decision-making algorithm has more information than Q-Learning. We compare the proposed algorithm with four benchmark algorithms, RR, HEF, QHEF and Q-Learning, and the results show that the DQL algorithm converges 13 times faster than Q-Learning.

Finally, DQL is comparable to Q-Learning in terms of percentage of remaining energy and percentage of deadline violations. Therefore, it is a better solution to our joint optimization problem and can reach the optimal solution faster than Q-Learning. In the future, we plan to further reduce convergence time and address scalability issues.

Bibliography:

[1] A. D. A. Aldabbagh, C. Hairu, and M. Hanafi, "Classifying Pepper Plant Growth Using Deep Learning," Proceedings of the 10th IEEE International Conference on Systems Engineering and Technology (ICSET) 2020, pp. 213–217, IEEE, November 2020.

[2] Y. Lina and Y. Xiuming, "Design of Intelligent Pest Monitoring System Based on Image Classification Algorithm," Proceedings of the 3rd International Conference on Control and Robotics (ICCR) 2020, pp. 21–24, IEEE, December 2020.

[3] J. Zhao, Y. Wang, Z. Fei and X. Wang, "UAV Deployment Design with Latency Constraints in Smart Farms," Proceedings of the 2020 IEEE/CIC China International Conference on Communications (ICCC), pp. 424–429, IEEE, August 2020.

[4] S. Zhang, H. Zhang and L. Song, "Beyond D2D: Full-Dimensional Drone Communications in 6G," IEEE Journal of Transportation Technology, Vol. 69, pp. 6592–6602, 2020.

Deep reinforcement learning for task offloading in UAV-assisted smart farm networks