CNS Core: Medium: Collaborative: Exploring and Exploiting Learning for Efficient Network Control: Non-Stationarity, Inter-Dependence, and Domain-Knowledge

Project Team

Investigators
  • Prof. Xin Liu, UC Davis
  • Prof. Ness Shroff, Ohio State University
  • Prof. Zhi-Li Zhang, University of Minnesota

 

Project Goals

Traditionally, network protocols and control mechanisms are designed and engineered in accordance with certain theoretical models or design principles, under (often simplifying) assumptions about the network environment in which they operate. Network operations are mostly performed by operators through manual configurations of control parameters and resources, sometimes guided by measurement analysis and performance optimization. With the increasingly wide range of applications and complex network scenarios, traditional methods do not always perform well. To address this challenge, machine learning techniques have been applied to a wide range of networking and distributed systems problems from reducing data center cooling costs, to traffic optimization, and application management. While preliminary results are promising, applying machine learning techniques to networking pose many important research questions that must be explored systematically and in depth. The proposed research constitutes an important first step toward providing a principled understanding of the fundamental limitations and promising new opportunities in learning-based network control from both theoretical and practical perspectives. It will help advance the emerging visions of self-driving networks and AIOps, and bring benefits to network operators, users, and the society at large. This project also allows the PIs to continue their strong efforts to integrate research with education, broaden participation in computing, recruit and train female and under-represented students, and conduct outreach to K12. Networks are a collection of control and (distributed) data plane elements that operate at different time scales on diverse types of data, respond and adapt to changes in traffic demands and the network state to achieve disparate objectives. The networking environments are highly dynamic and uncertain, with non-stationarity caused by surges and time-of-day changes in traffic demands, and unpredictable network failures; they are also inherently correlated, inter-dependent and constrained, in part due to complex interactions of various network entities. Moreover, networks are engineered systems --- there are basic principles that govern their designs and operations, with constraints that cannot be violated and inherent relations that could yield substantial performance gains. The proposed research focuses on learning-based network control problems to address these challenges along the following inter-related research thrusts. In Thrust 1. Network-Centric Learning Techniques, the PIs will explore the fundamental limits (from a theoretical perspective) and advance new network-centric ML techniques for non-stationary, correlated and constrained environments. In Thrust 2. Network-wide Learning-based Control and Horizontal/Vertical Interactions, the PIs will study and develop innovative learning-based network control algorithms in a network-wide framework by exploiting the (horizontal and vertical) interactions and leveraging shared learning. Last but not the least, in the Evaluation Thrust, the PIs will evaluate the proposed learning-based network control algorithms, and compare them with conventional optimization and other ML based approaches.

 

Accomplishments

Major Activities

Opportunitistic Learning for Efficient Exploration-Exploitation Tradeoff: To make learning techniques useful to network control, learning efficiency is highly critical. To achieve this goal, we propose and investigate opportunistic learning that reduces the cost of exploration by leveraging network dynamics. Existing RL algorithms typically assume that the cost of exploration; i.e., trying a suboptimal action, remains the same over time. While this is a valid assumption in many existing applications, there exists a variety of applications where the actual cost of trying a suboptimal action may vary depending on external conditions. In networks, traffic fluctuates dynamically. Due to network traffic fluctuations, the cost of exploration may vary significantly. Consider the example of find the best routing configuration under peak load. When the network load is low, we can inject dummy traffic into the network so that the total load, the real load plus the dummy load, resembles the peak load. It allows us to learn the performance of the configuration under the peak load. At the same time, the cost (regret) of using a suboptimal configuration is low because the real load affected is low. Furthermore, we can set the priority of the dummy traffic to be lower than that of the real traffic to further reduce the real cost of exploration. Motivated by this observation, as proposed, we have studied opportunistic learning that leverages less costly exploration opportunities efficiently and with provable properties. Contextual bandit is a particularly useful version of multi-armed bandit where certain side information, called contextual information, is provided before making the decision in each time slot. Contextual bandits have been widely used for personalization and customization. When an external factor, such as network load, exists in the contextual bandit problem, the benefit of opportunistic learning also rises. Based on this intuition, we have developed and analyzed an Adaptive Upper-ConfidenceBound algorithm (AdaLinUCB) to adaptively balance the exploration-exploitation trade-off. The algorithm is designed to dynamically balance the exploration-exploitation trade-off in opportunistic contextual bandits. To be best of our knowledge, this is the first work to study opportunistic learning for contextual bandits. We focus on the problem-dependent bound analysis here, which is a setting that allows a better bound to be achieved under stronger assumptions. To the best of our knowledge, such a bound does not exist for LinUCB in the existing literature. In this paper, we prove problem-dependent bounds for both the proposed AdaLinUCB and the traditional LinUCB algorithms. Both algorithms have a regret upper bound of O (log T)^2, and the coefficient of the AdaLinUCB bound is smaller than that of LinUCB. Furthermore, using both synthetic and real-world large-scale dataset, we show that AdaLinUCB significantly outperforms other contextual bandit algorithms, under large exploration cost fluctuations.

Vertical Interaction among Network Functions: We initiated the investigation on effective ML-based control for vertically interacting network functions. Specifically, we have implemented an experimental environment that realizes the state-of-the-art reinforcement-learning-based ABR (adaptive bit rate) and TCP congestion control algorithms. We have experimented with different bandwidth, network condition, and delay settings, as well as user actions, at the congestion control and application layers. Our experiments show that the current algorithms interact well together with no observable negative interaction. This result shows that the ML-based network control can be promising even in the complex environment with vertically interactions.

Best k-item Selection for Online Learning: In a number of networking applications it is important to figure out how to choose the best k-choices from a given number of choices (e.g., routing, web ranking, etc.). In a recent work [Ren-Shroff2020], we study the sample complexity (aka number of comparisons) bounds for the active best-k items selection from pairwise comparisons. From a given set of items, the learner can make pairwise comparisons on every pair of items, and each comparison returns an independent noisy result about the preferred item. At any time, the learner can adaptively choose a pair of items to compare according to past observations (i.e., active learning). The learner’s goal is to find the (approximately) best-k items with a given confidence, while trying to use as few comparisons as possible. In this paper, we study two problems: (i) finding the probably approximately correct (PAC) best-k items and (ii) finding the exact best-k items, both under strong stochastic transitivity and stochastic triangle inequality. For PAC best-k items selection, we first show a lower bound and then propose an algorithm whose sample complexity upper bound matches the lower bound up to a constant factor. For the exact best-k items selection, we first prove a worst-instance lower bound. We then propose two algorithms based on our PAC best items selection algorithms: one works for k = 1 and is sample complexity optimal up to a loglog factor, and the other works for all values of k and is sample complexity optimal up to a log factor.

Prediction and Learning for Predictive Caching: We have also been studying how to use prediction and learning to do predictive caching in order to reduce delay of wireless networks. We explore the possibility of caching at wireless end-users where caches are typically very small, orders of magnitude smaller than the catalog size. We develop a predictive multicasting and caching scheme, where the transmitter proactively multicasts popular content for end-users to cache, and access locally if requested. We analyze the impact of this joint multicasting and caching on the delay performance. Our analysis uses a novel application of Heavy-Traffic theory under the assumption of vanishing caches to show that predictive caching fundamentally alters the asymptotic throughput-delay scaling. This in turn translates to a several-fold delay improvement in simulations over the on-demand unicast baseline as the network operates close to the full load. We highlight a fundamental delay-memory trade-off in the system and identify the correct memory scaling to fully benefit from the network multicasting gains.

Contrained Reinforcement Learning: Reinforcement Learning (RL) involves an agent repeatedly interacting with an environment  modelled by a Markov Decision Process (MDP) that is characterized by its state-space, action-space, and controlled transition probabilities. At each time-slot the agent observes the current state of the environment and applies a control action for which it earns a reward. The agent does not know the controlled transition probabilities that describe the system dynamics of the environment. This framework has many applications in networking control from routing to scheduling to congestion control. In our recent work, we consider RL in Markov Decision Processes in which  an agent repeatedly interacts with an environment that is modeled by a controlled  Markov process. We design learning algorithms that maximize the cumulative reward earned over a time horizon of T time-steps, while simultaneously ensuring that the average values of the M cost expenditures are bounded. This is the constrained RL problem which is which is critical for modeling networking constrains (e.g., capacity, expected delays, etc.). Further, the considerations on the cumulative cost expenditures is different from the existing literature, in that the agent now additionally needs to balance the cost expenses in an online manner, while  simultaneously performing the exploration-exploitation trade-off that is typically encountered in RL tasks.

Mapping and Predicting Commercial mmWave 5G Throughput: 2019 saw the commercial deployment of 5G services in US and worldwide, with majority using 5G NR mid-band frequencies and a few using 5G NR millimeter wave (mmWave) high-band frequencies (e.g., Verizon in US).We are particularly interested in mmWave 5G performance for several reasons. First of all, the ultra-high bandwidth (theoretically up to 20 Gbps) of mmWave 5G offers exciting new opportunities to support a variety of emerging and future bandwidth-intensive applications expected of the 5G eMBB service. On the other hand, from theoretical analysis, simulation study, controlled experiments and limited field testing, it has been widely known that there are many technical challenges facing mmWave radios, making designing and managing 5G services based on mmWave radio a daunting task. For example, due to the directionality and limited range of mmWave radio and its high sensitivity to obstructions (e.g., surrounding buildings, moving bodies, foliage, etc.), establishing and maintaining a stable communication link with user equipment (UE) can be difficult, especially when the UE is moving around. In this research effort, we apply machine learning to the problem of 5G throughput prediction problem at the user equipment (UE) side.  Due to the vagary of wireless signals, data-driven machine learning (ML) models have been proposed for 3G/4G cellular network management. Given the diverse array of impact factors and their complex interplay, the need for ML models for 5G networks is more acute. However, instead of blindly applying machine learning to 5G throughput prediction, we seek to answer a few basic questions: i) Is mmWave 5G throughput predictable, and to what extent? ii) What key UE-side factors (or features) most affect 5G throughput? iii) In order to capture these key factors, what types of ML models are best suited for 5G throughput prediction? In particular, can we develop ML models that are explainable? To this end, we carefully design our measurements under various settings (e.g., selecting indoor and outdoor locations, considering both stationary and mobility scenarios of various speeds), and conduct extensive, reproducible, and repeated experiments for data collection, throughput characterization and factor analysis. Based on these results, we propose and develop novel ML models for 5G throughput prediction on the UE side.

Despite the COVID19 pandemic, the entire project team has continued to collaborate closely, with biweekly zoom meetings and (formal & informal) joint student supervision. For example, as we had started last year, PI Zhang of the UMN team (NSF Project #1901103) has led the experimental research effort, with active participation from both OSU (Shroff) and UC Davis (Liu) teams. The research challenges/problems uncovered through these measurement and experimental efforts as well as the datasets collected have spurred a number of collaborative research activities. In the following we will describe several major activities we have carried out during this report period. Many of these activities are still ongoing. 

Measurement of Commercial 5G Services and Data Collection for Machine Learning-driven 5G and Beyond Network Control: Continued the efforts that we have started, we have carried out an in-depth study to collect datasets and performance metrics to characterize the performance of commercial 5G services that have been rapidly rolled out. Our goals are multi-fold: We want to understand how 5G impacts application performance and energy efficiency, in particular, the challenges posed by emerging 5G services as well as opportunities offered by them, with an aim to develop novel machine learning based methods for 5G network control; We are particularly interested in examining the effects of 5G deployment strategies such as Non-Standalone (NSA) 5G and Standalone (SA), radio bands and protocol specific properties (e.g., RRC state transitions and power profiles) on performance and power usage experienced by user equipment (smartphones) and applications running on them; and last but not the least, we need massive amounts of datasets to enable and drive the various research tasks we have proposed in this project.

5G-Aware Video Streaming Applications: We have started to study video-streaming in 5G mmWave through a three-campus collaborative effort, leveraging PI Zhang’s expertise in real-world mmWave measurement and analysis, PI Shroff’s expertise in theoretical analysis of learning algorithms, and PI Liu’s expertise in practical application of learning algorithms in networking. Postdoc Arnob Ghosh and the three PIs have met regularly to brainstorm and discuss various challenges and methodologies. Among various topics that we have discussed, we have identified and formulated the problem of video-streaming ABR control in mmWave. As shown in our past and current measurement and experimental studies, we have found that existing ABR control does not perform well in mmWave because of the high volatility in available bandwidth due to environmental changes. We have advanced several novel mechanisms for making video streaming applications 5G-aware. We have also formulated a reinforcement learning problem that specifically target high volatility. These research efforts are still ongoing.

Novel bandit algorithms for 4G/5G Handoffs: In this report period, we have studied how novel bandit algorithms can improve handoff performance in extreme high mobility. Nowadays, the demand for seamless Internet access under extreme user mobility, such as on high-speed trains and vehicles, has become a norm rather than an exception. However, the 4G/5G mobile network is not always reliable to meet this demand, with non-negligible failures during the handover between base stations. A fundamental challenge is to balance the exploration of more measurements for satisfactory handover, and exploitation for timely handover before the fast-moving user leaves the serving base station's coverage.  In this work, we start with a simple question: Can the mobile network automatically learn a reliable handover policy for its users in extreme mobility? We formulate the exploration-exploitation trade-off in extreme mobility as a composition of two online learning problems. Then we showcase how multi-armed bandits help solve both problems with a provable O(log J log T) regret, where J is the number of arms and T is the time horizon.  Our theoretical analysis and tests over a real LTE dataset from high-speed rails confirm the benefit of online learning for reliable extreme mobility. This is a collaborative work involving PIs Liu and Shroff.

Network slicing via constrained reinforcement learning: With the proliferation of mobile networks, we face strong diversification of services, demanding the network to be more flexible. To satisfy this dire need, network slicing is embraced as a promising solution for resource utilization in 5G and future networks. However, this process is complicated that the traditional approaches cannot effectively perform resource orchestration due to the lack of accurate models and the existence of dynamic hidden structures. We formulate the resource allocation problem as a Constrained Markov Decision Process and solve it using constrained reinforcement learning. Specifically, we use the adaptive interior-point policy optimization and projection layers to handle cumulative and instantaneous constraints. Our evaluations show that our method is effective in resource allocation and outperforms baselines. The work is ongoing and is a collaboration involving PIs Liu and Zhang.

Minimizing the age of information over heterogeneous channels. We are also studying the problem of minimizing the age of information when a source can transmit status updates over two heterogeneous channels. Our work is motivated by recent developments in 5G mmWave technology, where transmissions may occur over an unreliable but fast (e.g., mmWave) channel or a slow reliable (e.g., sub-6GHz) channel. The unreliable channel is modeled as a time-correlated Gilbert-Elliot channel, where information can be transmitted at a high rate when the channel is in the “ON” state. The reliable channel provides a deterministic but lower data rate. The scheduling strategy determines the channel to be used for transmission with the aim to minimize the time-average age of information (AoI). The optimal scheduling problem is formulated as a Markov Decision Process (MDP), which in our setting poses some significant challenges because e.g., supermodularity does not hold for part of the state space. We show that there exists a multi-dimensional threshold-based scheduling policy that is optimal for minimizing the age. A low-complexity bisection algorithm is further devised to compute the optimal thresholds. Numerical simulations are provided to compare different scheduling policies.

Specific Objectives

SyncCoding for Efficient Data Transmission:  In a variety of network applications, there exists significant amount of shared data between two end hosts. Examples include data synchronization services that replicate data from one node to another. Given that shared data may have high correlation with new data to transmit, we  how such shared data can be best utilized to improve the efficiency of data transmission [Lee-Shroff2020]. To answer this, we develop an inter-data encoding technique, SyncCoding, that effectively replaces bit sequences of the data to be transmitted with the pointers to their matching bit sequences in the shared data so called references. By doing so, SyncCoding can reduce data traffic, speed up data transmission, and save energy consumption for transmission. Our evaluations of SyncCoding implemented in Linux show that it outperforms existing popular encoding techniques, Brotli, LZMA, Deflate, and Deduplication. The gains of SyncCoding over those techniques in the perspective of data size after compression in a cloud storage scenario are about 12.5%, 20.8%, 30.1%, and 66.1%, and are about 78.4%, 80.3%, 84.3%, and 94.3% in a web browsing scenario, respectively.

Approximate Gradient Coding in Large Machine Learning Applications: While large-scale machine learning has shown great promise for solving many practical applications these applications require massive training datasets and model parameters, and force practitioners to adopt distributed computing frameworks over a communication network to increase the learning speed. However, the speedup gain is far from ideal due to the latency incurred in waiting for a few slow servers (called stragglers), due to network delays or other issues, to complete their tasks. For example, it was observed that a straggler may run 8× slower than the average worker performance on Amazon EC2. To alleviate the straggler issue, current frameworks such as Hadoop deploy various straggler detection techniques and usually replicate the straggling tasks on other available nodes. Recently, gradient coding techniques have been proposed to provide an effective way to deal with stragglers for distributed learning applications. In the distributed gradient coding problem, which appears in a large number of these distributed ML problems,  it has been established that, to exactly recover the gradient under s slow machines, the minimum computation load (number of stored data partitions) of each worker is at least linear (s + 1), which incurs a large overhead when s is large. In our recent work [WangShroff2019], we focus on approximate gradient coding that aims to recover the gradient with bounded error ϵ. We outline our key results in the section on Significant results.

Learning Universal and Transferable Graph Neural Network Embedding: Learning powerful data embeddings has become a center piece in machine learning.  Network data arising from various network control problems are naturally graph-structured, how to develop powerful data embedding (especially, in an unsupervised manner) that can be used to effectively tackle network control tasks is a key challenge we are addressing. Building upon the success of deep learning in images and words, graph neural networks (GNNs) have been recently developed for various graph learning tasks on graph-structured datasets. Most of existing GNNs are task-specific in the sense that they are trained on datasets via supervised (or semi-supervised) learning with task-specific labels, and thus the trained models cannot be directly applied to other tasks. In addition, they are often limited to learning node embeddings (based on a node’s local structure within an underlying graph), as opposed to graph embedding (defined as a single embedding for the whole graph), which is more general purpose, e.g., for graph classification or graph generation tasks. In addition, graph embedding can serve as a node embedding by operating on a node’s ego-graph. We have proposed a deep universal graph embedding neural network (DUGnn) framework, which is capable of the following: 1) It can be trained on diverse datasets (e.g., with different node features) for a variety of tasks in an unsupervised fashion to learn task-independent graph embeddings; 2) The learned graph embedding model can be shared across different datasets, thus enjoying the benefits of transfer learning; 3) The learned model can further be adapted and improved for specific tasks using adaptive supervised learning. Our DUGnn model incorporates a novel graph neural network (as a universal graph encoder) and leverages rich Graph Kernels (as a multi-task graph decoder) for both unsupervised learning and (task-specific) adaptive supervised learning. By learning task-independent graph embeddings across diverse datasets, DUGnn also reaps the benefits of transfer learning.

Network Anomaly Detection using Network Telemetry: As networks become increasingly complex, Network Operation and Management (OAM) tasks such as capacity planning, security, network health monitoring, troubleshooting are critical to ensure normal operations of business services. Traditionally, network operators rely on mechanisms and tools such as Simple Network Management Protocol (SNMP), Syslog, and CLI which provide only slow (e.g., every 5 minutes), coarse-grain, incomplete and often specific information that is hard to operationalize. Their limitations restrict network automation. To these drawbacks, Model-Driven Telemetry (MDT) has been developed in recent years and is widely adopted by network vendors. MDT leverages the power of models to implement a high frequency (in the order of seconds) push based data approach to  provides greater scalability. In this work, we explore the feasibility of leveraging granular MDT data streams to provide  fine-grained visibility to network devices, enable data processing in near real-time, and develop new approaches for designing Network Operation and Management (OAM) solutions. In this research effort, we study anomaly detection using BGP MDT from a large data center as a case study.  Our work improves and builds on OutlierDenStream which employ a clustering approach -- DenStream -– for detecting anomalies and raising alarms. In OutlierDenStream, each node in the network operates individually. An alarm is raised upon reception of k consecutive outlier samples at a node and for multiple nodes during the same time slot. In this work, We first engineer a systematic MDT data processing and feature selection framework that is portable to various vendors. By leveraging fine-grain MDT data, we develop a novel unsupervised anomaly detector engine that incorporate time as well as network proximity-based heuristics. Our engine intelligently combines node level alarms which decreases false positive alarms while increases the true positive alarms. Therefore it reduces the attention burden placed on network operators for verifying anomalies and making decisions.

Systematic Measurement Study of 5G in the Wild: Motivated by rapidly expanding deployment of 5G services, we have carried out an in-depth study to understand how 5G impacts application performance and energy efficiency. We examine the effects of 5G deployment strategies such as Non-Standalone (NSA) 5G and Standalone (SA), radio bands and protocol specific properties (e.g., RRC state transitions and power profiles) on performance and power usage experienced by user equipment (smartphones) and applications running on them. Unlike last year’s effort, in this new effort we have focused on both 5G mmWave – which provides significantly higher throughput than 4G LTE – as well as  low-band radio (having more comparable throughput with 4G LTE) with NSA vs. SA modes. Our objectives are to  understand how 5G impacts application performance and energy efficiency, in particular, the challenges posed by emerging 5G services as well as opportunities offered by them, with an aim to develop novel machine learning based methods for 5G network control as well as 5G-aware applications to best utilize 5G services.

Case for 5G-Aware Video Streaming Applications: Our measurement studies show that commercial mmWave 5G can indeed offer ultra-high bandwidth (up to 2 Gbps), capable of supporting bandwidth-intensive applications such as ultra-HD (UHD) 4K/8K and volumetric video streaming on mobile devices. However, mmWave 5G also exhibits highly variable throughput performance and incurs frequent handoffs (e.g., between 5G and 4G), due to its directional nature, signal blockage and other environmental factors, especially when the device is mobile. All these issues make it difficult for applications to achieve high Quality of Experience (QoE). These challenges have spurred to explore new mechanisms and machine learning models for not only 5G performance prediction, but also cross-layer, 5G-aware application adaptation mechanisms such as available bit rate (ABR) algorithms.

UE 5G Power Measurement and Data-driven Power Modeling: In addition to its high bandwidth and low latency enabled by physical-layer innovations (e.g., massive MIMO, advanced channel coding, etc.), power saving is a top concern to mobile users of 5G. In cellular networks, this is usually achieved by different Radio Resource Control (RRC) states. 5G makes no exception. It is thus important to understand the RRC state machine of commercial 5G networks and its implications. The configurations of key parameters in the state machine lead to important performance and energy trade-offs. They are usually carrier-specific and can be very different between NSA and SA deployment modes. Our objective is to provide key insights into how to design machine learning-based, 5G-aware applications so as to best utilize 5G by balancing the critical tradeoff between performance and energy consumption, as well as by taking into account the availability of both network and computation resources.

Significant Results

  • We have been investigating opportunistic learning for better learning efficiency. We have proposed and analyzed an opportunistic contextual bandit algorithm, named AdaLinUCB, that adaptively balances the exploration-exploitation trade-off. We have been investigating opportunistic learning for better learning efficiency. We have proposed and analyzed an opportunistic contextual bandit algorithm, named AdaLinUCB, that adaptively balances the exploration-exploitation trade-off. Our theoretical analysis and simulation evaluation show superior performance of the proposed algorithm compared to existing algorithm. Contextual bandit is a particularly useful version of multi-armed bandit where certain side information, called contextual information, is provided before making the decision in each time slot. Contextual bandits have been widely used for personalization and customization. When an external factor, such as network load, exists in the contextual bandit problem, the benefit of opportunistic learning also rises. Based on this intuition, we have developed and analyzed an Adaptive Upper-ConfidenceBound algorithm (AdaLinUCB) to adaptively balance the exploration-exploitation trade-off. The algorithm is designed to dynamically balance the exploration-exploitation trade-off in opportunistic contextual bandits. Our theoretical analysis and simulation evaluation show superior performance of the proposed algorithm compared to existing algorithm.To be best of our knowledge, this is the first work to study opportunistic learning for contextual bandits. We focus on the problem-dependent bound analysis here, which is a setting that allows a better bound to be achieved under stronger assumptions. To the best of our knowledge, such a bound does not exist for LinUCB in the existing literature. In this paper, we prove problem-dependent bounds for both the proposed AdaLinUCB and the traditional LinUCB algorithms. Both algorithms have a regret upper bound of O (log T)^2, and the coefficient of the AdaLinUCB bound is smaller than that of LinUCB. Furthermore, using both synthetic and real-world large-scale dataset, we show that AdaLinUCB significantly outperforms other contextual bandit algorithms, under large exploration cost fluctuations. The work appears in the conference proceedings of IJCAI 2019 (International Joint Conferences on Artificial Intelligence).
  • As mentioned earlier, distributed gradient descent has significant value in  have large-scale machine learning has shown great promise for solving many practical applications these applications require massive training datasets and model parameters, and force practitioners to adopt distributed computing frameworks over a communication network to increase the learning speed.  In our recent work we focus on approximate gradient coding that aims to recover the gradient with bounded error ϵ. Theoretically, our main contributions are three-fold: (i) we analyze the structure of optimal gradient codes, and derive the information-theoretical lower bound of minimum computation load: O(log(n)/log(n/s)) for ϵ = 0 and d ≥ O(log(1/ϵ)/log(n/s)) for ϵ > 0, where d is the computation load, and ϵ is the error in the gradient computation; (ii) we design two approximate gradient coding schemes that exactly match such lower bounds based on random edge removal process; (iii) we implement our schemes and demonstrate the advantage of the approaches over the current fastest gradient coding strategies. The proposed schemes provide order-wise improvement over the state of the art in terms of computation load, and are also optimal in terms of both computation load and latency.
  • In terms of our research on learning universal and transferable graph neural network embedding,  we have made the following major contributions: i)  We propose a novel theoretical guaranteed DUGnn model for universal graph embedding learning that can be trained in unsupervised fashion and also capable of doing transfer learning. 2) We leverage rich graphs kernels to design a multi-task graph decoder which incorporates the power of graph kernels in graph neural networks and get the best of both the worlds. 3) Our DUGnn model achieves superior results in comparison with existing graph neural networks and graph kernels on various types of graph classification benchmark datasets. Through extensive experiments and ablation studies, we show that the proposed DUGnn model consistently outperforms both the existing state-of-art GNN models and Graph Kernels by an increased accuracy of 3% − 8% on graph classification benchmark datasets.
  • In terms of network anomaly detection using BGP MDT data streams, we have proposed Bravo, a systematic MDT data processing and feature selection framework that is portable to other MDT vendors. The source code, MDT data and all prerequisite scripts of our results are available on github. We have also developed NetCorDenStream, a novel unsupervised anomaly detector engine equipped with two anomaly mechanisms: (i) TimeProximity NetCorDenStream, TPk,delta , Which intelligently combines node level alarms raised increasing true positive alarms and precision at the expense of delay. (ii) SigProximity NetCorDenStream, SPk,delta, which incorporates counter heuristics from within TPk,delta to aid network operators in early-bird troubleshooting steps. Evaluation demonstrates that our system outperforms OutlierDenStream: it reduces the number of alarms raised by 59% while increasing precision; the decrease in false alarms rate is achieved with a slight delay in alarms.
  • The key takeaways in our work on NetCorDenStream are: 1) k and delta are tuneable parameters set by network operators. On one hand, operators who seek to increase precision at the expense of delay can set a higher k (k = 5) and delta low (delta = 20 seconds). On the other hand, operators can set k small (k = 1) and delta high (delta= 55 seconds) to reduce false alarms. 2) SPk,delta benchmark quickly brings forth specific counters which trigger alarms, allowing network operators to quickly handle alarms. Thus, NetCorDenStream is very practical and applicable across several vendor networks.
  • In terms of mapping and predicting commercial mmWave 5G throughput, we have made the following major contributions: i) To understand its potential impact on 5G throughput, we identify several UE-side factors and decompose them into quantifiable factors. We have conducted numerous empirical and statistical analysis over the factors individually to understand their impact on 5G throughput behavior and its predictability. We find that 5G throughput performance is driven by a wide spectrum of factors and their interplay that are much more complex compared to traditional cellular technologies such as 3G and 4G. ii) Based on our measurement findings, we develop a holistic and robust ML framework that predicts 5G throughput both qualitatively (via classification) and quantitatively (via regression). Our framework is “composable” in that it considers different feature groups (location, mobility, tower, radio connection) as well as their combinations. This is to our knowledge the first comprehensive study of the predictability of commercial 5G performance using real-world data. ii) Using the proposed prediction framework, we conduct extensive evaluations and demonstrate that it achieves accurate and reliable 5G throughput prediction, and that using 5G-specific features helps significantly improve the prediction results. Powered by judicious feature and ML model selection, our framework achieves an overall F1 score of 0.95–0.97 (with three prediction classes), and 1.26x to 5.57x reduction in throughput prediction error compared to existing approaches designed for 3G/4G. We plan to make the ML models and datasets publicly available.
  • We have developed a novel bandit algorithm for handover in extreme mobility. Specifically, we derive a policy called BaTT(Bandit and Threshold Tuning) based on the unique properties of handover procedures and multi-armed bandit algorithms. To determine when to start the measurements for handover, we formulate it as a J-armed stochastic bandit problem over 𝑇 rounds, and solve it with 𝜖-Binary-Search-First with O(log𝐽log𝑇) regret. Then, to optimize the handover reliability, BaTT decides what sequence of target cells to measure using opportunistic Thompson Sampling. The work is published at IWQoS 2021.
  •  In terms of our research on our measurement study of commercial 5G service performance, we have obtained the following key findings: We have performed a detailed performance examination of 5G over multiple frequency bands including mmWave. We find that both throughput and latency have experienced noticeable improvements compared to its initial deployment. The end-to-end performance is highly correlated with geographical properties. We quantify such properties over network performance and observe critical differences between NSA and SA 5G. We have also performed experiments over T-Mobile’s SA 5G which they have deployed for their low-band network. Our findings suggest that SA 5G has far fewer vertical handoffs when compared to NSA 5G. This is, to our knowledge, the first look at the performance of commercial SA 5G. The complexity, diversity and high variability of emerging 5G services highlight the need for machine learning-based methods for modeling and predicting 5G performance and for 5G network configuration and control.
  •  In terms of our collaborative research on understanding the benefits that 5G brings to mobile applications and identifying new challenges in mmWave networking, we have obtained the following key outcomes: we have conducted a first comprehensive study of evaluating the existing state-of-the-art ABR algorithms for video streaming over mmWave 5G (key for supporting UHD resolution and beyond) and compare that with 4G. We find that due to the poor signal propagation characteristics of mmWave 5G, existing ABR mechanisms over 5G can cause 3.7% to 259.5% higher stall time than 4G/LTE. We have proposed simple yet effective interface selection mechanisms for 5G video streaming and demonstrate 26.9% improvement on reducing video stall with 4.2% improvement in energy efficiency – all without compromising on user-perceived video quality. The entire project team involving all three sites are actively collaborating on tackling the research challenges we have uncovered.
  • In addition, we have advanced several new mechanisms to tackle the challenges facing UHD video streaming applications over 5G networks, thereby making them 5G-aware. We argue for the need to employ machine learning (ML) for effective throughput prediction to aid applications in intelligent bitrate adaptation. Furthermore, we advocate adaptive content bursting, and dynamic radio (band) switching to allow the 5G radio network to fully utilize the available radio resources under good channel/beam conditions, whereas dynamically switched radio channels/bands (e.g., from 5G high -band to low-band, or 5G to 4G) to maintain session connectivity and ensure a minimal bitrate. We have conducted initial evaluation using real-world 5G throughput measurement traces. Our results show these mechanisms can help minimize, if not completely eliminate, video stalls, despite wildly varying 5G throughput.
  • In terms of our research on 5G UE power modeling using machine learning, we have obtained the following key findings: First, using a network-based probing approach, we have developed mechanisms to infer the RRC states and configuration parameters for SA 5G (T-Mobile) and NSA 5G (Verizon and T-Mobile). For NSA 5G that relies on 4G as anchor, we find that the NR_RRC_CONNECTED→LTE_RRC_IDLE transition (due to UE-data-inactivity) for the carriers considered in our study is 2 more energy efficient than what was found in the previous study. Second, we have closely investigated the power characteristics of 5G and 4G/LTE. In terms of energy efficiency for downlink (and uplink), 5G can be 79% (74%) less efficient than 4G at a low throughput but up to 2x more when the throughput is high. Third, using a data-driven, machine learning approach, we have built the first throughput and signal strength-aware network power model for 5G on different frequency bands.
  • In terms of our research on performance and energy/power consumption trade-offs, we have collected a large dataset consisting of more than 30,000 web page loadings of diverse websites, and use it to compare mmWave 5G vs. 4G page load time and energy consumption. We find that overall 5G improves the page load time at the cost of higher energy consumption compared to 4G. Moreover, this impact is highly web-page-dependent. We have built decision tree models that can intelligently select the appropriate network (5G or 4G) for web browsing.
  • We are recently considering learning techniques to optimize freshness of status updates that are sent from a large number of low-power sources to a common access point. The source nodes utilize carrier sensing to reduce collisions and adopt an asynchronized sleep-wake scheduling strategy to achieve a target network lifetime (e.g., 10 years). We use age of information (AoI) to measure the freshness of status updates, and design sleep-wake parameters for minimizing the weighted-sum peak AoI of the sources, subject to per-source battery lifetime constraints. When the mean transmission time of status-update packets is unknown, we devise a reinforcement learning algorithm that adaptively performs the following two tasks in an “efficient way”: a) it learns the unknown parameter, b) it also generates efficient controls that make channel access decisions. We analyze its performance by quantifying its “regret”, i.e., the sub-optimality gap between its average performance and the average performance of a controller that knows the mean transmission time. Our numerical and NS-3 simulation results show that our solution can indeed elongate the batteries lifetime of information sources, while providing a competitive AoI performance.
  • We consider the adaptive control of differentially private systems. We develop the first private RL algorithm, Private-OFU-RL which is able to attain a sub-linear regret while guaranteeing privacy protection. Through this process, we also provide a general procedure for adaptive control of LQ systems under changing regularizers, which not only generalizes previous non-private controls, but also serves as the basis for general private controls. This work was published in IEEE ISIT.

 

Key outcomes 

Towards Commercialization of Opportunistic Learning: Motivated by the theoretical and numerical results of opportunistic learning, we have studied it in real-world application scenarios. In particular, in collaboration with researchers at Target, we have studied opportunistic learning in user interface experimentation. Facing growing competition from online rivals, the retail industry is increasingly investing in their online shopping platforms to win the high-stake battle of customer’ loyalty. User experience is playing an essential role in this competition, and retailers are continuously experimenting and optimizing their user interface for better user experience. The cost of experimentation is dominated by the opportunity cost of providing a suboptimal service to the customers. In our work, we demonstrate the effectiveness of opportunistic bandits to make the experiments as inexpensive as possible using real online retail data. In fact, we model user interface experimentation as an opportunistic bandit problem, in which the cost of exploration varies under a factor extracted from customer features. We achieve significant regret reduction by mitigating costly exploration and providing extra contextual information that helps to guide the testing process. 

Commercial 5G performance measurement and datasets via smart phones: Taking advantage of the fact that Minneapolis was one of the first cities in US where mmWave 5G service was deployed, we have conducted  to our knowledge a first measurement study of commercial mmWave 5G performance on smartphones by closely examining 5G networks of three carriers (two mmWave carriers, one mid-band carrier) in three U.S. cities. Due to lack of publicly available tools/APIs for 5G measurements, we have developed our own 5G service monitoring and throughput measurement platform. We use our tools to conduct extensive, reproducible, and repeated on-field experiments for 5G throughput data collection in Minneapolis (as well as other cities in US). While ensuring data quality remains high, we carefully design systematic measurement methodologies to measure 5G throughput under various settings (indoor/outdoor, mobility scenarios, etc.). After cleaning, our dataset contains throughput samples captured by walking over 331~km  and driving over 132~km. We conduct extensive field tests on 5G performance in diverse urban environments. We systematically analyze the handoff mechanisms in 5G and their impact on network performance. We explore the feasibility of using location and possibly other environmental information to predict the network performance. We also study the app performance (web browsing, HTTP download, and volumetric video streaming) over 5G. Our study consumes more than 15 TB of cellular data. Conducted when 5G just made its debut, it provides a “baseline” for studying how 5G performance evolves, and identifies key research directions on improving 5G users’ experience in a cross-layer manner. We have released the data collected from our study (referred to as 5Gophers) at https://fivegophers.umn.edu/www20.

We propose and study Adaptive Federated Dropout (AFD), a novel technique to reduce the communication costs associated with federated learning. It optimizes both server-client communications and computation costs by allowing clients to train locally on a selected subset of the global model. We empirically show that this strategy, combined with existing compression methods, collectively provides up to 57x reduction in convergence time. It also outperforms the state-of-the-art solutions for communication efficiency. Furthermore, it improves model generalization by up to 1.7%. The work is published at IEEE INFOCOM Workshop on Distributed Machine Learning and Fog Networks 2021.

Accelerating Distributed Deep Learning using Multi-Path RDMA in Data Center Networks: Data center networks (DCNs) have widely deployed RDMA to support data-intensive applications such as machine learning. While DCNs are designed with rich multi-path topology, current RDMA (hardware) technology does not support multi-path transport. In this work we advance Maestro – a purely software-based multi-path RDMA solution – to effectively utilize the rich multi-path topology for load balancing and reliability. As a “middleware” operating at the user-space, Maestro is modular and software-defined: Maestro decouples path selection and load balancing mechanisms from hardware features, and allows DCN operators and applications to make flexible decisions by employing the best mechanisms as needed. As such, Maestro can be readily deployed using existing RDMA hardware (NICs) to support distributed deep learning (DDL) applications. Our experiments show that Maestro is capable of fully utilizing multiple paths with negligible CPU overheads, thereby enhancing the performance of DDL applications.

5G Datasets and Tools:  We have released the functional artifacts (both datasets and tools) of our study: https://github.com/SIGCOMM21-5G/artifact as well as https://5gophers.umn.edu

 

Broader Impact

The PIs or their students presented their work at the multiple international conferences, both in networking field and in machine learning field. The PIs presented the results to collaborators in academia and industry, including Amazon, AT&T, Cisco, Target, and Verizon. Such efforts broaden the participation and interaction between the networking and machine learning community. One female PhD student has been recruited at UC Davis. She has been jointly working PIs Liu and Shroff. She has gained technical skills, communication skills, collaboration skills, in addition to important good real-world experience. In addition, a postdoc at OSU has been jointly working with PIs Shroff and Liu. At the UMN side, the project has provided the graduate students (one Black Ph.D student and a female Ph.D student) who worked on the project with system and algorithm design, modeling and analytical skills, as well
 as performance evaluation and testing and implementation experiences. The research experience gained from this project has obtained them to obtain a part-time internship at Cisco to work on applying machine learning to tackle network management issues. At the UC Davis and UMN, a number of undergraduate students, including one female and one black, have been working on algorithm evaluation and 5G field measurement and trial. These efforts helped enrich their research experiences and trained them in developing critical system building skills. The PIs and their students have presented their work at the conferences, both in networking field and in machine learning field. The PIs have presented the results to collaborators in academia and industry, including Amazon, Cisco, Ericsson, InterDigital, Microsoft, among others. PI Shroff has also presented a keynote address at a Machine Learning for Networks workshop at INFOCOM. Besides conference and journal papers, the PIs interact regularly with the networking and cloud computing industry both in a one-on-one setting and by participating in industry-focused conferences and workshops. These interactions have been fruitful for a cross-pollination of problems and solutions between the research community and industry. Overall, our efforts broaden the participation and interaction between the networking and machine learning community, and between academia and industry. Multiple trainees, including several female students and students from under-represented groups, have worked on problems related to this project, at the three universities. They have gained good real-world experience, communication skills, collaboration skills, in addition to important technical skills. 

The opportunities for training and professional development has the project provided

The students working on this project have gained good real-world experience, communication skills, collaboration skills, in addition to important technical skills. One female PhD student, Esha Datta, has been recruited and working on the project. She has gained technical skills, communication skills, collaboration skills, in addition to important good real-world experience.

  • We have also involved a number of undergraduate students in research, including one African American student. They are Cazamere Comrie, Thomas Munduchira, Bryce Kroencke, Faraz Cherukattil, Andrei Blebea, Chaitenya Gupta, Alessandro Liu, Jiayi Zhang, Henry Zhu, Nguyen Duong. They have gained research experience, technical skills, communication skills, and collaboration skills.
  • At the UMN side, the project has provided the graduate students (one Black Ph.D student and two female Ph.D students) who worked on the project with system and algorithm design, modeling and analytical skills, as well as performance evaluation and testing and implementation experiences. The research experience gained from this project has enabled the students to obtain internships at AT&T, Cisco, Facebook and Microsoft to work on applying machine learning to tackle network management issues.
  • At the UC Davis and UMN, a number of undergraduate students, including one female and one black, have been working on algorithm evaluation and 5G field measurement and trial. These efforts helped enrich their research experiences and trained them in developing critical system building skills. At the OSU side, a female student is working on problems related to this project. At the UC Davis side, a new female student has started to work on this project.
  • We have also involved several undergraduate student (including one female) in conducting 5G field measurement and trial, and collected datasets for our research on applying machine learning to 5G throughput prediction. These efforts helped enrich their research experiences and trained them in developing critical system building skills.  
Disseminated to communities of interest

The PIs or their students presented their work at the multiple international conferences, both in networking field and in machine learning field. The PIs presented the results to collaborators in academia and industry, including Amazon, AT&T, Cisco, Target, and Verizon. The research outcomes and results have been disseminated to communities of interest through researchpublications, conference/workshop presentations (both in networking field and in machine learning field), and invited talks at academic institutions as well as industrialresearch labs.  The research project has also enabled the PIs and their students to collaborate with industry on several timely andimportant problems of mutual interests, and disseminate our research findings to other communities of interest. The PIs or their students presented their work at the multiple international conferences, The PIs presented the results to collaborators in industry, including Amazon, Cisco, Ericsson, InterDigital, Microsoft, among others. PI Shroff has also presented a keynote address at a Machine Learning for Networks workshop at INFOCOM.  Such efforts broaden the participation and interaction between the networking and machine learning community, and between academia and industry. Besides conference and journal papers, the PIs interact regularly with the networking and cloud computing industry both in a one-on-one setting and by participating in industry-focused conferences and workshops. These interactions have been fruitful for a cross-pollination of problems and solutions between the research community and industry. The research project has also enabled the PIs and their students to collaborate with industry on several timely and important problems of mutual interests, and disseminate our research findings to other communities of interest.  Through internship of graduate students who worked on the project, we are collaborating with several industrial partners including Amazon, AT&T, Ericcson, Cisco, Facebook, InterDigital and Microsoft to apply our research findings to their networks and systems. In addition, our research collaborations with industry have led to patent applications/disclosure with our industrial partners such as AT&T. We have also received research awards and gifts from our industrial partners (including an Amazon AWS machine learning award, Cisco University Research Award and InterDigital charitable gift) to support our research efforts.

Plans to do 
  • We plan to continue the current research directions on opportunistic learning. We will explore both theoretical and applied research in this direction.  
  • We will extend our efforts on network multi-arm bandits to more general graphical reinforcement learning settings that are important for various network control problems.
  • In order to model network non-stationary behavior in our learning framework we are investigating problems involving reinforcement learning under non-stationary conditions.
  • We plan to continue our research on graph neural networks, especially, universal deep graph embedding using graph neural networks, and apply this framework to address various network control tasks.
  • We plan to continue our research on applying machine learning -- especially, deep learning and deep reinforcement learning -- techniques and tools to address various network control tasks including 5G/wireless channel/throughput prediction, video streaming, caching, routing and scheduling and traffic engineering.
  • we plan to continue the recruitment and training of postdoc researchers, PhD students, and undergraduate researchers, in particular under-represented minority groups.
  • We plan to continue the current research directions on learning-based closed-loop control in cellular networks. We will explore both theoretical and applied research in this direction.

The impact on the development of the principal discipline(s) of the project

The results of this project help the community better understand how to learn effective policies in an efficient manner. Besides the research contributions, we have develop a measurement platform and data collection via smart phones for mmWave 5G performance measurement, and collected datasets that will be made publicly available. We believe that this measurement platform and datasets will enable  the research community in developing innovative network mechanisms as well as novel machine learning algorithms to support the development of emerging 5G networks and applications.

The impact on other disciplines

Our developed algorithm and analysis further advances the state-of-the-art in machine learning, in particular, multi-armed bandit algorithms. Our developments apply not only to the networking community, but also other communities. For example, we have applied the opportunistic learning algorithm in e-commerce in collaboration with Target. Through student internships, we are also collaborating with Cisco on applying machine learning to network management issues. We are also collaborating Amazon on developing graph neural networks.

The impact on information resources that form infrastructure

We have develop a measurement platform and data collection via smart phones for mmWave 5G performance measurement, and collected datasets that will be made publicly available. We believe that this measurement platform and datasets will a valuable information resources that form infrastructure to enable  the research community in developing innovative network mechanisms as well as novel machine learning algorithms to support the development of emerging 5G networks and applications.

The impact on the development of human resources

The project involves the three PIs, graduate students, postdocs, and undergraduate researchers. All of them have gained significant experience in terms of collaboration, communication, and leadership skills, in addition to technical skills. As specified in the proposal, the PI has made significant effort in recruiting and training undergraduate researchers, as well as researchers from under-represented group.  In addition to her female PhD student, PI Liu has recruited a female postdoc, who was supposed to start early 2020. She is currently delayed due to visa issues.  PI Liu and Shroff are co-supervising a female PhD student at UC Davis. Co-PI Zhang are supervising one female Ph.D student and one Black Ph.S student who are working on this research project.

 

Products

You have the option of selecting “nothing to report” in this section.  There are no limitations to the number of entries you submit and you can also pull information directly from Thomson Search when using the online tool on Research.gov.

Presentations

  • Fundamental Limits of Approximate Gradient Coding [pdf]
    • S. Wang, J. Liu and N. B. Shroff
    • Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), vol. 3, no. 3 (2019): 52 (paper also appeared in ACM Sigmetrics 2020).
  • An Inter-data Encoding Technique that Exploits Synchronized Data for Network Applications [pdf]
    • Wooseung Nam, Joohyun Lee, Ness B. Shroff, Kyunghan Lee
    • IEEE Trans. on Mobile Computing (TMC), accepted for publication.
  • A First Look at Commercial 5G Performance on Smartphones [pdf]
    • Arvind Narayanan, Eman Ramadan, Jason Carpenter, Qingxu Liu, Yu Liu, Feng Qian, Zhi-Li Zhang
    • WWW '20: Proceedings of The Web Conference 2020 April 2020 Pages 894–905
  • Anomalous Model-Driven-Telemetry Network-Stream BGP Detection
    • Rostand A. K. Fezeu and Zhi-Li Zhang
    • Proceedings of the Second Workshop on Harnessing Data Revolution in Networks (HDR-Net), co-located with IEEE ICNP'20.
  • Modeling Personalized Item Frequency Information for Next-Basket Recommendation [pdf]
    • Hu, Haoji and He, Xiangnan and Gao, Jinyang and Zhang, Zhi-Li
    • SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval July 2020 Pages 1071–1080
  • Mapping and Predicting Commercial mmWave 5G Throughput
    • Arvind Narayanan, Eman Ramadan, Rishabh Mehta, Xinyue Hu, Qingxu Liu, Udhaya Kumar Dayalan, Rostand A. K. Fezeu, Saurabh Verma, Peiqi Ji, Tao Li, Feng Qian, Zhi-Li Zhang
    • 2020 ACM Internet Measurement Conference (IMC’20).  
  • AdaLinUCB: Opportunistic Learning for Contextual Bandits [pdf] 
    • Xueying Guo, Xiaoxiao Wang, and Xin Liu
    • The 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019.
  • An Opportunistic Bandit Approach for User Interface Experimentation [pdf]
    • Nader Bouacida, Amit Pande, Xin Liu
    • SIAM SDM Data science for Retail and Ecommerce (DSRE) workshop, 2020.
  • The Sample Complexity of Best-k Items Selection from Pairwise Comparisons [pdf]
    • W. Ren, J. Liu and N. B. Shroff
    • ICML’20, online, July 2020.
  • Fundamental Limits of Approximate Gradient Coding [pdf]
    • S. Wang, J. Liu and N. B. Shroff
    • ACM Sigmetrics' 20, Boston, Massachusetts, Jun. 2020 (moved to an online conference).
  • Asymptotically Optimal Load Balancing in Large-scale Heterogeneous Systems with Multiple Dispatchers 
    • X. Zhou, N. B. Shroff, and A. Wierman
    • to appear in Performance 2020.
  • Towards a Graph Signal Processing Framework for Modeling Power System Dynamics 
    • Xinyue Hu and Zhi-Li Zhang
    • 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP) 
  • 5G Tracker - A Crowdsourced Platform to Enable Research Using Commercial 5G Services 
    • Arvind Narayanan, Eman Ramadan, Jacob Quant, Peiqi Ji, Feng Qian, and Zhi-Li Zhang
    • ACM SIGCOMM’20 Poster
  • VeerEdge: Towards an Edge-Centric IoT Gateway 
    • Arvind Narayanan, Eman Ramadan, Jacob Quant, Peiqi Ji, Feng Qian, and Zhi-Li Zhang
    • CCGRID 2021: 690-695
  • ECIoT: Case for an Edge-Centric IoT Gateway 
    • Udhaya Kumar Dayalan, Rostand A. K. Fezeu, Nitin Varyani, Timothy J. Salo, Zhi-Li Zhang
    • HotMobile 2021: 154-156
  • Towards a Software-Defined, Fine-Grained QoS Framework for 5G and Beyond Networks
    • Zhi-Li Zhang, Udhaya Kumar Dayalan, Eman Ramadan, Timothy J. Salo
    • NAI@SIGCOMM 2021: 7-13
  • A variegated look at 5G in the wild: performance, power, and QoE implications 
    • Arvind Narayanan, Xumiao Zhang, Ruiyang Zhu, Ahmad Hassan, Shuowei Jin, Xiao Zhu, Xiaoxuan Zhang, Denis Rybkin, Zhengxuan Yang, Zhuoqing Morley Mao, Feng Qian, Zhi-Li Zhang
    • SIGCOMM 2021: 610-625
  • Lumos5G: Mapping and Predicting Commercial mmWave 5G Throughput 
    • Arvind Narayanan, Eman Ramadan, Rishabh Mehta, Xinyue Hu, Qingxu Liu, Rostand A. K. Fezeu, Udhaya Kumar Dayalan, Saurabh Verma, Peiqi Ji, Tao Li, Feng Qian, Zhi-Li Zhang
    • Internet Measurement Conference 2020: 176-193
  • Case for 5G-Aware Video Streaming Applications 
    • Eman Ramadan, Arvind Narayanan, Udhaya Kumar Dayalan, Rostand A. K. Fezeu, Feng Qian, Zhi-Li Zhang
    • Proceedings of the 1st Workshop on 5G Measurements, Modeling, and Use Cases August 2021 Pages 27–34
  • Minimizing Age of Information via Scheduling over Heterogeneous Channels [pdf]
    • J. Pan, A. M. Bedewy, Y. Sun, and N. B. Shroff
    • ACM MobiHoc’21, Shanghai, China, July 2021
  • Battle between Rate and Error in Minimizing Age of Information [pdf]
    • G. Yao, A. M. Bedewy, and N. B. Shroff
    • ACM MobiHoc’21, Shanghai, China, July 2021
  • Age-Optimal Low-Power Status Update over Time-Correlated Fading Channel [pdf]
    • G. Yao, A. M. Bedewy, and N. B. Shroff
    • IEEE ISIT’21, Melbourne, Australia, July 2021.
  • Adaptive Control of Differentially Private Linear Quadratic Systems [pdf]
    • S. R. Chowdhury, X. Zhou, and N. B. Shroff
    • IEEE ISIT’21, Melbourne, Australia, July 2021
  • Adaptive Federated Dropout: Improving Communication Efficiency and Generalization for Federated Learning [pdf]
    • Nader Bouacida, Jiahui Hou, Hui Zang, Xin Liu, arXiv.
    • IEEE INFOCOM Workshop on Distributed Machine Learning and Fog Networks 2021.
  • Can Online Learning Increase the Reliability of Extreme Mobility Management? [pdf]
    • Y. Li, E. Datta, J. Ding, N. B. Shroff, and X. Liu
    • IEEE/ACM International Symposium on Quality of Service, 2021.
  • Resource Allocation Method for Network Slicing Using Constrained Reinforcement Learning [pdf]
    • Yongshuai Liu, Jiaixin Ding, and Xin Liu
    • The 20th International Federation for Information Processing (IFIP) Networking, Poster, 2021
  • Policy Learning with Constraints in Model-free Reinforcement Learning: A Survey [pdf]
    • Yongshuai Liu, Avishai Halev, and Xin Liu
    • The 30th International Joint Conference on Artificial Intelligence (IJCAI), 2021

Other Products

Datasets: 5Gopher platform: 5G performance measurement datasets and performance map https://fivegophers.umn.edu/www20.

5G Datasets and Tools:  We have released the functional artifacts (both datasets and tools) of our studies: https://github.com/SIGCOMM21-5G/artifact as well as https://5gophers.umn.edu

Websites: https://github.com/SIGCOMM21-5G/artifact and https://5gophers.umn.edu 

Technologies or Techniques: We have developed Maestro, a purely software-based modular multipath RDMA solution that brings efficiency and flexibility. A novel vNICs based user space path control mechanism and a middle-ware transport layer is proposed to achieve effective multi-path transmission in RDMA without introducing extra memory copy. The user space path monitor and load balance are also realized to provide application-aware path selection and resilience for the heterogeneous, requirement of DDL applications. Our experiments show that Maestro can effectively utilize multiple paths by aggregating bandwidth and detecting path failure in collective communication. Meanwhile, Maestro introduces negligible CPU overhead in data paths which is feasible for large-scale DDL platforms.

Inventions:

Using user-side contextual factors to predict cellular radio throughputThis disclosure generally relates to cellular telephone technology and, in particular, to the use of machine learning to predict of throughput performance of cellular networks. A patent application for this invention is currently under preparation. Emerging 5G services offer numerous new opportunities for networked applications.  The present disclosure provides techniques for predicting the throughput of mmWave 5G in real-life environments and describes machine learning models for predicting 5G throughput in such environments. The disclosure identifies key user equipment (UE) side factors that affect 5G performance and quantify the extent to which 5G throughput can be predicted.  The disclosure further describes a composable machine learning (ML) framework that judiciously considers features and their combinations, and that applies state-of-the-art ML techniques for making context-aware 5G throughput predictions. In practice, the described techniques achieve a 1.37× to 4.84× reduction in prediction error compared to existing models. The throughput prediction techniques described may be used to support applications such as a dynamic 5G throughput map (akin to Google traffic map) and other 5G-aware applications. The present disclosure further describes robust and accurate methods that provide quick and accurate predictions of 5G performance without (or with little) active probing. The present disclosure describes a data-driven approach, constructing a performance model for user equipment (UE) using a wide variety of carefully selected, robust, and easy-to-collect features. Such a data-driven approach is then used to automatically model the UE device and the complex relationships between the device’s various “contexts” and 5G network performance, in particular for the environment-sensitive 5G mmWave radio. In one example, this prediction framework provides a high-resolution 5G “performance map” that may be used as a fundamental infrastructural service for 5G edge computing.

Impacts

The results of this project help the community better understand how to integrate learning into network resource management, from both theoretical and practical aspects.  Besides the research contributions, we have developed a measurement platform and data collection via smart phones for 5G measurement and data-driven performance prediction, and collected datasets that have been made publicly available. This measurement platform and datasets have enabled the research community in developing innovative network mechanisms as well as novel machine learning algorithms to support the development of emerging 5G networks and applications.

Impact on other disciplines: Our developed algorithm and analysis further advances the state-of-the-art in machine learning, in particular, multi-armed bandit algorithms. Our developments apply not only to the networking community, but also other communities. For example, we have applied the opportunistic learning algorithm in e-commerce in collaboration with Target. Through student internships, we are also collaborating with AT&T, Cisco, Facebook and Microsoft  on applying machine learning to network management issues. We are also collaborating Cisco and Amazon on developing machine learning methods for enabling future intelligent network management and new (edge-assisted) cloud services.

Impact on the development of human resources: The project involves the three PIs, graduate students, postdocs, and undergraduate researchers. All of them have gained significant experience in terms of collaboration, communication, and leadership skills, in addition to technical skills. As specified in the proposal, the PI has made significant effort in recruiting and training undergraduate researchers, as well as researchers from under-represented group.  In addition to her female PhD student, PI Liu has recruited a female postdoc, who was supposed to start early 2020. She is currently delayed due to visa issues.  PI Liu and Shroff are co-supervising a female PhD student at UC Davis. One Ph.D student who worked on the project and supervied by PI Zhang is expected to graduate with his Ph.D in September 2021, and will join AT&T Labs – Research. PI Zhang are also supervising two female Ph.D students and one Black Ph.D student who are working on this research project. One of the female Ph.D students is expected to graudate with her Ph.D by the end of 2021, and the other female student was offered a summer internship at AT&T Labs and worked on applying machine learning to cellular network load balancing. The Black Ph.D student was offered a summer internship at Microsoft and worked on cloud computing systems.

Impact on teaching and educational experiences: At the University of Minnesota, the research problems, datasets and tools identified, collected and developed in the project have been used in one of our undergraduate and advanced networking class (csci 4211 “Introduction to Computer Networks” and csci8211 “Advanced Networking and Applications”) for course projects in the past. It will continue to be used in the subsequent offering of the classes (e.g., Spring 2020, Fall 2020 and Fall 2021). The project materials and software will be made public and downloadable from the class websites and shared freely with any other researchers/instructors who are interested in using the material/software. Through our commercial 5G performance measurement and data-driven performance prediction projects, we have engaged a number of undergraduate students (including two female students). Several of them are co-authors in the research papers we have published. One female undergraduate student has just been awarded a UROP (Undergraduate Research Opportunity) scholarship to continue her research on IoT. The datasets we have collected in connection with the commercial 5G measurement study have been used in integrated research and teaching activities.

Impact on information resources that form infrastructure: The datasets we have collected in connection with the commercial 5G measurement study (as well as the toolkit) have been publicly released, and have been used by a number of research teams, in addition to our own integrated research and teaching activities.

Impact on technology transfer: Our research collaboration with AT&T Labs-Research has led the filing of two joint patents with AT&T. Our measurement study on commercial 5G study has led to the development of a novel machine learning based method for mmWave 5G performance prediction; a patent disclosure on this tool is being prepared by the University of Minnesota’s Technology Transfer and Commercialization Office. Through internship of a graduate student who worked on the project, we are collaborating with the Facebook team to apply the ideas and experiences we have gained from this research efforts to improve latency and throughput performance of distributed machine learning in data center networks.

Impact on society beyond science and technology: As our society further depends on networking and network technologies, especially emerging 5G networks, and with the increasing roles of machine learning and AI play in both network control as well as various applications and services, our research will help inform the design of future networked systems such as Beyond 5G networks and next-generation networked systems and enable innovative new applications and services.