Distributed Analytics


To handle the exponential growth of available data, traditional ways of conducting data processing and mining are either too time consuming or infeasible. Typical scientific challenges of analyzing large volume data include 1) how to parallelize existing data processing/mining algorithms, 2) how to maintain good data mining accuracy while improving execution performance. To have good execution efficiency, most big data analytics are conducted in a distributed environment, instead of a local computer. Typical related techniques are cluster computing, cloud computing, service computing, edge computing and Internet of Things (IoT). Research challenges on this topic include: 1) Efficient task and workflow scheduling in distributed environments to reduce overall execution time; 2) Service provisioning and collaboration for data analytics tasks on server/edge nodes so users at client side can easily conduct data analytics remotely.

We have worked on parallelizing and benchmarking for different data processing and mining applications/algorithms from large volume datasets such as data aggregation, Bayesian network learning, neural network training, causality learning and deep learning. We have also worked on novel task and workflow scheduling algorithms/systems in the cloud, hybrid architecture for IoT instrument management and data analytics, application in Smart transportation, and sensor data service. Our work in this research topic results in major journals (Sensors, IEEE Access, Neurocomputing, etc.) and conferences (ICWS, SCC, ICSOC, ISPA, INFORMS-CSS, ICA3PP, IEEE Big Data, etc.).


Open Source Github Repositories