The AI/deep learning based data analysis use cases are concerned with addressing large-scale pattern recognition tasks where the input data contains images (cervical cancer diagnostics) or videos (multi-label video classification) or statistics collected from either various network devices or parallel file system in the HPC context. The first two datasets, originate from a repository of machine learning Kaggle challenges used in the machine learning community as real-world benchmark problems of particular clinical or societal relevance:

  • Intel & MobileODT Cervical Cancer Screening problem
    • three diagnostic classes (cancer types)
    • overall data size is in the order of 40GB
    • images of varying sizes from 480x640 to 3096x4128
  • Google Cloud & YouTube-8M Video Understanding Challenge
    • 61M popular YouTube videos (over 400k hours) with almost 4000 classes (video labels reflecting the main semantic topics of the video content are organized in a graph with 24 top-level verticals)
    • No need to use raw video input data as there are over 2.6 billion precomputed audio-visual features (at the frame or video level)

The other two datasets, offer new opportunities to study HPC problems such as network routing and data placement with a data-driven approach.

  • MPI Routing problem: statistics obtained from network devices and Slurm outputs in large compute clusters. For each job, the collected data includes routing tables between compute nodes involved in the job and the related job performance statistics as well as meta-data about the jobs (type of code, owner of the run, number of cores needed).
  • Events Analysis for Data Placement problem: statistics collected from a parallel filesystem including information about the usage patterns of files and directories, i.e. the way they are used and their lifetime (when they are used and how, when they are definitely deleted).

The use cases will be built upon the state-of-the-art deep learning methodology with the use of suitable feedforward multi-layer perceptron (MLP) and convolutional neural network (CNN) architectures designed for these specific classification problems (supervised learning where data such images or videos is coupled to class descriptions, labels). This is an active area of research, especially if the tasks considered in this project are reformulated to provide an even more challenging problem formulation. The extensions of the base problem in AI/deep learning for images/videos include mainly the concept of incremental (online) and transfer learning. In addition, boosting techniques that rely on uneven usage of available data (some data samples need to be accessed much more often than the others) as well as ensemble approaches that consist in designing and applying multiple neural networks, where each one acts as a member of a voting panel of experts, will be considered. Finally, in the spirit of novel trends in the realm of deep learning, the effort will be made to constrain the size of the proposed neural network models without sacrificing the classification performance. This partly facilitates the interpretability of the network’s predictive mechanisms.