Big Data Storage and Processing Infrastructures – SIF Research Master's Degree in Computer Science

This course provides an introduction to Big Data and Data Science platforms. We will introduce the main concepts, application domains, and research challenges associated with large-scale distributed big-data application both on the cloud and on edge devices. We will introduce the main storage and data management models, such as MapReduce and its derivatives, as well as the associated technological platforms, such as Hadoop, Spark, Flink. We will present the internals of the platforms and study comparative analyses that will enable students to choose the appropriate model and platforms in a variety of application domains.

We will also consider the distributed nature of data-management applications, and explore the evolution of architectural models from peer-to-peer applications to the cloud and to the recent concept of edge computing.

In particular, we will study decentralized protocols that enable large-scale coordination and data diffusion such as epidemic protocols, with applications such as recommender systems and video streaming. Finally, we will present recent research results in the context of hybrid cloud-edge architectures for large-scale machine learning, such as those of the Google Federated Learning team.

Introduction to Map Reduce and Hadoop
Physical infrastructures and software architectures to manage large-scale big-data platforms: challenges, design concepts, examples.
Beyond Map Reduce: limitations, extensions, post-Hadoop systems: Spark, Flink, …
Comparative studies (architectures, performance, applications)
From Peer-to-Peer to the Cloud: DHTs to Key Value Stores
Epidemic protocols for large-scale coordination: from video streaming to machine learning
Decentralized and Federated Learning: aggregation, distributed deep learning.

Objectives

Teachers