Aida Mousavifar

PhD candidate

Data Summarization on Massive Data Streams

  Halls department, Hall 5
  Wednesday, 26 December 2018
  14:00 - 15:00


Many tasks in machine learning and data mining, such as data diversi cation, non-parametric
learning, kernel machines, clustering etc., require extracting a small but representative summary
from a massive dataset. Often, such problems can be posed as maximizing a submodular set function subject to a cardinality constraint. We consider this question in the streaming setting, where
elements arrive over time at a fast pace and thus we need to design an ecient, low-memory algorithm. One such method, proposed by Badanidiyuru et al. (2014), always finds a 0:5-approximate
solution. Can this approximation factor be improved? We answer this question armatively by
designing a new algorithm for streaming submodular maximization. It is the rst low-memory,
a single-pass algorithm that improves the factor 0:5, under the natural assumption that elements arrive in a random order. We also show that this assumption is necessary, i.e., that there is no such
algorithm with better than 0:5-approximation when elements arrive in arbitrary order. Our experiments demonstrate that our algorithm significantly outperforms the state of the art in applications
related to exemplar-based clustering, social graph analysis, and recommender systems.


Aida Mousavifar is a doctoral researcher in Theory of Computation Laboratory under supervision of Prof. Michael Kapralov at EPFL. She designed a fast algorithm for testing the cluster structure of big graphs in sublinear time. Specifically, given a target number of clusters and a measure of desired cluster quality, it distinguished between graphs that are admitting such clustering and graphs that are far from that by exploration of the graph in sublinear time. The work was joint with Microsoft Research Redmond and appeared in FOCS 2018.
She also designed a low-memory and efficient algorithm for streaming submodular maximization subject to a cardinality-constraint, that is a problem widely studied in machine learning and data mining, e.g. data diversification, clustering, etc. The experiments outperform state-of-the-art in applications related to social graph analysis and recommender systems.