Venue: The British Library, London, UK
Date: 9th December 2015
- Jon Crowcroft (Cambridge)
- Mark Handley (UCL)
- Graham Cormode (Warwick)
- Dan Olteanu (Oxford)
Many data scientists are not computer scientists. Thus we need to create systems that support:
- Ease of programming, across a wide scale of big data
- Seamless tolerance of inevitable system faults in large scale data centre systems
- Low latency, for responsive interactive data exploration of large data sets
From indexing the web for search and users’ opinions for recommendations, to simple large scale statistical queries, 100s of thousands of processors can be used in parallel, through frameworks such as map/reduce, with extensions for stream and graph processing.
This vision of distributed computing only really works for “embarrassingly parallel” scenarios. The challenge for the research community is to build systems to support more complex models and algorithms that do not so easily partition into independent chunks; and to give answers in near real time on a given size data centre efficiently. Users want to integrate different tools (for example, R on Spark); don’t want to have to program for fault tolerance, yet as their tasks & data grow, will have to manage this; meanwhile, data science workloads don’t resemble traditional computer science batch or single-user interactive models.
These systems put novel requirements on data centre networking operating systems, storage systems, databases, and programming languages and runtimes. This workshop addresses systems research for big data – where ‘big’ really means very large, non-trivial dimensionality, possibly heterogeneous data, with potentially complex queries.
Key scientific questions
- Data centre systems architecture roadmap for 5-10 years out: if map/reduce is dead, what replaces it?
- Programming paradigms for big data: what are the key primitives to empower data scientists?
- Storage/Database paradigms for big data: how to make storage reliable, scalable and available