2019年研究生学术前沿讲座(46)-Enabling High-performance Sampling for Big Data Processing

2019年研究生学术前沿讲座(46

 

人:王军  教授  美国佛罗里达中央大学

主题名称:Enabling High-performance Sampling for Big Data Processing

内容简介:

In this talk, we aim to demonstrate how to perform sampling in today’s big data processing platforms. We enable both efficient

 and accurate approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low sampling efficiency and poor estimation accuracy on unevenly distributed sub-datasets.

To address the problem, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system, and make good use of such information to facilitate online sampling. We have implemented Sapprox into Hadoop ecosystem as an example system and open sourced it on GitHub. Our comprehensive experimental results show that Sapprox can achieve a speedup by up to a factor of 20 over the precise execution.

 

时间地点:20191026日,信息235

主办学院:信息工程学院

 

  

365足球外围平台

2019.10.17

分类: 
  • 分类:
    培养工作
Baidu
sogou