NPU-Big Data Management Group

Selected
Technical Report

r-HUMO: A Risk-aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018.
Boyi Hou, Qun Chen, Zhaoqiang Chen, Youcef Nafa, Zhanhuai Li
[Abstract] [Bibtex] [PDF] [Technical report]

Even though many approaches have been proposed for entity resolution (ER), it remains very challenging to enforce quality guarantees. To this end, we propose a risk-aware HUman-Machine cOoperation framework for ER, denoted by r-HUMO. Built on the existing HUMO framework, r-HUMO similarly enforces both precision and recall guarantees by partitioning an ER workload between the human and the machine. However, r-HUMO is the first solution that optimizes the process of human workload selection from a risk perspective. It iteratively selects human workload by real-time risk analysis based on the human-labeled results as well as the pre-specified machine metric. In this paper, we first introduce the r-HUMO framework and then present the risk model to prioritize the instances for manual inspection. Finally, we empirically evaluate r-HUMO's performance on real data. Our extensive experiments show that r-HUMO is effective in enforcing quality guarantees, and compared with the state-of-the-art alternatives, it can achieve desired quality control with reduced human cost.

@article{hou2018rhumo,
title={r-HUMO: A Risk-aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees},
author={Hou, Boyi and Chen, Qun and Chen, Zhaoqiang and Nafa, Youcef and Li, Zhanhuai},
booktitle={IEEE Transactions on Knowledge and Data Engineering (TKDE)},
year={2018},
doi={10.1109/TKDE.2018.2883532},
publisher={IEEE},
}

Relational data imputation with quality guarantee. Information Sciences, 2018.
Fengfeng Fan, Zhanhuai Li, Qun Chen, Lei Chen
[Abstract] [Bibtex] [PDF]

Missing attribute values are prevalent in real relational data, especially the data extracted from the Web. Their accurate imputation is important for ensuring high quality of data analytics. Even though many techniques have been proposed for this task, none of them provides a flexible mechanism for quality control. The lack of quality guarantee may re- sult in many missing data being filled with wrong values, which can easily result in bi- ased data analysis. In this paper, we first propose a novel probabilistic framework based on the concept of Generalized Feature Dependency (GFD). By exploiting the monotonic- ity between imputation precision and match probability, it enables a flexible mechanism for quality control. We then present the imputation model with precision guarantee and the techniques to maximize recall while meeting a user-specified precision requirement. Finally, we evaluate the performance of the proposed approach on real data. Our extensive experiments show that it has performance advantage over the state-of-the-art alternatives and most importantly, its quality control mechanism is effective.

@article{FAN2018305,
title={Relational data imputation with quality guarantee},
author={Fengfeng Fan and Zhanhuai Li and Qun Chen and Lei Chen},
journal={Information Sciences},
volume={465},
pages={305 - 322},
year={2018},
issn={0020-0255},
doi={https://doi.org/10.1016/j.ins.2018.07.017},
url={http://www.sciencedirect.com/science/article/pii/S0020025518305309},
}

Improving Machine-based Entity Resolution with Limited Human Effort: A Risk Perspective. International Workshop on Real-Time Business Intelligence and Analytics, 2018.
Zhaoqiang Chen, Qun Chen, Boyi Hou, Murtadha Ahmed, Zhanhuai Li
[Abstract] [Bibtex] [PDF] [Technical report]

Pure machine-based solutions usually struggle in the challenging classification tasks such as entity resolution (ER). To alleviate this problem, a recent trend is to involve the human in the resolution process, most notably the crowdsourcing approach. However, it remains very challenging to effectively improve machine-based entity resolution with limited human effort. In this paper, we investigate the problem of human and machine cooperation for ER from a risk perspective. We propose to select the machine-labeled instances at high risk of being mislabeled for manual verification. For this task, we present a risk model that takes into consideration the human-labeled instances as well as the output of machine resolution. Finally, we evaluate the performance of the proposed risk model on real data. Our experiments demonstrate that it can pick up the mislabeled instances with considerably higher accuracy than the existing alternatives. Provided with the same amount of human cost budget, it can also achieve better resolution quality than the state-of-the-art approach based on active learning.

@inproceedings{chen2018risker,
title={Improving Machine-based Entity Resolution with Limited Human Effort: A Risk Perspective},
author={Chen, Zhaoqiang and Chen, Qun and Hou, Boyi and Ahmed, Murtadha and Li, Zhanhuai},
booktitle={Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics},
series={BIRTE'18},
numpages={5},
year={2018},
doi={10.1145/3242153.3242156},
publisher={ACM},
}

Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework. ICDE 2018.
Zhaoqiang Chen, Qun Chen, Fengfeng Fan, Yanyan Wang, Zhuo Wang, Youcef Nafa, Zhanhuai Li, Hailong Liu, Wei Pan
[Abstract] [Bibtex] [PDF] [Slides]

Even though many machine algorithms have been proposed for entity resolution, it remains very challenging to find a solution with quality guarantees. In this paper, we propose a novel Human and Machine cOoperation (HUMO) framework for entity resolution (ER), which divides an ER workload between the machine and the human. HUMO enables a mechanism for quality control that can flexibly enforce both precision and recall levels. We introduce the optimization problem of HUMO, minimizing human cost given a quality requirement, and then present three optimization approaches: a conservative baseline one purely based on the monotonicity assumption of precision, a more aggressive one based on sampling and a hybrid one that can take advantage of the strengths of both previous approaches. Finally, we demonstrate by extensive experiments on real and synthetic datasets that HUMO can achieve high-quality results with reasonable return on investment (ROI) in terms of human cost, and it performs considerably better than the state-of-the-art alternatives in quality control.

@INPROCEEDINGS{chen2018humo,
author={Z. Chen and Q. Chen and F. Fan and Y. Wang and Z. Wang and Y. Nafa and Z. Li and H. Liu and W. Pan},
booktitle={2018 IEEE 34th International Conference on Data Engineering (ICDE)},
title={Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework},
year={2018},
pages={1156-1167},
doi={10.1109/ICDE.2018.00107},
month={April},
}

SenHint: A Joint Framework for Aspect-level Sentiment Analysis by Deep Neural Networks and Linguistic Hints. WWW 2018.
Yanyan Wang, Qun Chen, Xin Liu, Murtadha Ahmed, Zhanhuai Li, Wei Pan, Hailong Liu
[Abstract] [Bibtex] [PDF] [Homepage]

The state-of-the-art techniques for aspect-level sentiment analysis focus on feature modeling using a variety of deep neural networks (DNN). Unfortunately, their practical performance may fall short of expectations due to semantic complexity of natural languages. Motivated by the observation that linguistic hints (e.g. explicit sentiment words and shift words) can be strong indicators of sentiment, we present a joint framework, SenHint, which integrates the output of deep neural networks and the implication of linguistic hints into a coherent reasoning model based on Markov Logic Network (MLN). In SenHint, linguistic hints are used in two ways: (1) to identify easy instances, whose sentiment can be automatically determined by machine with high accuracy; (2) to capture implicit relations between aspect polarities. We also empirically evaluate the performance of SenHint on both English and Chinese benchmark datasets. Our experimental results show that SenHint can effectively improve accuracy compared with the state-of-the-art alternatives.

@inproceedings{DBLP:conf/www/WangCLALPL18,
author = {Yanyan Wang and
Qun Chen and
Xin Liu and
Murtadha H. M. Ahmed and
Zhanhuai Li and
Wei Pan and
Hailong Liu},
title = {SenHint: {A} Joint Framework for Aspect-level Sentiment Analysis by
Deep Neural Networks and Linguistic Hints},
booktitle = {Companion of the The Web Conference 2018 on The Web Conference 2018,
{WWW} 2018, Lyon , France, April 23-27, 2018},
pages = {207--210},
year = {2018},
crossref = {DBLP:conf/www/2018c},
url = {http://doi.acm.org/10.1145/3184558.3186980},
doi = {10.1145/3184558.3186980},
timestamp = {Tue, 24 Apr 2018 14:09:22 +0200},
biburl = {https://dblp.org/rec/bib/conf/www/WangCLALPL18},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

GraphU: A Unified Vertex-Centric Parallel Graph Processing Platform. ICDCS 2018.
Jing Su, Qun Chen, Zhuo Wang, Murtadha Ahmed, Zhanhuai Li
[Abstract] [PDF] [Homepage]

Many synchronous and asynchronous distributed platforms based on the Bulk Synchronous Parallel (BSP) model have been built for large-scale vertexcentric graph processing. Unfortunately, a program designed for a synchronous platform may not work properly on an asynchronous one. As a result, given the same problem, end users may be required to design different parallel algorithms for different platforms. Recently, we have proposed a unified programming model, DFA-G (Deterministic Finite Automaton for Graph processing), which expresses the computation at a vertex as a series of message-driven state transitions. It has the attractive property that any program modeled after it can run properly across synchronous and asynchronous platforms. In this demo, we first propose a framework of complexity analysis for DFA-G automaton and show that it can significantly facilitate complexity analysis on asynchronous programs. Due to the existing BSP platforms’ deficiency in supporting efficient DFA-G execution, we then develop a new prototype platform, GraphU. GraphU was built on the popular open-source Giraph project. But it entirely removes synchronization barriers and decouples remote communication from vertex computation. Finally, we empirically evaluate the performance of various DFA-G programs on GraphU by a comparative study. Our experiments validate the efficacy of the proposed complexity analysis approach and the efficiency of GraphU.

Reasoning about attribute value equivalence in relational data. Information Systems, 2018.
Fengfeng Fan, Zhanhuai Li, Qun Chen, Lei Chen
[Abstract] [Bibtex] [PDF]

In relational data, identifying the distinct attribute values that refer to the same real-world entities is an essential task for many data cleaning and mining applications (e.g., duplicate record detection and functional dependency mining). The state-of-the-art approaches for attribute value matching are mainly based on string similarity among attribute values. However, these approaches may not perform well in the cases where the specified string similarity metric is not a reliable indicator for attribute value equiv- alence. To alleviate such limitations, we propose a new framework for attribute value matching in rela- tional data. Firstly, we propose a novel probabilistic approach to reason about attribute value equivalence by value correlation analysis. We also propose effective methods for probabilistic equivalence reasoning with multiple attributes. Next, we present a unified framework, which incorporates both string similarity measurement and value correlation analysis by evidential reasoning. Finally, we demonstrate the effec- tiveness of our framework empirically on real-world datasets. Through extensive experiments, we show that our framework outperforms the string-based approaches by considerable margins on matching ac- curacy and achieves the desired efficiency.

@article{Fan2018Reasoning,
title={Reasoning about Attribute Value Equivalence in Relational Data},
author={Fan, Fengfeng and Li, Zhanhuai and Chen, Qun and Chen, Lei},
journal={Information Systems},
volume={75},
year={2018},
}

Reducing Partition Skew on MapReduce: An Incremental Allocation Approach. Frontiers of Computer Science, 2018.
Zhuo Wang, Qun Chen, Bo Suo, Wei Pan, Zhanhuai Li
[Abstract] [PDF]

MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate Map and Reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data. In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micro-partitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov Decision Process (MDP) model to optimize the problem of multiple-round micro-partition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K·N2), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K·NlnN). Then, we propose a Min-Max programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.

GL-RF: A Reconciliation Framework for Label-free Entity Resolution. Frontiers of Computer Science, 2018.
Yaoli Xu, Zhanhuai Li, Qun Chen, Fengfeng Fan
[Abstract] [PDF]

Entity resolution (ER) plays a fundamental role in data integration and cleaning. Even though many approaches have been proposed for ER, their performance on the data sets with different characteristics may vary significantly. Given a practical ER workload, it remains very challenging to select an appropriate resolution technique while applying multiple techniques simultaneously would result in inconsistent resolution. In this paper, we study the problem of how to reconcile the inconsistent results of different label-free resolution techniques. We first propose a generic label-free reconciliation framework, denoted by GL-RF. The proposed framework does not require to manually label pairs, but reasons about the match status of the inconsistent pairs purely based on the implicit information contained in the consistent pairs. We then formalize the reconciliation problem and present an incremental K-neighbor influence algorithm. Finally, we empirically evaluate the performance of the proposed approach on the real data sets by a comparative study. Our extensive experiments show that GL-RF performs considerably better than the state-of-the-art alternatives.

POOLSIDE: An Online Probabilistic Knowledge Base for Shopping Decision Support. CIKM 2017.
Ping Zhong, Zhanhuai Li, Qun Chen, Yanyan Wang, Lianping Wang, Murtadha HM Ahmed, Fengfeng Fan
[Abstract] [Bibtex] [PDF] [Homepage]

We present POOLSIDE, an online PrObabilistic knOwLedge base for ShoppIng DEcision support, that provides with the on-target recommendation service based on explicit user requirement. With a natural language interface, POOLSIDE can answer question in real time. We present how to construct the knowledge base and how to enable real-time response in POOLSIDE. Finally, we demonstrate that Poolside can give high-quality product recommendations with high efficiency.

@inproceedings{DBLP:conf/cikm/ZhongLCWWAF17,
author = {Ping Zhong and
Zhanhuai Li and
Qun Chen and
Yanyan Wang and
Lianping Wang and
Murtadha H. M. Ahmed and
Fengfeng Fan},
title = {{ POOLSIDE:} An Online Probabilistic Knowledge Base for Shopping Decision
Support},
booktitle = {Proceedings of the 2017 {ACM} on Conference on Information and Knowledge
Management, {CIKM} 2017, Singapore, November 06 - 10, 2017},
pages = {2559--2562},
year = {2017},
crossref = {DBLP:conf/cikm/2017},
url = {http://doi.acm.org/10.1145/3132847.3133168},
doi = {10.1145/3132847.3133168},
timestamp = {Tue, 07 Nov 2017 16:24:37 +0100},
biburl = {https://dblp.org/rec/bib/conf/cikm/ZhongLCWWAF17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

A Human-and-Machine Cooperative Framework for Entity Resolution with Quality Guarantees. ICDE 2017.
Zhaoqiang Chen, Qun Chen, Zhanhuai Li
[Abstract] [Bibtex] [PDF] [Homepage]

For entity resolution, it remains very challenging to find the solution with quality guarantees as measured by both precision and recall. In this demo, we propose a HUman-andMachine cOoperative framework, denoted by HUMO, for entity resolution. Compared with the existing approaches, HUMO enables a flexible mechanism for quality control that can enforce both precision and recall levels. We also introduce the problem of minimizing human cost given a quality requirement and present corresponding optimization techniques. Finally, we demo that HUMO achieves high-quality results with reasonable return on investment (ROI) in terms of human cost on real datasets.

@inproceedings{DBLP:conf/icde/ChenCL17,
author = {Zhaoqiang Chen and
Qun Chen and
Zhanhuai Li},
title = {A Human-and-Machine Cooperative Framework for Entity Resolution with
Quality Guarantees},
booktitle = {33rd {IEEE} International Conference on Data Engineering, {ICDE} 2017,
San Diego, CA, USA, April 19-22, 2017},
pages = {1405--1406},
year = {2017},
crossref = {DBLP:conf/icde/2017},
url = {https://doi.org/10.1109/ICDE.2017.197},
doi = {10.1109/ICDE.2017.197},
timestamp = {Wed, 24 May 2017 11:31:57 +0200},
biburl = {https://dblp.org/rec/bib/conf/icde/ChenCL17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

Parallelizing maximal clique and k-plex enumeration over graph data. Journal of Parallel and Distributed Computing, 2017.
Zhuo Wang, Qun Chen, Boyi Hou, Bo Suo, Zhanhuai Li, Wei Pan, Zachary G. Ives
[Abstract] [Bibtex] [PDF]

In a wide variety of emerging data-intensive applications, such as social network analysis, Web document clustering, entity resolution, and detection of consistently co-expressed genes in systems biology, the detection of dense subgraphs cliques and k-plex is an essential component. Unfortunately, these problems are NP-Complete and thus computationally intensive at scale — hence there is a need to come up with techniques for distributing the computation across multiple machines such that the computation, which is too time-consuming on a single machine, can be efficiently performed on a machine cluster given that it is large enough. In this paper, we first propose a new approach for maximal clique and k-plex enumeration, which identifies dense subgraphs by binary graph partitioning. Given a connected graph G = (V, E), it has a space complexity of O(|E|) and a time complexity of O(|E|μ(G)), where μ(G) represents the number of different cliques (k-plexes) existing in G. It recursively divides a graph until each task is sufficiently small to be processed in parallel. We then develop parallel solutions and demonstrate how graph partitioning can enable effective load balancing. Finally, we evaluate the performance of the proposed approach on real and synthetic graph data and show that it performs considerably better than existing approaches in both centralized and parallel settings. In the parallel setting, it can achieve the speedups of up to 10x over existing approaches on large graphs. Our parallel algorithms are primarily implemented and evaluated on MapReduce, a popular shared-nothing parallel framework, but can easily generalize to other sharednothing or shared-memory parallel frameworks. The work presented in this paper is an extension of our preliminary work on the approach of binary graph partitioning for maximal clique enumeration. In this work, we extend the proposed approach to handle maximal k-plex detection as well.

@article{Wang2017Parallelizing,
title={Parallelizing maximal clique and k-plex enumeration over graph data},
author={Wang, Zhuo and Chen, Qun and Hou, Boyi and Suo, Bo and Li, Zhanhuai and Pan, Wei and Ives, Zachary G.},
journal={Journal of Parallel & Distributed Computing},
volume={106},
pages={79-91},
year={2017},
}

DFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing. ICDM 2016.
Bo Suo, Jing Su, Qun Chen, Zhanhuai Li, Wei Pan
[Abstract] [Bibtex] [PDF] [Homepage]

Many systems have been built for vertex-centric parallel graph processing. Based on the Bulk Synchronous Parallel (BSP) model, they execute user-defined operations at vertices iteratively and exchange information between vertices by messages. Even though the native BSP systems (e.g. Pregel and Giraph) execute the operations on vertices synchronously within an iteration, many other platforms (e.g. Grace, Blogel and GraphHP) have proposed asynchronous execution for improved efficiency. However, they also bring about an undesirable side effect: a program designed for synchronous platforms may not run properly on asynchronous platforms. In this demo, we present DFA-G (Deterministic Finite Automaton for Graph processing), a unified programming model for vertex-centric parallel graph processing. Built on DFA, DFA-G expresses the computation at a vertex as a process of message-driven state transition. A program modeled after DFA-G can run properly on both synchronous and asynchronous platforms. We demonstrate how to build DFA-G models using a graphical user interface and also how to automatically translate them into BSP programs.

@inproceedings{DBLP:conf/icdm/SuoSCLP16,
author = {Bo Suo and
Jing Su and
Qun Chen and
Zhanhuai Li and
Wei Pan},
title = {{DFA-G:} {A} Unified Programming Model for Vertex-Centric Parallel
Graph Processing},
booktitle = {{IEEE} International Conference on Data Mining Workshops, {ICDM} Workshops
2016, December 12-15, 2016, Barcelona, Spain.},
pages = {1328--1331},
year = {2016},
crossref = {DBLP:conf/icdm/2016w},
url = {https://doi.org/10.1109/ICDMW.2016.0196},
doi = {10.1109/ICDMW.2016.0196},
timestamp = {Thu, 21 Dec 2017 12:14:38 +0100},
biburl = {https://dblp.org/rec/bib/conf/icdm/SuoSCLP16},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

A probabilistic ranking framework for web-based relational data imputation. Information Sciences, 2016.
Zhaoqiang Chen, Qun Chen, Jiajun Li, Zhanhuai Li, Lei Chen
[Abstract] [Bibtex] [PDF]

Due to richness of information on web, there is an increasing interest to search for missing attribute values in relational data on web. Web-based relational data imputation has to first extract multiple candidate values from web and then rank them by their matching probabilities. However, effective candidate ranking remains challenging because web documents are unstructured and popular search engines can only provide with relevant but not necessarily semantically matching information. In this paper, we propose a novel probabilistic approach for ranking the web-retrieved candidate values. It can integrate various influence factors, e.g. snippet rank order, occurrence frequency, occurrence pattern, and keyword proximity, in a single framework by se- mantic reasoning. The proposed framework consists of snippet influence model and se- mantic matching model. The snippet influence model measures the influence of a snip- pet, and the semantic matching model measures the semantic similarity between a candidate value in a snippet and a missing relational value in a tuple. We also present effective probabilistic estimation solutions for both models. Finally, we empirically evaluate the performance of the proposed framework on real datasets. Our extensive experiments demonstrate that it outperforms the state-of-the-art techniques by considerable margins on imputation accuracy.

@article{Chen2016A,
title={A probabilistic ranking framework for web-based relational data imputation},
author={Chen, Zhaoqiang and Chen, Qun and Li, Jiajun and Li, Zhanhuai and Chen, Lei},
journal={Information Sciences},
volume={355–356},
number={C},
pages={152-168},
year={2016},
}

Automatic Web-based Relational Data Imputation. Frontiers of Computer Science, 2016.
Hailong Liu, Zhanhuai Li, Qun Chen, Zhaoqiang Chen
[Abstract] [PDF]

Data incompleteness is one of the most important data quality problems in enterprise information systems. Most existing data imputing techniques just deduce approximate values for the incomplete attributes by means of some specific data quality rules or some mathematical methods. Unfortunately,approximation may be far away from the truth. Furthermore, when observed data is inadequate, they will not work well. The World Wide Web (WWW) has become the most important and the most widely used information source. Several current works have proven that using Web data can augment the quality of databases. In this paper, we propose a Web-based relational data imputing framework, which tries to automatically retrieve real values from the WWW for the incomplete attributes. In the paper, we try to take full advantage of relations among different kinds of objects based on the idea that the same kind of things must have the same kind of relations with their relatives in a specific world. Our proposed techniques consist of two automatic query formulation algorithms and one graph-based candidates extraction model. Several evaluations are proposed on two high-quality real datasets and one poor-quality real dataset to prove the effectiveness of our approaches.

Gradual Machine Learning for Aspect-level Sentiment Analysis. 2018.
Yanyan Wang, Qun Chen, Jiquan Shen, Boyi Hou, Murtadha Ahmed, Zhanhuai Li
[Abstract] [PDF]

Usually considered as a classification problem, aspect-level sentiment analysis can be very challenging on real data due to the semantic complexities of natural languages. The state-of-the-art solutions for aspect-level sentiment analysis are built on a variety of deep neural networks (DNN), whose efficacy depends on large amounts of accurately labeled training data. Unfortunately, high-quality labeled training data usually require expensive manual work, and are thus not readily available in many real scenarios. In this paper, we propose a novel learning paradigm, called gradual machine learning, which aims to enable effective machine learning without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels the more challenging instances by iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real benchmark data have shown that the performance of the proposed approach is considerably better than its unsupervised alternatives, and highly competitive compared with the state-of-the-art supervised DNN techniques. Using aspect-level sentiment analysis as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.

GraphHP: A Hybrid Platform for Iterative Graph Processing. 2014.
Qun Chen, Zhanhuai Li, Beng Chin Ooi, Song Bai, and Zhiying Gou
[Abstract] [PDF] [Homepage] [Code]

The Bulk Synchronous Parallel (BSP) computational model has emerged as a popular parallel framework for building large-scale iterative graph processing systems. It is simple and yet practical, and its implementations (e.g., Pregel, Giraph and Hama) have so far shown good efficiency. However, its frequent synchronization and communication among the workers can cause substantial parallel inefficiency and hence affect its scalability. To alleviate the problem, we propose the GraphHP(Graph Hybrid Processing) platform based on the friendly BSP abstraction to optimize its synchronization and communication overhead.
We first propose a hybrid execution model that can effectively reduce the invocation frequency of distributed synchronization and communication. It differentiates between the computations within a graph partition and across the partitions so that the computations within a partition can be performed in an in-memory pseudo-superstep iteration. We then demonstrate how the hybrid execution model can be easily implemented using the BSP abstraction to preserve its simple programming interface by building the hybrid platform GraphHP on Hama, a BSP open source. Finally, we evaluate our GraphHP implementation using some classical BSP applications and show that it performs significantly better than state-of-the-art BSP implementations.

publications

r-HUMO: A Risk-aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018. Boyi Hou, Qun Chen, Zhaoqiang Chen, Youcef Nafa, Zhanhuai Li [Abstract] [Bibtex] [PDF] [Technical report]

Relational data imputation with quality guarantee. Information Sciences, 2018. Fengfeng Fan, Zhanhuai Li, Qun Chen, Lei Chen [Abstract] [Bibtex] [PDF]

Improving Machine-based Entity Resolution with Limited Human Effort: A Risk Perspective. International Workshop on Real-Time Business Intelligence and Analytics, 2018. Zhaoqiang Chen, Qun Chen, Boyi Hou, Murtadha Ahmed, Zhanhuai Li [Abstract] [Bibtex] [PDF] [Technical report]

Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework. ICDE 2018. Zhaoqiang Chen, Qun Chen, Fengfeng Fan, Yanyan Wang, Zhuo Wang, Youcef Nafa, Zhanhuai Li, Hailong Liu, Wei Pan [Abstract] [Bibtex] [PDF] [Slides]

SenHint: A Joint Framework for Aspect-level Sentiment Analysis by Deep Neural Networks and Linguistic Hints. WWW 2018. Yanyan Wang, Qun Chen, Xin Liu, Murtadha Ahmed, Zhanhuai Li, Wei Pan, Hailong Liu [Abstract] [Bibtex] [PDF] [Homepage]

GraphU: A Unified Vertex-Centric Parallel Graph Processing Platform. ICDCS 2018. Jing Su, Qun Chen, Zhuo Wang, Murtadha Ahmed, Zhanhuai Li [Abstract] [PDF] [Homepage]

Reasoning about attribute value equivalence in relational data. Information Systems, 2018. Fengfeng Fan, Zhanhuai Li, Qun Chen, Lei Chen [Abstract] [Bibtex] [PDF]

Reducing Partition Skew on MapReduce: An Incremental Allocation Approach. Frontiers of Computer Science, 2018. Zhuo Wang, Qun Chen, Bo Suo, Wei Pan, Zhanhuai Li [Abstract] [PDF]

GL-RF: A Reconciliation Framework for Label-free Entity Resolution. Frontiers of Computer Science, 2018. Yaoli Xu, Zhanhuai Li, Qun Chen, Fengfeng Fan [Abstract] [PDF]

POOLSIDE: An Online Probabilistic Knowledge Base for Shopping Decision Support. CIKM 2017. Ping Zhong, Zhanhuai Li, Qun Chen, Yanyan Wang, Lianping Wang, Murtadha HM Ahmed, Fengfeng Fan [Abstract] [Bibtex] [PDF] [Homepage]

A Human-and-Machine Cooperative Framework for Entity Resolution with Quality Guarantees. ICDE 2017. Zhaoqiang Chen, Qun Chen, Zhanhuai Li [Abstract] [Bibtex] [PDF] [Homepage]

Parallelizing maximal clique and k-plex enumeration over graph data. Journal of Parallel and Distributed Computing, 2017. Zhuo Wang, Qun Chen, Boyi Hou, Bo Suo, Zhanhuai Li, Wei Pan, Zachary G. Ives [Abstract] [Bibtex] [PDF]

DFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing. ICDM 2016. Bo Suo, Jing Su, Qun Chen, Zhanhuai Li, Wei Pan [Abstract] [Bibtex] [PDF] [Homepage]

A probabilistic ranking framework for web-based relational data imputation. Information Sciences, 2016. Zhaoqiang Chen, Qun Chen, Jiajun Li, Zhanhuai Li, Lei Chen [Abstract] [Bibtex] [PDF]

Automatic Web-based Relational Data Imputation. Frontiers of Computer Science, 2016. Hailong Liu, Zhanhuai Li, Qun Chen, Zhaoqiang Chen [Abstract] [PDF]

Gradual Machine Learning for Aspect-level Sentiment Analysis. 2018. Yanyan Wang, Qun Chen, Jiquan Shen, Boyi Hou, Murtadha Ahmed, Zhanhuai Li [Abstract] [PDF]

GraphHP: A Hybrid Platform for Iterative Graph Processing. 2014. Qun Chen, Zhanhuai Li, Beng Chin Ooi, Song Bai, and Zhiying Gou [Abstract] [PDF] [Homepage] [Code]