Comparative Study; Classification; Big Behavioral Data; High-Dimensional; Sparse;
The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense.This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings.First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique.Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data.The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.