Data Mining

metapath2vec: Scalable Representation Learning for Heterogeneous Networks

We study the problem of representation learning in heterogeneous networks. Its unique challenges come from the existence of multiple types of nodes and links, which limit the feasibility of the conventional network embedding techniques. We develop two scalable representation learning models, namely metapath2vec and metapath2vec++. The metapath2vec model formalizes meta-path-based random walks to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous skip-gram model to perform node embeddings. The metapath2vec++ model further enables the simultaneous modeling of structural and semantic correlations in heterogeneous networks. Extensive experiments show that metapath2vec and metapath2vec++ are able to not only outperform state-of-the-art embedding models in various heterogeneous network mining tasks, such as node classification, clustering, and similarity search, but also discern the structural and semantic correlations between diverse network objects.


Link: Github

WEKA for Imbalanced Data (SMOTE and HDDT)

We’ve modified WEKA (v3.7.14), the popular Java-based data mining software package, to include two methods developed specifically for learning from imbalanced data: Synthetic Minority Oversampling TEchnique (SMOTE), a popular sampling method for data-preprocessing, and Hellinger Distance Decision Tree (HDDT), a skew-insensitive decision tree-based algorithm for classification. In the provided WEKA implementation, SMOTE can be found as a supervised instance filter, while HDDT can be found as a tree-based classifier. The SMOTE filter implementation for WEKA may also be downloaded separately here.

For more details on these methods, please consult the following publications:

Link: Download

Model Monitor (M2)

Model Monitor is a Java toolkit for the systematic evaluation of classifiers under changes in distribution. It provides methods for detecting distribution shifts in data, comparing the performance of multiple classifiers under shifts in distribution, and evaluating the robustness of individual classifiers to distribution change. As such, it allows users to determine the best model (or models) for their data under a number of potential scenarios. Additionally, Model Monitor is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired.

Techniques implemented in this package come primarily from the following sources:

Links: Download | Manual

Perl/C SMOTE+Undersampling Wrapper Implementation

Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a corresponding high cost for misclassification of rare events. Under such circumstances, generating models with high minority class accuracy and with lower total misclassification cost is necessary. It becomes important to apply resampling and/or cost-based reweighting to improve the prediction of the minority class. However, the question remains on how to effectively apply the sampling strategy. To that end, we provide a wrapper paradigm that discovers the amount of re-sampling for a dataset. This method has produced favorable results compared to other imbalance methods and several cost-sensitive learning methods, such as MetaCost. With it, we have obtained the lowest cost per test example compared to any result we are aware of for the KDD Cup 1999 intrusion detection dataset.

For more details on the wrapper method, please consult the following publication:

Link: Download

Condor Grid Analysis Software Package (GASP)

Whether you are a first time Condor user or an advanced system administrator, job failure on the grid is inevitible. In a submission batch of 1,000 jobs, one might observe 500 job failures, leaving the user with several questions: Why are some jobs evicted multiple times? Why do some jobs create Shadow Exceptions? Is a group of machines incapable of running a particular submission? All of these are difficult to answer due to the scale of the machine pool and jobs submitted. Failure may appear to occur at random, but often there is a pattern and the Condor Grid Analysis Software Package (GASP) is the tool to help you find it.


Links: Download | Instructions