ABSTRACT
Commercial datasets are often large, relational, and dynamic. They contain many records of people, places, things, events and their interactions over time. Such datasets are rarely structured appropriately for knowledge discovery, and they often contain variables whose meanings change across different subsets of the data. We describe how these challenges were addressed in a collaborative analysis project undertaken by the University of Massachusetts Amherst and the National Association of Securities Dealers(NASD). We describe several methods for data pre-processing that we applied to transform a large, dynamic, and relational dataset describing nearly the entirety of the U.S. securities industry, and we show how these methods made the dataset suitable for learning statistical relational models. To better utilize social structure, we first applied known consolidation and link formation techniques to associate individuals with branch office locations. In addition, we developed an innovative technique to infer professional associations by exploiting dynamic employment histories. Finally, we applied normalization techniques to create a suitable class label that adjusts for spatial, temporal, and other heterogeneity within the data. We show how these pre-processing techniques combine to provide the necessary foundation for learning high-performing statistical models of fraudulent activity.
- Blau, H., Immerman, N., and Jensen, D. A Visual Language for Querying and Updating Graphs. Technical Report 2002-37, University of Massachusetts, 2002.Google Scholar
- Cortes, C., Pregibon, D., and Volinsky, C. Communities of interest. Lecture Notes in Computer Science, 2189, 2001. Google Scholar
Digital Library
- Friedland, L., and Jensen, D. Finding tribes: Identifying close-knit individuals from employment patterns. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007. Google Scholar
Digital Library
- Goldberg, H. G. and Senator, T. E. Restructuring databases for knowledge discovery by consolidation and link formation. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995.Google Scholar
Digital Library
- Jensen, D. and Neville, J. Autocorrelation and linkage cause bias in evaluation of relational learners. In 12th International Conference on Inductive Logic Programming, 2002. Google Scholar
Digital Library
- Levenshtein, V. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 1966.Google Scholar
- Neville, J., Jensen, D., Friedland, L., and Hay, M. Learning relational probability trees. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003. Google Scholar
Digital Library
- Neville, J., Simsek, O., Jensen, D., Komoroske, J., Palmer, K., and Goldberg, H. Using relational knowledge discovery to prevent securities fraud. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005. Google Scholar
Digital Library
- Provost, F. and Domingos, P. Tree induction of probability-based ranking. Machine Learning Journal, 52(3), 2002. Google Scholar
Digital Library
- Provost, F. and Fawcett, T. Analysis and visualization of classifier performance comparison under imprecise class and cost distributions. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997.Google Scholar
Index Terms
Relational data pre-processing techniques for improved securities fraud detection
Recommendations
An improved data pre-processing method for classification and insider information leakage detection
Data pre-processing, a step performed prior to data processing, converts data into a form that is easy to analyse. In this study, we propose a method for the pre-processing and integration of data collected from various sources to detect insider ...
Pre-processing Methods of Data Mining
UCC '14: Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud ComputingData generation, handling and its processing have emerged as the most reliable source of understanding and discovery of new facts, knowledge and products in the world of natural and material sciences. The emergence of the most efficient techniques in ...
Using relational knowledge discovery to prevent securities fraud
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningWe describe an application of relational knowledge discovery to a key regulatory mission of the National Association of Securities Dealers (NASD). NASD is the world's largest private-sector securities regulator, with responsibility for preventing and ...





Comments