Source Code Classification Using Bayesian Statistics
Theodore N. Carnahan
Dr. Jon Beck, Faculty Mentor
Identifying the authorship of computer source code is necessary in cases such as plagiarism detection and forensic analysis of virus code. Few strategies exist for quantitative statements of authorship probability. Most work to date depend on statistical analysis or machine learning-based approaches. Most of these systems, however, depend upon stylistic features of the code that are easily modified by source code formatting programs and pretty-printers. We present a new method based on Bayesian classification. A Bayesian spam-detection system was adapted for the purposes of identifying authorship, coupled with a tokenizer written for the purpose. This system was trained with several different corpuses of source code from textbooks, faculty, and students. The trained system was then used to predict authorship. The architecture of the tokenizer and prediction system will be presented, along with the results generated. Finally, there will be a discussion of future research directions.
Keywords: source code, classification, analysis, Bayesian, virus, plagiarism
Topic(s):Computer Science
Presentation Type: Oral Paper
Session: 22-1
Location: VH 1408
Time: 10:30