2005 Student Research Conference:
18th Annual Student Research Conference

Mathematics and Computer Science

Source Code Classification Using Bayesian Statistics
Theodore N. Carnahan
Dr. Jon Beck, Faculty Mentor

Identifying the authorship of computer source code is necessary in cases such as plagiarism detection and forensic analysis of virus code. Few strategies exist for quantitative statements of authorship probability. Most work to date depend on statistical analysis or machine learning-based approaches. Most of these systems, however, depend upon stylistic features of the code that are easily modified by source code formatting programs and pretty-printers. We present a new method based on Bayesian classification. A Bayesian spam-detection system was adapted for the purposes of identifying authorship, coupled with a tokenizer written for the purpose. This system was trained with several different corpuses of source code from textbooks, faculty, and students. The trained system was then used to predict authorship. The architecture of the tokenizer and prediction system will be presented, along with the results generated. Finally, there will be a discussion of future research directions.

Keywords: source code, classification, analysis, Bayesian, virus, plagiarism

Topic(s):Computer Science

Presentation Type: Oral Paper

Session: 22-1
Location: VH 1408
Time: 10:30

Add to Custom Schedule

   SRC Privacy Policy