Overview
PECAN (Programming Encoder Classification Analysis Network) is a research initiative focused on improving programming language identification through encoder-only models. The project aims to design efficient, scalable, and accurate language classification systems that advance the field of software engineering and code understanding.
Motivation
Existing tools such as GuessLang and GitHub Linguist provide limited accuracy and scalability when faced with modern, diverse codebases. PECAN addresses this gap by leveraging deep learning techniques to enhance generalization, handle multilingual repositories, and maintain lightweight model architectures suitable for real-world deployment.
Key Challenges Addressed
- Limited Accuracy: Current tools struggle with ambiguous code snippets and mixed-language files
- Scalability Issues: Existing solutions don't scale well to large, diverse codebases
- Model Efficiency: Need for lightweight models that maintain high accuracy while being deployment-ready
- Dataset Limitations: Lack of comprehensive, diverse training datasets for modern programming languages
Approach
Model Architecture
Training and evaluating multiple encoder-only transformer models to identify programming languages from raw code snippets. Our approach focuses on lightweight architectures that can be efficiently deployed in production environments while maintaining state-of-the-art accuracy.
Dataset Development
Initially leveraging the GuessLang dataset while constructing a much larger custom dataset with 42 million+ code samples, designed to capture language diversity, syntax variability, and real-world code structures.
Code Samples
Programming Languages (Expanding)
Distributed Training
Training & Evaluation
- Distributed Training: Implementing distributed training across multiple GPUs for efficient model development
- Comprehensive Evaluation: Developing a unified evaluation pipeline to compare model families, including our in-house trained models, pre-trained encoders, GuessLang, and GitHub Linguist
- Scalable Infrastructure: Using PyTorch and Hugging Face infrastructures for experimentation and reproducibility
- Experiment Tracking: Leveraging Weights & Biases (W&B) for comprehensive experiment tracking, hyperparameter optimization, and model performance visualization
Technical Stack
Poster & CV
View the PECAN poster draft and download my CV.
PECAN Poster (Draft)
Research Goals
- Unified Benchmark: Establish a comprehensive benchmark for lightweight transformer models in programming language identification
- Computational Efficiency: Achieve optimal balance between accuracy and computational requirements for real-world deployment
- Downstream Applications: Support AI-driven software analysis, development tools, and automated code understanding systems
- Open Source Contribution: Provide the research community with improved tools and datasets for programming language identification
Impact & Applications
PECAN has the potential to significantly impact various areas of software engineering and development:
- IDE Integration: Enhanced language detection for code editors and development environments
- Repository Analysis: Improved accuracy in analyzing large, multi-language codebases
- Code Search & Discovery: Better language-aware code search and recommendation systems
- AI Code Tools: Foundation for more accurate AI-powered code analysis and generation tools
- Security Analysis: Enhanced capability for automated security scanning across different programming languages