Clean CaDET - Project Results

Platform Development

The Clean Code and Design Educational Tool (Clean CaDET) is a platform dedicated to the study of clean code. It presents a conglomerate of AI-powered tools for educators, learners, practitioners, and researchers studying clean code.

We've deployed our intelligent tutoring system to production: clean-cadet.tech. Our instance hosts software quality engineering lectures in Serbian. It maintains a userbase of various software engineering students from the Faculty of Technical Sciences in Novi Sad. The code behind the tool is open-source, along with the content (see tutor repository below).

We maintain several open-source repositories that host our platform and it's ever increasing set of capabilities:

The main platform repository contains several projects:
1. The code model project that parses C# code into an object model that is easier to analyze.
2. The smell detector project that hosts rules for code smell detection and adapters for smell detection machine learning models.
3. The dataset explorer project that has grown into a sophisticated application for analyzing code repositories and annotating smells. This project will soon be moved to a separate repository.
The tutor repository, which hosts the intelligent tutoring system specialized for the clean code domain.
The tutor front-end repository, which contains the client web application for interacting with the intelligent tutoring system.
The dataset explorer front-end repository, which contains the client web application for interacting with the dataset explorer.

Publications

1. Clean Code and Design Educational Tool

Prokić, S., Grujić, K.G. Luburić, N., Slivka, J., Kovačević, A., Vidaković, D., Sladić, G., 2021. In 2021 44th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE.

Abstract: Many different code snippets can implement the same software feature. However, a significant subset of these possible solutions contains difficult-to-understand code that harms the software’s maintainability and evolution. Such low-quality code snippets directly harm profit, as frequent and fast code change enables businesses to seize new opportunities. Unfortunately, they are also prevalent in an industry that consists mostly of junior programmers.
We developed a platform called Clean CaDET to tackle the prevalence of low-quality code from two angles. The Smell Detector module presents a framework for integrating AI-based code quality assessment algorithms to identify low-quality code as the programmer is writing it. The Smart Tutor module hosts a catalog of educational content that helps the programmer understand the identified issue and suggests possible solutions. By combining the quality assessment with the educational aspect, our integrated solution presents a novel approach for increasing the quality of code produced by our industry.

2. The challenges of migrating an active learning classroom online in a crisis

Luburić, N., Slivka, J., Sladić, G. and Milosavljević, G., 2021. Computer Applications in Engineering Education.

Abstract: The coronavirus disease of 2019 (COVID‐19) pandemic has severely crippled our globalized society. Despite the chaos, much of our civilization continued to function, thanks to contemporary information and communication technologies. In education, this situation required instructors and students to abandon the traditional face‐to‐face lectures and move to a fully online learning environment. Such a transition is challenging, both for the teacher tasked with creating digital educational content, and the student who needs to study in a new and isolated working environment. As educators, we have experienced these challenges when migrating our university courses to an online environment. Through this paper, we look to assist educators with building and running an online course. Before we needed to transition online, we researched and followed the best practices to establish various digital educational elements in our online classroom. We present these elements, along with guidance regarding their development and use. Next, we designed an empirical study consisting of two surveys, focus group discussions, and observations to understand the factors that influenced students' engagement with our online classroom. We used the same study to evaluate students' perceptions regarding our digital educational elements. We report the findings and define a set of recommendations from these results to help educators motivate their students and develop engaging digital educational content. Although our research is motivated by the pandemic, our findings and contributions are useful to all educators looking to establish some form of online learning. This includes developers of massive open online courses and teachers promoting blended learning in their classrooms.

3. Towards a systematic approach to manual annotation of code smells

Slivka, J., Luburić, N., Prokić, S., Grujić, K.G., Kovačević, A., Sladić, G., Vidaković, D., 2023. Science of Computer Programming.

Abstract: Code smells are structures in code that indicate the presence of maintainability issues. A significant problem with code smells is their ambiguity. They are challenging to define, and software engineers have a different understanding of what a code smell is and which code suffers from code smells.
A solution to this problem could be an AI digital assistant that understands code smells and can detect and even resolve them. However, it is challenging to develop such an assistant as there are few usable datasets of code smells on which to train and evaluate it. Furthermore, the existing datasets suffer from issues that mainly arise from an unsystematic approach used for their construction.
Through this work, we address this issue by developing a procedure for the systematic manual annotation of code smells. We use this procedure to build a dataset of code smells. During this process, we refine the procedure and identify recommendations and risks for its use. The primary contribution is the proposed annotation model and procedure and the annotators’ experience report. The dataset and supporting tool are secondary contributions of our study. Notably, our dataset includes open-source projects written in the C# programming language, while almost all manually annotated datasets contain projects written in Java.

4. Automatic detection of Long Method and God Class code smells through neural source code embeddings

Kovačević, A., Slivka, J., Vidaković, D., Grujić, K.G., Luburić, N., Prokić, S. and Sladić, G., 2022. Expert Systems with Applications.

Abstract: Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection.
This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT).
We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach.
This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.

5. Clean Code Tutoring: Makings of a Foundation

Luburić, N., Vidaković, D., Slivka, J., Prokić, S., Grujić, K.G., Kovačević, A., Sladić, G., 2022. CSEDU 2022.

Abstract: High-quality code enables sustainable software development, which is a prerequisite of a healthy digital society. To train software engineers to write higher-quality code, we developed an intelligent tutoring system (ITS) grounded in recent advances in ITS design. Its hallmark feature is the refactoring challenge subsystem, which enables engineers to develop procedural knowledge for analyzing code quality and improving it through refactoring. We conducted a focus group discussion with five working software engineers to get feedback for our system. We further conducted a controlled experiment with 51 software engineering learners, where we compared learning outcomes from using our ITS with educational pages offered by a learning management system. We examined the correctness of knowledge, level of knowledge retention after one week, and the learners’ perceived engagement. We found no statistically significant difference between the two groups, establishing that our system does not lead to worse learning outcomes. Additionally, instructors can analyze challenge submissions to identify common incorrect coding patterns and unexpected correct solutions to improve the challenges and related hints. We discuss how our instructors benefited from the challenge subsystem, shed light on the need for a specialized ITS design grounded in contemporary theory, and examine the broader educational potential.

6. Challenges of Knowledge Component Modeling: A Software Engineering Case Study

Nikola Luburić, Balša Šarenac, Luka Dorić, Dragan Vidaković, Katarina-Glorija Grujić, Aleksandar Kovačević, Simona Prokić, 2022. HEAd 2022.

Abstract: To improve instruction, educators require greater visibility into the learning gains of individual students and the ability to adapt to resolve any learning gaps. A prerequisite to achieving such instruction is decomposing knowledge into small components that function as gradual steps for the learner’s journey. However, decomposing a knowledge domain is not trivial, as instructors must overcome many practical challenges along the way.
We report on our experience of knowledge component modeling for the clean code analysis and refactoring domain. We describe our method and list four challenges encountered during modeling and our solution for them. The resulting knowledge component inventory is a secondary contribution of the paper. These results can assist educators in planning and executing knowledge component modeling to refine their instruction and produce more significant learning gains in their students.

7. An Intelligent Tutoring System to Support Code Maintainability Skill Development

Nikola Luburić, Luka Dorić, Jelena Slivka, Dragan Vidaković, Katarina-Glorija Grujić, Aleksandar Kovačević, Simona Prokić. Submitted to journal for consideration.

Abstract: Maintainability determines the ease of analyzing, modifying, reusing, and testing a software component. This quality aspect greatly affects the software’s lifetime cost, contributing to developer productivity and other quality aspects like reliability and performance. Consequently, academia and the industry emphasize the need to train software engineers to build maintainable software code. Unfortunately, code maintainability is an ill-defined domain from a knowledge and skill model perspective. It is challenging to teach and learn. This problem is aggravated by a lack of capable instructors in the field. Existing instructors rely on scalable one-size-fits-all teaching methods that are ineffective.Advances in e-learning technologies can alleviate these issues. This paper’s primary contribution is a conceptual model and implementation of an intelligent tutoring system (ITS) specialized in clean code analysis and refactoring. It includes specialized learning instruments for developing procedural knowledge for the refactoring domain. We designed, developed, and evaluated the ITS over two years of working with undergraduate students using a mixed-method approach anchored in design science. We report on the results from the empirical evaluation that showcase the utility of our contributions. The results of this study are helpful for software engineering instructors, including university professors, MOOC designers, and corporate training program trainers.

8. Automatic detection of code smells using metrics and CodeT5 embeddings a case study in C#

Aleksandar Kovačević, Nikola Luburić, Jelena Slivka, Simona Prokić, Katarina-Glorija Grujić, Dragan Vidaković, and Goran Sladić. Neural Computing and Applications.

Abstract: Available on link above.

9. Semi-supervised detection of Long Method and God Class code smells

Ilija Brdar, Jelena Vlajkov, Jelena Slivka, Katarina-Glorija Grujić, Aleksandar Kovačević. SISY 2022.

Abstract: Code smells are poorly designed parts of code whose removal is essential for sustainable software development. However, recognizing code smells in practice is challenging. Machine Learning (ML)-based code smell detectors could solve this problem. Current ML-based code smell detection approaches are based on supervised learning (SL) that requires a large and diverse dataset for training. Unfortunately, the existing code smell datasets are small, which hinders the performance of the trained SL models. This paper aims to improve the performance of ML-based code smell detectors by employing semi-supervised learning (SSL). SSL models are trained by combining a manually labeled code smell dataset with unlabeled code snippets collected from open-source repositories. Two major SSL techniques are employed: self-training and co-training. Experiments were performed for two code smell types: God Class and Long Method. SSL classifiers significantly outperformed SL classifiers for God Class detection (by 6% F-measure). For Long Method detection, SSL classifiers slightly outperformed SL classifiers (by 1% F-measure). This paper is the first to consider applying SSL for code smell detection. SSL models outperforming SL models in all experiments suggest that SSL holds the great potential to improve current code smell detectors, which is essential for their adoption in practice.

10. Machine Learning approaches for Code Smell detection: a Systematic Literature Review

Katarina-Glorija Grujić, Simona Prokić, Aleksandar Kovačević, Jelena Slivka, Nikola Lubudić, Dragan Vidaković, Goran Sladić. Submitted to journal for consideration.

Abstract: Code smells indicate suboptimal design or implementation choices in the source code. They often lead it to be more change- and fault-prone. Recently, the automation of code smell recognition has gained much attention. We conduct a systematic literature review to summarize the information and conclusions published in the period from 1.1.2018. until 1.2.2022. A total of 56 papers passed the inclusion and exclusion criteria. We found God Class, Feature Envy, Long Method, and Data Class to be the most detected code smells. Feature engineering methods are most often used, although we can notice that other methods also gain importance over time. Structural metrics are mainly used for the recognition of code smells. The cross-validation metric is commonly used for model validation, and for performance, Recall, precision, and f-measure are most frequently used. Authors tend to create their own data sets for model training. We noticed that only a small group of code smells are commonly detected, and it is necessary to pay attention to the recognition of others. In addition to binary recognition, recognizing the severity of code smells to prioritize the code that needs refactoring should be brought into focus. It is necessary to use other metrics to recognize code smells, such as source code embedding, and text embedding, because structural metrics cannot convey semantics. Systematizing data sets and creating high-quality publicly available data sets would enable the training of code smell recognition models.

11. Automatic detection of Feature Envy and Data Class code smells using machine learning

Milica Škipina, Aleksandar Kovačević, Nikola Luburić, Jelena Slivka. Expert Systems with Applications.

Abstract: Available on link above.

12. Understanding the Teamwork Challenges of Software Engineering Students

Dorić, L., Luburić, N., Slivka, J. and Kovačević, A., 2023, May. Understanding the Teamwork Challenges of Software Engineering Students. In 2023 46th MIPRO ICT and Electronics Convention (MIPRO) (pp. 1578-1583). IEEE.

Abstract: Developing collaborative skills in students is nontrivial. The fact that students work in teams does not mean they become skilled in teamwork. Students face varied challenges when working in teams that harm their skill development and attitude towards teamwork. To prepare students for the collaboration-intensive workplace, we researched and designed a catalog of challenges present in the teamwork of undergraduate software engineering students on 3-month projects. We created an initial catalog of 10 challenges by examining the literature, surveying 15 teaching assistants, and coding their opinions regarding the problems faced by student teams. Using the catalog, we crafted a survey for students nearing the end of their team project to assess which challenges were present in their teamwork. We surveyed students from multiple contexts, including teams of 3, teams of 4, and teams of16 students. We analyzed 155 answers to determine the prevalence and intensity of the 10 challenges in student teams. We discuss our findings and best practices for resolving the most prevalent challenges. The catalog and recommendations are directly valuable for software engineering educators and can inform the broader community of collaborative learning researchers and instructional designers.

13. Identification of Code Properties that Support Code Smell Analysis

Prokić, S., Luburić, N., Slivka, J. and Kovačević, A., 2023, May. Identification of Code Properties that Support Code Smell Analysis. In 2023 46th MIPRO ICT and Electronics Convention (MIPRO) (pp. 1664-1669). IEEE.

Abstract: Code smells are structures in code that imply potential maintainability problems and may negatively impact software quality. One of the critical challenges with code smells is that their definitions are often vague, difficult to comprehend and subjective, making them hard to reliably and consistently detect and analyze by humans and automated systems. Most existing code smell detection approaches rely heavily on human interpretation and are typically supported by structural code metrics. Unfortunately, many of these approaches are incomplete and do not cover a range of code properties that could indicate potential code smells.This paper analyzes code smell detection approaches to identify code properties used for code smell detection and analysis. Informed by our previous work and the literature, we define five code properties used by humans and automatic detectors to identify code smells. We demonstrate how various code properties can be mapped to the 22 code smells defined by Martin Fowler. The resulting catalog of properties can help software engineers and code maintainability researchers analyze code smells and build automated code smell detectors that examine properties beyond the traditional structural metrics.

14. A case study in combining project-based learning and autograding in Machine Learning education

Vidaković, D., Slivka, J., Luburić, N., Savić, G., and Kovačević, A., 2023. A case study in combining project-based learning and autograding in Machine Learning education. In 13th International Conference on Information Society and Technology (ICIST 2023).

Abstract: As many industries are in high demand for Machine Learning (ML) practitioners to solve business problems, it is essential to ensure that students know how to select adequate ML tools for given contexts and apply them ade-quately. To this aim, we designed a project-based undergraduate university ML course. The course utilizes a blended approach, in which students collaborative-ly work on real-world projects using an autograding platform for code-based as-signments specially developed for the needs of the course. The course includes traditional lectures, discussions, reporting, and oral presentations. The course was evaluated using class assessment outcomes, faculty surveys, and observa-tions. The results indicated that the blended learning approach was well-received and helped students better understand how to apply ML tools. They al-so suggest that project-based learning, in combination with an autograding plat-form and a blended approach, can be an effective way to teach undergraduate ML.

15. A source code readability prediction model capturing reader interaction with code text

Segedinac, M., Savić, G., Zeljković, I., Slivka, J., Konjović, Z., 2024. Assessing code readability in Python programming courses using eye‐tracking. Computer Applications in Engineering Education

Completed Theses

Here we briefly showcase the master's and bachelor degrees obtained by students while contributing to the Clean CaDET project. We thank all our candidates for their contribution to the project. While some solutions remain a part of the platform to this day, all of them have contributed to our understanding of the problem and the latest incarnation of the solution.

Master's degrees

1. Automatic detection of code smells based on code change history

Candidate: Simona Prokić
Mentor: Jelena Slivka

Short abstract: The paper presents a code smell detection model based on code change history. The model's inputs are the source code metrics' values in n revisions for the observed code snippet.

2. Automatic code smell detection based on information extracted from the codes’ textual content

Candidate: Katarina-Glorija Grujić
Mentor: Aleksandar Kovačević

Short abstract: The paper presents a code smell detection approach based on natural language processing of the identifier names and comments.

3. Identification of cohesive parts inside of a class

Candidate: Balša Šarenac
Mentor: Nikola Luburić

Short abstract: This research explores how different cohesive metrics can be combined to determine opportunities for extract class refactoring. The paper further describes the resulting algorithm and examples of its use.

Bachelor degrees

1. Subsystem for educational challenges in the Clean CaDET platform

Candidate: Ana Atanacković
Mentor: Nikola Luburić

Short abstract: The candidate designed and developed the initial version of the refactoring challenges subsystem for the Clean CaDET platform. The paper further describes the complete challenge submission process.

2. Securing the educational platform Clean CaDET

Candidate: Luka Dorić
Mentor: Nikola Luburić

Short abstract: The candidate created an initial threat model of the Clean CaDET platform, including threats, risks, and proposals for mitigations. The paper further describes a subset of the mitigations, including the design and implementation of the appropriate security controls.

3. Intelligent Tutoring System implementation within the Clean CaDET platform

Candidate: Milica Siriški
Mentor: Nikola Luburić

Short abstract: The candidate researched effective strategies for delivering instruction. The paper further describes the design and implementation of the components that integrated this instructional expertise to offer a better educational experience to the end users.

4. Subsystem for creating a digital course within the platform for teaching clean code

Candidate: Vladimir Buđen
Mentor: Nikola Luburić

Short abstract: The candidate designed and developed controls for educational content creation. The paper further describes the subscription mechanism introduced to the platform to support licensing of the platform's features.

5. Clean code analysis through a clean code learning platform

Candidate: Nemanja Pualić
Mentor: Nikola Luburić

Short abstract: This paper describes code analysis through the clean code learning platform Clean CaDET, with an emphasis on the structure and dynamics of modules within the platform as well as a focus on implemented metrics and detection rules.

6. Web application for managing datasets for clean code

Candidate: Radoš Milićev
Mentor: Nikola Luburić

Short abstract: The candidate developed a front-end web application that invokes the Dataset Explorer functionality. The paper further describes the design and implementation of this component.

7. Clean code learning platform communication infrastructure

Candidate: Vladislav Maksimović
Mentor: Nikola Luburić

Short abstract: This paper presents the communication infrastructure of a platform for learning clean code. Performance analysis was conducted on projects of various size. The structure and behavior of the message exchange is described in detail.

8. Automatic code smell detection based on code2seq features

Candidate: Mitar Perović
Mentor: Jelena Slivka

Short abstract: The paper presents an ML-based code smell detection model that uses pre-trained source code embeddings as features. The embeddings are generated using the code2seq model.

9. Overcoming the class imbalance problem in datasets for automatic code smell detection

Candidate: Ilija Brdar
Mentor: Jelena Slivka

Short abstract: Code smell datasets are hugely imbalanced, which poses a challenge for ML-based code smell detection algorithms. This paper explores the benefits of multiple dataset augmentation techniques applied to alleviate the class imbalance problem.

10. Automatic code smell detection based on combining machine learning methods with heuristics

Candidate: Jelena Vlajkov
Mentor: Jelena Slivka

Short abstract: Most existing code smell detection approaches are based on metrics-based heuristics or training ML classifiers using code metrics as features. This paper combines these two approaches to improve code smell detection performance.

11. Detection of badly written Java methods using machine learning

Candidate: Zdravko Dugonjić
Mentor: Aleksandar Kovačević

Short abstract: The paper presents a heuristic-based approach for the Long Method code smell detection. In the approach, each method is segmented into logical blocks. The blocks similarity is used to decide whether they should be a part of the same method.

12. Client web application that interacts with a platform for learning clean code

Candidate: Nikola Pantelić
Mentor: Nikola Luburić

Short abstract: This paper presents the design and implementation of a client web application built using the Angular framework, which interacts with the clean code intelligent tutoring system.

13. Parsing Java source code into an object graph with calculated metrics

Candidate: Nikola Gudelj
Mentor: Nikola Luburić

Short abstract: This paper presents the design and implementation of a Java code parser which transforms Java source code into the CaDET model.

14. Self-training for Long Method and God Class detection

Candidate: Jelena Vlajkov
Mentor: Jelena Slivka

Short abstract: The existing code smell datasets are small, which hinders the performance of the supervised learning models. This paper aims to improve the performance of the supervised learning models by employing a semi-supervised learning self-training technique.

15. Co-training for Long Method and God Class detection

Candidate: Ilija Brdar
Mentor: Jelena Slivka

16. Active Learning for Long Method and God Class detection

Candidate: Mihajlo Ostojić
Mentor: Jelena Slivka

Short abstract: Manually labeled code smell datasets are small, as manual annotation of code smells is challenging and time-consuming. This paper aims to mitigate this problem by employing Active Learning - learning algorithms that actively query the annotators for labels to reduce the number of required labels and therefore reduce the human labeling effort.

17. An Intelligent Tutoring System for Training Software Engineers

Candidate: Luka Dorić
Mentor: Nikola Luburić

Short abstract: This paper presents an intelligent tutoring system used for training undergraduate software engineering students and the results of its empirical evaluation. The theoretical foundations on which the system is based, the architecture of the system, the empirical evaluation of the system as well as the improvements are described. The result of the work is a learning platform called Clean CaDET Tutor which is applied to several courses at the Faculty of Technical Sciences in Novi Sad.

Clean CaDET Results

Platform Development

Publications

1. Clean Code and Design Educational Tool

2. The challenges of migrating an active learning classroom online in a crisis

3. Towards a systematic approach to manual annotation of code smells

4. Automatic detection of Long Method and God Class code smells through neural source code embeddings

5. Clean Code Tutoring: Makings of a Foundation

6. Challenges of Knowledge Component Modeling: A Software Engineering Case Study

7. An Intelligent Tutoring System to Support Code Maintainability Skill Development

8. Automatic detection of code smells using metrics and CodeT5 embeddings a case study in C#

9. Semi-supervised detection of Long Method and God Class code smells

10. Machine Learning approaches for Code Smell detection: a Systematic Literature Review

11. Automatic detection of Feature Envy and Data Class code smells using machine learning

12. Understanding the Teamwork Challenges of Software Engineering Students

13. Identification of Code Properties that Support Code Smell Analysis

14. A case study in combining project-based learning and autograding in Machine Learning education

15. A source code readability prediction model capturing reader interaction with code text

Completed Theses

Master's degrees

1. Automatic detection of code smells based on code change history

2. Automatic code smell detection based on information extracted from the codes’ textual content

3. Identification of cohesive parts inside of a class

Bachelor degrees

1. Subsystem for educational challenges in the Clean CaDET platform

2. Securing the educational platform Clean CaDET

3. Intelligent Tutoring System implementation within the Clean CaDET platform

4. Subsystem for creating a digital course within the platform for teaching clean code

5. Clean code analysis through a clean code learning platform

6. Web application for managing datasets for clean code

7. Clean code learning platform communication infrastructure

8. Automatic code smell detection based on code2seq features

9. Overcoming the class imbalance problem in datasets for automatic code smell detection

10. Automatic code smell detection based on combining machine learning methods with heuristics

11. Detection of badly written Java methods using machine learning

12. Client web application that interacts with a platform for learning clean code

13. Parsing Java source code into an object graph with calculated metrics

14. Self-training for Long Method and God Class detection

15. Co-training for Long Method and God Class detection

16. Active Learning for Long Method and God Class detection

17. An Intelligent Tutoring System for Training Software Engineers