We are four students in last year of Polytech’ Nice-Sophia specialized in Software Architecture :
This document presents the results of our researches on the Test-Driven Development method. In order to concretly present them, we present in a first section the context of our research. In the second section, we go deeper into the description of our study and on which project it is based.
The Test-Driven Development (TDD) is a method of software development relying on writing tests before the tested code even exists and more importantly relying on refactoring code. More precisely, there are five different steps. First writing the unit test, then run the test to watch it fail. If the test succeeds, there is a problem since the tested code is not yet written. When the test is written and fails, the next step is to write just enough code to see the test succeed. Then, when the new test succeeds, the fourth step is to check that all the tests still pass. If there are some failures, it is necessary to fix the issues to have all the tests passing. Then the final step is to refactor the code in order to make it better. The Figure below illustrates the development process using the TDD method.Figure 1 : TDD method process
The purpose of this method is to write the specifications first in the form of unit tests so the written code answers exactly to the wanted functionalities. More than that, it wants to assert that the code is always valid and more stable. It should also help the developer to avoid regression when refactoring the code.
In this sub-chapter, we present our study of the impacts of the TDD method on code quality, code maintainability and test coverage. The study is splitted in three sub-questions:
Test-Driven Development promotes the fact of producing a code of better quality and always valid. This study aims to verify if this assertion is real or not and in which way. There is not yet an answer to this question which divides the developer community. We think that it could be interesting to compare this method to a more common way of developing, which is to develop functionalities first, then write the tests, to bring an answer to this question with concrete arguments. This common method is also known as the Test-Last (TL) method. Also, many companies do not approve this method thinking that testing first cost more than having something which works first. It would be interesting to see if it is true or not by comparing maintenance cost (i.e, number of issues, fixes …) and productivity (i.e, number of lines added and deleted) in test-first method (TDD) with test-last method.
In this study, we compare TDD projects and TL projects only. Any other development method is not part of this study, we made this choice in order to narrow our field of research. As our main question is about code quality, we don’t study the impacts of TDD on development time compared to TL. We have restricted our analysis to some metrics defining the code quality of a project and described in the part Project Evaluation. So external factors as team size, team experience or language used are not part of our scope. Even if these factors can have impacts on code quality and code coverage, we lacked time and resources to take them into account, they are part of the limits of this study.
Other concepts linked to the TDD method, like emerging software architecture, are not studied either, still in order to focus on answering our questions.
The projects studied here are either TDD or TL but how can we now the development method used in a project ? To choose the projects, we do not have an automated tool capable of detecting the development method used. We rely on the project team and the developer community. Developing following the test-driven method is not a common choice, so usually the project team clearly state that the project is test-driven. In addition to this, we also checked manually if the commits of the TDD projects seems to follow the pattern test first, then code. But this is just a partial verification, the best would be a tool analysing the commits and finding this pattern.
In order to find a concrete answer, we had to find many projects built using a TDD approach and of at least thousands of commits:
There are other projects using TDD but those are the most interesting for our works. We are limited in time by our studies.
We found the following TL projects which are approximately of the same size as the previous TDD projects.
We compared TDD method projects with TL method projects based on:
As Test-Driven Development is really driven by the tests, we expect TDD projects to have a high code coverage, of at least 80% and higher than TL projects. This method involves an important refactoring phase, so cleaning the code is an important part of it. Because of this, we expect a better code quality but also more commits about refactoring and less about fixing or patching bugs.
The previous section described our research context, our goal and the red string for our project. In this section we present how to evaluate those project samples (i.e, in what and how we are evaluating those projects).
In order to answer our three sub-questions we used many metrics. All of our TDD and Test-Last projects are evaluated using the same metrics to compare them.
The code age measures in month the last time a file has been modified. Because some of the projects are older than others, the real metric measured here is the average code age of each file relative to the project age. For example, if a file hasn’t been modified for 10 months and the project is 12 month old, we can say it’s a pretty much stable file. If most of the project contains files that are currently being modified it means that the developers have a lot of file to maintain. In the opposite case, it means that the developers just have to focus on a few files which is a sign of good maintainability.
The code coverage measures how much of a project has been tested.
The cyclomatic complexity measures the number of paths through a function. Ideally there should be at least as much unit test as the cyclomatic complexity which should be as little as possible.
Code smells are issues detected in the source code that can lead to a deeper problem. For example, Duplicated code, long methods and large class are code smells. The more code smells spotted in the source code, the more likely the project to be difficult to maintain.
During our projects, we applied the same process on seven different projects. We used four test-driven development projects and three test-last projects, as described before. We chose those seven projects of their likeliness in terms of commits and lines of code.
TDD projects :
Test-last projects :
For each project, we tried to obtain the data about the metrics we defined in the first part of the report.
The first thing we did was configuring the JaCoCo (Java code coverage) for each project.
JaCoCo is a free code coverage library for Java. The advantage of using JaCoCo was the uniformity of generated reports, and its compatibility with Maven and Gradle. All of our studied projects use either Maven or Gradle. We configured the plugin for each project.
<plugin>
<groupId>org.jacoco</groupId>
<artifactId>jacoco-maven-plugin</artifactId>
<version>0.7.5.201505241946</version>
<executions>
<execution>
<goals>
<goal>prepare-agent</goal>
</goals>
</execution>
<execution>
<id>report</id>
<phase>prepare-package</phase>
<goals>
<goal>report</goal>
</goals>
</execution>
</executions>
</plugin>
The Gradle configuration is similar. When the plugin is configured, you can generate reports with :
$ mvn jacoco:report
For each project, we had a report like the following :
Figure 2 : JUnit4 JaCoCo report
This report allows us to know the code coverage of each project, a metric we need to compare TDD and TL methods.
We scanned each project with SonarQube. Sonar allowed us to get the general quality of the code. Like JaCoCo, the reports generated are the same for each projects, which allows us to compare the projects easily. To be able to scan with SonarQube a project, one needs to add a file called sonar-project.properties to a project. The file we used for all our project is the following :
sonar.java.source=1.8
sonar.sources=src/main
sonar.tests=src/test
sonar.junit.reportsPath=target/surefire-reports
sonar.jacoco.reportPaths=target/jacoco.exec
sonar.java.binaries=target/classes
#local props
sonar.login=admin
sonar.password=admin
sonar.host.url=http://localhost:9000
sonar.projectKey=JUNIT
sonar.projectName=junit
sonar.projectVersion=1.0
You need to have JaCoCo configured for your project (previously described) and your project built. After you launched the SonarQube server, you can scan your project with :
$ sonar-scanner
The reports can be found onlocalhost:9000, where the list of all your projects will be displayed.
Another advantage of using Sonar is the uniformity of the generated reports. For example :Figure 3 : Sonar reports for JUnit4
Using SonarQube was a way during our study to get the cyclomatic complexity of the projects, along with the sonar issues. Both are a metric we need to compare TDD and TL methods. Sonar defines a number of issues during a scan, for example bugs detected, vulnerabilities and code smells.
SoftVis3D is a framework to vizualize a project, litterally. It is available on the SonarQube update center. You need to install the plugin on Sonar and it is automatically available when you scan a code.
The goal is to provide a visualization for the hierarchical structure of a project. Folders or packages are shown as districts, files as buildings. The building footprint, height and color are dependent on two arbitrary sonar metrics : you can tell SoftVis3D wich value you want to use, and you can use any metrics. This tool is useful in order to have a complete view of a project, and to see if a god-class is present. For our project, we use it to judge the global cleanliness of a project. We used the complexity as footprint , the number of duplicated lines as height, and number of Sonar issues as the color.
Figure 4 : SoftVis3D results for JUnit4
We used Code Maat as a tool to study GitHub repositories. Code Maat is a command line tool used to mine and analyze data from version-control systems. It allows us to perform many kind of analyses. Those in which we were interested are “age” (the code age) and “revisions” (how many times a file has been modified).
Running code-maat directly is not the most convenient way, so we made a simple cli in python with two commands : retrieve <git_url> and analyse <projects_names>. The first one clone the repository and run code-maat analysis on it, which gives as an output the raw data relative to code-age and revisions (among other unexploited in our project). The second one aggregates these raw data into readable statistics which we used to make our own analysis. It also counts the number of commits containing our predefined keywords (fix, test and refactor).
This script is in python and is available here:
https://github.com/lecourtoisn/code-maat-cli
Here is an example of output we used :
fitnesse fix 6%(365) , refactor 1%(87) , add 4%(234) , test 14%(809)
origin fix 9%(1684) , refactor 1%(194) , add 6%(1111) , test 9%(1604)
spoon fix 30%(535) , refactor 7%(126) , add 13%(243) , test 12%(223)
jacoco fix 2%(35) , refactor 0%(2) , add 2%(39) , test 11%(151)
junit4 fix 7%(156) , refactor 1%(24) , add 5%(115) , test 14%(303)
node fix 17%(2850) , refactor 2%(383) , add 17%(2883) , test 18%(3084)
gson fix 6%(92) , refactor 0%(8) , add 6%(83) , test 13%(183)
jfreechart fix 1%(49) , refactor 0%(1) , add 3%(104) , test 7%(266)
For each project, we applied the previously described evaluation process, except for the SonarQube analysis of OpenCover which can be found directly online. In the next two parts we present and analyse these results to answer our questions.
Here are the raw results we obtained after analysing the projects.
Metrics | Test-Driven Development | Test-Last Development | |||||
---|---|---|---|---|---|---|---|
Fitnesse | JUnit4 | JFreeChart | OpenCover | Spoon | GSON | JaCoCo | |
Code Coverage | 48% | 85% | 45% | 93.9% | 90.7% | 83% | 80% |
Sonar issues | 1927 | 833 | 5039 | 286 | 2341 | 592 | 200 |
Complexity | 8612 | 2061 | 19323 | 1568 | 7635 | 1945 | 1962 |
Code Age | 48.3% | 21.5% | 18.1% | 83.6% | 9% | 50% | 35% |
Average number of reviews/files | 4.46 | 7.72 | 4.97 | 3.15 | 6.35 | 17.26 | 9.12 |
% “Fix” Commit | 6% | 7% | 1% | 7% | 30% | 6% | 2% |
% “Refactor” Commit | 1% | 1% | 0% | 2% | 7% | 0% | 0% |
% “Test” commit | 14% | 14% | 7% | 9% | 12% | 13% | 11% |
Figure 5 : Fitnesse
Figure 6 : JFreeChart
Figure 7 : JUnit4
Figure 8 : Google GSON
Figure 9 : JaCoCo
Figure 10 : Spoon
From the raw data we had collected and for each of our metrics, we made charts to obtain a better visualization and make comparisons easier. The four first projects in the charts are TDD and the three last TL.
Figure 11 : Test Coverage
This chart represents the percentage of code coverage for each project. We can see that TL projects have a code coverage higher than 80%, so they are mainly well covered by tests. For the TDD projects, there is some disparity. Two projects have a code coverage higher than 80% but the two others have a coverage of 48% and 52%, which is really low. We expected TDD projects to have a high code coverage, but our study shows the opposite. TL projects seems to have a better code coverage than TDD. But as we studied just a few projects, these results can be just exceptions or the consequences of external factors.
Figure 12 : Proportions of commits
With this view, where the projects are kept separated, we can see that the results are more or less uniform. Except for Spoon, whose number of fix related commits is surprisingly high, about four times higher than the other.
Figure 13 : Proportions of commits. TDD compared to TL
As expected, using a Test-Last method implies to have more fixes (and so, have more bugs). However, Spoon has falsified our data, it proves that we need more data to have a concrete representation of the Test-Driven Method and the Test-Last method. Also, the first metric shows that in Test-Driven Development we got less tests than in Test-Last. This metric is only a study of their commits and not representing the test coverage. Yet, due to the way TDD works, with phases of test and phases of refactoring, we expected TDD projects to have a higher percentage of refactor commits.
Figure 14 : Stability of files
The first four are TDD projects and others are TL projects. GSON is a TL project made by Google, this statistic shows that each file has been edited approximately 17 time (in average). We know there is a correlation between the number of revisions and the number of bugs proven by many researchers (Thomas Zimmermann studied it in its researches : https://goo.gl/eNVqAK). So it seems that GSON, based on this metric, isn’t a clean project.
Globally, TL projects are less stable (more often edited) than TDD projects and consequently more subjects to bugs.
Figure 15 : Stability and Test Coverage (TDD compared to TL)
This graphic shows three important metrics. Globally, we can see that files in TDD projects are more stable and, consequently, less exposed to bugs. However it is interesting to see that the mean of TDD test coverage is under the mean of TL test coverage projects. Using the TDD method doesn’t mean to have more test, it is just a method where you have to write tests before the code.
Visualizing the SoftVis3D representations, where the base is the cyclomatic complexity, the color is the number of sonar issues and the height the number of code duplication, it appears that projects using TDD have a lower cyclomatic complexity than TL projects, which is what we expected. This results may be a consequence of the process explained by Kent Beck during a TDD project :
The two rules imply an order to the tasks of programming:
- Red—write a little test that doesn’t work, perhaps doesn’t even compile at first
- Green—make the test work quickly, committing whatever sins necessary in the process
- Refactor—eliminate all the duplication created in just getting the test to work
Red/green/refactor. The TDDs mantra.
The refactor phase happens after every task during TDD, so developers are probably more used to refactoring, and the quality of the said refactor may be higher. A higher refactor quality may explain the lower cyclomatic complexity in TDD method, because developers devote more of their time to refactoring.
In our study, we investigated the impacts of Test-Driven Development on code quality and code coverage compared to projects following the Test-Last method. We analysed seven projects of similar size to evaluate these impacts. The results that we analysed shows that the TDD projects have an overall complexity and code quality better than TL projects, according to our expectations. However, the results gathered about code coverage are not matching our expectations. Half of the TDD projects have a code coverage lower than 60% and all the TL projects have a coverage higher than 80%.
As we analysed just a few projects without a precisely defined context, we cannot generalize the results obtained beyond our scope. Still we can see that our results about code quality are similar to the ones of the study made by Bhat and Nagappan who also studied the impacts of TDD on development time (which is not part of our scope). So we hope that this study will contribute to the research in this field, especially about the confidence on the impacts on code quality while using the Test-Driven Development method.
TDD projects:
TL projects:
Beck, K. (2003).Test-driven development: by example. Addison-Wesley Professional.
Dave Astels. (2003).Test Driven Development: A Practical Guide. Prentice Hall Professional Technical Reference.
M. Pancur, M. Ciglaric. (2011).Impact of test-driven development on productivity, code and tests: A controlled experiment. In Information and Software Technology 53 (pp. 557–573)
Bhat, T., & Nagappan, N. (2006, September). Evaluating the efficacy of test-driven development: industrial case studies. InProceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering(pp. 356-363). ACM.
Martin, R. C. (2008).Clean Code: A Handbook of Agile Software Craftsmanship. Pearson Education.
Kaufmann R. & Janzen D. (2003, October). Implications of test-driven development: a pilot study. In Companion of the 18th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (pp. 298-299). ACM.
Bhat, T., & Nagappan, N. (2006, September). Evaluating the efficacy of test-driven development: industrial case studies. InProceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering(pp. 356-363). ACM.
Thomas Zimmermann, Nachiappan Nagappan, and Andreas Zeller, Predicting Bugs from History. https://goo.gl/eNVqAK