Quantifying the Encapsulation of Implemented Software Architectures

TL;DR: We studied which static architecture metrics are correlated with a high ratio of local changes (i.e. changes made to only a single component). An analyses of 10 open-source systems shows a positive relationship between the percentage of code only used within a component and local change. We conclude that having small, clearly defined interfaces for your components lead to more local changes, which are easier to implement and test. 

This week I had the pleasure to present our paper Quantifying the Encapsulation of Implemented Software Architectures at the 30th International Conference on Software Maintenance and Evolution. What follows is the high-level story I have presented (using these slides), if you want all the details you can find the complete paper here.

Inspecting the title of our paper we see that it is about quantifying the encapsulation of implemented architectures. To understand what this paper talks about let's start by examining these concepts more closely.

Implemented software architectures

As a whole, software architecture is defined as:
the organisational structure of a software system including components, connections, constraints, and rationale. 
If we focus on the implementation within the code, we can only observe the components and the connections, stuff like constraints and rationals are normally defined in the documentation. 

As an example, consider the figure on the right, which depicts a hypothetical system. We can clearly see the high-level components which (hopefully) implement a distinct functionality, and the connections that exist between these components. To get such a high-level overview of the system you can normally open up the source-code repository or look for it in the documentation. Should that fail you can always fall back to a whiteboard, a marker, and a software engineer working on the system, I still have to meet the software engineer who cannot draw such a picture of their system.

Quantifying encapsulation

Encapsulation revolves around localizing the design decisions which are likely to change (a process also known as information hiding). If done correctly, we would see that the changes to a system are done to source-code modules which are located near each other, preferably in the same component. This makes it easier to implement the change (since we do not have to jump between components) and easier to test the change (since we have less components to test).

Given a system, a definition of its components, and all changes made in the past years we can easily determine whether the process of encapsulation has been successful by using the concepts of local change and non-local change as introduced by Yu et al. 

As a first step we classify each change-set in the history of a system (e.g. all commits or pull-requests) as either local or non-local. When a change-set contains source-code files from only a single component it is considered to be local, if more than one component is touched the change-set is considered to be non-local. The figure on the left shows an example of each type of change-set, blue for local and brown for non-local.

After this classification we can quantify the success of the encapsulation by simply dividing the number of local change-sets by the number of total change-sets. For example, the figure on the left shows a change-set series containing ten change-sets of which seven are local, leading to a quantification of 0.7 for encapsulation.

As explained above we would like to see as many local change-sets as possible, so we want this number to be as high as possible. However, since we also expect to see some non-local changes for cross-cutting concerns such as logging we would not expect to see a ratio of 1 that often. To get a feel for which numbers are good we can calculate this metric for many systems, thus creating a benchmark which can tell us whether this 0.7 is relatively good or bad compared to other systems.

Up until now we have only seen concepts introduced by others, which makes a rather sub-standard research paper. So what is the problem here?

The timing problem(s)

The main issue with calculating the encapsulation of an implemented architecture using the concepts above is that it can only be done after a project has been finished. Although nice to know at that point in time, it would be nicer if we could calculate a metric on the project which provides some sort of indication of the encapsulation of the system now. Given that the current literature lists over 40 software architecture level metrics (an overview can be found here) we should be able to find something right?

So we designed an experiment to see which software architecture metric we can calculate on a single snapshot of the code (i.e. snapshot-based encapsulation) is correlated with the encapsulation calculated over time (i.e. the historical encapsulation). 

The first set-up was straight-forward, select some systems, calculate the snapshot-based metrics, calculate the historical encapsulation, run the statistics, and bob is your uncle. The figure on the right shows a sketch of the outline of this set-up, using the number of components as an example of a snapshot-based metric. At a glance this set-up seems correct, but after a while we figured out there it contains a (rather serious) flaw, any thoughts?

Notice that we calculated the number of snapshots based on the situation after the last change-set. But this change-set makes a change to the system, and can also change the number of components! 

More graphically, consider the chart on the left which shows the number of components on the x-axis and the change-sets on the y-axis. We see that there is a period where we have 2 components, then a period where there are 5 components, only to drop to 4 components in the last change-set. Trying to correlate the historical encapsulation of 0.7 with a number of four components is clearly incorrect, since most of the time the number of components was either 3 or 5. 

To remedy this problem we adjusted the design of our experiment. Instead of using all of the history to calculate the historical encapsulation based on all of the change-sets we instead calculate the historical encapsulation based on the periods in which the snapshot-based metric is stable

In the example above this gives us two pairs of numbers, (2, 0.6) and (5, 0.75), to indicate the number of components and the historical encapsulation for that period. Note that we do not calculate a pair for when there are four components, since we do not consider a single change-set a 'period'.


The results

Now that we know how to do the experiment we can execute it. First we selected 10 open-source software systems to investigate, giving us over 60 years of historical data. Secondly, we filtered the snapshot based software architecture metrics available down to a list of twelve metrics. This list includes simple metrics such as the number of cyclic dependencies or the number of binary dependencies, but also more involved software metrics such as the metrics which form the basis for our dependency profiles

These last metrics are (unfortunately) not yet widely known, so let me explain them quickly. In a dependency profile we divide the source-code modules within a component into four distinct categories based on the dependencies from and to other components. We can calculate the profile by calculating the percentage of code of the system in each category, thus a profile (50, 20, 25, 5) indicates that 50% of the code is internal to components, while 20% is depended upon from other components, 25% of the code depends on code from other components, leaving 5% in the last category of code which is depended upon and depends upon code from other components. 

After crunching all the numbers the result is that there is a positive correlation between the historical encapsulation and the percentage of internal code. In other words, we observed that systems which contain a higher percentage of internal code also exhibit periods with a higher ratio of local changes.

So what can I do with this result?

Given that more internal code is related with a higher ratio of local changes I would argue that you should strive towards an implementation with as much internal code as possible. 

One way to achieve this is to define clear, small, and specific interfaces for your components. While this is often done correctly for the incoming interface of a component, the outgoing interface is often overlooked, leading to a large outgoing interface with a higher risk of needing change when other components are touched. 

More details ...

Interested in reading more about the design of the experiment? Or do you want to know how well other metrics correlate? (spoiler: they don't) Want to know more about our ideas about how the inspected software architecture metrics can be improved? Download the full paper here!

One step towards a software metrics catalog

A little over a year ago my proposal was accepted in the Tiny Transaction on Computer Science. If you have not read it, the body of the publication is:

Unfortunately, such a catalog has not materialized instantly :(

However, last week we did take a small step forward during the Workshop on Emerging Trends in Software Metrics. During this workshop, Arie van Deursen presented our proposal for a Software Metrics Catalog Format. This format is specifically designed to provide a concise, yet meaningful overview of a software metric, while also showing the relationships a software metric has with other metrics.

You can read the complete description of the Software Metrics Catalog Format in our publication, but it is probably more appealing (and fun!) to visit our demo implementation using a semantic wiki hosted at referata.com.

Naturally, all comments, questions, remarks and contributions to the catalog are more than welcome!

Update:If you want to use the format in a scientific publication, this LaTeX template might be helpful (thank you Joël Cox)!

Defending the propositions

As explained before, a thesis in Holland is accompanied by a set of propositions. These propositions are considered to be a part of the thesis, which means that the members of the doctoral committee are allowed challenge them if they desire.

This is why the regulations state that the propositions:

'...shall be academically sound, shall lend themselves to opposition and be defendable by the PhD candidate, and shall be approved by the promotor.'

Furthermore, at least six of the propositions should not concern the topic(s) of the thesis, and at most two propositions can be playful in nature. 

So let's see whether we succeeded, there is the list of my propositions (each one linking to this blog-post which contains an explanation for the proposition):
 So what do you think, did I succeed? And do you agree with them all?

The propositions explained

This post explains all of the propositions listed here.

To enable the effective application of software metrics, a pattern catalog based on real-world usage scenarios must be developed.

This proposition is actually a complete publication, it was published in the Tiny Transactions on Computer Science, Volume 2. As explained in that paper, there has been a vast amount in research in the area of software metrics. Many software metrics have been designed and validated over the past decades, but only a few software metrics are used by project teams to identify and solve problems in a timely manner.

One reason for this lack of adaption is that it is currently hard to decide which software metric should be used in which situation. Documenting the benefits and limitations of metrics makes this decision easier, which ultimately leads to more successful software projects.



The software architect should take the responsibility for the implementation of the system.

According to the global IT Architect Association:

"The software architect has mastered the value, use, development and delivery of software intensive systems. They have developed skills in software development lifecycles, software engineering and software design. They are also responsible for communicating software concepts to all levels of management and for ensuring that expected quality attribute levels are achieved. "

On different occasions I have encountered a software architect which only concerns him/herself with the design of the system, not with the actual implementation. In other words, the architect does not communicate with the development team. This way of working is based on the assumption that the design is such that all of the quality attributes are achieved. Thus, when the implementation follows the design, the implemented system will also achieve the desired quality attributes.

Unfortunately, the implementation only follows the design in very rare cases. During the implementation developers will run into nasty problems with the used technologies, unexpected events and border-cases, or simply with errors in the design.  It is crucial that the development team can rely on the software architect reflect upon these situations and make the decisions that are necessary. In other words, the software architect should be an integral part of the development team, and the one person that makes all final decisions.


If software engineering PhD students spend 20% of their time 'in the field', their research will be based on more realistic assumptions.

The field of software engineering research should reflect upon the way in which professionals design, construct, test and maintain (e.g. 'engineer') software systems. In my opinion, the best way to do this is to observe professionals to identify some of the problems that they are facing. The researcher then develops solutions for these problems and verifies whether these solutions indeed solve the identified problem.

For this last step it is crucial that the research has been based on realistic assumptions about the data available, the effort people want to invest or the processes that can be changed. If any of these assumptions are incorrect it is going to be hard to a) get professionals to apply the solution for the verification, and b) to get wider acceptance for the solution after the initial validation.

By spending time together with professionals in industry, a PhD student gets an idea of what constraints are put upon these professionals in terms of time, resources and data. This knowledge can immediately be used to test the assumptions for potential solutions, avoiding the development of unnecessary or unrealistic ones. 

Making the names of reviewers public will make reviewers more inclined to write better reviews, which increases the quality of the overall review process.

In the majority of cases the review process for conferences and journals is a closed process. The input is a submitted paper and the output is a decision and (possibly) a set of reviews. Most of the time the paper has been read by two or three reviewers who wrote a review, and in some cases these reviews are discussed by the program committee (either online or in person).

Within the current process the names of the reviewers are known to the rest of the committee and the chairs, but the authors normally do not know who wrote the reviews. In theory this makes sure that the reviewers can be honest in their reviews without being afraid that their feedback gets back to them in undesired ways. Unfortunately, this also enables reviewers to reject papers on loose claims or false beliefs. In addition, this cloak of anonymity provides the reviewers an opportunity to be less civilized than they could be. Lastly, being anonymous decreases the reward for writing a very good and detailed review, since only a few people witness and appreciate it if you do.

Especially this last problem can be solved by making the names of reviewers public, since the authors know who to thank for reviewing their paper. In addition, just as a paper should not contain claims without evidence, a reviewer should be less inclined to make a loose claim if his name will be known. Of course, an author still needs to accept a 'reject'-decision, but this should be possible if the feedback in the review is honest, civil and supported by facts.


Hiring a skilled typist is an often overlooked option during the design of an automated process.

When I just learned to program during my university training I wanted to automate everything, all repetitive behavior was to be captured in scripts, macro's or programming tools. Using this strategy you quickly run into a situation in which the effort to automate some steps is way bigger than the eventual costs savings. 

This XKCD-comic actually summarizes pretty clearly how much time you can spend on the automation of a certain tasks before you spend more time on the automation instead of the task itself. For some of the systems I have seen this table would have saved quit some time, money and irritation of the peoples involved.


Understanding your goal makes it easier to deal with unpleasant chores.

In every job, study or other day-time passing activity there are chores that are not fun to do. For example, in this PhD project my initial reaction to a complete restructuring of a paper is always annoyance. Yes, it might be a good idea. Yes, it will make the paper easier to read. Yes, it does make the story better. But most of all, it requires me to work another three evenings and it requires me to throw away two days of work! Did I mention that the deadline is just three days away?

At these points in time I always tried to take a step back and look at the overall goal, which is to write a nice PhD thesis. The restructuring is gonna cost me now, but the realization that the paper becomes better (okay), which gives a higher chance of acceptation (good), which means that I can finish the overall project on time (awesome!) usually makes me want to do the restructuring anyway.

So, whenever an unpleasant chore comes along I try to ask myself: 'what goal am I getting closer to by completing this task?' In most cases, this helps me to do the task anyway. And in those cases that I cannot figure out why the task helps in reaching a goal, this helps me to not feel bad about not doing the task at all.


The replication of experiments becomes easier when all PhD students must replicate an existing study during their research.

Replicating an experiment is performing the same experiment with slight variations in terms of data or set-up. This is in contrast with reproducing an experiment, which is geared towards reproducing the exact same results. Within academia, replication of experiments is needed to confirm earlier results and to broaden the common body of knowledge. Replication of an experiment can be viewed as performing a double-check of the work, making it less likely that an error has been made earlier.

By replicating an experiment, a PhD student learns about the choices that need to be made during an experiment, and the possible effect(s) of these choices on the outcome of the experiment. In addition, the PhD student probably finds out that an experiment cannot be (easily) replicated because of missing data, too few details in the description of a procedure, or the absence of running source-code. Because of this first-hand experience (and frustration) about missing details the documentation produced by the student about his own experiments will be of higher quality, thus making the experiments easier to replicate.


Using only metrics as acceptation criteria leads to undesired optimization


This proposition is based on chapter 5 of the thesis, which in turn is based on the article called 'Getting What You Measure'. This article describes four pitfalls I have seen over and over again when metrics are being used in a project management setting. The most widespread one of these is probably 'treating the metric', e.g. making changes to a system just to improve the value of a metric. 

In some cases this is not problematic because improving the value of the metric also helps in reaching a goal. For example, if you want people to write more code you could require them to check-in 2000 Lines of Code each day. However, you probably want them to write useful code, something which is hard to capture in a metric. And even if you could, there would a long list of other characteristics that are desirable, but you didn't think of specifying at the start of a project.

Therefore, I think that using metrics as a formal acceptation criteria is fine, but it should always be clear why a specific value of the metric is desirable. In other words, always communicate the overall  goal along with the formal metrics, and focus the acceptation on the goal itself instead of the metrics.


The most important objective in [Boy Scout] training is to educate, not instruct. (cf. Lord Baden-Powell)

I spend a large part of my life being a boy-scout, only the last few years I have been a bit busy with a different project. By joining the scouts movement I have had many great times, I have traveled to many different places and have meet a wide range of interesting people. Because of all this, I figured that a quote from the founding father of the scouts movement should be in my list of propositions.

My interpretation of this quote is that you should not strive to tell somebody on what to do next, but that you should help the person to understand the current situation such that he can derive useful actions himself. I think the best way to summarize the benefits of this approach is to refer to the old saying: 'give a man a fish and he can eat for a day, teach a man to fish and he can eat for a lifetime'.


The fact that the McChicken tastes the same everywhere, proves that it is possible to have distinct teams produce the same results.


I have personally sampled the McChicken in many places in the Netherlands, and this PhD project allowed me to sample them in Spain, Portugal, Canada, the USA, Germany, Belgium, Italy and Switzerland. I am always amazed that this piece of fast-food does not only look similar, but also has the same taste (or lack according to some, 'lack of taste') in all those locations. 

Unfortunately, I do not have any insight into how this distribution/production process works. And although I probably do not want to know all of the details, I would really like to understand which conditions have to be met in order to replicate this achievement in other fields.    

Time to defend the dissertation

Since October 2008 I have been introducing myself as 'a technical consultant and a PhD student'. On the 28th of this month, around 16:15 hours I hope to drop the second part of this sentence!

Because June 28th (at 15:00 hours) I will start defending my dissertation against the eight members of my doctorate committee. The ceremony lasts for a little over an hour and is carried out according to a strict set of rules, which prescribe everything from the way in which everybody needs to be addressed up until the clothes that will be worn by the committee, my paranymphs and me.
The subject of the defense is my dissertation, titled 'Metric-Based Evaluation of Implemented Software Architectures'. If you have followed this blog you know most of the content by now since it is basically a compilation of my previous publications. If not, the easiest thing to do is to read through the summary enclosed in the PDF version of the dissertation. 

At first I was a bit skeptical about the usefulness of bundling all of the papers together since they are already published. However, writing down the overall story felt pretty good, and I must say that it was quite a joy to unpack the printed copies of the resulting book! (BTW, there are still a few copies left, so you can probably still get one of them at the defense).

Apart from asking questions about the dissertation itself, the members of the doctoral committee are also allowed to ask questions about one of the ten propositions that accompany the dissertation. Although I am not sure whether I can pull off a 'Project #tweetprop' (e.g. a short blog-post per proposition), but I'll definitely discuss the propositions in a later entry. So stay tuned!