Hug a developer … they’re in a terrible pain

I’m back from vacation … just to discover again the cruel reality:

links for 2008-08-18

Repositories for Models: VCS or Databases?

A recent post from Markus Voelter replying some articles from Martin Fowler has resulted in an excellent discussion about repositories for MDSD and textual vs graphical DSLs. I was thinking about joining the discussion in the original post, but as my answer is a bit long to be added as a comment, I’ll reply him here.

In his post, Markus raised a question that recurrently appears in my thoughts: where is it better to store models? as files in a VCS (like CVS/SVN) or as structured data in databases?

Unlike Markus, my first answer is usually: without any doubt, in databases. I suppose that this answer has much to do with my professional background, since in the past I have worked extensively with CASE tools. Yes, I believe I’m one of these people that M. Fowler describes in his post about MDSD:

The MDSD vision evolved from the development of graphical design notations and CASE tools. Proponents of these techniques saw graphical design notations as a way to raise the abstraction level above programming languages – thus improving development productivity. While these techniques and tools never caught on too far, the basic core ideas still live on and there is an ongoing community of people still developing them.

But also, it is because I’m used to working with very large repositories, where there are lots of relationships between models. If you work with lots of components but few relationships, you usually do not have problems working with local files (such as Eclipse does). VCSs are able to deal quite well large amount of files. But what happens when you have, for example, your entire transactional system modeled (> 30,000 components), with a very high index of reusability between models (lots of links), and each model can belong to a different owner (only the owner can modify it)? As Markus points out at the comments:

Also, text files tend to be a bit hard to scale. Often the minimum you need is some kind of “cross-indexer” via a database so you can efficiently cross-ref, search, etc. In a “real repository” that’s easier.

Consider the Xtext case. What do you do once you have hundreds of Xtext resources? Each linking into each other. How do you efficiently load, unload, search, find-refs, etc? You need some kind of (in memory or persistent) index.

Exactly. When you work with huge amount of data and links, you need to provide some kind of impact analysis or cross-reference functionalities, you must be able to do complex queries, you must version not only the components but also the relationships, and you must be able to link to other models without the need to download them locally. Yes, I know that there are some solutions out there that provide some of those facilities also for files, as text search engine libraries (Apache Lucene) and query languages (for example, XQuery for XML or SPARQL for RDF). But IMHO all of these solutions, although they work very well with few components, are not as powerful as what you can get *for free* using relational databases.

But I must also say that my opinion has changed somewhat over the years, due to my experience working with repositories. The approach of working with databases also has some problems. One of them is that in order to work with the tool you always must be connected to the database, and, although this situation sounds silly, this may limit the productivity of some developers. With VCS, you only need to be connected while you perform the checkout of the component, but after that, you can work locally with it. Another pain point is that in relational databases, you must to create a fixed schema (no matter if you use a metamodel, you always must create it), and that could be a mess when you need to modify the data structure, since RDBMS doesn’t provide schema versioning facilities. Fortunately, some new approaches has appeared in the market in the last years, as schema-free databases, that will help in this task. Another side effect is that if you want to preserve the integrity of the models and the relationships, you have to deal with locking mechanisms, so the scenario become worse, and usually, the system tends to be over-engineered. And finally, there are also some functionalities not provided by databases, as versioning, accountability (who, when and what) and in some cases traceability (why), so you must develop it by yourself (yay! we love to reinvent the wheel!). Almost all VCSs provide these facilities.

Let’s go back again to the original post. Markus talks about some conditions where repositories could fit well in this scenario:

My point is that a repository is not per se a bad thing, provided the following criteria: (1) you store all your relevant stuff in it (2) it provides versioning facilities (3) supports diff/merge on a meaningful abstraction level.

Ummm, I agree with almost everything, but I’ve some concerns:

  1. Not sure, if he talks about storing all of the data that belongs to a model in the repository, then I agree. But if he talks about storing the model and the code together, then I disagree. There are some scenarios where this is not convenient. For example, when you want to be platform independent (and I’m not talking about all the MDA stuff). The various parser/generator/interpret could run and store the code/binary on several platforms, and not always the same platform where you store the model.
  2. I agree, versioning is an essential facility.
  3. Diff/Merge works well with textual DSL’s, with a concrete syntax. But, although this is a great feature, is it mandatory? I have worked a long time without this feature and I assure you that you can survive without it. And what happens with graphical DSLs?

Before concluding, I would also like to comment one of the latest projects where we applied MDSD. At this project, we decided to use a Oracle XML DB to store our models in XML (something like to what Eurocontrol-CFMU have done for their UML models), but we added also some metadata. By storing the XML directly in the database, we avoid the need to decompose the XML into a relational schema, and allow developers to download the XML and work locally without the need to be connected to the database. We could use also all the SQL query facilities, and for those situations where the performance could be a problem, then we use the metadata to store some relevant data and relationships. Oh, and this RDBMS provides us also with versioning facilities. At this moment, we don’t have enough data in the repository to tell you if this approach will be a success or not. Let’s see!

To sum up (or not!). I believe you should never reject the database approach (nor the VCS option). I can not give you a “Golden Rule”, but my advice is that if you are not going to have lots of relationships between models, then use the VCS approach. If not, then analyze first what a VCS approach could offer you, and if it doesn’t fit well with your requirements, then use the database approach. But please, be careful and don’t tend to design the metamodel too much complicated, or you could have lots of performance problems.

As the post is quite long, I will leave for another post my thoughts about textual vs graphical DSLs. In the meantime, what is your opinion? I would love to hear stories from other folks on what people are doing in their companies.

links for 2008-07-31 [delicious.com]

Eclipse Ganymede hidden treasures

The last week of June (as usual), the Eclipse Foundation delivered the new release of Eclipse, called Ganymede. This year the updated version is a coordinated release of 23 different projects and represents 18 MLOC. There are lots of articles and posts out there explaining the new features, so I’m not going to bore you with the rehashed details. I would just like to mention on two interesting features.

The first one is a really cool feature introduced in the Eclipse Communication Framework project that enables distributed teams to reap the benefits of pair programming. Based on a Google Summer of Code proposal, Mustafa Isik developed Real-Time Shared Editing, dubbed Cola (collaborate), a mechanism that allows two developers to work collaboratively in real-time to edit source code and/or documents. He has put together a short screencast showing the usage of this technology. Check it out! Digging further in this amazing feature, Mustafa pointed me to a Google Tech Talk he gave at EclipseDay at the Googleplex where he explained how this plugin resolves in real time any change conflict. The video is worth a visit. And if you want to add this feature to other editors (by default it has has been added to the JDT Java Source Code editor and Eclipse’s Default Text Editor), Scott Lewis has wrote some easy instructions … simply by adding a little bit of markup to plugin.xml.

The second one is the Usage Data Collector, a piece of technology that will generate statistics on how the various components of the Eclipse workbench (loaded bundles, commands and actions, perspective changes, view usage, …) are being used by developers. The Eclipse Foundation intent is to use this data to help committers and organizations better understand how developers are using Eclipse, in order to improve the overall user experience. Privacy must not be a problem, as this feature is opt-in (there is an option on the “Usage Data Collector” preferences page labeled “Enable Capture”) and it is completely anonymous. Although the data collected is not quite representative, you can see right now some statistics (I see lots of Cut-and-Paste Programming). I hope that these statistics will be public and the Eclipse Foundation will publish some reports regularly (I have not seen any notice about this). But besides the benefits that these statistics may have for the Eclipse Foundation, I believe they can also be attractive to some organizations which have developed internal plugins. And I say this from my own experience. One of the problems we had in the past was how to measure the use of the different plugins we developed, and also, which was the response time (we had several complains about the client performance). We finally had to create an infrastructure in order to collect and analyze these data. So, I see with interest the possibility of extending the official UDC API (both, listeners and monitors). Let’s see how it evolves in the future.