Guidelines for New Software Projects

In this blog post I share the Guidelines for New Software Projects that we use at Wikimedia Deutschland.

I wrote down these guidelines recently after a third-party that was contracted by Wikimedia Deutschland delivered software that was problematic in several ways. The department contracting this third-party was not the software development department, as since we did not have written down guidelines anywhere, both this department and the third-party lacked information they needed to do a good job. The guidelines are a short-and-to-the-point 2 page document that can be shared third parties and serves as a starting point when making choices for internal projects.

If you do not have such guidelines at your organization, you can copy these, though you will of course need to update the “Widespread Knowledge at Our Department” section. If you have suggestions for improvements or interesting differences in your own guidelines, leave a comment!


Domain of Applicability

These guidelines are for non-trivial software that will need to be maintained or developed further. They do not apply to throw-away scripts (if indeed they will be thrown away) or trivial code such as a website with a few static web pages. If on the other hand the software needs to be maintained, then these guidelines are relevant to the degree that the software is complex and hard to replace.

Choice of Stack

The stack includes programming language, frameworks, libraries, databases and tools that are otherwise required to run or develop the software.

We want

  • It to be easy to hire people who can work with the stack
  • It to be easy for our software engineering department to work with the stack
  • The stack to continue to be supported and develop with the rest of the ecosystem

Therefore

  • Solutions (ie frameworks, libraries) known by many developers are preferred over obscure ones
  • Custom/proprietary solutions, where mature and popular alternatives are available, are undesired
  • Solutions known by our developers are preferred
  • Solutions with many contributors are preferred over those with few
  • Solutions with active development are preferred over inactive ones
  • More recent versions of the solutions are preferred, especially over unsupported ones
  • Introducing new dependencies requires justification for the maintenance cost they add
  • Solutions that are well designed (see below) are preferred over those that are not

Widespread Knowledge at Our Department

  • Languages: PHP and JavaScript
  • Databases: MySQL and SQLite
  • Frameworks: Symfony Components and Silex (discontinued)
  • Testing tools: PHPUnit, QUnit, Selenium/Mocha
  • Virtualization tools: Docker, Vagrant

Architecture and Code Quality

We want

  • It to be easy (and thus cheap) to maintain and further develop the project

Therefore

  • Decoupled architecture is preferred, including
    • Decoupling from the framework, especially if the framework has issues (see above) and thus forms a greater than usual liability.
    • Separation of persistence from domain and application logic
    • Separation of presentation from domain and application logic
  • The code should adhere to the SOLID principles, not be STUPID and otherwise be clean on a low-level.
  • The applications logic should be tested with tests of the right type (ie Unit, Integration, Edge to Edge) that are themselves well written and not laden with unneeded coupling.
  • Setup of the project for development should be automated, preferably using virtualization tools such as Docker. After this automated setup developers should be able to easily run the tests and interact with the application.
  • If the domain is non-trivial, usage of Domain-driven design (including the strategic patterns) is preferred

Collaboration Best Practices

During 2016 and 2017 I worked in a small 3 people dev team at Wikimedia Deutschland, aptly named the FUN team. This is the team that rewrote our fundraising application and implemented the Clean Architecture. Shortly after we started working on new features we noticed room for improvement in many areas of our collaboration. We talked about these and created a Best Practices document, the content of which I share in this blog post.

We created this document in October 2016 and it focused on the issues we where having at the time or things we where otherwise concerned about. Hence it is by no means a comprehensive list, and it might contain things not applicable to other teams due to culture/value differences or different modes of working. That said I think it can serve as a source of inspiration for other teams.

Make sure others can pick up a task where it was left

  • Make small commits
  • Publish code quickly and don’t go home with unpublished code
  • When partially finishing a task, describe what is done and still needs doing
  • Link Phabricator on Pull Requests and Pull Requests on Phabricator

Make sure others can (start) work(ing) on tasks

  • Try to describe new tasks in such a way others can work on them without first needing to inquire about details
  • Respond to emails and comments on tasks (at least) at the start and end of your day
  • Go through the review queue (at least) at the start and end of your day
  • Indicate if a task was started or is being worked on so everyone can pick up any task without first checking it is not being worked upon. At least pull the task into the “doing” column of the board, better yet assign yourself.

Make sure others know what is going on

  • Discuss introduction of new libraries or tools and creation of new projects
  • Only reviewed code gets deployed or used as critical infrastructure
  • If review does not happen, insist on it rather than adding to big branches
  • Discuss (or Pair Program) big decisions or tasks before investing a lot of time in them

Shared commitment

  • Try working on the highest priority tasks (including any support work such as code review of work others are doing)
  • Actively look for and work on problems and blockers others are running into
  • Follow up commits containing suggested changes are often nicer than comments

PHP project template

Want to start a new PHP project? Perhaps yet another library you are creating? Tired of doing the same lame groundwork for the 5th time this month? Want to start a code kata and not lose time on generic setup work? I got just the project for you!

I’ve created a PHP project template that you can fork or copy to get started quickly. It contains setup for testing, lining and CI and no actual PHP code. A detailed list of the project templates contents:

  • Ready-to-go PHPUnit (configuration and working bootstrap)
  • Ready-to-go PHPCS
  • Docker environment with PHP 7.2 and Composer (so you do not need to have PHP or Composer installed!)
  • Tests and style checks runnable in the Docker environment with simple make commands
  • TravisCI ready
  • Code coverage creation on TravisCI and uploading to ScrutinizerCI (optional)
  • Coverage tag validation
  • Stub production and test classes for ultra-quick start (ideal when doing a kata)
  • COPYING and .gitignore files
  • README with instructions of how to run the tests

Getting started

  • Copy the code or fork the repository
  • If you do not want to use the MediaWiki coding style, remove mediawiki/mediawiki-codesniffer from composer.json
  • If you want to support older PHP versions, update composer.json and remove new PHP features from the stub PHP files
  • If the code is not a kata or quick experiment, update the PHP namespaces and the README
  • Start writing code!
  • If you want TravisCI and/or ScrutinizerCI integration you will need to log in to their respective websites
  • Optionally update the README

You can find the project template on GitHub.

Special thanks to weise, gbirke and KaiNissen for their contributions to the projects this template was extracted from.

My year in books

This is a short summary of my 2017 reading experience, following my 2016 Year In Books and my 2015 Year In Books.

Such Stats

I read 27 books , totaling just over 9000 pages. 13 of these where non-fiction, 6 hard science fiction and 8 pleb science fiction.

In 2017 I did an effort to rate books properly using the Goodreads scale (Did not like, it was OK, I liked it, I really liked it, it was amawzing), hence the books are more distributed rating wise than the previous years, where most got 4 stars.

Non-fiction

While only finished shortly after new year and thus not strictly a 2017 book even though I read 90% of it in 2017, my favorite for the year is The Righteous Mind: Why Good People are Divided by Politics and Religion. A few months ago I saw an interview of Jonathan Haidt, the author of this book. I was so impressed that I went look for books written by him. As I was about to add his most recent work (this book) to my to-read list on Goodreads, I realized that it was already on there, added a few months before. The book is about social and moral psychology and taught me many useful models in how people behave, where to look for ones blind spots, how to avoid polarization and how to understand many aspects of the political landscape. A highly recommended read!

Another non-fiction book that I found to be full of useful models that deepened my understanding of how the world works is The Psychopath Code: Cracking The Predators That Stalk Us. Since about 4% of people are psychopaths and thus systematically and mercilessly exploit others (while being very good at covering their tracks), you are almost guaranteed to significantly interact with some over the course of your life. Hence understanding how to detect them and prevent them from feeding on you or those you care about is an important skill. You can check out my highlights from the book to get a quick idea of the contents.

Fiction

Incandescence

After having read a number of more mainstream Science Fiction books that do not empathize on the science or grand ideas as proper Hard Science Fiction, I picked up Incandescence, by Greg Egan, author of one of my 2015 favorite books. As expected from Greg Egan, the focus definitely is on the science and the ideas. The story is set in the The Amalgam galaxy, which I was already familiar with via a number of short stories.

Most of the book deals with a more primitive civilization discovering gradually discovering physics starting from scratch, both using observation and theorization, and eventually creating their own version of General Relativity. Following this is made extra challenging by the individuals in this civilization using their own terms for various concepts, such as the 6 different spatial directions. You can see this is not your usual SF book from this… different trailer that Greg made.

What is so great about this book is that Greg does not explain what is going on. He gradually drops clues that allow you to piece by piece get a better understanding of the situation, and how the story of the primitive civilization fits the wider context. For instance, you find out what “the Incandescence” is. At least you will if you are familiar with the relevant “space stuff”. While in this case it is hard to miss if you know the “space stuff”, the hints are never really confirmed. This is also true for perhaps the most important hinted at thing. I had a “Holy shit! Could it be… this is epic” moment when I made that connection.

Not a book I would recommend for people new to Hard SF, or the best book of Egan to start with. In any case, before reading this one, read some of the stories of The Amalgam.

Diaspora

Since I linked Incandescence so much I decided to finally give Diaspora a go. Diaspora had been on my to-read list for years and I never got to it because of its low rating on Goodreads. (Rant: Paying so much attention to the rating was perhaps not so smart of me, since for non-mainstream the ratings are not very telling. They mainly seem to be a combination of what average people expect and what is popular, with less correlation to quality and a punishing of material that is too sophisticated for the uninitiated or less intelligent.)

I loved the first chapter of Diaspora “Orphanogenesis” and cried some tears of joy at its conclusion.

The final chapters are epic. Spoiler ahead. How can a structure spanning 200 trillion universes and encoding the thoughts of an entire civilization be anything but epic? This beats the Xeelee ring in scale. The choices of some of the characters near the end make little sense to me, though still a great book overall, and perhaps the not making sense part makes sense if one considers how much these characters are not traditional humans.

Introduction to Iterators and Generators in PHP

In this post I demonstrate an effective way to create iterators and generators in PHP and provide an example of a scenario in which using them makes sense.

Generators have been around since PHP 5.5, and iterators have been around since the Planck epoch. Even so, a lot of PHP developers do not know how to use them well and cannot recognize situations in which they are helpful. In this blog post I share insights I have gained over the years, that when sharing, always got an interested response from colleague developers. The post goes beyond the basics, provides a real world example, and includes a few tips and tricks. To not leave out those unfamiliar with Iterators the post starts with the “What are Iterators” section, which you can safely skip if you can already answer that question.

What are Iterators

PHP has an Iterator interface that you can implement to represent a collection. You can loop over an instance of an Iterator just like you can loop over an array:

Why would you bother implementing an Iterator subclass rather than just using an array? Let’s look at an example.

Imagine you have a directory with a bunch of text files. One of the files contains an ASCII NyanCat (~=[,,_,,]:3). It is the task of our code to find which file the NyanCat is hiding in.

We can get all the files by doing a glob( $path . '*.txt' ) and we can get the contents for a file with a file_get_contents. We could just have a foreach going over the glob result that does the file_get_contents. Luckily we realize this would violate separation of concerns and make the “does this file contain NyanCat” logic hard to test since it will be bound to the filesystem access code. Hence we create a function that gets the contents of the files, and ones with our logic in it:

While this approach is decoupled, a big drawback is that now we need to fetch the contents of all files and keep all of that in memory before we even start executing any of our logic. If NyanCat is hiding in the first file, we’ll have fetched the contents of all others for nothing. We can avoid this by using an Iterator, as they can fetch their values on demand: they are lazy.

Our TextFileIterator gives us a nice place to put all the filesystem code, while to the outside just looking like a collection of texts. The function housing our logic, findTextWithNyanCat, does not know that the text comes from the filesystem. This means that if you decide to get texts from the database, you could just create a new DatabaseTextBlobIterator and pass it to the logic function without making any changes to the latter. Similarly, when testing the logic function, you can give it an ArrayIterator.

I wrote more about basic Iterator functionality in Lazy iterators in PHP and Python and Some fun with iterators. I also blogged about a library that provides some (Wikidata specific) iterators and a CLI tool build around an Iterator. For more on how generators work, see the off-site post Generators in PHP.

PHP’s collection type hierarchy

Let’s start by looking at PHP’s type hierarchy for collections as of PHP 7.1. These are the core types that I think are most important:

  •  iterable
    • array
    • Traversable
      • Iterator
        • Generator
      • IteratorAggregate

At the very top we have iterable, the supertype of both array and Traversable. If you are not familiar with this type or are using a version of PHP older than 7.1, don’t worry, we don’t need it for the rest of this blog post.

Iterator is the subtype of Traversable, and the same goes for IteratorAggregate. The standard library iterator_ functions such as iterator_to_array all take a Traversable. This is important since it means you can give them an IteratorAggregate, even though it is not an Iterator. Later on in this post we’ll get back to what exactly an IteratorAggregate is and why it is useful.

Finally we have Generator, which is a subtype of Iterator. That means all functions that accept an Iterator can be given a Generator, and, by extension, that you can use generators in combination with the Iterator classes in the Standard PHP Library such as LimitIterator and CachingIterator.

IteratorAggregate + Generator = <3

Generators are a nice and easy way to create iterators. Often you’ll only loop over them once, and not have any problem. However beware that generators create iterators that are not rewindable, which means that if you loop over them more than once, you’ll get an exception.

Imagine the scenario where you pass in a generator to a service that accepts an instance of Traversable:

The service class in which doStuff resides does not know it is getting a Generator, it just knows it is getting a Traversable. When working on this class, it is entirely reasonable to iterate though $things a second time.

This blows up if the provided $things is a Generator, because generators are non-rewindable. Note that it does not matter how you iterate through the value. Calling iterator_to_array with $things has the exact same result as using it in a foreach loop. Most, if not all, generators I have written, do not use resources or state that inherently prevents them from being rewindable. So the double-iteration issue can be unexpected and seemingly silly.

There is a simple and easy way to get around it though. This is where IteratorAggregate comes in. Classes implementing IteratorAggregate must implement the getIterator() method, which returns a Traversable. Creating one of these is extremely trivial:

If you call getIterator, you’ll get a Generator instance, just like you’d expect. However, normally you never call this method. Instead you use the IteratorAggregate just as if it was an Iterator, by passing it to code that expects a Traversable. (This is also why usually you want to accept Traversable and not just Iterator.) We can now call our service that loops over the $things twice without any problem:

By using IteratorAggregate we did not just solve the non-rewindable problem, we also found a good way to share our code. Sometimes it makes sense to use the code of a Generator in multiple classes, and sometimes it makes sense to have dedicated tests for the Generator. In both cases having a dedicated class and file to put it in is very helpful, and a lot nicer than exposing the generator via some public static function.

For cases where it does not make sense to share a Generator and you want to keep it entirely private, you might need to deal with the non-rewindable problem. For those cases you can use my Rewindable Generator library, which allows making your generators rewindable by wrapping their creation function:

A real-world example

A few months ago I refactored some code part of the Wikimedia Deutschland fundraising codebase. This code gets the filesystem paths of email templates by looking in a set of specified directories.

This code made the class bound to the filesystem, which made it hard to test. In fact, this code was not tested. Furthermore, this code irked me, since I like code to be on the functional side. The array_walk mutates its by-reference variable and the assignment at the end of the loop mutates the return variable.

This was refactored using the awesome IteratorAggregate + Generator combo:

Much easier to read/understand code, no state mutation whatsoever, good separation of concerns, easier testing and reusability of this collection building code elsewhere.

See also: Use cases for PHP generators (off-site post).

Tips and Tricks

Generators can yield key value pairs:

You can use yield in PHPUnit data providers.

You can yield from an iterable.

Thanks for Leszek Manicki and Jan Dittrich for reviewing this blog post.

Yield in PHPUnit data providers

Initially I started creating a general post about PHP Generators, a feature introduced in PHP 5.5. However since I keep failing to come up with good examples for some cool ways to use Generators, I decided to do this mini post focusing on one such cool usage.

PHPUnit data providers

A commonly used PHPUnit feature is data providers. In a data provider you specify a list of argument lists, and the test methods that use the data provider get called once for each argument list.

Often data providers are created with an array variable in which the argument lists get stuffed. Example (including poor naming):

The not so nice thing here is that you have a variable (explicit state) and you modify it (mutable state). A more functional approach is to just return an array that holds the argument lists directly. However if your argument list creation is more complex than in this example, requiring state, this might not work. And when such state is required, you end up with more complexity and a higher chance that the $return variable will bite you.

Using yield

What you might not have realized is that data providers do not need to return an array. They need to return an iterable, so they can also return an Iterator, and by extension, a Generator. This means you can write the above data provider as follows:

No explicit state to be seen!

Update: my Introduction to Iterators and Generators in PHP is now live 🙂

The Fallacy of DRY

DRY, standing for Don’t Repeat Yourself, is a well-known design principle in the software development world.

It is not uncommon for removal of duplication to take center stage via mantras such as “Repetition is the root of all evil”. Yet while duplication is often bad, the well intended pursuit of DRY often leads people astray. To see why, let’s take a step back and look at what we want to achieve by removing duplication.

The Goal of Software

First and foremost, software exists to fulfill a purpose. Your client, which can be your employer, is paying money because they want the software to provide value. As a developer it is your job to provide this value as effectively as possible. This includes tasks beyond writing code to do whatever your client specifies, and might best be done by not writing any code. The creation of code is expensive. Maintenance of code and extension of legacy code is even more so.

Since creation and maintenance of software is expensive, the quality of a developers work (when just looking at the code) can be measured in how quickly functionality is delivered in a satisfactory manner, and how easy to maintain and extend the system is afterwards. Many design discussions arise about trade-offs between those two measures. The DRY principle mainly situates itself in the latter category: reducing maintenance costs. Unfortunately applying DRY blindly often leads to increased maintenance costs.

The Good Side of DRY

So how does DRY help us reduce maintenance costs? If code is duplicated, and it needs to be changed, you will need to find all places where it is duplicated and apply the change. This is (obviously) more difficult than modifying one place, and more error prone. You can forget about one place where the change needs to be applied, you can accidentally apply it differently in one location, or you can modify code that happens to the same at present but should nevertheless not be changed due to conceptual differences (more on this later). This is also known as Shotgun Surgery. Duplicated code tends to also obscure the structure and intent of your code, making it harder to understand and modify. And finally, it conveys a sense of carelessness and lack of responsibility, which begets more carelessness.

Everyone that has been in the industry for a little while has come across horrid procedural code, or perhaps pretend-OO code, where copy-paste was apparently the favorite hammer of its creators. Such programmers indeed should heed DRY, cause what they are producing suffers from the issues we just went over. So where is The Fallacy of DRY?

The Fallacy of DRY

Since removal of duplication is a means towards more maintainable code, we should only remove duplication if that removal makes the code more maintainable.

If you are reading this, presumably you are not a copy-and-paste programmer. Almost no one I ever worked with is. Once you know how to create well designed OO applications (ie by knowing the SOLID principles), are writing tests, etc, the code you create will be very different from the work of a copy-paste-programmer. Even when adhering to the SOLID principles (to the extend that it makes sense) there might still be duplication that should be removed.The catch here is that this duplication will be mixed together with duplication that should stay, since removing it makes the code less maintainable. Hence trying to remove all duplication is likely to be counter productive.

Costs of Unification

How can removing duplication make code less maintainable? If the costs of unification outweigh the costs of duplication, then we should stick with duplication. We’ve already gone over some of the costs of duplication, such as the need for Shotgun Surgery. So let’s now have a look at the costs of unification.

The first cost is added complexity. If you have two classes with a little bit of common code, you can extract this common code into a service, or if you are a masochist extract it into a base class. In both cases you got rid of the duplication by introducing a new class. While doing this you might reduce the total complexity by not having the duplication, and such extracting might make sense in the first place for instance to avoid a Single Responsibility Principle violation. Still, if the only reason for the extraction is reducing duplication, ask yourself if you are reducing the overall complexity or adding to it.

Another cost is coupling. If you have two classes with some common code, they can be fully independent. If you extract the common code into a service, both classes will now depend upon this service. This means that if you make a change to the service, you will need to pay attention to both classes using the service, and make sure they do not break. This is especially a problem if the service ends up being extended to do more things, though that is more of a SOLID issue. I’ll skip going of the results of code reuse via inheritance to avoid suicidal (or homicidal) thoughts in myself and my readers.

DRY = Coupling

— A slide at DDDEU 2017

The coupling increases the need for communication. This is especially true in the large, when talking about unifying code between components or application, and when different teams end up depending on the same shared code. In such a situation it becomes very important that it is clear to everyone what exactly is expected from a piece of code, and making changes is often slow and costly due to the communication needed to make sure they work for everyone.

Another result of unification is that code can no longer evolve separately. If we have our two classes with some common code, and in the first a small behavior change is needed in this code, this change is easy to make. If you are dealing with a common service, you might do something such as adding a flag. That might even be the best thing to do, though it is likely to be harmful design wise. Either way, you start down the path of corrupting your service, which now turned into a frog in a pot of water that is being heated. If you unified your code, this is another point at which to ask yourself if that is still the best trade-off, or if some duplication might be easier to maintain.

You might be able to represent two different concepts with the same bit of code. This is problematic not only because different concepts need to be able to evolve individually, it’s also misleading to have only a single representation in the code, which effectively hides that you are dealing with two different concepts. This is another point that gains importance the bigger the scope of reuse. Domain Driven Design has a strategic pattern called Bounded Contexts, which is about the separation of code that represents different (sub)domains. Generally speaking it is good to avoid sharing code between Bounded Contexts. You can find a concrete example of using the same code for two different concepts in my blog post on Implementing the Clean Architecture, in the section “Lesson learned: bounded contexts”.

DRY is for one Bounded Context

— Eric Evans in Good Design is Imperfect Design

Example

How many times is doAction called? 3 times? 4 times? What about in this snippet:

I’m sure you can figure out what the first snippet does. The point here is that it takes more effort and that it is a lot easier to make a mistake.

This demonstrates how removing duplication on a very low-level can be harmful. If you have a good higher level example that can be understood unambiguously without providing a lot of extra context, do leave a comment.

Conclusion

Duplication itself does not matter. We care about code being easy (cheap) to modify without introducing regressions. Therefore we want simple code that is easy to understand. Pursuing removal of duplication as an end-goal rather than looking at the costs and benefits tends to result in a more complex codebase, with higher coupling, higher communication needs, inferior design and misleading code.

Review of Ayreon: The Source

In this post I review the source code of the Ayreon software. Well, actually not. This is a review of The Source, a progressive rock/metal album from the band Ayreon. Yes really. Much wow omg.

Overall rating

This album is awesome.

Like every Ayreon album, The Source features a crapton of singers each portraying a character in the albums story, with songs always featuring multiple of them, often with interleaving lines. The mastermind behind Ayreon is Arjen Anthony Lucassen, who for each album borrows some of the most OP singers and instrumentalists from bands all over the place. Hence if you are a metal or rock fan, you are bound to know some of the lineup.

What if you are not into those genres? I’ve seen Arjen described as a modern day Mozart and some of his albums as works of art. Art that can be appreciated even if you otherwise are not a fan of the genre.

Some nitpicks

The lyrics, while nice, are to me not quite as epic as those in the earlier Ayreon album 01011001. A high bar to set, since 01 is my favorite Ayreon album. (At least when removing 2 tracks from it that I really do not like. Which is what I did years ago so I don’t remember what they are called. Which is bonus points for The Source, which has no tracks I dislike.) One of the things I really like about it is that some of the lyrics have characters with opposite moral or emotional stances debate which course of action to take.

These are some of the lyrics of The Fifth Extinction (a song from 01011001), with one singers lines in green italics and the other in red normal font:

I see a planet, perfect for our needs
behold our target, a world to plant our seeds
There must be life
first remove any trace of doubt!
they may all die
Don’t you think we should check it out?
We have no choice, we waited far too long
this is our planet, this is where they belong.
We may regret this,
is this the way it’s supposed to be?
A cold execution, a mindless act of cruelty!

I see mainly reptiles
A lower form or intelligence
mere brainless creatures
with no demonstrable sentience
What makes us superior
We did not do so great ourselves!
A dying race, imprisoned in restricted shells

<3

The only other nitpick I have is that the background tone in Star of Sirrah is kinda annoying when paying attention to the song.

Given my potato sophistication when it comes to music, I can’t say much more about the album. Go get a copy, it’s well worth the moneys.

The story

Now we get to the real reason I’m writing this review, or more honestly, this rant. If you’re not a Ayreon fanboy like me, this won’t be interesting for you (unless you like rants). Spoilers on the story in The Source ahead.

In summary, the story in The Source is as follows: There is a human civilization on a planet called Alpha and they struggle with several ecological challenges. To fix the problems they give control to Skynet (yes I will call it that). Skynet shuts everything down so everyone dies. Except a bunch of people who get onto a spaceship and find a home on a new planet to start over again. The story focuses on these people, how they deal with Skynet shutting everything down, their coming together, their journey to the new planet and how they begin new lives there.

Originality

It’s an OK story. A bit cheesy and not very original. Arjen clearly likes the “relying on machines is bad” topic, as evidenced by 01011001 and Victims of the Modern Age (Star One). When it comes to originally in similar concept albums I think 01011001 does a way better job. Same goes for Cybion (Kalisia), which although focuses on different themes and is not created by Arjen (it does feature him very briefly), has a story with similar structure. (Perhaps comparing with Cybion is a bit mean, since that bar is so high it’s hidden by the clouds.)

Consistency

Then there are some serious WTFs in the story. For instance planet Alpha blows up after some time because of the quantum powercore melting down due to the cooling systems being deactivated. Why would Skynet let that happen? If it can take on an entire human civilization surely it knows what such deactivation will result into? Why would it commit suicide, and fail its mission to deal with the ecological issues? Of course, this is again a bit nitpicky. Logic and consistency in the story are not the most important thing on such an album of course. Still, it bugs me.

Specificness

Another difference with 01011001 is that story in The Source is more specific about things. If it was not, a lot of the WTFs would presumably be avoided, and you’d be more free to use your imagination. 01011001 tells the story of an Aquatic race called the Forever and how they create human kind to solve their wee bit of an apathy problem. There are no descriptions of what the Forever look like, beyond them being an Aquatic race, and no description of what their world and technology looks like beyond the very abstract.

Take this line from Age of Shadows, the opening song of 01011001:

Giant machines blot out the sun

When I first heard this song this line gave me the chills, as I was imagining giant machines in orbit around the systems star (kinda like a Dyson swarm) or perhaps just passing in between the planet and the star, still at significant distance from the planet. It took some months for me to realize that the lyrics author probably was thinking of surface based machines, which makes them significantly less giant and cool. The lyrics don’t specify that though.

The Ayreon story

Everything I described so far are minor points to me. What really gets me is what The Source does to the overall Ayreon story. Let’s recap what it looked like before The Source:

The Forever lose their emotions and create the human race to fix themselves. Humanity goes WW3 and blows itself up, despite the Forever helping them to avoid this fate, and perhaps due to the Forevers meddling to accelerate human evolution. Still, the Forever are able to awaken their race though the memories of the humans.

Of course this is just a very high level description, and there is much more to it then that. The Source changes this. It’s a prequel to 01011001 and reveals that the Forever are human, namely the humans that fled Alpha… Which turns the high level story into:

Humans on Alpha fuck their environment though the use of technology and then build Skynet. Some of them run away to a new planet (the water world they call Y) and re-engineer themselves to live there. They manage to fuck themselves with technology again and decide to create a new human race on earth. Those new humans also fuck themselves with technology.

So to me The Source ruins a lot of the story from the other Ayreon albums. Instead of having this alien race and the humans, each with their own problems, we now have just humans, who manage to fuck themselves over 3 times in a row and win the universes biggest tards award. Great. #GrumpyJeroenIsGrumpy

Conclusion

Even with all the grump, I think this is an awesome album. Just don’t expect too much from the story, which is OK, but definitely not as great as the rest of the package. Go buy a copy.

Generic Entity handling code

In this blog post I outline my thinking on sharing code that deals with different types of Entities in your domain. We’ll cover what Entities are, code reuse strategies, pitfalls such as Shotgun Surgery and Anemic Domain Models and finally Bounded Contexts.

Why I wrote this post

I work at Wikimedia Deutschland, where amongst other things, we are working on a software called Wikibase, which is what powers the Wikidata project. We have a dedicated team for this software, called the Wikidata team, which I am not part of. As an outsider that is somewhat familiar with the Wikibase codebase, I came across a writeup of a perceived problem in this codebase and a pair of possible solutions. I happen to disagree with what the actual problem is, and as a consequence also the solutions. Since explaining why I think that takes a lot of general (non-Wikibase specific) explanation, I decided to write a blog post.

DDD Entities

Let’s start with defining what an Entity is. Entities are a tactical Domain Driven Design pattern. They are things that can change over time and are compared by identity rather than by value, unlike Value Objects, which do not have an identity.

Wikibase has objects which are conceptually such Entities, though are implemented … oddly from a DDD perspective. In the above excerpt, the word entity, is confusingly, not referring to the DDD concept. Instead, the Wikibase domain has a concept called Entity, implemented by an abstract class with the same name, and derived from by specific types of Entities, i.e. Item and Property. Those are the objects that are conceptually DDD Entities, yet diverge from what a DDD Entity looks like.

Entities normally contain domain logic (the lack of this is called an Anemic Domain Model), and don’t have setters. The lack of setters does not mean they are immutable, it’s just that actions are performed through methods in the domain language (see Ubiquitous Language). For instance “confirmBooked()” and “cancel()” instead of “setStatus()”.

The perceived problem

What follows is an excerpt from a document aimed at figuring out how to best construct entities in Wikibase:

Some entity types have required fields:

  • Properties require a data type
  • Lexemes require a language and a lexical category (both ItemIds)
  • Forms require a grammatical feature (an ItemId)

The ID field is required by all entities. This is less problematic however, since the ID can be constructed and treated the same way for all kinds of entities. Furthermore, the ID can never change, while other required fields could be modified by an edit (even a property’s data type can be changed using a maintenance script).

The fact that Properties require the data type ID to be provided to the constructor is problematic in the current code, as evidenced in EditEntity::clearEntity:

…as well as in EditEntity::modifyEntity():

Such special case handling will not be possible for entity types defined in extensions.

It is very natural for (DDD) Entities to have required fields. That is not a problem in itself. For examples you can look at our Fundraising software.

So what is the problem really?

Generic vs specific entity handling code

Normally when you have a (DDD) Entity, say a Donation, you also have dedicated code that deals with those Donation objects. If you have another entity, say MembershipApplication, you will have other code that deals with it.

If the code handling Donation and the code handing MembershipApplication is very similar, there might be an opportunity to share things via composition. One should be very careful to not do this for things that happen to be the same but are conceptually different, and might thus change differently in the future. It’s very easy to add a lot of complexity and coupling by extracting small bits of what would otherwise be two sets of simple and easy to maintain code. This is a topic worthy of its own blog post, and indeed, I might publish one titled The Fallacy of DRY in the near future.

This sharing via composition is not really visible “from outside” of the involved services, except for the code that constructs them. If you have a DonationRepository and a MembershipRepository interface, they will look the same if their implementations share something, or not. Repositories might share cross cutting concerns such as logging. Logging is not something you want to do in your repository implementations themselves, but you can easily create simple logging decorators. A LoggingDonationRepostory and LoggingMembershipRepository could both depend on the same Logger class (or interface more  likely), and thus be sharing code via composition. In the end, the DonationRepository still just deals with Donation objects, the MembershipRepository still just deals with Membership objects, and both remain completely decoupled from each other.

In the Wikibase codebase there is an attempt at code reuse by having services that can deal with all types of Entities. Phrased like this it sounds nice. From the perspective of the user of the service, things are great at first glance. Thing is, those services then are forced to actually deal with all types of Entities, which almost guarantees greater complexity than having dedicated services that focus on a single entity.

If your Donation and MembershipApplication entities both implement Foobarable and you have a FoobarExecution service that operates on Foobarable instances, that is entirely fine. Things get dodgy when your Entities don’t always share the things your service needs, and the service ends up getting instances of object, or perhaps some minimal EntityInterface type.

In those cases the service can add a bunch of “if has method doFoobar, call it with these arguments” logic. Or perhaps you’re checking against an interface instead of method, though this is by and large the same. This approach leads to Shotgun Surgery. It is particularly bad if you have a general service. If your service is really only about the doFoobar method, then at least you won’t need to poke at it when a new Entity is added to the system that has nothing to do with the Foobar concept. If the service on the other hands needs to fully save something or send an email with a summary of the data, each new Entity type will force you to change your service.

The “if doFoobar exists” approach does not work if you want plugins to your system to be able to use your generic services with their own types of Entities. To enable that, and avoid the Shotgun Surgery, your general service can delegate to specific ones. For instance, you can have an EntityRepository service with a save method that takes an EntityInterface. In it’s constructor it would take an array of specific repositories, i.e. a DonationRepository and a MembershipRepository. In its save method it would loop through these specific repositories and somehow determine which one to use. Perhaps they would have a canHandle method that takes an EntityInterface, or perhaps EntityInterface has a getType method that returns a string that is also used as keys in the array of specific repositories. Once the right one is found, the EntitiyInterface instance is handed over to its save method.

This delegation approach is sane enough from a OO perspective. It does however involve specific repositories, which begs the question of why you are creating a general one in the first place. If there is no compelling reason to create the general one, just stick to specific ones and save yourself all this not needed complexity and vagueness.

In Wikibase there is a generic web API endpoint for creating new entities. The users provide a pile of information via JSON or a bunch of parameters, which includes the type of Entity they are trying to create. If you have this type of functionality, you are forced to deal with this in some way, and probably want to go with the delegation approach. To me having such an API endpoint is very questionable, with dedicated endpoints being the simpler solution for everyone involved.

To wrap this up: dedicated entity handling code is much simpler than generic code, making it easier to write, use, understand and modify. Code reuse, where warranted, is possible via composition inside of implementations without changing the interfaces of services. Generic entity handling code is almost always a bad choice.

On top of what I already outlined, there is another big issue you can run into when creating generic entity handling code like is done in Wikibase.

Bounded Contexts

Bounded Contexts are a key strategic concept from Domain Driven Design. They are key in the sense that if you don’t apply them in your project, you cannot effectively apply tactical patterns such as Entities and Value Objects, and are not really doing DDD at all.

“Strategy without tactics is the slowest route to victory. Tactics without strategy are the noise before defeat.” — Sun Tzu

Bounded Contexts allow you to segregate your domain models, ideally having a Bounded Context per subdomain. A detailed explanation and motivation of this pattern is out of scope for this post, but suffice to say is that Bounded Contexts allow for simplification and thus make it easier to write and maintain code. For more information I can recommend Domain-Driven Design Destilled.

In case of Wikibase there are likely a dozen or so relevant subdomains. While I did not do the analysis to create a comprehensive picture of which subdomains there are, which types they have, and which Bounded Contexts would make sense, a few easily stand out.

There is the so-called core Wikibase software, which was created for Wikidata.org, and deals with structured data for Wikipedia. It has two types of Entities (both in the Wikibase and in the DDD sense): Item and Property. Then there is (planned) functionality for Wiktionary, which will be structured dictionary data, and for Wikimedia Commons, which will be structured media data. These are two separate subdomains, and thus each deserve their own Bounded Context. This means having no code and no conceptual dependencies on each other or the existing Big Ball of Mud type “Bounded Context” in the Wikibase core software.

Conclusion

When standard approaches are followed, Entities can easily have required fields and optional fields. Creating generic code that deals with different types of entities is very suspect and can easily lead to great complexity and brittle code, as seen in Wikibase. It is also a road to not separating concepts properly, which is particularly bad when crossing subdomain boundaries.

OOP file_get_contents

I’m happy to announce the immediate availability of FileFetcher 4.0.0.

FileFetcher is a small PHP library that provides an OO way to retrieve the contents of files.

What’s OO about such an interface? You can inject an implementation of it into a class, thus avoiding that the class knows about the details of the implementation, and being able to choose which implementation you provide. Calling file_get_contents does not allow changing implementation as it is a procedural/static call making use of global state.

Library number 8234803417 that does this exact thing? Probably not. The philosophy behind this library is to provide a very basic interface (FileFetcher) that while insufficient for plenty of use cases, is ideal for a great many, in particular replacing procedural file_get_contents calls. The provided implementations are to facilitate testing and common generic tasks around the actual file fetching. You are encouraged to create your own core file fetching implementation in your codebase, presumably an adapter to a library that focuses on this task such as Guzzle.

So what is in it then? The library provides several trivial implementations of the FileFetcher interface at its heart:

  • SimpleFileFetcher: Adapter around file_get_contents
  • InMemoryFileFetcher: Adapter around an array provided to its constructor
  • ThrowingFileFetcher: Throws a FileFetchingException for all calls (added after 4.0)
  • NullFileFetcher: Returns an empty string for all calls (added after 4.0)
  • StubFileFetcher: Returns a stub value for all calls (added after 4.0)

It also provides a number of generic decorators:

Version 4.0.0 brings PHP7 features (scalar type hints \o/) and adds a few extra handy implementations. You can add the library to your composer.json (<a href="https://packagist.org/packages/jeroen/file-fetcher">jeroen/file-fetcher</a>) or look at the documentation on GitHub. You can also read about its inception in 2013.