Introduction to Iterators and Generators in PHP

In this post I demonstrate an effective way to create iterators and generators in PHP and provide an example of a scenario in which using them makes sense.

Generators have been around since PHP 5.5, and iterators have been around since the Planck epoch. Even so, a lot of PHP developers do not know how to use them well and cannot recognize situations in which they are helpful. In this blog post I share insights I have gained over the years, that when sharing, always got an interested response from colleague developers. The post goes beyond the basics, provides a real world example, and includes a few tips and tricks. To not leave out those unfamiliar with Iterators the post starts with the “What are Iterators” section, which you can safely skip if you can already answer that question.

What are Iterators

PHP has an Iterator interface that you can implement to represent a collection. You can loop over an instance of an Iterator just like you can loop over an array:

Why would you bother implementing an Iterator subclass rather than just using an array? Let’s look at an example.

Imagine you have a directory with a bunch of text files. One of the files contains an ASCII NyanCat (~=[,,_,,]:3). It is the task of our code to find which file the NyanCat is hiding in.

We can get all the files by doing a glob( $path . '*.txt' ) and we can get the contents for a file with a file_get_contents. We could just have a foreach going over the glob result that does the file_get_contents. Luckily we realize this would violate separation of concerns and make the “does this file contain NyanCat” logic hard to test since it will be bound to the filesystem access code. Hence we create a function that gets the contents of the files, and ones with our logic in it:

While this approach is decoupled, a big drawback is that now we need to fetch the contents of all files and keep all of that in memory before we even start executing any of our logic. If NyanCat is hiding in the first file, we’ll have fetched the contents of all others for nothing. We can avoid this by using an Iterator, as they can fetch their values on demand: they are lazy.

Our TextFileIterator gives us a nice place to put all the filesystem code, while to the outside just looking like a collection of texts. The function housing our logic, findTextWithNyanCat, does not know that the text comes from the filesystem. This means that if you decide to get texts from the database, you could just create a new DatabaseTextBlobIterator and pass it to the logic function without making any changes to the latter. Similarly, when testing the logic function, you can give it an ArrayIterator.

I wrote more about basic Iterator functionality in Lazy iterators in PHP and Python and Some fun with iterators. I also blogged about a library that provides some (Wikidata specific) iterators and a CLI tool build around an Iterator. For more on how generators work, see the off-site post Generators in PHP.

PHP’s collection type hierarchy

Let’s start by looking at PHP’s type hierarchy for collections as of PHP 7.1. These are the core types that I think are most important:

  •  iterable
    • array
    • Traversable
      • Iterator
        • Generator
      • IteratorAggregate

At the very top we have iterable, the supertype of both array and Traversable. If you are not familiar with this type or are using a version of PHP older than 7.1, don’t worry, we don’t need it for the rest of this blog post.

Iterator is the subtype of Traversable, and the same goes for IteratorAggregate. The standard library iterator_ functions such as iterator_to_array all take a Traversable. This is important since it means you can give them an IteratorAggregate, even though it is not an Iterator. Later on in this post we’ll get back to what exactly an IteratorAggregate is and why it is useful.

Finally we have Generator, which is a subtype of Iterator. That means all functions that accept an Iterator can be given a Generator, and, by extension, that you can use generators in combination with the Iterator classes in the Standard PHP Library such as LimitIterator and CachingIterator.

IteratorAggregate + Generator = <3

Generators are a nice and easy way to create iterators. Often you’ll only loop over them once, and not have any problem. However beware that generators create iterators that are not rewindable, which means that if you loop over them more than once, you’ll get an exception.

Imagine the scenario where you pass in a generator to a service that accepts an instance of Traversable:

The service class in which doStuff resides does not know it is getting a Generator, it just knows it is getting a Traversable. When working on this class, it is entirely reasonable to iterate though $things a second time.

This blows up if the provided $things is a Generator, because generators are non-rewindable. Note that it does not matter how you iterate through the value. Calling iterator_to_array with $things has the exact same result as using it in a foreach loop. Most, if not all, generators I have written, do not use resources or state that inherently prevents them from being rewindable. So the double-iteration issue can be unexpected and seemingly silly.

There is a simple and easy way to get around it though. This is where IteratorAggregate comes in. Classes implementing IteratorAggregate must implement the getIterator() method, which returns a Traversable. Creating one of these is extremely trivial:

If you call getIterator, you’ll get a Generator instance, just like you’d expect. However, normally you never call this method. Instead you use the IteratorAggregate just as if it was an Iterator, by passing it to code that expects a Traversable. (This is also why usually you want to accept Traversable and not just Iterator.) We can now call our service that loops over the $things twice without any problem:

By using IteratorAggregate we did not just solve the non-rewindable problem, we also found a good way to share our code. Sometimes it makes sense to use the code of a Generator in multiple classes, and sometimes it makes sense to have dedicated tests for the Generator. In both cases having a dedicated class and file to put it in is very helpful, and a lot nicer than exposing the generator via some public static function.

For cases where it does not make sense to share a Generator and you want to keep it entirely private, you might need to deal with the non-rewindable problem. For those cases you can use my Rewindable Generator library, which allows making your generators rewindable by wrapping their creation function:

A real-world example

A few months ago I refactored some code part of the Wikimedia Deutschland fundraising codebase. This code gets the filesystem paths of email templates by looking in a set of specified directories.

This code made the class bound to the filesystem, which made it hard to test. In fact, this code was not tested. Furthermore, this code irked me, since I like code to be on the functional side. The array_walk mutates its by-reference variable and the assignment at the end of the loop mutates the return variable.

This was refactored using the awesome IteratorAggregate + Generator combo:

Much easier to read/understand code, no state mutation whatsoever, good separation of concerns, easier testing and reusability of this collection building code elsewhere.

See also: Use cases for PHP generators (off-site post).

Tips and Tricks

Generators can yield key value pairs:

You can use yield in PHPUnit data providers.

You can yield from an iterable.

Thanks for Leszek Manicki and Jan Dittrich for reviewing this blog post.

Yield in PHPUnit data providers

Initially I started creating a general post about PHP Generators, a feature introduced in PHP 5.5. However since I keep failing to come up with good examples for some cool ways to use Generators, I decided to do this mini post focusing on one such cool usage.

PHPUnit data providers

A commonly used PHPUnit feature is data providers. In a data provider you specify a list of argument lists, and the test methods that use the data provider get called once for each argument list.

Often data providers are created with an array variable in which the argument lists get stuffed. Example (including poor naming):

The not so nice thing here is that you have a variable (explicit state) and you modify it (mutable state). A more functional approach is to just return an array that holds the argument lists directly. However if your argument list creation is more complex than in this example, requiring state, this might not work. And when such state is required, you end up with more complexity and a higher chance that the $return variable will bite you.

Using yield

What you might not have realized is that data providers do not need to return an array. They need to return an iterable, so they can also return an Iterator, and by extension, a Generator. This means you can write the above data provider as follows:

No explicit state to be seen!

Stay tuned for more generator goodness if I can overcome my own laziness (hint hint :))

The Fallacy of DRY

DRY, standing for Don’t Repeat Yourself, is a well-known design principle in the software development world.

It is not uncommon for removal of duplication to take center stage via mantras such as “Repetition is the root of all evil”. Yet while duplication is often bad, the well intended pursuit of DRY often leads people astray. To see why, let’s take a step back and look at what we want to achieve by removing duplication.

The Goal of Software

First and foremost, software exists to fulfill a purpose. Your client, which can be your employer, is paying money because they want the software to provide value. As a developer it is your job to provide this value as effectively as possible. This includes tasks beyond writing code to do whatever your client specifies, and might best be done by not writing any code. The creation of code is expensive. Maintenance of code and extension of legacy code is even more so.

Since creation and maintenance of software is expensive, the quality of a developers work (when just looking at the code) can be measured in how quickly functionality is delivered in a satisfactory manner, and how easy to maintain and extend the system is afterwards. Many design discussions arise about trade-offs between those two measures. The DRY principle mainly situates itself in the latter category: reducing maintenance costs. Unfortunately applying DRY blindly often leads to increased maintenance costs.

The Good Side of DRY

So how does DRY help us reduce maintenance costs? If code is duplicated, and it needs to be changed, you will need to find all places where it is duplicated and apply the change. This is (obviously) more difficult than modifying one place, and more error prone. You can forget about one place where the change needs to be applied, you can accidentally apply it differently in one location, or you can modify code that happens to the same at present but should nevertheless not be changed due to conceptual differences (more on this later). This is also known as Shotgun Surgery. Duplicated code tends to also obscure the structure and intent of your code, making it harder to understand and modify. And finally, it conveys a sense of carelessness and lack of responsibility, which begets more carelessness.

Everyone that has been in the industry for a little while has come across horrid procedural code, or perhaps pretend-OO code, where copy-paste was apparently the favorite hammer of its creators. Such programmers indeed should heed DRY, cause what they are producing suffers from the issues we just went over. So where is The Fallacy of DRY?

The Fallacy of DRY

Since removal of duplication is a means towards more maintainable code, we should only remove duplication if that removal makes the code more maintainable.

If you are reading this, presumably you are not a copy-and-paste programmer. Almost no one I ever worked with is. Once you know how to create well designed OO applications (ie by knowing the SOLID principles), are writing tests, etc, the code you create will be very different from the work of a copy-paste-programmer. Even when adhering to the SOLID principles (to the extend that it makes sense) there might still be duplication that should be removed.The catch here is that this duplication will be mixed together with duplication that should stay, since removing it makes the code less maintainable. Hence trying to remove all duplication is likely to be counter productive.

Costs of Unification

How can removing duplication make code less maintainable? If the costs of unification outweigh the costs of duplication, then we should stick with duplication. We’ve already gone over some of the costs of duplication, such as the need for Shotgun Surgery. So let’s now have a look at the costs of unification.

The first cost is added complexity. If you have two classes with a little bit of common code, you can extract this common code into a service, or if you are a masochist extract it into a base class. In both cases you got rid of the duplication by introducing a new class. While doing this you might reduce the total complexity by not having the duplication, and such extracting might make sense in the first place for instance to avoid a Single Responsibility Principle violation. Still, if the only reason for the extraction is reducing duplication, ask yourself if you are reducing the overall complexity or adding to it.

Another cost is coupling. If you have two classes with some common code, they can be fully independent. If you extract the common code into a service, both classes will now depend upon this service. This means that if you make a change to the service, you will need to pay attention to both classes using the service, and make sure they do not break. This is especially a problem if the service ends up being extended to do more things, though that is more of a SOLID issue. I’ll skip going of the results of code reuse via inheritance to avoid suicidal (or homicidal) thoughts in myself and my readers.

DRY = Coupling

— A slide at DDDEU 2017

The coupling increases the need for communication. This is especially true in the large, when talking about unifying code between components or application, and when different teams end up depending on the same shared code. In such a situation it becomes very important that it is clear to everyone what exactly is expected from a piece of code, and making changes is often slow and costly due to the communication needed to make sure they work for everyone.

Another result of unification is that code can no longer evolve separately. If we have our two classes with some common code, and in the first a small behavior change is needed in this code, this change is easy to make. If you are dealing with a common service, you might do something such as adding a flag. That might even be the best thing to do, though it is likely to be harmful design wise. Either way, you start down the path of corrupting your service, which now turned into a frog in a pot of water that is being heated. If you unified your code, this is another point at which to ask yourself if that is still the best trade-off, or if some duplication might be easier to maintain.

You might be able to represent two different concepts with the same bit of code. This is problematic not only because different concepts need to be able to evolve individually, it’s also misleading to have only a single representation in the code, which effectively hides that you are dealing with two different concepts. This is another point that gains importance the bigger the scope of reuse. Domain Driven Design has a strategic pattern called Bounded Contexts, which is about the separation of code that represents different (sub)domains. Generally speaking it is good to avoid sharing code between Bounded Contexts. You can find a concrete example of using the same code for two different concepts in my blog post on Implementing the Clean Architecture, in the section “Lesson learned: bounded contexts”.

DRY is for one Bounded Context

— Eric Evans

Conclusion

Duplication itself does not matter. We care about code being easy (cheap) to modify without introducing regressions. Therefore we want simple code that is easy to understand. Pursuing removal of duplication as an end-goal rather than looking at the costs and benefits tends to result in a more complex codebase, with higher coupling, higher communication needs, inferior design and misleading code.

Review of Ayreon: The Source

In this post I review the source code of the Ayreon software. Well, actually not. This is a review of The Source, a progressive rock/metal album from the band Ayreon. Yes really. Much wow omg.

Overall rating

This album is awesome.

Like every Ayreon album, The Source features a crapton of singers each portraying a character in the albums story, with songs always featuring multiple of them, often with interleaving lines. The mastermind behind Ayreon is Arjen Anthony Lucassen, who for each album borrows some of the most OP singers and instrumentalists from bands all over the place. Hence if you are a metal or rock fan, you are bound to know some of the lineup.

What if you are not into those genres? I’ve seen Arjen described as a modern day Mozart and some of his albums as works of art. Art that can be appreciated even if you otherwise are not a fan of the genre.

Some nitpicks

The lyrics, while nice, are to me not quite as epic as those in the earlier Ayreon album 01011001. A high bar to set, since 01 is my favorite Ayreon album. (At least when removing 2 tracks from it that I really do not like. Which is what I did years ago so I don’t remember what they are called. Which is bonus points for The Source, which has no tracks I dislike.) One of the things I really like about it is that some of the lyrics have characters with opposite moral or emotional stances debate which course of action to take.

These are some of the lyrics of The Fifth Extinction (a song from 01011001), with one singers lines in green italics and the other in red normal font:

I see a planet, perfect for our needs
behold our target, a world to plant our seeds
There must be life
first remove any trace of doubt!
they may all die
Don’t you think we should check it out?
We have no choice, we waited far too long
this is our planet, this is where they belong.
We may regret this,
is this the way it’s supposed to be?
A cold execution, a mindless act of cruelty!

I see mainly reptiles
A lower form or intelligence
mere brainless creatures
with no demonstrable sentience
What makes us superior
We did not do so great ourselves!
A dying race, imprisoned in restricted shells

<3

The only other nitpick I have is that the background tone in Star of Sirrah is kinda annoying when paying attention to the song.

Given my potato sophistication when it comes to music, I can’t say much more about the album. Go get a copy, it’s well worth the moneys.

The story

Now we get to the real reason I’m writing this review, or more honestly, this rant. If you’re not a Ayreon fanboy like me, this won’t be interesting for you (unless you like rants). Spoilers on the story in The Source ahead.

In summary, the story in The Source is as follows: There is a human civilization on a planet called Alpha and they struggle with several ecological challenges. To fix the problems they give control to Skynet (yes I will call it that). Skynet shuts everything down so everyone dies. Except a bunch of people who get onto a spaceship and find a home on a new planet to start over again. The story focuses on these people, how they deal with Skynet shutting everything down, their coming together, their journey to the new planet and how they begin new lives there.

Originality

It’s an OK story. A bit cheesy and not very original. Arjen clearly likes the “relying on machines is bad” topic, as evidenced by 01011001 and Victims of the Modern Age (Star One). When it comes to originally in similar concept albums I think 01011001 does a way better job. Same goes for Cybion (Kalisia), which although focuses on different themes and is not created by Arjen (it does feature him very briefly), has a story with similar structure. (Perhaps comparing with Cybion is a bit mean, since that bar is so high it’s hidden by the clouds.)

Consistency

Then there are some serious WTFs in the story. For instance planet Alpha blows up after some time because of the quantum powercore melting down due to the cooling systems being deactivated. Why would Skynet let that happen? If it can take on an entire human civilization surely it knows what such deactivation will result into? Why would it commit suicide, and fail its mission to deal with the ecological issues? Of course, this is again a bit nitpicky. Logic and consistency in the story are not the most important thing on such an album of course. Still, it bugs me.

Specificness

Another difference with 01011001 is that story in The Source is more specific about things. If it was not, a lot of the WTFs would presumably be avoided, and you’d be more free to use your imagination. 01011001 tells the story of an Aquatic race called the Forever and how they create human kind to solve their wee bit of an apathy problem. There are no descriptions of what the Forever look like, beyond them being an Aquatic race, and no description of what their world and technology looks like beyond the very abstract.

Take this line from Age of Shadows, the opening song of 01011001:

Giant machines blot out the sun

When I first heard this song this line gave me the chills, as I was imagining giant machines in orbit around the systems star (kinda like a Dyson swarm) or perhaps just passing in between the planet and the star, still at significant distance from the planet. It took some months for me to realize that the lyrics author probably was thinking of surface based machines, which makes them significantly less giant and cool. The lyrics don’t specify that though.

The Ayreon story

Everything I described so far are minor points to me. What really gets me is what The Source does to the overall Ayreon story. Let’s recap what it looked like before The Source:

The Forever lose their emotions and create the human race to fix themselves. Humanity goes WW3 and blows itself up, despite the Forever helping them to avoid this fate, and perhaps due to the Forevers meddling to accelerate human evolution. Still, the Forever are able to awaken their race though the memories of the humans.

Of course this is just a very high level description, and there is much more to it then that. The Source changes this. It’s a prequel to 01011001 and reveals that the Forever are human, namely the humans that fled Alpha… Which turns the high level story into:

Humans on Alpha fuck their environment though the use of technology and then build Skynet. Some of them run away to a new planet (the water world they call Y) and re-engineer themselves to live there. They manage to fuck themselves with technology again and decide to create a new human race on earth. Those new humans also fuck themselves with technology.

So to me The Source ruins a lot of the story from the other Ayreon albums. Instead of having this alien race and the humans, each with their own problems, we now have just humans, who manage to fuck themselves over 3 times in a row and win the universes biggest tards award. Great. #GrumpyJeroenIsGrumpy

Conclusion

Even with all the grump, I think this is an awesome album. Just don’t expect too much from the story, which is OK, but definitely not as great as the rest of the package. Go buy a copy.

Generic Entity handling code

In this blog post I outline my thinking on sharing code that deals with different types of Entities in your domain. We’ll cover what Entities are, code reuse strategies, pitfalls such as Shotgun Surgery and Anemic Domain Models and finally Bounded Contexts.

Why I wrote this post

I work at Wikimedia Deutschland, where amongst other things, we are working on a software called Wikibase, which is what powers the Wikidata project. We have a dedicated team for this software, called the Wikidata team, which I am not part of. As an outsider that is somewhat familiar with the Wikibase codebase, I came across a writeup of a perceived problem in this codebase and a pair of possible solutions. I happen to disagree with what the actual problem is, and as a consequence also the solutions. Since explaining why I think that takes a lot of general (non-Wikibase specific) explanation, I decided to write a blog post.

DDD Entities

Let’s start with defining what an Entity is. Entities are a tactical Domain Driven Design pattern. They are things that can change over time and are compared by identity rather than by value, unlike Value Objects, which do not have an identity.

Wikibase has objects which are conceptually such Entities, though are implemented … oddly from a DDD perspective. In the above excerpt, the word entity, is confusingly, not referring to the DDD concept. Instead, the Wikibase domain has a concept called Entity, implemented by an abstract class with the same name, and derived from by specific types of Entities, i.e. Item and Property. Those are the objects that are conceptually DDD Entities, yet diverge from what a DDD Entity looks like.

Entities normally contain domain logic (the lack of this is called an Anemic Domain Model), and don’t have setters. The lack of setters does not mean they are immutable, it’s just that actions are performed through methods in the domain language (see Ubiquitous Language). For instance “confirmBooked()” and “cancel()” instead of “setStatus()”.

The perceived problem

What follows is an excerpt from a document aimed at figuring out how to best construct entities in Wikibase:

Some entity types have required fields:

  • Properties require a data type
  • Lexemes require a language and a lexical category (both ItemIds)
  • Forms require a grammatical feature (an ItemId)

The ID field is required by all entities. This is less problematic however, since the ID can be constructed and treated the same way for all kinds of entities. Furthermore, the ID can never change, while other required fields could be modified by an edit (even a property’s data type can be changed using a maintenance script).

The fact that Properties require the data type ID to be provided to the constructor is problematic in the current code, as evidenced in EditEntity::clearEntity:

…as well as in EditEntity::modifyEntity():

Such special case handling will not be possible for entity types defined in extensions.

It is very natural for (DDD) Entities to have required fields. That is not a problem in itself. For examples you can look at our Fundraising software.

So what is the problem really?

Generic vs specific entity handling code

Normally when you have a (DDD) Entity, say a Donation, you also have dedicated code that deals with those Donation objects. If you have another entity, say MembershipApplication, you will have other code that deals with it.

If the code handling Donation and the code handing MembershipApplication is very similar, there might be an opportunity to share things via composition. One should be very careful to not do this for things that happen to be the same but are conceptually different, and might thus change differently in the future. It’s very easy to add a lot of complexity and coupling by extracting small bits of what would otherwise be two sets of simple and easy to maintain code. This is a topic worthy of its own blog post, and indeed, I might publish one titled The Fallacy of DRY in the near future.

This sharing via composition is not really visible “from outside” of the involved services, except for the code that constructs them. If you have a DonationRepository and a MembershipRepository interface, they will look the same if their implementations share something, or not. Repositories might share cross cutting concerns such as logging. Logging is not something you want to do in your repository implementations themselves, but you can easily create simple logging decorators. A LoggingDonationRepostory and LoggingMembershipRepository could both depend on the same Logger class (or interface more  likely), and thus be sharing code via composition. In the end, the DonationRepository still just deals with Donation objects, the MembershipRepository still just deals with Membership objects, and both remain completely decoupled from each other.

In the Wikibase codebase there is an attempt at code reuse by having services that can deal with all types of Entities. Phrased like this it sounds nice. From the perspective of the user of the service, things are great at first glance. Thing is, those services then are forced to actually deal with all types of Entities, which almost guarantees greater complexity than having dedicated services that focus on a single entity.

If your Donation and MembershipApplication entities both implement Foobarable and you have a FoobarExecution service that operates on Foobarable instances, that is entirely fine. Things get dodgy when your Entities don’t always share the things your service needs, and the service ends up getting instances of object, or perhaps some minimal EntityInterface type.

In those cases the service can add a bunch of “if has method doFoobar, call it with these arguments” logic. Or perhaps you’re checking against an interface instead of method, though this is by and large the same. This approach leads to Shotgun Surgery. It is particularly bad if you have a general service. If your service is really only about the doFoobar method, then at least you won’t need to poke at it when a new Entity is added to the system that has nothing to do with the Foobar concept. If the service on the other hands needs to fully save something or send an email with a summary of the data, each new Entity type will force you to change your service.

The “if doFoobar exists” approach does not work if you want plugins to your system to be able to use your generic services with their own types of Entities. To enable that, and avoid the Shotgun Surgery, your general service can delegate to specific ones. For instance, you can have an EntityRepository service with a save method that takes an EntityInterface. In it’s constructor it would take an array of specific repositories, i.e. a DonationRepository and a MembershipRepository. In its save method it would loop through these specific repositories and somehow determine which one to use. Perhaps they would have a canHandle method that takes an EntityInterface, or perhaps EntityInterface has a getType method that returns a string that is also used as keys in the array of specific repositories. Once the right one is found, the EntitiyInterface instance is handed over to its save method.

This delegation approach is sane enough from a OO perspective. It does however involve specific repositories, which begs the question of why you are creating a general one in the first place. If there is no compelling reason to create the general one, just stick to specific ones and save yourself all this not needed complexity and vagueness.

In Wikibase there is a generic web API endpoint for creating new entities. The users provide a pile of information via JSON or a bunch of parameters, which includes the type of Entity they are trying to create. If you have this type of functionality, you are forced to deal with this in some way, and probably want to go with the delegation approach. To me having such an API endpoint is very questionable, with dedicated endpoints being the simpler solution for everyone involved.

To wrap this up: dedicated entity handling code is much simpler than generic code, making it easier to write, use, understand and modify. Code reuse, where warranted, is possible via composition inside of implementations without changing the interfaces of services. Generic entity handling code is almost always a bad choice.

On top of what I already outlined, there is another big issue you can run into when creating generic entity handling code like is done in Wikibase.

Bounded Contexts

Bounded Contexts are a key strategic concept from Domain Driven Design. They are key in the sense that if you don’t apply them in your project, you cannot effectively apply tactical patterns such as Entities and Value Objects, and are not really doing DDD at all.

“Strategy without tactics is the slowest route to victory. Tactics without strategy are the noise before defeat.” — Sun Tzu

Bounded Contexts allow you to segregate your domain models, ideally having a Bounded Context per subdomain. A detailed explanation and motivation of this pattern is out of scope for this post, but suffice to say is that Bounded Contexts allow for simplification and thus make it easier to write and maintain code. For more information I can recommend Domain-Driven Design Destilled.

In case of Wikibase there are likely a dozen or so relevant subdomains. While I did not do the analysis to create a comprehensive picture of which subdomains there are, which types they have, and which Bounded Contexts would make sense, a few easily stand out.

There is the so-called core Wikibase software, which was created for Wikidata.org, and deals with structured data for Wikipedia. It has two types of Entities (both in the Wikibase and in the DDD sense): Item and Property. Then there is (planned) functionality for Wiktionary, which will be structured dictionary data, and for Wikimedia Commons, which will be structured media data. These are two separate subdomains, and thus each deserve their own Bounded Context. This means having no code and no conceptual dependencies on each other or the existing Big Ball of Mud type “Bounded Context” in the Wikibase core software.

Conclusion

When standard approaches are followed, Entities can easily have required fields and optional fields. Creating generic code that deals with different types of entities is very suspect and can easily lead to great complexity and brittle code, as seen in Wikibase. It is also a road to not separating concepts properly, which is particularly bad when crossing subdomain boundaries.

OOP file_get_contents

I’m happy to announce the immediate availability of FileFetcher 4.0.0.

FileFecther is a small PHP library that provides an OO way to retrieve the contents of files.

What’s OO about such an interface? You can inject an implementation of it into a class, thus avoiding that the class knows about the details of the implementation, and being able to choose which implementation you provide. Calling file_get_contents does not allow changing implementation as it is a procedural/static call making use of global state.

Library number 8234803417 that does this exact thing? Probably not. The philosophy behind this library is to provide a very basic interface (FileFetcher) that while insufficient for plenty of use cases, is ideal for a great many, in particular replacing procedural file_get_contents calls. The provided implementations are to facilitate testing and common generic tasks around the actual file fetching. You are encouraged to create your own core file fetching implementation in your codebase, presumably an adapter to a library that focuses on this task such as Guzzle.

So what is in it then? The library provides several trivial implementations of the FileFetcher interface at its heart:

  • SimpleFileFetcher: Adapter around file_get_contents
  • InMemoryFileFetcher: Adapter around an array provided to its constructor
  • ThrowingFileFetcher: Throws a FileFetchingException for all calls (added after 4.0)
  • NullFileFetcher: Returns an empty string for all calls (added after 4.0)
  • StubFileFetcher: Returns a stub value for all calls (added after 4.0)

It also provides a number of generic decorators:

Version 4.0.0 brings PHP7 features (scalar type hints \o/) and adds a few extra handy implementations. You can add the library to your composer.json (jeroen/file-fetcher) or look at the documentation on GitHub. You can also read about its inception in 2013.

PHP development with Docker

I’m the kind of dev that dreads configuring webservers and that rather does not have to put up with random ops stuff before being able to get work done. Docker is one of those things I’ve never looked into, cause clearly it’s evil annoying boring evil confusing evil ops stuff. Two of my colleagues just introduced me to a one-line docker command that kind off blew my mind.

Want to run tests for a project but don’t have PHP7 installed? Want to execute a custom Composer script that runs both these tests and the linters without having Composer installed? Don’t want to execute code you are not that familiar with on your machine that contains your private keys, etc? Assuming you have Docker installed, this command is all you need:

This command uses the Composer Docker image, as indicated by the first composer at the end of the command. After that you can specify whatever you want to execute, in this case composer ci, where ci is a custom composer Script. (If you want to know what the Docker image is doing behind the scenes, check its entry point file.)

This works without having PHP or Composer installed, and is very fast after the initial dependencies have been pulled. And each time you execute the command, the environment is destroyed, avoiding state leakage. You can create a composer alias in your .bash_aliases as follows, and then execute composer on your host just as you would do if it was actually installed (and running) there.

Of course you are not limited to running Composer commands, you can also invoke PHPUnit

or indeed any PHP code.

This one liner is not sufficient if you require additional dependencies, such as PHP extensions, databases or webservers. In those cases you probably want to create your own Docker file. Though to run the tests of most PHP libraries you should be good. I’ve now uninstalled my local Composer and PHP.

To get started, install Docker and add your user to the docker group (system restart might be needed afterwards):

Why Every Single Argument of Dan North is Wrong

Alternative title: Dan North, the Straw Man That Put His Head in His Ass.

This blog post is a reply to Dan’s presentation Why Every Element of SOLID is Wrong. It is crammed full with straw man argumentation in which he misinterprets what the SOLID principles are about. After refuting each principle he proposes an alternative, typically a well-accepted non-SOLID principle that does not contradict SOLID. If you are not that familiar with the SOLID principles and cannot spot the bullshit in his presentation, this blog post is for you. The same goes if you enjoy bullshit being pointed out and broken down.

What follows are screenshots of select slides with comments on them underneath.

Dan starts by asking “What is a single responsibility anyway?”. Perhaps he should have figured that out before giving a presentation about how it is wrong.

A short (non-comprehensive) description of the principle: systems change for various different reasons. Perhaps a database expert changes the database schema for performance reasons, perhaps a User Interface person is reorganizing the layout of a web page, perhaps a developer changes business logic. What the Single Responsibility Principle says is that ideally changes for such disparate reasons do not affect the same code. If they did, different people would get in each other’s way. Possibly worse still, if the concerns are mixed together and you want to change some UI code, suddenly you need to deal with and thus understand, the business logic and database code.

How can we predict what is going to change? Clearly you can’t, and this is simply not needed to follow the Single Responsibility Principle or to get value out of it.

Write simple code… no shit. One of the best ways to write simple code is to separate concerns. You can be needlessly vague about it and simply state “write simple code”. I’m going to label this Dan North’s Pointlessly Vague Principle. Congratulations sir.

The idea behind the Open Closed Principle is not that complicated. To partially quote the first line on the Wikipedia Page (my emphasis):

… such an entity can allow its behaviour to be extended without modifying its source code.

In other words, when you ADD behavior, you should not have to change existing code. This is very nice, since you can add new functionality without having to rewrite old code. Contrast this to shotgun surgery, where to make an addition, you need to modify existing code at various places in the codebase.

In practice, you cannot gain full adherence to this principle, and you will have places where you will need to modify existing code. Full adherence to the principle is not the point. Like with all engineering principles, they are guidelines which live in a complex world of trade offs. Knowing these guidelines is very useful.

Clearly it’s a bad idea to leave code in place that is wrong after a requirement change. That’s not what this principle is about.

Another very informative “simple code is a good thing” slide.

To be honest, I’m not entirely sure what Dan is getting at with his “is-a, has-a” vs “acts-like-a, can-be-used-as-a”. It does make me think of the Interface Segregation Principle, which, coincidentally, is the next principle he misinterprets.

The remainder of this slide is about the “favor compositions about inheritance” principle. This is really good advice, which has been well-accepted in professional circles for a long time. This principle is about code sharing, which is generally better done via composition than inheritance (the latter creates very strong coupling). In the last big application I wrote I have several 100s of classes and less than a handful inherit concrete code. Inheritance has a use which is completely different from code reuse: sub-typing and polymorphism. I won’t go into detail about those here, and will just say that this is at the core of what Object Orientation is about, and that even in the application I mentioned, this is used all over, making the Liskov Substitution Principle very relevant.

Here Dan is slamming the principle for being too obvious? Really?

“Design small , role-based classes”. Here Dan changed “interfaces” into “classes”. Which results in a line that makes me think of the Single Responsibility Principle. More importantly, there is a misunderstanding about the meaning of the word “interface” here. This principle is about the abstract concept of an interface, not the language construct that you find in some programming languages such as Java and PHP. A class forms an interface. This principle applies to OO languages that do not have an interface keyword such as Python and even to those that do not have a class keyword such as Lua.

If you follow the Interface Segregation Principle and create interfaces designed for specific clients, it becomes much easier to construct or invoke those clients. You won’t have to provide additional dependencies that your client does not actually care about. In addition, if you are doing something with those extra dependencies, you know this client will not be affected.

This is a bit bizarre. The definition Dan provides is good enough, even though it is incomplete, which can be excused by it being a slide. From the slide it’s clear that the Dependency Inversion Principle is about dependencies (who would have guessed) and coupling. The next slide is about how reuse is overrated. As we’ve already established, this is not what the principle is about.

As to the Dependency Inversion Principle leading to DI frameworks that you then depend on… this is like saying that if you eat food, you might eat non-nutritious food such as sand, which is not healthy. The fix is not to reject food altogether, it is to not eat food that is non-nutritious. Remember the application I mentioned? It uses dependency injection all the way, without using any framework or magic. In fact, 95% of the code does not bind to the web-framework used due to adherence to the Dependency Inversion Principle. (Read more about this application)

That attitude explains a lot about the preceding slides.

Yeah, please do write simple code. The SOLID principles and many others can help you with this difficult task. There is a lot of hard-won knowledge in our industry and many problems are well understood. Frivolously rejecting that knowledge with “I know better” is an act of supreme arrogance and ignorance.

I do hope this is the category Dan falls into, because the alternative of purposefully misleading people for personal profit (attention via controversy) rustles my jimmies.

If you’re not familiar with the SOLID principles, I recommend you start by reading their associated Wikipedia pages. If you are like me, it will take you practice to truly understand the principles and their implications and to find out where they break down or should be superseded. Knowing about them and keeping an open mind is already a good start, which will likely lead you to many other interesting principles and practices.

My year in books

This is a short summary of my 2016 reading experience, following my 2015 Year In Books.

Such Stats

I’ve read 38 books, most of which novels, up from last years “44”, which included at least a dozen short stories. These totaled 10779 pages, up from 6xxx in both previous years.

Favorites FTW

Peter Watts

Peter Watts

My favorites for 2016 are without a doubt Blindsight and Echopraxia by Peter Watts. I got to know Watts, who is now in my short list of favorite authors, though the short story Malek, which I mentioned in my 2015 year in books. I can’t describe the books in a way that does them sufficient justice, but suffice to say that they explore a good number of interesting questions and concepts. Exactly what you (or at least I) want from Hard (character) Science Fiction.

You can haz a video with Peter Watts reading a short excerpt from Echopraxia at the Canada Privacy Symposium. You can also get a feel from these books based on their quotes: Blindsight quotes, Echopraxia quotes. Or simply read Malek 🙂 Though Malek does not have OP space vampires that are seriously OP.

My favorite non-fiction book for the year is Thinking Fast and Slow, which is crammed full with information useful to anyone who wants to better understand how their (or others) mind works, where it tends to go off the rails, and what can be done about those cases. The number two slot goes to Domain-Driven Design Distilled, which unlike the red and blue books is actually somewhat readable.

My favorite short story was Crystal Nights by Greg Egan. It explores the creation of something akin to strong AI via simulated evolution, including various ethical and moral questions this raises. It does not start with a tetravalent graph more like diamond than graphite, but you can’t have everything now can you?

Dat Distribution

All books I read, except for 2 short stories, where published earliest in the year 2000.

Lying Lydia <3

Lydia after audiobook hax

For 2016 I set a reading goal of 21 books, to beat Lydia‘s 20. She ended up reading 44, so beat me. However, she used OP audio book hax. I attempted this as well but did not manage to catch up, even though I racked up 11.5 days worth of listening time. This year I will win of course. (Plan B)

Series, Seriously

As you might be able to deduce from the read in vs published in chart, I went through a number of series. I read all The Expanse novels after watching the first season of the television series based on them. Similarly I read the Old Man’s War novels except the last one, which I have not finished yet. Both series are fun, though they might not live up to the expectations of fellow Hard Science Fiction master race members, as they are geared more towards plain SF plebs. Finally I started with the Revelation Space books by Alastair Reynolds after reading House of Suns, which has been on my reading list since forever. This was also high time since clearly I need to have read all the books from authors that have been in a Google hangout with Special Circumstances agent Ian M. Banks.

Simple is not easy

Simplicity is possibly the single most important thing on the technical side of software development. It is crucial to keep development costs down and external quality high. This blog post is about why simplicity is not the same thing as easiness, and common misconceptions around these terms.

Simple is not easy

Simple is the opposite of complex. Both are a measure of complexity, which arises from intertwining things such as concepts and responsibilities. Complexity is objective, and certain aspects of it, such as Cyclomatic Complexity, can be measured with many code quality tools.

Easy is the opposite of hard. Both are a measure of effort, which unlike complexity, is subjective and highly dependent on the context. For instance, it can be quite hard to rename a method in a large codebase if you do not have a tool that allows doing so safely. Similarly, it can be difficult to understand an OO project if you are not familiar with OO.

Achieving simplicity is hard

I’m sorry I wrote you such a long letter; I didn’t have time to write a short one.

Blaise Pascal

Finding simple solutions, or brief ways to express something clearly, is harder than finding something that works but is more complex. In other words, achieving simplicity is hard. This is unfortunate, since dealing with complexity is also hard.

In recent decades the cost of software maintenance has become much greater than the cost of its creation, so it makes sense to make maintenance as easy as we can. This means avoiding as much complexity as we can during the creation of the software, which is a hard task. The cost of the complexity does not suddenly appear once the software goes into an official maintenance phase, it is there on day 2, when you need to deal with code from day 1.

Good design requires thought

Questions about whether design is necessary or affordable are quite beside the point: design is inevitable. The alternative to good design is bad design, not no design at all.

— Vaughn Vernon in Domain-Driven Design Distilled

Some people in the field conflate simple and easy in a particularly unfortunate manner. They reason that if you need to think a lot about how to create a design, it will be hard to understand the design. Clearly, thinking a lot about a design does not guarantee that it is good and minimizes complexity. You can do a good job and create something simple or you can overengineer. There is however one guarantee that can be made based on the effort spent: for non-trivial problems, if little effort was spent (by going for the easy approach), the solution is going to be more complex than it could have been.

One high-profile case of such conflation can be found in the principles behind the Agile Manifesto. While I don’t fully agree with some of the other principles, this is the only one I strongly disagree with (unless you remove the middle part). Yay Software Craftsmanship manifesto.

Simplicity–the art of maximizing the amount of work not done–is essential

Principles behind the Agile Manifesto

Similarly we should be careful to not confuse the ease of understanding a system with the ease of understanding how or why it was created the way it was. The latter, while still easier than the actual task of creating a simple solution, is still going to be harder than working with said simple solution, especially for those that lack the skills used in its creation.

Again, I found a relatively high-profile example of such confusion:

If the implementation is hard to explain, it’s a bad idea. If the implementation is easy to explain, it may be a good idea.

The Zen of Python

I think this is just wrong.

You can throw all books in a library onto a big pile and then claim it’s easy to explain where a particular book is – in the pile – though actually finding the book is a bigger challenge. It’s true that you need more skills to use a well-organized library effectively than you need to go through a pile of books randomly. You need to know the alphabet, be familiar with the concept of genres, etc. Clearly an organized library is easier to deal with than our pile of books for anyone that has those skills.

It is also true that sometimes it does not make sense to invest in the skill that allows working more effectively, and that sometimes you simply cannot find people with the desired skills. This is where the real bottleneck is: learning. Most of the time these investments are worth it, as they allow you to work both faster and better from that point on.

See also

In my reply to the Big Ball of Mud paper I also talk about how achieving simplicity requires effort.

The main source of inspiration that led me to this blog post is Rich Hickeys 2012 Rails Conf keynote, where he starts by differentiating simple and easy. If you don’t know who Rich Hickey is (he created Clojure), go watch all his talks on YouTube now, well worth the time. (I don’t agree with everything he says but it tends to be interesting regardless.) You can start with this keynote, which goes into more detail than this blog post and adds a bunch of extra goodies on top. <3 Rich

Following the reasoning in this blog post, you cannot trade software quality for lower cost. You can read more about this in the Tradable Quality Hypothesis and Design Stamina Hypothesis articles.

There is another blog post titled Simple is not easy, which as far as I can tell, differentiates the terms without regard to software development.