Clean Architecture + Bounded Contexts diagram

I’m happy to release a two Clean Architecture + Bounded Contexts diagrams into the public domain (CC0 1.0).

I created these diagrams for Wikimedia Deutchland with the help of Jan Dittrich, Charlie Kritschmar and Hanna Petruschat. They represent the architecture of our fundraising codebase. I explain the rules of this architecture in my post Clean Architecture + Bounded Contexts. The new diagrams are based on the ones I published two years ago in my Clean Architecture Diagram post.

Diagram 1: Clean Architecture + DDD, generic version. Click to enlarge. Link: SVG version

Diagram 1: Clean Architecture + DDD, fundraising version. Click to enlarge. Link: SVG version

 

Base Libraries Should be Stable

In this post I go over the principles that govern package (library) design and one specific issue I have come across several times.

Robert C Martin has proposed six Package Principles. Personally I find the (linked) description on Wikipedia rather confusing, especially if you do not already understand the principles. Here is my take on library design using short and simple points:

Libraries should

  • cover a single responsibility
  • provide a clean and well segregated interface to their users
  • hide/encapsulate implementation details
  • not have cyclic dependencies
  • not depend on less stable libraries

The rest of this post focuses on the importance of the last point.

Stable Base Libraries

I am assuming usage of a package manager such as Composer (PHP) or NPM. This means that each library defines its dependencies (if any) and the version ranges of those that it is compatible with. Libraries have releases and these follow semver. It also means that libraries are ultimately consumed by applications that define their dependencies in the same way.

Consider the scenario where you have various applications that all consume a base library.

If this library contains a lot of concrete code, it will not be very stable and it will now and then be needed or desired to make a major release. A major release is one that can contain breaking changes and therefore does not automatically get used by consumers of the library. These consumers instead need to manually update the version range of the library they are compatible with and possibly adjust their code to work with the breaking changes.

In this scenario that is not a problem. Suppose the library is at version 1.2 and that all applications use version 1.x. If breaking changes are made in the library, these will need to be released as version 2.0. Due to the versioning system the consuming applications can upgrade at their leisure. There is of course some cost to making a major release. The consumers still do need to spend some time on the upgrade, especially if they are affected by the breaking changes. Still, this is typically easy to deal with and is a normal part of the development process.

Things change drastically when we have an unstable library that is used by other libraries, even if those libraries themselves are unstable.

In such a scenario making breaking changes to the base library are very costly. Let’s assume we make a breaking change to the base library and that we want to use it in our application. What do we actually need to do to get there?

  • Make a release of the base library (Library A)
  • Update library B to specify it is compatible with the new A and make a new release of B
  • Update library C to specify it is compatible with the new A and make a new release of C
  • Update library D to specify it is compatible with the new A and make a new release of D
  • Update our application to specify it is compatible with the new version of A

In other words, we need to make a new release of EVERY library that uses our base library and that is also used by our application. This can be very painful, and is a big waste of time if these intermediate libraries where not actually affected by the breaking change to begin with. This means that it is costly, and generally a bad idea, to have libraries depend on another library that is not very stable.

Stability can be achieved in several ways. If the package is very abstract, and we assume good design, it will be stable. Think of a package providing a logging interface, such as psr/log. It can also be approached by a combination of following the package principles and taking care to avoid breaking changes.

Conclusion

Keep the package principles in mind, and avoid depending on unstable libraries in your own libraries where possible.

See also

Clean Architecture + Bounded Contexts

In this follow-up to Implementing the Clean Architecture I introduce you to a combination of The Clean Architecture and the strategic DDD pattern known as Bounded Contexts.

At Wikimedia Deutschland we use this combination of The Clean Architecture and Bounded Contexts for our fundraising applications. In this post I describe the structure we have and the architectural rules we follow in the abstract. For the story on how we got to this point and a more concrete description, see my post Bounded Contexts in the Wikimedia Fundraising Software. In that post and at the end of this one I link you to a real-world codebase that follows the abstract rules described in this post.

If you are not yet familiar with The Clean Architecture, please first read Implementing the Clean Architecture.

Clean Architecture + Bounded Contexts

Diagram by Jeroen De Dauw, Charlie Kritschmar, Jan Dittrich and Hanna Petruschat. Click image to enlarge

Diagram depicting Clean Architecture + Bounded Contexts

In the top layer of the diagram we have applications. These can be web applications, they can be console applications, they can be monoliths, they can be microservices, etc. Each application has presentation code which in bigger applications tends to reside in a decoupled presentation layer using patterns such as presenters. All applications also somehow construct the dependency graph they need, perhaps using a Dependency Injection Container or set of factories. Often this involves reading configuration from somewhere. The applications contain ALL framework binding, hence they are the place where you will find the Controllers if you are using a typical web framework.

Since the applications are in the top layer, and dependencies can only go down, no code outside of the applications is allowed to depend on code in the applications. That means there is 0 binding to mechanisms such as frameworks and presentation code outside of the applications.

In the second layer we have the Bounded Contexts. Ideally one Bounded Context per subdomain. At the core of each BC we have the Domain Model and Domain Services, containing the business logic part of the subdomain. Dependencies can only point inwards, so the Domain model which is at the center cannot depend on anything more to the outside. Around the Domain Model are the Domain Services. These include interfaces for persistence services such as Repositories. The UseCases form the final ring. They can use both the Domain Model and the Domain Services. They also form a boundary around the two, meaning that no code outside of the Bounded Context is allowed to talk to the Domain Model or Domain Services.

The Bounded Contexts include their own Persistence Layer. The Persistence Layer can use a relational database, files on the file system, a remote web API, a combination of these, etc. It has implementations of domain services such as Repositories which are used by the UseCases. These implementations are the only thing that is allowed to talk to and know about the low-level aspects of the Persistence Layer. The only things that can use these service implementations are other Domain Services and the UseCases.

The UseCases, including their Request Models and Response Models, form the public interface of the Bounded Context. This means that there is 0 binding to the persistence mechanisms outside of the Bounded Context. It also means that the code responsible for the domain logic cannot be directly accessed elsewhere, such as in the presentation layer of an application.

The applications and Bounded Contexts contain all the domain specific code. This code can make use of libraries and of course the runtime (ie PHP) itself.

As examples of Bounded Contexts following this approach, see the Donation Context and Membership Context. For an application following this architecture, see the FundraisingFrontend, which uses both the Donation Context and Membership Context. Both these contexts are also used by another application the code of which sadly enough is not currently public. You can also read the stories of how we rewrote the FundraisingFontend to use the Clean Architecture and how we refactored towards Bounded Contexts.

Further reading

If you are not yet familiar with Bounded Contexts or how to design them well, I recommend reading Domain-Driven Design Distilled.

Bounded Contexts in the Wikimedia Fundraising Software

In this follow-up to rewriting the Wikimedia Deutschland fundraising I tell the story of how we reorganized our codebases along the lines of the DDD strategic pattern Bounded Contexts.

In 2016 the FUN team at Wikimedia Deutschland rewrote the Wikimedia Deutschland fundraising application. This new codebase uses The Clean Architecture and near the end of the rewrite got reorganized partially towards Bounded Contexts. After adding many new features to the application in 2017, we reorganized further towards Bounded Contexts, this time also including our other fundraising applications. In this post I explain the questions we had and which decisions we ended up making. I also link to the relevant code so you have real world examples of using both The Clean Architecture and Bounded Contexts.

This post is a good primer to my Clean Architecture + Bounded Contexts post, which describes the structure and architecture rules we now have in detail.

Our initial rewrite

Back in 2014, we had two codebases, each in their own git repository and not using code from the other. The first one being a user facing PHP web-app that allows people to make donations and apply for memberships, called FundraisingFrontend. The “frontend” here stands for “user facing”. This is the application that we rewrote in 2016. The second codebase mainly contains the Fundraising Operations Center, a PHP web-app used by the fundraising staff to moderate and analyze donations and membership applications. This second codebase also contains some scripts to do exports of the data for communication with third-party systems.

Both the FundraisingFrontend and Fundraising Operations Center (FOC) used the same MySQL database. They each accessed this database in their own way, using PDO, Doctrine or SQL directly. In 2015 we created a Fundraising Store component based on Doctrine, to be used by both applications. Because we rewrote the FundraisingFrontend, all data access code there now uses this component. The FOC codebase we have been gradually migrating away from random SQL or PDO in its logic to also using this component, a process that is still ongoing.

Hence as we started with our rewrite of the FundraisingFrontend in 2016, we had 3 sets of code: FundraisingFrontend, FOC and Fundraising Store. After our rewrite the picture still looked largely the same. What we had done was turning FundraisingFrontend from a big ball of mud into a well designed application with proper architecture and separation of subdomains using Bounded Contexts. So while there where multiple components within the FundraisingFrontend, it was still all in one codebase / git repository. (Except for several small PHP libraries, but they are not relevant here.)

Need for reorganization

During 2017 we did a bunch of work on the FOC application and associated export script, all the while doing gradual refactoring towards sanity. While refactoring we found ourself implementing things such as a DonationRepository, things that already existed in the Bounded Contexts part of the FundraisingFrontend codebase. We realized that at least the subset of the FOC application that does moderation is part of the same subdomains that we created those Bounded Contexts for.

The inability to use these Bounded Contexts in the FOC app was forcing us to do double work by creating a second implementation of perfectly good code we already had. This also forced us to pay a lot of extra attention to avoid inconsistencies.  For instance, we ran into issues with the FOC app persisting Donations in a state deemed invalid by the donations Bounded Context.

To address these issues we decided to share the Bounded Contexts that we created during the 2016 rewrite with all relevant applications.

Sharing our Bounded Contexts

We considered several approaches to sharing our Bounded Contexts between our applications. The first approach we considered was having a dedicated git repository per BC. To do this we would need to answer what exactly would go into this repository and what would stay behind. We where concerned that for many changes we’d need to touch both the BC git repo and the application git repo, which requires more work and coordination than being able to make a change in a single repository. This lead us to consider options such as putting all BCs together into a single repo, to minimize this cost, or to simply put all code (BCs and applications) into a single huge repository.

We ended up going with one repo per BC, though started with a single BC to see how well this approach would work before committing to it. With this approach we still faced the question of what exactly should go into the BC repo. Should that just be the UseCases and their dependencies, or also the presentation layer code that uses the UseCases? We decided to leave the presentation layer code in the application repositories, to avoid extra (heavy) dependencies in the BC repo and because the UseCases provide a nice interface to the BC. Following this approach it is easy to tell if code belongs in the BC repo or not: if it binds to the domain model, it belongs in the BC repo.

These are the Bounded Context git repositories:

The BCs still use the FundraisingStore component in their data access services. Since this is not visible from outside of the BC, we can easily refactor towards removing the FundraisingStore and having the data access mechanism of a BC be truly private to it (as it is supposed to be for BCs).

The new BC repos allow us to continue gradual refactoring of the FOC and swap out legacy code with UseCases from the BCs. We can do so without duplicating what we already did and, thanks to the proper encapsulation, also without running into consistency issues.

Clean Architecture + Bounded Contexts

During our initial rewrite we created a diagram to represent the favor of The Clean Architecture that we where using. For an updated version that depicts the structure of The Clean Architecture + Bounded Contexts and describes the rules of this architecture in detail, see my blog post The Clean Architecture + Bounded Contexts.

Diagram depicting Clean Architecture + Bounded Contexts

Clean Architecture: UseCase tests

When creating an application that follows The Clean Architecture you end up with a number of UseCases that hold your application logic. In this blog post I outline a testing pattern for effectively testing these UseCases and avoiding common pitfalls.

Testing UseCases

A UseCase contains the application logic for a single “action” that your system supports. For instance “cancel a membership”. This application logic interacts with the domain and various services. These services and the domain should have their own unit and integration tests. Each UseCase gets used in one or more applications, where it gets invoked from inside the presentation layer. Typically you want to have a few integration or edge-to-edge tests that cover this invocation. In this post I look at how to test the application logic of the UseCase itself.

UseCases tend to have “many” collaborators. I can’t recall any that had less than 3. For the typical UseCase the number is likely closer to 6 or 7, with more collaborators being possible even when the design is good. That means constructing a UseCase takes some work: you need to provide working instances of all the collaborators.

Integration Testing

One way to deal with this is to write integration tests for your UseCases. Simply get an instance of the UseCase from your Top Level Factory or Dependency Injection Container.

This approach often requires you to mutate the factory or DIC. Want to test that an exception from the persistence service gets handled properly? You’ll need to use some test double instead of the real service, or perhaps mutate the real service in some way. Want to verify a mail got send? Definitely want to use a Spy here instead of the real service. Mutability comes with a cost so is better avoided.

A second issue with using real collaborators is that your tests get slow due to real persistence usage. Even using an in-memory SQLite database (that needs initialization) instead of a simple in-memory fake repository makes for a speed difference of easily two orders of magnitude.

Unit Testing

While there might be some cases where integration tests make sense, normally it is better to write unit tests for UseCases. This means having test doubles for all collaborators. Which leads us to the question of how to best inject these test doubles into our UseCases.

As example I will use the CancelMembershipApplicationUseCase of the Wikimedia Deutschland fundrasing application.

This UseCase uses 3 collaborators. An authorization service, a repository (persistence service) and a mailing service. First it checks if the operation is allowed with the authorizer, then it interacts with the persistence and finally, if all went well, it uses the mailing service to send a confirmation email. Our unit test should test all this behavior and needs to inject test doubles for the 3 collaborators.

The most obvious approach is to construct the UseCase in each test method.

Note how both these test methods use the same test doubles. This is not always the case, for instance when testing authorization failure, the test double for the authorizer service will differ, and when testing persistence failure, the test double for the persistence service will differ.

Normally a test function will only change a single test double.

UseCases tend to have, on average, two or more behaviors (and thus tests) per collaborator. That means for most UseCases you will be repeating the construction of the UseCase in a dozen or more test functions. That is a problem. Ask yourself why.

If the answer you came up with was DRY then think again and read my blog post on DRY 😉 The primary issue is that you couple each of those test methods to the list of collaborators. So when the constructor signature of your UseCase changes, you will need to do Shotgun Surgery and update all test functions. Even if those tests have nothing to do with the changed collaborator. A second issue is that you pollute the test methods with irrelevant details, making them harder to read.

Default Test Doubles Pattern

The pattern is demonstrated using PHP + PHPUnit and will need some adaptation when using a testing framework that does not work with a class based model like that of PHPUnit.

The coupling to the constructor signature and resulting Shotgun Surgery can be avoided by having a default instance of the UseCase filled with the right test doubles. This can be done by having a newUseCase method that constructs the UseCase and returns it. A way to change specific collaborators is needed (ie a FailingAuthorizer to test handling of failing authorization).

Making the UseCase itself mutable is a big no-no. Adding optional parameters to the newUseCase method works in languages that have named parameters. Since PHP does not have named parameters, another solution is needed.

An alternative approach to getting modified collaborators into the newUseCase method is using fields. This is less nice than named parameters, as it introduces mutable state on the level of the test class. Since in PHP this approach gives us named fields and is understandable by tools, it is better than either using a positional list of optional arguments or emulating named arguments with an associative array (key-value map).

The fields can be set in the setUp method, which gets called by PHPUnit before the test methods. For each test method PHPUnit instantiates the test class, then calls setUp, and then calls the test method.

With this field-based approach individual test methods can modify a specific collaborator by writing to the field before calling newUseCase.

The choice of default collaborators is important. To minimize binding in the test functions, the default collaborators should not cause any failures. This is the case both when using the field-based approach and when using optional named parameters.

If the authorization service failed by default, most test methods would need to modify it, even if they have nothing to do with authorization. And it is not always self-evident they need to modify the unrelated collaborator. Imagine the default authorization service indeed fails and that the testWhenSaveFails_cancellationFails test method forgets to modify it. This test method would end up passing even if the behavior it tests is broken, since the UseCase will return the expected failure result even before getting to the point where it saves something.

This is why inside of the setUp function the example creates a “cancellable application” and puts it inside an in-memory test double of the repository.

I chose the CancelMembershipApplication UseCase as an example because it is short and easy to understand. For most UseCases it is even more important to avoid the constructor signature coupling as this issue becomes more severe with size. And no matter how big or small the UseCase is, you benefit from not polluting your tests with unrelated setup details.

You can view the whole CancelMembershipApplicationUseCase and CancelMembershipApplicationUseCaseTest.

See also:

Guidelines for New Software Projects

In this blog post I share the Guidelines for New Software Projects that we use at Wikimedia Deutschland.

I wrote down these guidelines recently after a third-party that was contracted by Wikimedia Deutschland delivered software that was problematic in several ways. The department contracting this third-party was not the software development department, as since we did not have written down guidelines anywhere, both this department and the third-party lacked information they needed to do a good job. The guidelines are a short-and-to-the-point 2 page document that can be shared third parties and serves as a starting point when making choices for internal projects.

If you do not have such guidelines at your organization, you can copy these, though you will of course need to update the “Widespread Knowledge at Our Department” section. If you have suggestions for improvements or interesting differences in your own guidelines, leave a comment!


Domain of Applicability

These guidelines are for non-trivial software that will need to be maintained or developed further. They do not apply to throw-away scripts (if indeed they will be thrown away) or trivial code such as a website with a few static web pages. If on the other hand the software needs to be maintained, then these guidelines are relevant to the degree that the software is complex and hard to replace.

Choice of Stack

The stack includes programming language, frameworks, libraries, databases and tools that are otherwise required to run or develop the software.

We want

  • It to be easy to hire people who can work with the stack
  • It to be easy for our software engineering department to work with the stack
  • The stack to continue to be supported and develop with the rest of the ecosystem

Therefore

  • Solutions (ie frameworks, libraries) known by many developers are preferred over obscure ones
  • Custom/proprietary solutions, where mature and popular alternatives are available, are undesired
  • Solutions known by our developers are preferred
  • Solutions with many contributors are preferred over those with few
  • Solutions with active development are preferred over inactive ones
  • More recent versions of the solutions are preferred, especially over unsupported ones
  • Introducing new dependencies requires justification for the maintenance cost they add
  • Solutions that are well designed (see below) are preferred over those that are not

Widespread Knowledge at Our Department

  • Languages: PHP and JavaScript
  • Databases: MySQL and SQLite
  • Frameworks: Symfony Components and Silex (discontinued)
  • Testing tools: PHPUnit, QUnit, Selenium/Mocha
  • Virtualization tools: Docker, Vagrant

Architecture and Code Quality

We want

  • It to be easy (and thus cheap) to maintain and further develop the project

Therefore

  • Decoupled architecture is preferred, including
    • Decoupling from the framework, especially if the framework has issues (see above) and thus forms a greater than usual liability.
    • Separation of persistence from domain and application logic
    • Separation of presentation from domain and application logic
  • The code should adhere to the SOLID principles, not be STUPID and otherwise be clean on a low-level.
  • The applications logic should be tested with tests of the right type (ie Unit, Integration, Edge to Edge) that are themselves well written and not laden with unneeded coupling.
  • Setup of the project for development should be automated, preferably using virtualization tools such as Docker. After this automated setup developers should be able to easily run the tests and interact with the application.
  • If the domain is non-trivial, usage of Domain-driven design (including the strategic patterns) is preferred

Collaboration Best Practices

During 2016 and 2017 I worked in a small 3 people dev team at Wikimedia Deutschland, aptly named the FUN team. This is the team that rewrote our fundraising application and implemented the Clean Architecture. Shortly after we started working on new features we noticed room for improvement in many areas of our collaboration. We talked about these and created a Best Practices document, the content of which I share in this blog post.

We created this document in October 2016 and it focused on the issues we where having at the time or things we where otherwise concerned about. Hence it is by no means a comprehensive list, and it might contain things not applicable to other teams due to culture/value differences or different modes of working. That said I think it can serve as a source of inspiration for other teams.

Make sure others can pick up a task where it was left

  • Make small commits
  • Publish code quickly and don’t go home with unpublished code
  • When partially finishing a task, describe what is done and still needs doing
  • Link Phabricator on Pull Requests and Pull Requests on Phabricator

Make sure others can (start) work(ing) on tasks

  • Try to describe new tasks in such a way others can work on them without first needing to inquire about details
  • Respond to emails and comments on tasks (at least) at the start and end of your day
  • Go through the review queue (at least) at the start and end of your day
  • Indicate if a task was started or is being worked on so everyone can pick up any task without first checking it is not being worked upon. At least pull the task into the “doing” column of the board, better yet assign yourself.

Make sure others know what is going on

  • Discuss introduction of new libraries or tools and creation of new projects
  • Only reviewed code gets deployed or used as critical infrastructure
  • If review does not happen, insist on it rather than adding to big branches
  • Discuss (or Pair Program) big decisions or tasks before investing a lot of time in them

Shared commitment

  • Try working on the highest priority tasks (including any support work such as code review of work others are doing)
  • Actively look for and work on problems and blockers others are running into
  • Follow up commits containing suggested changes are often nicer than comments

PHP project template

Want to start a new PHP project? Perhaps yet another library you are creating? Tired of doing the same lame groundwork for the 5th time this month? Want to start a code kata and not lose time on generic setup work? I got just the project for you!

I’ve created a PHP project template that you can fork or copy to get started quickly. It contains setup for testing, lining and CI and no actual PHP code. A detailed list of the project templates contents:

  • Ready-to-go PHPUnit (configuration and working bootstrap)
  • Ready-to-go PHPCS
  • Docker environment with PHP 7.2 and Composer (so you do not need to have PHP or Composer installed!)
  • Tests and style checks runnable in the Docker environment with simple make commands
  • TravisCI ready
  • Code coverage creation on TravisCI and uploading to ScrutinizerCI (optional)
  • Coverage tag validation
  • Stub production and test classes for ultra-quick start (ideal when doing a kata)
  • COPYING and .gitignore files
  • README with instructions of how to run the tests

Getting started

  • Copy the code or fork the repository
  • If you do not want to use the MediaWiki coding style, remove mediawiki/mediawiki-codesniffer from composer.json
  • If you want to support older PHP versions, update composer.json and remove new PHP features from the stub PHP files
  • If the code is not a kata or quick experiment, update the PHP namespaces and the README
  • Start writing code!
  • If you want TravisCI and/or ScrutinizerCI integration you will need to log in to their respective websites
  • Optionally update the README

You can find the project template on GitHub.

Special thanks to weise, gbirke and KaiNissen for their contributions to the projects this template was extracted from.

My year in books

This is a short summary of my 2017 reading experience, following my 2016 Year In Books and my 2015 Year In Books.

Such Stats

I read 27 books , totaling just over 9000 pages. 13 of these where non-fiction, 6 hard science fiction and 8 pleb science fiction.

In 2017 I did an effort to rate books properly using the Goodreads scale (Did not like, it was OK, I liked it, I really liked it, it was amawzing), hence the books are more distributed rating wise than the previous years, where most got 4 stars.

Non-fiction

While only finished shortly after new year and thus not strictly a 2017 book even though I read 90% of it in 2017, my favorite for the year is The Righteous Mind: Why Good People are Divided by Politics and Religion. A few months ago I saw an interview of Jonathan Haidt, the author of this book. I was so impressed that I went look for books written by him. As I was about to add his most recent work (this book) to my to-read list on Goodreads, I realized that it was already on there, added a few months before. The book is about social and moral psychology and taught me many useful models in how people behave, where to look for ones blind spots, how to avoid polarization and how to understand many aspects of the political landscape. A highly recommended read!

Another non-fiction book that I found to be full of useful models that deepened my understanding of how the world works is The Psychopath Code: Cracking The Predators That Stalk Us. Since about 4% of people are psychopaths and thus systematically and mercilessly exploit others (while being very good at covering their tracks), you are almost guaranteed to significantly interact with some over the course of your life. Hence understanding how to detect them and prevent them from feeding on you or those you care about is an important skill. You can check out my highlights from the book to get a quick idea of the contents.

Fiction

Incandescence

After having read a number of more mainstream Science Fiction books that do not empathize on the science or grand ideas as proper Hard Science Fiction, I picked up Incandescence, by Greg Egan, author of one of my 2015 favorite books. As expected from Greg Egan, the focus definitely is on the science and the ideas. The story is set in the The Amalgam galaxy, which I was already familiar with via a number of short stories.

Most of the book deals with a more primitive civilization discovering gradually discovering physics starting from scratch, both using observation and theorization, and eventually creating their own version of General Relativity. Following this is made extra challenging by the individuals in this civilization using their own terms for various concepts, such as the 6 different spatial directions. You can see this is not your usual SF book from this… different trailer that Greg made.

What is so great about this book is that Greg does not explain what is going on. He gradually drops clues that allow you to piece by piece get a better understanding of the situation, and how the story of the primitive civilization fits the wider context. For instance, you find out what “the Incandescence” is. At least you will if you are familiar with the relevant “space stuff”. While in this case it is hard to miss if you know the “space stuff”, the hints are never really confirmed. This is also true for perhaps the most important hinted at thing. I had a “Holy shit! Could it be… this is epic” moment when I made that connection.

Not a book I would recommend for people new to Hard SF, or the best book of Egan to start with. In any case, before reading this one, read some of the stories of The Amalgam.

Diaspora

Since I linked Incandescence so much I decided to finally give Diaspora a go. Diaspora had been on my to-read list for years and I never got to it because of its low rating on Goodreads. (Rant: Paying so much attention to the rating was perhaps not so smart of me, since for non-mainstream the ratings are not very telling. They mainly seem to be a combination of what average people expect and what is popular, with less correlation to quality and a punishing of material that is too sophisticated for the uninitiated or less intelligent.)

I loved the first chapter of Diaspora “Orphanogenesis” and cried some tears of joy at its conclusion.

The final chapters are epic. Spoiler ahead. How can a structure spanning 200 trillion universes and encoding the thoughts of an entire civilization be anything but epic? This beats the Xeelee ring in scale. The choices of some of the characters near the end make little sense to me, though still a great book overall, and perhaps the not making sense part makes sense if one considers how much these characters are not traditional humans.

Introduction to Iterators and Generators in PHP

In this post I demonstrate an effective way to create iterators and generators in PHP and provide an example of a scenario in which using them makes sense.

Generators have been around since PHP 5.5, and iterators have been around since the Planck epoch. Even so, a lot of PHP developers do not know how to use them well and cannot recognize situations in which they are helpful. In this blog post I share insights I have gained over the years, that when sharing, always got an interested response from colleague developers. The post goes beyond the basics, provides a real world example, and includes a few tips and tricks. To not leave out those unfamiliar with Iterators the post starts with the “What are Iterators” section, which you can safely skip if you can already answer that question.

What are Iterators

PHP has an Iterator interface that you can implement to represent a collection. You can loop over an instance of an Iterator just like you can loop over an array:

Why would you bother implementing an Iterator subclass rather than just using an array? Let’s look at an example.

Imagine you have a directory with a bunch of text files. One of the files contains an ASCII NyanCat (~=[,,_,,]:3). It is the task of our code to find which file the NyanCat is hiding in.

We can get all the files by doing a glob( $path . '*.txt' ) and we can get the contents for a file with a file_get_contents. We could just have a foreach going over the glob result that does the file_get_contents. Luckily we realize this would violate separation of concerns and make the “does this file contain NyanCat” logic hard to test since it will be bound to the filesystem access code. Hence we create a function that gets the contents of the files, and ones with our logic in it:

While this approach is decoupled, a big drawback is that now we need to fetch the contents of all files and keep all of that in memory before we even start executing any of our logic. If NyanCat is hiding in the first file, we’ll have fetched the contents of all others for nothing. We can avoid this by using an Iterator, as they can fetch their values on demand: they are lazy.

Our TextFileIterator gives us a nice place to put all the filesystem code, while to the outside just looking like a collection of texts. The function housing our logic, findTextWithNyanCat, does not know that the text comes from the filesystem. This means that if you decide to get texts from the database, you could just create a new DatabaseTextBlobIterator and pass it to the logic function without making any changes to the latter. Similarly, when testing the logic function, you can give it an ArrayIterator.

I wrote more about basic Iterator functionality in Lazy iterators in PHP and Python and Some fun with iterators. I also blogged about a library that provides some (Wikidata specific) iterators and a CLI tool build around an Iterator. For more on how generators work, see the off-site post Generators in PHP.

PHP’s collection type hierarchy

Let’s start by looking at PHP’s type hierarchy for collections as of PHP 7.1. These are the core types that I think are most important:

  •  iterable
    • array
    • Traversable
      • Iterator
        • Generator
      • IteratorAggregate

At the very top we have iterable, the supertype of both array and Traversable. If you are not familiar with this type or are using a version of PHP older than 7.1, don’t worry, we don’t need it for the rest of this blog post.

Iterator is the subtype of Traversable, and the same goes for IteratorAggregate. The standard library iterator_ functions such as iterator_to_array all take a Traversable. This is important since it means you can give them an IteratorAggregate, even though it is not an Iterator. Later on in this post we’ll get back to what exactly an IteratorAggregate is and why it is useful.

Finally we have Generator, which is a subtype of Iterator. That means all functions that accept an Iterator can be given a Generator, and, by extension, that you can use generators in combination with the Iterator classes in the Standard PHP Library such as LimitIterator and CachingIterator.

IteratorAggregate + Generator = <3

Generators are a nice and easy way to create iterators. Often you’ll only loop over them once, and not have any problem. However beware that generators create iterators that are not rewindable, which means that if you loop over them more than once, you’ll get an exception.

Imagine the scenario where you pass in a generator to a service that accepts an instance of Traversable:

The service class in which doStuff resides does not know it is getting a Generator, it just knows it is getting a Traversable. When working on this class, it is entirely reasonable to iterate though $things a second time.

This blows up if the provided $things is a Generator, because generators are non-rewindable. Note that it does not matter how you iterate through the value. Calling iterator_to_array with $things has the exact same result as using it in a foreach loop. Most, if not all, generators I have written, do not use resources or state that inherently prevents them from being rewindable. So the double-iteration issue can be unexpected and seemingly silly.

There is a simple and easy way to get around it though. This is where IteratorAggregate comes in. Classes implementing IteratorAggregate must implement the getIterator() method, which returns a Traversable. Creating one of these is extremely trivial:

If you call getIterator, you’ll get a Generator instance, just like you’d expect. However, normally you never call this method. Instead you use the IteratorAggregate just as if it was an Iterator, by passing it to code that expects a Traversable. (This is also why usually you want to accept Traversable and not just Iterator.) We can now call our service that loops over the $things twice without any problem:

By using IteratorAggregate we did not just solve the non-rewindable problem, we also found a good way to share our code. Sometimes it makes sense to use the code of a Generator in multiple classes, and sometimes it makes sense to have dedicated tests for the Generator. In both cases having a dedicated class and file to put it in is very helpful, and a lot nicer than exposing the generator via some public static function.

For cases where it does not make sense to share a Generator and you want to keep it entirely private, you might need to deal with the non-rewindable problem. For those cases you can use my Rewindable Generator library, which allows making your generators rewindable by wrapping their creation function:

Edit: You can, and probably should, use PHPs native CachingIterator instead.

A real-world example

A few months ago I refactored some code part of the Wikimedia Deutschland fundraising codebase. This code gets the filesystem paths of email templates by looking in a set of specified directories.

This code made the class bound to the filesystem, which made it hard to test. In fact, this code was not tested. Furthermore, this code irked me, since I like code to be on the functional side. The array_walk mutates its by-reference variable and the assignment at the end of the loop mutates the return variable.

This was refactored using the awesome IteratorAggregate + Generator combo:

Much easier to read/understand code, no state mutation whatsoever, good separation of concerns, easier testing and reusability of this collection building code elsewhere.

See also: Use cases for PHP generators (off-site post).

Tips and Tricks

Generators can yield key value pairs:

You can use yield in PHPUnit data providers.

You can yield from an iterable.

Thanks for Leszek Manicki and Jan Dittrich for reviewing this blog post.