Me, Myself and UI - My experiences with UI testing at HM Land Registry
The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy or position of HM Land Registry or the wider UK Civil service.
I’ve worked with HM Land Registry (HMLR) for over five years. In that time I worked on the Local Land Charges(LLC) programme. It included four delivery(scrum) teams from both HMLR and external providers. We worked to migrate and standardise information from local authorities into one central register.
I’m still involved with the programme but not embedded in the team anymore. I'll mostly cover my learning from 2 years embedded in those teams, and how I’ve applied it to other teams.
We worked to deliver two key services:
- A citizen service to buy information
- A professional service to maintain the information
I'll explore our challenges, thinking, and lessons learned. How the approaches to many of the problems boil down to research and experimentation. We looked at design patterns, and what others had already tried. Lines of thought that, having looked back now, are hardly groundbreaking. I hope by sharing this story, you can see that change and improvement doesn’t need to wait for a new project. It is worth trying to improve all the time.
Like any project, the timeline of events isn’t linear. Many of the things we tried happened around the same time.
CI Builds Are Failing
You guessed it, I’m talking about flakey tests. Like many other teams, we got to the point where some tests seemed to fail at random. When this became the norm rather than the exception we knew something had to change. We had success fixing these tests but it was short lived. A small group was analysing and fixing these alongside other work.
We were fixing tests after they were written and merged into our test suite. Not preventing them from being written in the first place. We reacted to symptoms but didn’t address the root cause. This was time consuming, and didn’t prevent flakey tests.
What it did do, however, was force us to look at the tests and architecture in more detail. We based our test suite on the test skeleton. We could create tests but lacked structure and consistency on how we should do it. More on that a bit later.
When I looked at our test estate I saw that many could be improved. Some tests were perhaps appropriate at some point in the past, but not anymore. Tests had become bloated with many assertions, and so had no clear purpose. Other tests used some “interesting” approaches to data management and other quirks. The suite showed us the debt we’d built up over the months of patching the tests.
As an example, the suite created users at startup, which most tests used at some point. This reused existing API tests to create users for the UI tests. This meant we had introduced dependencies between tests. This was something I implemented early on as a simple way to manage test users. Over time as people added more tests, they added more users to this setup. That approach wasn’t appropriate or sustainable anymore. I implemented it as a short term solution but we never revisited it, and as such, it became technical debt.
Overall the standard of our testing was high. I’m proud of what we did. I learned more about automation and code design working on that project than I had anywhere else. I learned by making mistakes, dealing with the consequences, and getting better. It’s fair to say a large proportion (most) of those interesting choices were mine. Like anything, they were choices that made sense at the time but not so much as time went on and we learned more.
Until we looked at our approach, we couldn't see these types of issues. We needed to rethink. How could we remove flakey tests, when we were actually introducing them all the time? Question after question surfaced and we needed to tackle them.
In general, we concluded:
We had lots of UI tests
What are they doing for us?
Were they all relevant?
Most of our tests were not atomic
Too much being done in a single test
Tests were not independent
Thinking back to conversations with my colleague Paul. We often talked about how we could apply good engineering practices. What is the next thing we should do? How do we create worthwhile automation? Design patterns, Object Oriented Design (OOD), and standards were often the topics we debated.
Why Do We Have This Test?
Asking “but why?”, turns out to be a pretty effective approach to most things. It’s also pretty annoying to deal with, so earns bonus points. Faced with lots of tests, we needed a way to categorise them. If the test was checking lots of things we immediately flagged it for further review. Tests written earlier in the project tended to cover low level actions. These tests generally didn’t have much value anymore as we tested the key functionality elsewhere. So in the bin they went! After some due diligence of course.
Content Checks Rant
Speaking of things that are low value and belong in the bin ...
Page content only checks are garbage, change my mind.
This is anecdotal. Your context and mileage may vary. I removed tests that were only checking for content. I did this with extreme prejudice because in my view they brought us no value. They took up execution, maintenance, and development time. Generally, content defects were of low priority. This should be reason enough to not have them at this level.
Consider this simple scenario. An automated test passes if some text is present on the screen. This test doesn’t explicitly verify that text is where it should be on the screen. Or that it is the correct font. If this test passes, does it tell us much? In my view no it doesn’t. We’d need to add lots more checking to the test to get more than simple verification. This scenario can be tested in different ways that may be more appropriate. This could be by a human or using other automated checks at different test levels.
A manual check was often more effective than automated tests bloating the suite. Which supported my question of why these tests existed at all. The answer being, we created the tests without thinking enough about their value. I banned these tests from our suite. We rejected merge requests unless there was justification. It did keep those tests out because nobody could articulate why they were valuable to us.
Nowadays Visual testing is far more sophisticated. Offering more than pixel comparisons using AI and a host of advanced features. Tools offer much smarter checking, aiming to simulate how a human would check i.e. not just the HTML but also how things appear. If we had access to that tooling this rant would be redundant. The tests never would have existed. Something I fully intend to explore in future.
Refactor And Delete Tests
I didn't include that rant for clickbait. The experience of removing tests, and saying no to new tests, gave me the confidence to delete more tests. That might sound odd to some, and feels odd to me now. Why wouldn’t you remove things that aren’t relevant? As things change, so should our tests. Code gets refactored and deleted all the time, and the same should be for our tests. I’m not suggesting we delete tests without thought. We should consider what’s important, and what tests are valuable.
In my experience, it’s a problem of perception. Once we wrote a test it had implied value. It implied it covers some risk. There was a fear that if we don’t run all the tests, then the ones we left out would have caught issues. We clung to a “run all the tests” mentality. Rather than a “run relevant tests” mentality.
To move forward, we ensured our key flows were tested quickly and removed many of the larger regression tests. For example, we kept the “buy some data” and “update the data” flows, and removed tests that checked the form error messages. In other words, we focussed more on acceptance flows, rather than system level tests.
CICD And Running Tests
We had a good understanding of what tests were important for CI and, what tests were good for larger regression runs. I wanted to reduce the number of tests run in our CICD pipeline, so we could get quicker feedback. I identified areas of lesser risk or infrequent change.
For example, our account management features hadn’t changed in about 3 months. We still ran a robust set of tests against this feature, even though we weren’t making changes in that area. Did we need to run all our tests against this feature? Of course not. Did we need to run any tests against the feature? That is a more interesting question. I would say we didn’t. But, to manage risk and people’s expectations, I chose to run a few key tests. A decent compromise in my book.
We were being more selective with what tests needed to run “all the time” in CI. We could change this if needed.
Once again I’m not suggesting we exclude tests on a whim, but rather to think critically about the value of those tests.
What Is The Point Of This Suite?
Expanding our previous line of questioning. If we want to know why we have a test, surely we should ask why we have a whole suite of tests. It’s easy to get swept up in writing and maintaining tests. We forget to do the testing and use the most appropriate tools for the job.
At this point, we’ve removed lots of tests that didn’t belong in our suite. But we didn’t have a shared understanding of what should be in there to begin with. Hardly surprising that we ended up here. How could we get the right test at the right time?
We worked out our goals for the suite by asking more questions. Our primary goal was to test the key flows of our services. Maintaining detailed regression tests was a secondary concern. Getting that quick feedback in CI was generally preferable to long test runs, where most tests weren’t relevant.
We still needed something to help us keep thinking about the goal. We worked out that we could remind ourselves in many places to think about our tests. We incorporated this into our peer reviews, analysis, and our story mind maps. Nothing prescriptive, but prompts to stop and think about things.
As we developed what the goal of the suite was, we identified tests that didn’t sit well at the UI level. Over time, we included API tests in our UI testing suite. Due to a lack of guidelines, delivery pressures, and a habit of using UI tests for anything that’s not a unit test. I can’t pin down when this happened. I suspect it happened gradually, and once they were in the suite, it was a precedent. An example that people followed. Well why wouldn’t they?
Using the “Test Pyramid”  as a guide we reviewed our tests (yet again). We weren't dogmatic about following the model but used it to prompt discussion. Can we test this at the API level? Is there any benefit to testing it at the UI level? Can we minimise the amount of UI steps? This all helped us to figure out at which level our tests should live.
Independence Day (And Months)
We removed the low hanging fruit and were being a bit smarter with running tests. But we still had flakey tests. Sure the suite ran a bit faster, and we had less tests to maintain. That was positive, but not what we wanted to resolve.
I’ve mentioned the user account management “solution”. It encouraged dependent tests. Turns out we had similar problems throughout the suite. We needed our tests to be more independent. One test shouldn’t impact another.
We also had lots of code that did mostly the same thing. I put this down to 4 development teams constantly adding code. It wasn’t always obvious where you should look for things, or where to put useful code.
Helpers And Models
We needed to create a way to support independent tests. We needed ways to manage users, charges, session information away from the UI. This is why we created data models and helpers. Helpers in our case are classes used primarily to set up and teardown state for our tests.
The model classes contained all the useful information about a data item. For example, the user model included; names, email, password. These models abstracted the details into classes. The tests could use this model when it needed a user object.
The helpers classes perform useful actions in our code base, often with models. To create a user, the create_user helper will interact with the relevant API /database.
Figure 1. Flow diagram of create a user helper
We stored these in one place, so anyone could update or create functions if needed. Not only did we have code following better Object Oriented principles. We now had ways to set up and teardown data.
Independence of Test(s)
In general, UI tests exercise most of the application stack, this means it can be hard to nail down issues. We used our helpers to get our UI tests into the correct state ready to perform the test. Most tests follow a simple pattern: get onto the right page, perform our action, check our result. The “Arrange-Act-Assert'' approach in short. Well explained by Automation Panda.
This idea was key to our approach for more atomic UI tests. We used our helpers to arrange our users, charges, and browser sessions (so we used the UI a little as possible). Then our UI test would perform its action, and assertion.
Each test was responsible for its own arrangement and cleanup. We leveraged Cucumber hooks for this. Resulting in tests (passed or failed) having no bearing on one another. This reduced flakiness, because we were only using the UI to perform the actual test. Not setup or clear state.
It took us a lot of learning and a lot of work to reduce flakiness. We still encounter it today. As does every company using UI automation. While the technical learning and approach is interesting. My takeaway from the experience is we need to keep asking questions. If we ask questions, we can find better ways to do things. We can learn from what others have done, and apply that to our situation. It’s also difficult to keep doing the “right things”, and not slip back into bad habits. I’m guilty of doing the wrong thing sometimes, as is any team.
A year or two later I worked with people in our test community to create guidelines on how to approach UI testing. I won’t go into the detail here, but we came up with a Page Object Model  implementation. I added a wiki and example tests to our existing skeleton. Feel free to have a look at the repository . The standard provided guidance on how we could write tests. Something we lacked before. It incorporated lots of different experiences and views from the practice. It’s intended to provide support to teams, and a way for us to share experiences and better ways of doing things.
If I Could Start Again?
Since my time on Local Land Charges I’ve worked on different projects. Most recently, a new project to provide a “common HMLR account solution”. The UI testing uses the new UI standard, helpers and models to create more atomic tests. This project has more manageable UI automation thanks in part to lessons learned on LLC. I appreciate it’s a different project with different constraints. However, having those foundations and goals in place from the beginning has made all the difference. But it’s never too late to improve.
So what can you takeaway?
Keep trying to improve
Even little things help
People have dealt with similar problems, use their experience
Find what helps you
Ask lots of questions
About what you are doing
Why you are doing it
Try new things
Sticking to “what's always been done” is expensive and a road to madness
Aaron Flynn is working with delivery teams at HMLR in Plymouth, UK, where he’s lived the last 5 years. Originally from Dublin, he’s worked across the UK and Ireland. He’s passionate about communities, accessibility, and technical testing to name a few things.
He’s a community lead for the HMLR test community, works with other communities at HMLR, spoken at cross government meetups, and (of course) is active on MoT. He loves collaborating with people to share ideas and experiences.
-  https://github.com/LandRegistry/skeleton-acceptance-tests
-  https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html
- [3 ] https://martinfowler.com/articles/microservice-testing/
-  https://martinfowler.com/bliki/PageObject.html
-  https://www.browserstack.com/guide/visual-testing-beginners-guide