Reading:
Navigating ChromeDriver crashes in Kubernetes: A tale of test automation resilience

Navigating ChromeDriver crashes in Kubernetes: A tale of test automation resilience

Overcome ChromeDriver crashes and resource limitations by testing on Kubernetes

Why test on Kubernetes?

In my day-to-day work as a Software Developer in Test, I was tasked with developing automated UI (user interface) tests for a complex web application with dynamically generated content. This means that the web elements I need to interact with often lack static attributes that can be easily referenced. As a result, I have to use more complex strategies to locate and interact with these elements. After using some web element localization gymnastics to make sure I can click the right buttons and access the right iframes, the tests were complete.

The web application runs as a separate K8S (kubernetes) instance for each client company, on a different K8S cluster, where the resources for that particular instance are grouped inside namespaces. The UI tests are automatically triggered prior to a major version update of the web application, and immediately after, to make sure that the update has not negatively impacted its usability. This was achieved by containerizing the UI test code in a Docker image, sending it to an organisational repository and using a K8S job to deploy the tests on the specific instance namespace whenever an update was pending and immediately after.

Being deployed on hundreds of resources, the Docker image needed to be light, so the tests ran in a linux environment. Running in Linux with no display support, meant that the tests couldn’t open a normal browser, but instead had to use the browser’s headless mode. All of the development and testing is done using the Chrome web browser company-wide, so naturally, I used the ChromeDriver to drive the Chromium browser set up on container deployment. A server that kept track of the upgrades schedule was used to trigger the tests for a particular instance of the web application, and the tests reported back to the server a JSON summary of the results.

Diagram of server requesting the tests and kubernetes service returning test result summary

Starting your journey

The first try of running the tests in a containerized fashion, using Docker Desktop ended up with providing the following error:

selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.

(unknown error: DevToolsActivePort file doesn't exist)

(The process started from chrome location /usr/lib/chromium/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

Double-check your web results

Upon searching the error online1, I came across a solution that seemed to work, and that is using in the options for the WebDriver the --no-sandbox flag. This seemed fine, my tests worked, so I could have moved on, but I wondered what that flag actually does and if using this solution can negatively impact the tests or the environment they run in.

It seems like “The sandbox removes unnecessary privileges from the processes that don't need them in Chrome for security purposes. Disabling the sandbox makes your PC more vulnerable to exploits via web pages, so Google doesn't recommend it.” This explanation was found here1 and it seems to be an option that is needed to run Chrome on unix-like systems in headless mode. Given that my tests ran in an ephemeral K8S job that is discarded after some time has passed since it was complete, I did not need to worry about environment security concerns.

This however, introduced a new issue when I tried to run my tests from my development environment, which is not discarded after use, running Windows 11 Pro, version 22H2. In about 50% of the cases, after the tests were run and the WebDriver.quit() line was executed, Chrome had 2 lingering background processes. These needed to be killed manually from the task manager, or else running the tests multiple times would increase the CPU usage to 100%, making the PC unusable without a restart. The testing community on the world wide web again came to the rescue, as this problem is documented in this GitHub issue2.

Try running in the final environment

All seemed to work fine on Docker Desktop, so I could have published the final image with the testing code and be done with it, chanting the infamous “Works on my machine”. I was even delivering “my machine” or the container environment where the tests worked to the location they were supposed to run in, the K8s cluster, so no issues there, right? Wrong.

When running the tests in the environment they were supposed to run in, namely K8S, only the come tests from the whole suite were executed correctly and then all the others failed.

Be creative

Having worked on these tests already for quite a long time, I had to deliver. Having no time to investigate further why the UI tests fail, without any apparent set-up error, I got the idea of thinking outside of the company box by using Firefox. That seemed to solve the issue and the tests finally worked in the environment they were supposed to. The code was delivered and it did what it was supposed to do, but the question remained, why did the UI tests fail in a K8S environment?

Be curious

After delivering the code, I could have patted myself on the back for a job well done, and closed the chapter on the subject, but the question of why some of the  tests work on Firefox and not on Chrome kept bugging me, so I started investigating, without having the pressure of delivering results.

For debugging purposes, I set the Dockerfile to run a sleep infinity command after creating the container, instead of the usual command that triggered the UI tests. This kept the container alive as I executed commands inside it, to run the tests. It also provided a way to be able to transfer files between the K8s pod that was running my tests and my local machine. This is the full Dockerfile I used for debugging:

FROM alpine:3.15

WORKDIR /app

# Copy the repository to the container
COPY src/ ./src/
COPY test/ ./test/
COPY requirements.txt .

# Switch to root user
USER root

# Install necessary software
RUN apk add --update --no-cache \
    bash \
    sudo \
    nano \
    python3 \
    python3-dev \
    py3-pip \
    firefox \
    chromium \
    chromium-chromedriver \
    && pip3 install --upgrade pip \
    && pip3 install --ignore-installed --trusted-host pypi.python.org -r requirements.txt \
    && export PATH="$PATH:/usr/bin/python3/bin:/usr/bin/chromedriver"

# Run pytest
CMD sleep infinity

I wanted to see how the UI looks when the tests fail, so I wrote some logic to take screenshots on test failure. As I could not see the images on the linux server environment inside the K8S pod, I downloaded the files on my pc. It seemed that during one of the tests that failed, the web page was trying to load an iframe and the iframe did not load when the web page was accessed from inside the K8S pod. This issue was not encountered when running the tests headless from localhost, nor when running them from inside a container in Docker Desktop based on the same image that was used in the kubernetes environment. I now knew why the tests had failed, the iframes did not load, but what could be the cause of that?

Be methodical

I thought about the processes that were going on during the tests and imagined what could go wrong such that the iframe would not load. The first thing that came to mind was that the browser could not reach the iframe URL from inside the K8S pod. I tried getting the iframe URL from the DevTools on my local browser and pinging the URL from inside the K8S pod, and it was responding fine. This ruled out a connection issue.

If the browser reached the iframe, maybe the WebDriver wasn’t handling the response correctly, so some logs would be helpful. I tried saving the WebDriver logs and going through them line by line while comparing them to the WebDriver logs that I got from the Docker Desktop container. They were identical, so the WebDriver handling of the connection to the iframe was not the problem.

Was it a problem in the way the browser handled displaying the information it got from the iframe? Were there any problems in the DevTools console logs? I wrote some logic in the code to fetch the DevTools console logs and save them to a file while running the tests. When going through the file, I found the culprit: 

{'level': 'SEVERE', 'message': 'https://www.example.com - Failed to load resource: net::ERR_INSUFFICIENT_RESOURCES', 'source': 'network', 'timestamp': 1704886353348}

It seemed like the browser did not have sufficient resources to load a simple iframe? I checked the resources allocated to the pod, and they were more than enough. I then turned to Stack Overflow for answers. They pointed out a 2011 chromium bug that on Linux forced the browser to reach a memory capacity3. Still no solution in sight, as the bug was not resolved and the last reply was from 2018.

Be resilient

Not wanting to give up, I tried doing some research on what each of the Selenium ChromeOptions mean, to see if one of them could be the answer that could fix my problem. I stumbled upon the --disable-dev-shm-usage option in this post4. This stated that “The /dev/shm partition is too small in certain VM environments, causing Chrome to fail or crash. Use this flag to work-around this issue (a temporary directory will always be used to create anonymous shared memory files).” It seems to be related to another Chromium bug, as linked in the post. The way the Chrome browser was using resources in a K8S environment was different than the ones used in the Docker Desktop app and on my local machine.

Be successful

I initialised my ChromeDriver with this flag and ran the tests. After such a long journey, it finally worked. I was now running UI tests inside a Chromium browser on a linux environment in a K8S cluster. I changed the code back to use the Chromium browser for running the tests and deployed the new image to the company container registry. I then checked for an instance that had an upcoming update and waited for the test results to roll in. Everything worked and this journey brought a sense of accomplishment and enriched my testing experience.

To wrap up

If you ever feel the gentle nudge of the nagging question of “Why?” take the time and resources to pursue and answer. It might take you where few others have gone before and prove to be an advantage in navigating the ever changing world of software testing. If you do find something interesting, take the time to share it with others, because together we test the world, one bug at a time.

For those facing the challenge of testing within a K8S environment, be sure to use the two options that made all this work possible:

chrome_options.add_argument('--no-sandbox')

chrome_options.add_argument('--disable-dev-shm-usage')

Navigating ChromeDriver crashes in Kubernetes presents a journey of test automation resilience. Tasked with UI test development for a dynamically generated web app, challenges arose due to the fact that initial tests in Docker Desktop faced Chrome crashes. This was resolved by the --no-sandbox flag, yet Windows lingering processes required further community solutions.

Deploying tests in Kubernetes revealed  further failures, prompting Firefox adoption as a workaround. Curiosity drove post-delivery investigation, uncovering iframe loading issues. Resource checks, bug searches, and ChromeOptions exploration led to success with --disable-dev-shm-usage, ensuring UI tests run smoothly in Kubernetes.

This journey underscores the importance of curiosity, resilience, and resourcefulness in overcoming testing challenges, with lessons shared for navigating similar hurdles.

References:

  1. What does the Chromium option `--no-sandbox` mean?
  2. Chrome process still running in background after driver.quit()
  3. Chrome fails to load more than ~440 images on one page.
  4. Meaning of Selenium ChromeOptions

For More Information:

Dan Burlacu's profile
Dan Burlacu

Software Developer in Test

Civil engineering Ph.D. turned Software Developer in Test. Skilled in Python, Selenium, Kubernetes, Docker, Visual Basic for Aplications, Excel, Microsoft Azure, Microsoft Power Automate, Php, JavaScript, HTML, CSS, and APIs. Proficient in cloud-based software testing. Strong communicator.

Comments
Explore MoT
Managing Distributed QA Teams: Strategies for Success
In an era where remote teams have become the norm, mastering the art of managing hybrid and distributed QA teams is more crucial than ever
MoT Advanced Certificate in Test Automation
Ascend to leadership roles by mastering strategic skills in automation strategy creation, planning and execution
This Week in Testing
Debrief the week in Testing via a community radio show hosted by Simon Tomes and members of the community