Running Down a Bug
One thing I love about writing software is how you can almost always count on finding a solution to a problem. It probably is also why I like crossword puzzles and sudoku, but get annoyed with solitaire. Very rarely in my 20+ years of development have I come across an issue where I threw my hands up and said, “it can’t be fixed”. I would like to say it happens less now because I have written and reviewed so many lines of code that I know all the patterns and can bend the frameworks to my will. The truth is my experience has just taught me there is a logical reason why the issue is happening and it just requires time and effort to figure it out. Being one of the lead developers on the team probably factors in since I can’t just pass it off when I am at my wit’s end.
I recently worked on a bug that really gave me a run for my money and is likely to be used at my next interview when I get the “Tell me about a bug you have fixed and how you figured it out.”
I am currently working on a mobile app and in order to support automated testing by QA we deploy the app to a Digital.ai Continuous Testing digital assurance lab formerly known as SeeTest. The app is built using a combination of the Ionic and Angular frameworks to output to both Android and iOS. The Jenkins pipeline build automatically uploads a supported version of the app binaries to the SeeTest server. The apps can then be installed on over 100 different devices in the browser or in the Appium automated testing tools. Everything was working fine until a few weeks ago, if you tried to install the app onto an iOS device the installation would fail. We had just finished a release and since this was happening in the next release nobody really noticed for a couple weeks. As QA started working on their automated testing for the next release they filed a blocking issue and it was assigned to the development team. Since I had been working on build related stuff it was assumed that it was related to my changes, so I volunteered to take a look at it.
The first task was to reproduce the issue and I quickly confirmed the latest build failed to install on a device in SeeTest. Next up I confirmed the install of the recently released version worked. Now I had to try and isolate to find more specifics around when it broke, since this was a build bug I had to try to reproduce the Jenkins build process locally to generate a similar IPA files. I grabbed an old shell script which I had used when our build machine broke and could now generate the artifact locally in about 10 minutes. The fun was just getting started as I pulled down one revision at a time, stored the generated artifacts for reference and tried installing them on SeeTest devices. After a few hours and a couple false positives I narrowed it down to one specific commit. Well that was easy, not so fast!
The changes in the commit included some new and complicated integrations, but nothing really jumped out as being problematic regarding the packaging of the application, especially since the TestFlight build worked just fine. I started the next round of my troubleshooting by removing things one at a time to see when things broke. After a few rounds of 15 minute builds and testing in SeeTest I wasn’t able to isolate the issue, some of the changes were intertwined with other changes so it was hard to pull out pieces. Nothing was jumping out to me so I reached out to the team that manages SeeTest to see if they could provide more details, but I was not hopeful. It was slow going and I had run into this one zip file in the change containing a bunch of source code from a different project and it wasn’t giving me a good feeling.
After meeting with the rest of team we figured it better for me to try adding things from the commit one at a time instead of removing them. I was overjoyed to have a new strategy and immediately set out with the new plan. It didn’t take long to see that the big zip file was the problem, but it didn’t make any sense to me why it would be causing problems since it was very isolated from the iOS build and packaging process. I continued the same strategy and one folder at a time I updated the old zip file with changes from the new zip file. After six attempts the installation still worked fine and I was getting to some of the bigger more complicated changes. I knew this was going to yield results, but could take a few more days.
Thankfully my luck changed when I heard back from the person managing our SeeTest installation and they were able to confirm the issue and escalated it up to Digital.ai. I still figured it was something to do with our code change, but if I could get some log statements with more details than “Installation fail” it would help me find the problem. The next morning I was invited onto a call with Digital.ai and we reproduced the issue and also got some detailed logs from the server. I had the bug on the run at this point and it only took about 20 minutes for me to track down exactly why the issue was happening.
During the installation process on the SeeTest device, the IPA is uploaded to a server, unpacked and re-signed with a different certificate. The log files showed that one of the files inside our IPA could not be re-signed because it couldn’t be read. The file in question has a ridiculously long name because it is meant to mock an HTTP request with a bunch of querystring parameters. A quick reading of the logs shows the re-signing process is running on Windows and the full file path to the file is 334 characters, which exceeds the default file path allowed in Windows. After removing the file from the zip file the SeeTest install succeeded and if I put it back in it would fail. Rather than wait for the fix to SeeTest, I just removed the file from our packaging process since it wasn’t really necessary.
After spending four days on the issue, I was happy to have found a solution and was also happy to see it wasn’t related to my build changes. I learned a bunch of different things about SeeTest and also realized how problematic it is to just drop in a giant zip file with an unknown number of changes. The other key takeaway is how much I prefer the strategy of adding changes than trying to remove them. Now we wait for the next bug to come my way and torpedo my planned work.