Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "cannot debug properly"
-
Depressed since yesterday.
Updated all our clients Dialers. Stellar performance. Suddenly one of 15 can’t hang up three way calls.
It’s one of our biggest clients. And they just started. We upgraded the dialers so the answering machine detection would improve for them and it did, along with vast performance upgrades as well. Suddenly, this issue.
2 days in they pull the plug until we fix it. The issue is sporadic and we cannot reproduce. No one else is having the issue. I can’t even debug it properly as it’s a third party dialer with no customizations on it. I found out where the error is, but no idea the workflow they got it to happen with or why. It’s so frustrating. It happens using the dialer native interface, and our integration via api calls. The channel doesn’t get sent to the command for some random reason, and only sometimes.
So even if it’s fixed they don’t trust the system. Now they are losing the full integration we have with the crm and dialer and it’s going to be a mess of data for them. All because of this one issue. They love the CRM though...
If they had just stayed on one more day I’m sure I could have found it. Now I have to play forensic scientist and look through old data, without being able to see the client code that was causing the issue.
Just threw some cash down to be able to talk to the dialer engineers and hopefully see what’s up. What a nightmare. And I have so many other projects for the platform due so soon...
Sigh. Super depressing.1 -
When you are debugging a function, and dumps all the DBG variables, and then puts an exit() to stop the execution....
.
.
.
.
.
.
.
But now it runs perfectly without any bugs... 😭😭😭
FML. -
dev vs QA rant (n + 1)
So our QA is done by China team so naturally time difference is quite irritating,
I cannot change code
I cannot debug for issue
So today I fix a critical issue and before pushing it my seniors send the to the QA
> QA unavailable
> I wait for QA because nobody notifies if the code is tested and I can work ahead
> I get review that my issue fix generated another issue (page gets redirected)
> I'm angry and astonished, I check on same link, same circumstances and no such issue is found
> My seniors say read the issue properly and I do it, no positive response when I contradict the QA
> QA leaves for home on Friday and critical issue still remains in live
I cannot believe the laziness of QA, I mean it's their loss at the end of the day.
> top of that I waited 2 hours for QA to check the issue2 -
In last episode of "How SystemD screwed me over", we talked about Systemd's PrivateTMP and how it stopped me from generating SSL certificates.
In today's episode - SystemD vs CGroups!
Mister Pottering and his team apparently felt that CGroups are underused (As they can be quite difficult to set up), and so decided to integrate them into SystemD by default. As well as to provide a friendlier interface to control their values.
One can read about these interactions in the manual page "systemd.resource-control"
All is cool so far. So what happened to me today?
Imagine you did a major system release upgrade of a production server, previously tested on a standalone server. This upgrade doesn't only upgrade the distribution however, it also includes the switch from SysVInit to SystemD. Still, everything went smooth before, nothing to worry now then, right? Wrong.
The test server was never properly stress-tested. This would prove to be an issue.
When the upgrade finishes, it is 4 AM. I am happy to go to bed at last. At 6 AM, however, I am woken up again as the server's webservices are unavailable, and the machine is under 100% CPU load. Weird, I check htop and see that Apache now eats up all 32 virtual cores. So I restart it, casting it off to some weird bug or something as the load returns to normal.
2 hours later, however, the same situation occurs. This time, I scour all the logs I can, and find something weird - Many mentions that Apache couldn't create a worker thread? That's weird.
Several hours of research and tinkering later, I found out the following:
1 - By default, all processes of a system that runs SystemD are part of several CGroups. One of these CGroups is the PID CGroup, meant to stop a runaway process from exhausting all PIDs/TIDs of a system.
This limit is, by default, set to a certain amount of the total available PIDs. If a process exhausts this limit, it can no longer perform operations like fork().
So now, I know the how and why, but how should I solve this? The sanest option would be to get a rough estimate of just how many threads the Apache webserver might need. This option, though, is harder, than apparent. I cannot just take the MaxRequestsWorkers number... The instance has roughly double the amount of threads already. The cause being, as I found out, the HTTP/2 module, which spawns additional threads that do not count towards this limit. So I have no idea what limit to set.
Or I could... Disable the limit for just the webserver via the TasksAccounting switch. I thought this would work. And it did seem to... Until I ran out of TIDs again - Although systemctl status apache2.service no longer reported the number of tasks or a task limit of the process, the PID CGroup stayed set to the previous limit. Later I found out that I can only really disable the Task Accounting for all the units of a given slice and its parents.
This, though, systemctl somewhat didn't make apparent (And I skimmed the manual, that part was my fault)
So... The only remaining option I had was to... Just set the limit to infinite. And that worked, at last.
It took me several hours to debug this issue. And I once again feel like uninstalling systemd again, in favor of sysvinit.
What did I learn? RTFM, carefully, everything is important, it is not enough to read *half* the paragraph of a given configuration option...
Oh, and apache + http/2 = huge TID sink.