Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "bug on prod"
-
I worked with a good dev at one of my previous jobs, but one of his faults was that he was a bit scattered and would sometimes forget things.
The story goes that one day we had this massive bug on our web app and we had a large portion of our dev team trying to figure it out. We thought we narrowed down the issue to a very specific part of the code, but something weird happened. No matter how often we looked at the piece of code where we all knew the problem had to be, no one could see any problem with it. And there want anything close to explaining how we could be seeing the issue we were in production.
We spent hours going through this. It was driving everyone crazy. All of a sudden, my co-worker (one referenced above) gasps “oh shit.” And we’re all like, what’s up? He proceeds to tell us that he thinks he might have been testing a line of code on one of our prod servers and left it in there by accident and never committed it into the actual codebase. Just to explain this - we had a great deploy process at this company but every so often a dev would need to test something quickly on a prod machine so we’d allow it as long as they did it and removed it quickly. It was meant for being for a select few tasks that required a prod server and was just going to be a single line to test something. Bad practice, but was fine because everyone had been extremely careful with it.
Until this guy came along. After he said he thought he might have left a line change in the code on a prod server, we had to manually go in to 12 web servers and check. Eventually, we found the one that had the change and finally, the issue at hand made sense. We never thought for a second that the committed code in the git repo that we were looking at would be inaccurate.
Needless to say, he was never allowed to touch code on a prod server ever again.8 -
Kudos to the devrant team for epic response time!
<50min from bug report on Twitter to workaround on prod!2 -
That moment when you make a bug fix on your local dev deployment and keep refreshing PROD web page hoping to reflect the change.1
-
Oh boy, my riskiest coding decision was certainly that one time when I refactored some 50k lines of critical legacy shit code in 3 days, straight up merged everything into master and then deployed to prod.
Luckily there was only one minor bug I had to fix after that... phew...
(To my defense: I was solo-working on it - the infamous CMS Of Doom™)2 -
Somewhere in a lonely break room
There's a guy starting to realize that eternal hell has been unleashed unto him.
It's two a.m.
It's two a.m.
The boss has gone
I'm sitting here waitin'
This desktop's slow
I am getting tired of fixin' all my coworkers' problems
Yeah there's a bug on the loose
Errors in the code
This is unreadable
Rubber ducky can't help
I cannot debug, my whole life spins into a frenzy
Help I'm slippin' into the programming zone
Git push to the prod
Set up a repo
My hard drive just crashed
All my code is gone
Where am I to go
Now that I've broke my distro
Soon you will come to know
When you need Stack Overflow
Soon you will come to know
When you need Stack Overflow
I'm falling down a spiral
Solution unkown
Disgusting legacy, ugly code
Can't get no connection
Can't get through to commit
Well the night weights heavy
On my confused mind
Where's the error on this line
When the CEO comes
He knows damn well
To keep his distance
And he says
Help I'm slippin' into the programming zone
Git push to the prod
Set up a repo
My hard drive just crashed
All my code is gone
Where am I to go
Now that I've broke my distro
Soon you will come to know
When you need Stack Overflow
Soon you will come to know
When you need Stack Overflow
When you need Stack Overflow
When you need Stack Overflow, a ha
When you need Stack Overflow
When you need Stack Overflow, a ha
When you need Stack Overflow
When you need Stack Overflow, a ha
When you need Stack Overflow
When you need Stack Overflow, a ha
When you need Stack Overflow4 -
When will I fuckin learn that
a) customers lie
b) customers are sloppy
c) customers are wrong
d) customers do not do their work (properly)
e) customers want us to do their (dirty) work
f) possibly all of the freakinly above?! + khm....
They will fuckin aaaalwaaaays say sth is not working after the update..
And I will alwaaaays assume I fucked up something..even if I didn't touch that part of the code/data..
And almost aaaaalways it turns out that the bug they complain about is how the system worked (or didn't work) before the update and/or some fuckup from their side..
Anyhow, I rushed over, grabbed the files went testing in dev..wtf, output is different, mine is ok, theirs is..wtf is that shit?!
Transfer newly built dll to test..same shit as on prod..wtf?! How?!
I assumed they have thing A correctly linked to thing B.. ofc thing A was linked to thing C in their case and in another case (our test) to correct thing B..
I got chillies when grabbing files, that
I should have tripple checked that they didn't fuck up something on the link part, but I just assumed they know what they were doing & that they checked they linked correct files with correct content already, before being pissy that the update fucked up things.. riiiight!! :/
I wanted to find solutions to this fuckup asap so I disregarded my gut feeling..yet again!! Fuuuck!
I've spent too much time trying to find ways to fix a bug that wasn't even a real bug to begin with.. :/
Fuuuuuck!!
So yeah, always treat the customers like they are 3yrs old & have no clue what they are doing & check exactly wtf they were indeed trying to do..it will save you time & nerves..
And note to self: reread this shit daily!! And imprint it in your brain that everything is not always your fault!!11 -
My most satisfying bug fix?
I found a core concurrency issue in this gnarly homegrown ORM and reported it to the lead devs, who (very defensively, having written the damn thing) argued that it would never pop up in a prod environment and I was stupid for even bringing it up. Theoretically, this bug could cause pretty much every foreign key to be assigned to the wrong parent, but only if multiple instances of the application were open/running at once. They were so certain it would never happen on live that they explicitly instructed me not to fix it. After all, this bug had been active for many years on a previous project and nobody complained.
Problem was, that previous project was something that only a single user had open most of the time (think: a manager). The new project was something that would be used by multiple people at the same time (think: all the employees). Once we released this new-project-with-old-orm, it didn't take long at all for our customers to start complaining.
After that, they let me fix the bug. :) -
This was some time ago. A Legendary bug appeared. It worked in the dev environment, but not in the test and production environment.
It had been a week since I was working on the issue. I couldn't pinpoint the problem. We CANNOT change the code that was already there, so we needed to override the code that was written. As I was going at it, something happened.
---
Manager: "Hey, it's working now. What did you do?"
Me: *Very confused because I know I was nowhere close to finding the real source of the problem* Oh, it is? Let me check.
Also me: *Goes and check on the test and prod environment and indeed, it's already working*
Also me to the power of three: *Contemplates on life, the meaning of it, of why I am here, who's going to throw out the trash later, asking myself whether my buddies and I will be drinking tonight, only to realize that I am still on the phone with my manager*
Me again: "Oh wow, it's working."
Manager: "Great job. What were the changes in the code?"
Me: "All I did was put console logs and pushed the changes to test and prod if they were producing the same log results."
Manager: "So there were no changes whatsoever, is that what you mean?"
Me: "Yep. I've no idea why it just suddenly worked."
Manager: "Well, as long as it's working! Just remove those logs and deploy them again to the test and prod environment and add 'Test and prod fix' to the commit comment."
Me: "But what if the problem comes up again? I mean technically we haven't resolved the issue. The only change I made were like 20 lines of console logs! "
Manager: "It's working, isn't it? If it becomes a problem, we'll work it out later."
---
I did as I was told, and Lo and Behold, the problem never occurred again.
Was the system playing a joke on me? The system probably felt sorry for me and thought, "Look at this poor fucker, having such a hard time on a problem he can't even comprehend. That idiotic programmer had so many sleepless nights and yet still couldn't find the solution. Guess I gotta do my job and fix it for him. I'm the only one doing the work around here. Pathetic Homo sapiens!"
Don't get me wrong, I'm glad that it's over but..
What the fuck happened?5 -
Not really a hack but still worth telling:
I was working in the QA team for a big project. I tried to do some automation when I realized some radio button behaved weird... out of curiosity I checked the source and saw that there was a hidden option for a unimplemented payment option.
I was like: Let’s see how the system behaves if I just submit that form with that hidden value...
Well I was very surprised when I received the email that my order has been processed successfully.
During the investigation we found out that this bug was in prod for over two years. And it requires a one liner executed in the browsers console to skip the payment.
It was kind of a big deal and although I was (and am) still a trainee (in apprenticeship) I got invited to meet up with the client and the bosses.
It was kind of a door opener! After that they trusted me more. I have more responsibility, more interesting tasks and more client contact ever since.
To make a long story short:
Validate everything on the server side ;-)1 -
*Assigns coworker as reviewer for a PR*
*Reviewer comments on something he thinks should be changed*
*Reviewer realizes I went home as it is Friday afternoon*
*Reviewer is super impatient, and chooses to push the change himself. He then accepts the PR (his own code) , merges, and makes a release*
*Team lead starts yelling since an obvious bug made it to prod - Literally white frontpage*
*Reviewer blames me, since the bug comes from my PR*
... Thanks, former employee. F*cking thanks. I know you got fired for being a d*ck, but I hope karma kicks you in the butt for the rest of your life..
And ps. to you: Don't blame coworkers when you can check the history.
F*ck some people..4 -
You know how my 'about' here says I create features from bugs?!? Well if you didn't before, now you do.. :P
Anyhow, today the most bizzare thing happened.. Customer played 'reverse uno card' on me..
Meaning?!
They reported a feature missing on uat env, when in fact I fixed a bug they have on prod..
Not sure how I should feel about this, but it sure made my day! (: I just hope they will not open a bug report for this missing bug..4 -
Managers at my jobs for the most part leave me be. Though I often have no clue whether I'm doing OK. I guess no news is good news right?
My worst experience isn't that bad. At one place, I was the only tester working on things coming from 20-30 devs. After about a year+, the company finally hired more testers, but it was still only 3 of us.
We were in the final stages of releasing a build to prod. It was going smoothly, or so we thought. At the last minute, I found a buried bug that was a showstopper.
A lot of hatred on me that day, that once it was fixed, and the release was finally deployed, I just shut off my laptop and left. I took all the blame because I was the one who found it rather than blaming the team as a whole for not finding it earlier. Oh well. Stuff happens.
Let's knock on wood that I don't run into worse higher up stories. -
We had 1 Android app to be developed for charity org for data collection for ground water level increase competition among villages.
Initial scope was very small & feasible. Around 10 forms with 3-4 fields in each to be developed in 2 months (1 for dev, 1 for testing). There was a prod version which had similar forms with no validations etc.
We had received prod source, which was total junk. No KT was given.
In existing source, spelling mistakes were there in the era of spell/grammar checking tools.
There were rural names of classes, variables in regional language in English letters & that regional language is somewhat known to some developers but even they don't know those rural names' meanings. This costed us at great length in visualizing data flow between entities. Even Google translate wasn't reliable for this language due to low Internet penetration in that language region.
OOP wasn't followed, so at 10 places exact same code exists. If error or bug needed to be fixed it had to be fixed at all those 10 places.
No foreign key relationships was there in database while actually there were logical relations among different entites.
No created, updated timestamps in records at app side to have audit trail.
Small part of that existing source was quite good with Fragments, MVP etc. while other part was ancient Activities with business logic.
We have to support Android 4.0 to 9.0 of many screen sizes & resolutions without any target devices issued to us by the client.
Then Corona lockdown happened & during that suddenly client side professionals became over efficient.
Client started adding requirements like very complex validation which has inter-entity dependencies. Then they started filing bugs from prod version on us.
Let's come to the developers' expertise,
2 developers with 8+ years of experience & they're not knowing how to resolve conflicts in git merge which were created by them only due to not following git best practice for coding like only appending new implementation in existing classes for easy auto merge etc.
They are thinking like handling click events is called development.
They don't want to think about OOP, well structured code. They don't want to re-use code mostly & when they copy paste, they think it's called re-use.
They wanted to follow old school Java development in memory scarce Android app life cycle in end user phone. They don't understand memory leaks, even though it's pin pointed by memory leak detection tools (Leak canary etc.).
Now 3.5 months are over, that competition was called off for this year due to Corona & development is still ongoing.
We are nowhere close to completion even for initial internal QA round.
On top of this, nothing is billable so it's like financial suicide.
Remember whatever said here is only 10% of what is faced.
- An Engineering lead in a half billion dollar company.4 -
A couple of years ago I was working on a fairly large system with a complex (by necessity) access control architecture.
As is usually the case with those projects, it's awkward for developers to repro bugs that have to do with a user's accesses in production when we are not allowed to replicate production data in test, let alone locally.
We had a bug where I ended up making myself a new row in the production database for a thing I could have access to without affecting real data to repro it safely. I identified the bug so I could repro it in dev/test and removed the row and ensured everything worked normally, whew scary.
Have you ever walked into the office one day, and everyone is hunched over in a semicircle around one person's workstation, before one turns around to look at you and says - after a pause - "... ltlian?.."
Turns out I had basically "poisoned the well" with my dummy entity in a way where production now threw 500 for everyone BUT me who had transitive access to this post-non-entity. Due to the scope of the system, it had taken about a day for this to gradually propagate in terms of caching and eventual consistencies; new entities coming in was expected, but not that they disappear.
Luckily I had a decent track record for this to be a one-off. I sometimes think about how I would explain testing in prod and making it faceplant before going home for the day, other than "I assumed it would be fine". I would fire me.3 -
I've now worked on both monolithic solutions and microapps/microservices. I gotta say I'm not sold on the new approach. There's so much overhead! You don't have to know your way around one solution -- no, now you need to know your way around 100 solutions. Debugging? Yeah, good luck with that. You don't have to provision one environment for dev, test, staging, and prod. No, now you need 100 environments per... environment. Now, you need a dedicated fulltime devops person. Now devs can check in breaking changes because their code compiles fine in that one tiny microapp. The extra costs go on and on and on. I get the theoretical benefits but holy crap you pay for it dearly. Going back to monolithic is so satisfying. You just address the bug or new feature head on without the ceremony and complexity. You know you're not crapping on other people's day (compilation-wise) because the entire solution compiles.
...and yeah, I'm getting old. So get off the lawn! ;)2 -
A loooong time ago...
I've started my first serious job as a developer. I was young yet enthusiastic as well as a kind of a greenhorn. First time working in a business, working with a team full of experienced full-lowered ultra-seniors which were waiting to teach me the everything about software engineering.
Kind of.
Beside one senior which was the team lead as well there were two other devs. One of them was very experienced and a pretty nice guy, I could ask him anytime and he would sit down with me a give me advice. I've learned a lot of him.
Fast forward three months (yes, three months).
I was not that full kind of greenhorn anymore and people started to give me serious tasks. I had some experience in doing deployments and stuff from my other job as a sysadmin before so I was soon known as the "deployment guy", setting up deployments for our projects the right way and monitoring as well as executing them. But as it should be in every good team we had to share our knowledge so one can be on vacation or something and another colleague was able to do the task as well.
So now we come to the other teammate. The one I was not talking about till now. And that for a reason.
He was very nice too and had a couple of years as a dev on his CV, but...yeah...like...
When I switched some production systems to Linux he had to learn something about Linux. Everytime he encountered an error message he turned around and asked me how to fix it. Even. For. The. Simplest. Error. He. Could. Google. Up.
I mean okay, when one's new to a system it's not that easy, but when you have an error message which prints out THE SOLUTION FOR THE ERROR and he asks me how to fix it...excuse me?
This happened over 30 times.
A. Week.
Later on I had to introduce him to the deployment workflow for a project, so he could eventually deploy the staging environment and the production environment by hisself.
I introduced him. Not for 10 minutes. I explained him the whole workflow and the very main techniques and tools used for like two hours. Every then and when I stopped and asked him if he had any questions. He had'nt! Wonderful!
Haha. Oh no.
So he had to do his first production deployment. I sat by his side to monitor everything. He did well. One or two questions but he did well.
The same when he did his second prod deploy. Everythings fine.
And then. It. Frikkin. Begins.
I was working on the project, did some changes to the code. Okay, deploy it to dev, time for testing.
Hm.
Error checking out git. Okay, awkward. Got to investigate...
On the dev server were some files changed. Strange. The repo was all up to date. But these changes seemed newer because they were fixing at least one bug I was working on.
This doubles the strangeness.
I want over to my colleague's desk.
I asked him about any recent changes to the codebase.
"Yeah, there was a bug you were working on right? But the ticket was open like two days so I thought I'll fix it"
What the Heck dude, this bug was not critical at all and I had other tasks which were more important. Okay, but what about the changed files?
"Oh yeah, I could not remember the exact deployment steps (hint from the author: I wrote them down into our internal Wiki, he wrote them done by hisself when introducing him and after all it's two frikkin commands), so I uploaded them via FTP"
"Uhm... that's not how we do it buddy. We have to follow the procedure to avoid..."
"The boss said it was fine so I uploaded the changes directly to the production servers. It's so much easier via FTP and not this deployment crap, sorry to say that"
You. Did. What?
I could not resist and asked the boss about this. But this had not Effect at all, was the long-time best-buddy-schmuddy-friend of the boss colleague's father.
So in the end I sat there reverting, committing and deploying.
Yep
It's soooo much harder this deployment crap.
Years later, a long time after I quit the job and moved to another company, I get to know that the colleague now is responsible for technical project management.
Hm.
Project Management.
Karma's a bitch, right? -
Lessions I learned so far from my first big node/npm project with tons of users:
1) If you didn't build something for a while, expect 3 hours of resolving version conflicts for every two weeks since the last build.
2) Even if the tests pass, run the containers on your own machine and make sure that the app doesn't randomly crash before deploying
3) Even if the app seemed to work on your own machine, run the tests again in an environment mimicking prod at most 15 minutes before replacing the running containers.
4) Even if all else indicates that the app will work, only ever deploy if you expect to be available within the 4 hours following a deployment.
5) Don't use shrinkwrap for anything other than locking every version down completely. A partial shrinkwrap will produce bugs that are dependent on the exact hour you built the app _and_ the shrinkwrap file, and therefore no one will ever have seen them other than you.
6) Avoid gyp, and generally try not to interface too much with anything that doesn't run on node. If parts of your solution use very different toolchains, your problems will be approximately proportional to the amount of code. And you'd be surprised just how much code you're running. (otherwise it's more logarithmic because the more code the less likely a new assumption is unique)
7) Do not update webpack or its plugins or anything they might call unless you absolutely need to
8) Containers are cool but the alpine ones are pretty much useless if you have even just one gyp module.
9) There's always another cache. To save yourself a lot of pain, include the build time in every file or its name that the browser can download, and compare these to a fresh build while debugging to assert that the bug is still present in the code you're reading
+1) Although it may look like it, SQLite is far from a simple solution because the code and the bindings aren't maintained. In fact, it'll probably be more time consuming than using a proper database.3 -
So we have this project that we are hosting on our testing server for presentation purposes ( already provisioning prod server ).
Our boss was presenting it to investors and my superior committed a bug there and was asking me help to figure out how to fix it (yeah.. he doesn't know how to checkout last commits in git... fml), and I realised the presentation might still be going on... so I asked: isn't boss showing it to investors?
superior: lol, idk maybe.
me: right... ( I proceed to roll back changes ) bye, have a good lunch.
And here I am having lunch considering my life choices. -
Confession: a very important feature of the website I'm developping wasn't working for a certain time. The boss wasn't aware because he doesn't go on the site, and I only found out last week because I needed to implement a new feature that used the previous one. Problem: the bug was only on production, not on local (and of course we don't have test server).
I took advantage of the absence of my boss today to clear the situation by making all of my tests on prod. I hope no customer tried to pass a command today, but it's finally repaired. I am both proud and shameful.3 -
Yesterday whole 12 hours we were working on deployment about a feature X that has deadline yesterday itself.
Everything damn perfectly running on Test env but not on Prod.
We made Prod into Dev/Test/Fucking garabage env. Haha.
I was laughing to myself at same time crying hard in my deep heart.
Business guys chasing PM
PM chasing us
And from morning till night we were in same room. Had lunch, and dinner only went out for toilet and to refil water bottles.
And found that feature Y is not working at same time that is related to our feature X. Fucking we have been wasted hours on it.
One of my devs got so fucked up emotionally that he messed up the code (not his fault) he didnt had his lunch and dinner. Had to console him later that its not his fault. Poor guy not sure whether he slept or not; will find out in few hours.
Anyways reported a bug.
But that bug assigned to us for fixing.
Are you fucking kidding me.
Anyways no choice. Had to do it.
Hope today everything goes good or horribly bad. FYI no deployments on Friday damn we are in stalememt till Monday.
Fuck that bug
Or
May be fuck our stupiditiy while makiing mistakes.1 -
This got me fucked up. Listen yo.
So we have this issue on our modal right. The issue keeps poppin. It's a hotfix because its in prod. So my senior and I were on it. After a few hours, I showed him the part of the code that is buggy. It's 50 lines of code of nested if-else, else-if. And so we're still fighting it. He redid everything since we're using angular2 he did a subject, behavior-subject all that bs and I was still trying to understand what's the bug, because it's happening on the second click and so I did my own thing and found the cause bug and showed it to him, its this:
setTimeout( () => {}, 0)
the bootstrap-modal doesn't allow async inside it (I dont why, its in the package). So he explained to me why it's there. So I did my own thing again and find a workaround which I did, a one-line of angular property, showed it to him he didn't accept it because we'll still have to redo it with subjects and he was on it. I said ok. Went back to my previous issue. The director came in and ask for a fixed, my senior came up to me and told me to push my fix. Alright no problem. So we good now. Went back to our thing bla bla bla, then got an email that we will have a meeting, So we went, bla bla bla. The internal team wants a support for mobile, senior said no problem bla bla bla, after the meeting he approaches me and said (THIS IS WHERE IT GOT FUCKED UP) we wont be supporting bootstrap4 anymore because of the modal issue and since we're going to support mobile and BOOTSTRAP4 grid system is NONINTUITIVE we are moving to material design because the grid system is easier. I was blown away man. we have more than 100 components and just because of that modal and mobile support shit he decided to abandon bootstrap. Mater of fact its the modal its his code. I'm not expert in frontend but I looked at the material design implementation its the same thing other than the class names. OHHH LAWD!3 -
I've just spent 4 hours trying to fix a bug on prod. that can be fixed in 30 secs. At the end I remembered that I should check the error log. FML (error reporting turned off, logs only)
-
Not exactly related to the topic but the exact thing is chilling the fuck out .
I always was anxious and was completely paranoid about minor bugs in my application during prod deployments(that is when I didn't know about testing utils and so on) , till the point that I couldn't fix a minor bug in the CSS and I puked 5 times over.
It was rough times but then I got over it and it really helped me alot.
I know bugs are like really not the kind of things you'd want to see in any application but it will arise in every application :3 -
My team is pretty small right now. It's myself and two other guys. One lead, who's been here for five years. A senior who we brought on 2 weeks ago. And me, a regular app dev. The lead put his two weeks in last week and has been trying to brain dump as much as he can onto us.
I've been building a list of prioritization to compensate for when he leaves based on what he was saying was the most important. This list has gotten pretty massive after reviewing most of the processes in place.
I was hired mainly to quell new requests coming in and not to maintain our systems, so that's what I did. I didn't examine our prod code base too closely. I wish I had. It's in a sorry state. I'm pretty sure I have about 2 years of tech debt for a crew of two guys constantly working on it.
I've been trying to prioritize based on what gets the most bug fixes and change requests. These apps will see the biggest changes and will undergo the most maintenance.
Since I'm just a regular app dev it feels weird trying to come up with this and try to prioritize this and come up with a plan. It feels like someone else should have. If it needs done then I guess it needs done. I need to be able to collaborate and work with my co worker and be able to plan for what projects are coming next.
If anyone has any suggestions to tackle tech debt please make them. Or if there's any help for managing priorities in a different manner that may prove helpful I'm open. Honestly, I don't want to tackle this completely blind, it feels like a lot.1 -
One nightmarish project that was doomed from the beginning, had me as the sole developer. I could hardly sleep when we began testing on a separate test system, but with (nearly) all the config stored in shared memory and copied from the production system, I dreaded, half awake, that the production server data base connection was still configured in the test system and that it was shooting all it's test data repeatedly to prod.
Finally drove to company in middle of the night at 4 o'clock. Checked everything was OK, tried to sleep 3 hours before the start of the work day.
This system also had the most hideous memory corruption in some shared memory that was used across several processes and should have been thoroughly protected by a mutex, but somehow, sometimes this crucial map, that was used to speed up the access to all the customer data just contained garbage.
Still haunts me to that day. (Like xkcd's unresolved tension of a non-matching parenthesis - an unresolved bug.