Log in | Register
Forum > Site Discussion > Thread

What caused today's outage?

Chainer
Sep 07, 2022 - permalink

Today GWM was down for about an hour.

What happened was this: Every time someone loads a page, there's a PageView entry written to the database that has a small amount of information about the page being loaded. These are deleted shortly thereafter; they mostly exist for the purpose of gathering usage stats about traffic to the site.

However, each new PageView entry has a unique integer ID, incremented by 1 for each new entry. This is similar to how every new image uploaded to the site has a new integer ID, in that case visible in the URL.

The database stores integers in such a way that it imposes a maximum size on them. For the PageView ID field, this maximum is 2^32 / 2, or 2,147,483,647. This was reached this afternoon. As a result, every time the site tried to write a new PageView entry to the database, it failed, which in turn caused the entire loading of the page to fail, and resulted in a 500 internal server error being given to you.

I fixed the immediate problem by clearing the PageViews and resetting the ID counter back to 1.

The longer-term fix is to use a different type of integer for this whose limit is 2^64. This number is so large that if 10 billion people on the planet each loaded a page every second, it would take about 60,000 years to reach it.

Sep 07, 2022 - permalink

But what will we do in 60,000 years? We must plan ahead! ;)

Sep 07, 2022 - permalink

This makes me curious, I have no data to back this up or anything but I feel like the site has grown a lot in the past year or two.

Sep 07, 2022 - permalink

Happy PageView entry ID reset day!

Sep 07, 2022 - permalink

Had this happen to a large client of mine when they stored to many events from a rule syntax mistake. They filled it in 6 months.

« first < prev Page 1 of 1 next > last »