How can I get status updates if this site is offline?
You can follow @wikispot_status on Twitter.
Also, you can join us in IRC in channel #wikispot on server irc.freenode.net. However, if we're busy resolving an outage, we're probably not going to be very chatty!
Who maintains the software that runs this site?
Who maintains the hardware that runs this site?
What is your server setup?
As of Mar 7, 2009, we are running on a Xen virtual private server (VPS) named Leo, on Amit's personal hardware. The system has two Opteron 275's (2.2GHz), for a total of four processor cores, 8GB of DDR1-400MHz ECC/Registered memory, and a four drive 1.5TB hardware RAID5 running atop a 3ware 9550SXU (with a battery backup for the write cache). Leo has 4GB of memory, two processor cores, and 300GB of storage to itself.
The former production server, Baxter, has a 2.0GHz Athlon64 dual core processor (an x2 3200+) with 4GB of ECC memory and 200GB of RAID1 storage. The server is based on a Tyan Transport barebone package and used to run Gentoo Linux. We're planning to move primary serving responsibilities back to Baxter once we purchase some new hard drives for it, and replace Gentoo with Ubuntu 8.04 LTS.
The choice was mainly due to Graham's suggestion, and the fact that nobody else protested. Personally, I like Ubuntu because of the fact that it has a much more recent "stable" release than Debian, and that it's already on all of the servers I maintain at work... it gives me a single security mailing list to watch for all of my sysadmin hats. Ubuntu 8.04 also has recent enough versions of all the software needed to run Sycamore available via apt (except a tiny module, python-memcached) (upon checking, this appears to be true for Debian stable/lenny as well). -AV
If it matters, Amit, I didn't protest because I very much supported the decision (there just wasn't much to add because, as you say, nobody really objected). I've found the LTS series to be quite excellent: stable and well crafted. I had earlier assumed you had meant to use the Ubuntu Server Edition... did you? —JabberWokky
We also have a another VPS dedicated to software development and testing. All three servers are colocated in the Cernio Tech Co-op's cabinets at United Layer's facility in the 200 Paul datacenter in San Francisco, California. In addition, we have access on short notice to server capacity in London, Minneapolis, and Santa Clara.
See also the page on Wiki hosting.
2009-8-23 Wikispot sysadmins addressed system performance issues by terminating a hung database process and improving the way the webserver handles requests. The hung database process resulted in approximately 5 minutes of downtime, and approximately 1 hour of sub-par performance. - Philip, Amit and Graham (with troubleshooting assistance from Jason)
OS maintenance - 8 Aug 2009
We updated a number of packages to slightly newer point releases, including python, PostgreSQL, and the kernel. We also changed the network configuration so as to clearly separate Wiki Spot's billable network traffic from Amit's billable network traffic. This resulted in an initial outage of approximately 30 seconds (while PostgreSQL restarted) and a second outage of approximately 2 minutes (while the server restarted). Things went very smoothly. - Amit and Graham
Big move/Reorganization - Mar 7 2009
The weekend of Mar 7, Wikispot's host Cernio had scheduled to move the entirety of its two cabinets on the third floor of 200 Paul into three cabinets on the first floor in a brand new data room. In addition to having a ton of free space, this new room also theoretically wouldn't suffer from the cooling problems that plagued the old room. At the same time, our primary server baxter was in dire need of an OS replacement because its install of Gentoo had become unmaintainable due to unpredictable behavior during routine upgrades. Additionally, the SQL database was heavily bloated due to our having left the Postgres max_fsm_pages parameter at its default value, which was way too low. This was causing postgres to "forget" about regions of the database that needed to be garbage collected, and as a result it was growing without bound: a freshly restored copy only consumed about 13GB, whereas the current one was nearly 90GB. The database bloat was the main reason for the progressive slowdown we've seen in the past few months.
Amit, Philip, and Graham tackled all of these issues simultaneously in the following way:
Amit moved his personal server downstairs first. Once he verified that it was in working order after the move, Philip made wikispot read-only and then performed a database dump.
Amit's personal server is configured with Xen, and he provisioned a virtual machine named leo, running Ubuntu 8.04 LTS, with resources comparable to those of baxter. He also copied over the Sycamore configuration/setup ahead of time, so that all that was needed for leo to take over wikispot responsibilities was a database restore.
After bringing up the VPS leo, they moved the wikispot database dump over to it and loaded it up. Everything worked as expected, and they silently switched IP addresses between leo and baxter.
After leo had taken over for wikispot, the site had write access restored to the general public. Baxter was shut off, brought downstairs and racked. Amit then took two new hard drives that had been donated by Graham and installed them into baxter as its new primary drives. He installed Ubuntu 8.04 LTS on them, while leaving the old Gentoo drives drives untouched.
At this moment, wikispot is being served from the VPS leo until a later time when we will transition the services back over to baxter.
This whole process took about 9 hours (from roughly 5PM-2AM (technically until 3AM if you count the daylight savings switch that happened at 2AM)).
Many thanks to Graham and the other Cernio people that took time to help us out despite the insanity of trying to get the rest of Cernio in order before the move-out deadline. This would've been a hell of a lot more difficult without your hard work and assistance!
Network maintenance - Jan 2009
I activated an additional network interface on the production server. -Graham (31 Jan 09 at about 15:30 California time)
Network maintenance - April 2007
What: At about 22:30 (UTC-7) on Mon 2 April 07, all Wiki Spot-hosted wikis went offline and remained offline for approximately 20 minutes, when service was restored at 22:50 (UTC-7).
Why: This was the result of two things: (1) my incomplete understanding of how Gentoo Linux init scripts work, and (2) Gentoo's use of an alpha-quality init script for the network functionality.
In coordination with Philip Neustrom, I began the process of switching Wiki Spot's main production server over to a new IP address in accordance with a migration process we agreed to with our colocation vendor. (Paraphrasing: Graham: "This should incur no downtime, but just in case, when can we afford 20 minutes of downtime?" Philip: "Actually, right now would be best." Graham: "OK, I'll get started.")
As per the documentation at http://www.gentoo.org/doc/en/handbook/handbook-x86.xml?part=4&chap=2 I started this by replacing the following lines in /etc/conf.d/net :
config_eth0=( "188.8.131.52 netmask 255.255.255.192 brd 184.108.40.206" )
routes_eth0=( "default gw 220.127.116.11" )
"18.104.22.168 netmask 255.255.255.192 brd 22.214.171.124"
routes_eth0=( "default gw 126.96.36.199" )
I then mis-read a section at http://www.gentoo.org/doc/en/handbook/handbook-x86.xml?part=2&chap=4 to mean that I could safely run "/etc/init.d/net restart". Unfortunately, while the "start()" function is required, the "restart()" function is not. Furthermore, the script lacked sanity checking sufficient to warn me that I passed an invalid argument, and apparently instead just ran its usual "start()" function despite the fact that the network and dependent services (including the various hosted wikis) were already running. This actually caused all of the wikis to "stop()" but not "start()" again, and did not put the network changes into effect. I then decided to restart the server (using "shutdown -r now") and hopped in the car for the 15-minute drive to the datacenter.
By the time I arrived at the datacenter, the server had come back online after a long filesystem check (due to it having been quite a while since the last one) and the various hosted wikis had come back online as well.
Because all of the services appear to have shut down cleanly, the only likely data loss would have been the result of wiki contributors being in the middle of making an edit and then clicking "save" while the server was still down.
We're documenting procedures such as these for future reference, and I intend to suggest a more widely-known and -supported Linux distribution for future production server deployments.
While we are all volunteers - indeed, none of us has ever been paid for our involvement in Wiki Spot, and many of us have donated considerable sums of money to make all of this happen - it's safe to say that none of us want to operate in a mode wherein 20 minutes of unplanned downtime is acceptable. I do apologize and will strive to avoid such mishaps in the future.
Network maintenance - February 2007
At about 19h00 on 19 Feb 07 there were two 3-second network outages in the course of collaborative maintenance between Cernio and Cernio's vendor as we troubleshot sub-par performance issues. We solved these problems by changing the network port configurations from auto-negotiate to manually-configured full-duplex 100Mb. —Graham Freeman
Network hardware change - December '06
Shortly before midnight on Thursday 7 Dec '06, Graham moved the server's network connection from one network switch to another. This resulted in two periods of downtime of approximately 15 seconds each. This was done in preparation for making significant changes to the configuration on the first switch. When that's done, we'll move back to the first one and make similar changes to the second switch. The end result will be better network management, greater expansion capabilities, and better protection against hardware failures.
Total downtime was about 30 seconds.
Rack re-install - November '06
Just after midnight on Sunday 19 Nov '06, Amit and Graham re-installed the server in a better position in the rack. Rather than sitting on a pair of rack trays, the server is now using rackmount rails. During this process, Graham and Amit moved the server's power outlet to a managed power distribution unit (PDU), which allows authorized personnel to remotely control the power feed to the server if necessary. Depending on the circumstances, this could save a trip in to the datacenter and could therefore result in greater system uptime in the event of a problem with the server.
Total downtime was about 20 minutes.
Moving Datacenters - October '06
In the early morning of Oct 22, 2006, GrahamFreeman and Users/AmitVainsencher moved DavisWiki's server between Cernio cabinets at Sonic.net in Santa Rosa, and United Layer's facility at 200 Paul Ave, San Francisco. This move was in line with Cernio's overall transition to United Layer as its colocation provider. The transition was made because the new data center offered better connectivity (for example, to countries in Eastern Europe and the Middle East) coupled with nearly a 50% reduction in bandwidth costs.
The transition went relatively painlessly. First, Amit made the wiki read-only, to prevent possible data loss. Then, he dumped the database and all files associated with the wiki to a system already in the new data center in SF. This was done to ensure that the site could be quickly restored if the primary server was damaged or destroyed in transit. After backing up, Graham and Amit pulled the server and loaded it into the back of Amit's car, and proceeded to drive to San Francisco (approximately 1AM). Fortunately, the backup precautions proved unnecessary, as the server was still alive and kicking on arrival.
Once Graham and Amit got the server running at the new location, Graham transitioned the DNS for all of the system's domains (see above for a list) to the server's new IP address, and configured a web traffic proxy from the system's old IP in Santa Rosa to the new one in San Francisco. This was done to ensure no traffic would be lost while the new DNS information was propagating (DNS changes can take anywhere from 30 minutes to 24 hours to fully propagate).
Once the proxy configuration was completed, Amit restored write access to the Wikis. He and Graham then left the San Francisco facility (around 4AM) in a triumphant, caffeinated haze.
Well done, gentlemen. -Users/JimStewart
Are we big in Eastern Europe and the Middle East? -Users/MattJurach
There are other sites hosted on this server, you know. -WilliamLewis
That, and there are plenty of other servers in the cabinet. The Cernio Tech Co-op has participants in California, New York, Hawai'i, British Columbia, England, the Czech Republic, Australia, etc... So, it's pretty important that these folks have good connectivity to the California-based servers. For Davis Wiki, it's maybe not as important, but as the world's best city wiki it's good that it remain responsive to everyone too. —GrahamFreeman
Why do you think, that DavisWiki is the world's best city wiki? What makes it best? —Users/WilhelmBuehler
I'm biased. That's why it's best. :) Seriously, I mean no disrespect whatsoever to the folks who've worked on other city/community wikis - I simply enjoy this one the most because it's focused on the community I'm most familiar with. —GrahamFreeman
What are some statistics on the site?
There are 1883 pages and 5144 registered users on this wiki.
You can find some auto-generated stats here. They are updated every 5 minutes or so.
Here is our bandwidth usage as of early October '06:
Here was our bandwidth usage in 2005 at some random point:
The difference from 2005 and 2006 seems to indicate we are nearly 10 times as active, at least during the day the figure was grabbed.
We also have lots of individual User Statistics.
You may be interested in the Wiki Community/Technical Discussion.
* Not really. :) It's actually LIGO