System Info

     (Redirected from System Info on davis)
InfoInfo TalkTalk
Search:    

War_Room.jpgThe top secret wiki control center located hundreds of feet underground in a nuclear-safe bunker.*

How can I get status updates if this site is offline?

You can follow [WWW]@wikispot_status on Twitter.

Also, you can join us in IRC in channel #wikispot on server irc.freenode.net. However, if we're busy resolving an outage, we're probably not going to be very chatty!

Who maintains the software that runs this site?

See Development

Who maintains the hardware that runs this site?

[davis]Amit Vainsencher and Graham Freeman

What is your server setup?

As of Mar 7, 2009, we are running on a Xen virtual private server (VPS) named Leo, on Amit's personal hardware. The system has two Opteron 275's (2.2GHz), for a total of four processor cores, 8GB of DDR1-400MHz ECC/Registered memory, and a four drive 1.5TB hardware RAID5 running atop a 3ware 9550SXU (with a battery backup for the write cache). Leo has 4GB of memory, two processor cores, and 300GB of storage to itself.

The former production server, Baxter, has a 2.0GHz Athlon64 dual core processor (an x2 3200+) with 4GB of ECC memory and 200GB of RAID1 storage. The server is based on a [WWW]Tyan Transport barebone package and used to run [WWW]Gentoo Linux. We're planning to move primary serving responsibilities back to Baxter once we purchase some new hard drives for it, and replace Gentoo with [WWW]Ubuntu 8.04 LTS.

We also have a another VPS dedicated to software development and testing. All three servers are colocated in the [WWW]Cernio Tech Co-op's cabinets at United Layer's facility in the 200 Paul datacenter in San Francisco, California. In addition, we have access on short notice to server capacity in London, Minneapolis, and Santa Clara.

See also the page on [davis]Wiki hosting.

Recent Maintenance


2009-8-23 Wikispot sysadmins addressed system performance issues by terminating a hung database process and improving the way the webserver handles requests. The hung database process resulted in approximately 5 minutes of downtime, and approximately 1 hour of sub-par performance. - Philip, [davis]Amit and Graham (with troubleshooting assistance from Jason)


OS maintenance - 8 Aug 2009

We updated a number of packages to slightly newer point releases, including python, PostgreSQL, and the kernel. We also changed the network configuration so as to clearly separate Wiki Spot's billable network traffic from Amit's billable network traffic. This resulted in an initial outage of approximately 30 seconds (while PostgreSQL restarted) and a second outage of approximately 2 minutes (while the server restarted). Things went very smoothly. - [davis]Amit and Graham


Big move/Reorganization - Mar 7 2009

The weekend of Mar 7, Wikispot's host Cernio had scheduled to move the entirety of its two cabinets on the third floor of 200 Paul into three cabinets on the first floor in a brand new data room. In addition to having a ton of free space, this new room also theoretically wouldn't suffer from the cooling problems that plagued the old room. At the same time, our primary server baxter was in dire need of an OS replacement because its install of Gentoo had become unmaintainable due to unpredictable behavior during routine upgrades. Additionally, the SQL database was heavily bloated due to our having left the Postgres max_fsm_pages parameter at its default value, which was way too low. This was causing postgres to "forget" about regions of the database that needed to be garbage collected, and as a result it was growing without bound: a freshly restored copy only consumed about 13GB, whereas the current one was nearly 90GB. The database bloat was the main reason for the progressive slowdown we've seen in the past few months.

Amit, Philip, and Graham tackled all of these issues simultaneously in the following way:

Network maintenance - Jan 2009

I activated an additional network interface on the production server. -Graham (31 Jan 09 at about 15:30 California time)

Network maintenance - April 2007

What: At about 22:30 (UTC-7) on Mon 2 April 07, all Wiki Spot-hosted wikis went offline and remained offline for approximately 20 minutes, when service was restored at 22:50 (UTC-7).

Why: This was the result of two things: (1) my incomplete understanding of how Gentoo Linux init scripts work, and (2) Gentoo's use of an alpha-quality init script for the network functionality.

Background:

In coordination with Philip Neustrom, I began the process of switching Wiki Spot's main production server over to a new IP address in accordance with a migration process we agreed to with our colocation vendor. (Paraphrasing: Graham: "This should incur no downtime, but just in case, when can we afford 20 minutes of downtime?" Philip: "Actually, right now would be best." Graham: "OK, I'll get started.")

As per the documentation at [WWW]http://www.gentoo.org/doc/en/handbook/handbook-x86.xml?part=4&chap=2 I started this by replacing the following lines in /etc/conf.d/net :

with these:

I then mis-read a section at [WWW]http://www.gentoo.org/doc/en/handbook/handbook-x86.xml?part=2&chap=4 to mean that I could safely run "/etc/init.d/net restart". Unfortunately, while the "start()" function is required, the "restart()" function is not. Furthermore, the script lacked sanity checking sufficient to warn me that I passed an invalid argument, and apparently instead just ran its usual "start()" function despite the fact that the network and dependent services (including the various hosted wikis) were already running. This actually caused all of the wikis to "stop()" but not "start()" again, and did not put the network changes into effect. I then decided to restart the server (using "shutdown -r now") and hopped in the car for the 15-minute drive to the datacenter.

By the time I arrived at the datacenter, the server had come back online after a long filesystem check (due to it having been quite a while since the last one) and the various hosted wikis had come back online as well.

Because all of the services appear to have shut down cleanly, the only likely data loss would have been the result of wiki contributors being in the middle of making an edit and then clicking "save" while the server was still down.

Follow-up:

We're documenting procedures such as these for future reference, and I intend to suggest a more widely-known and -supported Linux distribution for future production server deployments.

While we are all volunteers - indeed, none of us has ever been paid for our involvement in Wiki Spot, and many of us have donated considerable sums of money to make all of this happen - it's safe to say that none of us want to operate in a mode wherein 20 minutes of unplanned downtime is acceptable. I do apologize and will strive to avoid such mishaps in the future.

Graham

Network maintenance - February 2007

At about 19h00 on 19 Feb 07 there were two 3-second network outages in the course of collaborative maintenance between Cernio and Cernio's vendor as we troubleshot sub-par performance issues. We solved these problems by changing the network port configurations from auto-negotiate to manually-configured full-duplex 100Mb. —Graham Freeman

Network hardware change - December '06

Shortly before midnight on Thursday 7 Dec '06, Graham moved the server's network connection from one network switch to another. This resulted in two periods of downtime of approximately 15 seconds each. This was done in preparation for making significant changes to the configuration on the first switch. When that's done, we'll move back to the first one and make similar changes to the second switch. The end result will be better network management, greater expansion capabilities, and better protection against hardware failures.

Total downtime was about 30 seconds.

Rack re-install - November '06

Just after midnight on Sunday 19 Nov '06, Amit and Graham re-installed the server in a better position in the rack. Rather than sitting on a pair of rack trays, the server is now using rackmount rails. During this process, Graham and Amit moved the server's power outlet to a managed power distribution unit (PDU), which allows authorized personnel to remotely control the power feed to the server if necessary. Depending on the circumstances, this could save a trip in to the datacenter and could therefore result in greater system uptime in the event of a problem with the server.

Total downtime was about 20 minutes.

Moving Datacenters - October '06

In the early morning of Oct 22, 2006, GrahamFreeman and [davis]Users/AmitVainsencher moved DavisWiki's server between [davis]Cernio cabinets at [WWW]Sonic.net in Santa Rosa, and [WWW]United Layer's facility at 200 Paul Ave, San Francisco. This move was in line with Cernio's overall transition to United Layer as its colocation provider. The transition was made because the new data center offered better connectivity (for example, to countries in Eastern Europe and the Middle East) coupled with nearly a 50% reduction in bandwidth costs.

The transition went relatively painlessly. First, Amit made the wiki read-only, to prevent possible data loss. Then, he dumped the database and all files associated with the wiki to a system already in the new data center in SF. This was done to ensure that the site could be quickly restored if the primary server was damaged or destroyed in transit. After backing up, Graham and Amit pulled the server and loaded it into the back of Amit's car, and proceeded to drive to San Francisco (approximately 1AM). Fortunately, the backup precautions proved unnecessary, as the server was still alive and kicking on arrival.

Once Graham and Amit got the server running at the new location, Graham transitioned the DNS for all of the system's domains (see above for a list) to the server's new IP address, and configured a web traffic proxy from the system's old IP in Santa Rosa to the new one in San Francisco. This was done to ensure no traffic would be lost while the new DNS information was propagating (DNS changes can take anywhere from 30 minutes to 24 hours to fully propagate).

Once the proxy configuration was completed, Amit restored write access to the Wikis. He and Graham then left the San Francisco facility (around 4AM) in a triumphant, caffeinated haze.

What are some statistics on the site?

There are 1891 pages and 5179 registered users on this wiki.

You can find some auto-generated stats [WWW]here. They are updated every 5 minutes or so.

Here is our bandwidth usage as of early October '06:

07oct06-baxter-net-stats.png

Here was our bandwidth usage in 2005 at some random point:

bandwidth05.png

The difference from 2005 and 2006 seems to indicate we are nearly 10 times as active, at least during the day the figure was grabbed.

We also have lots of individual User Statistics.

You may be interested in the [davis]Wiki Community/Technical Discussion.

Thanks to...

The Wiki Spot project wouldn't be possible without the awesomeness that is: [WWW]GNU/[WWW]Linux, [WWW]Python, [WWW]PostgreSQL, [WWW]Xapian, [WWW]memcached, and [WWW]lighttpd.


* Not really. :) It's actually [wikipedia]LIGO

This is a Wiki Spot wiki. Wiki Spot is a 501(c)3 non-profit organization that helps communities collaborate via wikis.