Free Salary Data from Thousands of Companies

July 3rd, 2008

This site has salary data from over forty thousand companies. It tells you the actual salaries of real employees, along with the company they work for, the state they work in, what their job title is, and the industry sector.

The search works well.

U.S. Census Data

May 12th, 2008

The U.S. Census makes a tremendous amount of data available on its website. I’ve downloaded a small chunk of their 2000 census data for Massachusetts but haven’t gotten around to figuring out the format yet.

Free Data from the NOAA

May 11th, 2008

If you want climate data, the National Climatic Data Center offers a great site with lots of free downloadable data sets.

As is typical with many goverment sites, a lot of the data is only available via CD-ROM or is in formats like PDF, or often in formats that you will have to spend a bit of time parsing before being able to work with it.

Free Economic, Demographic & Financial Data from FreeLunch.com

May 11th, 2008

A great site to download many high quality data sets is FreeLunch.com. You have to register to download data, but once you are registered it’s easy to grab a ton of datasets.

Some of them are scant; some are pretty comprehensive.

For the former, see the Government spending set–20 rows which only show aggregate USG spending. For an example of the latter try “Nasdaq: Composite Index, (Index Feb 05 1971=100″–over 9,000 rows.

24 Good Datasets

May 11th, 2008

Here’s a list of various datasets compiled by someone at Drexel University.

A lot of these are a bit old but interesting nonetheless.

The sets include:

Netflix Prize Dataset

May 11th, 2008

The Netflix Prize is a huge dataset that is freely available to download. Here is a description from Wikipedia:

Netflix provided a training data set of over 100 million ratings that over 480,000 users gave to nearly 18,000 movies. Each training rating is a quadruplet <user, movie, date of grade, grade>. The user and movie fields are integer IDs, while grades are from 1 to 5 (integral) stars.[3]

The qualifying data set contains over 2.8 million triplets <user, movie, date of grade>, with grades known only to the jury. A participating team’s algorithm must predict grades on the entire qualifying set, but they are only informed of the score for half of the data, the quiz set. The other half is the test set, and performance on this is used by the jury to determine potential prize winners. Only the judges know which ratings are in the quiz set, and which are in the test set—this arrangement is intended to make it difficult to hill climb on the test set. Submitted predictions are scored against the true grades in terms of root mean squared error (RMSE), and the goal is to reduce this error as much as possible. Note that while the actual grades are integers in the range 1 to 5, submitted predictions need not be.

For each movie, title and year of release is provided in a separate dataset. No information at all is provided about users.[2]

In order to protect the privacy of customers, “some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates”.[2]

The training set is such that the average user rated over 200 movies, and the average movie was rated by over 5000 users. But there is wide variance in the data—some movies in the training set have as little as 3 ratings,[4] while one user rated over 17,000 movies.[5]