Concept
datamasher.org is a website that contains many datasets with results on a state by state basis. I wanted to mine the datasets and have a script automatically compare them to see if there is a correlation.
Process
I mine the highest rated list on datamasher.org for links to datasets, then i open the datasets and store the state data. One I have all of the datasets, I compare each one against all others to calculate a distance. I insert the dataset comparisons into a sorted list by distance value. The lower the distance value, the higher the correlation.
Results
The results are stored in two files results.txt and summary.txt. I didn’t create a graph comparison because I’ve already invested so much time in the data mining process and have to prepare for career fair.
Here are the output files: summary.txt results.txt
The top most data comparisons in summary.txt/results.txt are the comparisons with the highest correlation.
Interesting results are correlations between:
% of Population Covered by Health Insurance AND 2008 ACT Average Composite State Scores
% of Population Covered by Health Insurance AND Median Age: Census 2008
Population: Census 2008 AND # of US Representatives
… there are more cool ones, just scroll through summary.txt
Code: OSTRANDER3.zip
