CodeHerb home

Analyzing the H-1B Data - Part 1
Published on: 30 Apr 2011

Recently I stumbled upon this corpus of H-1B data. Now it seemed like a good data set to explore so I decided to download the H-1B iCert LCA Data files for the years 2002-2010.

../images/h1bvisa.jpg

After downloading the first step was to try and load it into a mySQL database. While the downloading was effortless using wget, loading the data into a mySQL database turned out to be a much bigger hassle than I anticipated - specially since some of the files had different headers than the other. Anyway to cut a long story short, I ended up writing this script to take a CSV file and create a table from the header! Now the table doesn't do the data types but it does give out the SQL / create the table so that you can edit the table data types either in text or using SQLyog Community Edition.

 func getTableSQL($header, $tableName, :$defaultLen=100, :$engine="InnoDB", :$dbh=undef) {
    my ($rSql, $cols);
    my $csv = Text::CSV->new();
    if ($csv->parse($header)) {
        foreach ($csv->fields()){
            $cols .= "`$_` varchar($defaultLen) DEFAULT NULL, \n";
        }
        $rSql = "CREATE TABLE `$tablename` ( $cols ) ENGINE=$engine DEFAULT CHARSET=latin1;";
        substr $rSql, rindex($rSql,','), 1, ''; #replace last comma
    } else {
        say "Failed to parse line:".$csv->error_input;
    }

    if($dbh) {
        return $dbh->do($rSql);
    }
    else {
        return $rSql;
    }
}

The code is fed the header as a string, and it produces a CREATE TABLE command from the header. Optionally, if you provide it a database handle it will go ahead and create the table for you as well. I know a lot of people seem to think perl is ugly yada yada yada, but its quick and CPAN has a module for everything. So while it might not be as glamorous as Ruby, Python etc. it is kind of awesome in its own way

http://www.perlbuzz.come

Anyway now that I had the tables ready to go I decided to load the data in using the the MySQL LOAD DATA INFILE command only to hit this unresolved bug. Really MySQL - you can't load a freaking data file into the database! Anyways, after doing some more nifty perl coding (to come in a later post) I was finally able to get all the data loaded into the database.

A quick snapshot of the approximate number of LCA efilings for each year!

../images/h1filings.png

YearFilings
FY02123712
FY03222789
FY04310008
FY05307499
FY06381513
FY07425396
FY08405940
FY09269641
FY10360375

As you can see from the graph and the data table, there was an expected dip in the H-1B filings during the recession(2009). I'll be doing some more analysis on the data soon so if there is anything in particular you will like to know about this data, leave some comments and I'll see what I can do!