Wisconsin DPI All Staff

Number of Rows: 4,970,698

Data Coverage: 1994-1995 through 2018-2019 school years

Want to know which teachers at your school make the most? How old they are? How many years they've been teaching? Which schools employ the most staff? You've come to the right place!

The Wisconsin Department of Public Instruction publishes data on all public schools in the State of Wisconsin in their Staff Data Collections.

The most interesting of these datasets is the All Staff dataset, which comprises files dating back to the 1994-1995 school year.

We have compiled the raw data from the Wisconsin Department of Public Instruction into a format that is ready for importing into a database. You can download either the CSVs (recommended) and directly import them into the DBMS of your choice or download the SQL query if you are using a MySQL compatible DBMS such as MariaDB. Registered users can download the CSVs or the SQL below:

1995-2019, Standardized Into CSV Files (85.8 MB compressed, 7Z)
1995-2019, Imported Into MariaDB, SQL Query (177 MB compressed, ZIP)

We do not own or have any special rights to this data. This data belongs to the Wisconsin DPI and if you download it either from them or from us, you should credit them for it. We have simply taken the DPI's data, standardized it, combined it and made it publicly queryable. Below is the standardization process used to generate these files, if you wish to download the data directly from DPI using the link at the top of this page and perform the steps yourself.

The calendar year used for a particular school year refers to the year in which the school year concluded (e.g. 1995 represents the 1994-1995 school year).

At a high level, DPI's original all staff datasets comprise of the following formats:

1995-2008: Fixed-width TXT and DAT files
2009-2016: XLSX (Excel) files
2017-2019: XLSX reports downloaded from the new web reporting system

The original data is not very useful initially, largely because:

Fixed-width columnar files, such as those used from 1995-2008, must be transformed before they can be imported into a DBMS. This is a time-consuming and technical process.
The schemas used have changed over the years. From 1995-2019, there were 8 different schemas used. These schemas must all be reconciled and combined before analysis can be done between all these years.

Thus, once all the data is standardized into CSV format, an extensive ETL process follows which entails standardizing the disparate schemas into one schema, or consolidating all the data into a single table. Some details are provided below.

Part A: Data Standardization

Download the documentation from WI DPI for all corresponding data files. These were provided from 1995-2016. (Not necessary for 2017-2019.)
For each year:
Copy and paste the file record layout table into Excel.
Use an Excel formula to subtract 1 from the Start Position column.
Copy and paste these values (which we'll call modified position values) into a separate worksheet, labeled with the year corresponding to these values.
In Excel, select and copy all the column headers (for the 1995 file, delete the blank row from your copy). Then, right-click and transpose the data so the column names appear in one row instead of one column. Copy these into a separate worksheet, labeled with the year corresponding to the column names.
The 1996 TXT contains commas. Use Notepad's Replace All operation to replace "," with "$" (without quotes). The $ will be a placeholder character. We can't escape it at this point.
This next step uses GNU/*nix tools. We copied the TXT/DAT files to our database server (which is Linux-based) for this step. For all fixed-width files (1995-2008), perform the following:
1. Create an empty file called positionsfile. Paste the modified position values from Excel for a specific year into this file (the ones which subtracted 1 from Start Position).
2. Run the following SED command on each file, modifying the file names appropriately:
```
sed -f <(sort -rn positionsfile | sed -n 's:^[1-9][0-9]*$:s/./\&,/&:p') 95staff.txt > 95staff.csv
```
  This command scans the file and inserts commas at all appropriate locations to turn this into a CSV file. This takes about 5-10 minutes per file to run.
3. Copy the new CSV files back to your local machine.
4. Open the CSV file in Excel and insert a blank row at the top. Copy the row containing the column names from your other worksheet for this year and paste this into the top row. Now, the CSV file contain headings.
The next step needs to be performed on all XLSX files from 2009-2016:
1. Open the file in Excel and save the page containing the data as a CSV file.
2. Insert a blank row at the top. Copy the row containing the column names from your other worksheet for this year and paste this into the top row. Now, the CSV file contain headings.
For files from 2017-2019, first save as CSV; then, delete the row(s) at the top which contain metadata about which columns and rows were downloaded. The column headings should be the first row.
At this point, you have transformed all of DPI's raw data into CSV files, the first row of which contains the column names.
The following are known problems or errors in the data you will need to fix at this point:
- In the file for 2014, search for "Francis E Finco, Ed. D." (without quotes). Because this name has a comma, it interferes with the CSV format. Use Excel's Find+Replace to replace the "," with "\," (both without quotes). This escapes the comma so it won't be used as a delimeter when importing it into the database.
- The file from 2001 contains a formatting error on row 154,761. The value from the "Former last name" column has spilled over into two columns. Replace "Wiesend" with "Wiesend\, Molkentine" and shift the data in the remaining columns to the right to the left by one column.
- The total salary and total fringe columns in the 2017-2019 files are formatted as currency. In Excel, select these columns and change the formatting to TEXT instead. You should see the format change in this way: $42,031.00 to 42031
- The files from 2017-2019 contain lots of columns. Run a Find+Replace in Excel to replace "," with "\," (again, without quotes).
Copy all of the CSV files to a *nix server (preferably the database server). (You can delete any TXT or DAT files that are still there).

The CSV files at this point are the final, standardized, error-free CSV files that are published for download at the top of this page. We have simply compressed them all into a 7-Zip archive file. The compression ratio is approximately 97% (appx. 3 MB in the 7 ZIP file corresponding to 100 MB of CSV files).

Part B: Staging Table Creation

Although most DBMSs can easily import the data at this point, they cannot automatically create the table for you.
First, create a new database to house the tables. You can call it something like "widpi".
Copy all of the CSV files to a *nix server. (You can delete any TXT or DAT files that are still there).

Create the following shell script and call it createTable.sh:

#!/bin/sh
DELIM=","
CSV="$1"
FIELDS=$(head -1 "$CSV" | sed -e 's/'$DELIM'/` varchar(255),\n`/g' -e 's/\r//g')
FIELDS='`'"$FIELDS"'` varchar(255)'
echo "create table $1 ($FIELDS);" | sed -e 's/ `/`/g'

Make sure to use chmod to make the script executable.

Run the script for each disparate schema. What constitutes a disparate schema? Any easy way to see this is to create a checksum column in Excel using the row-format list of column names. To the left, use a formula like
```
=SUM(LEN($C2:$BL2))
```
. You will see 8 unique values across all 25 years. Or, you can take our word for it that the 8 distinct schemas are 1995, 1996, 1997-1998, 1999-2003, 2004-2007, 2008, 2009-2016, and 2017-2019. However, there's no harm in running this script for all 25 files.
Run the script by passing in the CSV file as an argument, e.g.
```
./createTable.sh 95staff.csv
```
It will echo out the SQL code to create a VARCHAR(255) column for every single column in the CSV code. Yes, this is lazy, but it doesn't matter since we're going to be combining these staging tables and revising the schema at the end.
Paste the echoed output from your terminal into a text file on your local machine. We recommend using a text editor like Notepad++. Manually change the table name so that it does not start with a number (e.g change 95staff.csv to staff95). Additionally, search the list of columns for "Filler". You cannot have duplicate column names, so rename each "Filler" column to something distinct, e.g. Filler1, Filler2, etc. These will get deleted later, anyways. Additionally, if there is a blank column at the end with no name (e.g. "", such as with the 1995 CSV file), give it a unique name.
Run the SQL code in your DBMS to create the table. The shell script above is MySQL-compatible; we don't guarantee it will work necessarily for other DBMSs. If you get an error from your DBMS, you may need to tweak it slightly.
Once the table has been created, you can load the CSV files into your database.
Repeat this process until you have 25 tables, one for each year. Make sure to use the correct SQL code to create the table for each year!

Part C: Loading Data Into Database

Connect to your DBMS and navigate to the correct database; then, issue a command like the following to load a file into your database:
```
LOAD DATA INFILE '/tmp/dpi/95staff.csv' INTO TABLE staff95 COLUMNS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES;
	
```
Repeat the above for all 25 files. Relatively speaking, this is the easy part; it only takes 5-10 seconds to import each file into the appropriate table. Make sure that there are 0 warnings and 0 errors on each import. If there are errors, it's likely caused by extraneous commas in your CSV files that are not escaped, or currency formatting in the 2017-2019 files. If this occurs, truncate the table to delete all its data, fix the CSV file, copy it back to your database server, then re-import the data and ensure there are no problems.

Now, fix the errors in your tables caused by comma-escaping in the CSV files by running the following queries:

DELETE FROM `staff96` WHERE `ID number` = ""
UPDATE `staff14` SET `Admin Name` = "Francis E Finco, Ed. D." WHERE `Admin Name` LIKE CONCAT('%', "Francis E Finco, Ed. D.", '%');
UPDATE `staff17` SET `Assignment Grades Served` = TRIM(BOTH '"' FROM `Assignment Grades Served`);
UPDATE `staff18` SET `Assignment Grades Served` = TRIM(BOTH '"' FROM `Assignment Grades Served`);
UPDATE `staff19` SET `Assignment Grades Served` = TRIM(BOTH '"' FROM `Assignment Grades Served`);

This code gets rid of the double quotes surrounding values containing a comma that was escaped by a backslash. In other words, it restores it to the way the original data format before it was escaped using a backslash (which was required to import it into the database properly).

In the instructions, we fixed the CSV earlier so you don't need to do this, but it was at this point we realized the 2001 CSV contained a formatting error, so the following SQL command was used to fix it:
```
UPDATE staff01 SET `Former last name` = "Wiesend, Molkentine", `Contracted employee` = "N" WHERE `Former last name` = "Wiesend"
	
```
Again, we retroactively patched the CSV, so if you download the CSV from us, no action is needed. Otherwise, as indicated earlier in the instructions, you will need to fix this; if you didn't do this earlier, you can use the command above to fix it now. Afterwards, you will also need to delete the blank column at the end of the staff01 table. (If you fixed the CSV before importing, there won't be a blank column - it's the result of this formatting error pushing the rest of the row one cell to the right.)

At this point, we exported the database into a SQL file as follows:

mysqldump -u root -p widpi > dump.sql
zip -r dump.zip /tmp/dpi/dump.sql

The ZIP archive generated at this point is the one we have published for download at the top of this page. It has a compression ratio of approximately 93% (7 MB compressed for each 100 MB of SQL code).

Schema Merges

Now is the more difficult part. All of the data is now in our database, but it's in 25 separate tables. If you want to query data from a single year, that's fine, but if you want to work with data from multiple years, even with JOINs, it gets messy and difficult very quickly. We now need to merge all 25 staging tables into a single master table. Because there's no one "correct" way to proceed from here, we will not be publicly publishing any downloads past this point or explaining the documentation as thoroughly. The final table in use (which you can freely query) is the one that resulted from the process below. This is one method to reach a unified schema but certainly not the only one.

The good news is that we can start by merging tables that have identical schemas without much hassle. In fact, we can start by noting the difference between the 1996 schema and the 1997-1998 schema is the latter contains exactly one column in the same location which is now empty as opposed to occupy. In other words, we can use the 1996 schema for the 1997 and 1998 tables with no data loss.
First, we duplicated the structure of the 1996, 1999, 2004, 2009, and 2017 tables (a representative from each schema). If you are using phpMyAdmin, go to Operations → Copy table to and choose "Structure only" and name the table "multYYYYtoZZZZ", replacing the letters with the years involved. Otherwise, the CREATE TABLE statements below should do the trick.

These commands were used to merge tables of identical (or near-identical, in the case of 1996 and 1997/1998) schemas together:

CREATE TABLE mult1996to1998 LIKE staff1996; 
INSERT INTO mult1996to1998 SELECT * FROM staff1996 UNION ALL SELECT * FROM staff1997;
INSERT INTO mult1996to1998 SELECT * FROM staff1998;
CREATE TABLE mult1999to2003 LIKE staff1999;
INSERT INTO mult1999to2003 SELECT * FROM staff1999 UNION ALL SELECT * FROM staff2000;
INSERT INTO mult1999to2003 SELECT * FROM staff2001 UNION ALL SELECT * FROM staff2002;
INSERT INTO mult1999to2003 SELECT * FROM staff2003;
CREATE TABLE mult2004to2007 LIKE staff2004;
INSERT INTO mult2004to2007 SELECT * FROM staff2004 UNION ALL SELECT * FROM staff2005;
INSERT INTO mult2004to2007 SELECT * FROM staff2006 UNION ALL SELECT * FROM staff2007;
INSERT INTO mult2009to2016 SELECT * FROM staff2009 UNION ALL SELECT * FROM staff2010;
INSERT INTO mult2009to2016 SELECT * FROM staff2011 UNION ALL SELECT * FROM staff2012;
INSERT INTO mult2009to2016 SELECT * FROM staff2013 UNION ALL SELECT * FROM staff2014;
INSERT INTO mult2009to2016 SELECT * FROM staff2015 UNION ALL SELECT * FROM staff2016;
INSERT INTO mult2017to2019 SELECT * FROM staff2017 UNION ALL SELECT * FROM staff2018;
INSERT INTO mult2017to2019 SELECT * FROM staff2019;

Now, we're down to 5 merged tables as well as the 1995 and 2008 tables, which have their own unique schemas. That means we're down from 25 tables to 7 tables (you can now delete the other individual tables to free up disk space, but be careful not to delete the 1995 and 2008 ones yet). Now comes the harder part — our columns will need to be manually renamed, deleted, and our tables plain "refactored" before we can continue merging down until we get to one single table.

First, let's set empty monetary values to NULL instead of an empty string, so that we can set the datatypes of all the monetary columns to INT, not VARCHAR:

UPDATE mult2017to2019 SET `Total Salary` = NULL WHERE `Total Salary` = "";
UPDATE mult2017to2019 SET `Total Fringe` = NULL WHERE `Total Fringe` = "";
ALTER TABLE `mult2017to2019` CHANGE `Total Salary` `Total Salary` INT(8) NULL DEFAULT NULL;
ALTER TABLE `mult2017to2019` CHANGE `Total Fringe` `Total Fringe` INT(8) NULL DEFAULT NULL;

UPDATE mult2009to2016 SET `Tot Salary` = NULL WHERE `Tot Salary` = "";
UPDATE mult2009to2016 SET `Tot Fringe` = NULL WHERE `Tot Fringe` = "";
ALTER TABLE `mult2009to2016` CHANGE `Tot Salary` `Tot Salary` INT(8) NULL DEFAULT NULL;
ALTER TABLE `mult2009to2016` CHANGE `Tot Fringe` `Tot Fringe` INT(8) NULL DEFAULT NULL;

UPDATE mult2004to2007 SET `Salary` = NULL WHERE `Salary` = "";
UPDATE mult2004to2007 SET `Fringe` = NULL WHERE `Fringe` = "";
ALTER TABLE `mult2004to2007` CHANGE `Salary` `Salary` INT(8) NULL DEFAULT NULL;
ALTER TABLE `mult2004to2007` CHANGE `Fringe` `Fringe` INT(8) NULL DEFAULT NULL;

UPDATE mult1999to2003 SET `Salary` = NULL WHERE `Salary` = "";
UPDATE mult1999to2003 SET `Fringe` = NULL WHERE `Fringe` = "";
ALTER TABLE `mult1999to2003` CHANGE `Salary` `Salary` INT(8) NULL DEFAULT NULL;
ALTER TABLE `mult1999to2003` CHANGE `Fringe` `Fringe` INT(8) NULL DEFAULT NULL;

UPDATE mult1996to1998 SET `Salary` = NULL WHERE `Salary` = "";
UPDATE mult1996to1998 SET `Fringe` = NULL WHERE `Fringe` = "";
ALTER TABLE `mult1996to1998` CHANGE `Salary` `Salary` INT(8) NULL DEFAULT NULL;
ALTER TABLE `mult1996to1998` CHANGE `Fringe` `Fringe` INT(8) NULL DEFAULT NULL;

Now, ORDER BY on these columns will work as expected.

The 2008 table is a superset of the 2004-2007 table, so we can actually combine them without much effort — we just need to create 4 empty columns at the end of the 2004-2007 table before combining them (use the structure of the 2008 table) — if you don't have a Filler5 column, rename Filler4 to Filler5:

ALTER TABLE `mult2004to2007` ADD `filler11` VARCHAR(0) NOT NULL AFTER `Long term substitute`, ADD `filler12` VARCHAR(0) NOT NULL AFTER `filler11`;
ALTER TABLE `mult2004to2007` ADD `filler13` VARCHAR(0) NOT NULL AFTER `filler12`, ADD `filler14` VARCHAR(0) NOT NULL AFTER `filler13`;
CREATE TABLE mult2004to2008 LIKE staff2008;
INSERT INTO mult2004to2008 SELECT * FROM mult2004to2007 UNION ALL SELECT * FROM staff2008;

Now, let's combine the 1996-1998 and 1999-2003 tables — we'll need to add 2 empty columns to both, first — then clone the structure of the 1999-2003 table:

ALTER TABLE `mult1996to1998` ADD `filler11` VARCHAR(0) NOT NULL AFTER `Filler5`, ADD `filler12` VARCHAR(0) NOT NULL AFTER `filler11`;
ALTER TABLE `mult1999to2003` CHANGE `Filler3` `Fiscal agent (Special Education)` VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL;
ALTER TABLE `mult1999to2003` ADD `Program (Special Education)` VARCHAR(255) NOT NULL AFTER `Fiscal agent (Special Education)`, ADD `Program sequence (Special Education)` VARCHAR(255) NOT NULL AFTER `Program (Special Education)`;
CREATE TABLE mult1996to2003 LIKE mult1999to2003;
INSERT INTO mult1996to2003 SELECT * FROM mult1996to1998 UNION ALL SELECT * FROM mult1999to2003;

Now, we will combine the 1996-2003 and 2004-2008 tables. This involves adding multiple columns again in both tables in different locations to line them all up. We will clone the 1996-2003 table structure and use that to create the structure of the 1996-2008 table.

ALTER TABLE `mult1996to2003` ADD `filler11` VARCHAR(255) NOT NULL AFTER `Months employed`;
ALTER TABLE `mult2004to2008` ADD `filler11` VARCHAR(0) NOT NULL AFTER `Filler5`, ADD `filler12` VARCHAR(0) NOT NULL AFTER `filler11`, ADD `filler13` VARCHAR(0) NOT NULL AFTER `filler12`, ADD `filler14` VARCHAR(0) NOT NULL AFTER `filler13`;
ALTER TABLE `mult1996to2003` ADD `Contracted Agency` VARCHAR(255) NOT NULL AFTER `Long term substitute`, ADD `Contracted Agency Site` VARCHAR(255) NOT NULL AFTER `Contracted Agency`, ADD `b1` VARCHAR(255) NOT NULL AFTER `Contracted Agency Site`, ADD `b2` VARCHAR(255) NOT NULL AFTER `b1`;
CREATE TABLE mult1996to2008 LIKE mult1996to2003;
INSERT INTO mult1996to2008 SELECT * FROM mult1996to2003 UNION ALL SELECT * FROM mult2004to2008;

You may need to be patient. The last INSERT statement above took 86 seconds for our server to run.

At this point, we are now down to 4 schemas! 1995, 1996-2008, 2009-2016, and 2017-2019.

ALTER TABLE `mult1996to2008` DROP `filler11`;
ALTER TABLE `staff1995` CHANGE `Mail Zip code (99999?9999)` `Mail Zip code (99999-9999)` VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL;
ALTER TABLE `staff1995` CHANGE `Ship Zip code (99999?9999)` `Ship Zip code (99999-9999)` VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL;
ALTER TABLE `mult1996to2008` CHANGE `Mail Zip code (99999?9999)` `Mail Zip code (99999-9999)` VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL;
ALTER TABLE `mult1996to2008` CHANGE `Ship Zip code (99999?9999)` `Ship Zip code (99999-9999)` VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL;
ALTER TABLE `mult1996to2008` CHANGE `Filler5` `Total 3rd Friday Enrollment` VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL;

The resulting table at this point is the one you can freely analyze by using our Insights platform to query the data (yes, all this work just to get DPI's raw data into a uniform database table without any data loss, really!). While we are not publishing the final database for download, we've documented this process transparently so interested parties can recreate it themselves somewhat easily (although it will still take hours), but if you're just interesting in analyzing the data and forming your own insights, please feel free to use our platform to do so — that's why we've made it available!