The National Alliance of Concurrent Enrollment Partnerships

2015-16 Civil Rights Data Collection (CRDC)

Advanced Placement (AP) v. Dual Enrollment (DE)

Initial Filtration

Alijah O'Connor - 2018



The basis for this entire project relies on the accuracy of establishing a dataset that includes only 'Traditional High Schools.' The definition we have chosen to use centers around schools that contain 11th or 12th grades, non-special education, non-juvenile justice, non-alternative schools; however, other filters include removing virtual schools, adult schools, schools without a matching National Center for Education Statistics (NCES) identifier (some excepetions). Note that the dataset used herein inlcudes both the 2015-2016 CRDC school information, but also information gathered in the 2015-2016 NCES National Survey. Below is the official filtration procedure:

- Join the uncompiled 2015-2016 NCES dataset into one dataset
- Join the compiled NCES dataset with the 2015-2016 CRDC dataset

- Filter Out (Dataset Attribute in Parentheses)
    - Special Education, Alternative, Juvenile Justice Schools (CRDC)
    - Schools without 11th or 12 Grade (CRDC)
    - Virtual Schools (NCES)
    - Schools reported as 'Elementary', 'Middle', or 'Other' (NCES)
    - Special Education, Alternative/Other, and "Adult" Schools (NCES)
- Recover Some Schools that did not have matching NCES identifiers
    - Join recovered schools with dataset
    - Remove any remaining schools that did not match


In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:
In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from my_functions import combokey_converter

%matplotlib inline
sns.set_style('whitegrid')
plt.rc('axes', titlesize = 14, titleweight = 'bold', labelweight = 'bold')

Data Cleaning/Joining

crdc_1516 Data

Used combokey_convert.converter to create a csv-compatible "COMBOKEY"

In [3]:
crdc_1516 = pd.read_csv('../filtered_data/00_crdc_1516_initial.csv', 
                        dtype = {'LEAID':np.object})
In [4]:
crdc_1516['COMBOKEY'] = combokey_converter.convert(crdc_1516, 'LEAID', 'SCHID')

Compiling NCES 1516 Data

The nces_1516 Data was recorded in separate files (each with different numbers of schools), so they are joined to avoid corruption/loss of data.
Files

1. Characteristics 
2. Directory 
3. Geographic 

Like the crdc data, the combokey field was generated using my combokey_converter.convert function.

In [5]:
nces_1516_characteristics = pd.read_csv('../filtered_data/01_nces_1516_initial_school_characteristics.csv')
In [6]:
nces_1516_characteristics['combokey'] = combokey_converter.convert(nces_1516_characteristics, 'LEAID', 'SCHID')
In [7]:
nces_1516_directory = pd.read_csv('../filtered_data/01_nces_1516_initial_school_directory.csv')
In [8]:
nces_1516_directory['combokey'] = combokey_converter.convert(nces_1516_directory, 'LEAID', 'SCHID')

First Join: Directory + Characteristics

In [9]:
nces_1516 = nces_1516_characteristics.set_index('combokey').join(nces_1516_directory.set_index('combokey'),
                                                                 how = 'inner', lsuffix = 'dir_')
In [10]:
str(len(nces_1516.index)) + ' Schools Matched.'
Out[10]:
'100232 Schools Matched.'
In [11]:
nces_1516 = nces_1516.drop(['LEAIDdir_', 'SCHIDdir_', 'SCH_NAMEdir_'], axis = 1)

Second Join: combined + geo

In [12]:
nces_1516_geo = pd.read_csv('../filtered_data/01_nces_1516_initial_geographic.csv',  dtype = {'LOCALE15': np.object})
In [13]:
nces_1516_geo['combokey'] = combokey_converter.convert(nces_1516_geo, 'LEAID', 'SCHID')
In [14]:
nces_1516_test = nces_1516.join(nces_1516_geo.set_index('combokey'), how = 'inner', rsuffix = 'dir_')
In [15]:
nces_1516_full = nces_1516_test.drop(['LEAIDdir_', 'SCHIDdir_', 'NAME'], axis = 1)
In [16]:
str(len(nces_1516_full.index)) + ' Schools Matched'
Out[16]:
'100096 Schools Matched'
In [17]:
# nces_1516_full.to_csv('../filtered_data/01_nces_1516_initial_combined_ccd.csv')

NCES (combined) and 2015-2016 CRDC join

In [18]:
crdc_nces1516_test = crdc_1516.set_index('COMBOKEY').join(nces_1516_full, how = 'left', rsuffix=('_'))
In [19]:
str(crdc_nces1516_test[crdc_nces1516_test.SCH_NAME_.isnull()].LEAID.count()) + ' Schools Did Not Have Matching NCES Identifiers'
Out[19]:
'3854 Schools Did Not Have Matching NCES Identifiers'
In [20]:
crdc_nces_1516 = crdc_nces1516_test.drop(['LEA_NAME_', 'LEAID_', 'SCHID_', 'SCH_NAME_'], axis = 1)
In [21]:
str(len(crdc_nces_1516.index)) + ' Total Schools in the Combined DataFrame'
Out[21]:
'96360 Total Schools in the Combined DataFrame'
In [22]:
crdc_nces_1516 = crdc_nces_1516.fillna('Missing')
In [23]:
# crdc_nces_1516.to_csv('../filtered_data/03_crdc_nces_1516_raw_combined.csv')

Filtration

Remove Schools without 11th or 12th Grade (CRDC)

In [24]:
filter1_crdc_nces_1516 = crdc_nces_1516.copy()
In [25]:
from my_functions.extra_functions import students_in_11_or_12
filter1_crdc_nces_1516['Students_in_11_12'] = filter1_crdc_nces_1516.apply(lambda row: students_in_11_or_12(row['SCH_GRADE_G11'], row['SCH_GRADE_G12']), axis = 1)
In [26]:
filtered_out = filter1_crdc_nces_1516[(filter1_crdc_nces_1516.Students_in_11_12 == 'No')]
filter1_crdc_nces_1516 = filter1_crdc_nces_1516[(filter1_crdc_nces_1516.Students_in_11_12 == 'Yes')]

Select Non-[Junvile Justice, Special Education, and Alternative Schools] (CRDC)

- Schools that answered 'No' to each of those three questions on the CRDC Survey.
- I also used a keyword filter to remove any remaining "Juvenile Justice"-eque Institutions.
In [27]:
filter2_crdc_nces_1516 = filter1_crdc_nces_1516.copy()
filtered_out = filtered_out.append(filter2_crdc_nces_1516[(filter2_crdc_nces_1516.JJ == 'Yes') |
                                      (filter2_crdc_nces_1516.SCH_STATUS_ALT == 'Yes') |
                                      (filter2_crdc_nces_1516.SCH_STATUS_SPED == 'Yes')])
filter2_crdc_nces_1516 = filter2_crdc_nces_1516[(filter2_crdc_nces_1516.JJ == 'No') &
                                                (filter2_crdc_nces_1516.SCH_STATUS_ALT == 'No') &
                                                (filter2_crdc_nces_1516.SCH_STATUS_SPED == 'No')]
In [28]:
def jj_keyword_remove(name):
    kws = ['behavioral', 'juvenile', 'correction']
    for kw in kws:
        if kw in name.strip().lower():
            return False
    return True

filter2_crdc_nces_1516 = filter2_crdc_nces_1516[filter2_crdc_nces_1516.SCH_NAME.apply(lambda x: jj_keyword_remove(x))]
filter2_crdc_nces_1516 = filter2_crdc_nces_1516[filter2_crdc_nces_1516.LEA_NAME.apply(lambda x: jj_keyword_remove(x))]

Remove Virtual Schools (NCES)

 - Remove any Schools that reported 'Yes' to the Virtual Schools Question<br>
 - Remove Schools that have certain keyword that likely indicate an online school
In [29]:
filter3_crdc_nces_1516 = filter2_crdc_nces_1516.copy()
filtered_out = filtered_out.append(filter3_crdc_nces_1516[filter3_crdc_nces_1516.VIRTUAL == 'Yes'])
filter3_crdc_nces_1516 = filter3_crdc_nces_1516[filter3_crdc_nces_1516.VIRTUAL != 'Yes']
In [30]:
def any_missed_virtuals(name):
    kws = ['virtual', 'cyber', 'electronic', 'internet', 'online', 'distance']
    for kw in kws:
        if kw in name.strip().lower():
            return False
    return True
filtered_out = filtered_out.append(filter3_crdc_nces_1516[~filter3_crdc_nces_1516.SCH_NAME.apply(lambda x: any_missed_virtuals(x))])
filter3_crdc_nces_1516 = filter3_crdc_nces_1516[filter3_crdc_nces_1516.SCH_NAME.apply(lambda x: any_missed_virtuals(x))]

Remove schools reported as elementary, middle, or 'N' (NCES)

Even with the Lowest/Highest Grade filter, I wanted to ensure that no non-typical high schools (as reported by the NCES's LEVEL Field) are retained.

In [31]:
filter4_crdc_nces_1516 = filter3_crdc_nces_1516.copy()
In [32]:
filtered_out = filtered_out.append(filter4_crdc_nces_1516[(filter4_crdc_nces_1516.LEVEL == 'N') |
                                                          (filter4_crdc_nces_1516.LEVEL == '1') |
                                                          (filter4_crdc_nces_1516.LEVEL == '2')])
filter4_crdc_nces_1516 = filter4_crdc_nces_1516[(filter4_crdc_nces_1516.LEVEL == 'Missing') |
                                                (filter4_crdc_nces_1516.LEVEL == '3') |
                                                (filter4_crdc_nces_1516.LEVEL == '4')]

Remove Special Education, Alternative/Other, and "Adult" Schools (NCES)

Removed Schools with a SCH_TYPE that was not 1 (Regular) or 3 (Vocational). Culls additional "Special Education", and "Alternative/Other" schools.

In [33]:
filter5_crdc_nces_1516 = filter4_crdc_nces_1516.copy()
In [34]:
filtered_out = filtered_out.append(filter5_crdc_nces_1516[(filter5_crdc_nces_1516.SCH_TYPE == 2) |
                                                          (filter5_crdc_nces_1516.SCH_TYPE == 4)])
filter5_crdc_nces_1516 = filter5_crdc_nces_1516[(filter5_crdc_nces_1516.SCH_TYPE == 'Missing') |
                                                (filter5_crdc_nces_1516.SCH_TYPE == 1) |
                                                (filter5_crdc_nces_1516.SCH_TYPE == 3)]

Mini-Filter: Remove schools with 'adult' in the Name (CRDC)

In [35]:
filtered_out = filtered_out.append(filter5_crdc_nces_1516[filter5_crdc_nces_1516.SCH_NAME.str.contains('adult', case=False)])
filter5_crdc_nces_1516 = filter5_crdc_nces_1516[~filter5_crdc_nces_1516.SCH_NAME.str.contains('adult', case=False)]

V. Dealing with Missing Values

With nearly 1200 schools missing NCES data, including schools from prominent districts like "NEW YORK CITY PUBLIC SCHOOLS" and "Green Dot Public Schools," it is important to try to recover as much of these schools as possible.

The problem that I found was that the CRDC lumped a number of school districts together; therefore, the combokeys of schools in these districts do not match those of the NCES.

I tried a number of methods to try to properly join these missing schools:

- Using only the school name:  This had difficulties because there are many schools that share the same name, so
when a join is implemented, these schools are given all of the values of the other schools (i.e. it creates a 
lot of duplicate values).
- Using the NCES data from 2013:  This was also problematic, as most of the same schools that were missing in 
this dataset were also constrained to the same problem in the 2013-2014 dataset.
- Using the District and the name together:  This also suffered from the fact that the CRDC data combines some
school districts; therefore, the names of the districts still did not match up.
- **Finally, I used a combination of the name of the school and the state:  There were only a handfull in the 
dataset containing duplicate names (which were removed from the Dataset).**<br><br>


Next, I recovered the remaining schools in the 'New York City Public Schools District', because it was clear that they were simply missing due to a LEA reporting error in the CRDC data. This process was two-parted:

- First, Because it seemed as though most of these remaining New York schools had the incorrect LEAID, 
I used the the school id and state abreviation to create a unique identifier.
- Second, I used the NCES database to manually search for the remaining schools correct their combokey

Finally, I performed the same (nces-provided field)-filtration steps on the recovered data. Then, I hand-removed duplicate values by checking the original filtered data for matching records.

In [36]:
filter5_missing_leas = filter5_crdc_nces_1516[filter5_crdc_nces_1516.LEVEL == 'Missing'].groupby('LEA_NAME')['LEAID'].count().sort_values(ascending = False)
In [37]:
# filter5_missing_leas.to_csv('../filtered_data/04_inital_filter_missing_LEAs.csv')
In [38]:
"""How many missing schools?"""
filter5_missing_schools = filter5_crdc_nces_1516[filter5_crdc_nces_1516.LEVEL == 'Missing']
len(filter5_missing_schools.index);
In [39]:
# filter5_missing_schools.to_csv('../filtered_data/04_intital_filter_missing_schools.csv')

Manipulate missing schools and original nces data --> join

In [40]:
filter5_schname_state = filter5_missing_schools.copy()
In [41]:
filter5_schname_state = filter5_schname_state.reset_index()
In [42]:
filter5_schname_state['SCH_NAME'] = filter5_schname_state['SCH_NAME'].apply(lambda x: x.lower())
filter5_schname_state['SCH_NAME_ST_NUM'] = filter5_schname_state.SCH_NAME + filter5_schname_state.LEA_STATE
In [43]:
"""How many duplicate schools in the filter5 dataset?"""
filter5_schname_state.groupby('SCH_NAME_ST_NUM')['SCH_NAME_ST_NUM'].count().sort_values(ascending = False).head(10);
In [44]:
filter5_schname_state[filter5_schname_state.SCH_NAME_ST_NUM == 'performance learning centerGA']
Out[44]:
COMBOKEY LEA_STATE LEA_STATE_NAME LEAID LEA_NAME SCHID SCH_NAME JJ SCH_GRADE_PS SCH_GRADE_KG ... LEVEL VIRTUAL GSLO GSHI NMCNTY15 LOCALE15 LAT1516 LON1516 Students_in_11_12 SCH_NAME_ST_NUM
301 ='130129003727' GA GEORGIA 1301290 Cobb County 3727 performance learning center No No No ... Missing Missing Missing Missing Missing Missing Missing Missing Yes performance learning centerGA
313 ='130270003728' GA GEORGIA 1302700 Harris County 3728 performance learning center No No No ... Missing Missing Missing Missing Missing Missing Missing Missing Yes performance learning centerGA

2 rows × 125 columns

In [45]:
nces_1516_schname_state = nces_1516_full.copy()
In [46]:
nces_1516_schname_state = nces_1516_schname_state.reset_index()
In [47]:
nces_1516_schname_state['SCH_NAME'] = nces_1516_schname_state['SCH_NAME'].apply(lambda x: x.lower())
nces_1516_schname_state['SCH_NAME_ST_NUM'] = nces_1516_schname_state.SCH_NAME + nces_1516_schname_state.STABR
In [48]:
"""Join the NCES and filter5 datasets on the SCH_NAME_ST_NUM column"""
schname_combined = filter5_schname_state.set_index('SCH_NAME_ST_NUM').join(nces_1516_schname_state.set_index('SCH_NAME_ST_NUM'), how = 'left', rsuffix = '_')
In [49]:
"""How many schools have duplicated values?"""
schname_combined.SCH_NAME_.value_counts().sort_values(ascending = False).head(10);
In [50]:
"""How may more schools were matched?"""
len(schname_combined[schname_combined.SCH_NAME_.notnull()].index)
Out[50]:
688
In [51]:
"""How many schools still did not have a match?"""
len(schname_combined[schname_combined.SCH_NAME_.isnull()].index)
Out[51]:
312

Recover the NY Schools

In [52]:
schname_combined_missing = schname_combined.copy()
schname_combined_missing = schname_combined_missing[schname_combined_missing.SCH_NAME_.isnull()]

schname_combined_missing_ny = schname_combined_missing.copy()
schname_combined_missing_ny = schname_combined_missing_ny[schname_combined_missing_ny['LEA_NAME'] == 'NEW YORK CITY PUBLIC SCHOOLS']
In [53]:
print(len(schname_combined_missing_ny.index))
print(schname_combined_missing_ny.SCHID.nunique())
22
22
In [54]:
schname_combined_missing_ny = schname_combined_missing_ny.drop(['TITLEI_', 'STABR_', 'SCH_TYPE_TEXT_', 'SCH_TYPE_',
                                                                'LEVEL_', 'VIRTUAL_', 'GSLO_', 'GSHI_', 
                                            'NMCNTY15_', 'LOCALE15_', 'LAT1516_', 'LON1516_', 'combokey',
                                            'LEAID_', 'LEA_NAME_', 'SCH_NAME_', 'SCHID_'], axis = 1)
In [55]:
def schid_state_maker(schid, state):
    schid = str(schid).zfill(5)
    return schid + state
In [56]:
schname_combined_missing_ny['schid_state'] = schname_combined_missing_ny.apply(lambda row: schid_state_maker(row['SCHID'], row['LEA_STATE']), axis = 1)
In [57]:
nces_for_missing_ny = nces_1516_full.copy()

nces_for_missing_ny['schid_state'] = nces_for_missing_ny.apply(lambda row: schid_state_maker(row['SCHID'], row['STABR']), axis = 1)
In [58]:
missing_ny_joined = schname_combined_missing_ny.set_index('schid_state').join(nces_for_missing_ny.reset_index().set_index('schid_state'), how = 'left', rsuffix = "_")
In [59]:
""" Join the missing NY schools with NCES """
missing_ny_joined[missing_ny_joined.LEVEL_.notnull()][['SCH_NAME','SCH_NAME_']]
Out[59]:
SCH_NAME SCH_NAME_
schid_state
01409NY law, government and community service high school LAW GOVERNMENT AND COMMUNITY SERVICE HIGH SCHOOL
02961NY bronx school for law, government and justice BRONX SCHOOL FOR LAW GOVERNMENT AND JUSTICE
03091NY high school of enterprise, business & technology HIGH SCHOOL OF ENTERPRISE BUSINESS & TECHNOLOGY
04873NY new explorations into science,tech and math hi... NEW EXPLORATIONS INTO SCIENCETECH AND MATH HIG...
05113NY high school for law, advocacy and community ju... HIGH SCHOOL FOR LAW ADVOCACY AND COMMUNITY JUS...
05516NY science, tech & research high school at erasmus SCIENCE TECH & RESEARCH HIGH SCHOOL AT ERASMUS
05521NY ms 223 laboratory school of finance and techno... MS 223 LABORATORY SCHOOL OF FINANCE AND TECHNO...
05536NY queens high school of teaching, liberal arts a... QUEENS HIGH SCHOOL OF TEACHING LIBERAL ARTS AN...
05677NY marie curie high sch-nursing, medicine & appli... MARIE CURIE HIGH SCH-NURSING MEDICINE & APPLIE...
05774NY high school for arts, imagination and inquiry HIGH SCHOOL FOR ARTS IMAGINATION AND INQUIRY
05899NY high sch-construction, trades, engineering & a... HIGH SCH-CONSTRUCTION TRADES ENGINEERING & ARC...
05962NY brooklyn community high sch-communication, art... BROOKLYN COMMUNITY HIGH SCH-COMMUNICATION ARTS...
05985NY archimedes acad-math, sci, tech ARCHIMEDES ACAD-MATH SCI TECH
05999NY queens high school for information, research a... QUEENS HIGH SCHOOL FOR INFORMATION RESEARCH AN...
06105NY high school for language and diplomacy HIGH SCHOOL FOR LANGUAGE AND DIPLOMACY (THE)
06122NY the brooklyn academy of global finance BROOKLYN ACADEMY OF GLOBAL FINANCE (THE)
06148NY urban assembly school for green careers URBAN ASSEMBLY SCHOOL FOR GREEN CAREERS (THE)
In [60]:
""" Dealing with remaining missing NY Schools """
missing_ny_2 = missing_ny_joined.copy()
missing_ny_2 = missing_ny_2[missing_ny_2.LEVEL_.isnull()]
In [61]:
missing_ny_2 = missing_ny_2.drop(['TITLEI_', 'STABR_', 'SCH_TYPE_TEXT_', 'SCH_TYPE_',
                   'LEVEL_', 'VIRTUAL_', 'GSLO_', 'GSHI_', 
                   'NMCNTY15_', 'LOCALE15_', 'LAT1516_', 'LON1516_', 'combokey',
                   'LEAID_', 'LEA_NAME_', 'SCH_NAME_', 'SCHID_'], axis = 1)
In [62]:
"""Recovering the missing Combokeys using the NCES website and manually filling the values in"""
missing_ny_2['actual_combokey'] = pd.Series(np.resize(0, len(missing_ny_2.index)), dtype = np.object)

missing_ny_2.at["99874NY", 'actual_combokey'] = "='360007706372'"
missing_ny_2.at["99933NY", 'actual_combokey'] = "='360008106380'"
missing_ny_2.at["99968NY", 'actual_combokey'] = "='360007606296'"
missing_ny_2.at["99992NY", 'actual_combokey'] = "='360009706274'"
missing_ny_2.at["99995NY", 'actual_combokey'] = "='360009506273'"
In [63]:
""" Join again on the NCES """
missing_ny_2_joined = missing_ny_2.set_index('actual_combokey').join(nces_1516_full, how = 'left', rsuffix = '_')
In [64]:
"""How many matched?"""
len(missing_ny_2_joined[missing_ny_2_joined.LEVEL_.notnull()].index)
Out[64]:
5

Combine recovered schools and performing filters

Concatenate the two recovered Missing NY Schools sets

In [65]:
missing_ny_joined_matching = missing_ny_joined[missing_ny_joined.LEVEL_.notnull()]
In [66]:
all_missing_ny_recovered = missing_ny_2_joined.append(missing_ny_joined_matching)

Join the original recovered schools (using schname_st identifier) with the recovered NY schools

In [67]:
recovered_schools = schname_combined.copy()
recovered_schools = recovered_schools.fillna("Missing")
In [68]:
recovered_schools = recovered_schools[recovered_schools['SCH_NAME_'] != "Missing"]
In [69]:
recovered_schools_all = recovered_schools.append(all_missing_ny_recovered)

Reformat the Columns -- Need to make sure that the recovered schools dataset's columns match the original filtered dataset's columns (required for concatenating the two sets properly)

In [70]:
"""Drop original nces columns (the ones with missing values)"""    
recovered_schools_all = recovered_schools_all.drop(['TITLEI', 'STABR', 'SCH_TYPE_TEXT', 'SCH_TYPE', 'LEVEL', 'VIRTUAL', 'GSLO', 'GSHI', 
                                            'NMCNTY15', 'LOCALE15', 'LAT1516', 'LON1516', 'combokey',
                                            'LEAID_', 'LEA_NAME_', 'SCH_NAME_', 'SCHID_'], axis = 1)
"""Rename new matching columns to replace the columns above (necessary for a proper concatenation later)"""
recovered_schools_all = recovered_schools_all.rename(lambda x: x.strip('_'), axis = 'columns')
recovered_schools_all = recovered_schools_all.set_index('COMBOKEY')
In [71]:
"""Do the columns between the original filtered set and recovered missing values set match"""
print(len(recovered_schools_all.columns.values))
print(len(filter5_crdc_nces_1516.columns.values))
123
123
In [72]:
""" How many schools recovered? """
len(recovered_schools_all.index)
Out[72]:
710

NCES-Reported High Schools

In [73]:
recovered_schools_filter1 = recovered_schools_all.copy()
In [74]:
filtered_out = filtered_out.append(recovered_schools_filter1[(recovered_schools_filter1.LEVEL == '1') | 
                                           (recovered_schools_filter1.LEVEL == '2') |
                                           (recovered_schools_filter1.LEVEL == 'N')])
recovered_schools_filter1 = recovered_schools_filter1[(recovered_schools_filter1.LEVEL == '3') |
                                                      (recovered_schools_filter1.LEVEL == '4')]

Non-Virtual Schools

In [75]:
recovered_schools_filter2 = recovered_schools_filter1.copy()
In [76]:
filtered_out = filtered_out.append(recovered_schools_filter2[recovered_schools_filter1.VIRTUAL == 'Yes'])
recovered_schools_filter2 = recovered_schools_filter2[recovered_schools_filter2.VIRTUAL != 'Yes']

NCES-Reported Regular / Vocational

In [77]:
recovered_schools_filter3 = recovered_schools_filter2.copy()
In [78]:
filtered_out = filtered_out.append(recovered_schools_filter3[(recovered_schools_filter3.SCH_TYPE == 2) | 
                                                                                 (recovered_schools_filter3.SCH_TYPE == 4)])
recovered_schools_filter3 = recovered_schools_filter3[(recovered_schools_filter3.SCH_TYPE == 1) |\
                                                      (recovered_schools_filter3.SCH_TYPE == 3)]

Remove Schools with 'Adult' in the Name

In [79]:
filtered_out = filtered_out.append(recovered_schools_filter3[recovered_schools_filter3.SCH_NAME.str.contains('Adult', case=False)])
recovered_schools_filter3 = recovered_schools_filter3[~recovered_schools_filter3.SCH_NAME.str.contains('Adult', case=False)]

Clean Duplicate Values

In [80]:
recovered_schools_filter3.groupby('SCH_NAME')['SCH_NAME'].count().sort_values(ascending = False).head(5)
"""NOTE: the community collaborative charter duplication appears to be legit (two campuses of the same school?)"""
Out[80]:
'NOTE: the community collaborative charter duplication appears to be legit (two campuses of the same school?)'
In [81]:
"""Beacon High School in Dutchess County is already in the filter5 dataset -- Remove"""
recovered_schools_filter4 = recovered_schools_filter3.copy()
recovered_schools_filter4 = recovered_schools_filter4[(recovered_schools_filter4.SCH_NAME != 'beacon high school') | (recovered_schools_filter4.NMCNTY15 != 'Dutchess County')]
In [82]:
"""Both of the performance learning centers here actually matched to a different 'performance learning center' record;
therefore, they should both be removed"""
recovered_schools_filter4 = recovered_schools_filter4[recovered_schools_filter4.SCH_NAME != 'performance learning center']
In [83]:
"""The University High in Irvine was already accounted for; therefore, needs to be removed from the recovered"""
recovered_schools_filter4 = recovered_schools_filter4[(recovered_schools_filter4.SCH_NAME != 'university high') | (recovered_schools_filter4.NMCNTY15 != 'Orange County')]
In [84]:
"""'How many final recovered values?'"""
str(len(recovered_schools_filter4.index)) + ' Recovered High Schools'
Out[84]:
'661 Recovered High Schools'

Concatenating Recovered Missing Values with the original Filtered Dataset

Finally, I concatenated the recovered high schools with the original filtered set.

I ensured that no duplicate values were added in the process.

In [85]:
"""Remove the missing values"""
filter6_crdc_nces_1516 = filter5_crdc_nces_1516.copy()
filter6_crdc_nces_1516 = filter6_crdc_nces_1516[filter6_crdc_nces_1516.LEVEL != "Missing"]
In [86]:
"""How many initial Duplicates?
Interesting enough, these duplicates appear to legitimate; the problem seems to be that the schools actually have 
different names (e.g. "The ADAIR Co. High"'s are actually supposed to be labeled ADAIR Co. R-I High and ADAIR Co. R-II BRASHEAR)"""
filter6_crdc_nces_1516.groupby(['STABR','SCH_NAME','NMCNTY15'])['SCH_NAME'].count().sort_values(ascending=False).head()
Out[86]:
STABR  SCH_NAME           NMCNTY15        
TX     TAYLOR H S         Harris County       2
       LEE H S            Harris County       2
       STERLING H S       Harris County       2
MO     ADAIR CO. HIGH     Adair County        2
KS     Smoky Valley High  McPherson County    1
Name: SCH_NAME, dtype: int64
In [87]:
"""Any dulications in the recovered schools?
    The community collaborative charter schools are two different schools."""
recovered_schools_filter4.groupby(['STABR','SCH_NAME','NMCNTY15'])['SCH_NAME'].count().sort_values(ascending=False).head()
Out[87]:
STABR  SCH_NAME                                   NMCNTY15         
CA     community collaborative charter            Sacramento County    2
TX     ischool high of hickory creek              Denton County        1
NY     arts and media preparatory academy         Kings County         1
       baccalaureate school for global education  Queens County        1
       aviation career and technical high school  Queens County        1
Name: SCH_NAME, dtype: int64
In [88]:
# filtered_and_recovered = pd.concat([filter6_crdc_nces_1516, recovered_schools_filter4])
filtered_and_recovered = filter6_crdc_nces_1516.append(recovered_schools_filter4)
In [89]:
"""Do the numbers of columns match?"""
print(len(filter6_crdc_nces_1516.columns.values))
len(filtered_and_recovered.columns.values)
123
Out[89]:
123
In [90]:
"""Because Columns are stored as dictionaries, there is no inherent order to the columns -- Pandas automatically 
uses an alphabetical sort on an append/concatenation.  I reorded the columns to show the SCH Name first"""
schName = ['SCH_NAME']
reorder = schName + [c for c in filtered_and_recovered.columns if c not in schName]
filtered_and_recovered = filtered_and_recovered[reorder]
In [91]:
"""No added duplicate records"""
filtered_and_recovered.groupby(['STABR','SCH_NAME','NMCNTY15'])['SCH_NAME'].count().sort_values(ascending=False).head(6)
Out[91]:
STABR  SCH_NAME                            NMCNTY15         
CA     community collaborative charter     Sacramento County    2
TX     STERLING H S                        Harris County        2
       TAYLOR H S                          Harris County        2
MO     ADAIR CO. HIGH                      Adair County         2
TX     LEE H S                             Harris County        2
KS     Victoria Junior-Senior High School  Ellis County         1
Name: SCH_NAME, dtype: int64
In [92]:
# filtered_and_recovered.to_csv('../filtered_data/04_filter_final.csv')

Final Missing Schools

In [93]:
final_missing = schname_combined[(schname_combined.SCH_NAME_.isnull()) & (schname_combined.LEA_NAME != 'NEW YORK CITY PUBLIC SCHOOLS')]
In [94]:
""" How many final missing schools? """
len(final_missing.index)
Out[94]:
290
In [95]:
# final_missing.to_csv('../filtered_data/04_final_missing_NCES.csv')
In [96]:
""" Top remaining unaccounted districts """
final_missing.groupby('LEA_NAME')['LEAID'].count().sort_values(ascending = False).head(10)
Out[96]:
LEA_NAME
NORMAN                                                     7
Dept. of Svs. for Children Youth & Their Families          5
OFFICE OF EDUCATION DEPARTMENT OF CHILDREN AND FAMILIES    4
ERIE 2-CHAUTAUQUA-CATTARAUGUS BOCES                        3
Clayton County                                             3
NASSAU BOCES                                               3
Cherokee County                                            3
WINDSOR SCHOOL DISTRICT                                    3
Boston                                                     3
TULSA                                                      3
Name: LEAID, dtype: int64

Final Filtered Out Schools From the Dataset

In [97]:
filtered_out = filtered_out.append(final_missing)
# filtered_out.to_csv('../filtered_data/04_filtered_out_schools.csv')

Filtration Process Details

In [98]:
from my_functions.extra_functions import (hs_enrollment_averager, missing_value_mapper, 
                                          add_enrollment_columns, enrollment_summary)
In [99]:
"""Adding Enrollment Columns to the Filtered Datasets"""
crdc_1516_enroll = add_enrollment_columns(crdc_1516)
filter_1_enroll = add_enrollment_columns(filter1_crdc_nces_1516)
filter_2_enroll = add_enrollment_columns(filter2_crdc_nces_1516)
filter_3_enroll = add_enrollment_columns(filter3_crdc_nces_1516)
filter_4_enroll = add_enrollment_columns(filter4_crdc_nces_1516)
filter_5_enroll = add_enrollment_columns(filter5_crdc_nces_1516)
filtered_and_recovered_enroll = add_enrollment_columns(filtered_and_recovered)
In [100]:
"""Set up the enrollment summary DataFrames"""
tot_enr = enrollment_summary(crdc_1516_enroll, 'Total CRDC Schools')
filter1_enr = enrollment_summary(filter_1_enroll, 'Remove Schools without 11th or 12th Grade (CRDC)')
filter2_enr = enrollment_summary(filter_2_enroll, 'Remove Special-Education, Alternative, and Juvenile Justice Schools (CRDC)')
filter3_enr = enrollment_summary(filter_3_enroll, 'Remove Virtual Schools (NCES)')
filter4_enr = enrollment_summary(filter_4_enroll, 'Remove Elementary, Middle, and "N" Schools (NCES)')
filter5_enr = enrollment_summary(filter_5_enroll, 'Remove Special Education, Alternative/Other, and "Adult" Schools (NCES)')
filter_final_enr = enrollment_summary(filtered_and_recovered_enroll, 'Remove the Non-Matching NCES Schools (Final Total)')
In [101]:
"""Compile enrollment summary DataFrames"""
pd.concat([tot_enr, filter1_enr, filter2_enr, filter3_enr, filter4_enr, filter5_enr, filter_final_enr])
Out[101]:
Number of Schools High Students DE Students AP Students
Total CRDC Schools 96360 15040262 1180054 2808786
Remove Schools without 11th or 12th Grade (CRDC) 25051 14774212 1175384 2793376
Remove Special-Education, Alternative, and Juvenile Justice Schools (CRDC) 20646 14403965 1164638 2784404
Remove Virtual Schools (NCES) 20269 14296572 1161385 2779722
Remove Elementary, Middle, and "N" Schools (NCES) 20113 14278284 1160178 2777305
Remove Special Education, Alternative/Other, and "Adult" Schools (NCES) 19012 14132585 1154017 2766557
Remove the Non-Matching NCES Schools (Final Total) 18684 14076669 1151835 2759810