CENSUS OF POPULATION AND HOUSING, 1990 âUNITED STATESã: PUBLIC USE MICRODATA SAMPLE: 1-PERCENT SAMPLE (ICPSR 9951) Principal Investigator United States Department of Commerce Bureau of the Census Second ICPSR Release August 1993 Inter-university Consortium for Political and Social Research P.O. Box 1248 Ann Arbor, Michigan 48106 1 1 BIBLIOGRAPHIC CITATION Publications based on ICPSR data collections should acknowledge those sources by means of bibliographic citations. To ensure that such source attributions are captured for social science bibliographic utilities, citations must appear in footnotes or in the reference section of publications. The bibliographic citation for this data collection is: U.S. Dept. of Commerce, Bureau of the Census. CENSUS OF POPULATION AND HOUSING, 1990 âUNITED STATESã: PUBLIC USE MICRODATA SAMPLE: 1-PERCENT SAMPLE âComputer fileã. 2nd release. Washington, DC: U.S. Dept. of Commerce, Bureau of the Census âproducerã, 1993. Ann Arbor, MI: Inter-university Consortium for Political and Social Research âdistributorã, 1993. REQUEST FOR INFORMATION ON USE OF ICPSR RESOURCES To provide funding agencies with essential information about use of archival resources and to facilitate the exchange of information about ICPSR participants' research activities, users of ICPSR data are requested to send to ICPSR bibliographic citations for each completed manuscript or thesis abstract. Please indicate in a cover letter which data were used. DATA DISCLAIMER The original collector of the data, ICPSR, and the relevant funding agency bear no responsibility for uses of this collection or for interpretations or inferences based upon such uses. 1 1 DATA COLLECTION DESCRIPTION United States Department of Commerce. Bureau of the Census CENSUS OF POPULATION AND HOUSING, 1990 âUNITED STATESã: PUBLIC USE MICRODATA SAMPLE: 1-PERCENT SAMPLE (ICPSR 9951) SUMMARY: The Public Use Microdata Sample (PUMS) 1-Percent Sample contains household and person records for a sample of housing units that received the "long form" of the 1990 Census questionnaire. Data items include the full range of population and housing information collected in the 1990 Census, including 500 occupation categories, age by single years up to 90, and wages in dollars up to $140,000. Each person identified in the sample has an associated household record, containing information on household characteristics such as type of household and family income. CLASS IV UNIVERSE: All persons and housing units in the United States. SAMPLING: A stratified sample, consisting of a sub-sample of the household units that received the 1990 Census "long-form" questionnaire (approximately 15.9 percent of all housing units). NOTE: (1) All PUMS files were resupplied by the Census Bureau during the summer of 1993. ICPSR has incorporated extensive user notes into the machine-readable codebook. Appendix G, consisting of maps for the PUMAS, is being released on a flow basis by the Census Bureau. These are available in hard copy from ICPSR. (2) Although all records are 231 characters in length, each file is hierarchical in structure, containing a housing unit record followed by a variable number of person records. Both record types contain approximately 120 variables. Two improvements over the 1980 PUMS files have been incorporated. First, the housing unit serial number is identified on both the housing unit record and on the person record, allowing the file to be processed as a rectangular file. In addition, each person record is assigned an individual weight, allowing users to more closely approximate published reports. Unlike previous years, the 1990 PUMS 1-Percent and 5-Percent Samples have not been released in separate geographic series (known as "A," "B," etc. records). Instead, each sample has its own set of geographies, known as "Public Use Microdata Areas" (PUMAs), established by the Census Bureau with assistance from each State Data Center. The PUMAs in the 1-Percent Sample are based on a distinction between metropolitan and nonmetropolitan areas. Metropolitan areas encompass whole central cities, Primary Metropolitan Statistical Areas (PMSAs), Metropolitan Statistical Areas (MSAs), or groups thereof, except where the city or metropolitan area contains more than 200,000 inhabitants. In that case, the city or metropolitan area is divided into several PUMAs. Nonmetropolitan PUMAs are based on areas or groups of areas outside 1 the central city, PMSA, or MSA. PUMAs in this 1-Percent Sample may cross state lines. EXTENT OF COLLECTION: 1 data file per state + machine-readable documentation (text) + database dictionary + SAS Control Cards + SPSS Control Cards EXTENT OF PROCESSING: MDATA/ NONNUM DATA FORMAT: Logical Record Length with SAS and SPSS Control Cards Part numbers correspond Part 80: Data Dictionary to FIPS codes of states for All Parts File Structure: hierarchical Record Length: 80 Record Length: 231 Part 81: Codebook Text Part 82: SPSS Control Cards for All Parts for All Parts Record Length: 80 Record Length: 80 Part 83: SAS Control Cards Part 84: Geographic Equivalency File for All Parts for the Entire Nation Record Length: 80 Record Length: 80 Part 99: Public Use Microdata Areas (PUMAS) Crossing State Lines Record Length: 231 1 CONTENTS Page Abstract------------------------------------------------------ 3 Introduction-------------------------------------------------- 11 How to Use This File------------------------------------------ 19 Accuracy of the Microdata Sample Estimates-------------------- 29 Sample Design and Estimation---------------------------------- 57 Record Contents----------------------------------------------- 71 Indexes to Variables------------------------ 71 Data Dictionary----------------------------- 85 User Notes---------------------------------------------------- 127 APPENDIXES A. Area Classifications-------------------------------------- 137 B. Definitions of Subject Characteristics-------------------- 165 C. Notes on Selected Data Items------------------------------ 273 D. Collection and Processing Procedures---------------------- 295 E. Facsimiles of Respondent Instructions and Questionnaire Pages--------------------------------------------------- 305 F. Data Products and User Assistance------------------------- 307 G. Maps (will be released as user notes)--------------------- 335 H. Record Layout of Machine-Readable Data Dictionary--------- 351 I. Code Lists------------------------------------------------ 353 ICPSR Note: User Notes have been added to this second edition of the codebook, beginning on page 127. 1 Page 2 ICPSR 9951 - 1 ICPSR 9951 Page 3 ABSTRACT Page Citation------------------------------------------------------ 3 File Availability--------------------------------------------- 7 Geographic Coverage------------------------------------------- 4 Related Electronic Media Products----------------------------- 7 Related Printed Reports--------------------------------------- 5 Related Reference Materials----------------------------------- 8 Software Considerations--------------------------------------- 8 Subject Matter Description------------------------------------ 3 Technical Description----------------------------------------- 8 Type of File-------------------------------------------------- 3 Universe Description------------------------------------------ 3 CITATION Census of Population and Housing, 1990: Public Use Microdata Samples (machine-readable data files) / prepared by the Bureau of the Census. -Washington: The Bureau (producer and distributor), 1992. TYPE OF FILE Microdata UNIVERSE DESCRIPTION All persons and housing units in the United States. SUBJECT MATTER DESCRIPTION Public Use Microdata Samples (PUMS) contain records representing 5% or 1% samples of the housing units in the U.S. and the persons in them. Selected group quarters persons are also included. The file contains individual weights for each person and housing unit which, when applied to the individual records, expand the sample to the total population. Most population and housing items are listed below. Please see the Data Dictionary for a complete listing of variables and recodes. Both the 5% and 1% samples have the same subject content and vary only in geographic composition of the Public Use Microdata Area (PUMA). A 3% elderly sample will be available later. 1 Page 4 ICPSR 9951 Items on the housing record include: Allocation Flags for Housing Mortgage Status and Selected Items Monthly Owner Costs Bedrooms Plumbing Facilities Condominium Status Presence and Age of Own Contract Rent Children Cost of Utilities Presence of Subfamilies Family Income in 1989 Family, in Household Subfamily and Relationship Property Value Recodes Real Estate Taxes Farm Status and Value Rooms Fire, Hazard, Flood Insurance Sewage Disposal Fuels Used Source of Water Gross Rent State (Residence) House Heating Fuel Telephone in Housing Unit Household Income in 1989 Tenure Household Type Units in Structure Housing Unit Weight Vacancy Status Kitchen Facilities Vehicles Available Linguistic Isolation Year Householder Moved into Unit Meals Included in Rent Year Structure Built Items on the person record include: Ability to Speak English Mobility Status Age Occupation Allocation Flags for Population Person's Weight Items Personal Care Limitation Ancestry Place of Birth Children Ever Born Place of Work PUMA Citizenship Place of Work State Class of Worker Poverty Status in 1989 Disability Status Race Educational Attainment Relationship Hispanic Origin School Enrollment and Type Hours Worked of School Income in 1989 by Type Time of Departure for Work Industry Travel Time to Work Language Spoken at Home Vehicle Occupancy Marital Status Weeks Worked in 1989 Means of Transportation Work Status in 1989 Migration PUMA Work Limitation Status Migration State Year of Entry Military Status, Periods of Active Duty Military Service, Veteran Period of Service GEOGRAPHIC COVERAGE 1 ICPSR 9951 Page 5 Each PUMS file provides records for States and many of their geographic levels. The hierarchy is shown below: The 5% sample identifies every State and various subdivisions of States called "Public Use Microdata Areas", each with at least 100,000 persons. These PUMAs were primarily based on counties, and may be whole counties, groups of counties, and places. When these entities have more than 200,000 persons, PUMAs can represent parts of counties, places, etc. None of these PUMAs on the 5% sample crosses state lines. On the other hand, the 1% sample was based primarily on metropolitan/nonmetropolitan areas, and contains PUMAs which were made from whole central cities, whole MSAs or PMSAs, MSA or PMSAs outside the central city, groups of MSAs or PMSAs, and groups of areas outside MSAs or PMSAs. When the areas have more than 200,000 persons, 1% PUMAs can represent parts of central cities, MSA/PMSAs, and so forth. 1% PUMAs may cross State lines and in that case State codes are not shown. See examples of PUMAs in figures 2-4. RELATED PRINTED REPORTS Since individual weights are provided on PUMS, most estimates derived from PUMS tabulations can be checked for reasonableness against other 1990 printed reports, STF's or microfiche produced from sample data. Listed below are the 1990 census printed reports containing sample data from summary tape products STF 3 and STF 4 which may be used to verify estimates provided from PUMS files. These reports will be available from Superintendent of Documents, U.S. Government Printing Office, Washington, DC 20402. An order form follows this abstract. 1990 CPH-3, Population and Housing Characteristics for Census Tracts and Block Numbering Areas. One report will be published for each metropolitan area (MA) and one for the non-metropolitan balance of each State, Puerto Rico and the U.S. Virgin Islands showing data for most of the population and housing subjects included in the 1990 census. Some tables will be based on the 100-percent tabulations, others on the sample tabulations. (Scheduled for release in 1992-93.) 1990 CPH-4, Population and Housing Characteristics for Congressional Districts of the 103rd Congress. A report for each State and the District of Columbia which provides both 100-percent and sample data for States, congressional districts, and, within congressional districts, counties, places of 10,000 or more inhabitants, county subdivisions of 10,000 or more inhabitants in 12 States, and American Indian and Alaska Native areas. (Scheduled for release in 1994.) 1 Page 6 ICPSR 9951 1990 CPH-5, Summary Social, Economic, and Housing Characteristics. These reports, issued for the United States, States, District of Columbia, Puerto Rico and the U.S. Virgin Islands, provide sample population and housing data for states and local government units, (i.e., counties, places, towns, and townships) other county subdivisions and American Indian and Alaska Native areas. 1990 CP-2, Social and Economic Characteristics. These reports are issued for the United States, States, District of Columbia, Puerto Rico, and the U.S. Virgin Islands. They focus on the population subjects collected on a sample basis in 1990. Data are shown for States (including summaries such as urban and rural), counties, places of 2,500 or more inhabitants, county subdivisions of 2,500 or more inhabitants in selected States, and the State portions of American Indian and Alaska Native areas. (Scheduled for release in 1993.) 1990 CP-2-1A, Social and Economic Characteristics for American Indian and Alaska Native Areas. Data are shown for American Indian and Alaska Native areas. (Scheduled for release in 1993.) 1990 CP-2-1B, Social and Economic Characteristics for Metropolitan Areas. Data are shown for MA's. (Scheduled for release in 1993.) 1990 CP-2-1C, Social and Economic Characteristics for Urbanized Areas. Data are shown for UAs. 1990 CP-3, Population Subject Reports. Thirty reports are planned covering populations subjects and subgroups. These include migration, income, and the older population. Geographic areas generally will include the United States, regions, and divisions; some reports may include data for highly populated area such as States, MA's, counties and large places. Scheduled for release in 1993. 1990 CH-2, Detailed Housing Characteristics. These reports, issued for the United States, States, District of Columbia, Puerto Rico, and the U.S. Virgin Islands focus on the housing subjects collected on a sample basis in 1990. Data are shown for State (including summaries such as urban and rural), counties, places of 2,500 or more inhabitants, MCD's of 2,500 or more inhabitants in selected States, Alaska Native areas and the State portion of American Indian areas. (Scheduled for release in 1993.) 1990 CH-2-1A, Detailed Housing Characteristics for American Indian and Alaska Native Areas. Data are shown for American Indian and Alaska Native areas. (Scheduled for release in 1993.) 1990 CH-2-1B, Detailed Housing Characteristics for 1 ICPSR 9951 Page 7 Metropolitan Areas. Data are shown for MA's. (Scheduled for release in 1993.) 1990 CH-2-1C, Detailed Housing Characteristics for Urbanized Areas. Data are shown for UA's. (Scheduled for release in 1993.) 1990 CH-3, Housing Subject Reports. Ten Housing subject reports are planned covering 1990 census items such as structural characteristics and space utilization. Geographic areas generally include the United States, regions, and divisions; some reports may include data for other highly populated geographic areas such as States, MA's, counties, and large places. (Scheduled for release in 1993.) RELATED ELECTRONIC MEDIA PRODUCTS PUMS data on compact disk-read only memory (CD-ROM) are issued after the all tape files are released. CENDATA, the Census Bureau's online system, carries PUMS Technical Documentation. STF 3 data are available also on CD-ROM and magnetic tape. Contact Customer Services (301-763-4100) for additional information on electronic media products. FILE AVAILABILITY PUMS files are provided for each State and the District of Columbia and are released on a State-by-State basis. All files and pricing information are available from Customer Services, Data User Services Division, Bureau of the Census, Washington, DC 20233. (See above for phone and FAX information.) A machine-readable data dictionary is included on the tape without charge. Options include 6250 or 1600 bpi, ASCII or EBCDIC, labeled or unlabeled. The files are also available on tape cartridges (IBM3480 or compatible format) for the same price. When ordering, please use the order form at the end of this Chapter. Files for the individual States are priced according to the number of megabytes of data they contain; each megabyte is priced at $1.25 regardless of the tape specifications. The minimum charge for a computer tape is $175 for one or more files. See the enclosed order blank for prices of the various PUMS files. Although a user can order a single file, we have packaged the files by census division for sale since many users order all of the states or at least states which border their state. Discount prices are available where all files in a group are paid for at the time of ordering. See order blank for specific prices. 1 Page 8 ICPSR 9951 RELATED REFERENCE MATERIALS 1990 Census Population and Housing Tabulation and Publication Program. This booklet provides descriptions of the data products available from the 1990 census. Available without charge from Customer Services (see above). Census '90 Basics. This booklet provides a general overview of census activities and detailed information on census content, geographic areas, and products. Available without charge from Customer Services (see above). Census ABC's-Applications in Business and Community. This booklet highlights key information about the 1990 census and illustrates a variety of ways the data can be used. Available without charge from Customer Services (see above). A comprehensive 1990 Census of Population and Housing Guide will be available in 1990. It will provide detailed information about all aspects of the census and a comprehensive glossary of census terms. TECHNICAL DESCRIPTION The file contains two record types a "housing" record and a "person" record each consisting of 231 characters of data. Each housing unit record is followed by a variable number of persons records, one for each occupant. Vacant housing units will have no person record, and selected persons in group quarters will have a dummy housing record and a person record. The 5% (A) sample includes a separate file for each state. The 1% (B) sample includes a file for each State and a file containing PUMAs which cross State lines. The 3% (O) sample (elderly file) has the same geographic composition as the 5% sample, (but includes housing units with at least one person age 60 and over or group quarters persons age 60 and over.) The block size for the files varies with each user's specifications, however the standard block size is 32,340 characters for 1990 PUMS. SOFTWARE CONSIDERATIONS The 1990 Public Use microdata files are a special type of nonrectangular file-hierarchical. That is, the file contains several record types each with different variables, rather than one gigantic record with all the variables. We release the PUMS in this format because of the tremendous amount of data contained in one record. 1 ICPSR 9951 Page 9 The file is sorted to maintain the relationship between both record types. Although these records are extremely large they can be handled by most statistical or report writing software. There are two basic record types: the housing unit record and the person record. For 1990, each of the records contains a serial number which links the persons in the housing unit to the proper housing unit record, so that a user no longer needs to worry about keeping the record sequence as the file was delivered. In today's information processing environment, most standard statistical software packages are now capable of handling the file in either format: hierarchical or rectangular structure. Most software packages, such as SAS, SPSS, BMDP, and some relational data base systems, will in fact rectangularize hierarchical files. Further, the manuals accompanying most packages contain samples of code showing how to process the files. Several of the packages also have extract procedures already coded into the software. The 1990 PUMS will be accompanied by electronic data dictionaries in a format which will allow the user to read in ASCII characters and prepare statements transforming the variables and their corresponding descriptions and values to the proper statements required by the software package of choice. The files will be ASCII, with no special software appended, so as to be compatible with most software packages. But the technical documentation will include a section on "how to use this file", where software concerns will be addressed. The user must be familiar with the processing system's limitations and the efficiency of the procedures within the software packages. Users may also write their own code enabling them to perform custom tabulations on their system of choice. 1 Page 10 ICPSR 9951 - 1 ICPSR 9951 Page 11 CHAPTER 1. INTRODUCTION OVERVIEW Public-use microdata samples are computer accessible files which contain records for a sample of housing units, with information on the characteristics of each unit and the people in it. We exclude information which would identify a household or an individual in order to protect the confidentiality of respondents. Within the limits of the sample size and geographic detail, these files allow users to prepare virtually any tabulations they require. Separate public-use microdata samples are available, each representing five percent or one percent of the population and housing of the United States: o 5% Sample, identifying all States and various subdivisions within them, including most counties with 100,000 or more inhabitants; o 1% Sample, identifying all metropolitan territory and most MAs with 100,000 or more inhabitants individually, and groups of (MAs) elsewhere; A 3% elderly sample will be available also. WHAT IS MICRODATA? We provide computer accessible data products in several formats as summary data or as microdata. Summary data are the type of data found in census printed reports, summary tape files, microfiche, and most special tabulations; microdata are the information collected from each person and housing unit on the questionnaire. In summary data, the basic unit of analysis is a specific geographic area (for example, a census tract, county or State) for which counts of persons or housing units (or aggregated data) in particular categories are provided. In microdata, the basic unit is an individual housing unit and the persons who live in it. Figure 1 illustrates the basic distinctions between summary data and microdata. Often, there are two types of microdata: Confidential microdata include the census basic record types, computerized versions of the questionnaires collected from households, as coded and edited during census processing. The Census Bureau tabulates these confidential microdata in order to produce the summary data that go into the various reports, summary tape files (STFs), and special tabulations. Public-use microdata samples are extracts from the confidential 1 Page 12 ICPSR 9951 microdata taken in a manner that avoids disclosure of information about identifiable households or individuals. PROTECTING CONFIDENTIAL INFORMATION All data released (in print or electronic media) by the Bureau of the Census are subject to strict confidentiality measures imposed by the legislation under which our data are collected: Title 13, U.S. Codes which protects the confidentiality of individual respondents. Responses to the questionnaire can be used only for statistical purposes, and Census Bureau employees are sworn to protect respondents' identities. Records on public-use microdata samples are selected after the confidentiality edit is performed, and contain no names or addresses. Also, the Bureau limits the detail (topcodes, recodes) on place of residence, place of work, high incomes, and other selected items to further protect the confidentiality of the records. Microdata records identify no geographic area with fewer than 100,000 inhabitants. Microdata samples include only a small fraction of the population, drastically limiting the chance that the record of a given individual is even contained in a public-use microdata file, much less identifiable. USES OF MICRODATA FILES Public-use microdata files essentially make possible "do-it-yourself" special tabulations. Since the 1990 files furnish nearly all of the detail recorded on long-form questionnaires in the census, subject to the limitations of sample size and geographic identification, users can construct an infinite variety of tabulations interrelating any desired set of variables. Users have the same freedom to manipulate the data that they would have if they had collected the data in their own sample survey, yet these files offer the precision of census data collection techniques and sample sizes larger than would be feasible in most independent sample surveys. Microdata samples will be useful to users (1) who are doing research that does not require the identification of specific small geographic areas or detailed cross tabulations for small populations, and (2) who have access to programming and computer time needed to process the samples. Microdata users frequently study relationships among census variables not shown in existing census tabulations, or concentrate on the characteristics of certain specially defined populations, such as unemployed homeowners or families with four or more children. 1 ICPSR 9951 Page 13 SAMPLE DESIGN AND SIZE Each microdata file is a stratified sample of the population, actually a subsample of the full census sample (approximately 15.9% of all housing units) that received census long-form questionnaires. Sampling was done housing unit-by-housing unit in order to allow study of family relationships and housing unit characteristics. Sampling of persons in institutions and other group quarters was done on a person-by-person basis. Vacant units were sampled also. There are two independently drawn samples, designated "5% (A)" and "1% (B),", each featuring a different geographic scheme, as discussed below. Samples from the 1970 and 1960 censuses also employed a 1% sample size, the 5% sample was new for 1980. Nationwide, the 1990 5% Sample gives the user records for over 12 million persons and over 5 million housing units. On the other hand, since processing a smaller sample is less expensive, some users will want to produce extracts using the subsample numbers provided in the housing record. Sample design is discussed more thoroughly in chapter 4. Unlike 1980, each file contains individual weights for both the housing unit and the persons in the unit. The user can estimate the frequency of a particular characteristic for the entire population by summing the weight variables for records with that characteristic from the microdata file. A section of Chapter 4 discusses the preparation and verification of estimates (see page 4-1). Reliability improves with increases in sample size, so the choice of sample size must represent a balance between the level of precision desired and the resources available for working with microdata files. By using tables provided in chapter 3 (see page 3-2), one can estimate the degree to which sampling error will affect any specific estimate prepared from a microdata file of a particular sample size. Many factors affect the user's decision on which file to use. Users of microdata files for State or MSA estimates would normally use a 1% or 5% sample, while users concerned only with national figures can frequently get by with a smaller sample, say a 0.1-percent (one-in-a-thousand) sample. Although we no longer provide the 0.1% file we do provide subsample numbers which allow scientifically designed extracts of various sizes to be drawn. Even national users may need a 1% or a 5% sample if extremely detailed tabulations are needed, or if users are concerned with very small segments of the population, for example, females 75 years old or over of Italian ancestry. One of the examples in chapter 3 discusses the selection of the appropriate sample size for a particular study. SUBJECT CONTENT 1 Page 14 ICPSR 9951 Microdata files contain the full range of population and housing information collected in the 1990 census: 500 occupation categories, age by single years up to 90, wages in dollars up to $140,000, and so forth. Because the samples provide data for all persons living in a sampled household, users can study how characteristics of household members are interrelated (for example, income and educational attainment of husbands and wives). Information for each housing unit in the sample appears on a 231-character record with geographic and housing items, followed by a variable number of 231-character records with person's information, one record for each member of the household. Items on the housing record are listed beginning on page 5-1; items on the persons record are listed beginning on page 5-4. Although each of the items as collected is further defined in the glossary (reprinted from the 1990 Census Users' Guide) presented as appendix B to this document, it is important to note that we modified several items on the microdata file to provide protection for individual respondents. We also include many transformed variables (recodes), such as those appearing on the STF 3A files, so that users can analyze many complex relationships between records. Data users will frequently want to generate additional variables or develop recodes to meet their individual needs. While it is impossible to predict all the transformations (recodes) required by data users, we included many of the more common ones (household income, selected monthly owner costs, poverty status, and so forth). Transformations such as these, as well as corrections that apply to certain subjects, are discussed in appendix C. We edited the sample questionnaires for completeness and consistency, and made substitutions or allocations for any missing data. Allocation flags appear at the end of each record indicating each item which has been allocated. Thus, a user desiring to tabulate only actually observed values can eliminate variables with allocated values. Editing and allocation flags are discussed beginning on page 3-15. 1 ICPSR 9951 Page 15 Figure 1. Comparison of Summary Data With Information on Microdata Files SUMMARY DATA o Basic unit is an identified geographic area o Data summarized on people and housing in areas o Available for small areas Illustrative Summary Data City Total Occupied Number Renter Gross Rent Pop Housing Persons Occupied Under $100- $150- Units Per Unit Units $100 149 199 Weston City 110,938 49,426 2.2 31,447 158 3,967 13,282 Smithville 21,970 7,261 3.1 2,492 37 190 1,766 Junction 17,152 5,494 2.7 822 11 29 238 PUBLIC-USE MICRODATA o Basic unit is an unidentified housing unit and its occupants o Unaggregated data to be summarized by the user o Allows detailed study of relationships among characteristics o Not available for small areas Illustrative Microdata* ---------------------------------------------------------------------- Housing Housing Housing unit #1 unit #2 unit #3 State of Residence Virginia Virginia Virginia PUMA Area name Area name Area name or code or code or code Persons in household 3 1 0 Telephone Yes Yes N/A Complete plumbing Yes Yes Yes Monthly rent $525-549 $650-699 $300-324 Vehicles 2 1 N/A Household type Married-couple Nonfamily Vacant family householder 1 Page 16 ICPSR 9951 Persons: 1a 1b 1c 2a Housing unit no. 1 1 1 2 Relationship Householder Spouse Child Householder Sex M F M F Age 37 35 6 62 Race W W W B Place of Birth Kansas Virginia Virginia Alabama Occupation Plumber NA NA Postsecondary economics teacher Earnings $28,100 0 0 $45,300 ---------------------------------------------------------------------- * Public-use microdata samples do not actually contain alphabetic information. Such information is converted to numeric codes; for example, the State of Virginia has a numeric code of 51. GEOGRAPHIC IDENTIFICATION The 5% and 1% Samples each feature a different geographic scheme: We call the geographic areas PUMAs for Public Use Microdata Areas. We use the term to apply to each of the areas identified on these files. A 5-digit number, unique within State, identifies each PUMA. The first three digits is the PUMA code and the last two are the sub-PUMA. The sub-PUMA is used when counties or metropolitan areas are subdivided by groupings of census tracts. For example, the PUMAs for Bronx County, New York consist of several groups of census tracts numbered from 05101 through 05111, whereas the PUMA numbered 03500 is made up of 3 counties: Courtland, Tioga and Tompkins Co. The State Data Center provided the PUMAs for most states. For the states of Georgia, Indiana, and Oregon, the Census Bureau developed the PUMAs with input from the respective State Data Center. o The 5% Sample identifies every State, most individual counties or county equivalents with 100,000 or more inhabitants, many individual cities or groups of places with 100,000 or more inhabitants, and for counties with at least 200,000 inhabitants groupings of Census tracts are also identified. Areas with populations under 100,000 have been grouped into reasonable analytic units often equivalent to State planning district boundaries. In New England, areas are defined in terms of cities and towns rather than counties. 1 ICPSR 9951 Page 17 The 1% Sample identifies MAs of 100,000 or more inhabitants. The remaining MAs are paired together so that metropolitan and nonmetropolitan territory can be separately analyzed. Many large cities, groups of cities, and counties are identified within large MAs. Outside MAs, counties are grouped according to State planning districts or into other reasonable analytic units with populations of 100,000 or more. On the 1% sample, when PUMAs cross state boundaries, states are not separately identified. All of these records appear on a separate file where the state is identified as "99". (See Appendix G). The characteristics of the different geographic schemes are compared in the maps and charts which follow in figures 2, 3, and 4. Purchasers of the 1% Sample for any of the States which include area in a county group crossing State lines may want to request that the "State Code 99" file be stacked onto a tape being purchased. Estimates of the number of tapes required for specified groups of files at a given density and blocking factor are available on request from Customer Services. At the time of this printing, we have not produced files for all States, and estimates of the number of tapes required for specified groups of files at a given density and blocking factor are available on request from Customer Services. We will issue a user note updating this information when all files are produced. CORRESPONDING MICRODATA FROM EARLIER CENSUSES PUMS files exist for the 1960, 1970 and 1980 censuses. Very little comparability exists between geographic identifiers on each of the previous files, but housing and population characteristics are similar. And because of this similarity, microdata files from the most recent censuses are a rich resource for analysis of trends. Items which were added, dropped, or substantially changed between 1980 and 1990 are listed in figure 5. Appendix B discusses historical comparability of items in greater detail. 1 Page 18 ICPSR 9951 - 1 ICPSR 9951 Page 19 CHAPTER 2. HOW TO USE THIS FILE This chapter serves as a guide for data users to both the tape and the technical documentation. Novice users trying to understand how to use the documentation and the file should read this chapter first. DOCUMENTATION CHAPTERS The Abstract chapter in this documentation provides a quick overview of the file, including the formal title, geographic coverage, subject coverage, and file availability. Also shown are citations for related reference materials and printed reports. Their titles and geography are included in this section, along with purchasing information. Chapter 1 describes microdata, chapter 3 describes accuracy of the data, and chapter 4 describes the sample design and estimation for PUMS. USER NOTES Information about file or documentation changes sometimes becomes available after the documentation has been printed. User notes inform the user community about these changes. These are issued in a numbered series. If there are technical documentation changes, revised pages usually accompany them. The revised pages should be inserted in their proper location, but the user note cover sheet should be filed in the User Notes chapter. Technical notes, which contain file errata, are also issued by the Census Bureau. We suggest filing these following appendix I. DATA DICTIONARY The data dictionary (code book) describes the file and provides character locations for each variable. The components include a short mnemonic or field name for use with software packages; field size; starting position; and a description of field contents with possible values. There also is a machine-readable data dictionary file on the data tape. This dictionary is designed to be converted for use with various software packages. APPENDIXES 1 Page 20 ICPSR 9951 Detailed information on geographic areas is in appendix A followed by subject-matter definitions in appendix B. Appendix C provides information about the data changes on PUMS while appendix D outlines the data collection and processing procedures. Facsimiles of both the respondent instructions and 1990 census long-form questionnaire are in appendix E. Appendix F furnishes detailed information on all the data products of the 1990 census, as well as suggested sources of information and assistance. Map information is included in appendix G (to be supplied as user notes). The record layout for the machine-readable data dictionary file that accompanies each tape order is in appendix H. Appendix I contains the code lists used in processing the data for most sample products. These are especially helpful in determining the components included in categories such as race, and group quarters. On the PUMS, the information on these lists may be changed for disclosure protection purposes. Those changes are indicated in the data dictionary and further explained in appendix C. INTERNAL FILE LABELS System Labels Tape orders which specify labeled tapes will have a standard American National Standards Institute (ANSI) label. The system label consists of 17 characters, but only the first 12 are active. The remaining five characters will be 'x' filled. The 1990 PUMS files have a Data Set Name (DSN) of PUMStXss.Fnnxxxxx where t is A, B, or O depending on the file, ss is the United States Postal Service (USPS) State abbreviation, and nn is a two-digit number with leading zeroes identifying the tape volume sequence. (The "X", "F", and "x" in the DSN remain constant). User Labels Each user tape will have two user header labels and two user trailer labels. These labels combine information from the system label and the identification portion of the first and last record. These labels enable the user to quickly identify the beginning and ending records on each tape. User Header Labels The user header labels are designated UHL1 and UHL2. UHL1 and UHL2 repeat information from the system label in HDR1 and HDR2. User Trailer Labels The user trailer labels are designated UTL1 and UTL2. UTL1 1 ICPSR 9951 Page 21 and UTL2 contain information from the system trailer label. STATE-SPECIFIC FILE INFORMATION State-specific file information, such as record counts, is not provided in the technical documentation. However, each tape order is accompanied by a tape creation sheet. This sheet provides the file name, file label (HDR1), record size, block size, and record count. The tape creation sheet received with the tape should be filed in the technical documentation notebook or with other tape information maintained by the user. FILE STRUCTURE Each file consists of a series of 231-character logical records of two types; housing and persons. Each housing unit record is followed by a variable number of person records, one for each member of the housing unit or none if vacant, as illustrated in figure 1. Each person in group quarters has two records--a dummy "housing unit" record (most nongeographic fields are not applicable), as well as a person record. For 1990, we made several improvements to the file to aid in processing the data. Two improvements allowing users more processing flexibility are the inclusion of the housing unit serial number on both record types and the inclusion of individual weights on each record. Including the housing unit serial number on both records affords the user an option as to how to process the data-either rectangularly or hierarchically. With the introduction of individual weights, users can more closely approximate published data. Another improvement for 1990 is providing many of the recodes (data transformations) which appear on the summary tape file (STF 3A). While the changes increase the file size, we should see an associative increase in file utility. In the text of this document, the numeric identification of a particular data item is the same as its character location within a record. Items on the housing record are prefixed with an H, items on the person record with a P. For instance, Race, item P12-14, is a two-digit code beginning in character 12 of the person record. We continue to provide in the data dictionary, or record layout, mnemonic identifiers, many of which are the same as those used in 1980. Geographic identifiers and subsample identifiers appear only on the housing unit record. Thus, most tabulations of person characteristics require manipulation of both housing and person records. An item on the housing record indicates the exact number of person records following before the next housing record (PERSONS). This feature allows a program to anticipate what type of record will appear next, if necessary. 1 Page 22 ICPSR 9951 In today's data manipulation environment, users have many options for processing data and are limited only by the amount and type of resources. Most statistical software packages (e.g. BMDP, SAS, SPSS, to name a few) are capable of handling the data either hierarchically or rectangularly. Many users may still want to create extract files with any desired household data repeated with each person's record. Users with limited resources (funds, personnel, software/hardware) may want to create or obtain extracts containing only those variables of interest. All fields are numeric, except for the Record Type which are "H" and "P." FILE SIZE Every file purchased from the Census Bureau includes a printout showing the total record count. Estimated file sizes are not shown now, but in a future user note record counts for each state will be identified. RECORD SEQUENCE We release these files on a state-by-state basis. Records on these files are sorted by geographic area within state. On the 5% and 1% Samples, all households sampled within a particular PUMA appear together. PUMAs are sequenced in ascending order within State. On the 1% Sample, this means that all PUMAs with State code suppressed (i.e., shown as 99) appear on a separate file. In order to provide an extra measure of protection from disclosure of individual households within each geographic area, we scramble the records to avoid any implication of geographic information beyond that which meets Census Bureau disclosure rules for the 1990 PUMS. Person records within household are sequenced by relationship code (P2). The householder record always immediately follows the housing unit record for an occupied unit. This feature simplifies tabulation of households or families by race of householder, ancestry of householder, and even poverty status--since the desired indicators are always on the first person record. Where the household contains more than one person of a given relationship, person records appear in sequence of decreasing age (P8-9). Persons sampled from within the same group quarters are not identifiable as such, since each has an independent dummy housing unit record. MACHINE-READABLE DOCUMENTATION Every file includes a machine readable "data dictionary" or record layout. Irrespective of the PUMS sample used, the record layout is the same. A user can produce hard copy documentation for extract 1 ICPSR 9951 Page 23 files or labels for tabulations created; or with minor modifications, can use the data dictionary file with software packages or user programs to automatically specify the layout of the microdata files. Also available in machine-readable form is the PUMA Equivalency File, which lists the geographic components (counties or MCDs, places, tracts where available) and their assigned PUMA codes for the 5% and 1% samples. HANDLING INVALID CODES The data dictionary shows each category as having a unique representation. Although we reviewed test files for several states, we may have a small number of cases outside the specified range for a variable. We will correct these errors when found, but users may follow the standard census practice to assign invalid codes to the next lower numbered valid category. For example, on an allocation flag with valid codes 0, 2 and 3, a 1 would be counted with code 0, and a code of 4 or more would be counted with 3. Exceptions to this rule occur in occupation and industry codes, where invalid codes are assigned to the next higher valid category. PREPARING AND VERIFYING TABULATIONS Estimation of totals - Estimates of complete-count census figures may be made from tabulations of public use microdata samples by using a simple inflation estimate - that is, summing the weights associated with that variable; (e.g. for housing characteristics, use the housing unit weight; for persons characteristics, use the person weight.) Those users using subsample numbers to vary the sample size must apply an appropriate factor, or, otherwise adjust the weights to derive an appropriate estimation of totals. We further explain the use of weights and subsample numbers in Chapter 4. Estimation of percentages - a user can estimate percentages by simply dividing the weighted estimate of persons or housing units with a given characteristic by the weighted sample estimate for the base. Normally, this yields the same as would be obtained if one made the computation using sample tallies rather than weighted estimates. For example, the percentage of housing units with air conditioning in a one-in-one-hundred sample can be obtained by simply dividing the tally of sample housing units with air conditioning by the total number of sample housing units. Verifying tabulations - Producing desired estimates from the public-use microdata samples is relatively easy. File structure and coding of items is straightforward. There are no missing data (see the section on allocations, page 3-38). Records not applicable for each item are assigned to specific "NA" categories, and it is frequently not necessary to determine in a separate operation whether 1 Page 24 ICPSR 9951 a record is in the universe or not. PUMs "universe" and "variable" definitions may differ from other products produced from sample data primarily because of concerns about disclosure risks (e.g. PUMs files may have different topcodes from STF 3A, or the recodes may vary because the components were topcoded). A user must, however, anticipate the possibility of errors in his or her own processing. Thus, user tabulations should be verified against other available tallies. Two ways for the user to verify estimates follow: 1. Using control tabulations from the samples. As each public-use microdata sample was produced, counts of persons, housing units, vacant housing units, and group quarters persons selected into the sample were tallied within each identified geographic area. These control counts will be published as a supplement to this documentation. (In the interim, counts for specific areas may be requested from Customer Services.) If users cannot replicate these exact counts, review of the user's programs, and the shipping advices accompanying the files are in order. 2. Using published data from the 1990 censuses. Tabulations from the 1990 census data base are available in the printed census publications and on summary tape files. Users may check the reasonableness of statistics derived from public-use microdata samples against these sources. A familiarity with summary data already available may also facilitate planning of tabulations to be made from microdata. Those publications series likely to be of greatest use for this purpose are listed in Figure 5. In comparing sample tabulations with published data one must carefully note the universe of the published tabulation. For instance, on microdata records, Industry (P87-89) is reported for the civilian labor force and for persons not in the labor force who reported having worked in 1985 or later. Industry tabulations in 1990 census publications are presented only for the employed population or the experienced civilian labor force. Thus, a tally of Industry for all persons from whom industry is reported in microdata records would not correspond directly to any published tabulation. "A 1 ICPSR 9951 Page 25 user should always pay particular attention to concept definitions as presented in the glossary." One cannot, of course, expect exact agreement between census publications which are based on the complete census count, full sample estimates, or a subsample of the census sample and user estimates based on tallies of a 5-percent or smaller sample. They will inevitably differ to some extent due to chance in selection of actual cases for Public Use Microdata Samples. Since the amount of likely chance variation for a given statistic can be measured, any discrepancy beyond a certain level can be identified as a likely error in programming. Chapter 3 discusses sampling variability and its measurement. User experience has indicated that careful verification of sample tabulations is essential -- so important that it may frequently be advisable to include additional cells in a tabulation for no other reason than to provide counts or to yield marginal totals, not otherwise available, which may be verified against available tabulations. 1 Page 26 ICPSR 9951 Figure 5. 1980-1990 Subject Comparability Most of the items for 1990 are comparable to 1980. Several items found in 1980 PUMS are not in the 1990 file primarily because the inquiries were not asked or because we are providing a measure of protection for respondents. Full descriptions of item comparability are given in appendix B. However, users should read appendix C for differences in PUMS definitions and those of other census products. 1990 Items not on 1980 Files 1980 Items Not on 1990 Files Condominium fees Access to unit Employment status of parents recode Age at first marriage Flag indicating all 100% person's data Bathrooms substituted Cooking Fuel Flag indication all 100% housing unit Heating equipment data substituted Passenger elevator Gross rent as a percentage of 1989 Place of work SMSA recode Household Income Place of work place size recode Household language recode Place of work central city recode Housing unit/GQ person serial number Quarter of birth Housing unit weight Spanish surname Time of departure for work Stories in structure Linguistic isolation recode Married, spouse present/absent recode Mobile home costs Number of related children in household recode Number of stepchildren in household recode Number of persons in family recode Person's weight Presence of subfamilies in household Presence of person under 65 years in household Presence of person under 60 years in household Presence of nonrelatives in household Presence of person under 18 years in household Rental unit recode Selected monthly owner costs as a percentage of 1989 household income Value unit recode Workers in family recode Years of active military duty CONCEPTS SUBSTANTIALLY CHANGED Grade & Finished Highest Grade - now combined and grouped to show 1 ICPSR 9951 Page 27 highest level completed Race - Several categories added including 25 American Indian tribes Spanish origin - Now Hispanic origin showing an expanded list of countries 1 Page 28 ICPSR 9951 - 1 ICPSR 9951 Page 29 CHAPTER 3 - ACCURACY OF THE MICRODATA SAMPLE ESTIMATES INTRODUCTION The tabulations prepared from a public use microdata sample are based on a subset of the 1990 Census sample. The data summarized from this file are estimates of the actual figures that would have been obtained from a 100-percent enumeration. Estimates derived from this sample are expected to be different from the 100-percent figures because they are subject to sampling and nonsampling errors. Sampling error in data arises from the selection of persons and housing units to be included in the sample. Nonsampling error affects both sample and 100 percent data. Errors are introduced during the collection and processing phases of the census. A more detailed discussion of both sampling and nonsampling error is given below. In microdata samples, the basic unit is an individual housing unit and the persons who live in occupied housing units or group quarters. However, microdata records in these samples do not contain names or addresses. A more detailed discussion of methods to protect confidentiality of individual responses follows. CONFIDENTIALITY OF THE DATA To maintain the confidentiality required by law (Title 13, United States Code), the Bureau of the Census applies a confidentiality edit to the 1990 census data to assure that published data do not disclose information about specific individuals, households, or housing units. As a result, a small amount of uncertainty is introduced into the estimates of census characteristics. The sample itself provides adequate protection for most areas for which sample data are published since the resulting data are estimates of the actual counts; however, small areas require more protection. The edit is controlled so that the basic structure of the data are preserved. The confidentiality edit is implemented by selecting a small subset of individual households from the internal sample data files and blanking a subset of the data items on these household records. Responses to those data items were then imputed using the same imputation procedures that were used for nonresponse. A larger subset of households is selected for the confidentiality edit for small areas to provide greater protection for these areas. The editing process is implemented in such a way that the quality and usefulness of the data were preserved. Since microdata records are the actual housing unit and person records, the Bureau of the Census takes further steps to 1 Page 30 ICPSR 9951 prevent the identification of specific individuals, households, or housing units. The main disclosure avoidance method used is to limit the geographic detail shown in the files. A geographic area must have a minimum of 100,000 population to be fully identified. Furthermore, certain variables are topcoded, or the actual value of the characteristics is replaced by a descriptive statistic, such as the median. SOURCES OF ERRORS IN THE DATA Since the estimates that users produce are based on a sample, they may differ somewhat from 100-percent figures that would have been obtained if all housing units, persons within those housing units, and persons living in group quarters had been enumerated using the same questionnaires, instructions, enumerators, and so forth. The sample estimate also would differ from other samples of housing units, persons within those housing units, and persons living in group quarters. The deviation of a sample estimate from the average of all possible samples is called the sampling error. The standard error of a sample estimate is a measure of the variation among the estimates from all the possible samples, and thus, is a measure of the precision with which an estimate from a particular sample approximates the average result of all possible samples. The sample estimate and its estimated standard error permit the construction of interval estimates with prescribed confidence that the interval includes the average result of all samples. The method of calculating standard errors and confidence intervals for the data in the microdata samples, is described in the next section. In addition to the variability which arises from the sampling procedures, both sample data and 100-percent data are subject to nonsampling error. Nonsampling error may be introduced during any of the various complex operations used to collect and process census data. For example, operations such as editing, reviewing, or handling questionnaires may introduce error into the data. A detailed discussion of the sources of nonsampling error is given in the section on "Control of Nonsampling Error" in this chapter. Nonsampling error may affect the data in two ways. Errors OAthat are introduced randomly will increase the variability of the data and should, therefore, be reflected in the standard error. Errors that tend to be consistent in one direction will make both sample and 100-percent data biased in that direction. For example, if respondents consistently tend to underreport their income, then the resulting counts of households or families by income category will tend to be understated for the higher income categories and overstated for the lower income categories. Such biases are not reflected in the standard error. 1 ICPSR 9951 Page 31 CALCULATIONS OF STANDARD ERRORS USING TABLES A standard sampling theory text should be helpful if the user needs more information about confidence intervals and nonsampling errors. Two methods for estimating standard errors of estimated totals and percentages are described in this section. The first method is very simple. This method uses already calculated standard errors for specific sizes of estimated totals and percentages given in tables A through F, shown later in this section. The estimated standard errors shown in tables A through F were calculated assuming simple random sampling while the microdata sample (and the census sample) were selected using a systematic sampling procedure. The numbers shown in table G, referred to as design factors, are defined as the ratio of the standard error from the actual sample design to the standard error from a simple random sample. The standard errors in tables A through F used in conjunction with the appropriate design factors from table G produce a reasonable measure of reliability for microdata sample estimates. Public use microdata sample data users will receive table G, the Table of Design Factors, as a supplement to the technical documentation. An alternative methodology by which more precise standard errors can be obtained requires additional data processing and file manipulation. The trade off is more precision for more data processing. However, with the technology available today, the second method is preferable and strongly recommended. However, the standard error tables could be very useful. For instance, they would be useful when one is trying to determine, prior to purchase, whether a 1-percent sample will yield estimates of adequate precision for a given study, or whether it is necessary to use the 5-percent sample instead. For these purposes the method described in this section should produce an acceptable approximation. On the other hand, for many statistics, particularly from detailed cross-tabulations, standard errors using the second method are also applicable to a wider variety of statistics, e.g., means and ratios. To produce standard error estimates, one obtains (1) the unadjusted standard error for the characteristic that would result from a simple random sample design (of persons, families, or housing units) and estimation technique; and (2) a design factor, which partially reflects the effects of the actual sample design and estimation procedure used for the 1990 census public use microdata sample, for the particular characteristic estimated. The design factors provided in this chapter are based on computations from the full census sample and, as such, do not reflect the additional stratification used in the selection of the public use microdata samples (see Chapter 4). In general, these factors will provide conservative estimates of the standard error. In addition, these factors only pertain to individual data items (e.g., educational 1 Page 32 ICPSR 9951 attainment, employment status) and are not entirely appropriate for use with detailed cross-tabulated data. To calculate the approximate standard error of a 5-percent or 1-percent sample estimate follow the steps given below. 1. Obtain the unadjusted standard error for the sampling rate to be used from table A, C, or E, for estimated totals or from tables B, D, or F for estimated percentages. Alternately, the formula given at the bottom of each table may be used to calculate the unadjusted standard error (for sample sizes other than 5 or 1-percent see the subsampling section). In using tables A, C, or E, or corresponding formulas for estimated totals, use weighted figures rather than unweighted sample counts to select the appropriate row. To select the applicable column for person characteristics, use the total population in the area being tabulated (not just the total of the universe being examined), or use the total count of housing units if the estimated total is a housing unit characteristic. Similarly in using table B, D, or F, or the corresponding formula for estimated percentages, use weighted figures to select the appropriate column. 2. Use table G to obtain the design factor for the characteristic (e.g., place of work or educational attainment). If the estimate is a cross-tabulation of more than one characteristic, scan table G for each appropriate factor and use the largest factor. Multiply the unadjusted standard error from step 1 by the factor obtained in step 2. Example 1: Standard Error of a Total - Suppose we tally a 5-percent public use microdata sample for state A. Further, suppose that for county A, the sum of the PUMS weights for all persons is 131,220. The sum of the PUMS weights for those persons who are age 16 years and over and in the civilian labor force is 59,948. The basic standard error for the estimated total is obtained from table A or from the formula given below table A. To avoid interpolation, the use of the formula will be demonstrated here. The formula for the basic standard error, SE, is: SE(59,948) = (square root of) 19(59,948) (1-59,948/131,220) 1 ICPSR 9951 Page 33 = 787 persons The standard error of the estimated 59,948 persons 16 years and over who were in the civilian labor force is found by multiplying the basic standard error 787 by the appropriate design factor (Employment Status) from table G. Suppose the design factor for Employment Status is 1.2, then the standard error is SE(59,948) = 787(1.2) = 945 persons Note that in this example the total weighted count of persons in county A of 131,220 was used. Example 2: Standard Error of a Percent - Suppose there are 95,763 persons in county A in state A aged 16 years and over. The estimated percent of persons 16 years and over who were in the civilian labor force is 62.6. Using the formula given in table B, the unadjusted standard error is found to be approximately 0.68 percent. The standard error for the estimated 62.6 percent of persons 16 years and over who were in the civilian labor force is 0.68 (1.2) = 0.82 percentage points. Note that in this example the base is defined as the weighted count of persons 16 years old and over. A note of caution concerning numerical values is necessary. Standard errors of percentages derived in this manner are approximate. Calculations can be expressed to several decimal places, but to do so would indicate more precision in the data than is justifiable. Final results should contain no more than two decimal places. Sums and Differences - The standard errors estimated from these tables are not directly applicable to sums of and differences between two sample estimates. To estimate the standard error of a sum or difference, the tables are to be used somewhat differently in the following three situations: 1. For the sum of or difference between a sample estimate and a 100-percent value, use the standard error of the sample estimate. The complete count value is not subject to sampling error. 2. For the sum of or difference between two sample estimates, the appropriate standard error is approximately the square root of the sum of the two individual standard errors squared, that is, for standard errors: 1 Page 34 ICPSR 9951 SE and SE of estimates X and Y X Y SE = SE = sauare root of (SE )sq. + (SE )sq. (X + Y) (X - Y) X Y This method, however, will underestimate (overestimate) the standard error if the two items in a sum are highly positively (negatively) correlated or if the two items in a difference are highly negatively (positively) correlated. This method may also be used for the difference between (or sum of) sample estimates from two censuses or from a census sample and another survey. The standard error for estimates not based on the 1990 census sample must be obtained from an appropriate source outside of this appendix. 3. For the differences between two estimates, one of which is a subclass of the other, use the tables directly where the calculated difference is the estimate of interest. For example, to determine the estimate of non-Black teachers, one may subtract the estimate of Black teachers from the estimate of total teachers. To determine the standard error of the estimate of non-Black teachers apply the above formula directly. Ratios - Frequently, the statistic of interest is the ratio of two variables, where the numerator is not a subset of the denominator. For example, the ratio of teachers to students in public elementary schools. The standard error of the ratio between two sample estimates is estimated as follows: 1. If the ratio is a proportion, then follow the procedure outlined for "Totals and Percentages." 2. If the ratio is not a proportion, then approximate the standard error using the formula below. X (SEx)sq. (SEy)sq. SE = - (square root of) ---------- + --------- (X/Y) Y Xsq. Ysq. 1 ICPSR 9951 Page 35 * lower case "X" and "Y" indicate subscript here. Medians - For the standard error of the median of a characteristic, it is necessary to examine the distribution from which the median is derived, as the size of the base and the distribution itself affect the standard error. An approximate method is given here. As the first step, compute one-half of the number on which the median is based (refer to this result as N/2). Treat N/2 as if it were an ordinary estimate and obtain its standard error as instructed above. Compute the desired confidence interval about N/2. Starting with the lowest value of the characteristic, compute the frequencies in each category of the characteristic until the sum equals or first exceeds the lower limit of the confidence interval about N/2. By linear interpolation, obtain a value of the characteristic corresponding to this sum. This is the lower limit of the confidence interval of the median. In a similar manner, continue cumulating frequencies until the sum equals or exceeds the count in excess of the upper limit of the interval about N/2. Interpolate as before to obtain the upper limit of the confidence interval for the estimated median. When interpolation is required in the upper open-ended interval of a distribution to obtain a confidence bound, use 1.5 times the lower limit of the open-ended confidence interval as the upper limit of the open-ended interval. CONFIDENCE INTERVALS AND INFERENCES BASED ON THE SAMPLE A sample estimate and its estimated standard error may be used to construct confidence intervals about the estimate. These intervals are ranges that will contain the average value of the estimated characteristic that results over all possible samples, with a known probability. For example, if all possible samples that could result under the 1990 census sample design were independently selected and surveyed under the same conditions, and if the estimate and its estimated standard error were calculated for each of these samples, then: 1. Approximately 68 percent of the intervals from one estimated standard error below the estimate to one estimated standard error above the estimate would contain the average result from all possible samples. 2. Approximately 90 percent of the intervals from 1.645 times the estimated standard error below the estimate to 1.645 times the estimated standard error above the estimate would contain the average result from all possible samples. 1 Page 36 ICPSR 9951 3. Approximately 95 percent of the intervals from two estimated standard errors below the estimate to two estimated standard errors above the estimate would contain the average result from all possible samples. 1 ICPSR 9951 Page 37 Table A: Unadjusted Standard Errors for Estimated Totals, 5 Percent Sample Estimated Size of Geographic area Tabulated (2) Total (1) 100,000 250,000 500,000 750,000 1 M 5 M 10 M 25 M 1,000 140 140 140 140 140 140 140 140 2,500 220 220 220 220 220 220 220 220 5,000 300 310 310 310 310 310 310 310 10,000 410 430 430 430 430 440 440 440 15,000 490 520 530 530 530 530 530 530 25,000 600 650 670 680 680 690 690 690 75,000 600 1,000 1,100 1,130 1,150 1,180 1,190 1,190 100,000 - 1,070 1,230 1,280 1,310 1,360 1,370 1,380 250,000 - - - 1,280 1,890 2,120 2,150 2,170 500,000 - - - 1,780 2,180 2,920 3,000 3,050 750,000 - - - - 968 3,480 3,630 3,717 1,000,000 - - - - - 3,900 4,140 4,270 5,000,000 - - - - - - 6,980 8,720 10,000,000 - - - - - - - 10,680 1. For estimated totals larger than 10,000,000, the standard error is somewhat larger than the table values. The formula given below should be used to calculate the standard error. SE(Y) = (square root of) ] Y ] Where: N = Size of area 19Y ]1 - --- ] Y = Estimate of ] N ] characteristic total 2. Total count of persons, housing units, or families in area if the estimated total is a person, housing unit, or family characteristic, respectively. ______________________________________________________________________ Table B: Unadjusted Standard Error for Estimated Percentages, 5 Percent Sample (Standard errors expressed in percentage points) Estimated Base (weighted total) of percentage (1) Percent 1000 1500 2500 5000 7500 10000 25000 50000 100000 250000 500000 2 or 98 1.9 1.6 1.2 0.9 0.7 0.6 0.4 0.3 0.2 0.1 0.1 5 or 95 3.0 2.4 1.9 1.3 1.1 1.0 0.6 0.4 0.3 0.2 0.1 10 or 90 4.1 3.4 2.6 1.8 1.5 1.3 0.8 0.6 0.4 0.3 0.2 15 or 85 4.9 4.0 3.1 2.2 1.8 1.6 1.0 0.7 0.5 0.3 0.2 20 or 80 5.5 4.5 3.5 2.5 2.0 1.7 1.1 0.8 0.6 0.3 0.2 25 or 75 6.0 4.9 3.8 2.7 2.2 1.9 1.2 0.8 0.6 0.4 0.3 30 or 70 6.3 5.2 4.0 2.8 2.3 2.0 1.3 0.9 0.6 0.4 0.3 35 or 65 6.6 5.4 4.2 2.9 2.4 2.1 1.3 0.9 0.7 0.4 0.3 1 Page 38 ICPSR 9951 50 6.9 5.6 4.4 3.1 2.5 2.2 1.4 1.0 0.7 0.4 0.3 1. For a percentage and/or base of percent age not shown in the table, the formula given below may be used to calculate the standard error. 19 SE(Y) = (square root of) ---p(100 - p) B Where: B = Base of estimated percentage p = Estimated percentage ---------------------------------------------------------------------- Table C: Unadjusted Standard Errors for Estimated Totals, 1 Percent Sample Estimated Size of Geographic area Tabulated (2) Total (1) 100,000 250,000 500,000 750,000 1 M 5 M 10 M 25 M 1,000 310 310 310 310 310 310 310 310 2,500 490 500 500 500 500 500 500 500 5,000 690 700 700 700 700 700 700 700 10,000 940 970 980 990 990 990 990 990 15,000 1,120 1,180 1,200 1,210 1,210 1,210 1,210 1,210 25,000 1,260 1,490 1,530 1,550 1,550 1,570 1,570 1,570 75,000 1,360 2,280 2,510 2,590 2,620 2,700 2,710 2,720 100,000 - 2,440 2,810 2,930 2,980 3,110 3,130 3,140 250,000 - - 3,520 4,060 4,310 4,850 4,910 4,950 500,000 - - - 4,060 4,970 6,670 6,860 6,960 750,000 - - - - 7,462 7,944 8,287 8,787 1,000,000 - - - - - 8,900 9,440 9,750 5,000,000 - - - - - - 15,730 19,900 10,000,000 - - - - - - - 24,370 1. For estimated totals larger than 10,000,000, the standard error is somewhat larger than the table values. The formula given below should be used to calculate the standard error. SE(Y) = (square root of) ] Y ] Where: N = Size of area 19Y ]1 - --- ] Y = Estimate of ] N ] characteristic total 2. Total count of persons, housing units, or families in area if the estimated total is a person, housing unit, or family characteristic, respectively. 1 ICPSR 9951 Page 39 ______________________________________________________________________ Table D: Unadjusted Standard Error for Estimated Percentages, 1 Percent Sample (Standard errors expressed in percentage points) Estimated Base (weighted total) of percentage (1) Percent 1000 1500 2500 5000 7500 10000 25000 50000 100000 250000 500000 2 or 98 4.4 3.6 2.8 2.0 1.6 1.4 0.9 0.6 0.4 0.3 0.2 5 or 95 6.9 5.6 4.3 3.1 2.5 2.2 1.4 1.0 0.7 0.4 0.3 10 or 90 9.4 7.7 6.0 4.2 3.4 3.0 1.9 1.3 0.9 0.6 0.4 15 or 85 11.2 9.2 7.1 5.0 4.1 3.6 2.2 1.6 1.1 0.7 0.5 20 or 80 12.6 10.3 8.0 5.6 4.6 4.0 2.5 1.8 1.3 0.8 0.6 25 or 75 13.6 11.1 8.6 6.1 5.0 4.3 2.7 1.9 1.4 0.9 0.6 30 or 70 14.4 11.8 9.1 6.4 5.3 4.6 2.9 2.0 1.4 0.9 0.6 35 or 65 15.0 12.8 9.5 6.7 5.5 4.7 3.0 2.1 1.5 0.9 0.7 50 15.8 12.8 9.9 7.0 5.7 5.0 3.1 2.2 1.6 1.0 0.7 1. For a percentage and/or base of percent age not shown in the table, the formula given below may be used to calculate the standard error. 99 SE(p) = (square root of)---p(100 - p) B Where: B = Base of estimated percentage (weighted total) p = Estimated percentage ---------------------------------------------------------------------- Table E: Unadjusted Standard Errors for Estimated Totals, 3 Percent Sample Estimated Size of Geographic area Tabulated (2) Total (1) 50K 100K 250K 500K 1,000K 5,000K 10,000K 25,000K 1,000 180 180 180 180 180 180 180 180 2,500 280 280 290 290 290 290 290 290 5,000 390 390 400 400 410 410 410 410 10,000 510 540 560 570 570 570 570 570 15,000 590 650 680 690 700 700 700 700 25,000 640 780 860 880 890 900 900 900 75,000 - 780 1,310 1,440 1,500 1,550 1,560 1,560 100,000 - - 1,400 1,610 1,710 1,780 1,790 1,800 250,000 - - - 2,010 2,470 2,780 2,810 2,830 500,000 - - - - 2,850 3,820 3,920 3,980 1 Page 40 ICPSR 9951 750,000 - - - - 2,460 4,540 4,736 4,580 1,000,000 - - - - - 5,090 5,400 5,850 5,000,000 - - - - - - 8,990 11,380 10,000,000 - - - - - - - 13,930 1. For estimated totals larger than 10,000,000, the standard error is somewhat larger than the table values. The formula given below should be used to calculate the standard error. SE(Y) = (square root of) 97 ] Y ] ---Y ]1 - --- ] 3 ] N ] Where: N = Size of area Y = Estimate of characteristic total 2. Total count of persons, housing units, or families in area if the estimated total is a person, housing unit, or family characteristic, respectively. ______________________________________________________________________ Table F: Unadjusted Standard Error for Estimated Percentages, 3 Percent Sample (Standard errors expressed in percentage points) Estimated Base (weighted total) of percentage (1) Percent 1000 1500 2500 5000 7500 10000 25000 50000 100000 250000 500000 2 or 98 2.5 2.0 1.6 1.1 .9 .8 .5 .6 .2 .1 .1 5 or 95 3.9 3.2 2.4 1.7 1.4 1.2 .8 .0 .4 .2 .2 10 or 90 5.4 4.4 3.4 2.4 2.0 1.7 1.1 .3 .5 .3 .2 15 or 85 6.4 5.2 4.1 2.9 2.3 2.1 1.3 .6 .6 .4 .3 20 or 80 7.2 5.9 4.5 3.2 2.6 2.3 1.4 1.0 .7 .5 .3 25 or 75 7.8 6.4 4.9 3.5 2.8 2.5 1.6 1.1 .8 .5 .3 30 or 70 8.2 6.7 5.2 3.7 3.0 2.6 1.6 1.1 .8 .5 .3 35 or 65 8.6 7.0 5.4 3.8 3.1 2.7 1.7 1.2 .9 .5 .4 50 9.0 7.3 5.7 4.0 3.3 2.8 1.8 1.3 .9 .6 .4 1. For a percentage and/or base of percent age not shown in the table, the formula given below may be used to calculate the standard error. 97 SE(p) = (square root of) ---p(100 - p) 3B Where: B = Base of estimated 1 ICPSR 9951 Page 41 percentage (weighted total) p = Estimated percentage ---------------------------------------------------------------------- Table G. Standard Error Design Factors-United States (Percent of persons or housing units in sample) Characteristic Design factors POPULATION Age............................................................ Sex............................................................ Race........................................................... Hispanic origin (of any race).................................. Marital status................................................. Household type and relationship................................ Children ever born............................................. Work disability and mobility limitation status................. Ancestry....................................................... Place of birth................................................. Citizenship.................................................... Migration (Residence in 1985).................................. Year of entry.................................................. Language spoken at home and ability to speak English........... Educational attainment......................................... School enrollment.............................................. Type of residence (urban/rural)................................ Household type................................................. Family type.................................................... Group quarters................................................. Subfamily type and presence of children........................ Employment status.............................................. Industry....................................................... Occupation..................................................... Class of worker................................................ Hours per week and weeks worked in 1989........................ Number of workers in family.................................... Place of work.................................................. Means of transportation to work................................ Travel time to work............................................ Vehicle occupancy.............................................. Time of departure for work..................................... Type of income in 1989......................................... Household Income in 1989....................................... Family income in 1989.......................................... Poverty status in 1989 (persons)............................... Poverty status in 1989 families)............................... Armed Forces and veteran status................................ 1 Page 42 ICPSR 9951 HOUSING Age of householder............................................. Race of householder............................................ Hispanic origin of householder................................. Type of residence (urban/rural................................. Condominium status............................................. Units in structure............................................. Tenure......................................................... Occupancy status............................................... Value.......................................................... Gross rent..................................................... Household income in 1989....................................... Year structure built........................................... Rooms, bedrooms................................................ Kitchen facilities............................................. Source of water, plumbing facilities........................... Sewage disposal................................................ House heating fuel............................................. Telephone in housing unit...................................... Vehicles available............................................. Year householder moved into structure.......................... Mortgage status and monthly mortgage costs..................... Mortgage status and selected monthly owner costs............... Gross rent as a percentage of household income in 1989......... Household income in 1989 by selected owner costs as a percentage of household income............................... 1 ICPSR 9951 Page 43 The intervals are referred to as 68 percent, 90 percent, and 95 percent confidence intervals, respectively. The average value of the estimated characteristic that could be derived from all possible samples is or is not contained in any particular computed interval. Thus, we cannot make the statement that the average value has a certain probability of falling between the limits of the calculated confidence interval. Rather, one can say with a specified probability of confidence, that the calculated confidence interval includes the average estimate from all possible samples (approximately the 100-percent value). Confidence intervals also may be constructed for the ratio, sum of, or difference between two sample figures. This is done by first computing the ratio, sum, or difference, then obtaining the standard error of the ratio, sum, or difference (using the formulas given earlier), and finally forming a confidence interval for this estimated ratio, sum, or difference as above. One can then say with specified confidence that this interval includes the ratio, sum, or difference that would have been obtained by averaging the results from all possible samples. The estimated standard errors given in this chapter do not include all portions of the variability due to nonsampling error that may be present in the data. The standard errors reflect the effect of simple response variance, but not the effect of correlated errors introduced by enumerators, coders, or other field or processing personnel. Thus, the standard errors calculated represent a lower bound of that total error. As a result, confidence intervals formed using these estimated standard errors may not meet the stated levels of confidence (i.e., 68, 90, or 95 percent). Thus, some care must be exercised in the interpretation of the data in this data product based on the estimated standard errors. In example 1, the standard error of the 59,948 persons 16 years and over in county A in state A who were in the civilian labor force was found to be 945. Thus, a 90 percent confidence interval for this estimated total is found to be: (59,948 - 1.645 (945)) to (59,948 + 1.645(945)) or 58,393 to 61,502 One can say, with about 90 percent confidence, that this interval includes the value that would have been obtained by averaging the results from all possible samples. The following is an illustration of the calculation of 1 Page 44 ICPSR 9951 standard errors and confidence intervals when a difference between two sample estimates is obtained. For example, suppose the number of persons in county B age 16 years and over who were in the civilian labor force was 69,314 and the total number of persons 16 years and over was 116,666. Further, suppose the population of county B was 225,225. Thus, the estimated percentage of persons 16 years and over who were in the civilian labor force is 59.4 percent. The unadjusted standard error from table B is 0.63 percentage points. The design factors table (table G) shows the design factor to be 1.1 for "Employment Status." Thus, the approximate standard error of the percentage (59.4 percent) is 0.63 x 1.2 = 0.76 percentage points. Now suppose that one wished to obtain the standard error of the difference between county A and county B of the percentage of persons who were 16 years and over and who were in the civilian labor force. The difference in the percentages of interest for the two cities is: 62.6 - 59.4 = 3.2 percent. Using the results of the previous example: SE(3.2) = (square root of)(SE(62.6))Ø2 + (SE(59.4))Ø2 = (square root of)(0.82)Ø2 + (.76)Ø2 = 1.12 percentage points The 90 percent confidence interval for the difference is formed as before: (3.20 - 1.645(1.12)) to (3.20 + 1.645(1.12)) or 1.36 to 5.04 One can say with 90 percent confidence that the interval includes the difference that would have been obtained by averaging the results from all possible samples. When, as in this example, the interval does not include zero, one can conclude, again with 90 percent confidence, that the difference observed between the two counties on this characteristic is greater than can be attributed to sampling error. For reasonably large samples, ratio estimates are normally distributed, particularly for the census population. Therefore, if 1 ICPSR 9951 Page 45 we can calculate the standard error of a ratio estimate, then we can from a confidence interval around the ratio. Suppose that one wished to obtain the standard error of the ratio of the estimate of persons who were 16 years and over and who were in the civilian labor force in county A to the estimate of persons who were 16 years and over and who were in the civilian labor force in county B. The ratio of the two estimates of interest is: 59,948 / 69,314 = .86 ] 59948 ] 953Ø2 1145Ø2 SE(.86) = ] ------- ] (square root of) -------- + --------- ] 69314 ] (59948)Ø2 (69314)Ø2 = .02 Using the results above, the 90 percent confidence interval for this ratio would be: (.86 - 1.645(.02)) to (0.86 + 1.645(.02)) or .83 to .89 SELECTING AN APPROPRIATE SAMPLE SIZE - One virtue in the use of the tables A through F for calculating standard errors and confidence intervals is that this method can be employed prior to making any sample tabulation, and thus, can help the user decide prior to purchase whether a 5-percent or 1-percent sample size is most appropriate for a proposed study. Suppose that in the foregoing example, the 59,948 figure was a guess, perhaps based on published data. The confidence interval could be calculated as above. In this case, tabulating a 5-percent sample for this particular characteristic would result in a 90 percent confidence interval 58,393 to 61,502. The width of this interval is 3,109. Tabulating from a 1-percent sample for the same characteristic would result in a confidence interval of 56,403 to 63,492. The width of the interval from the 1-percent sample is 7,089 (over two times the width of the confidence interval from the 5-percent sample). Another criterion used in making this type of decision is the coefficient of variation (CV). The CV is a measure of reliability and is defined as the ratio of the standard error of the estimate and the absolute value of the expected value of the estimate. To get an estimate of the CV, substitute the estimate itself for the expected value in the CV formula. In this example, if the 59,948 estimate is obtained from the 5-percent sample, the CV would be 1.4 percent. If the 1-percent sample is tallied to get the estimate then the CV would be 3 percent. The smaller the CV, the 1 Page 46 ICPSR 9951 more reliable the estimate. There is no particular rule of thumb that dictates how large a confidence interval or CV is acceptable. This depends on the relative precision necessary for a particular application as balanced against the relative cost of tabulating microdata samples of the various sizes. USING TABLE A THROUGH F FOR OTHER SAMPLE SIZES Tables A through F may also be used to approximate the unadjusted standard errors for other sample sizes by adjusting for the sample size desired. The adjustment for sample size is obtained as follows: Let f1 be the sampling rate in any of the tables A through F, and f2 be the sampling rate for the sample size to be used. The adjustment for sample size can be read from the following table: f2 Sample Size Adjustment Factor For example, if the user were to select a subsample of one half of a 1-percent sample, i.e., f2 = .005, then the standard errors shown in tables C or D for a 1-percent sample must be multiplied by 1.42 to obtain the standard errors for a .005 sample. The factor of 1.42 shows that the standard errors increase by 42 percent when the sample size is halved. The principle is also applicable when combining microdata samples to achieve a sample size larger than 5 percent. If, for example, all three samples are combined for the same area to obtain an estimate of a characteristic for the elderly population, the standard errors for this sample size (i.e., 11 percent) can be obtained by multiplying those shown in tables A and B by .65. Thus, the increase from a 5-percent to a 11-percent sample reduces the standard error by 35 percent. Alternatively, the user may wish to use the following formulas to directly calculate the unadjusted standard errors. For estimated totals, calculate as Se(Y) ] 1 ] Se(Y) = (square root of) ] --- - 1 ] Y(1 - Y/N) 1 ICPSR 9951 Page 47 ] f ] 2 where N = size of area tabulated Y = estimate (weighted) of characteristic total. Example 1 shows the unadjusted standard error for the figure 59,948 to be 787. Using the above formula with f2 = .11 yields an unadjusted standard error SE() = 513 for a 35 percent reduction in the standard error as shown in the above table. For a estimated percentage, calculate ] 1 ] P(100 - P) Se(P) = (square root of) ] --- - 1 ] ---------- ] f ] B 2 where P = estimated percentage and B = base of estimated percentage (weighted estimate) ESTIMATION OF STANDARD ERRORS DIRECTLY FROM THE MICRODATA SAMPLES Use of tables or formulas to derive approximate standard error as discussed above is simple and does not complicate processing. Nonetheless, a more accurate estimate of the standard error can be obtained from the samples themselves, using the random group method. Using this method it is also possible to compute standard errors for mean ratios, indexes, correlation coefficients, or other statistics for which the tables or formulas presented earlier do not apply. The random group method does increase processing time somewhat since it requires that the statistic of interest, for example a total, be computed separately for each of up to 100 random groups. The variability of that statistic for the sample as a whole is estimated from the variability of the statistic among the various random groups within the sample. The procedure for calculating a standard error by the random group method for various statistics is given below. TOTALS - to obtain the standard errors of estimated totals the following method should be used. The random groups estimate of variance of X is given by _t_ ] _t_ ]Ø2 1 Page 48 ICPSR 9951 ] T ] < ] 1 ] < ]] var(X) = ] --- ] > ] Xg - --- ] > Xg ]] ] T-1 ] <___ ] T ] <___ ]] g=1 ] g=1 ] or the computational formula note - lowercase here is the equivalent of subscript _t_ ] t ] < var(X) = ] --- ] > XgØ2 - t XgØ-2 ] t-1 ] <___ g=1 where t = number of random groups Xg = the weighted microdata sample total of the characteristic of interest from the g-th random group. _t_ _ < Xg = > Xg/t, the average random group total <___ g=1 The standard error of the estimated total is the square root of var(X). It is suggested that t=100 for estimating the standard error of a total since, as it is discussed in the next chapter, each of the sample records was assigned a two-digit subsample number sequentially from 00 to 99. The two-digit number can be used to form 100 random groups. For example, a sample case with 01 as the two-digit number will be in random group 1. All sample cases with 02 as the two digit number will be in random group 2, etc., up to 00 as the one- hundredth random group. The reliability of the random group variance estimator is a function of both the kurtosis of the estimator and the number of groups t. If t is small, the coefficient of variation (CV) will be large, and therefore, the variance estimator will be of low precision. In general, the larger t is, the more reliable the variance estimator will be.1 PERCENTAGES, RATIOS, AND MEANS - To obtain the estimated standard error of a percent, ratio, or mean, the following method should be used. Let 1 ICPSR 9951 Page 49 x r = --- be the estimated percent, ratio, or mean y where and y = the estimated totals as defined above for the X and Y characteristics. For the case where BOTH numerator and denominator are obtained from the full microdata sample then the variance of R is given by _t_ ] T ] ] 1 ]Ø2 < var(R) = ] --- ] ] - ] > (Xg - RYg)Ø2 ] T-1 ] ] y ] <___ g=1 where T and Xg are defined above, Y = the weighted full microdata sample total for the y characteristic, and Yg = the corresponding weighted total for the g-th random group. CORRELATION COEFFICIENTS, AND REGRESSION COEFFICIENTS AND COMPLEX STATISTICS - The random group method for computing the variance of correlation coefficients, regression coefficients, and other complex nonlinear statistics may be expressed as: ___ T < Var(THETA) = --- > (THETA g - THETA)Ø2 T-1 <___ g=1 where THETA g = the weighted estimate (at the tabulation area level) of the statistic of interest computed from the g-th random group, and THETA = corresponding weighted estimate computed from the full microdata sample. Care must be exercised when using this variance estimator for complex nonlinear statistics as its properties have not been fully explored for such statistics. In particular, the choice of the number of random groups must be considered more carefully. When using the 5-percent sample, use of t=100 for all areas tabulated is recommended. When using the 1-percent sample or samples having a smaller sampling fraction, the user should consider using a smaller number of random groups to insure that each random group contains at 1 Page 50 ICPSR 9951 least 25 records. Fewer than 100 random groups can be formed by appropriate combination of the two-digit subsample numbers. For example, to construct 50 random groups assign all records in which the subsample number is 01 or 50 to the first random group; all records in which the subsample number is 02 or 52 to the second random group, etc. Finally, assign all records in which the subsample number is 00 or 50 to random group 50. Ten random groups can be constructed by including all records having subsample numbers with the same "units" digit in a particular random group. For example, subsample numbers 00,10,...,90 would form one random group; subsample numbers 01,11,...,91 would form a second random group, etc. STANDARD ERRORS FOR SMALL ESTIMATES Percentage estimates of zero and estimated totals of zero are subject to both sampling and nonsampling error. While the magnitude of the error is difficult to quantify, users should be aware that such estimates are nevertheless subject to both sampling and nonsampling error even though in the case of zero estimates the corresponding random groups estimate of variance will be zero. A second point concerning standard errors, the standard error estimates obtained using the random groups method do not include all components of the variability due to nonsampling error that may be present in the data. Therefore, the standard error calculated using the methods described in this section represent a lower bound for the total error. Data users should be aware that in general confidence intervals formed using these estimated standard errors do not meet the stated levels of confidence. Data users are advised to be conservative when making inferences from the data provided in this data product. CONTROL OF NONSAMPLING ERROR As mentioned earlier, both sample and 100- percent data are subject to nonsampling error. This component of error could introduce serious bias into the data, and the total error could increase dramatically over that which would result purely from sampling. While it is impossible to eliminate completely nonsampling error from an operation as large and complex as the decennial census, the Bureau of the Census attempted to control the sources of such error during the collection and processing operations. Described below are the primary sources of nonsampling error and the programs instituted for control of this error. The success of these programs, however, was contingent upon how well the instructions actually were carried out during the census. As part of the 1990 census evaluation program, both the effects of these programs and the amount of error remaining after their application will be evaluated. 1 ICPSR 9951 Page 51 UNDERCOVERAGE--It is possible for some households or persons to be missed entirely by the census. The undercoverage of persons and housing units can introduce biases into the data. Several coverage improvement programs were implemented during the development of the census address list and census enumeration and processing to minimize undercoverage of the population and housing units. These programs were developed based on experience from the 1980 census and results from the 1990 census testing cycle. In developing and updating the census address list, the Census Bureau used a variety of specialized procedures in different parts of the country. o In the large urban areas, the Census Bureau purchased and geocoded address lists. Concurrent with geocoding, the United States Postal Service (USPS) reviewed and updated this list. After the postal check, census enumerators conducted a dependent canvass and update operation. Prior to mailout, in the fall of 1989, local officials were given the opportunity to examine block counts of address listings (local review) and identify possible errors, and the USPS conducted a final review. o In small cities, suburban areas, and selected rural parts of the country, the Census Bureau created the address list through a listing operation. The USPS reviewed and updated this list, and the Census Bureau reconciled USPS corrections and updated through a field operation. In the fall of 1989, local officials participated in reviewing block counts of address listings. Prior to mailout, the USPS conducted a final review. o The Census Bureau (rather than the USPS) conducted a listing operation in the fall of 1989 and delivered census questionnaires in selected rural and seasonal housing areas in March of 1990. In some inner-city public housing developments, whose addresses had been obtained via the purchased address list noted above, census questionnaires were also delivered by Census Bureau enumerators. Coverage improvement programs continued during and after mailout. A recheck of units initially classified as vacant or nonexistent further improved the coverage of persons and housing units. All local officials were given the opportunity to participate 1 Page 52 ICPSR 9951 in a post-census local review, and census enumerators conducted an additional recanvass. In addition, efforts were made to improve the coverage of unique population groups, such as the homeless and parolees/probationers. Computer and clerical edits and telephone and personal visit followup also contributed to improved coverage. More extensive discussion of the programs implemented to improve coverage will be published by the Census Bureau when the evaluation of the coverage improvement program is completed. RESPONDENT AND ENUMERATOR ERROR--The persons answering the questionnaire or responding to the questions posed by an enumerator could serve as a source of error, although the questions were phrased as clearly as possible based on precensus tests, and detailed instructions for completing the questionnaire were provided to each household. In addition, respondents' answers were edited for completeness and consistency, and problems were followed up as necessary. The enumerator may misinterpret or otherwise incorrectly record information given by a respondent; may fail to collect some of the information for a person or household; or may collect data for households that were not designated as part of the sample. To control these problems, the work of enumerators was monitored carefully. Field staff were prepared for their tasks by using standardized training packages that included hands-on experience in using census materials. A sample of the households interviewed by enumerators for nonresponse were reinterviewed to control for the possibility of data for fabricated persons being submitted by enumerators. Also, the estimation procedure was designed to control for biases that would result from the collection of data from households not designated for the sample. PROCESSING ERROR--The many phases involved in processing the census data represent potential sources for the introduction of nonsampling error. The processing of the census questionnaires includes the field editing, followup, and transmittal of completed questionnaires; the manual coding of write-in responses; and the electronic data processing. The various field, coding and computer operations undergo a number of quality control checks to insure their accurate application. NONRESPONSE--Nonresponse to particular questions on the census questionnaire allows for the introduction of bias into the data, since the characteristics of the nonrespondents have not been observed and may differ from those reported by respondents. As a result, any imputation procedure using respondent data may not completely reflect this difference either at the elemental level (individual person or housing unit) or on the average. Some protection against the introduction of large biases is afforded by minimizing nonresponse. In the census, nonresponse was reduced 1 ICPSR 9951 Page 53 substantially during the field operations by the various edit and followup operations aimed at obtaining a response for every question. Characteristics for the nonresponses remaining after this operation were imputed by the computer by using reported data for a person or housing unit with similar characteristics. EDITING OF UNACCEPTABLE DATA The objective of the processing operation is to produce a set of data that describes the population as accurately and clearly as possible. To meet this objective, questionnaires were edited during field data collection operations for consistency, completeness, and acceptability. Questionnaires also were reviewed by census clerks for omissions, certain specific inconsistencies, and population coverage. For example, write-in entries such as "Don't know" or "NA" were considered unacceptable. For some district offices, the initial edit was automated; however, for the majority of the district offices, it was performed by clerks. As a result of this operation, a telephone or personal visit followup was made to obtain missing information. Potential coverage errors were included in the followup, as well as a sample of questionnaires with omissions and/or inconsistencies. Subsequent to field operations, remaining incomplete or inconsistent information on the questionnaire was assigned using imputation procedures during the final automated edit of the collected data. Imputations, or computer assignments of acceptable codes in place of unacceptable entries or blanks, are needed most often when an entry for a given item is lacking or when the information reported for a person or housing unit on that item is inconsistent with other information for that same person or housing unit. As in previous censuses, the general procedure for changing unacceptable entries was to assign an entry for a person or housing unit that was consistent with entries for persons or housing units with similar characteristics. The assignment of acceptable codes in place of blanks or unacceptable entries enhances the usefulness of the data. Another way in which corrections were made during the computer editing process was through substitution; that is, the assignment of a full set of characteristics for a person or housing unit. When there was an indication that a housing unit was occupied but the questionnaire contained no information for the people within the household or the occupants were not listed on the questionnaire, a previously accepted household was selected as a substitute, and the full set of characteristics for the substitute was duplicated. The assignment of the full set of housing characteristics occurred when there was no housing information available. If the housing unit was determined to be occupied, the housing characteristics were assigned from a previously processed occupied unit. If the housing unit was 1 Page 54 ICPSR 9951 vacant, the housing characteristics were assigned from a previously processed vacant unit. USE OF ALLOCATION FLAGS IN THESE FILES As a result of the editing there are no blank fields or missing data in public use microdata sample files. Each field contains a data value or a "not applicable" indicator, except for the few items where allocation was not appropriate and a "not reported" indicator is included. For every subject item it is possible for the user to differentiate between entries which were allocated, by means of "allocation flags" in items H161 through H198, and P186 through P233 in the microdata files. For all items it is possible to compute the allocation rate and, if the rate is appreciable, compute the distribution of actually observed values (with allocated data omitted) and compare it with the overall distribution including allocated values. The flags indicate the changes in values between input and output. These flags may indicate up to four possible types of allocations: A. Pre-edit - When the original entry was rejected because it fell outside the range of acceptable values. B. Consistency - Imputed missing characteri