CONCURRENT (ORGANIZED) SESSIONS F
Session 3: Modernizing the U.S. Census Bureauís Statistical Foundation through Enterprise Frames
Demographic Frame: Leveraging Person-Level Data to Enhance Census and Survey Taking
- Dr Jennifer Ortman (U.S. Census Bureau) – Presenting Author
Realizing the full potential of the U.S. Census Bureauís Frames Program requires building out the fourth enterprise frame ñ The Demographic Frame. The Demographic Frame is a comprehensive, secure, access-controlled database of person-level data consisting of demographic, social, and economic characteristics of individuals derived from census, survey, administrative, and third-party data sources. It includes unique person-level identifiers used to link individuals across datasets and links information about individuals across time and space. There are myriad potential applications for the Demographic Frame as a valuable database on its own and by linking the information about people in the Demographic Frame to information about places, jobs, and businesses available in the other enterprise Frames. This presentation will introduce the Demographic Frame, defining it and how it is being built, highlight the source information used to derive it, detail the components of the current extract available to Census Bureau users, and summarize evaluations that are informing our initial assessment of the fitness for use of this resource to support census and survey taking.
Enterprise Frames: Creating the Infrastructure to Enable Transformation
- Dr Anthony Knapp (U.S. Census Bureau) – Presenting Author
- Dr Lori Zehr (U.S. Census Bureau)
The U.S. Census Bureau has long maintained frame-like data to support census and survey operations. A multitude of programs, censuses, and surveys collect and maintain universe-level information on individuals, households, businesses, and governments. However, these data are rarely used for enterprise-wide operations. There is a need for a modernized data infrastructure; a linked universe of information from which sampling can occur and statistical summaries directly produced. The Census Bureau has established the Frames Program to meet this need, with a focus on maximizing the utility of the Census Bureauís demographic, economic, and geospatial data while minimizing the data collection burden on households and businesses and addressing redundancies in data processing. This presentation will discuss the vision and benefits of, and challenges faced by, the Frames Program. This presentation will also summarize the objectives and achievements of the nascent Frames Program, highlight the evolution of the existing Business, Job, and Geospatial Frames, briefly describe the Demographic Frame, and detail efforts to establish a linkage infrastructure to better leverage these resources.
Using a Demographic Frame to Potentially Enhance the American Community Survey
- Ms Deliverance Bougie (U.S. Census Bureau) – Presenting Author
The American Community Survey (ACS) is the premier source of detailed population and housing information for our nation and its communities. In the current landscape, there are concerns about data quality and the quality of survey estimates. To address some of these concerns, the U.S. Census Bureau is looking to innovate and utilize readily available data to inform decision making about data quality. The introduction of the Census Bureauís Frames Program, and the creation of a Demographic Frame, allows surveys to learn more about their survey respondents and non-respondents. This could potentially allow the ACS the opportunity to shore up data quality. This information about location and basic demographic information could be a major benefit to the ACS and other surveys at the Census Bureau. This presentation will detail possible applications of the Demographic Frame in the ACS. First, there will be a discussion about possible use cases such as quality checks, monitoring the quality of interviewer returns, and analyzing response patterns. We will also discuss the feasibility of using the Demographic Frame as part of ACS production and future goals for implementation.
Obtaining Non-Employer Business Owner Data from the Demographic Frame
- Mr Michael Ratcliffe (U.S. Census Bureau)
- Ms Erica Marquette (U.S. Census Bureau) – Presenting Author
The U.S. Census Bureau is undergoing a major transformation that will provide a robust, data-centric ecosystem to better meet the changing and emerging needs of our data users. The Frames Program will provide an easy and efficient way to link curated internal datasets for purposes both familiar and unanticipated use cases. The first cross-frame linkage between the Business Frame and Demographic Frame will be to support the assignment of demographic characteristics to business owners. Currently, the Non-Employer Statistics by Demographics (NES-D) is an annual data product created by leveraging existing administrative and census records to assign demographic characteristics to the universe of approximately 25 million (as of 2016) nonemployer businesses. This paper will discuss the feasibility of NES-D acquiring demographic characteristics from the Demographic Frame, provide a high-level comparison and coverage assessment between the existing methodology and the proposed methodology used to assign demographic attributes to owners of self-owned businesses, and will close with a discussion of future steps.
Demographic Frame Evaluation
- Ms Aliza Kwiat (U.S. Census Bureau)
- Mr Matt Herbstritt (U.S. Census Bureau) – Presenting Author
As part of the Census Bureauís Frames Program, the Demographic Frame is a comprehensive database of person-level data in the United States. Instrumental to meeting the Census Bureauís statistical quality standards, Census surveys require complete and accurate frames from which to draw their samples. For surveys to consider sampling from the Demographic Frame, they need assurances that their data quality and coverage will be just as good, if not better than their current frame. To this end, an evaluation of the Demographic Frameís coverage is necessary. The evaluation will be conducted by comparing the Demographic Frame to the Census Bureauís Master Address File (MAF), survey data, and potentially other administrative files. The MAF is a database containing an inventory of all known living quarters in the United States and consists of address information, Census geographic location codes, as well as housing characteristics. The MAF is used to support many surveys conducted by the Census Bureau, including the Decennial Census, the American Community Survey, and ongoing demographic surveys. In this paper, weíll discuss the importance of coverage evaluations of survey frames, our method of evaluating the coverage of the Demographic Frame, and summarize the results of this evaluation.
CONCURRENT (ORGANIZED) SESSIONS F
Session 2: Bridging Survey and Twitter Data: Methodology and Application
Comparative Topic Analysis of Tweets and Open-Ended Survey Responses on Covid-19 Vaccinations, Financial Threats, and K-12 Education
- Dr Joshua Pasek (University of Michigan) – Presenting Author
- Dr Leticia Bode (Georgetown University)
- Dr Le Bao (Georgetown University)
Survey responses catalogue answers to questions based on a specific prompt and asked at a time of the researcherís choosing (although with some variability in when responses need to be received). Data related to the same topic on Twitter, in contrast, reflect a userís inclination to mention something relevant in response to the generic prompt ìWhatís happening?î or when commenting on another userís post (cf. Marwick and Boyd 2011). The former are typically constrained to a series of pre-generated response options whereas the latter only concern the topic of interest because it seemed sufficiently salient to the tweet generator at the time of posting (Schober et al. 2016). This implies, then, that there are three central differences between the content of tweets and the content of a survey response in the meaning conferred: they may differ because of differences in the motivation to provide content, the open vs. closed nature of the content they are providing, or the implications of producing a response in these two very different contexts (for a researcher vs. for some imagined audience). This paper presents the results of a probability sample study linking Americanís survey responses with content those same individuals share on Twitter. We use a combination of closed-ended survey questions, open-ended survey questions, and Twitter posts. The closed-ended survey questions provide us with demographic information for our panelists and some answers to targeted questions about the pandemic. The open-ended survey questions allow us to get more detailed information about the respondentsí decisions and opinions in their own words. It also gives us insight into the vocabulary respondents are using to discuss the issue of interest. We then use the vocabulary and topics from the open-ended answers to identify similar threads of conversation on Twitter. Across a series of questions on vaccine status and hesitancy, personal financial threats, and K-12 education concerns, we compare respondents’ open-ended survey responses with the Twitter content those same individuals generate. We find that both streams of data can be used to identify similar topics, and that the overall frequency of most topics is at least somewhat related across streams both in relative volume and over time. Notably, however, there are a number of circumstances where there are large differences in either topic volume or trends across these streams. The results of the study reinforce a body of scholarship that shows that differences between survey and social media data can only partially be accounted for by distinctions in who is posting content. Instead, at least some of the difference appears attributable to when and why individuals provide the content they do in surveys versus Twitter data.
Can We Gain Useful Knowledge of Public Opinion from Linked Twitter Data? Reweighting to Correct for Consent Bias
- Ms Jessica Stapleton (SSRS) – Presenting Author
- Mr Michael Jackson (SSRS)
- Ms Cameron McPhee (SSRS)
- Dr Lisa Singh (Georgetown University)
- Dr Trivellore Raghunathan (University of Michigan )
As an ever-growing platform for people across the globe, social media is an essential part of today’s society. Social media allows people to express themselves, discuss topics, connect with others, and gather information. This presentation will look at Twitter data specifically and assess the possibility of gathering and weighting data from a representative sample of Twitter users. To collect a representative sample of Twitter users, a survey was conducted on the SSRS Opinion Panel, a probability-based panel of U.S. adults. Respondents were asked if they had a Twitter account. Those who responded ìYesî are considered ìTwitter usersî. All Twitter usersí demographic information is available to us from surveys. Within Twitter users, each respondent was asked for their Twitter handle and consent to scrape their Twitter activity. The people who allowed data gathering from their Twitter accounts are called ìconsentersî, and those that did not are ìnon-consentersî. As other papers in this panel will discuss, we collected and cleaned data for analysis from the consentersí Twitter accounts. This approach of linking consenting survey respondents to their social media posts has value because it allows the collection of additional information for survey respondents beyond what was provided in the survey. This social media data could allow for richer analysis of consenters than the survey data alone. However, the consent process may inhibit the representativeness of the sample for which Twitter data are available: though we have survey data (both demographic and substantive) for the full sample of Twitter users, we have Twitter data only for those who consented, who may differ in meaningful ways from non-consenters. This paper will therefore address the question of whether analytic results from the subpopulation of consenters can be generalized to the larger population of Twitter users. We will use survey data to evaluate demographic and attitudinal differences between consenting and non-consenting Twitter users; and then compare different methods of reweighting Twitter consenters to correct for these differences and thereby draw conclusions about the Twitter user population. This process would entail using data collected in the survey (both quantitative and coded open-ended data) to estimate weighting targets for the population of Twitter users, and then weighting the consenters to match those targets. After weighting the consenters, testing will be done to measure the remaining differences between consenters and the full population of Twitter users, and to assess the impact of the weighting on substantive conclusions drawn using the Twitter data. Results will provide insight into whether weighting can make the subpopulation of consenters sufficiently representative of the full population of Twitter users to draw generalizable conclusions from linked Twitter data.
Lurk More (or Less): Differential Engagement in Twitter by Sociodemographics
- Dr Lisa Singh (Georgetown University) – Presenting Author
- Dr Michael Traugott (University of Michigan)
- Dr Nathan Wycoff (Georgetown University)
While coverage and content are clear areas in which we need to understand the measurement properties of social media, a third area crosses both coverage and contentñactivity level. There have been claims that optimizing platforms to maximize “engagement” can lead to radicalization pipelines alongside the desired end of higher ad revenue. Therefore, understanding the way that activity levels differ by sociodemographic categorizations can help us understand the differential risk presented to various communities by Twitter. Some use it to get their daily news, rarely posting, others post regularly to promote their business while rarely reading posts of others, and some are fully engaged in reading and responding to tweets of others, engaging in online asynchronous conversations. Though these two types of survey respondents may identify as ìTwitter Usersî, they may have rather different experiences, and this mixture of users may confound relationships between social media use and other survey questions. Here we consider multiple measures of Twitter activity: a self-reported survey measure of time spent (Self-Reported Activity Level), a self-reported survey measure of activity type (Self-Reported Activity Type), and a constructed measure based on collecting posts through the Twitter Streaming API for consented Twitter users (Twitter-Constructed Posting Activity Level). We find that while there is a significant relationship between self-reported activity level and that empirically measured, there are also serious deviations in certain cases. We investigate the frequency of Twitter use as a function of demographic information, finding variations in activity level by race, gender, and political affiliation. We also see differences in the discrepancy between the observed activity on Twitter and the self reported activity level by demographics. Better understanding the relationship between self-reported and actual activity levels can facilitate the interpretation of surveys relying entirely on self-reported activity level, or of the nonconsenting subset of surveys that allow opt-out of reporting social media handles. These results can enable a more nuanced understanding of the interplay between social media usage and important behavioral outcomes such as vaccination, conspiratorial thought, or self-esteem in future studies.
Conversation Coverage: Comparing Topics of Conversation By Survey Respondents on Twitter and the General Twitter Population
- Dr Ceren Budak (University of Michigan) – Presenting Author
- Dr Rebecca Ryan (Georgetown University)
- Mr Yanchen Wang (Georgetown University)
Topic generation is not new for survey and social media research and has been an important tool for understanding the impacts of crucial events and peopleís perceptions and opinions. However, determining topics for short social media posts has always been a challenge. Manual generation may be considered the gold standard for topic construction, but labeling 100,000s or millions of posts defeats the near real-time benefits of using social media data. Moreover, the diffused nature of social media, which leads to wide variation in language use, also poses additional complexities for efficiently constructing topics for the domains of interest. Therefore, in this paper, leveraging data from both consented survey respondents who provided Twitter handles and the general Twitter population, we begin with a discussion of the use of computational topic models and human-in-the-loop approaches for efficiently constructing topics of social media posts. Specifically, we apply a semi-supervised, iterative human-in-the-loop approach to determine topics that are posted by the consented survey respondents and then generalize the topic construction to a more general Twitter population identified using the Twitter Decahose. Using Twitter posts from January 1, 2021 to October 31, 2021, we compare ìapples to applesî—the topics of the same issue domains (covid vaccines, economy, and homeschooling) from our consented survey respondents to the random Twitter users and then formulate an approach to better and efficiently capture the Twitter conversations. Finally, we compare the dynamics of the overall conversation topics shared by both groups. We find that the confirmed Twitter users in our study and that of a broader population of users are very similar for many of our analyses. With the help of a more confined sample, we are able to provide a more principled approach that can better capture the dynamics of Twitter conversations.
CONCURRENT SESSIONS A
Session 1: My “socials” like your surveys! Exploring alignment, consistency and divergence of estimates based on survey, social media and digital trace data
(Room: Da Vinci 128)
Uncovering Hidden Alignment between Social Media Posts and Survey Responses
- Professor Frederick Conrad (University of Michigan) – Presenting Author
- Mr Mao Li (University of Michigan)
- Professor Michael Schober (New School for Social Research)
- Professor Johann Gagnon-Bartsch (University of Michigan)
- Dr Robyn Ferg (Westat)
- Ms Rebecca Dolgin (New School for Social Research)
A precondition for using content from social media in place of survey data is that the two data sources tell the same story, that the prevalence of posts and survey responses move up and down together over time. When this is the case, the two data sources are aligned. Alignment can be elusive, appearing at one point and later disappearing (e.g., Jungherr, et al., 2012; Conrad et al. 2021). It is also possible that alignment may be hidden, i.e., responses to survey questions that measure a particular opinion may align with posts that express the same opinion but this correspondence may be obscured by the diversity of opinion in the full set of posts. In the research reported here, we investigate whether hidden alignment can be uncovered by selecting just those tweets expressing an opinion, i.e., stance, corresponding to a target survey response. We test this idea by comparing patterns of responses to a question asked on the (online) Census Tracking Survey — whether the 2020 US Census would ask about the citizenship of household members ñ to the patterns of posts in a corpus of 3.5 million tweets about the 2020 Census and related topics. We measure alignment with co-movement, the fraction of times the two series move in the same direction over time. Alignment was not evident when we compared the prevalence of ìWill not askî responses to that of all tweets about the possible inclusion of a citizenship question, but we wondered whether it might be revealed by comparing the movement of ìWill not askî responses to the movement of only those tweets expressing the same (ìWill not askî) opinion. To test this, we trained a language model (XLNet) to distinguish between the stance (ìWill not ask,î ìWill ask,î or ìCanít tellî) of each tweet in the corpus and compared just those tweets which expressed the ìWill not askî stance to the corresponding survey responses. We trained XLNet on 1000 example tweets labeled for their stance, based on judgments of 100 MTurkers, and asked the model to predict the stance of the ~350,000 tweets on this topic. This approach revealed significant comovement, suggesting that hidden alignment may be uncovered with the appropriate filtering and preprocessing. We then asked if this process could be carried out using automatically labeled training examples to conserve resources. To test this idea, we trained a model (SBERT) to estimate the semantic distance of tweets from several phrases we chose to capture the ìWill not askî and ìWill ask stances.î This also produced strong comovement. Although an effort to extend automated labelling of stance to a question asking respondents whether they intend to complete the Census was less successful, the exercise highlighted subtle distinctions between stance of behavioral intentions and likelihood judgments. Overall, we conclude that filtering social media posts based on their stance holds promise for the use of social media to conduct social research.
New Data Sources and Signals on Presidential Approval: Daily Indicator Based on Google Trends
- Mr Basti·n Gonz·lez-Bustamante (University of Oxford) – Presenting Author
Although Google Trends could be a novel source of signals on a number of different topics relevant to social sciences, some research has already noted that the time series could present some statistical inconsistencies. We use a novel sampling approach to generate more stable daily Google search volumes (GSV) indicators aligned and corrected with weekly and monthly series to elaborate daily presidential approval indicators for 12 presidential democracies from the mid-2000s to date, considering seasonal adjustment for time series. This indicator is entirely novel and constitutes a significant empirical contribution to the study of public opinion based on alternative sources and signals. Subsequently, we provide some plausibility checks to cross-validate our indicator. First, we establish face validity by regressing approval quarterly estimates using a dyad-ratios algorithm on our indicator and a vector of confounders such as macroeconomic indicators and dummies for controlling the quadratic effects of presidential approval. Second, we cross-validate our indicator using novel data on calls for the resignation of cabinet members elaborated using optical recognition algorithms on archives and press reports together with machine learning classifiers. Finally, we discuss alternatives to integrate our indicator with survey data to improve measurements.
Ground Truth References for Trends Estimated from Social Media Using Automated Entity Extraction. Example of Substance Use Discussions on Reddit
- Dr Georgiy Bobashev (RTI International) – Presenting Author
- Mr Alexander Preiss (RTI International)
- Mr Anthony Berghammer (RTI International)
- Dr Mark Edlund (RTI International)
We developed an approach to extract information (entities and associations) form social media and applied it to substance use discussions on Reddit platforms. We identified specific substances and effects associated with these substances from nearly 4 million comments from the r/opiates and r/OpiatesRecovery subreddits. We built a bipartite network of substance and effect co-occurrence. For each of 16 symptoms of withdrawal, we identified the 10 most strongly associated substances. We identified 458 unique substances and 253 unique effects and validated through clinical review. Of 130 potential remedies strongly associated with withdrawal symptoms, 41.54% known treatments for the symptom; 13.08% could be potentially useful given their pharmacological profile; 5.38% were causes of the symptom; and 30.00% were other/unclear. While these results are useful for the identification of off-label use of certain medications, the question arises about the representativeness and meaning of these numbers. We evaluated temporal trends in the discussions of different drugs and drug use practices and searched for “Ground Truth” data to evaluate the observed trends. We show examples when for the same trends in social media several different trends of ground truth are plausible, and conversely for the same ground truth, trends in the number of posts, comments and numbers of authors substantially differ. For social media studies as such, representativeness and validation on ground truth are not well defined. We discuss the results and future research directions.
Examining Latino Political Engagement and Activity on Social Media: Combining Survey Responses with Digital Trace Data
- Professor Marisa Abrajano (University of California, San Diego)
- Ms Marianna Garcia (UC, San Diego)
- Dr Robert Vidigal (New York University)
- Mr Aaron Pope (New York University)
- Dr Edwin Kamau (New York University)
- Professor Joshua A. Tucker (New York University)
Social media is used by millions of Americans for political news and information. We combine detailed survey responses form over 4000 respondents recruited on Facebook with digital trace data supplied by a subset of over 1000 of those respondents covering web-browing behavior, Facebook use, Twitter use, and YouTube use to examine this. Much research has focused on understanding how English-language social media consumption affects one’s political behavior and preferences. Yet much less is known about the way Latinos more generally, and Spanish language reliant Latinos specifically, use social media to access news and discuss politics. Our sample consists of approximately 3000 self-identified hispanics, split almost equally between English-dominant, Spanish-dominant, and bilingual respondents. We examine the way Latinos use social media to acquire political information and how it compares to the behavior of white Americans. Based on previous research, we hypothesize Latino political engagement (as measured by the number of politicians followed on Twitter, frequency of discussing politics, sharing a news post, etc) on Facebook, Twitter, WhatsApp and Instagram to be lower than it is for non-Hispanic whites. Within the Latino population, we expect to find variations based on their language proficiency. Namely, Latinos who rely on Spanish-language social media for news and politics should be more likely to believe in political misinformation, given that misinformation is more widespread in Spanish-language social media than on English-language platforms(Valencia 2021). We find minimal support for our first hypotheses; levels of political activity on social media do not vary significantly between white Americans and Latinos. In fact, Latino political interest on WhatsApp is significantly greater than it is for white Americans. Moreover, our findings reveal that Latinos relied more on social media for information about COVID-19 than did whites. We also show that Latino reliance on Spanish-language social media is correlated with beliefs in political misinformation. We examine reliance on social media for health-related information (i.e., Covid). And we examine Latinoís belief in several pieces of misinformation that circulated during the 2022 elections, and relate levels of belief to use of social media.
CONCURRENT SESSIONS A
Session 2: Adept Adaptation: Using auxiliary data and advanced modelling methods to predict eligibility, contact, participation and response
(Room: Da Vinci 200)
Age-Eligibility Oversampling to Reduce Screening Costs in a Multimode Survey
- Dr Stephanie Zimmer (RTI International) – Presenting Author
- Mr Joe McMichael (RTI International)
- Dr Taylor Lewis (RTI International)
Some surveys have a narrow range of eligibility, including age subgroups and special populations such as smokers. It is expensive and inefficient to sample households that do not have any eligible people. The National Survey of Family Growth has a target population of Americans aged 15 through 49 thus we would like to minimize sampling households with people aged 50 and older. In this paper, we discuss a method to oversample households within sampling units that are more likely to be age-eligible. RTI has an enhanced address frame which includes addresses as well as data from marketing vendors. Using the enhanced frame and historic survey data from a prior, unrelated study, we developed a model to predict whether households have people of the targeted age range. We will discuss the method to build the model, score the model on the sampling frame, and create age-eligibility strata to allocate more of the sample to households with higher likelihood of eligibility. We use data from 2022 to show how the model performed and how we will change the allocation in future years of data collection.
Calling the Right Cases: Using Predictive Modeling to Direct Outbound Dialing Effort in an Address-Based Sample
- Mr Michael Jackson (SSRS) – Presenting Author
- Mr Todd Hughes (University of California – Los Angeles)
- Ms Cameron McPhee (SSRS)
Cost increases have caused many researchers to move away from computer-assisted telephone interviewing (CATI) as a primary contact mode. In the U.S., this has caused a shift away from the random digit dialing (RDD) frame. Address-based sampling (ABS), with mail as the primary contact mode, has become a popular replacement. Advantages of ABS include the ability to push respondents to self-administered Web instruments (by providing login credentials in mailings), avoiding the costs of live interviewing; and to append a wide array of auxiliary variables to the sample, allowing more extensive analysis and modeling of response behavior. However, under-surveyed populations, including lower-income households, immigrants, and non-native English-speakers, show a low propensity to respond to mailed push-to-Web contacts. Therefore, to ensure representative samples, it is often necessary to retain CATI as a secondary contact mode, by appending phone numbers to addresses and/or including a supplemental RDD sample. Researchers using phone with ABS face a tradeoff between cost control and representativeness: cost control argues for minimizing the number of outbound dialing attempts, but indiscriminate reduction of dialing may compromise representativeness. Adaptive survey designs may help to resolve this dilemma. Adaptive designs use predictive models of response behavior to direct more intensive data collection efforts to cases for which they are most likely to be effective and/or necessary. The availability of extensive auxiliary data, combined with machine-learning techniques, makes ABS samples a strong potential use case for adaptive designs. Using predictive modeling, it may be possible to prioritize CATI for units that are most likely to yield an interview in a target subpopulation, while reducing dialing effort for other units. This paper will evaluate the application of predictive modeling and adaptive design to CATI follow-up in the 2022 California Health Interview Survey (CHIS). CHIS transitioned from RDD to ABS in 2019, but retained a CATI component (using appended numbers and an RDD oversample of prepaid cell phones) to meet the detailed demographic targets necessary to ensure full representation of Californiaís diverse population. Beginning in 2022, an adaptive design was introduced to reduce outbound dialing costs without compromising demographic representativeness. This design uses random forest response propensity (RP) models to predict the probability that additional dialing attempts will yield a response, based on paradata such as the outcomes of earlier attempts. Dialing is ended early for low-RP cases. The exact threshold for cutting off dialing varies based on separate models of the likelihood that the address includes members of target demographics. This paper will evaluate the impact of this adaptive design on dialing effort, response rates, and representativeness in the CHIS sample. Because the adaptive design was implemented midway through the 2022 collection on a non-experimental basis, the analysis will rely on a quasi-experimental difference-in-differences design. Results will provide insight into the ability of predictive modeling and adaptive design to improve the efficiency of CATI as a secondary contact mode in ABS designs.
Longitudinal Nonresponse Prediction and Bias Mitigation with Machine Learning
- Mr John Collins (University of Mannheim) – Presenting Author
- Dr Christoph Kern (University of Muenchen)
We apply and compare novel time-series classification techniques to improve predicting participant nonresponse and reduce nonresponse bias in panel surveys. While panel surveys are an irreplaceable source for social science researchers, nonresponse can lead to significant loss of data quality. To prevent attrition, researchers have turned to predictive modelling to identify at-risk participants and support early interventions. In particular, machine learning-based approaches have shown promising results for predicting participant nonresponse. However, researchers have not yet fully exploited the time-series nature of panel data. Current research commonly accounts for time with ad-hoc approaches such as rolling-average features. These rolling averages are then inputted to ëtime-staticí models, which are models that do not receive time as a variable. In these techniques, the order of events has no impact on prediction. In this research, we apply two sets of novel time-series classification techniques to predicting nonresponse. First, we apply recurrent neural networks (RNNs). RNNs reveal the importance of recent events when predicting future nonresponse. For example, a time-static model, such as random forest, may utilize a feature ëaverage survey enjoyment rating over past three waves,í but an RNN would utilize the un-aggregated array of ësurvey enjoyment ratings.í In this scenario, two different arrays of ratings, with the same average, would be treated as different predictive factors, whereas the time-static model would consider them identical. Second, we additionally investigate kernel, distance, and feature-based time-series classification algorithms. These techniques transform time-series data into variables which describe the features of the time-series. Example variables include the seasonality or volatility of the time sequence. These variables are then used as predictors for another machine learner, such as random forest. We apply these techniques to data from the GESIS Panel, a large-scale probability-based German panel study. We take the current best performing nonresponse prediction models as baseline and compare them to our novel approaches. We find that the RNN-based approaches show promising results. Regarding the kernel, distance, and feature-based time-series classification algorithms, we find that only feature-based approaches show adequate performance. Additionally, we demonstrate how to convert predictive analysis into actionable advice for survey practitioners. Minimizing nonresponse does not necessarily minimize nonresponse bias. We therefore compare the outlined prediction methods not only with respect to prediction performance, but also regarding their potential to reduce nonresponse bias based on simulated interventions. In this context, we investigate and compare the composition of at-risk panellists identified by different machine learning models. We then study the downstream bias implications of treatment regimes that draw on these predictions. Our experiments demonstrate how prediction-based interventions affect variable-level nonresponse bias scores. From our investigation, we provide panel survey practitioners with a more effective approach to predicting participant attrition, targeting interventions, and reducing biases.
Predicting Future Panel Participation in AmeriSpeak Panel
- Dr Stas Kolenikov (NORC) – Presenting Author
- Dr Ipek Bilgen (NORC)
- Dr David Dutwin (NORC)
Attrition and nonresponse are important concerns in probability-based panels such as AmeriSpeak, the highest quality panel available in the U.S., given the importance of achieving a representative panel with active panelists and the cost of keeping the panel highly representative. In this project, we model unit response of the recruited panelists over time, accounting for the paradata of the surveys they are invited to participate in, such as incentive offered and the length of the survey. Each panelist is thought of following a latent trajectory model with person-by-survey response propensity featuring person-specific intercepts (their propensity to respond to the very first survey they are invited) and person-specific slopes (the extent to which they lose interest with each subsequent survey). We extend this idea to build a multilevel cross-classified two-way model to estimate panelistsí survey-specific propensity of unit response, incorporating predictors at appropriate levels, and estimate it using Bayesian methods (RStan). We find that slower attrition is associated with higher education, age, and being female. Racial and ethnic minorities have much lower initial propensities to respond than non-Hispanic whites, but attire at slower rates. The past values of the self-reported willingness to participate in future surveys are highly predictive of subsequent unit response, validating it as a lead indicator of attrition. Predicting future response allows the panel operations implement interventions before attrition occurs.
Optimizing data collection interventions using personalized models
- Mr Michael Duprey (RTI International) – Presenting Author
- Dr Rebecca Powell (RTI International)
- Dr Jerry Timbrook (RTI International)
- Ms Melissa Hobbs (RTI International)
- Dr Brian Burke (RTI International)
- Professor Allison Aiello (Columbia University)
Maximizing sample representativeness and minimizing the potential for nonresponse bias is of paramount importance within nationally-representative longitudinal research studies. Contrasted against the persistent decline of response rates, researchers are presently challenged to identify new methods to increase response from sample members. While uniformly increasing data collection efforts across the entire sample may be one such option, resource constraints act as a contravening force, indicating that optimization of the effort applied to each case is a more efficacious approach. To this end, adapting intervention efforts (e.g., contact mode, time, and day) conditioned on sample membersí prior response behaviors and demographic characteristics may increase the effectiveness of data collection efforts without substantially increasing cost. In this paper, we describe the use of machine learning to develop and test a tailored strategy for contacting sample members and prompting them to complete a new wave of a major longitudinal survey. Our approach uses data from the web-based component of Wave VI of the National Longitudinal Study of Adolescent to Adult Health (Add Health; n=14,725), with interventions starting in January 2023. Sample members are contacted by a mix of paper mailings and email messages throughout the field period. Clustering and predictive modeling are employed at multiple time-points to group sample members using their demographic characteristics and response behaviors from past waves and previous current-wave data collection periods. We use these models to predict the current optimal contacting strategy (e.g., best contact mode, time, and day) for each class of characteristics. Our paper describes the methodology used to conduct the clustering and modeling, the tailored contacting strategy implemented during data collection, and results of the modeling approach to date.
Differential Privacy for Surveys ñ Things we know and problems that still need to be solved
- Mr Joerg Drechsler (Institute for Employment Research and University of Maryland) – Presenting Author
The concept of differential privacy gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for the 2020 Decennial Census. However, despite its attractive theoretical properties, implementing the approach in practice is challenging, especially when it comes to survey data. In this talk I will present early results from a project funded by the U.S. Census Bureau that explores the possibilities and limitations of differential privacy for survey data. I will highlight some key findings from the project and also discuss some of the challenges that would still need to be addressed if the framework should become the new data protection standard at statistical agencies.
CONCURRENT SESSIONS A
Session 3: You will brag about this LAG: modern methods for creating Labels, Asking questions and Gathering data
(Room: Da Vinci 201)
Best practices in data donation: A workflow for studies using digital data donation
- Mr Thijs CarriËre (Utrecht University) – Presenting Author
- Dr Laura Boeschoten (Utrecht University)
- Dr Heleen Janssen (University of Amsterdam)
Digital trace data are gaining popularity and are considered as valuable to scientific research. Collection of these data can happen through multiple methods. Boeschoten, Mendrik et al. (2022) argue that data donation is a fruitful and innovative method for collection of digital trace data, while preserving the privacy of study participants. In data donation, participants first request a copy of the digital trace data that platforms have collected on them. Thanks to the European Unionís General Data Protection Regulation (GDPR) and similar legislation in other territories, digital platforms are obliged to comply with this request. They often provide these digital trace data in the form of data download packages (DDPs), a folder with files containing all of the data collected by the platform on the user. Participants can then actively consent to donate (part of) these DDPs to the researcher. Boeschoten, Mendrik et al. (2022) developed PORT, a software specifically built for data donation. In PORT, processing and filtering the targeted data from the DDPs happens locally at the device of the participant, contributing to perseverance of the participantís privacy. Each method of data collection introduces potential sources of error to the collected data. In methodological research, total error frameworks (TE frameworks) are constructed to summarize all these potential error sources for a data collection method. In addition to PORT, Boeschoten, Ausloos et al. (2022) designed a TE framework specifically for data donation. Based on this TE framework, as well as on experiences of the authors on designing and carrying out data donation studies, we propose a workflow that guides researchers through setting up a data donation study, in such a way that it accounts for the errors identified by the TE framework. In the workflow, six main steps are identified: (1) operationalizing the research idea and specifying the DDPs and data of interest; (2) working out the study design; (3) constructing the feature extraction script; (4) configurating the software and donation tool; (5) ethical approval of the study and related legal issues; and (6) conducting a pilot study and improving towards a full study. Additionally, five different domains of expertise and skills needed for the different steps are specified, represented by different roles: (1) a main researcher with substantive knowledge on the subject and responsible for the study; (2) an expert in the field of methodology, highly involved with the study design; (3) a research engineer, coding the extraction script; (4) an IT expert responsible for the software implementation; and (5) an expert on legal and privacy issues, who can be consulted in the different steps of the study. We illustrate all steps of the workflow, discuss what expertise is required at what step, and show how these steps combined take into account all possible sources of error described by the TE framework. Therefore, the proposed workflow aims to improve the quality of future data donation studies.
A Practical Guide to (Successfully) Collect and Process Images in the Frame of Web Surveys
- Ms Patricia A. Iglesias (Research and Expertise Centre for Survey Methodology) – Presenting Author
- Mr Carlos Ochoa (Research and Expertise Centre for Survey Methodology)
- Dr Melanie Revilla (Research and Expertise Centre for Survey Methodology & Institut Barcelona d’Estudis Internacionals)
Proposing web survey respondents to answer questions by sharing images is a practice that has gained notoriety during the last years. Indeed, images are expected to help getting more accurate and/or new insights, while also reducing the respondents’ burden. Although this new collecting strategy may offer many advantages, it requires researchers to know how to operationalize, collect, process, and analyze this type of data, which is not yet an extended expertise among survey practitioners. Some of the steps considered are similar to the ones followed when measuring concepts through conventional survey questions, for instance, defining and operationalizing the concepts of interest. Nevertheless, we expect the collection and processing of images to present more/different challenges compared to conventional survey questions. First, specific tools are required to collect and store images. In addition, participants need to have the skills and be willing to share such data. Moreover, the data need to be available to be shared during the survey. Furthermore, once images are collected, a question such as ìwhat do I do with these images?î turns out to be more challenging than expected for researchers using them for the first time. This is highly relevant, since the possibility to improve data quality and/or provide new insights by requesting images largely depend on how researchers deal with the images they receive. The process of extracting information and assigning labels to the items contained in an image is called ìclassificationî. This can be compared to reading and coding responses in conventional open-ended questions. Although in the field of computer vision the classification of images is a central topic that has been extensively investigated, giving rise to numerous advances and actual applications in recent years, there is not much evidence on the specific case of images gathered through surveys. This paper aims to guide researchers inexperienced in image analysis by presenting the main steps involved in the process of collecting images as a new data source: 1) operationalization, 2) definition of the labels, 3) choice of the most suitable classification method(s), 4) collection of the images, 5) classification of the images, 6) verification of the classification outcomes, and 7) data analysis. The seven-step process proposed in this paper will help practitioners getting a better idea of the tasks to be accomplished and the associated risks and benefits. This can help them decide whether to use images or not to measure their concepts of interest. In addition, if they decide to use images, following the seven-step process proposed will help them getting the best out of their data.
Identifying Common Abortion Measurement Across Studies and Wording: A New Technique For Leveraging Historical Survey Data
- Professor Josh Pasek (University of Michigan) – Presenting Author
The wealth of survey data that has been amassed over the last century represents an invaluable tool for understanding human beliefs, attitudes, and behaviors and how they have evolved over time. But although there are tens of thousands of datasets that have been made available to researchers, scholars are often unable to use more than a handful of these for any given project. One principal reason for this constraint is that many questions ó even those asking about similar topics ó employ different wordings and response options. This means that it is often difficult to tell whether differences between responses to questions are indicative of items that track subtly different topics, methodological choices in administration, or substantive changes over time. In practice, scholars examining trends over time often conduct analyses on the small subset of questions that are asked identically at multiple time points instead of engaging with the full richness of the available data. The current study proposes and illustrates a novel solution to identifying common survey questions across different data collections. By norming responses across items, estimating vectors of correlations between target items and demographic measures, and further correlating those correlations across datasets, we develop a way to measure the similarity of questions that were not asked concurrently. Further, using clustering techniques and evaluating common terminology across clusters (by applying NLP approaches to question wordings), we can identify which questions behave similarly within a given topic and highlight the common concepts that associate those questions. We apply these techniques to a dataset evaluating hundreds of distinct measures of abortion attitudes and illustrate that this technique allows us to (1) identify the different types of historical questions that exist to measure views on abortion, (2) discern the similarity of those different types of questions, and (3) estimate how attitudes toward different types of questions have themselves trended over time both overall and within population subgroups. We discuss the benefits and limitations of the technique as well as its potential applicability across issues.
A Conceptual Model of Labeling in Supervised Learning (with Implications for Survey Coding)
- Mr Robert Chew (RTI International) – Presenting Author
Supervised machine learning algorithms are primarily trained on human labeled data. While these labels are often assumed to be free of error and represent an objective “ground truth” during modeling, in practice, they often exhibit measurement error. If present in the training data, this can lead to a reduction in the model’s ability to learn and generalize. If present in the test set, it can severely bias hold-out error rate estimates. Despite these important impacts, there is little research devoted to understanding the drivers of label noise (difference between observed label and true class value). To help connect disparate strands in the literature and better understand how components of the labeling process may impact label variation, we propose a conceptual model of labeling. This conceptual model focuses on potential causes of variation in labeling decisions within and between annotators. Given its similarity in structure, we will also explore implications of this work to survey coding and total survey error.
CONCURRENT SESSIONS B
Session 1: Does the Trace save Face? Exploring how digital trace data can be leveraged to estimate social media usage
Surveys or digital trace data, which one should we use? Using Generalized MultiTrait-MultiMethod models to simultaneously estimate the measurement quality of surveys and digital trace data.
- Mr Oriol J. Bosch (University of Oxford / The London School of Economics) – Presenting Author
- Dr Melanie Revilla (IBEI)
- Professor Patrick Sturgis (The London School of Economics)
- Professor Jouni Kuha (The London School of Economics)
Measuring what people do online is crucial across all areas of social science research. Although self-reports are still the main instrument to measure online behaviours, there is evidence to doubt about their validity. Consequently, researchers are increasingly relying on digital trace data to measure online phenomena, assuming that it will lead to higher quality statistics. Recent evidence, nonetheless, suggests that digital trace data is also affected by measurement errors, questioning its gold standard status. Therefore, it is essential to understand (a) the size of the measurement errors in digital trace data, and (b) when it might be best to use each data source. To this aim, we adapt the Generalised MultiTrait-MultiMethod (GMTMM) model created by Oberski et al. (2017) to simultaneously estimate the measurement errors in survey and digital trace data. The GTMM allows both survey and digital trace data to contain random and systematic measurement errors, while accommodating the specific characteristics of digital trace data (i.e., zero-inflation). To simultaneously assess the measurement quality of both sources of data, we use survey and digital trace data linked at the individual level (N = 1,200), collected using a metered online opt-in panel in Spain. Using this data, we conducted three separate GMTMM models focusing on the measurement quality of survey and digital trace data when measuring three different types of online behaviours: 1) news media exposure, 2) online communication and 3) entertainment. Specifically, for each of type of behaviour, we measured three simple concepts (e.g., time spent reading articles about politics and current affairs) with both survey self-reports and digital traces. For each simple concept, we present the reliability and method effects of each data source. Results provide needed evidence about the size of digital trace data errors, as well as when the use of survey self-reports might still be justified.
Contrasting the accuracy of self-reported Facebook use with digital behavioral data
- Mr ¡d·m Stefkovics (Centre for Social Sciences – CSS Recens ) – Presenting Author
- Professor Zoltan Kmetty (Centre for Social Sciences – CSS Recens )
Surveys have been the main tool for measuring online media behavior. Nevertheless, as digital traces have become increasingly available for researchers to contrast survey responses with actual behavioral data, a growing line of studies have highlighted that self-reports of surveys are often inaccurate, i.e. people tend to over- or underreport certain actions. Inaccuracy may originate from the lack of motivation or interest in the survey, the sensitivity-, the general frequency of the behavior in question, the device of completion, and various individual characteristics. Contrasting self-reported responses with ìground truthî (e.g., tracking data) can help to understand the extent to which survey responses are flawed and support the correction of systematic biases. In this paper, we use individual Facebook data to assess the accuracy of self-reported Facebook use, and evaluate two specific question order variations in an online survey experiment. We contribute to the literature in two ways. First, we pre-registered our study to replicate and extend the study of Guess et al. (2019). We use the same questions to measure the frequency of different activities on Facebook (posting, sharing, reacting, commenting) on different topics (in general, about politics, and health and health-preventive behaviors). The survey data is now being collected in Hungary from a non-probability-based access panel. After completing the survey, we ask respondents to donate their social media data via data download packages (DDPs). Participants who complete and share their Facebook and part of their Google data receive monetary incentives (around 13 euro). In our analysis, we will contrast self-reports with actual behavior observed in the historical Facebook data. Our second contribution is a question-order experiment. The survey methodological literature showed that providing an answer to a specific question can impact the way respondents answer a subsequent general question. We randomly assigned respondents to two groups. Group A received questions in an order such that the frequency of social media activity (e.g. posting) was first asked about their activity in general topics followed by questions on specific topics in contrast to Group B who received the same questions in an order such that the specific questions preceded the general question. We hypothesized that the general self-reported frequency of each activity will be higher in the general-specific condition (H1), the specific self-reported frequency of each activity will be higher in the specific-general condition (H2), and correlations between the general and specific activities will be stronger in the specific-general condition. The results of the experiments will allow us to evaluate which question order yields the more accurate results. Our study will point the way forward for future online media behavior studies by validating survey measures and making recommendations for their refinement. The study to be replicated: Andrew Guess, Kevin Munger, Jonathan Nagler & Joshua Tucker (2019) How Accurate Are Survey Responses on Social Media and Politics?, Political Communication, 36:2, 241-258, DOI: 10.1080/10584609.2018.1504840
CONCURRENT SESSIONS B
Session 1: Humans chatting about chatbots: Promises and Challenges of Large Language Models for Survey Research
(Room: Da Vinci 128)
Seeing ChatGPT Through Students’ Eyes: An Analysis of TikTok Data
- Dr Anna-Carolina Haensch (LMU Munich) – Presenting Author
- Mrs Sarah Ball (LMU Munich)
- Mr Markus Herklotz (LMU Munich)
- Professor Frauke Kreuter (LMU Munich)
TikTok is a rapidly growing and exciting platform and a new opportunity for data collection, with over a billion active users worldwide. This provides an opportunity to collect data on user preferences, behaviors, and attitudes towards hot and emerging topics. One such topic in 2023 is AI models like ChatGPT, which have gained considerable attention among university students. However, limited understanding exists among lecturers and teachers on how students use and perceive ChatGPT. To address this, this study analyzed the content on ChatGPT that is available on TikTok. A total of 100 English-language videos tagged with the hashtag #chatgpt were examined, which collectively garnered over 250 million views until February 7th 2023. Our aim was to gain insights into the information and perspectives on ChatGPT that university students are exposed to and its potential effects on their academic pursuits. We identified different categories of #chatgpt videos on TikTok and plan to repeat our data collection in April and June 2023 to understand changes in the types of topics or prompts that are popular at different points in time. The results of our research note are valuable for educators and academic institutions, as they shed light on how students use and perceive ChatGPT, and how it may impact their academic pursuits. Our study also serves as a model case for using a new form of social media data in research, providing valuable insights into the advantages and challenges of using TikTok data for data science and social science research.
Evaluation of GPT models and prompts to create tailored questionnaires
- Ms Zoe Padgett (Momentive.ai) – Presenting Author
- Ms Laura Wronski (Momentive.ai)
- Mr Sam Gutierrez (Momentive.ai)
- Mr Tao Wang (Momentive.ai)
- Mr Misha Yakubovskiy (Momentive.ai)
With the release of ChatGPT by OpenAI, researchers at SurveyMonkey are wondering whether GPT models can be used to create high-quality automated questionnaires. Our engineers have fine-tuned GPT models to create customer experience (CX) questionnaires based on structured prompts developed by our researchers. To evaluate the questionnaires output by these models, we will score questions and questionnaires on multiple dimensions. These scores will be analyzed across several models to determine which models and prompt structures produce the most relevant, highest quality questionnaires. This presentation will detail the evaluation process and results. We will include scores by evaluation dimensions, such as clarity, bias, and flow, to gain a deeper understanding of what GPT models are good at, and not so good at, when it comes to creating survey questionnaires.
Exploring and predicting public opinion using large language models
- Mr Mathew Hardy (Princeton University) – Presenting Author
Large language models are revolutionizing artificial intelligence, leading to breakthroughs in machine translation, natural language understanding, and question answering. While traditional AI models are trained for a specific task, language models are trained on massive, diverse corpora to predict word sequences. We show that these models can be leveraged to predict peopleís responses to both multiple-choice and open-ended survey questions. These predictions can be adjusted to account for personality traits, demographic features, and idiosyncratic habits. Additionally, these models can predict responses to the same question at different points in time, allowing for simple trend analysis and forecasting. While promising, we also show that using language models ìout-of-the-boxî can result in highly skewed estimates, even when using optimized or specialized prompting. These biases likely arise from bias in the modelsí training data or fine-tuning procedures that differ from the general word-prediction objective. However, bias can be significantly reduced via calibrating general-purpose models on large survey and behavioral datasets. In domains where the modelís point estimates may still be unreliable, we show that calibrated models often preserves rank ordering between alternative questions, options, and target groups. Our results show that large language models can generate reliable, low-cost predictions of public opinion, providing a valuable complement to traditional survey methods.
While Chatbots have many answers, do they have good questions? An Experimental Study Exploring the Creation and Evaluation of Survey Questions using Automated Chatbot Tools
- Dr Trent Buskirk (Bowling Green State University) – Presenting Author
- Dr Adam Eck (Oberlin College)
- Dr Jerry Timbrook (RTI International)
- Mr Scott Crawford (Sound Rocket)
Take a glance through recent news or social media and you would be hard-pressed not to see mentions of chatbots and artificial intelligence (AI) methods aimed at generating text, images and other content. One tool, ChatGPT, has taken the spotlight and uses a complex autoregressive language model to predict the likely next word(s) in text with amazingly conversational results. Although these tools are nascent in their development, deployment and widespread use are rapidly increasing. Many people expect that this new line of AI technology will dramatically change the day-to-day operation within many industries. Naturally, we wonder: will this tech change the work in our field of survey research? In particular, how might chatbot technology be leveraged to support, enhance or expand our work? Bots are not new for survey researchersówe have increasingly encountered bot-generated content when collecting digital trace data or social media data. Similar technology is used in analyzing qualitative text data. However, an uninvestigated avenue of research is how chatbots may become part of the development and testing of survey research products such as questionnaires, cognitive interviews, and study/sampling designs. In this work, we explore how chatbots like ChatGPT might be used to generate survey questions under a wide array of topics (e.g., financial, mental, physical and emotional domains of health) and question types (e.g. attitude and behavioral questions). We compare these chatbot-generated questions to gold standard questions on these same topics from existing surveys using several different quality indicators and tools including: the Survey Quality Predictor tool (SQP:http://sqp.upf.edu/), the Question Appraisal System (QAS) and the Question Understanding Aid (QUAID:http://quaid.cohmetrix.com/). Evaluating the quality of questions generated from ChatGPT establishes a baseline measure of usability and might also illuminate possible weaknesses of such chatbot tools (e.g. stemming from the fact that the chatbot models originally learned from a corpus of primarily non-question text that was intended for a different purpose than questionnaire development). A second feature of our research builds on these established baseline quality measures by investigating how the prompts provided to ChatGPT by a survey researcher affect the quality of the generated questions, also known as prompt engineering within the area of human-AI interactions. When issuing question-generation requests, we will prompt ChatGPT with: 1) an assumed role (e.g. ìAssume you are a questionnaire designer and physical activity researcher who needs to ask questions of participants to learn more about their activityî) and 2) question context (e.g. ìConsider the best way to ask this question in the context of a questionnaire that is designed to measure physical and mental health in humansî), as well as 3) specific requirements of survey and question characteristics (i.e. response categories, single vs. multi-response, open response, etc.). We present the results of our baseline question quality assessments and recommendations for how interested survey researchers might prompt ChatGPT to optimize the quality of chatbot-generated questions.
CONCURRENT SESSIONS B
Session 2: What’s your claim to FRAME? Sampling Design and Evaluation Using New Data Sources and Analytic Methods
(Room: Da Vinci 200)
Assessing the quality of commercial auxiliary data appended to an address-based sampling survey frame
- Dr Rebecca Medway (American Institutes for Research) – Presenting Author
- Dr Rachel Carroll (American Institutes for Research)
Targeted and adaptive designs are increasingly being incorporated in survey methods. In these types of designs, researchers may use information known about sample members prior to data collection to tailor the methods or materials that are used for particular subgroups of interest. Such designs rely on the availability of high-quality data that are predictive of outcomes of interest, such as eligibility or response (Buskirk, Malarek, & Bareham 2014; West et al. 2015). This creates challenges for mailed surveys that use address-based sampling frames, which tend to lack detailed information about household members prior to data collection. Commercial auxiliary data providers offer a wide array of variables that can be appended to address-based samples to facilitate targeted or adaptive designs, but questions remain about the quality and accuracy of these data (Disogra, Dennis, & Fahmi 2010; Pasek et al. 2014). In this presentation, we will report on the quality and utility of a newly acquired commercial auxiliary data source that was appended to the address-based sampling frame used for the 2023 National Household Education Survey (NHES). The NHES is sponsored by the U.S. National Center for Education Statistics and provides descriptive data about the educational activities of children and families in the United States. For NHES:2023, a sample of 205,000 U.S. addresses was drawn from a file of residential addresses that was based on the U.S. Postal Serviceís Computerized Delivery Sequence File. In addition to address variables, the sampling frame included only a handful of variables (e.g., gender and age of head of household, presence of children, number of adults in the household) that had been appended to the file by the frame vendor. By contrast, the new commercial data source is promising because it includes about 200 additional variables on topics ranging from household composition to receptiveness to various modes of communication to commercial behavior. We will start by discussing the match rate (how many of the NHES records could be located on the commercial data files?). We will then report on the extent of missing data among matched cases, both at the address-level (what percent of variables are missing for this address?) and variable-level (what percent of addresses are missing data on this variable?). We will also present findings related to the degree of consistency of the commercial data with (1) the variables available on the survey sampling frame and (2) self-reports from NHES respondents; for example, if the frame suggests there are children living in the household, does the commercial data suggest the same? Finally, we will discuss whether the auxiliary data appear to improve our ability to make model-based predictions of an addressís propensity to respond to the NHES. We will also touch on whether these findings are consistent with a similar analysis conducted using NHES:2016 data and auxiliary data from a different vendor.
Targeted Random Door-to-Door Sampling Design for COVID-19 Informed by Community Wastewater
- Dr Katherine McLaughlin (Oregon State University) – Presenting Author
- Dr Jeffrey Bethel (Oregon State University)
- Dr Benjamin Dalziel (Oregon State University)
- Dr Roy Haggerty (Louisiana State University)
- Dr Kathryn Higley (Oregon State University)
- Dr Jane Lubchenco (Oregon State University)
TRACE-COVID-19 is an Oregon State University public health project designed to gather information about the presence of SARS-CoV-2, the virus that causes COVID-19, in communities. Throughout 2020-2021, TRACE (Team-based Rapid Assessment of community-level Coronavirus Epidemics) used random sampling to understand the prevalence of the virus, rather than relying on clinically reported cases which are subject to biases that changed over time. In this talk, we provide an overview of a targeted random door-to-door sampling method informed by community wastewater measurements we developed that was implemented in two communities in Oregon in 2021. The sampling design is a three-stage design with strata informed by microsewershed boundaries, clusters corresponding to one or more adjacent census blocks selected with probability proportional to size, and systematic sampling of housing units within clusters. The details of the design, data requirements for implementation, design-based analysis strategies, methods to address nonresponse and measurement error, and results will be discussed. This design is intended to allow the allocation of field teams collecting nasal swabs such that an unbiased prevalence estimate can be obtained while attempting to discover positive individuals. We also detail a simulation study framework that can be used to consider adapting the design across communities with different characteristics and different pathogens. The survey design and analysis for this project involve the integration of data at different spatial and temporal scales, including time series of wastewater measurements collected at microsewersheds throughout a community, Census data at the block level, and information provided by local public health officials. We also discuss lessons learned and how our sampling and surveillance techniques have evolved over the course of the pandemic. With an eye toward the future, we consider the role of survey data integration with epidemic modeling and passive surveillance to create cities that are more resilient to pandemics.
Using machine learning methods to stratify the household surveys Sampling Frame
- Dr William Jes˙s Constante Erazo (Particular) – Presenting Author
- Miss Diana Carolina SimbaÒa Flores (Particular)
Any probabilistic sampling procedure requires a tool that makes it possible to identify, select, and locate each and every one of the elements that make up the target population (GutiÈrrez, 2016). Kish (1965) mentions that the sampling frame is the cornerstone around which selection processes must be designed and that its nature is a very important issue in sample design. Guzm·n (2020), points out that the importance of having a quality sampling framework with optimal stratification criteria to generate efficient and precise sample designs is dimensioned to a lesser extent. The best stratification is the one that minimizes the sampling errors of the estimators, expressed in the form of variances or standard errors (INE, 2020). The objective of the stratification is to reduce the variance of the parameter of interest (INEC, 2021), likewise, INE (2020) indicates that the objective of building strata is the formulation of sampling designs that induce greater efficiency and precision in the estimation of official statistics. In turn, stratification makes it possible to control sample sizes by geographic area, which affects the quality of the estimates provided by the survey (INE, 2022). To carry out this stratification, variables related to housing, household and population will be used, without neglecting the geographical or territorial level or the classification by urban or rural areas (INEC, 2021; INE, 2020; DANE, 2022; INEC, 2022; INDEC, 2012; INE, 2022). The Population and Housing Census (CPV) carried out in Ecuador in 2010 will be used as a data source. According to the review of the literature, an algorithm frequently used for the formation of strata or groups from a multivariate perspective is the k-mean (INE, 2020; INEC, 2021; INE, 2022), in turn, the method has been applied optimal stratification (LavallÈe & Hidiroglou, 1988) (INE, 2020), cumulative frequency root method (Dalenius & Hodges, 1959) (INE, 2020; AMAI, 2018) or the quantile method for univariate stratification (INE, 2020). On the other hand, in other countries strata have been formed using regression models (INEC, 2012) or percentiles (INDEC, 2012). Given the above, and according to what has been reviewed so far in the cases at the Latin American level, no evidence has been found that machine learning methods are applied to stratify the sampling frame. In this sense, the objective of this research is to use machine learning methods to stratify the household surveys Sampling Frame. The hypothesis that is intended to be tested is that the stratification carried out through machine learning techniques can maintain and even induce greater efficiency and precision in the estimation of official statistics published by the National Statistics Institutes. As mentioned by TillÈ et al., (2022), official statistics is a special field of statistics, the methods used have been developed to deal with original questions that essentially revolve around quality, the term “Big Data” encompasses a set of sources and new statistical methods, for which an evolution of the methodology will be necessary, which must be supported by quality fundamental research.
Introducing ‘Designing and Implemented Gridded Populaton Surveys’: A New Manual with Step-by-Step Tutorials
- Dr Dana Thomson (University of Twente) – Presenting Author
Household surveys are a primary source of information for official statistics (including one-third of SDG indicators), opinion polls, rapid needs assessments, and program evaluations. In many lower- and middle-income countries, traditional surveys suffer from outdated or inaccurate census sample frames, while simultaneously facing complex urbanization defined by highly mobile and/or large informal populations. Survey practitioners increasingly revert to modelled gridded population estimates as sample frames in these settings. In the past, designers and implementers of gridded population surveys assembled their own data and tools and developed bespoke methods largely in isolation. However, we have collated best practices and published a generalized manual with step-by-step tutorials on “Designing and Implemented Gridded Population Surveys” (www.gridpopsurvey.com). This presentation by the manual’s author summarizes who uses gridded population sampling and for what, when gridded population sampling is and is not appropriate, and the strengths and limitations of available data, tools, and methods. This is a rapidly evolving discipline with new and improved gridded population datasets becoming available, as well as a range of tools that might be used to derive gridded population-based sample frames and select sampling units. Survey practitioners in government, the private sector, academia, and the non-profit sector should find this session informative whether or not they plan to use gridded population sampling. Resource: Thomson, DR. 2022. Designing and Implementing Gridded Population Surveys (DA Rhoda, Ed.). Available at: www.gridpopsurvey.com
CONCURRENT SESSIONS B
Session 3: When the Kaleidoscope becomes the Megaphone! New Tools and Resources for Improving Dissemination and Insights that are based on Multiple Data Sources
(Room: Da Vinci 201)
Census Household Panel: Opportunities for Data Triangulation, Harmonization and Integration
- Ms Jennifer Childs (U.S. Census Bureau) – Presenting Author
- Dr Jason Fields (U.S. Census Bureau)
- Dr Aleia Fobia (U.S. Census Bureau)
- Ms Cassandra Logan (U.S. Census Bureau)
- Dr Stephanie Coffey (U.S. Census Bureau)
- Dr Jennifer Ortman (U.S. Census Bureau)
Early research and development work has demonstrated the value of a high-quality panel to improve representativeness of and significantly reduce burden on households in the interest of collecting high-frequency data. This paper outlines plans for the development of the Census Household Panel (CHP) consisting of a pool of households carefully selected, recruited, and refreshed by the Census Bureau to reflect the diversity of our nationís population. Development of the CHP at the Census Bureau allows the agency to use representative respondent pools accurately and quickly, responding to the need for timely insights on an array of topics and improving data outputs inclusive of historically undercounted populations. The initial goal for the development phase of the CHP is 15,000 households linked to the Census Bureauís gold standard Master Address File (MAF). The MAF will be linked to administrative records securely maintained and curated by the Census Bureau to provide additional information to ensure representativeness and enhance the informative power of the CHP. Initial invitations to enroll in the CHP will be sent by mail and questionnaires will be mainly completed through internet self-response. The CHP will maintain representativeness by allowing respondents who do not use the internet to respond via in-bound computer-assisted telephone interviewing (CATI). All panelists will receive an incentive for each complete questionnaire. Periodic replenishment samples will maintain representativeness and panelists will be replaced after completing their initial term. This panel will become integral to the Demographic High Frequency Survey program, rapidly providing insight on national events that may impact social, economic, or demographic characteristics of the population. Traditionally, federal surveys are designed to collect and disseminate data on a slower timetable to produce statistically robust key measures of the society and economy. In keeping with growing needs for more timely information, however, the Census Bureau seeks to complement these important, established surveys with new data sources, such as the Census Household Panel, which can produce data much closer to real time as the events develop, emphasizing timeliness over some of the more comprehensive survey processing procedures used in the long-standing federal surveys. We also will look at alternative methods for enhancing data with administrative and other external data sources and developed modeled data. The frame for the CHP will begin with the MAF, but evolve to be a product of the Census Bureauís nascent Demographic Frame. The demographic frame comprises demographic, social, and economic characteristics of individuals derived from census, survey, administrative, and third-party data sources. With the CHP, we will have the opportunity to study the triangulation, harmonization, and integration of disparate data sources from various parts of the survey lifecycle, enhancing representativeness, data quality, data processing, and data products. We will be able to study various entry points for administrative and alternate data sources, including pre-data collection (sampling), during data collection (adaptive design in contact and incentive structure) and post processing (weighting). This paper will explore the opportunities for triangulation, harmonization, and integration of
The Social Impact Data Commons: Regional Data-Driven Decision-Making
- Ms Kathryn Linehan (University of Virginia)
- Dr Aaron Schroeder (University of Virginia) – Presenting Author
The Social Impact Data Commons is an open knowledge repository that co-locates data from a variety of sources, builds and curates data insights, and provides tools designed to track issues over time and geography. The Social Impact Data Commons allows governments and key stakeholders to learn continuously from their own data. Specifically, we include county and sub-county level data around issues of access and equity that are of interest to stakeholders, policy makers, and community members, allowing the Social Impact Data Commons to be a critical tool in policy and funding decisions. The Social Impact Data Commons is open source, publicly hosted on GitHub, and includes a project summary website and data dashboard for the National Capital Region (i.e., the area around Washington, D.C.). In this talk we will provide an overview of the Social Impact Data Commons and examples of turning data into actionable insights such as using data triangulation to understand equity of access to broadband in Arlington County, Virginia. We will also present applications of methods to assist in turning data into insights such as floating catchment areas for identifying spatial accessibility to services such as hospitals or supermarkets, and demographics redistribution to community defined geographies such as neighborhoods or civic associations.
Data for Health Intelligence
- Dr Ellie Graeden (Georgetown University) – Presenting Author
The study of global health security sits at the intersection of science, policy, and tactical operations. This work is informed and driven by data ñ quantitative and qualitative, numeric and textual, structured and unstructured. Building these data systems requires the combination of data engineering, ontology and taxonomy development, and visual communications. Here, we present a series of case studies describing data systems designed and built to predict spillover of zoonotic disease, use policy-as-data to define the governance environment in which outbreaks unfold, and guide vaccine deployments for vulnerable populations. These systems have been used to prioritize investments in global health based on spatial and population risk, define policy gaps, and report on the operational status of hospitals across the US during the COVID-19 outbreak. Together, these case studies provide key examples for how data from a wide array of domains can be structured, integrated, and used to implement global health. By addressing key issues in data privacy and security policy, we can make sure the data that are needed are available and useful to those who need it, when they need it.
Re-thinking data policy: An engineering-informed approach to global data regulation (Demo)
- Dr Ellie Graeden (Georgetown University) – Presenting Author
CONCURRENT SESSIONS C
Session 1: Heads AND Tails! Evaluating and leveraging survey and alternate data sources for improving estimation
Measuring Vulnerability and Resilience for Small Areas
- Ms Heather King (United States Census Bureau)
- Ms Jennifer Childs (United States Census Bureau) – Presenting Author
In March of 2020, the COVID-19 pandemic struck the world and highlighted a need for high quality data on the vulnerability of our communities. A metric was needed that could indicate, in simple terms, which communities were most at risk to the impacts of the pandemic. To answer this call, the United States Census Bureau released the Community Resilience Estimates (CRE). Combining survey data with auxiliary data, the CRE measures the capacity of individuals and households to cope with the external stresses of disaster. The CRE uses respondent data and small area modeling techniques to provide a more accurate and timely measure of vulnerability with complete geographic coverage. We release estimates on the number and percentage of people who can be considered low, medium, or high risk. This ability to provide an estimate of the number of people who may be socially vulnerable is a first for these types of indices which typically create and compare county-level scores. For emergency managers, government officials, and others involved in what are often rapidly changing emergency events, timely access to detailed information about the affected population and workforce is critical for many planning, response, and recovery activities. Information is important for determining the number and location of people living and working within affected areas. Identifying the impact to various demographic groups and sectors of the economy is also a priority. Answering such questions is a challenge due to the lack of a single national source for social and economic data for local areas affected by a natural hazard or disaster event. Underlying the CRE is the understanding that heterogenous outcomes following the pandemic could depend critically on individual and household characteristics. The CREís granular data captures the differential abilities of these communities. The CRE provides a critical tool for disaster preparedness enabling community planners, public health officials and policymakers to effectively deploy resources in response to a disaster and to plan for future disasters. The CRE can measure social vulnerability in a more precise way because is starts by analyzing information about individual people and households using restricted microdata collected by the American Community Survey. By doing this, the Census Bureau can determine individual respondentís potential risk to a disaster. That information is then tabulated, modeled, and provided to data users. Other indices instead start their analyses by using publicly available information about counties and communities. While this information is valuable, these data can mask how at-risk an area is to a disaster if risks are not spread proportionally.
Are media exposure measures created with digital trace data any good? An approach to assess and predict the true-score reliability of web tracking data.
- Mr Oriol J. Bosch (University of Oxford / The London School of Economics) – Presenting Author
- Dr Melanie Revilla (IBEI)
- Professor Patrick Sturgis (The London School of Economics)
- Professor Jouni Kuha (The London School of Economics)
According to the very large literature on media effects, how much and what kind of news people read online matters. In recent years there has been a shift from survey self-reports to digital trace data to measure online media exposure, based on the assumption that the latter will yield higher-quality statistics. Recent evidence, nonetheless, suggests that digital trace data is also affected by errors which could harm its validity and reliability. On top of that, research has also suggested that operationalising concepts such as media exposure into web tracking measures is not straightforward, with many design decisions to be made (e.g., which devices to track, for how many days, what traces to combine, and how to transform them). Therefore, understanding how these different design choices might influence the quality of the measurements used is also crucial. Considering this, we investigate the reliability of news media exposure measures created using digital trace data, and the role that several design choices play in this. To do so, we use data from the TRI-POL dataset, a three-wave online survey conducted in opt-in panels in 2021/22 in Spain, Portugal, and Italy, matched at the individual level with web tracking data. Cross quotas for age, gender, educational level, and region were used in each country to guarantee that the sample is similar in these variables to the general Internet population. Using this data, we first created a comprehensive list of the different measurements that could be created to measure media exposure. We identified 28 different design choices (e.g., whether to measure seconds or visits, track smartphones or not, for how long?), which resulted in 8,070 potential variables to measure this simple concept, all of which we computed. Leveraging the longitudinal nature of our data, we then used the Quasi-Markov Simplex Model to measure the true-score reliability of the 8,070 variables created. Finally, we estimate the influence of the different design choices on the true-score reliability of the measures through the application of random forests of regression trees. Overall, results show that the average reliability of the 8,070 variables computed to measure media exposure is 0.68 i.e., 32% of the variance of the measures is due to errors. Although these results are slightly below what has been found for high-quality survey measures, some of the thousands of measures created achieve higher reliability levels. Results from the random forests of regression trees suggest that, in order to achieve the highest reliability possible, it is key to aim for as many days of tracking as possible. Specifically, using a month of tracking data (the maximum we got), and holding all things equal, measures could reach on average a predicted true-score reliability of around 0.80. Our results can help researchers understand the quality of web tracking measures, and how to better design them. We also exemplify how psychometrics and computational methods can be combined for methodological purposes.
Using machine learning to downscale projected land conversion: Application to bioenergy expansion
- Dr Robert Beach (RTI International)
- Mr Graedon Martin (RTI International)
- Dr Stanley Lee (RTI International)
- Dr Jonathan Holt (RTI International) – Presenting Author
Decisionmakers need to better understand the effects of programs and policies impacting land conversion and associated environmental, economic, and social impacts to improve assessment of the costs and benefits of alternative policies. It is not only the quantity of land conversion taking place, but which land is being converted that determines outcomes for endangered species habitat, biodiversity, water quality, carbon sequestration, and other ecosystem services. However, there is a current lack of tools that can facilitate evaluation of impacts at a sufficiently spatially disaggregated level to adequately evaluate them. In this study, we develop a flexible framework enabling evaluation of the relative likelihood of land conversion at high spatial resolution using machine learning. The modeling framework is written in Python and R and places an emphasis on flexibility and modularity, allowing the utilization of alternative machine learning algorithms, land cover transition data, geographic region, grid size, and economic model input. This framework can be used to downscale the results of large-scale economic models of land use, which tend to operate at aggregate levels (e.g., state, province, region, nation). To generate 300m x 300m pixel-level estimates of the relative probability of conversion, we compile time series data for 1985-2021 from the US Geological Surveyís Land Change Monitoring, Assessment, and Projection framework along with a number of explanatory variables expected to explain changes in land cover over time (e.g., population density, income, soil productivity, precipitation, temperature, etc.). We apply a machine learning model to predict relative likelihood of conversion at the pixel level based on our explanatory variables. In our specific case study, we utilize projections of changes in land cover and land use from the Forest and Agricultural Sector Optimization Model with Greenhouse Gases (FASOMGHG) under an expanded bioenergy scenario. We then downscale projections of land cover change for U.S. state and sub-state regions from FASOMGHG based on the relative likelihood of conversion at the pixel level. We convert pixels in order of relative likelihood of conversion until matching the total regional land conversion projected for a given period by FASOMGHG. We then overlay spatial data on critical habitats, biodiversity, carbon sequestration, and other measures of interest to assess impacts of land conversion to cropland. This enables us to estimate the area of endangered speciesí critical habitats that would potentially be impacted by projected land conversion, for instance. We compare with values based on regional averages to assess the implications of utilizing a more spatially disaggregated characterization of land cover and land use change for assessment of environmental impacts. The information generated can inform design of data collection efforts as well as policies aimed at reducing the negative environmental impacts associated with cropland expansion.
Analyzing Economic Views via Twitter versus Survey Data
- Professor Suzanne Linn (Penn State University)
- Dr Patrick Wu (New York University)
- Professor Joshua A. Tucker (New York University)
- Professor Jonathan Nagler (New York University) – Presenting Author
Each day, millions of Americans express public opinions via social media. Social media offers a way to observe opinion at fine-grained time intervals without resorting to expensive surveys. But if we are to measure sentiment from social media, we need to know how opinion expressed in social media compares to survey-based measures. Unlike opinions expressed in surveys, individuals posting on social media choose what to talk about and how. And unlike representative surveys, the opinions expressed on social media come from those motivated to share their views, rather than a random sample. Further, while we usually assume survey responses are truthful, social media posts are performative. Thus, there are ample reasons to expect the two measures to differ. Yet at base, we expect both to respond to available objective information. We measure public sentiment about the US economy using a representative sample of (US) Twitter users, Twitter users matched to the L2 voter file, and a random sample of tweets about the US economy. We weight the L2 sample to be representative of the US adult population. Using a supervised machine learning classifier, we measure the proportion of positive tweets each day from January 1, 2018, to December 31, 2022, in each sample. We compare these Twitter-based measures of sentiment to a daily survey-based measure of consumer sentiment over the same period. We test how highly correlated the measures are and how each responds to a number of objective indicators of economic performance. If social media and survey sentiment track each other closely, it opens up the possibility of using Twitter-based measures to study how economic sentiment responds to temporally fine-grained events, as well as how sentiment varies across different groups. If social media and survey sentiment differ, further analysis may help us to better understand what motivates social media users to post. To the extent that opinions on social media drive mass or elite behavior, understanding both if and how it responds to objective facts differently from representative measures of opinion is important for assessing the health of our democracy.
Estimation of purchases with a hybrid survey design
- Mr Matthew Shimoda (Bank of Canada) – Presenting Author
The Bank of Canadaís Methods-of-Payment (MOP) survey is used as a template to analyze the trade-offs between recall surveys and diaries, and to present a hybrid approach which utilizes both components to estimate statistics on the purchases made by consumers. The MOP survey begins with the completion of a recall-based survey questionnaire (SQ), in which respondents are asked to recall their purchases made in the previous week. The second component of the MOP is a diary survey instrument (DSI), a three-day diary in which respondents record their daily purchases at the end of each day. Only 39 percent of SQ respondents go on to complete the DSI. Both the SQ and DSI can be used to produce estimates (e.g., mean number of purchases per day, payment volume shares, etc.), but we demonstrate that estimates derived from SQ data contains recall bias, while estimates from the DSI contains nonresponse bias. Using both the SQ and DSI data, we develop a hybrid approach using statistical (zero-inflated poisson regression) and machine learning mixture models (K-nearest neighbour, random forest) to estimate the number of purchases per person during the DSI period. Using recall-based purchases data allows us to infer an individual specific rate for each SQ respondent. By fitting a model with the number of diary reported purchases per person in the sample of DSI respondents, we can estimate the number of purchases for respondents who only completed the SQ and not the DSI. Previous work on this topic has been completed by Hitcenko (2021). Our approach differs by estimating both the extensive margin (did the consumer make a purchase) and intensive margin (how many purchases did they make) of respondentsí purchasing behaviour. We compare our hybrid approach to a more standard calibration exercise with the DSI and also assess the out of sample accuracy of the model. We find that the hybrid approach, with the use of either statistical or machine learning models, produces estimates similar to standard calibration.
CONCURRENT SESSIONS C
Session 2: Going Public About What’s Private! New Observations, Methods and Advances in Data Privacy
Synthetic population generation for nested data using differentially private posteriors
- Dr Hang Kim (University of Cincinnati)
- Dr Terrance Savitsky (U.S. Bureau of Labor Statistics)
- Dr Matthew Williams (RTI International) – Presenting Author
- Dr Jingchen Hu (Vassar College)
When working with restricted use data with a nested structure (e.g. students within schools, or patients within hospitals), both the individual and the group participation may be considered sensitive. Direct application of differential privacy methods by setting a privacy budget for individuals quickly compounds when considering the joint case of the aggregated group. The privacy weighted pseudo-posterior is a form of the popular exponential mechanism and can be used to generate differentially private parameter estimates as well as synthetic populations for non-nested data. We explore the challenges and trade-offs of extending the privacy weighted pseudo-posterior approach to a two-level hierarchical model such as an analysis of variance (ANOVA) model. Challenges include (i) establishing a group-level neighborhood definition for databases analogous to the usual Hamming ñ 1 definition for neighboring databases that differ by one record and (ii) interrogating the privacy risk of latent variables (e.g. random group means). Trade-offs include the usual competing privacy and utility goals as well as the competing priorities between individual and group level objectives.
CONCURRENT SESSIONS C
Session 3: My Daddy is a Ballerina: Issues in Coding Occupations and Gender Identity in Surveys
Coding occupations: are occupational codes in administrative data consistent with survey self-reported occupations?
- Ms Ana Santiago-Vela (Federal Institute for Vocational Education and Training) – Presenting Author
The reliable measurement of occupations is central for sociological labour market research on inequality and working conditions as well as for indicators of individualsí status in society. However, measuring and coding occupations is challenging due to occupational complexity (i.e., many occupational titles). More importantly, occupational codes may be inconsistent across data sources, as survey and administrative (i.e., process-generated) data may lead to different measurement errors. Occupational information is typically gathered in survey reports using open-ended questions, which are subsequently coded using previously standardised occupational classifications and applying semi-automated coding via rule-based coding schemes. Occupational coding faces different difficulties, given potential ambiguity in occupational responses that do not fit well into a unique occupational category or given potential lack of intercoder-reliability. In order to reduce costs of gathering occupational information without compromising quality, researchers have developed sophisticated methods of machine learning to fully automatise occupational coding. However, research has remained silent about potential inconsistencies between occupational codes in linked data sets (e.g., survey and administrative data). In this paper, we study how data linkage can help us understand and validate different occupational information and create more comprehensive datasets containing occupational codes. We argue that it is critical for researchers to carefully consider how administrative data has been produced, especially when focusing on occupations. Whereas administrative data is said to measure income more accurately than survey data (which is prone to survey measurement error due to social desirability for example), there is no evidence that administrative data may offer more accurate and valid measures of occupations than surveys that already focus on measuring occupational tasks. As process- generated data has been increasingly examined in social science research, we argue that linking process-generated data and survey data contributes to improving the measurement quality of occupations. By using a unique linked dataset, we explore differences in reporting occupational information between employers (administrative data) and employees (survey data). We used the BIBB/BAuA-Employment Survey 2018 of employed persons in Germany that focus on the measurement of occupations, which was linked to process-generated data containing social security records of the of the German Federal Employment Agency (ADIAB). The mechanism of gathering occupations differs between data sources: whereas employers are required to report detailed occupational information following official occupational schemes, occupational information in the survey based on employeesí open-ended answers which were subsequently categorised using semi-automated coding and the knowledge of professional coders. We investigate (1) whether occupational codes in survey data are consistent with those in process-generated data, (2) which socio-demographic, economic or regional characteristics may explain inconsistencies in occupational codes, and 3) what consequences inconsistencies in occupational codes have on the study of occupations and related indices. Our analyses may serve to improve occupation coding in general and interview coding in particular, to develop statistical tests to assess the viability of automatic occupational coding, and also to inform users of potential biases in occupational categories.
Trends in gender inclusion: a cross-national examination of non-binary gender options in surveys from 2012-2022
- Ms Zoe Padgett (Momentive.ai) – Presenting Author
- Ms Laura Wronski (Momentive.ai)
- Mr Sam Gutierrez (Momentive.ai)
How gender identity has been asked within surveys has undergone a drastic change within the past decade, evolving from a binary gender scale (male/female) to include non-binary gender options (non-binary, transgender, non-conforming, other, etc.). Analyzing millions of user-generated surveys on the SurveyMonkey platform, we explore the rise in prevalence of non-binary gender options across different countries and regions, as well as the most common terms used. In the US, for example, we find that the percentage of gender questions with binary answer options has fallen from 83% in 2012 to 36% in 2022. Because SurveyMonkey is available all around the world, we can take a country-by-country look at the patterns in how people ask about gender. Do certain countries or geographic regions show different levels of non-binary gender inclusion within their surveys, and which terms are most common? Our findings will not only provide insight into the changing nature of gender identity within surveys, but also reflect shifts in societal expectations and norms surrounding diversity, equity, and inclusion. We also look at top answer options among non-binary gender answer options, from ìotherî and ìtransgenderî to ìnon-binaryî and other possible terms used.
Parent Proxy Reporting on Multidimensional Measures of Adolescent Gender Using a Nationally Representative Sample from the U.S.
- Mr Christopher Hansen (Loyola University Chicago) – Presenting Author
Parents are frequently relied upon in survey research as proxy reporters for their children. Yet, research is mixed about the accuracy and reliability of parentsí report for certain constructs. In sexual orientation and gender identity (SOGI) measurement, gender is theorized as a multidimensional construct that is fundamental to individual identity, particularly during adolescence, a period of social and physical change. Concerns about data quality with proxy reporting may be especially relevant when measuring gender minority identities (e.g., transgender and non-binary identities). For example, parent-adolescent reports of gender identity may be discordant in circumstances in which the child is not ìoutî to their parent as a gender minority, or the child is ìoutî but the parent is not supportive. This paper shares results of a recent national survey of adolescents and parents that assessed concordance/discordance across multidimensional measures of gender, including adolescent gender identity and expression (i.e., degree of masculinity and femininity). The University of Vermont contracted NORC to conduct the ABCD Gender Identity Pilot Study of U.S. adolescents and parents. A total of 279 adolescents age 13-17 and their parents were surveyed in April 2021 using AmeriSpeakÆ, NORCís probability-based survey panel. The study included a self-administered parent survey and a separate, self-administered adolescent survey which were paired to achieve dyadic interviews. The adolescent sample included 49.9% cisgender boys, 44.4% cisgender girls, and 5.5% transgender and non-binary adolescents. Parents were more likely than adolescents to report normative gender identity and expression (i.e., being cisgender, exclusively masculine or exclusively feminine). Across dimensions of gender, parent-adolescent concordance was highest for cisgender boys followed by cisgender girls and transgender and non-binary adolescents. For example, concordance for cisgender boys was 98.9% for gender identity, 57.8% for degree of masculinity, and 70.3% for degree of femininity. Concordance for cisgender girls was 95.9% for gender identity, 49.9% for degree of masculinity, and 46.3% for degree of femininity. Concordance for transgender and non-binary adolescents was 13.9% for gender identity, 32.5% for degree of masculinity, and 45.9% for degree of femininity. Findings indicate that parents are generally reliable reports of gender identity among cisgender adolescents. Surveys that rely on parentsí report of adolescent gender should be skeptical of estimates derived for gender minorities. Improvements in SOGI measurement, including parent proxy reporting, can improve our collective understandings of gender minority populations specifically as well as advance our understanding of how gender operates as a social category more broadly.
CONCURRENT SESSIONS C
Session 2: Zooming in on the Reality of Working with Multiple Data Sources in Practice: A Virtual Session
Online Mediaís Agenda-Setting Effect: A Method Triangulation Approach with Survey and Text Mining
- Mr Hao Hsuan Wang (Academia Sinica)
- Mr Shih-Peng Wen (Academia Sinica)
- Mr You-Jian Wu (Academia Sinica) – Presenting Author
- Professor Ching-ching Chang (Academia Sinica)
- Dr Justin Chun-ting Ho (Academia Sinica)
- Dr Yu-ming Hsieh (Academia Sinica)
Social media have become important platforms for people to receive news and express their opinions, which could influence others’ views on the importance of various social issues. Agenda-setting theory suggests that audiences perceive the issues which are most frequently discussed as the most important. This implies that the salience of an issue can transfer from social media users’ agenda to the public agenda. To investigate the agenda-setting effect, this research explores the relationship between the salience of issues in social media and their importance as perceived by the public. We also examine the role of active users on different social media platforms. Against this background, this work addresses the following three research questions: first, whether the importance of issues perceived by social media users correlates with the overall salience on the same platform; second, if the most active users exhibit a similar pattern compared to other users in general; and third, if the importance of issues perceived by the most active commenters is in line with the overall salience on the same platform. This work employs a mix of survey and text-mining approaches to address the above research questions. To measure the perceived importance of the issues to the general public, we adopted the Comparative Agendas Projects (CAP) framework as the main agenda structure. In a survey, we asked respondents from a representative sample (N = 5,102) of internet users in Taiwan to rank the importance of the issues in the framework. To aid in the measurement of issue salience in the next step, we also asked the respondents to write down relevant words for each issue. We also measure the respondents’ social media usage and their degree of participation in comments on social media. To measure the issues’ salience in social media, we harvested text data from Dcard, PTT, and LineToday, Taiwan’s three most popular news discussion forums. Next, we use a combination of custom dictionaries and topic modeling to analyze the salience of different issues in social media. First, following established methods in dictionary construction, we build word vector models from each platform and use the words from respondents to filter out the top ten words correlated with each CAP issue. Next, we recruit annotators to verify the relevance of the words. In addition, the topic model method transforms text into vectors using the BERT embedding model, which was then clustered using the K-means clustering algorithm. We hired annotators to certify every cluster’s agenda with the most representative words identified using the term frequency-inverse document frequency (TF-IDF). The words were then combined into an issue dictionary. Finally, we use the dictionary to measure the salience of each issue in the CAP framework. We then address the three research questions by testing the association between the issue salience on social media platforms and the importance of the issue as perceived by all survey respondents, the most active users, and the most active commenters.
Negativity Bias Effect in Issue Salience Transfer
- Mr Hao Hsuan Wang (Academia Sinica) – Presenting Author
- Mr You-Jian Wu (Academia Sinica)
- Mr Shih-Peng Wen (Academia Sinica)
- Professor Ching-ching Chang (Academia Sinica)
- Dr Justin Chun-ting Ho (Academia Sinica)
- Dr Yu-ming Hsieh (Academia Sinica)
Social media has become an important platform for politicians to interact with the public. Previous work on the flow of influence between political elites and citizens suggests that politicians can shape public opinion through a process known as priming. However, existing work focuses mainly on the salience of issues, while the sentimental aspect needs to be studied more. The negativity bias theory posits that negative contents have a more significant influence on people. Against this background, this work investigates the influence of politicians’ social presence on public opinion, with special consideration of the sentimental aspect. In this research, we address three research questions. First, what demographic factors influence the alignment between politicians’ and their supporters’ perceptions of the importance of various issues? Second, do political interests and party identification strength influence this alignment? Third, does negative content exert a more significant influence on public opinion about different issues? This work combines survey data and text data from social media. We adopted the Comparative Agendas Project (CAP) framework in the survey and asked 5,102 respondents (a representative sample of internet users in Taiwan) to rate each policy area’s importance. For the social media data, we harvested all posts published between January 2022 to December 2022 on politicians’ fan pages on Facebook, Taiwan’s most popular social media platform, with a penetration rate of 91%. To measure the salience of the issues, we built a custom issues dictionary in accordance with the CAP framework. To measure the sentiment of texts, we use the embedding model Ada from OpenAI to classify articles into positive or negative sentiments. Specifically, we use the zero-shot learning method to put the articles into the model and get the predicted positive or negative label from the Ada model without any model training. Also, the model can predict the label using the cosine similarity contrast among the word embeddings present in the text. A sentiment score was calculated for each issue by comparing the frequency of positive and negative articles. We also employ the LIWC dictionary to get every issue’s sentiment scores from the articles to compare the accuracy to the innovative method. To estimate the negativity bias effect of politicians’ priming efforts on social media, we explore whether the issues mentioned in politicians’ negative posts align better with the survey respondents’ issue ranking when compared to positive and neutral content, accounting for the respondents’ gender, education level, political stance, and geographical area.
Impact of environmental features and spatial economic diversity on social inclusion
- Dr Chan-Hoong Leong (Kantar Public) – Presenting Author
Contemporary studies in multiculturalism focus primarily on the influence of individual-level indicators on measurements of diversity and inclusion. The effects of social or built environmental features such as geographic income level, population density, spatial clustering of ethnocultural minorities, and proximity to amenities, are all but commonly excluded in the empirical analyses. This approach has vastly constrained our insights to the rich and complex human-environment interaction effects. As such, the current paper proposes a structured and multi-layered approach on how the neighbourhood economic and demographic contours can be harnessed, quantified, and integrated across levels. The information obtained offers a fertile database to analyse the profound impact of residential environment on human motivations, emotions, and behaviours to diversity, and in particular, what are the types cross-level interaction processes and outcomes that crucial in policy making. This includes but not limited to the types and effective range of built features that can help promote positive intercultural contact and change; and understanding how attitudes to public social assistance and wealth redistribution may vary as a function of residential diversity and segregation. Importantly, this paper aims to identify the optimal range of residential disimilarity before the sense of ethnocultural security is eroded. Building on what was previously presented at BigSurv 2020, the current research uses Singapore, a compact city-state, as a case study, combining survey data, geographic information in urban planning, and residential housing databases.
Identifying factors associated with US adolescentsí behaviors and experiences during COVID-19 using LASSO regression on complex survey data
- Dr Wenna Xi (Weill Cornell Medicine) – Presenting Author
- Ms Peizi Wu (Weill Cornell Medicine)
Objective: To identify the factors associated with US adolescentsí behaviors and experiences during COVID-19. Method: We analyzed The 2021 Adolescent Behaviors and Experiences Survey (ABES) data. The ABES was a self-reported online survey conducted by the Centers for Disease Control and Prevention (CDC) with the goal of collecting a nationally representative sample of high school students during COVID-19. Ten outcome variables regarding adolescentsí behaviors and experiences during COVID-19 were considered: poor mental health, parent/adult lost jobs, felt hungry due to lack of food, schoolwork was more difficult, verbally abused by parents, physically abused by parents, drank more alcohol, used more drugs, received medical care, received mental health care, and never or rarely able to spent time with family or friends. Independent variables included demographics, emotional well-being, tobacco use, alcohol and other drug use, violence-related behaviors, sexual behaviors, dietary behaviors, and physical activity. LASSO regression incorporating the complex survey design were used to identify factors associated with each outcome. Results: Among the 7705 (design-adjusted count) adolescents, 37.11% reported poor mental health, 28.5% had parents who lost jobs, 23.85% felt hungry, 66.6% felt schoolwork was more difficult, 55.07% were verbally abused by parents, 11.26% were physically abused by parents, 14.66% drank more alcohol, 12.15% used more drugs, 25.84% received medical care, 8.48% received mental health care, 28.21% not never or rarely able to spend time with family or friends. Feeling sad or hopeless was positively associated with 10 of the outcomes (all but receiving medical health care). Having serious difficulty concentrating, remembering, or making decisions was positively associated with poor mental health, feeling hungry, feeling schoolwork was more difficult, being verbally abused, being physically abused, receiving medical care, and receiving mental health care (ORs=1.96, 1.42, 1.26, 1.82, 1.25, 1.28, and 2.21, respectively). Ever taking prescription pain medicine without a doctor’s prescription or differently than instructions was positively associated with poor mental health, parents losing jobs, feeling schoolwork was more difficult, being verbally abused, being physically abused, drinking more alcohol, and receiving medical care (ORs=1.30, 1.18, 1.07, 1.92, 1.09, 1.18, and 1.09, respectively). In addition, those who were homosexual or bisexual had higher odds of poor mental health, feeling hungry, being verbally abused, using more drugs, and receiving mental health care (ORs=1.50, 1.07, 1.24, 1.00, and 2.10, respectively). Males had higher odds of poor mental health, parents losing jobs, being verbally abused, and receiving medical care (ORs=1.31, 1.06, 1.22, and 1.16, respectively), and lower odds of using more drugs and not being able to spend time with family or friends (0.97 and 0.96, respectively). Conclusions: Adolescents who are male, feel sad or hopeless, have serious difficulty concentrating, take prescription pain medicine, identify as sexual minorities require more support during COVID-19.
Expectations of Ecuadorian Higher Education in a Time of Uncertainty: A Comparison between the Perceptions of Students and Teachers during the COVID-19 Pandemic
- Dr Anne Carr (Universidad del Azuay) – Presenting Author
In a recent action research project, we took advantage of the complexity of online teaching to notice and investigate what we had not seen before. As teachers began to exercise reflexivity, certain dialogic characteristics appeared to demonstrate epistemological and pedagogical transformations to include practice with new roles and modes of interaction. This drew our attention to the extent to which teachers and students, as digital citizens, might go beyond or step outside of backgrounds, roles, jobs, politics and beliefs forged by the power differentials of the digital economy. Technological infrastructure, that is, digital platforms that are a combination of resources at our disposal enabling us to share ideas, are also online businesses that facilitate commercial interactions between suppliers and consumers, yet are often presented as empty spaces for others to interact on when as textually mediated literacies they are actually political and can increasingly gain control and governance over the rules of the game. The collection of digital data through online education platforms has been been described as raising concerns over power, control and performativity.by reinforcing and intensifying the culture of managerialism within education with the potential risk of reducing teachers, students and their interactions to measurable data sets that increasingly shape educational processes, for example, standardization and competitiveness generationally, nationally and globally. Big tech has also been described in Gramscian terms as a powerful bourgeoisie that has developed sophisticated new techniques for extracting wealth from their users who take on the role of the proletariat by extracting wealth from their users. In our mixed methods paper, we discuss findings from our investigation of the experiences and perceptions of students compared to their teachersí perceptions of them regarding modes of distance learning and remote classes during the COVID-19 pandemic in Ecuador. Whilst digital technology has simplified the communication process and expanded potential interactive communication opportunities, participation is structurally different from interaction. Interaction remains an important condition of participation, but it cannot be equated to participation. Interaction has no political meanings because it does not entail power dynamics as does participation. From analysis of our data in the Ecuadorian context, we suggest that through Third Space theory and innovative practice, ëorganicí students and teachers might participate to negotiate transnational counter hegemonic social change even in our digitally divided Ecuadorian context. Gramsci founded LíOrdine Nuovo that has been described as a privileged ëthird spaceí for the knowledge exchange ideals of a group of ëorganic intellectualsí with the working class – the Educational Principle ñ where a new and liberating conception of the world promoted a cultural revolution giving more attention to public knowledge and inviting society through discourse. Gramsci believed that the student could teach the master in what we could describe as epistemic capability. Chapter to be published 23 July, 2023 in The Post Pandemic University, Elgar Press, London.
CONCURRENT SESSIONS D
Session 1: What did the robots hear the humans say? Advances in Coding Survey Open-Ends Using ML Methods
Considerations for data quality in open-ended embedded probes across population and methodological subgroups
- Mr Zachary Smith (National Center for Health Statistics, Centers for Disease Control and Prevention) – Presenting Author
- Dr Kristen Cibelli Hibben (National Center for Health Statistics, Centers for Disease Control and Prevention)
- Dr Paul Scanlon (National Center for Health Statistics, Centers for Disease Control and Prevention)
- Dr Valerie Ryan (National Center for Health Statistics, Centers for Disease Control and Prevention)
- Mr Benjamin Rogers (National Center for Health Statistics, Centers for Disease Control and Prevention)
- Dr Travis Hoppe (National Center for Health Statistics, Centers for Disease Control and Prevention)
Open-ended questions are a valuable tool in survey research, but their utility has been limited by the time and cost associated with processing and coding. Advances in data science offer new approaches to reducing this burden. However, these questions, particularly in the context of online surveys, are more susceptible to insufficient or irrelevant responses that can be burdensome and time-consuming to identify and remove manually. This can result in the potential inclusion of poor-quality data or, often, the underuse of open-ended questions. To address these challenges, we developed the Semi-Automated Nonresponse Detector for Surveys (SANDS). SANDS is based on a Bidirectional Encoder Representation from Transformers (BERT) model, fine-tuned using Simple Contrastive Sentence Embedding (SimCSE). This approach uses a pretrained language model, as opposed to existing rule-based approaches, to detect item nonresponse, and can be directly applied to open-ended responses without the need for model retraining or substantial text preprocessing, unlike existing bag-of-words approaches. SANDS categorizes responses into types of nonresponse and valid responses with high sensitivity and specificity. This presentation applies SANDS and other data quality metrics, including word count and response latency, to examine a specific type of open-ended questionóembedded cognitive probesóthat allows researchers to collect qualitative information about question interpretation while concurrently fielding survey questions of interest. Embedded probing is particularly useful when questions on novel topics need to be deployed quickly and before more traditional evaluation methods, such as cognitive interviewing, can be used. One unresolved question is whether these responses differ in quality by population and methodological subgroups. If data are not of similar quality, then responses may not provide equal insights across subgroups of interest, with consequences for their use in subsequent question design efforts. Data for this study are from seven rounds of the NCHS Research and Development Survey (RANDS), which uses both probability-based and opt-in panels as its sample sources. Analyses focus on differences between subgroups of interest, including race, age, education, income, sex, geographic region, survey administration mode, and panel type, and the results of the SANDS model and related metrics are used as a proxy for data quality. The 15 probes evaluated cover a range of topics, including the COVID-19 pandemic, experiences with discrimination, gender identity, and religion, among other topics. The results demonstrate differential data quality among some, but not all, subgroups, particularly panel type, survey administration mode, and education level. While the use of embedded cognitive probes in survey questionnaires has increased alongside the use of commercial panels as reliable and inexpensive sources of survey samples, caution is warranted when using this method because of the potential for low data quality and the costs of manual coding and processing. The use of new natural language processing methods makes identification of low-quality responses in open-ended probes faster and more reliable, increasing the utility of embedded cognitive probes and allowing researchers to understand potential unintended consequences of divergent data quality by subgroups.
Multi-label classification of open-ended questions with BERT
- Professor Matthias Schonlau (University of Waterloo)
- Dr Julia Weiﬂ (GESIS )
- Mr Jan Marquardt (GESIS) – Presenting Author
Open-ended questions in surveys are valuable because they do not constrain the respondent’s answer, thereby avoiding biases. However, answers to open-ended questions are text data which are harder to analyze. Traditionally, answers were manually classified as specified in the coding manual. In the last 10 years, researchers have tried to automate coding. Most of the effort has gone into the easier problem of single label prediction, where answers are classified into a single code. However, open-ends that require multi-label classification, i.e., that are assigned multiple codes, occur frequently. In social science surveys, such open-ends are also frequently mildly multi-label. In mildly multi-label classifications, the average number of labels per answer text is relatively low (e.g. <1.5). For example, the data set we analyze asks ``What do you think is the most important political problem in Germany at the moment?" Even though the question asks for a single problem, some answers contain multiple problems. Of course, the average number of problems (or labels) per answer is still low. This paper focuses on multi-label classification of text answers to open-ended survey questions in social science surveys. We evaluate the performance of the transformer-based architecture BERT for the German language in comparison to traditional multi-label algorithms (Binary Relevance, Label Powerset, ECC) in a German social science survey, the GLES Panel (N=17,584, 55 labels). Because our data set requires at least one label per text answer, we also propose a modification in case the classification methods fail to predict any labels. We evaluate the algorithms on 0/1 loss: zero loss occurs only when all labels are predicted correctly; a mistake on one label incurs the full loss (1). This loss corresponds to the reality of manual text classification: you code the whole text answer with all labels, even if only a single suspicious label requires review. We find that classification with BERT (forcing at least one label) has the smallest 0/1 loss (13.1%) among methods considered (18.8%-21.3%). As expected, it is much easier to correctly predict answer texts that correspond to a single label (7.1% loss) than those that correspond to multiple labels ($\sim$50% loss). Because BERT predicts zero labels for only 1.5% of the answers, forcing at least one label, while successful and recommended, ultimately does not lower the 0/1 loss by much. Our work has important implications for social scientists: 1) We have shown multi-label classification with BERT works in the German language for open-ends. 2) For mildly multi-label classification tasks, the loss now appears small enough to allow for fully automatic classification. Previously, the loss was more substantial, usually requiring semi-automatic approaches. 3) Multi-label classification with BERT requires only a single model. The leading competitor, ECC, is an iterative approach that iterates through individual single label predictions.
Machine Learning Assisted Autocoding Tools for Improving the Experience of Manual Coding of Real-World Big Text Data
- Ms Emily Hadley (RTI International) – Presenting Author
- Ms Caroline Kery (RTI International)
- Mr Durk Steed (RTI International)
- Ms Anna Godwin (RTI International)
- Mr Ethan Ritchie (RTI International)
- Mrs Donna Anderson (RTI International)
Text in survey data often requires manual review and coding by humans for downstream analysis and reporting. This process can be laborious and expensive, especially with large text data sources that require multiple coders. Automated coding models for open-ended text promise time and labor savings but can be challenging to implement in practice. Varying expectations of predictive accuracy, implementation costs, and complexity of integration into existing processes are common sources of frustration. We discuss the lessons learned from developing two autocoding tools that have both been used to assist with text coding in real-world settings. We first introduce SMART, an open-source computer assisted coding and labeling application. For human coders, SMART is a user-friendly approach to coding that provides both classical text analytics methods like deduplication, as well as modern methods like machine-learning recommendations. These recommendations are generated from a text embedding similarity search augmented with context-specific terms. For project leaders, SMART provides an intuitive management interface to quickly create and manage multi-coder projects. We describe how SMART has been used for coding on multiple large national surveys and has resulted in substantial improvements in efficiency and user experience. We then introduce ROTA, an open-source tool focused on automated assignment of National Corrections Reporting Program (NCRP) categories to offense descriptions for criminal charges documented in criminal justice data. Users upload a table of offense text records to ROTA which produces a table of predicted categories and the confidence of the predictions. ROTA implements a natural language processing technique known as a transformer model and is trained on a publicly available lookup table as well as hand-labeled data. ROTA has an overall accuracy of 93% when predicting across 84 unique NCRP charge categories. We discuss how ROTA can be used to assign charge categories with real data. We compare the development process for both tools, including the use of user-centered design, setting realistic accuracy benchmarks, and anticipating the challenges of integrating the model into existing workflows. We describe the varying user-oriented impact of these tools, such as improving the onboarding process for new human coders, maintaining knowledge across multiple survey instruments, and supporting standardization and reliability between coders. We consider opportunities and challenges for future implementation of assisted autocoding tools.
Linguistic shifts and topic drift: Building adaptive Natural Language Processing systems to code open-ended responses from multiple survey rounds
- Dr Sarah Staveteig Ford (U.S. State Department) – Presenting Author
- Dr Jon Krosnick (Stanford University)
- Dr Matthew DeBell (Stanford University)
Natural Language Processing (NLP) technologyówhich combines computational linguistics, artificial intelligence, and machine learningóoffers survey implementers tools to automate the labeling (ëcodingí) of open-ended responses in the survey data production process. Recent studies have found that supervised machine learning with NLP is increasingly viable for survey providers to automate the coding of many types of verbatim open-ended survey responses. Models show varying degrees of accuracy depending on the nature of the text corpus, methods, and label structure used (Schonlau et al., 2019; Gweon & Schonlau, 2022; Meidinger & Aﬂenmacher, 2021). For example, Ford (2023) employed modern NLP transformer embeddings on 2008 American National Election Studies (ANES) set of 72 ëmost important problemí labels for open-ended text data and found that around one-eighth of the 2008 responses could be leveraged to auto-code the remainder at previously recorded human levels of agreement (Berent et al., 2013) on a cleaned set of labels. For survey programs like ANES that ask similar open-ended questions over time, these NLP data processing models may be useful in a given year, but it is doubtful whether their accuracy would hold in perpetuity. Over time, survey responses naturally undergo what NLP researchers refer to as ësemantic changeí, which encompasses both linguistic shifts and substantive topic drifts. For example, in the ANES most important problem data from 2008 to 2020, there are clear temporal patterns in themes raised by respondents: most notably, the emergence of substantial public health concerns due to COVID in 2020 that did not exist in any prior year. As expected, there is also evolution in the key political figures and controversies of the day, for example the emergence of concerns about fake news in 2016 and voter fraud in 2020, as well as temporal shifts in the verbatim language coded into themes that may reflect evolving terminology (e.g., from ëhomosexualityí to ëLGBT issuesí), or how the nature of the problems themselves have shifted (e.g., ëcartelsí and ëillegal drugsí early on, to terms like ëopioid crisisí and ëlegalizationí in later years). No research we are aware of to date has tackled the challenge of building an adaptive natural language processing model to maintain robustness across survey rounds. In this paper, we retrospectively simulate an adaptive NLP model to code open-ended English responses the ANES survey’s most important problem set of questions during an unprecedented year (2020) based on adaptive training from earlier 2008 to 2016 ANES answers and a small set of new responses. We test whether it is uniformly better to incorporate earlier training sets for every topic when out-of-sample themes emerge (e.g., COVID), given sufficient weight toward newer data. We also explore the value of algorithms to detect the most significant semantic differences and prioritize selection toward the most unique new data. We discuss implications for survey programs considering NLP pipelines to automate coding of open-ends, as well as implications for NLP data coding processes in survey research writ large.
CONCURRENT SESSIONS D
Session 2: We Triangulated but Got A Rhombus!? Methods for Improving Insights Based on Data Combined from Multiple Sources
Validating matches of electronically reported fishing trips to investigate matching error
- Dr Benjamin Williams (University of Denver) – Presenting Author
- Dr Shalima Zalsha (NORC)
- Dr Lynne Stokes (Southern Methodist University)
- Dr Ryan McShane (Amherst College)
- Dr John Foster (NOAA Fisheries)
Recently, researchers have developed methods for combining probability samples and large non-probability samples. In recreational fisheries management, data from probability samples are typically counts of catch from a random sample of trips intercepted by a sampler, while non-probability samples consist of catch data that are collected in self-reports made to a fishery management agency. These reports are typically transmitted electronically and are known as an electronic logbook (ELB). Even when such reporting is mandated, compliance is not universal. Since the inclusion probability for any particular angler is unknown, the ELB sample is a non-probability sample, and can be considered a big data source. We used data from a 2017 Gulf of Mexico (GoM) pilot study in which charter captains volunteered to electronically report their catch. At the dock, they could also be intercepted by a sampler, at which time their catch was observed. Estimates of total catch can be generated if trips from the two data sets can be accurately matched. Several states in the GoM implement similar ELB reporting augmented with a probability sample. However, there is an apparent discrepancy between National Oceanic and Atmospheric Administration (NOAA) estimates of the total and the ELB estimates for the same geographies. We seek to investigate the extent to which matching errors contribute to the discrepancies. We employed probabilistic record linkage to match reports with intercepts and developed a validation tool to examine the matches. Using our validation tool, we examined several methods of estimating the total catch of Red Snapper in the GoM to investigate the potential cause of the discrepancy. We found the existing differences between the NOAA estimates and estimates resulting from combining ELB reports in this application were likely not due to matching error but instead were apparently derived from other sources of non-sampling error. This has implications for new and existing ELB implementations, which are gaining popularity. The tool and results we provide can allow other implementations to better match reports with intercepts and offers a way to examine the extent to which matching errors affect the bias of estimates. Our work also shows that agencies should focus on non-sampling errors besides matching error to reduce bias. Our tool may also be extended to examine such non-sampling errors, such as the assumption that reporting and interception are independent.
Open-Ended Survey Questions: A comparison of information content in text and audio response formats
- Mrs Camille Landesvatter (MZES, University of Mannheim) – Presenting Author
- Mr Paul Bauer (MZES, University of Mannheim)
Open-ended survey questions (ìOEQsî) provide an important and rich source of data in addition to closed-ended questions (ìCEQsî). However, despite their usefulness, they pose challenges to respondents, for example, increased response burden and associated phenomena such as a decrease in answer length and response quality. Consequently, the question arises as to how survey practitioners should design OEQs so that respective answers contain a high degree of information. Our study examines the effect of requesting respondents to answer questions via voice input compared to text input. We use a U.S. sample (N=1,500) and questions adapted from popular social survey programs. By experimentally varying the response format we examine which format produces answers with a higher amount of information. The information content is measured in several ways in our study. First, we consider the response length and thus follow previous research on the effects of response format on response quality. In the next step, however, we argue that longer answers do not necessarily indicate higher information content. In particular, we investigate to what extent the content and complexity of answers, irrespective of the length of the answer, can provide insights about the information content. In this second step, information content is measured via the number of topics derived from topic models, an assessment of document similarity using cosine similarity as well as response entropy. Preliminary results show that oral responses are on average longer, but results with regard to information content are mixed. Possibly, open-ended audio responses are not the supposed secret weapon when we want to learn more about individuals’ motives and preferences in survey attitudes.
A Novel Methodology for Improving Applications of Modern Predictive Modeling Tools to Linked Data Sets Subject to Mismatch Error
- Dr Brady West (Institute for Social Research, University of Michigan-Ann Arbor)
- Dr Martin Slawski (Department of Statistics, George Mason University)
- Dr Emanuel Ben-David (U.S. Census Bureau)
- Dr Stas Kolenikov (NORC) – Presenting Author
Modern predictive modeling tools, such as random forests (and related ensemble methods), neural networks, and LASSO regression, to name a few, have become almost ubiquitous in research applications involving innovative combinations of survey methodology and data science. Whether the objective of a research project is accurate prediction of categorical survey outcomes (e.g., indicators of survey cooperation) or regression-based prediction of continuous outcomes (e.g., predictions of income for unmeasured individuals in a superpopulation modeling approach), these methods have proven quite useful for making good predictions in a variety of contexts (especially for processing of survey paradata). However, an important potential flaw in the widespread application of these methods has not received sufficient research attention to date. Researchers at the junction of computer and survey science frequently leverage linked data sets to study relationships between variables, where the techniques used to link two (or more) data sets may be probabilistic and non-deterministic in nature. If frequent mismatch errors occur when linking two (or more) data sets, the commonly desired outputs of predictive modeling tools describing relationships between variables in the linked data sets (e.g., variable importance, confusion matrices, RMSE, etc.) may be negatively affected, and the true predictive performance of these tools may not be realized. In this presentation, we will provide an overview of a new general methodology designed to adjust modern predictive modeling tools for the presence of mismatch errors in a linked data set. Briefly, this methodology is based on a two-component mixture model rooted in a latent binary mismatch indicator that can be modeled given contextual information associated with data linkage (e.g., match probabilities or quasi-identifiers). The proposed mixture model is designed to account for data contamination resulting from incorrect links while simultaneously identifying and quantifying potential determinants of such incorrect links. The approach is generic in that it can be applied in conjunction with a wide array of predictive modeling tools in a modular fashion. We will evaluate the performance of this new methodology in an application involving the use of observed Twitter activity measures and predicted socio-demographic features of Twitter users to accurately predict linked measures of political ideology that were collected in a designed survey, where respondents were asked for consent to link any Twitter activity data to their survey responses (exactly, based on Twitter handles). We find that the new methodology, which we have implemented in R, is able to largely recover results that would have been seen prior to the artificial introduction of mismatch errors in the linked data set. We will conclude with recommendations for future work in this general area.
Beware of propensity score matching as a method for the integration of different data sets
- Dr Hans Kiesl (Ostbayerische Technische Hochschule Regensburg) – Presenting Author
- Dr Florian Meinfelder (Otto-Friedrich-Universit‰t Bamberg)
Statistical Matching (or data fusion) is a generic term for various algorithms that aim to integrate different data sets (e.g. two or more household surveys or a probability and a nonprobability survey) with not exactly the same set of variables, i.e. there are some variables common to all data sets (ìcommon variablesî) and variables that are never jointly observed (ìspecific variablesî). After a statistical matching step, all variables may be analyzed together, e.g. in the form of a prediction model using all variables. Recently, some papers proposed propensity score matching, a prominent and successful method in causal inference studies, for use in the context of data integration. While propensity score matching instead of, say, nearest neighbor matching on the common variables is valid in the causal inference setting, it lacks validity in the data integration context. We show that the ìconditional independence assumptionsî that are crucial in any statistical matching application slightly differ in the two settings of causal inference (estimation of average treatment effects) and data integration. Therefore, if propensity score matching is used for data integration, covariances in the fused data set will usually differ from values that reflect conditional independence. We show the quite erratic behavior of propensity score matching in the data fusion context with some simulation studies. For the special case of (approximate) multivariate normal variables, we give analytic expressions that show how covariances between the specific variables in the matched data file depend on the coefficients of the logistic regression model used for the calculation of the propensity scores.
Coverage Error – Leveraging data science and external data sources to adjust for coverage error in online panels including the National Center for Health Statistics’ (NCHS) Research and Development Survey (RANDS) and create rapid surveys of health outcomes
- nan Katherine Irimata (CDC/DDPHSS/NCHS/DRM) – Presenting Author
CONCURRENT SESSIONS D
Session 3: Harmonize Your Vocals! Exploring Voice Capture and Processing for Collecting Open-Ended Survey Responses
Assessing Performance of Survey Questions through a CARI Machine Learning Pipeline
- Dr Ting Yan (Westat) – Presenting Author
- Mr Anil Battalahalli (Westat)
Computer Assisted Recorded Interviewing (CARI) has long been used by field management to monitor interviewer performance and to assess questionnaire items (e.g. Hicks, Edwards, Tourangeau, McBride, Harris-Kojetin and Moss 2010). Conventionally, a human coder needs to first listen to the audio recording of the interactions between the interviewer and the respondent, and then evaluate and code features of the question-and-answer sequence using a pre-specified coding scheme. Such coding tends to be labor intensive and time consuming. Due to resource constraints, often a small proportion of completed interviews or a selected group of questionnaire items can be evaluated in a timely manner. In this study, we will present a pipeline we developed at Westat that heavily draws on the use of machine learning. The pipeline has demonstrated its ability to quickly and efficiently identify cases at a higher risk of interviewer falsification. In this talk, we will evaluate the performance of this pipeline to quickly and efficiently identify survey items with poor performance. Building on literature on behavior coding and question evaluation, we will show how machine learning is used in this pipeline to process recordings and to detect problematic survey items at a higher risk of encountering problems during survey response process through multiple metrics. We will evaluate the performance of the pipeline using both mock interviews produced in a laboratory setting and field interviews from a nationally representative survey. We also discuss the time and cost implications of using the pipeline as compared to the conventional human coding to achieve the same goal.
Innovating web probing: Comparing text and voice answers to open-ended probing questions in a smartphone survey
- Dr Jan Karem Hˆhne (University of Duisburg-Essen) – Presenting Author
- Dr Timo Lenzner (GESIS – Leibniz Institute for the Social Sciences)
- Dr Konstantin Gavras (Nesto Software GmbH)
Cognitive interviewing in the form of probing is key for developing methodologically sound question and questionnaire designs. For a long time, probing has been tied to the lab, inducing small sample sizes and a high burden on both researchers and participants. Therefore, researchers have recently started to implement probing techniques in web surveys where participants are asked to provide written answers. As observed in studies on open-ended questions, participants frequently provide very short or no answers at all because entering answers is tedious. This particularly applies when completing the web survey via a smartphone with a virtual on-screen keypad that shrinks the viewing space. In this study, we therefore compare written and oral answers to open-ended probing questions in a smartphone survey. Oral answers were collected via the open-source SurveyVoice (SVoice) tool. We conducted a survey experiment in the German Forsa Omninet Panel (N = 1,001) in November 2021 and probed two questions from the modules “National Identity” and “Citizenship” of the German questionnaires of the International Social Survey Programme (ISSP) in 2013/2014. Specifically, we probed for respondentsí understanding of key terms in both questions (comprehension probing). Preliminary analyses indicate that oral answers result in higher item non-response than their written counterparts. However, oral answers are longer, suggesting more in-depth information. In order to provide a clear-cut evaluation of written and oral answers to open-ended probing questions in web surveys we will conduct further, refined analyses. For example, we will analyze the answers with respect to the number and variety of themes mentioned and examine whether one answer format elicits more detailed and elaborated answers than the other. In addition, we will investigate respondent characteristics associated with high-quality written and oral answers to open-ended probing questions.
API vs. human coder: Comparing the performance of speech-to-text transcription using voice answers from a smartphone survey
- Dr Jan Karem Hˆhne (University of Duisburg-Essen) – Presenting Author
- Dr Timo Lenzner (GESIS – Leibniz Institute for the Social Sciences)
New advances in information and communication technology, coupled with a steady increase in web survey participations through smartphones, provide new avenues for collecting answers from respondents. Specifically, the built-in microphones of smartphones allow survey researchers and practitioners collecting voice instead of text answers to open-ended questions. The emergence of automatic speech-to-text APIs transcribing voice answers into text pose a promising and efficient way to make voice answers accessible to text-as-data methods. Even though there are various studies indicating a high transcription performance of speech-to text APIs, these studies usually do not consider voice answers from smartphone surveys. In this study, we therefore compare the performance of the Google Cloud Speech API and a human coder. We conducted a smartphone survey (N = 501) in the Forsa Omninet Panel in Germany in November 2021 including two open-ended questions with requests for voice answers. These two open questions were implemented to probe two questions from the modules ìNational Identityî and ìCitizenshipî of the German questionnaires of the International Social Survey Programme (ISSP) 2013/2014. The preliminary results indicate that human coders provide more accurate transcriptions than the Google Cloud Speech API. However, the API is much more cost- and time-efficient than the human coder. In what follows, we determine the error rate of the transcriptions for the API and distinguish between no errors, errors that do not affect the interpretability of the transcriptions (minor errors), and errors that affect the interpretability of the transcriptions (major errors). We also analyze the data with respect to error types, such as misspellings, word separation error, and word transcription error. Finally, we investigate the association between these transcription error forms and respondent characteristics, such as age and gender. Our study helps to evaluate the usefulness and usability of automatic speech-to-text transcription in the framework of smartphone surveys and provides empirical-driven guidelines for survey researchers and practitioners.
Gauging public opinion on artificial intelligence with linked web content and survey data
- Dr Veronika Batzdorfer (GESIS-Leibniz Institute for the Social Sciences) – Presenting Author
We see a paradigm shift with language models that are trained on broad text data at scale (e.g., GPT-3) and which gained substantive traction in the scientific community and the public. Amongst others, ethical aspects of AI have been focused by research, such as biases in large language models (Bender, Gebru, McMillan-Major, & Shmitchell, 2021). Yet, considering public input at large, from multiple stakeholders such as citizens, research, or business institutions to shape ethical regulations has been limited and costly. Gaining knowledge on risk perception and regulation needs of public stakeholders can be fruitful to predict future initiatives and inform policy making efforts in general and on specific AI applications. The main goal of this study is to compare topics (i.e., relating to perceived chances, risks, and regulations) towards AI on an EU-level for multiple stakeholders, across countries, based on text data and respective survey measures. Public data on three EU consultation rounds (06/2020-08/2021) on AI regulations have been obtained by means of dynamic web crawling of public opinion pieces and website metadata (N = 1,669). These data have been linked to a survey (N = 1,216) for the first consultation round. Furthermore, multi-lingual text has been translated into English before obtaining measures with different off-the-shelf lexicon analysis techniques (e.g., Laver-Garry dictionary or the Policy-Agenda dictionary). These measures are compared to the BERTopic machine learning architecture. The latter combines transformer models with topic models and allows to identify and compare clusters of topics (by means of c-TF-IDF) over time and across stakeholders. Finally, different types of error are discussed, as has been put forward in the ìTotal Error Framework for Digital Traces of Human Behavior on Online Platformsî by Sen et al. (2021). Among the caveats to be considered when working with non-probability samples also range measurement errors that relate to trace augmentation or reduction. Leveraging new sources of open (text) data, whilst acknowledging potential biases at all phases of the research cycle can enrich public opinion measurements.
Don’t Know! Don’t Care? We Should! ‘Don’t Know’ Responses in Digital and Financial Literacy Questions
- Dr Christopher Henry (Bank of Canada) – Presenting Author
- Dr Daniela Balutel (York University, Canada)
- Dr Kim Huynh (Bank of Canada)
- Dr Marcel Voia (Universite d’Orleans, France)
“Don’t Know” (DK) survey responses are an alternative option to skipping a question. Respondents may skip the question if they do not know how to answer. Or it could be a convenient way to skip the question. Survey designers reduce the effect of the former and maximize the effect of the latter. It is important to understand the information content of DK answers. Is it a measure of lack of knowledge or is it a convenient option to skip? We answer this question using the 2018, 2019, and 2021 Bank of Canada Bitcoin Omnibus Survey. The survey contains questions on Digital and Financial Literacy. We use econometric analysis of DK answers between the Bitcoin owners and non-owners. Bitcoin non-owners are more likely to respond DK based on their personal knowledge. Bitcoin owners are more assertive about their answers. They prefer to choose an answer even if it is incorrect, making the DK option to be more of a random choice.
CONCURRENT SESSIONS E
Session 1: The Methodologists talked with the Data Scientists and it wasn’t fair! Here’s What to Do About It!
My Training Data May Need a Trainer: Applying Population Representation Metrics to Training Data to Assess Representativity, Machine Learning Model Performance and Fairness
- Dr Trent Buskirk (Bowling Green State University) – Presenting Author
- Dr Christoph Kern (Ludwig Maximilian University of Munich)
- Mr Patrick Schenk (Ludwig Maximilian University of Munich)
The successful use of machine learning is inevitably tied to the quality of the training data used for building the underlying prediction models. Biases in training data may not only limit prediction performance in general, but can also lead to differential accuracy across different groups or entities in model deployment (Rolf et al., 2021 and Chen et al., 2018). A major source of bias is selective participation or insufficient coverage of social groups in training data generation and collection processes. Such ìrepresentation biasî (Mehrabi et al., 2022) can result in prediction models that fail to learn relationships for important subpopulations and thus has severe fairness implications. This is especially the case when members of minority groups are subject to misclassification traced back to underrepresentation, which can perpetuate discrimination of historically disadvantaged groups as highlighted by the Gender Shades study (Buolamwini and Gebru, 2018). Such biases may not be restricted just to minority groups. Caliskan and colleagues (2017) found measurable levels of human-like biases in semantics derived from models using text data gathered from the Web. Regardless of the field of application, composition, skewness and possible lack of representativeness in training data can induce possible biases in predictions from machine learning models learned using such training data. These issues have become a key focus of a growing body of literature in the data and computer sciences that focuses on algorithmic fairness. Drawing on the rich toolkit of survey research, this project aims to identify and quantify the impact that misrepresentation error in the training data may have on resulting models both in terms of model accuracy and fairness. Our approach frames this issue in terms of a population sampling and representation perspective and attempts to leverage sample properties that can be measured prior to model development as early warning signals for potential fairness issues. By leveraging information at the data gathering stage we could potentially save time and resources through the model development cycle so that models are not developed where fairness risks have been identified and researchers could efficiently focus attention on data gathering methods to rectify data issues associated with fairness. This work adds to the research on fairness in AI by: (1) studying representivity of training data from a population inference perspective, (2) proposing the use of survey sampling metrics to quantify training data misrepresentation and (3) systematically tracing the link between representation bias and algorithmic fairness in downstream tasks. We present results based on an extensive simulation study using real-world data sets in which we vary the level of representation and amount of training data as well as other factors that may be related to fairness or model accuracy (e.g. number of features used in the model and machine learning method). Training data representation will be measured using multiple metrics including R-indicators and unequal weighting effects, and the relationship between these metrics and group fairness measures, such as equalized odds and predictive parity among others, will be reported across all simulations.
Applied Strategies for Advancing Racial Equity and Addressing Bias in Big Data Research
- Ms Emily Hadley (RTI International) – Presenting Author
- Ms Rachel Dungan (Academy Health)
In the last decade, big data research studies have proliferated and, in some cases, offered considerable promise for humanity. Yet, numerous incidents have documented that without safeguards, this same research can reproduce and amplify existing societal biases and disparities. Given the increased public awareness of structural racism, bias based on race and ethnicity in big data research is particularly concerning. Big data researchers can mitigate the risk of perpetuating bias based on race and ethnicity by intentionally incorporating best practices that reduce or eliminate racial bias. We synthesize key findings and applied recommendations from over 140 sources for addressing race and ethnicity bias in big data research. We first describe broad principles for addressing race and ethnicity bias in big data research. We discuss important considerations when scoping a big data study that can improve racial equity, including identifying motivations and opportunities to collaborate with individuals who will be directly impacted by the results. We explain the importance of a high-quality big dataset for preventing bias and provide examples of how quality issues like measurement errors may differ by race or ethnicity. We consider the importance and challenge of collecting race and ethnicity in big data, including differences in uses of self-identified and perceived race and ethnicity and the inappropriate use of genetic variation to infer race and ethnicity. We discuss concerns regarding standardization of race and ethnicity, particularly in government big data sources, and the challenges that arise when linking multiple datasets with different race and ethnicity categories. We share tools and techniques that are useful for assessment of bias by race and ethnicity in algorithm development. We then consider three specific data concerns related to race and ethnicity in big data. First, we consider how the size of big data permits and perhaps augments concerns about unintentional proxies that may cause an algorithm to discriminate and cause harm unintentionally or unknowingly on the basis of race and ethnicity. We provide suggestions for recognizing and testing for wrongful proxy discrimination. Second, we consider the importance of data completeness and addressing missing race and ethnicity data in big data studies which often use blanket imputation or case completeness methods. We consider best practices and approaches to improve the accuracy and reliability of big data studies through improved treatment of missing race and ethnicity data. Finally, we consider race and ethnicity in the context of data representativeness; even when big datasets are large, they may not represent the underlying population and cannot be assumed to support unbiased population estimates. In particular, we discuss the Big Data Paradox where confidence intervals shrink but small biases become magnified, so underrepresentation of racial and ethnic subgroups can still directly impact the generalizability of big data findings. We share recommendations for how big data researchers can better address race and ethnicity representativeness. We close with discussion of additional resources that researchers can use and opportunities for further investigation.
Assessing the downstream effects of training data annotation methods on supervised machine learning models
- Mr Jacob Beck (LMU Munich )
- Dr Stephanie Eckman (Independent)
- Professor Christoph Kern (LMU Munich)
- Mr Rob Chew (RTI International ) – Presenting Author
- Mr Bolei Ma (LMU Munich)
- Professor Frauke Kreuter (LMU Munich, University of Maryland)
Assessing the downstream effects of training data annotation methods on supervised machine learning models Jacob Beck, Stephanie Eckman, Christoph Kern, Rob Chew, Bolei Ma, Frauke Kreuter Machine learning (ML) training datasets often rely on human-annotated data collected via online annotation instruments. These instruments have many similarities to web surveys, such as the provision of a stimulus and fixed response options. Survey methodologists know that item and response option wording and ordering, as well as annotator effects, impact survey data. Our previous research showed that these effects also occur when collecting annotations for ML model training and that small changes in the annotation instrument impacted the collected annotations. This new study builds on those results, exploring how instrument structure and annotator composition impact models trained on the resulting annotations. Using Twitter data on hate speech, we collect annotations with five experimental versions of an annotation instrument, randomly assigning annotators to versions. Our data includes 3,000 Tweets labeled by 1,897 annotators, resulting in a total of over 90,000 annotations. We train and fine-tune state of the art ML models such as LSTM and BERT for hate speech classification on five training data sets that consist of annotations collected with five different instrument versions and evaluate on the corresponding five test sets . By comparing model performance across the instruments, we aim to understand 1) whether the way annotations are collected impacts the predictions and errors of the trained models; and, 2) which instrument version leads to robust and efficient models, judged by the performance scores across instrument versions and model learning curves. In addition, we expand upon our earlier findings that annotators’ demographic characteristics impact the annotations they make. We find considerable performance differences across test sets and, to some extent, also across models trained with tweets that were labeled using different instrument versions. Our results emphasize the importance of careful annotation instrument design. Hate speech detection models are likely to hit a performance ceiling without increasing data quality; By paying additional attention to the training data collection process, researchers can better understand how their models perform and assess potential misalignment with the underlying concept of interest they are trying to predict.
What If? Using Multiverse Analysis to Evaluate the Influence of Model Design Decisions on Algorithmic Fairness
- Mr Jan Simson (LMU Munich) – Presenting Author
- Dr Florian Pfisterer (LMU Munich)
- Professor Christoph Kern (LMU Munich)
Across the world, more and more decisions are being made with the support of algorithms, so called algorithmic decision making (ADM). Examples of such systems can be found in finance, the labor market, criminal justice system and beyond. While these systems are very promising when designed well, raising hopes of more accurate, just and fair decisions, their impact can be quite the opposite when designed wrongly. There are many examples of unfair ADM systems discriminating against people in the wild, with the Dutch childcare benefits providing an especially prominent and recent example. While these fairness problems often occur because algorithms replicate biases in the underlying training data, gathering perfectly fair data is usually not an option. Biases can also originate or increase in other parts of the typical machine-learning pipeline. As a result, preventing algorithms from reinforcing biases requires careful study and evaluation of the, often implicit, decisions made while designing ADM systems. To facilitate this, we introduce the method of multiverse analysis for algorithmic fairness. Multiverse analyses were introduced in Psychology to improve reproducibility and to combat p-hacking and cherry-picking of results. This makes them particularly useful to assess the susceptibility of ADM systems with respect to their fairness implications. In the proposed adaptation of multiverse analysis for ADM one starts by making the many implicit decisions required during the design of an ADM system explicit. One of the differences in the present analysis compared to a classic multiverse analysis is that we will evaluate machine learning systems in the end, whereas classical multiverse analyses will typically culminate in a null-hypothesis-significance-test (NHST). While many of the decision points apply to any machine-learning system (e.g. choice of algorithm, how to preprocess certain variables, cross-validation splits), many of them are also domain specific (e.g. coding of certain variables, how to set classification thresholds, how fairness is operationalized). In particular we focus on decisions made during the pre-processing of data and in the translation of predictions into possible decisions. Using all possible unique combinations of these decisions we create a grid of possible universes of decisions. For each of these universes, we compute the resulting fairness metric of the ADM system and collect it as a data point. The resulting dataset of decision universes and resulting fairness is treated as our source data for further analysis where we evaluate how individual decisions relate back to fairness. In our work, we present a generalizable approach of using multiverse analysis to estimate the effect of decisions during the design of an ADM system on fairness metrics. We demonstrate the feasibility of this approach using a case study of predicting public health coverage in US census data, finding that even subtle design decisions, such as encoding of features which might go unscrutinized in typical fairness analyses can have a considerable effect on a system’s fairness.
CONCURRENT SESSIONS E
Session 2: Watch Out – Your Phone Answered My Survey! Procesing, Compliance and Estimation Using Data Captured via Smartphone Meters and Wearable Devices
Evaluating Compliance and Churn in an Ongoing Passive-Metered Smartphone Panel
- Dr Robert Petrin (Ipsos Public Affairs)
- Ms Brittany Alexander (Ipsos Public Affairs)
- Mr August Warren (Ipsos Public Affairs)
- Ms Margie Strickland (Ipsos Public Affairs)
- Dr Michael Link (Ipsos Public Affairs) – Presenting Author
Passive-metering of smartphones among those agreeing to be a part of an ongoing meter & survey panel can provide key insights into mobile phone-related behaviors of study populations. Data collection efforts such as this sit at the nexus of tradition survey and broader ìbig dataî efforts to understand attitudes and behaviors. For the purposes of this presentation, passive metered data refers to information gathered via tracking devices installed with consent on panelistsí smartphones and used to understand smartphone, internet, social media, and other behavior. Panelists also agree to participate in 2-4 surveys per month on various topics. Passive metered panel data offer several implementation challenges in practice, such as erratic daily compliance (selectively turning the meter on or off), as well as nonstandard rates of study attrition. Different sets of panelists may have also been recruited via different sources and/or have different incentive structures, which can impact compliance and attrition, and thus the bias of subsequent analyses. This paper applies Bayes failure time models to smartphone metered data collected on a nationally representative probability panel. These models are used to assess the demographic, attitudinal, and behavioral determinants of daily compliance with metering, as well as study attrition to understand potential sources of bias and losses in power. The paper concludes with recommendations for enhancing study design and panel maintenance, and improving inferences from metered data.
Continuous Monitoring of Health and Wellness Using Wearable Sensors: New Data Source for Social Science
- Dr Dorota Temple (RTI International) – Presenting Author
- Dr Meghan Hegarty-Craver (RTI International)
- Dr Hope Davis-Wilson (RTI International)
- Dr Edward Preble (RTI International)
- Dr Jonathan Holt (RTI International)
- Dr Howard Walls (RTI International)
Wearable physiological sensors (ìwearablesî) are revolutionizing the collection of data pertaining to health and wellness. A recent study showed that 30% of adults in the United States own smartwatches incorporating heart rate and physical activity monitors, with nearly 50% using these devices every day . The growing use of wearables has unlocked a new data source for population health research, one that requires minimal effort on the part of the monitored individual and can produce information continuously in near-to-real time. Such data have the potential to provide a more complete picture of the daily rhythms of health, well-being, and disease, and the environment in which these take place, than survey questionnaires by themselves. In this paper, we describe a system for the collection and end-to-end processing of signals acquired by smartwatches integrating optical sensors for the measurement of heart rate, heart rate variability, respiration rate and blood oxygen level, and accelerometers for the measurement of physical activity. We show how the sensor signals are processed to remove artifacts and discuss the methodology for the extraction of physiological metrics and their standardization, the latter to enable comparison of trends from individual to individual and with respect to the level of physical activity. We describe how these standardized metrics are used as inputs to machine learning algorithms for the detection of specific health or disease states. We review applications of such data collection and analytics platforms, focusing on presymptomatic detection of respiratory infections  and wearables-based affect recognition. Finally, we describe how wearables data together with survey questionnaires can provide a more comprehensive picture of health and well-being. References:  Chandrasekaran R., Katthula V., Moustakas E. Patterns of Use and Key Predictors for the Use of Wearable Health Care Devices by US Adults: Insights from a National Survey. J Med Internet Res., 2020; doi: 10.2196/22443.  Temple D. S., Hegarty-Craver M., Furberg R.D., et al. Wearable Sensor-Based Detection of Influenza in Presymptomatic and Asymptomatic Individuals, The Journal of Infectious Diseases, 2022; doi: 10.1093/infdis/jjac262.
Provide or Bring Your Own Wearable Device? An assessment of compliance, adherence, and representation in a national study.
- Dr Heidi Guyer (RTI International) – Presenting Author
- Ms Margaret Moakley (RTI International)
- Mr Carlos Macuada (RTI International)
- Professor Florian Keusch (University of Mannheim)
- Professor Bella Struminskaya (Utrecht University)
The use of wearables in data collection began over two decades ago with research-grade devices designed to measure physical activity as well as to detect other biological processes and environmental conditions such as heart rate, body temperature, light exposure, and sound. These devices allow for the collection of data that is not reportable by individuals yet has an important impact on health status such as the total number of daily steps, average and maximum daily heart rate, sleep duration and sleep stage. The ubiquity and advancement of consumer-grade wearable devices offers new and expanded opportunities for researchers to collect data in a natural environment thereby decreasing respondent burden and, potentially, increasing data availability and quality. The All of Us Research Program (AoU) is a research initiative led by the United Statesí National Institutes of Health (NIH) with the goal of enrolling 1,000,000 individuals to advance health care in the U.S. with a specific focus on populations typically underrepresented in research. AoU began in 2018 and currently includes over 600,000 participants. Participants are asked to share their health records as well as complete surveys and provide biological samples. These longitudinal data are available to registered researchers to support a wide array of health research. The program is currently achieving the objectives of including a higher proportion of participants from under-represented backgrounds. For example, 39% of AoU participants are non-white compared to 37% in the U.S. population (and typically much lower in national surveys). In 2019, AoU expanded data collection to include wearable fitness trackers first using a Bring-Your-Own-Device (BYOD) approach. Participants who already owned a FitBit device could link their FitBit data to the AoU database with 12,880 providing data to date. To improve representation of the wearables data, FitBits were provided to 10,000 participants in 2021. We will describe the representativeness of each of the populations of FitBit usersóthose who owned a FitBit and those who were provided a FitBitóas well as measures of compliance and adherence. Additionally, we will analyze the demographic characteristics associated with compliance and adherence for three user groupsó device owners, device provided, all users. Our findings will help to assess sample representation and non-participation by participant characteristics for these groups of participants. We will provide guidance to researchers on how to determine the return on investment (amount of data and data quality versus cost) for the BYOD approach vs. providing devices.
Wearables Research and Analytics Platform (WRAP): Integrating wearables, surveys and monitoring systems
- Dr Heidi Guyer (RTI International) – Presenting Author
- Dr Eric Francisco (RTI International)
- Mr Vaughn Armbrister (RTI International)
- Mr Adam Miller (RTI International)
- Mr Ben Allaire (RTI International)
- Dr Vinay Tannan (RTI International)
The use of wearables and sensors in data collection began over two decades ago with research grade devices designed to measure physical activity as well as to detect other biological processes and environmental conditions such as heart rate, body temperature, light exposure, and sound. These devices allow for the collection of data that is not reportable by individuals yet has an important impact on health status such as the total number of daily steps, average and maximum daily heart rate, sleep duration and sleep stage. The ubiquity and advancement of these devices for the general population offers new and expanded opportunities for researchers to access widely available data in a natural environment thereby decreasing respondent burden and, potentially, increasing data availability and quality. Research Triangle Institute International (RTI) developed the Wearables Research and Analytics Platform (WRAP?) to collect, store and analyze data collected using wearable devices. WRAP? has since expanded to provide researchers with a system to seamlessly obtain data from a variety of wearable devices and apps, allow for real-time monitoring, and deliver ecological momentary assessment (EMA) surveys to participants. This session will feature a demonstration of WRAP? including: – The system functionality – The system architecture – Data base structures – A dashboard data monitoring platform – EMA functionality – Monitoring data from multiple apps/trackers concurrently – Examples from current projects Important considerations regarding the measures, apps/sensors, population and data collection setting of interest will be discussed as well. The WRAP? demonstration will provide an overview of the novel innovations available to researchers interested in collecting data using wearables, apps and sensors to ensure high quality acquisition of data that reduces the burden for participants as well as researchers.
CONCURRENT SESSIONS E
Session 3: Any Kinks in the Links? Exploring Data Linkage and Quality Frameworks for Modern Surveys
Burden, benefit, consent, and control: Moving beyond privacy and confidentiality in attitudes about administrative data in government data collection
- Dr Aleia Fobia (US Census Bureau) – Presenting Author
- Ms Jennifer Childs (US Census Bureau)
- Dr Shaun Genter (US Census Bureau)
In the face of declining survey participation rates and increasing demand for fast and efficient data products, governments are increasing their reliance on data from alternative sources including records and commercially available data. Public perception is one of the many challenges faced by this shift in federal data infrastructure. Research on public opinion around the use of administrative data has often focused on privacy and confidentiality concerns. Privacy and confidentiality protections often rely on strict prohibitions around data sharing and data use. The push to increase data sharing and data linkage has brought increased scrutiny to how changes might affect the trust relationship between respondent and government data collection efforts that rests on those prohibitions (see NASEM 2017). However, just as Big Data has changed the landscape that governs federal data collection, it has shifted the broader context within which respondents interpret their relationship with government data collection (see 2016 UN Report on Big Data for Official Statistics). In this presentation, we explore quantitative and qualitative data from multiple studies designed to assess attitudes towards data linkage, sharing across agencies, and the use of alternative data sources at the United States Census Bureau. Data include nationally representative opinion survey data and data from focus groups and cognitive interviews collected in 2022 and 2023. Moving beyond a paradigm of privacy and confidentiality concerns, we investigate how respondents negotiate a relationship with government data collection in a context where their data is ìalready everywhere.î We explore how themes of consent, control, burden, and benefit interact and merge with values of privacy and promises of confidentiality in the context of government data collection in the United States. Preliminary qualitative data suggest that some benefits and risks associated with increased use of administrative data are more complex than commonly assumed. For example, decreasing burden by using administrative data (to replace questions or survey requests) has been identified as one prospective benefit for respondents. However, some participants say that this violates their right to refuse a request for information. Using quantitative data, we explore how different populations view benefits to the use of alternative data sources by the federal government and their perceptions of control over data collection and data use. We conclude with implications for future opinion and attitude studies and considerations for communication around proposed and current uses of alternative data sources in government contexts.
Challenges and Opportunities in Using Big Data for Official Statistics: A Critical Review of Quality Frameworks and Twitter Sentiment Analysis
- Professor Marc Callens (Ghent University) – Presenting Author
- Professor Dries Verlet (Ghent University)
Using social media data, or more general big data, for the production of official statistics introduces quality concerns outside the scope of traditional quality frameworks because the characteristics and sources of big data are different from traditional survey data and because extracting information from big data requires different methodologies for collecting and processing data. In order to address these additional quality concerns, existing quality frameworks need to be adapted or new ones may need to be introduced. A big data quality framework does not exclusively refer to the quality of the big data itself, but also to the quality of the processes it undergoes and the quality of the datasets/information derived from it. It has been pointed out that the high variety in big data sources and analysis methods hampers the creation of a generally applicable big data quality framework (ESSnet Big Data II, 2020; UNECE big data quality task team, 2014). The relevancy, interpretation, and evaluation of specific quality dimensions vary depending on the big data class and the application (ESSnet Big Data II, 2020). Still, establishing the general structure of a big data quality framework may be valuable because it can function as the basis for developing data source-specific quality frameworks. In this paper we first critically review two traditional quality frameworks that are widely used in official statistics: the total survey error-framework (TSE) and the Code of Practice defined by Eurostat. Both quality-frameworks have been designed mainly with surveys and designed data in mind. The TSE framework has its origins mainly in the academic world, the Code of Practice originated in the world of official statistics. We then evaluate some newly developed big data quality frameworks. Several approaches to the construction of big data quality frameworks can be identified in the literature. Fairly general guidelines have been proposed specifically for official statistics (ESSnet Big Data II, 2020; UNECE big data quality task team, 2014). Another strategy has been to focus on one dimension such as the adapted version of the TSE-framework (Amaya, Biemer, & Kinyon, 2020). We also discuss some examples of very specific quality frameworks in the context of a particular big data source (e.g.; Twitter: Hsieh & Murphy, 2017). The paper concludes by considering a hybrid framework that integrates aspects of the TSE and big data quality frameworks to address the unique challenges of using big data for official statistics.
Survey Design Considerations for Data Linkage
- Professor Sunshine Hillygus (Duke University) – Presenting Author
- Professor Kyle Endres (University of Northern Iowa)
There is considerable promise in the marriage of ìmadeî (e.g., surveys) and ìfoundî data, whether government administrative records, social media activity, mortgage and financial information, or the like. Auxiliary data can enrich surveys by providing leverage to diagnose and mitigate survey error, reduce survey burden, or validate sensitive behaviors. While there are myriad opportunities and potential benefits, there are also unique challenges to ensuring successful linkage. The benefits from record linkage are only achievable if researchers are able to access external data sources, obtain consent from survey participants to perform this match if required, and can successfully match the two data sources. In this paper, we outline various design considerations in developing a survey questionnaire for data linkage. We discuss the need to 1) prepare for data harmonization in the questionnaire development process; 2) specify in advance data quality measures and diagnostics; 3) optimize design of the informed consent if required; 4) evaluate the privacy and confidentiality implications of a combined data set. In addition to reviewing the research of relevance, we will present results from data linkage between the American National Election Study and Facebook and a series of original survey experiments.
A Data Quality Scorecard to Assess a Data Sourceís Fitness for Use
- Ms Lisa Mirel (NCSES/NSF)
- Dr John Finamore (NCSES/NSF)
- Dr Elizabeth Mannshardt (NCSES/NSF)
- Dr Julie Banks (NORC)
- Dr Don Jang (NORC) – Presenting Author
- Dr Jay Breidt (NORC)
Assessing data quality is critical to determine a data sourceís fitness for use. The US Federal Committee on Statistical Methodology (FCSM) has developed a data quality framework that provides a common language for federal agencies and researchers to make decisions about data quality. The FCSM framework is based on three core domains: utility, objectivity, and integrity. Each domain has varying dimensions. Recently, the National Center for Science and Engineering Statistics (NCSES) within the U.S. National Science Foundation and NORC at the University of Chicago have collaboratively developed an approach to generate scorecards based on the eleven dimensions outlined in the FCSM Data Quality Framework. Our ìFederal Data Quality Assessment Frameworkî (FDQAF) scorecard generates a score in each of the eleven FCSM data quality dimensions by first defining a specific use case and then answering a series of binary questions about that use case, its data sources, and any organizations that provided the data. Each binary item is answered by using available documentation and referencing the relevant metadata for each answer. Any items that cannot be answered have a default score of zero. Zero-one scores for binary items sum to preliminary dimension scores that are rescaled between zero and nine to give equally weighted dimension scores. The overall score is the sum of the rescaled scores across the eleven dimensions, with higher scores reflecting overall higher data quality. The dimension scores give insight on strengths and weaknesses of a data source for a particular use case. We illustrate the FDQAF scorecard approach with use cases from NCSES practice and describe some of the challenges in developing a multipurpose assessment system. Specific attention within the presentation will focus on use of the scorecard across different data types including survey , administrative, and linked data.
Supplementing Sensitive Survey Data by Leveraging Social Listening with Machine Learning Models
- Professor Heather Kitada Smalley (Willamette University)
- Professor John Walker Orr (George Fox University) – Presenting Author
In December 2017, the US Department of Justice requested the reinstatement of the citizenship question on the 2020 Decennial Census questionnaire; however, perceptions of the Trump administrationís stance on immigration caused concern for how these data would be used and gave rise to the Twitter movements. Although this request for the 2020 Decennial Census was ultimately rejected by the Supreme Court in June 2019, the American Community Survey (ACS) continued to include the citizenship question. Due to this, we wish to explore the effect of the movements, such as #leaveitblank, on rates of missingness in the ACS from 2017 to 2020. This project is novel because we combine data from a long established government survey obtained from a probability sample with social media data, which comes from a non-probability sample with different frequencies. There are 73 gigabytes of ACS data, containing both individual and household records from 2005 to 2020. The data from Twitter consists of 13.3 million tweets collected from 2011 to 2022. We employ supervised and unsupervised machine learning methods for natural language processing, including S-BERT, neural networks, and clustering in order to geo locate and classify relevant tweets. These Twitter data are aggregated over space and time in order to assess possible correlations with trends in non-response.
CONCURRENT SESSIONS F
Session 1: I’m Biased Towards Accuracy! Advances in Evaluating and Adjusting Estimates within Finite Population Frameworks
The Sensitivity of Selection Bias Estimators: A Diagnostic based on a case study and simulation
- Mr Santiago GÛmez (Vrije Universiteit Amsterdam) – Presenting Author
- Mr Dimitris Pavlopoulos (Vrije Universiteit Amsterdam)
- Mr Ton De Waal (Statistics Netherlands)
- Mr Reinoud Stoel (Statistics Netherlands)
- Mr Arnout van Delden (Statistics Netherlands)
Selection bias is one of the most prevalent concerns when dealing with sampling and the principal motivation behind the different random sampling methods implementation. However, currently, non- probabilistic data is ever more available given the advent of Big Data, and thus selection bias needs to be revisited. More specifically, there is a want for estimates that capture the degree of systematic error due to selection. Several authors have proposed sensible approaches to this problem, which have been implemented already to analyze issues such as the bias in voting polls in the United States 2016 election. However, several proposed estimators depend on unobserved parameters that are determined arbitrarily. Given this, in this study, we detail the approaches from Meng(2018) and Little et al.(2020) and estimate their sensitivity to data variations that are frequent in practice, like skewed distributions and high selectivity. To do so, we conduct a series of simulations and employ a case study to evaluate the performance of these novel selection bias estimators and some alternatives. Our analyses indicate that a high correlation between the selection variable and the target variable implies less precise estimates for most of the estimators though this increased variance could be reduced considerably when employing highly informative auxiliary variables. Besides, the simulation results indicate that the leading predictor of the bias of the estimates is the skewness of the target distribution. We conclude by making some remarks on the current state of the literature on selection bias estimators and future research paths.
ìbalanceî – a Python package for balancing biased data samples
- Dr Tal Galili (Meta) – Presenting Author
- Dr Tal Sarig (Meta)
- Mr Steve Mandala (Meta)
The ìbalanceî Python package is a new open-source software by Meta (released in late 2022) [1, 2, 3]. The package offers a simple workflow and methods for dealing with biased data samples when looking to infer from them to a population of interest. Bias in survey data is often the result of survey non-response or when the data collection suffers from sampling bias. Directly inferring insights from data with such biases can result in erroneous estimates. Hence, it is important for practitioners to understand if and how data is biased and, when possible, use statistical methods to minimize such biases. The ìbalanceî package addresses this issue by providing a simple, easy-to-use, framework for weighing data and evaluating its biases. The package is designed to provide best practices for weight fitting and offers several modeling approaches. The methodology in ìbalanceî can support ongoing automated survey data processing, as well as ad-hoc analyses of survey data. The main workflow API of balance includes three steps: (1) understanding the initial bias in the data relative to a target we would like to infer, (2) adjusting the data to correct for the bias by producing weights for each unit in the sample based on propensity scores, and (3) evaluating the final biases and the variance inflation after applying the fitted weights. The adjustment step provides a few alternatives for the researcher to choose from: Inverse propensity weighting using logistic regression model based on LASSO (Least Absolute Shrinkage and Selection Operator ), Covariate Balancing Propensity Scores , and post-stratification. The focus is on providing a simple to use API, based on Pandas data-frame structure, which can be used by researchers from a wide spectrum of fields. In this talk, we present the capabilities of the balance package, demonstrating the flow of the package and its ease of use. References  balance website: https://import-balance.org/  balance github repository: https://github.com/facebookresearch/balance  balance release post: https://import-balance.org/blog/2023/01/09/bringing-balance-to-your-data/  Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.  Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 243-263.
Rethinking the Test Set: A Finite Population Perspective
- Mr Robert Chew (RTI International) – Presenting Author
Test set error metrics are universally used to assess how well a supervised learning model will perform on new data, under the assumption that both the training and test set are drawn i.i.d from the same fixed distribution. Though often useful, the vagueness introduced by evoking an unknown joint distribution, rather than a specific target population of interest, can mask challenges to evaluating generalization in practice. For example, assessing objectives such as fairness or the need for domain adaptation can be obscured if there is ambiguity on what types of observations the model should be expected to generalize to. This talk considers test set evaluation through the lens of finite population sampling, where a finite population is defined as a collection of distinct units (people, businesses, schools, etc.). By focusing on how model errors on the test set generalize to all distinct units in our target population, we can make different kinds of statements about model errors than generalizability paradigms often used in ML. These include being able to estimate the number of observations that model errors impact on the target population (which can be used in complementary evaluations, such as risk impact assessments) and constructing valid confidence intervals on test set metrics that hold on the target population under various sampling designs. We argue that this perspective is particularly relevant when evaluating model performance on human populations, where a finite listing of all observations is possible to construct and is conceptually meaningful, and when the test set evaluations metrics need to be interpreted as model errors rates for a target population as opposed to heuristics for comparing generalization properties of competing algorithms.
Composite Weighting for Hybrid Samples
- Dr Mansour Fahimi (Marketing Systems Group) – Presenting Author
Increasingly, survey researchers rely on hybrid samples to improve coverage or secure the needed number of respondents in a cost-effective manner by combining two or more independent samples. For instance, it is possible to combine two probability samples with one relying on RDD and another on ABS. More commonly, however, researchers are compelled to supplement expensive probability-based samples with those from online panels that are substantially less costly. If carried out effectively, such hybrid samples can address both cost and coverage challenges of single-frame surveys. Traditionally, the conventional method of Composite Estimation has been used to blend results from different surveys to improve the robustness of the resulting estimates. This means individual point estimates from different surveys are produced separately and then pooled together, one estimate at a time. Given that for a typical study one has to produce dozens of estimates for key outcome measures, this computationally intensive methodology can require serious time and resources. Moreover, component point estimates used for composition are subject to the inferential limitations of the individual surveys that are used in this process. During this presentation the author will start with a quick review of the composite estimation methodology and then introduces the method of Composite Weighting that is significantly more efficient, both computationally and inferentially when pooling data from multiple surveys. For empirical illustrations, results from three surveys will be presented with each survey relying on hybrid samples comprised of probability-based components from the USPS address database and supplemental samples from online panels.
Poster Session (actively presented 8:30-10:00)
Technological Developments Influence the Cybercrime in Juja Sub-County
- Mr Ndirangu Ngunjiri (University of Nairobi ) – Presenting Author
- Miss Elizabeth Kingoo (Jomo Kenyatta University of Science and Technology)
Before there was technological growth, the world only had physical threats. However, the emergence of technology creates cybercrime that may be skilled via way of means of anyone. Cybercrimes have grown steadily with perpetrators growing more recent and extra state-of-the-art strategies each day. The maximum distinguished offenses are stalking, hacking, phishing, online fraud, identification theft, and dispensing viruses. These crimes purpose harm to private identification, fraud, forgery, threat, and financial losses. Cybercrimes cause severe damage in developing countries driving toward a cashless economy. Purpose: This article ambitions to study the effect of the technology boom on cybercrime in the Juja sub-county as a version of crime that develops with the advancement of technology. Methodology: The concepts of cybercrimes are introduced and different types of cybercrimes are explored as examples of some of the impacts caused by cybercrimes activities. The analysis in this paper is based on an extensive review of published research works, which provide theoretical and empirical evidence on the effects and impact of the internet on the development Findings: There are many terrible effects the society suffers from cybercrimes and why the computer or networking are geared goal for crimes. The paper commonly concludes that the net is overwhelmingly an effective device for development. Paradoxically, the net is a ìdouble-edged swordî, presenting many possibilities for people and firms to increase however an equal time, has introduced with it new possibilities to devote crime. The paper argues that the net affords new demanding situations to regulation enforcement in each evolved and growing international location. However, growing international locations go through substantially from the sports of net crime than their evolved opposite numbers as growing international locations have insufficient generation, infrastructure, and inadequate regulation enforcement expertise. Development in Technology allows cybercrimes and creates worry Information Technology widened the verbal exchange sphere making it Borderless and Transnational. Practical implications: This paper reminds people, agencies, and coverage makers alike that cybercrime has turned out to be an international difficulty that calls for the entire participation and cooperation of each evolved and growing international location on the global level, as net crime investigations regularly require that proof be traced and accumulated in extra that one ICT industries ought to awareness now in designing merchandise which is proof against crime and may facilitate detection and research of crime. Value: The main novelty of this paper is that it lines the historic evolution of the technology after which sketches out a number of the improvements the net has introduced in addition to thinking about the terrible outcomes related to this generation and its effect on development. Recommendations: Introduction of a concrete legal framework, establishment, and strengthening of cybercrime regulation enforcement agencies whole with excessive generation tracking gadgets and cutting-edge infrastructure. Empower the youths while in college with entrepreneurial skills. Cybercrime may be predicted via way of means of growing safety in the company community while speaking to the
ML Applications to Survey Quality Control and Fraud Detection Abstract 2023
- Mrs Kriscel Berrum (D3 systems) – Presenting Author
- Mr Liam Spoletini (D3 Systems)
- Mr Timothy Van Blarcom (D3 Systems)
The usefulness of survey data is heavily reliant on the assumption that the collected data are legitimate. D3ís proprietary pre-, in-, and post-field quality control measures are therefore essential components of the survey data collection process for minimizing the number of fraudulent cases in resulting data. Traditional fraud detection methods for survey data are based on a set of known phenomena relating to answers on the questionnaire, the interviewer, and other available characteristics of the survey. For example, a completed survey may be flagged for review if responses are identical or highly similar to another completed survey; interviewers can be flagged for having very low/high nonresponse rates; and a completed survey may be flagged for review if its elapsed time is hugely different than the average. Though these traditional techniques achieve satisfactory performance, there is always a chance that fraudulent cases escape detection. Machine learning models may capture what traditional quality control misses by considering higher dimensional features of post-field data. We plan to compare the performance of machine learning models against the traditional techniques to investigate this possibility and evaluate the traditional quality control measures for their ability to predict fraudulent cases. We will use multiple waves of data from a multi-national face-to-face survey, with metadata on quality control results, paradata on how and when the survey was administered, and final determinations of legitimacy as our input dataset. This data is unstructured and will require careful feature extraction and/or dimensionality reduction. This dataset also includes verified cases of fraudóallowing the application of supervised models like k-Nearest-Neighbors and kernelized Support Vector Machines. We will also investigate unsupervised anomaly detection techniques like Gaussian Mixture Models and Local Outlier Factor (LOF). This model’s performance will be evaluated against traditional fraud detection methods by comparing the number of correctly identified cases of fraud. Additionally, the efficacy of the tests will be evaluated to determine the best indicators of falsified data.
Volatility and irregularity Capturing in stock price indices using time series Generative adversarial networks.
- Mr Leonard Mushunje (Columbia University) – Presenting Author
This paper attempts to capture irregularities in financial time series data, particularly DAX stock data in the presence of the Covid-19 pandemic shock. We conjectured that jumps and irregularities are embedded in stock data due to the pandemic shock, which brings forth periodic trends in the time series data. We put forward that efficient and robust forecasting methods are needed to predict stock closing prices in the presence of the pandemic shock. This information is helpful to investors as far as confidence risk and return boost are concerned. Generative adversarial networks of time series nature are used, which are good at providing new ways of modeling and learning the proper and suitable distribution for the financial time series data under complex setups. In addition, our models provide better forecasting errors than the traditional time series models such as the LSTM, GARCH, ARCH, and ARIMA models. Ideally, these traditional models are liable to producing high forecasting errors and need to be more robust to capture dependency structures and other stylized facts like volatility in stock markets. The GAN and WGAN models are used, which effectively deal with this risk of poor forecasts. We trained our models on the DAX stock index over a 12-year daily period, and the LSTM model was fitted as a robustness-checking mechanism.
Tracking and explaining the support for the use of large linked datasets within government, during a time of crises and data breaches
- Professor Nicholas Biddle (Australian National University) – Presenting Author
Like many policy communities, public policy in Australia is increasingly reliant on the analysis of data from large, linked administrative datasets. The author, for example, has been involved in a project analysing COVID-19 vaccine uptake, using data linked from the 2021 Census, immunisation registry, and tax/social security system. This type of research is broadly supported by the Australian public, with a recent nationally representative survey (also conducted by the author) showing that 84.2 per cent of Australians (in August 2022) though that the government should be using data ëwithin government to evaluate the effectiveness of government programsí and 70.1 per cent thinking government should share data with researchers for similar purposes. In April 2022, the Australian government passed new legislation (the Data Access and Transparency Act) to make it easier and safer to share data within government and with approved researchers. While the legislation passed with bipartisan support, there was some concerns expressed by the community that the legislation would reduce privacy protections, and that government cannot be trusted with its citizensí data. For example, in the same survey cited above, only 29.6 per cent of Australians agreed or strongly agreed that the Australian government ëcan be trusted to use data responsiblyí Part of the scepticism towards the governmentís use of data comes from high profile examples of the misuse of data within government, as well as a number of data breaches that have occurred within the private and public sector. Another source of scepticism is the lack of perceived benefits from the use of linked administrative datasets and associated Machine Learning, Artificial Intelligence, and general big data analytical approaches to the data held by governments. This scepticism or contingent social licence to utilise large datasets within government or to share with researchers has the potential to put at risk the real public policy benefits that such data and new analytical approaches present. This paper reports Australianís views about data trust, cybercrime and data breaches and how these have changed over the pandemic period and beyond. It is based on data from seven waves of the ANUpoll collected over the period October 2018 to April 2023. It includes tracking questions, as well as a number of survey experiments that causally identify the impact of data breaches and different usage of data on public attitudes. Comparisons are also made with use of data within the public sector and within the private sector. From this survey data, it is possible to estimate the short and medium-run impacts of two major and high-profile data breaches of data held by a major Australia telecommunication company in September 2022 and a major health insurance company in October 2022 on Australians trust in various institutions to maintain data privacy and how various groups should be able to use data.
Title: Representativeness of push-to-web Generations and Gender Survey (GGS) in the United Kingdom
- Miss Grace Chang (University of Southampton) – Presenting Author
- Dr Olga Maslovskaya (University of Southampton)
- Professor Brienna Perelli-Harris (University of Southampton)
There is growing interest in online survey data collection because of a rise in internet penetration and a decrease in responses in offline social surveys. In 2020, 96% of households in Great Britain had internet access, and the United Kingdom (UK) England and Wales Census 2021 showed that online data collection is feasible where 89% of households responded to the census online (v.s. target of 75%). However, one of the main challenges in collecting UK social surveys online is the absence of an individual-sampling frame, compared to countries which have population register data. A potential solution is to use a ëpush-to-webí approach but there is little evidence about the representativeness of the samples collected. This is especially challenging because response rates for online surveys are lower compared to other modes, low for hard-to-reach groups, and vary by ability and willingness to participate in online surveys. We use data from the first UK Generations and Gender Survey (GGS), which implemented a push-to-web survey design with online only mode of data collection. The surveyís goal is to examine demographic transitions and complex partnerships of 18ñ59 year olds. We examine whether the UK GGS sample is representative of those 18-59 in the UK, based on gender, age, ethnicity, highest education, country of birth, marital and cohabitation status, fertility (number of biological children), whether in full-time work, occupation, deprivation by small areas in the UK, and urban-rural locality, compared to gold-standard external benchmark measures obtained from the 2021 Annual Population Study (APS) and the England and Wales 2021 Census. We contribute new evidence about the representativeness of a long (approximately 50 minutes) push-to-web online only survey, where most surveys of similar length in the UK typically use mixed-mode designs (e.g., Understanding Society survey). This study will discuss the advantages and limitations of push-to-web online only survey design in the UK, which will provide useful insights for other countries which do not have population registers and desire to move to push-to-web data collection. We make unweighted and weighted comparisons of the GGS sample to the external benchmarks. We will also compute the average absolute error that takes account of the number of categories in the variables of interest to compare its similarities and dissimilarities to variables in the external benchmarks. Our preliminary analysis of the unweighted data at the UK-level suggests similar representation by those born in the UK (84% GGS, 86% APS), age groups of 18-29 (21% GGS, 20% APS) and 40-49 (24% GGS, 25% APS). The UK GGS over-represents women (63% GGS, 53% APS), ages 30-39 (28% GGS, 22% APS), respondents with partners (married/cohabiting/civil partnership) (70% GGS, 65% APS), and a degree level qualification (52% GGS, 37% APS), and surprisingly underrepresents those with White ethnic backgrounds (83% GGS, 89% APS). Final data will be available by May 2023 for a full and comprehensive analysis.
Latent Class Dynamic Mediation Model with Application to Smoking Cessation Data
- Professor Ying Yuan (University of Texas MD Anderson Cancer Center) – Presenting Author
Traditional mediation analysis assumes that a study population is homogeneous and the mediation effect is constant over time, which may not hold in some applications. Motivated by smoking cessation data, we propose a latent class dynamic mediation model that explicitly accounts for the fact that the study population may consist of different subgroups and the mediation effect may vary over time. We use a proportional odds model to accommodate the subject heterogeneities and identify latent subgroups. Conditional on the subgroups, we employ a Bayesian hierarchical nonparametric time-varying coefficient model to capture the time-varying mediation process, while allowing each subgroup to have its individual dynamic mediation process. A simulation study shows that the proposed method has good performance in estimating the mediation effect. We illustrate the proposed methodology by applying it to analyze smoking cessation data.
Statistical learning methods to estimate sales forecasts for products that affect the supply chain of a mass consumption company in the city of Guayaquil
- Professor Francisco Morales (ESPOL ñ Escuela PolitÈcnica del EjÈrcito) – Presenting Author
- Dr Sergio Ba˙z (ESPOL ñ Escuela PolitÈcnica del EjÈrcito)
- Dr Johny Pambabay (ESPOL ñ Escuela PolitÈcnica del EjÈrcito)
Currently, sales in the food sector in Ecuador are in constant growth, especially in the retail industry, therefore, sales forecasting is a factor of high interest to improve competitiveness within the industries. To model the behavior of individual customer purchases and their group characteristics, multivariate clustering, is need to use the K-means method and also the characterization through the CHAID method, which both help tos simulate the real sales behavior, which is one of the main problems in forecasting models. Based on these techniques, 7 homogeneous customer segments with similar purchasing characteristics were identified, which helped to identify relevant variables for the generation of more reliable and accurate sales forecasts at the category level and by number of orders, processed. Through the use of statistical learning, we used the following forecasting models: linear regression, K-nearest neighbors (KNN) and regression trees. The results of the forecasts showed that the best model to adjust the data for the categories of the different customer segments is the Linear Regression model, presenting lower error measures in terms of MAPE and RMSE in relation to the measures presented by the KNN and Regression Trees models, obtaining outstanding results especially in the Toilet Paper categories of customer segment C with a MAPE of only 2.77% and a MAPE of 3.14% in the Wet Towels category.
Unraveling the Correlation between Perceived Issue Importance and Issue Salience On the Internet among Users with Different Media Repertoires
- Mr You-Jian Wu (Academia Sinica) – Presenting Author
- Mr Hao-Hsuan Wang (Academia Sinica)
- Mr Shih-Peng Wen (Academia Sinica)
- Professor Ching-ching Chang (Academia Sinica)
- Dr Yu-ming Hsieh (Academia Sinica)
- Dr Justin Chun-ting Ho (Academia Sinica)
With the advent of online media and social platforms, people now have access to a plethora of channels for receiving information. This has led to a diversity of media diets, which in turn may be associated with distinct perspectives on various issues. Consequently, a growing body of research has adopted a media repertoires approach to identify patterns of media use and explore how perceptions or attitudes vary among different types of media users. This study aims to provide evidence of the relationship between media diets, the perceived importance of various issues, and online public opinion on issues, using a representative sample from Taiwan. By conducting an in-depth analysis, we seek to shed light on the complex interplay between these variables. We employed latent class analysis (LCA) to identify four distinct classes of media diets among survey respondents (n = 3,982). A binary observation variable was derived from the four-point scale frequency of news information consumption by using the median split, with a frequency above the median considered high usage. Based on the three criteria: 1) the lowest AIC and BIC values, 2) entropy values above 0.6, and 3) a moderate share of each class, we identified a four-class model. Omnivores, consisting of 30.85% of the respondents, have the highest probability of high usage of all platforms. Old School, 27.37% of the respondents, have a lower probability of high usage of forums (PTT, Dcard) and Instagram. Netizens, 27.19% of the respondents, mainly use forums and have lower usage of newspapers, TV, and online news. Traditionalists, 14.59% of the respondents, have a higher probability of high usage of traditional media but a lower probability of usage of online news and social media platforms. We also asked survey respondents to rank thirty-two issues according to their perceived level of importance across three major topics: eighteen long-term social issues, eight communication issues, and six hot issues. To gauge online public opinion on the importance of issues, we detected the number of articles retrieved through keyword searches related to each issue. For each class of media diets, we used Spearman’s rank correlation to assess the correlation between the perceived level of issue importance and online public opinion as observed in online news and forums. Among Netizens, who browse forums to a greater degree, the correlations between their perceptions of issue importance and public opinion observed in online news and forums are .62 and .58 respectively, which are the highest among all four classes. Among Traditionalists, the two correlations were the lowest (.40 and .37). Extending prior literature, this paper suggests that whether peopleís perception of the importance of issues reflects online opinions depends on their media diets.
Estimators of the Sensitive Proportion in Item Count Models under Some Assumptions Violation
- Dr Barbara Kowalczyk (SGH Warsaw School of Economics) – Presenting Author
- Dr Robert Wieczorkowski (Statistics Poland)
Various item count techniques (ICTs) are established and widely applicable methods for surveys with sensitive questions. Estimation of the unconditional probability of possessing the sensitive attribute, i.e. estimation of the sensitive proportion is of main importance. Due to the fact that some latent masking (control) variable (or variables) is used to protect respondents’ privacy in all item count models the problem of the efficiency of the estimation is especially important, since the efficiency is often not very high. Although in social science practice moment-based estimators are widely used, in the modern methodology of the item count techniques the problem is treated as a problem of incomplete data and therefore ML estimators via either EM or Newton-Raphson algorithm are employed. But the use of a parameter approach to various item count methods introduces new problems of latent masking (control) variable modelling. To our best knowledge the problem of robustness/non-robustness of various item count models concerning violation of the latent masking variable distribution assumptions has not been studied so far. In the paper we analyze different approaches to estimation in various item count techniques by taking into account violation of model assumptions. We conduct a comprehensive Monte Carlo simulation study and compare moment-based and ML via EM/Newton-Raphson algorithm estimates in various ICTs, including Poisson and negative binomial item count techniques and item count techniques with a continuous control variable, and address the consequences of assumptions violation.
Multivariate Statistical Modeling to Identify Atmospheric and Sociodemographic Variables Associated with Covid-19 in the City of Guayaquil
- Professor Jhon PeÒa (ESPOL ñ Escuela PolitÈcnica del Litoral) – Presenting Author
- Dr Sergio Bauz (ESPOL ñ Escuela PolitÈcnica del Litoral)
- Dr Johny Pambabay (ESPOL ñ Escuela PolitÈcnica del Litoral)
- Professor Cesar Menendez (ESPOL ñ Escuela PolitÈcnica del Litoral)
In the year 2019 a new virus appeared that would be known as COVID-19, on March 11, 2020 the WHO would declare the pandemic of COVID-19, this serious virus spread throughout the world, reaching almost all its corners, one of the cities that was greatly affected by this virus was Guayaquil – Ecuador. This became an epicenter of the pandemic, for this reason the present work analyzes from multivariate and spatial statistics, the variables that may possibly have an impact on the spread of the virus, the variables to be studied were divided into 2 groups: atmospheric variables such as precipitation and temperature provided by INOCAR, sociodemographic variables such as socioeconomic level, population density, education level and coverage of basic services downloaded from the Geoportal of the IGM. On the other hand, covid-19 data were taken from a pandemic sentinel hospital located in the south of the city. In addition to this, other data sources were used to spatialize the COVID-19 cases, i.e. at the time of georeferencing or geolocating the location of each patient, data from the municipality of guayaquil, Google maps, and the former senplades (distribution of territory) were used. With all these databases, a purification and a correct geolocation of each patient at the block or street intersection level was made and for the respective analyses, the atmospheric and sociodemographic line was followed separately. In the modeling we used: principal component analysis PCA, multiple linear regression RLM and Bayesian inference with the R package INLA using Poisson distribution, nearest neighbors, random noise effects, posterior distribution. Resulting in several maps of zones, in this case circuits – polygons, which show the different measurements that had in: standardized incidence risk SIR and Relative Risk RR in the contagion of the COVID19 pandemic in the years 2020 and 2021. As a result we can see that the atmospheric variables do not have a significant contribution to the analysis and in the case of sociodemographic variables, it was found that the level of education and socioeconomic level, have a significant contribution as covariates that have a greater influence on the spread of COVID-19 against the supply of basic services and population density that have lower weights in the adjustment of the models, there being an inverse correlation of these weights, between the years 2020 and 2021 respectively.
Reliable Inference from Imperfect Data
- Dr Mansour Fahimi (Marketing Systems Group) – Presenting Author
The survey research landscape is rapidly evolving. Above all, in an era of diminishing response rates and escalating costs, more effective sampling alternatives are no longer academic curiosities. While the new realities suggest that departures from traditional sampling methods are becoming inevitable, they also beckon an immediate question as survey researchers continue to experiment with hybrid sampling techniques. That is, are such sampling alternatives conducive to the inferential needs of scientific surveys by reaching a representative subset of the target population in a pragmatic and cost-effective manner? While the literature on how to improve the external validity of survey estimates from nonprobability samples is maturing, existing investigations have focused solely on surveys of the general population. In particular, there are no recent studies on how to improve the inferential integrity of surveys of hard-to-reach cohorts, such as young adults, that rely on hybrid samples. Moreover, proposed methodologies are often theoretical in nature or pertain to ad-hoc techniques with limited scalability. This presentation shares results from two large surveys of individuals 15 to 24 for whom the employed hybrid samples were secured from two sources: a probability-based sample of addresses from the USPS delivery database, and supplementary samples from various online panels. Specifically, we will illustrate effective calibration procedures that go beyond basic geodemographic weighting adjustments to improve the representation of surveys of teens and young adults from online panels that are often subject to compromised representations.
Optimism and cryptoasset ownership
- Dr Kim Huynh (Bank of Canada)
- Dr Christopher Henry (Bank of Canada) – Presenting Author
- Dr David Jacho-Chavez (Emory University, USA)
- Miss Gabriela Coada (Emory University, USA)
Level of optimism can affect economic behavior/outcomes of agents (Brunnermeir and Parker, 2005). Recently, a novel measure of optimism was developed in Puri and Robinson (2007) and shown to be empirically relevant using data from the U.S. Federal Reserve’s Survey of Consumer Finance. The Bank of Canada included a question on self-reported life expectancy in their 2021 and 2022 Bitcoin Omnibus Survey (BTCOS). This survey helps the Bank to monitor trends in Canadiansí awareness, ownership, and use of Bitcoin. Balutel et al.(2022) provides the survey instrument and methodological details for the 2021 version. Fieldwork for the 2022 BTCOS just concluded. The total sample size is approximately 2,000 individuals with roughly 200 Bitcoin owners. Besides self-reported life expectancy, the BTCOS also contains information on ownership of Bitcoin and altcoins, price expectations of Bitcoin, duration of Bitcoin ownership, etc., in addition to socio-demographic information and financial literacy measures of the respondents. Bitcoin ownership has increased remarkably in the last several years. However, the relationship between various characteristics and behaviors of individuals and cryptocurrency ownership still remains an open question. Using various high-dimensional regression methods, this research proposes to conduct an empirical analysis of optimism and cryptoasset ownership. Since people are asked to self-report how long they think they will live and the survey contains their socio-demographic information, one can consult Canadaís Actuarial Life Tables to assign a life expectancy to each individual, hence giving us an objective measure of how optimistic they are, as in Puri and Robinson (2007). Specifically, using regression methods I plan to uncover whether optimism correlated with future price expectations, and whether optimism relates to financial or Bitcoin literacy. The latest high-dimensional regression tools of Belloni et al. (2014) and Chernozhukov et al. (2018) will be used for this research, as well as common visualization devices such as specification curves as in Gao et al. (2021). References: Balutel, D., W. Engert, C. S. Henry, K. P. Huynh, and M. Voia (2022): ìPrivate digital cryptoassets as investment? Bitcoin ownership and use in Canada, 2016-2021,î Staff Working Paper 2022-44, Bank of Canada. Belloni, A., V. Chernozhukov, and C. Hansen (2014): ìInference on treatment effects after selection among high-dimensional controls,î The Review of Economic Studies, 81, 608ñ650. Brunnermeir, M. K. and J. A. Parker (2005): ìOptimal Expectations,î American Economic Review, 95, 1092ñ1118. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018): ìDouble/debiased Machine Learning for Treatment and Structural Parameters,î The Econometrics Journal, 21, C1ñC68. Gao, M., H. Leung, and B. Qiu (2021): ìOrganization Capital and Executive Performance Incentives,î Journal of Banking & Finance, 123, 106017. Puri, M. and D. Robinson (2007): ìOptimism and Economic Choice,î Journal of Financial Economics.