Data Science and Consequences in Mathematical Modeling


Jedediyah Williams

This page contains resources for a session at the 2021 NCTM Fall Virtual Conference.

slides: link to slides        video: link to video of talk

Below is an overview of the material from the session along with additional resources. At the core of what I want to communicate are two ideas:

  1. Math is awesome
  2. Doing harm is bad
These are not original ideas.

A Data Science Process

Overview

The figure below depicts a Data Science Process. This process developed as multiple disciplines converged through modeling with data (Chapman 1999, Han 2012 Figure P.1, Schutt 2013 p. 41, Estrellado 2020 Chp. 3, Nantasenamat 2020, Kolter 2021).


Roughly, this process involves getting data, cleaning it up, inespecting it, modeling it, and sharing the results. This process is not so linear in practice. Rather like the engineering design process it can iteratively converge toward a solution.

This process is a structure for modeling with data and it is a tool that can incorporate concepts of data science into existing courses. Below is a brief description of the five stages stages.

DataDesign and Acquire
The first stage involves getting data. This might require the design of an experiment or survey, or it could involve simply downloading a dataset. In any case, we require a data problem and data with which to solve it.

PreprocessWrangle and Prepare
Preprocessing data might be the stage with which students (and teachers) are least familiar, but by many accounts it is where data scientists spend the majority of their time (Kelleher and Tierney 2018, 65-67). Preprocessing involves cleaning up and wrangling data into usable formats. If you have ever combed through a spreadsheet to find cells that interpreted an entry as a date when it wasn't, you have a sense of the preprocessing expereince and how messy it can be.

ExploreVisualize and Inspect
The third stage of Explore almost always involves visualizing data, particularly for the data projects our students work on. In fact, it is not unusual to see this stage called "Visualize". Exploring could also involve analyzing descriptive statistics to familiarize and get a sense of data.

ModelModel and Validate
This is the most important stage in some respects. Chances are good that your curriculum documents emphasize this stage and largely ignore the others! A critical component of modeling which is often left out is validation. Validation involves actually checking to verify your model is working.

CommunicateCommunicate and Deploy
The stage of Communicate could also include "Deploy", i.e., incorporating a model into a product or implementing a policy decision. While some students hopefully will deploy products they develop, we emphasize communication as the critical component after developing a mathematical model.


The Data Science Process is useful as a framework for:

The expanded figure below lists some of the components of each stage in the process. This partial list includes steps that students might utilize when working with introductory models.


An expanded view of the Data Science Process. In addition to iterating on these stages, some stages may blend together. For example, exploration can easily blend into the modeling stage.

When first introducing the framework, we might scaffold a data modeling project by providing the data, providing explicit walk-throughs of the middle stages, or offering templates for the last stage. As students familiarize themselves with the process and we remove scaffolding, we are working toward a point when the students can execute an entire data modeling project from a single prompt.

While the structure is useful and captures an abstraction of what modern data science often looks like, it is just one conceptualization. Projects often don't flow in a linear progression from the first stage to the last. For example, after the stage of Explore, we might revisit the stage of Preprocess to do some work we missed. After the stage of Model, we might realize we need to find different features, so we revisit Explore. There can also be a lot of overlap between stages.

Example Projects

The following examples are intended to demonstrate the framing of data projects with the Data Science Process. The complete projects are not necessarily presented, but I try to offer enough information that you could do these projects or facilitate them with students.

Shortcuts to the examples below:

Example: Bouncy Ball

Data
  • Design
    Pose the question to students:
    Question
    How high does a bouncy ball bounce?

    Discuss and translate this into a data question:

    Data Question
    Given a particular bouncy ball dropped from rest at a height \(h_{i}\) on a consistent surface, what is the predicted peak height \(h_{p}\) of the first bounce?
  • Aquire
    My students use the slo-mo features on their cell phones to record several trials.
Preprocess
  • Wrangle
    The videos tend to be spread across multiple students and sometimes those students are absent the next class! Students parse the video data while attempting some quality control about reading the measurments from the video. Some students develop a procedure where they independently read the measurements from the video and then average their readings.
  • Prepare
    They create tables in a spreadsheet and label the columns.
Explore
  • Visualize
    We create scatter-plots of the bounce-height \(h_{p}\) as a function of drop-height \(h_{i}\). Some students use Google Sheets and some use Desmos.
  • Inspect
    We notice and wonder about the relationship between the variables we are exploring. Students notice that \(h_p\) appears to have a linear relationship to \(h_i\). They wonder if that relationship continues forever as \(h_i\) increases. Some students notice that there are data points which are far away from the others and they wonder if they should revisit their video data. Some students wonder about the bounce-height for a drop-height of zero!
Model
  • Model
    Students model their data with an appropriate choice of function. If they can do a regression, they find the best-fit.
  • Validate
    We use our models to make several predictions of unrecorded drop heights, then validate our model by assessing those predictions against experiment. Our validation tests include interpolated predictions, extrapolated predictions, and extreme extrapolated predictions.
Communicate
  • Report
    Students present their work. This could be a presentation, podcast, or report. An important piece of this project is to reflect on the performance of the model on out-of-domain drop-heights.

Example: Blue Fishing

Data
  • Design
    Pose the question to students:
    Question
    What time should Zach go casting for blues this Saturday?

    Discuss and help them translate this into a data question:

    Data Question
    Using the tide data from the NOAA Brant Point station NTKM3, when do we predict high tide will be on Saturday?
  • Aquire
    We download a .csv file of tide data from NOAA, or I will pre-download the data over a specific time interval and send it to them.
Preprocess
  • Wrangle
    We import the data into a spreadsheet, remove any corrupted rows (with missing data), and label the columns.
  • Prepare
    We split the data into training and testing sets. For my students, this is the first lab we do where we utilize this type of validation. There are some data-dependent conventions for this but we often use 70% of our data for training our models and 30% for validation of those models.
Explore Important to note: the testing data is currently hidden at this stage and we are working only with the training data.
  • Visualize
  • Inspect
    We notice and wonder about the relationship between the variables we are exploring. "It looks like the sine wave things we studied last week!"
Model
  • Model
    Students model their data with an appropriate choice of function. If they can do a regression, they find the best-fit.
  • Validate
    Only after students are satisfied that they have a good model do we move to the step of validation. To validate their model, they test how well it predicts or "fits" to the testing data.
    It's a good fit! We can predict with some level of confidence when Zach should go casting for blues on Saturday.
Communicate
  • Report
    Students present their work. This could be a presentation, podcast, or report. We will want to be conscientious of variability and the reliability of our model for predicting future tides. There are also well-known factors that were not present in the training data, e.g., weather. An instructive project would be to collect more NOAA data and test model predictive power over longer time scales.

Example: Fruit Classifier

Fruit Classification Colaboratory Project

Data
  • Design
    Pose the question to students:
    Question
    What fruit is in the picture?

    Discuss and help them translate this into a data question:

    Data Question
    What are the distinguishing image features that can be used to accurately classify fruit?
  • Aquire
    The images are download-able through the Colab.
Preprocess
  • Wrangle
    We parse the image files into Python lists of Image data.
  • Prepare
    The data is prelabeled for us. We split the data randomly into training and testing sets.
Explore
  • Visualize

  • Inspect
    We notice and wonder about the relationship between the variables we are exploring. There are several features we could identify. The simplest is color! We could look at the amount of red, green, and blue in our attempt to distinguish the fruit.
Model
  • Model
    Students model their data with an appropriate choice of function. This particular model involves some conditional statements about color values.
  • Validate
Communicate
  • Report
    Students present their work. This particular project focuses on computational thinking and some technical skills.

Concern

When students are familiar with the Data Science Process, we can leverage its structure to critically analyze data technologies, including those that we are building in class.

When our models are useful, which is to say that they make good predictions, it is important to be able to identify how and why they work as a critical step in anticipating when they will fail. Given the simplicity of most of the models our students will see, this often involves questions of inference and variability. However, students should understand that as their models grow more complex, so do the complications of applying those models, even to the extent that models are not applicable (Birhane 2021).

It can be incredibly difficult to identify harm done by technologies, let alone predict or anticipate the source of it in increasingly complex data systems that are interacting with complex social systems, often with multiple feedback loops. In critically analyzing data science products, context is everything and there is no limit of nuance. When we deploy models or products to make or aid in decisions that affect lives, we have a moral obligation to validate and monitor those products to avoid doing harm.

Harm from technologies may include, but is not limited to, disproportionate harm done to subpopulations. We have seen that historically marginalized populations are often further marginalized by data technologies (Benjamin 2019, Criado-Perez 2020, Eubanks 2019, Wachter-Boettcher 2017). Bias and harm may be incorporated at each stage of the Data Science Process. Assuming we wish to avoid causing harm, we should be conscientious of these issues and pass this understanding on to students who are active consumers and future practitioners of data technologies.

The list below contains some of the considerations we may face at each stage of the data science process.

Data
  • Problematic Problems
    At the very first step of the process, we translate a problem into a data problem. It is entirely possible that the problem is not one that should be addressed. This could be a value judgment or moral dilemma.
    Another issue is presuming a metric in the framing of the problem, e.g., we want students to do better on a timed test so we ask how do we get students to solve problems faster?
  • Harmful Data Collection
    This can be passive, e.g., presuming cultural conventions of name formats in a survey. It can also be agressive, manipulating vulnerable populations or violently claiming access.
  • Biased Data
    Perhaps the most notorious concern, biased data, e.g., sampling bias, may lead to poorer outcomes for some. Consider medical studies on primarily men.
  • Privacy
    There is far too much to say about this bullet. Remember that time that a fitness bit company released all that customer data and it revealed the locations of military personnel?
    When collecting and working with data, security considerations can be critical.
  • Consent
    Each of us has had data collected about us wihtout our consent. At what point does this cross a line?
  • Environmental Impact
    Data consumes resources. There is A LOT of data.
Preprocess
  • Labor Exploitation
    Outsourcing and underpaying.
  • Bad Labeling
  • Having non-experts label, e.g., medical data.
  • Trauma Experienced by Labelers
    Labelers who spend hours each day parsing horrific content.
Explore
  • Feature Bias
    Engineering bias into the model through feature choices.
  • Bad Data Visualization
    Bad data visualizations can mislead. This can be exploited maliciously to influence decisions, or it could inadvertently mislead, possibly amplifying or even creating new biases.
  • Data Manipulation
    Not seeing the pattern you were hoping for, then making it.
Model
  • What problem are we solving?
    Algorithms do what we tell them to do, not what we want them to do (Shane 2019, Chapter 5). Connecting back to the question we set out to answer at the first stage of the Data Science Process: having the best data about the wrong thing means we aren't modeling what we expected or assumed. We may discover such a disconnect at this stage.
  • Bias in Model Choice
    There is bias in the choice of models.
  • Model-Amplified Bias
    While data may biased, so are models. This could be due to simple technical issues like getting "stuck" in a local minima, being anchored by initial model training conditions.
  • Environmental Impact
    Many modern models require enourmous computational resources.
  • Lack of Robustness
    How sensitive is the model output to variability in input? Where are the corner-cases and how does the model behave there?
  • Peripheral Modeling
    Did the model really learn what we think it did?
Communicate
  • Poor Model Interpretation
    While models model what they model, humans (who created the models) interpret them.
  • Ignoring Variance
    What good are predictions when the errors are so large?
  • Ignoring Conclusions
    Even when predictions are good, it is up to people to utilize them or not. Consider the case of resume parser which actually work well, but with results are that are ignored.
  • Deploying Harmful Products
    There are unfortunately many insantces of products being deployed that simply do harm. This can be related to automation bias.
Meta
  • Feedback Loops
    The concept of a feedback loop is used in several fields, e.g., electrical engineering or control systems, and isn't necessarily harmful. Feedback loops lead to harm in data technologies when they amplify bias or exhibit run-away self-reinforcing performance optimization. Consider the example of predictive policing: an algorithm tells police where to look for crime, of course police find crime where they look for it, that policing not only has the appearence of affirming the model prediction but reinforces the prediction so they police there more. Even without considering the bias of the data used in the original prediction, it becomes clear that such a system will perpetuate itself by seeming to justify its use until we step back and realize the effect of the feedback loop. Without a critical analysis of such tools,
  • Susceptability to Adversarial Attack
    Algorithms are brittle.
  • Lack of Oversight or Auditing
    "All models are wrong", so all deployed models need some mechanism for reporting issues.

Consequence

Many people are working on issues of technology and society, and particularly data technologies. Some years ago there was a flood of media attention to the sudden demand for philosophers in tech. There were any number of articles about the modern-day reformulation of the trolley problem in the context of self-driving cars. Engineers creating self-driving cars are in a situation where they need to decide how a car should respond in an accident before that accident happens. Who is at fault when a self-driving car causes harm? This is, unfortunately, not hypothetical.

Here are some interesting articles documenting consequences. I kind of want to start collecting them in an organized way. Maybe let me know if you'd like that (and like to help)?

Conclusions

"Our success, happiness, and wellbeing are never fully of our own making. Others' decisions can profoundly affect the course of our lives."
– Barocas et al 2019, Page 1

When we create models trained on data, we are creating optimization problems to maximize or minimize some set of metrics that we have identified as valuable or have let an algorithm identify as valuable. We are encoding values. The extent to which these algorithms might cause harm can be incredibly difficult to predict or even detect, so it is important that we are able to critically analyze methods and applications so as to minimize harm.

When handing over the tools of mathematics, we are responsible as educators for teaching their responsible use. It is a sin of omission when we fail to acknowledge the consequences of the content we are teaching. If we are teaching mathematics because it is practical then it is because mathematics is applicable in solving problems, and applied mathematics has consequences. From the problems they choose to solve to the solutions they choose to find through the choices embedded in the processes of modeling, practitioners of mathematics are making decisions that affect us all and we are their educators.

Resources

  • Fast AI's Practical Data Ethics course.
  • Ethics in Mathematics Readings, curated by Allison N. Miller.
  • The fantastic videos below are all related to books which I also highly recommend. The videos are a good introduction to each author and their work.
  • Automating Inequality - Virginia Eubanks
  • The New Jim Code? - Ruha Benjamin
  • The collapse of artificial intellgience - Melanie Mitchell
  • Invisible Women - Caroline Criado Perez
  • The danger of AI is weirder than you think - Janelle Shane
  • The future of human-robot interaction - Kate Darling
  • References

    Barocas, Solon, Moritz Hardt, and Arvind Naraynan.  2019.  Fairness and Machine Learning.  https://www.fairmlbook.org 
              
    Benjamin, Ruja. 2019. Race after Technology: Abolitionist Tools for the New Jim Code. Medford, MA: Polity.
            
    Birhane, Abeba. 2021. “The Impossibility of Automating Ambiguity.” Artificial Life 27, no. 1 (June): 44-61.  10.1162/artl_a_00336.
    
    Brokaw, Galen. 2010. A History of the Khipu. Cambridge, United Kingdom: Cambridge University Press.
    
    Chapman, Pete, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rüdiger Wirth. 1999. “CRISP-DM 1.0: Step-by-step Data Mining Guide.” https://www.the-modeling-agency.com/crisp-dm.pdf.
    
    Criado-Perez, Caroline. 2020. Invisible Women: Exposing Data Bias in a World Designed for Men. London, England: Vintage.
    
    Estrellado, Ryan A., Emily A. Freer, Jesse Mostipak, Joshua M. Rosenberg, and Isabella C. Velázquez. 2020. Data Science in Education Using R. New York, NY: Routledge.
    
    Eubanks, Virginia. 2019. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. New York, NY: Picador.
    
    Friendly, Michael. 2006. “A Brief History of Data Visualization.” In Handbook of Computational Statistics: Data Visualization, edited by C. Chen, W. Hardle, and A. Unwin. Heidelberg: Springer-Verlag.
    
    Fry, Hannah. 2018. Hello World: How to Be Human in the Age of the Machine. London, England: Transworld
    Publishers.
    
    Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” (March). https://arxiv.org/abs/1803.09010.
    
    Grus, Joel. 2019. Data Science from Scratch: First Principles with Python. 2nd ed. Sebastopol, CA: O'Reilly Media, Inc.
    
    Han, Jiawei, Kamber Michelin, Jian Pei.  2012.  Data Mining, Concepts and Techniques.  The Morgan Kaufmann Series in Data Management Systems. 
    
    Hill, Kashmir, and Aaron Krolik. 2020. “Your Kids' Photos May be Powering Surveillance.” New York Times:
    Artificial Intelligence Edition (New York), 2020, 40-45.
    
    Hooker, Sara. 2021. “Moving beyond “algorithmic bias is a data problem.”” Patterns 2 (4).
    https://doi.org/10.1016/j.patter.2021.100241.
    
    James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical
    Learning: with Applications in R. 1st ed. New York, NY: Springer.
    
    Kelleher, John D., and Brendan Tierney. 2018. Data Science. Cambridge, MA: The MIT Press.
    
    Kolter, Zico. 2021. “Introduction.” Practical Data Science. http://www.datasciencecourse.org/notes/intro/.
    
    Mason, Hilary, and Chris Wiggins. 2010. “A Taxonomy of Data Science.” dataists.
    http://www.dataists.com/2010/09/a-taxonomy-of-data-science/.
    
    Metz, Cade. 2020. “Machine Learning Takes Many Human Teachers.” New York Times: Artificial Intelligence Edition (New York), 2020, 18-24.
    
    Mitchell, Melanie. 2019. Artificial Intelligence: A Guide for Thinking Humans. New York, NY: Farrar, Straus and Giroux.
    
    Nantasenamat, Chanin. 2020. “The Data Science Process.” Towards Data Science.
    https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b.
    
    O'Neil, Cathy. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York, NY: Crown Publishing Group.
    
    Schutt, Rachel, and Cathy O'Neil. 2014. Doing Data Science. Sebastopol, CA: O'Reilly Media, Inc.
    
    Schweinsberg, Martin, and et al. 2021. “Same data, different conclusions: Radical dispersion in empirical
    results when independent analysts operationalize and test the same hypothesis.” Organizational Behavior and Human Decision Processes 165:228-249. https://doi.org/10.1016/j.obhdp.2021.02.003.
    
    Seth Jones, Ryan, Zhigang Jia, and Joel Bezaire. 2020. “Giving Birth to Inferential Reasoning.” Mathematics Teacher: Learning and Teaching PK-12 113, no. 4 (April): 287-92.
    
    Shane, Janelle. 2019. You Look Like a Thing and I Love You. New York, NY: Voracious / Little, Brown and Company.
    
    Wachter-Boettcher, Sara. 2017. Technically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic Tech. New York, NY: W. W. Norton & Company.