Week 3 – Working with Data

Now that we’ve learned a bit more about how humanities data is created, collected, and structured, we’re going to explore some tools for making sense of large datasets and for “tidying” them up. We’ll try out OpenRefine, a tool that is useful for exploring large datasets and preparing them for projects. This requires a software install. Please see the documentation for installing OpenRefine in this tutorial and message me on Slack, visit me during office hours, or contact Emily for help if you need it.

In this lab we will:

  • Gain familiarity with a couple tools for making sense of and manipulating data
  • Follow some sample exercises in OpenRefine and see how they help us deal with large data sets
  • Examine a CSV (comma-separated value text file) export of the Omeka item metadata from last week’s lab and identify one point of interest in the collection metadata
  • Work with a partner to identify the different ways people approach cleaning and clustering data and to consider how bias and cultural expectations enter data sets

Specs

  • 750-1000 words
  • Report applies ideas from this week’s readings to the questions
  • Author offers meaningful research questions for the datasets
  • Author demonstrates explicit awareness of how data and data structures are affected by their creators’ cultural expectations or biases 
  • Author makes their points in clear and concise ways
  • The work contains no more than 3 grammatical, spelling, or other “mechanical” errors
  • The work contains no more than 2 minor factual inaccuracies and no major factual inaccuracies
  • Upload PDF to Canvas

Lab Instructions

  1. Watch this video with an introduction to OpenRefine and demonstrations of the tutorial exercises. 
  2. Follow the OpenRefine tutorial with exercises to get familiar with the tool on a typical archival dataset (or the smaller 1000-object dataset if your computer is balking).
  3. Download this CSV of item metadata from last week’s Omeka exercise.
  4. Create a new project in OpenRefine and import the Omeka export data, and identify two or three of the techniques you’ve learned to try on the Omeka export. 
  5. Meet with your partner to discuss your results from steps 2-4, using the prompts below to guide your discussion.

Report (about 750 words)

Please address these prompts in your report. You should draw on your readings as well as your lab experience for the last two prompts. You and your partner are encouraged to discuss these questions, but you must write your reports independently.

  • Describe your experience working with OpenRefine. Were you able to complete all the exercises? What problems did you encounter? Did you have any questions about how it worked? Did you and your partner differ at all in your approaches? 
  • Based on your discussions with your partner, describe at least one notable observation of the Omeka export data or the WCMA dataset after working with it in OpenRefine (this is a useful place to draw on tools like faceting). 
  • What kinds of research questions or projects could you imagine using the collections you engaged with this week or last week and data refining tools like OpenRefine? Give at least one example of a project idea. What fields would you need to use to work on that project, and how might they need to be “tidied,” restructured, or clustered? 
  • D’Ignazio and Klein explained why we should see data as already “cooked.” Identify a way that the data from WCMA’s collection or one of the collections you worked with in Week 2 is cooked, and discuss how you could mitigate that or use it to expose structural bias.

Submission Details

  • Submit the lab report as a PDF to Canvas by the end of the day (local time) on Saturday, July 2
  • You can write the report in Google Docs, Word, Pages, or another application. Just be sure to save as a PDF