Day 1: Introduction to Working with Open Data

Raymond Yee

January 21, 2014 (http://is.gd/wwod1401)

Goals Today

Course Overview

INFO 290T- Working with Open Data
http://www.ischool.berkeley.edu/courses/290t-wod Spring 2014 / CCN: 41620
T,Th 2:00-3:30pm 202 South Hall
Office Hours: T, Th 3:30-4:30pm, 302 South Hall (along with possible virtual office hours)
Instructor: Raymond Yee, Ph.D.
Contact info:
Twitter: @WorkingOpenData / @rdhyee
Tutor: AJ Renold ()

bcourses site to be unveiled soon...

Course Description

Open data -- data that is free for use, reuse, and redistribution -- is an intellectual treasure-trove that has given rise to many unexpected and often fruitful applications. In this course, students will

  1. learn how to access, visualize, clean, interpret, and share data, especially open data, using Python, Python-based libraries, and supplementary computational frameworks

  2. understand the theoretical underpinnings of open data and their connections to implementations in the physical and life sciences, government, social sciences, and journalism.

Working with Open Data (WwOD) is a technical course with a strong focus on the social-political context and domains of application of open data.

Prerequisite

Info 206 Distributed Computing Applications and Infrastructure or equivalent background with Python.

Grading Scheme

Grading Scheme:

Subject to Change

  1. problem sets (25%)
  2. mid-term exam (30%)
  3. project proposal (5%)
  4. final project (25%)
  5. participation (15%)

Coverage vs Discovery

Main Textbook

Wes McKinney. Python for Data Analysis. (O'Reilly Media, 2012). I strongly recommend getting a paper copy as well as accessing any electronic versions

IPython Notebook Integral to Course

Working through IPython Notebooks created by the instructor is the primary vehicle for learning at the beginning of the course.

See A gallery of interesting IPython Notebooks · ipython/ipython Wiki

Supplementary Materials

I plan to supplement the book with materials covering the following topics:

In addition to survey materials on the public domain, creative commons, and open data movements, I'll focus us on

and other data sets still to be determined, probably large open scientific data sets

What we did in 2013

A narrative about last year's course co-written by Fernando Perez and Raymond Yee: Exploring Open Data with Pandas and IPython at the Berkeley I School -- includes abstracts of last year's projects.

Flow of Logic for Course

Course Outline

Last revised: 2014.03.20

  1. Introduction to Working with Open Data
  2. Setting Up for Python & IPython
  3. Setting Up Cont'd: Environments and Contexts
  4. Numpy & Pandas Intro
  5. Geographical Hierarchies in the Census
  6. Generators for Geographic Entities
  7. Calculating Diversity I
  8. Calculating Diversity II
  9. Creating Projects
  10. Calculating Diversity III
  11. Looking Ahead to Projects, Plotting, and Baby Names
  12. Baby Names and Plotting
  13. Baby Names II and mpld3
  14. Baby Names III
  15. Preparing for the Midterm I
  16. Preparing for Midterm II
  17. MIDTERM (Day 17, 2014-03-18)
  18. Looking ahead
  19. Class Presentations (I)
  20. Class Presentations (II)
  21. Open House

To be scheduled:

Major Deadlines

It is the student’s responsibility to notify the instructor(s) in writing by the second week of the semester of any potential conflict(s) and to recommend a solution, with the understanding that an earlier deadline or date of examination may be the most practicable solution.

It is the student’s responsibility to inform him/herself about material missed because of an absence, whether or not he/she has been formally excused.

Stay at home if you are sick

Campus Flu Guideline:

Students that they should not come to class if they become ill.The University has adopted the CDC recommendation that members of the campus community who develop flu-like illness should self-isolate until at least 24 hours after they are free of fever or signs of fever without the use of medication. Let your students know that they should follow this recommendation in deciding whether or not to come to class

In return: there will be flexibility and good judgment in how course requirments will be handled.

Projects

Laptops in classroom

I would like everyone to bring a notebook computer to class so that we can work together in class on programming assignments. If you are not able to do so, check in with me.

Why Python?

see McKinney's narration: http://proquest.safaribooksonline.com/book/programming/python/9781449323592/1dot-preliminaries/id2700570

Working definition of open data

From http://en.wikipedia.org/w/index.php?title=Special:Cite&page=Open_data&id=532390265:

Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

http://opendefinition.org/:

A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.

OKFestival /OKCon as indicator of vibrancy of the international open data community

list of working groups

http://okfn.org/wg/ includes:

In development:

other ties

A Motivating Example: Racial Dot Map

http://bit.ly/rdotmap

and

http://bit.ly/rdotmapintro

With any luck, we will not only understand how the map works, we'll also be able to reproduce it and enhance it by the end of the semester. That is how to turn Census 2010 data into a map.

Activity 1

Group activity -- discuss and enter answers at http://bit.ly/wwod1401Q

Activity 2: Setting up for Programming

We'll study the population of countries before we dive into the US Census.

Upcoming events

Homework

Try to get IPython installed on your computer before class

AJ wrote a nice set of notes on how to do so: https://github.com/rdhyee/working-open-data-2014/wiki/IPython-Installation-Options

Readings

optional video to give you a sense of the huge possibilties of IPython: SciPy 2013 :: IPython in depth

APPENDICES

Random and not-so random questions for me that open data can help answer

Reading the news, world news, local news, tech news, understanding new contexts, deepening old interests, controlled serendipity.

Some Big Questions for the Course

data.gov as a good example

http://www.data.gov/

http://www.data.gov/about:

The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government."

A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government by encouraging innovative ideas (e.g., web applications). Data.gov strives to make government more transparent and is committed to creating an unprecedented level of openness in Government. The openness derived from Data.gov will strengthen our Nation's democracy and promote efficiency and effectiveness in Government.

datasets in data.gov

Data.gov -> Earthquake Feeds - Data.gov -> Real-time Feeds & Notifications -> KML Format -> feed http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/1.0_week_age.kml to Google Maps:

https://www.google.com/maps?q=http:%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Ffeed%2Fv1.0%2Fsummary%2F1.0_week_age.kml

Motivation: why I care about open data and why you might care

Traditional motivations given for open government data:

My personal interests in the area:

Like open source software as an enabler, catalyst, and foundation...you can stand on the shoulder of giants.

Examples of Open Data