As you’ve seen in the last two weeks, most of the work involved in data journalism involves getting data into a form you can use, either in Excel or in SQL or some other program.

We’ve looked at how to extract tables from PDFs, how to extract data from the Web, and done a quick review on strategies for converting reports into datasets.

Now that you can envision what success looks like, we’ll spend some time with Open Refine, a tool that many data scientists as well as reporters use to clean up messy data.

If you want to bring your laptop to class, we can make sure it installs properly before you leave for the day. It requires a reasonably up-to-date installation of Java, a free program from Oracle. Most Macs already have it. Depending on what else is on your computer, Windows would also have it, but it’s not as likely.

Open Refine has been installed on the lab computers, which I suggest you use for class work.

If you want to put it on your computer, download it here

PLEASE NOTE You will need to use Open Refine for the tutorial later this week. We’re going to start walking through it together on Wednesday, but you may need to finish it after class. It is due on Friday.

Here are some more resources on Open Refine.

Monday, April 9

We’ll use the hazards incident level reports, in csv form and the individual campaign finance contributions that we looked at before as examples.

Files for class:

Wednesday, April 11

Work on tutorial, due April 13

You should be sure to come to class if you are at all unsure about how to use Open Refine. This will be a challenging exercise, and I think you’ll want the opportunity to go through it in person.

We’re going to begin a walkthrough of an exercise on Open Refine that will serve as your last exam. I’ll be able to answer questions on this one, in class. I’m not sure if we’ll finish. It will be due on Friday, April 13.