The newly formed Alan Turing Institute last week hosted a summit on Data Science for Media at the University of Edinburgh’s Informatics Forum. The Forum is an exciting space for some of the finest young minds to come together to collaborate on projects that could actually shape the world of tomorrow.
It’s also a sleek, airy building that wasn’t even in existence when I studied English here in the early Noughties, so it made me feel slightly old.
I was lucky enough to attend alongwith some colleagues from Johnston Press, and it proved to be a fascinating insight into how data is increasingly underpinning contemporary journalism. I was there for the first of two days, but my notes here are from the first keynote and panel discussion only, as this was most relevant to my interests.
Mike Dewar from the New York Times Labs gave the opening keynote speech on his work on the 28th floor of the NYT building on Eighth Avenue. His speech was titled ‘Design Values for Data Scientists’, but it could also have been called ‘How to Make Data Science Work for a Media Company’.
Mike explained that he works with “creative technologists” in an effort to future-proof the NYT for “three to five years ahead”. He added that the media needs to aspire to more than optimisation, to what he called a “race to clickbait”.
To structure his talk, he identified three key values of data science:
And for each of these values he gave examples of NYT Labs projects he thinks applies to each, whether they were successful or not.
Curriculum: this was a tool that displayed browser text of whatever Dewar and his colleagues were looking at online, on a screen in their office, and the question it was exploring was what’s it like to be watched. Obviously, it made people uncomfortable to see their supposedly private internet habits on the public screen, so Dewar added that they modified the tool to make it trustworthy, by making it open-source, clearly written, with the option to disengage.
Editor: this tool was a basic Word-style programme that sent any written text to the NYT corpus as it’s typed, so that it would deliver the journalist with live analysis and markup. However, the question of whether this was trustworthy had not been addressed.
Iris: this was essentially a feature of the NYT iPad app, which displayed three related articles at the end of any piece. But unlike simple keyword-based delivery, Iris surfaced three options: ‘similar in style’, ‘similar topic’ and, most interestingly, ‘readers like you like this’. Obviously, this is a data-based tool that’s clearly legible to even the most casual reader.
Behavioural Segmentation: my notes of this are scant, but it was a simple model that looked at the reading session on the NYT website as a “set of transitional possibilities”. This was looking at live data of the types of users, and the interesting thing was how it was visualised so that the user types could be clearly represented by the shape of the different columns, adding complexity to the NYT’s understanding of its audience.
Delta: this was a tool to allow the NYT to “see all the audience right now”. It’s essentially a global view of NYT live traffic, with each pixel representing one page view. To me, it looked more like an executive-pleasing toy than a project with practical use. But it was quite cool, admittedly.
Streamtools: again, my notes are brief here, but this was an open-source project to allow journalists to work with live data, giving a graphical overview.
After Mike’s speech there was a panel discussion on Data Journalism, moderated by my colleague Frank O’Donnell, managing editor of The Scotsman and Scotland on Sunday.
On the panel were Crina Boros (a data journalist at Greenpeace), Evan Hensleigh (visual data journalist at The Economist), Rachel Schutt (chief data scientist at Newscorp) and Jacqui Maher (interactive journalist at BBC News Labs).
Rather than give an exhaustive report, here are some quotes and topics that caught my attention on the subject.
Making data journalism mainstream
Frank opened the debate with a caveat: given the current state of the industry, how can media companies make data journalism worth investing in? One look at the panel just underlines the fact that this is, for the most part, a field still dominated by the global news brands and publicly funded organisations, rather than regional and local publications.
My personal suspicion here is that it’s not always cost that’s held newspapers back, but rather a reluctance by traditional journalists and editors to really interrogate non-human sources like databases.
Crina did not have much sympathy for this kind of wilful ignorance, saying that the tools are there to “interview the data source”. Jacqui made the point that most news orgs “still have a problem with silos”, that “reporters, designers and programmers should work together, and all be called journalists”.
Ethics and bias in data journalism
While it was pointed out that data journalism still lacks a code of ethics, Rachel said that it’s still “about being an ethical person”.
One question that was discussed was that, if you train traditional journalists in data journalism, is there a risk of bias, of having their desired story to mind before they even look at the numbers?
Crina didn’t think this was a risk: “The story is in the data, start with the facts and figures.”
Evan added that “there’s nothing wrong with a hunch, but you have to be willing to be proved wrong.”
Rachel, meanwhile, underlined the importance of not treating the data itself as sacred and beyond question, that data journalists should ask where the data came from. For example, if it’s comprised of survey answers, how were the questions framed?
There were other points raised and topics covered, but Mike’s keynote, followed by the panel discussion, provided a useful insight into an area of journalism that’s growing in importance. From the NSA files to local government budgets, so often the story is in the data, and if journalists can’t coax it out then it’s a problem not just for the industry but democracy at large.