Text analysis to anticipate genocide

HatebaseA new database project attempts to identify impending genocide by spotting key textual indicators.  It’s crowdsourced, called Hatebase, and a co-sponsor describes it like so:

Hatebase, an authoritative, multilingual, usage-based repository of structured hate speech which data-driven NGOs can use to better contextualize conversations from known conflict zones.

A fascinating idea, one part digital humanities, one part pre-crime.  It can also be localized, as a

critical concept in Hatebase is regionality: users can associate hate speech with geography, thus building a parallel dataset of “sightings” which can be monitored for frequency, localization, migration, and transformation.

(via Slashdot)


Against Nate Silver

Critiques of Nate Silver’s work have appeared from the right this season, so it’s unusual to read one coming from another direction.

Short version: “Silver Ignores Politics and Loves Experts.”

Longer version: Cathy O’Neil notes that Silver blames the 2008 financial crash on bad modeling… but not on corruption, or unethical business practices.

We didn’t have a financial crisis because of a bad model or a few bad models. We had bad models because of a corrupt and criminally fraudulent financial system.

That’s an important distinction, because we could fix a few bad models with a few good mathematicians, but we can’t fix the entire system so easily. There’s no math band-aid that will cure these boo-boos.

Similarly, in regards to bad modeling in medical research,

flaws in these medical models will be hard to combat, because they advance the interests of the insiders: competition among academic researchers to publish and get tenure is fierce, and there are enormous financial incentives for pharmaceutical companies.

I haven’t gotten to either of those chapters in Silver’s book yet, but yikes!  What blinders.

Predicting movie ticket sales through Wikipedia edits

How can we use big data to grapple with the future?  An intriguing new paper presents a new take on this question, using Wikipedia edits to forecast movie box office.  “Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data” (Márton Mestyán, Taha Yasseri, János Kertész)(pdf) shows a model owing much to previous efforts relying on Twitter mentions, but moving in a different direction.

We show that the popularity of a movie could be predicted well in advance by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia.

Note that this is a contextual approach, not based on the actual content of a movie (“the Wikipedia model does not require any complex content analysis”).

The maths go over my head, which reminds me to do some studying.

One minor note: I like the way they define rigor.

(via Technology Review)