Tag Archives: big data

police officers accidentally film themselves planting evidence

According to the Intercept,

Last month, the city public defender’s office discovered body camera footage showing a local cop placing a bag of heroin in a pile of a trash in an alley. The cop, unaware he was being filmed, walked out of the alley, “turned on” his camera, and went back to “find” the drugs. The cop then arrested a man for the heroin, placed him in jail. The man, who couldn’t afford to post the $50,000 bail, languished there for seven months. He was finally released two weeks ago, after the public defender’s office sent the video to the state attorney.

So, I think I support the body cameras, I don’t really see why honest police officers wouldn’t support them too. Maybe they shouldn’t even have an “off” button.

what is a p-value?

Five Thirty Eight has a video of statisticians trying to explain what a p-value is. Well, what’s disturbing to me is that they won’t really try. Then again, the maker of the video very well may have cherry picked the most entertaining answers. I can’t reproduce the research so I have no way of knowing.

Here’s another article slamming the humble p-value. It’s true, there will always be some false positives if the data set is large enough. As an engineer, I try to use statistics to back up (or not) a tentative conclusion I have reached based on my understanding of a system. I will question a statistically significant result using my understanding of a system. That way both statistics and system thinking can reinforce and make each other stronger, rather than our relying exclusively on one or the other. Another way to think about this is that as data sets grow and our traditional engineering system analysis methods are just taking too long to apply, we can use statistics to weed out a lot of the data that is clearly just noise, and then focus our brains on a reduced data set that we are pretty sure contains the signal, although we know there are some false positives in there. So i say relax, use statistics, but don’t expect statistics to be a substitute for thinking. Thinking still works.


how do you value data?

This article lists six ways a company or organization can try to value its data:

  1. Intrinsic value of information. The model quantifies data quality by breaking it into characteristics such as accuracy, accessibility and completeness.
  2. Business value of information. This model measures data characteristics in relation to one or more business processes. Accuracy and completeness, for example, are evaluated, as is timeliness…
  3. Performance value of information…measures the data’s impact on one or more key performance indicators (KPIs) over time
  4. Cost value of information. This model measures the cost of “acquiring or replacing lost information.”
  5. Economic value of information. This model measures how an information asset contributes to the revenue of an organization.
  6. Market value of information. This model measures revenue generated by“selling, renting or bartering” corporate data

Another article says that algorithms are becoming less valuable as data becomes more valuable.

Google is not risking much by putting its algorithms out there.

That’s because the real secret sauce that differentiates Google from everybody else in the world isn’t the algorithms—it’s the data, and in particular, the training data needed to get the algorithms performing at a high level.

“A company’s intellectual property and its competitive advantages are moving from their proprietary technology and algorithms to their proprietary data,” Biewald says. “As data becomes a more and more critical asset and algorithms less and less important, expect lots of companies to open source more and more of their algorithms.”


where are the refugees from?

Here’s a pretty awesome data analysis on where (legal) refugees who enter the U.S. come from, and where they go. It’s great both for the information, and for the presentation of the information, which is simple yet highly effective. Click on the link, but here are a few facts to whet your appetite:

  • The country of origin for the most refugees to the U.S. in 2014 was Iraq, at 19.651.
  • Surprisingly (to me at least), next is Burma at 14,577.
  • Rounding out the top five are Somalia (9,011), Bhutan (8,316), and D.R. Congo (4,502).
  • After Cuba (4,063), the next highest country from Central or South America is Columbia at 243.

I might have guessed Iraq, but I don’t think I would have guessed anything else on this list. In a number of cases, there are groups of essentially stateless people living in various places (Bhutan and Burma, for example) that the U.S. has agreed to resettle in fairly large groups. In other cases, there are just a handful of people from a given country granted refugee status in a given year. It is a little hard to make sense of why one group is allowed and the next is not.

Edward Tufte

Here’s a fun interview with Edward Tufte, insult comic and author of The Visual Display of Quantitative Information. Here are a couple of his snappy retorts:

…highly produced visualizations look like marketing, movie trailers, and video games and so have little inherent credibility for already skeptical viewers, who have learned by their bruising experiences in the marketplace about the discrepancy between ads and reality (think phone companies)…

…overload, clutter, and confusion are not attributes of information, they are failures of design. So if something is cluttered, fix your design, don’t throw out information. If something is confusing, don’t blame your victim — the audience — instead, fix the design. And if the numbers are boring, get better numbers. Chartoons can’t add interest, which is a content property. Chartoons are disinformation design, designed to distract rather than inform. Thus they reduce the credibility of your presentation. To distract, hire a magician instead of a chartoonist, for magicians are honest liars…

Sensibly-designed tables usually outperform graphics for data sets under 100 numbers. The average numbers of numbers in a sports or weather or financial table is 120 numbers (which hundreds of million people read daily); the average number of numbers in a PowerPoint table is 12 (which no one can make sense of because the ability to make smart multiple comparisons is lost). Few commercial artists can count and many merely put lipstick on a tiny pig. They have done enormous harm to data reasoning, thankfully partially compensated for by data in sports and weather reports. The metaphor for most data reporting should be the tables on ESPN.com. Why can’t our corporate reports be as smart as the sports and weather reports, or have we suddenly gotten stupid just because we’ve come to work?

It’s a very interesting point, actually, that people are willing to look at very complex data on sports sites, really study it and think about it, and do that voluntarily, considering it fun rather than boring, hard work. It’s child-like in a way – I mean in a positive sense, that for children the world is fresh and new and learning is fun. What is the secret of not shutting down this ability in adults. I think it’s context.


Algorithms don’t sound like a topic for riveting reading, but these two articles are pretty good.

The first is from a marketing magazine, Adbusters. The claim it makes – that markets have never really worked before, but are starting to work now because of computer algorithms – is  bit of a stretch, but entertaining. Here’s a quote:

The critical flaw in Hayek’s vision of the hand was that a “central body” could never gather enough information. We know this to be untrue, and with big data and the analysis and manipulation of that data through algorithmic equation, the missing link between money and the machine was discovered.

The searches we make, the news we read, the dates we go on, the advertisements we see, the products we buy and the music we listen to. The stock market … All informed by this marriage between mathematics and capital, all working together in perfect harmony to achieve a singular goal — equilibrium. But it’s a curious sort of equilibrium. Less to do with the relationship between supply and demand, and more about the man and the market.

All these algorithms we encounter throughout the day, they’re working toward a greater goal: solving problems and learning how to think. Like the advent and rise of high–frequency trading, they’re part of an optimization trend that leads to a strange brand of perfection: automated profit.

The second, from ESPN, is about how numbers are being crunched by big-time professional sports gamblers:

Eventually, he grew to understand one of Walters’ keys to success: Some of his bets were intentional losers, designed to manipulate the bookmakers’ odds. Walters might bet $50,000 on a team giving 3 points, then $75,000 more on the same team when the line reaches 3.5. The moment the line gets to 4, a runner is instructed to immediately place a larger bet — perhaps $250,000 — on the other team. The $125,000 on the initial lines will be lost, but if things go according to plan, the $250,000 on the other side will win enough to make up for it many times over. Walters uses the same method on multiple games, often risking millions each weekend.

Since the days of the Computer Group, analytically inclined professional gamblers have relied on technology as well as research to produce what is called a delta: the difference between the Vegas line and what the bettors conclude the point spread should be. The greater the delta, the more money a gambler like Walters will bet. There’s nothing illegal about manipulating lines, and many prominent gamblers have the ability to move a line with as little as $1,000. Walters’ strategy is simply more sophisticated and uses more people, better information and, of course, more dollars bet in far more places than anyone else’s, insiders say…

The vast Walters network also includes a guy on the East Coast known as The Reader, who scans local newspapers, websites, blogs and Twitter for revealing tidbits or injury updates. That information is weighed and plugged into the computer alongside other statistical data — from field conditions to intricate breakdowns of officiating crews. Armed with algorithms and probability theories, the objective is to find the mispriced team, then hammer the line to where Walters wants it.