Tag Archives: data science

April 2017 in Review

Most frightening stories:

Most hopeful stories:

Most interesting stories, that were not particularly frightening or hopeful, or perhaps were a mixture of both:

  • I first heard of David Fleming, who wrote a “dictionary” that provides “deft and original analysis of how our present market-based economy is destroying the very foundations―ecological, economic, and cultural― on which it depends, and his core focus: a compelling, grounded vision for a cohesive society that might weather the consequences.”
  • Judges are relying on algorithms to inform probation, parole, and sentencing decisions.
  • I finished reading Rainbow’s End, a fantastic Vernor Vinge novel about augmented reality in the near future, among other things.

Richard Berk

Here’s a Bloomberg article on Richard Berk, a statistician at the University of Pennsylvania whose algorithms are used for parole, probation, and sentencing decisions.

Risk scores, generated by algorithms, are an increasingly common factor in sentencing. Computers crunch data—arrests, type of crime committed, and demographic information—and a risk rating is generated. The idea is to create a guide that’s less likely to be subject to unconscious biases, the mood of a judge, or other human shortcomings. Similar tools are used to decide which blocks police officers should patrol, where to put inmates in prison, and who to let out on parole. Supporters of these tools claim they’ll help solve historical inequities, but their critics say they have the potential to aggravate them, by hiding old prejudices under the veneer of computerized precision. Some people see them as a sterilized version of what brought protesters into the streets at Black Lives Matter rallies…


reproducible research in hydrology

This October 2016 article in Water Resources Research on reproducible research got some attention.

Hutton, C., T. Wagener, J. Freer, D. Han, C. Duffy, and B. Arheimer (2016), Most computational hydrology is not reproducible, so is it really science?, Water Resour. Res., 52, 7548–7555, doi:10.1002/2016WR019285.

Reproducibility is a foundational principle in scientific research. Yet in computational hydrology the code and data that actually produces published results are not regularly made available, inhibiting the ability of the community to reproduce and verify previous findings. In order to overcome this problem we recommend that reuseable code and formal workflows, which unambiguously reproduce published scientific results, are made available for the community alongside data, so that we can verify previous findings, and build directly from previous work. In cases where reproducing large-scale hydrologic studies is computationally very expensive and time-consuming, new processes are required to ensure scientific rigor. Such changes will strongly improve the transparency of hydrological research, and thus provide a more credible foundation for scientific advancement and policy support.

slot machines and pseudo-random numbers

Russian hackers have been able to beat a certain brand of slot machine by buying old models, studying their coding, and figuring out the pattern of random numbers they generate. From Wired:

But as the “pseudo” in the name suggests, the numbers aren’t truly random. Because human beings create them using coded instructions, PRNGs can’t help but be a bit deterministic. (A true random number generator must be rooted in a phenomenon that is not manmade, such as radioactive decay.) PRNGs take an initial number, known as a seed, and then mash it together with various hidden and shifting inputs—the time from a machine’s internal clock, for example—in order to produce a result that appears impossible to forecast. But if hackers can identify the various ingredients in that mathematical stew, they can potentially predict a PRNG’s output. That process of reverse engineering becomes much easier, of course, when a hacker has physical access to a slot machine’s innards.

Knowing the secret arithmetic that a slot machine uses to create pseudorandom results isn’t enough to help hackers, though. That’s because the inputs for a PRNG vary depending on the temporal state of each machine. The seeds are different at different times, for example, as is the data culled from the internal clocks. So even if they understand how a machine’s PRNG functions, hackers would also have to analyze the machine’s gameplay to discern its pattern…

… the operatives use their phones to record about two dozen spins on a game they aim to cheat. They upload that footage to a technical staff in St. Petersburg, who analyze the video and calculate the machine’s pattern based on what they know about the model’s pseudorandom number generator. Finally, the St. Petersburg team transmits a list of timing markers to a custom app on the operative’s phone; those markers cause the handset to vibrate roughly 0.25 seconds before the operative should press the spin button.

Obama’s word clouds

Obama read 10 letters a day while in office. He received about 10,000 pieces of mail and email a day, and his staff had to pick the 10. Also interesting, the staff made a word cloud out of all emails received and posted them in the White House. I find that kind of nice, the idea that words you write were received and might have had an influence in some form even if they weren’t all read.

I’ve always liked the idea of elected officials setting up some kind of voting site where constituents could weigh in on issues or even specific bills. The official could get a daily “report card” of where his or her constituents stand, and this could help to influence his or her decisions. If there were a concern that people logging on to the website were not representative, some form of demographic weighting could be used.

how the ATF traces guns

This is one of those in-depth articles that GQ puts out every now and then.

Anytime a cop in any jurisdiction in America wants to connect a gun to its owner, the request for help ends up here, at the National Tracing Center, in a low, flat, boring building that belies its past as an IRS facility, just off state highway 9 in Martinsburg, West Virginia, in the eastern panhandle of the state, a town of some 17,000 people, a Walmart, a JCPenney, and various dollar stores sucking the life out of a quaint redbrick downtown. On any given day, agents here are running about 1,500 traces; they do about 370,000 a year…

The National Tracing Center is not allowed to have centralized computer data… That’s been a federal law, thanks to the NRA, since 1986: No searchable database of America’s gun owners. So people here have to use paper, sort through enormous stacks of forms and record books that gun stores are required to keep and to eventually turn over to the feds when requested. It’s kind of like a library in the old days—but without the card catalog. They can use pictures of paper, like microfilm (they recently got the go-ahead to convert the microfilm to PDFs), as long as the pictures of paper are not searchable. You have to flip through and read. No searching by gun owner. No searching by name…

All the out-of-business records that come in here—2 million last month—are eventually imaged and organized according to the store that sent them. It might be 50,000 Form 4473s from one Dick’s Sporting Goods in some suburb of Cleveland. So, say you need to find one particular 4473 from that store. “We go through them,” Charlie tells me. “Just like photographs from your Christmas party, and we look through every one. Until we find it.”

social interaction in cities

Here’s an interesting article from the University of Bern, Switzerland, on social interaction in cities. The engineer in me likes to see some hard data and theory applied in the social sciences.

Cities and the Structure of Social Interactions: Evidence from Mobile Phone Data

Social interactions are considered pivotal to urban agglomeration forces. This study employs a unique dataset on mobile phone calls to examine how social interactions differ across cities and peripheral areas. We first show that geographical distance is highly detrimental to interpersonal exchange. We then reveal that individuals residing in high-density locations do not benefit from larger social networks, but from a more efficient structure in terms of higher matching quality and lower clustering. These results are derived from two complementary approaches: Based on a link formation model, we examine how geographical distance, network overlap, and sociodemographic (dis)similarities impact the likelihood that two agents interact. We further decompose the effects from individual, location, and time specific determinants on micro-level network measures by exploiting information on mobile phone users who change their place of residence.

And here’s a more touchy-feely article in Vox on how the U.S. suburban development pattern discourages social interaction.

the key ingredient for the formation of friendships is repeated spontaneous contact. That’s why we make friends in school — because we are forced into regular contact with the same people. It is the natural soil out of which friendship grows…

This kind of spontaneous social mixing doesn’t disappear in post-collegiate life. We bond with co-workers, especially in those scrappy early jobs, and the people who share our rented homes and apartments.

But when we marry and start a family, we are pushed, by custom, policy, and expectation, to move into our own houses. And when we have kids, we find ourselves tied to those houses. Many if not most neighborhoods these days are not safe for unsupervised kid frolicking. In lower-income areas there are no sidewalks; in higher-income areas there are wide streets abutted by large garages. In both cases, the neighborhoods are made for cars, not kids. So kids stay inside playing Xbox, and families don’t leave except to drive somewhere.

I buy this about 75%. I am lucky to live and work in a highly walkable urban neighborhood, and I do have a lot of friendly spontaneous interactions with people around the neighborhood. I have a “scrappy” job where I bond with my co-workers, like soldiers in the trenches. I am also a middle-aged family person and somewhat of an introvert. Part of the reason I don’t have a lot of close adult friendships outside of work and family is that between work and family, I have all the human interaction I can really handle. If I have 15 minutes free on a given day, I would rather spend it alone than interacting with yet another person. I suppose this could change when the kids get a little older and/or when I don’t have to work so much, assuming I live long enough for these things to happen. So I’m just saying there are family pressures, financial and career pressures, and personality differences that influence these things alongside urban form.

Back to the first article, it suggests that high school, college, band camp, and even most workplaces might not be the best model of the most fulfilling and productive social interactions that can develop among adults in the best cities. In high school and college we tend to form small, tight-knit groups where most people in the group network only within the group. The first article above, if I am interpreting it correctly, describes a case where not only are individuals interacting frequently within a social network, but relatively open social networks themselves are interacting with each other as individuals within them interact in random and freewheeling ways. It’s wonderful. Now if you’ll excuse me I’m going to sit on my couch for a little while, watch some TV, unwind and recharge so I can handle the social interaction that will be thrown at me tomorrow.

100 is about it

This Nature article makes an argument that pushing human life span much beyond 100 years is not likely to happen. However, there has been criticism of the statistical methods used in this study.

Evidence for a limit to human lifespan

Driven by technological progress, human life expectancy has increased greatly since the nineteenth century. Demographic evidence has revealed an ongoing reduction in old-age mortality and a rise of the maximum age at death, which may gradually extend human longevity1, 2. Together with observations that lifespan in various animal species is flexible and can be increased by genetic or pharmaceutical intervention, these results have led to suggestions that longevity may not be subject to strict, species-specific genetic constraints. Here, by analysing global demographic data, we show that improvements in survival with age tend to decline after age 100, and that the age at death of the world’s oldest person has not increased since the 1990s. Our results strongly suggest that the maximum lifespan of humans is fixed and subject to natural constraints.

flow maps

Here is an interesting paper proposing design principles for flow maps, which “visualize movement using a static image and demonstrate not only which places have been affected by movement but also the direction and volume of movement.”

Design principles for origin-destination flow maps

Origin-destination flow maps are often difficult to read due to overlapping flows. Cartographers have developed design principles in manual cartography for origin-destination flow maps to reduce overlaps and increase readability. These design principles are identified and documented using a quantitative content analysis of 97 geographic origin-destination flow maps without branching or merging flows. The effectiveness of selected design principles is verified in a user study with 215 participants. Findings show that (a) curved flows are more effective than straight
flows, (b) arrows indicate direction more effectively than tapered line widths, and (c) flows between nodes are more effective than flows between areas. These findings, combined with results from user studies in graph drawing, conclude that effective and efficient origin-destination flow maps should be designed according to the following design principles: overlaps between flows are minimized; symmetric flows are preferred to asymmetric flows; longer flows are curved
more than shorter or peripheral flows; acute angles between crossing flows are avoided; sharp bends in flow lines are avoided; flows do not pass under unconnected nodes; flows are radially distributed around nodes; flow direction is indicated with arrowheads; and flow width is scaled with represented quantity.

Tobler’s first law of geography

Since I seem to be on a kick of writing about key theories I didn’t learn in school (and perhaps I am a bit burned out thinking about politics and climate change, and I don’t have any amazing new technologies to share today), here is the first law of geography:

The first law of geography was developed by Waldo Tobler in 1970 and it makes the observation that ‘everything is usually related to all else but those which are near to each other are more related when compared to those that are further away’.  This observation which Tobler made is closely related to the ‘Law of Universal Gravitation’ and the ‘Law of Demand’ as well. The concept was first applied by Tobler to urban growth systems and was not popularly received when it was first published.  It wasn’t until the 1990s when this formulation of the concept of spatial autocorrelation became an important underlying concept in the field of GIS.