Tag Archives: data science

how the ATF traces guns

This is one of those in-depth articles that GQ puts out every now and then.

Anytime a cop in any jurisdiction in America wants to connect a gun to its owner, the request for help ends up here, at the National Tracing Center, in a low, flat, boring building that belies its past as an IRS facility, just off state highway 9 in Martinsburg, West Virginia, in the eastern panhandle of the state, a town of some 17,000 people, a Walmart, a JCPenney, and various dollar stores sucking the life out of a quaint redbrick downtown. On any given day, agents here are running about 1,500 traces; they do about 370,000 a year…

The National Tracing Center is not allowed to have centralized computer data… That’s been a federal law, thanks to the NRA, since 1986: No searchable database of America’s gun owners. So people here have to use paper, sort through enormous stacks of forms and record books that gun stores are required to keep and to eventually turn over to the feds when requested. It’s kind of like a library in the old days—but without the card catalog. They can use pictures of paper, like microfilm (they recently got the go-ahead to convert the microfilm to PDFs), as long as the pictures of paper are not searchable. You have to flip through and read. No searching by gun owner. No searching by name…

All the out-of-business records that come in here—2 million last month—are eventually imaged and organized according to the store that sent them. It might be 50,000 Form 4473s from one Dick’s Sporting Goods in some suburb of Cleveland. So, say you need to find one particular 4473 from that store. “We go through them,” Charlie tells me. “Just like photographs from your Christmas party, and we look through every one. Until we find it.”

social interaction in cities

Here’s an interesting article from the University of Bern, Switzerland, on social interaction in cities. The engineer in me likes to see some hard data and theory applied in the social sciences.

Cities and the Structure of Social Interactions: Evidence from Mobile Phone Data

Social interactions are considered pivotal to urban agglomeration forces. This study employs a unique dataset on mobile phone calls to examine how social interactions differ across cities and peripheral areas. We first show that geographical distance is highly detrimental to interpersonal exchange. We then reveal that individuals residing in high-density locations do not benefit from larger social networks, but from a more efficient structure in terms of higher matching quality and lower clustering. These results are derived from two complementary approaches: Based on a link formation model, we examine how geographical distance, network overlap, and sociodemographic (dis)similarities impact the likelihood that two agents interact. We further decompose the effects from individual, location, and time specific determinants on micro-level network measures by exploiting information on mobile phone users who change their place of residence.

And here’s a more touchy-feely article in Vox on how the U.S. suburban development pattern discourages social interaction.

the key ingredient for the formation of friendships is repeated spontaneous contact. That’s why we make friends in school — because we are forced into regular contact with the same people. It is the natural soil out of which friendship grows…

This kind of spontaneous social mixing doesn’t disappear in post-collegiate life. We bond with co-workers, especially in those scrappy early jobs, and the people who share our rented homes and apartments.

But when we marry and start a family, we are pushed, by custom, policy, and expectation, to move into our own houses. And when we have kids, we find ourselves tied to those houses. Many if not most neighborhoods these days are not safe for unsupervised kid frolicking. In lower-income areas there are no sidewalks; in higher-income areas there are wide streets abutted by large garages. In both cases, the neighborhoods are made for cars, not kids. So kids stay inside playing Xbox, and families don’t leave except to drive somewhere.

I buy this about 75%. I am lucky to live and work in a highly walkable urban neighborhood, and I do have a lot of friendly spontaneous interactions with people around the neighborhood. I have a “scrappy” job where I bond with my co-workers, like soldiers in the trenches. I am also a middle-aged family person and somewhat of an introvert. Part of the reason I don’t have a lot of close adult friendships outside of work and family is that between work and family, I have all the human interaction I can really handle. If I have 15 minutes free on a given day, I would rather spend it alone than interacting with yet another person. I suppose this could change when the kids get a little older and/or when I don’t have to work so much, assuming I live long enough for these things to happen. So I’m just saying there are family pressures, financial and career pressures, and personality differences that influence these things alongside urban form.

Back to the first article, it suggests that high school, college, band camp, and even most workplaces might not be the best model of the most fulfilling and productive social interactions that can develop among adults in the best cities. In high school and college we tend to form small, tight-knit groups where most people in the group network only within the group. The first article above, if I am interpreting it correctly, describes a case where not only are individuals interacting frequently within a social network, but relatively open social networks themselves are interacting with each other as individuals within them interact in random and freewheeling ways. It’s wonderful. Now if you’ll excuse me I’m going to sit on my couch for a little while, watch some TV, unwind and recharge so I can handle the social interaction that will be thrown at me tomorrow.

100 is about it

This Nature article makes an argument that pushing human life span much beyond 100 years is not likely to happen. However, there has been criticism of the statistical methods used in this study.

Evidence for a limit to human lifespan

Driven by technological progress, human life expectancy has increased greatly since the nineteenth century. Demographic evidence has revealed an ongoing reduction in old-age mortality and a rise of the maximum age at death, which may gradually extend human longevity1, 2. Together with observations that lifespan in various animal species is flexible and can be increased by genetic or pharmaceutical intervention, these results have led to suggestions that longevity may not be subject to strict, species-specific genetic constraints. Here, by analysing global demographic data, we show that improvements in survival with age tend to decline after age 100, and that the age at death of the world’s oldest person has not increased since the 1990s. Our results strongly suggest that the maximum lifespan of humans is fixed and subject to natural constraints.

flow maps

Here is an interesting paper proposing design principles for flow maps, which “visualize movement using a static image and demonstrate not only which places have been affected by movement but also the direction and volume of movement.”

Design principles for origin-destination flow maps

Origin-destination flow maps are often difficult to read due to overlapping flows. Cartographers have developed design principles in manual cartography for origin-destination flow maps to reduce overlaps and increase readability. These design principles are identified and documented using a quantitative content analysis of 97 geographic origin-destination flow maps without branching or merging flows. The effectiveness of selected design principles is verified in a user study with 215 participants. Findings show that (a) curved flows are more effective than straight
flows, (b) arrows indicate direction more effectively than tapered line widths, and (c) flows between nodes are more effective than flows between areas. These findings, combined with results from user studies in graph drawing, conclude that effective and efficient origin-destination flow maps should be designed according to the following design principles: overlaps between flows are minimized; symmetric flows are preferred to asymmetric flows; longer flows are curved
more than shorter or peripheral flows; acute angles between crossing flows are avoided; sharp bends in flow lines are avoided; flows do not pass under unconnected nodes; flows are radially distributed around nodes; flow direction is indicated with arrowheads; and flow width is scaled with represented quantity.

Tobler’s first law of geography

Since I seem to be on a kick of writing about key theories I didn’t learn in school (and perhaps I am a bit burned out thinking about politics and climate change, and I don’t have any amazing new technologies to share today), here is the first law of geography:

The first law of geography was developed by Waldo Tobler in 1970 and it makes the observation that ‘everything is usually related to all else but those which are near to each other are more related when compared to those that are further away’.  This observation which Tobler made is closely related to the ‘Law of Universal Gravitation’ and the ‘Law of Demand’ as well. The concept was first applied by Tobler to urban growth systems and was not popularly received when it was first published.  It wasn’t until the 1990s when this formulation of the concept of spatial autocorrelation became an important underlying concept in the field of GIS.

best practices for writing code

Here’s another R post I am saving for my own reference – some best practices for writing code. This is something I actually can say I learned in engineering school – it was a covered in 15 minutes or so in a required intro to computer science course I took around 1994. Perhaps it’s time to brush up. Again, these are skills that are useful these days in many fields beyond just computer science and software development.

relational algebra

R bloggers has a nice post on the theory behind database organization, and some tools that can used to manage and manipulate data through R. Maybe this seems very specialized, but many of our jobs involve dealing with data these days, so this knowledge and tools is potentially relevant to us, and yet I don’t think many of us even in technical fields outside math and computer science learn this stuff in school.

November 2016 in Review

Sometimes you look back on a month and feel like nothing very important happened. But November 2016 was obviously not one of those months! I am not going to make any attempt to be apolitical here. I was once a registered independent and still do not consider myself a strong partisan. However, I like to think of myself as being on the side of facts, logic, problem solving, morality and basic goodness. Besides, this blog is about the future of our human civilization and human race. I can’t pretend our chances didn’t just take a turn for the worse.

3 most frightening stories

  • Is there really any doubt what the most frightening story of November 2016 was? The United Nations Environment Program says we are on a track for 3 degrees C over pre-industrial temperatures, not the “less than 2” almost all serious people (a category that excludes 46% of U.S. voters, apparently) agree is needed. This story was released before the U.S. elected an immoral science denier as its leader. One theory is that our culture has lost all ability to separate fact from fiction. Perhaps states could take on more of a leadership role if the federal government is going to be immoral? Washington State voters considered a carbon tax that could have been a model for other states, and voted it down, in part because environmental groups didn’t like that it was revenue neutral. Adding insult to injury, WWF released its 2016 Living Planet Report, which along with more fun climate change info includes fun facts like 58% of all wild animals have disappeared. There is a 70-99% chance of a U.S. Southwest “mega-drought” lasting 35 years or longer this century. But don’t worry, this is only “if emissions of greenhouse gases remain unchecked”. Oh, and climate change is going to begin to strain the food supply worldwide, which is already strained by population, demand growth, and water resources depletion even without it.
  • Technological unemployment may be starting to take hold, and might be an underlying reason behind some of the resentment directed at mainstream politicians. If you want a really clear and concise explanation of this issue, you could ask a smart person like, say, Barack Obama.
  • According to left wing sources like Forbes, an explosion of debt-financed spending on conventional and nuclear weapons is an expected consequence of the election. Please, Mr. Trump, prove them wrong!

3 most hopeful stories

3 most interesting stories

Nate Silver and college football

I thought Nate Silver only looked at professional sports. I was wrong – here is a cool interactive web page he has put together for college football. The numbers don’t always give you the answers you want to hear though – even if my beloved Gators somehow win all the rest of their games, which would include beating Alabama in the conference championship game, he gives them only a 13% chance of winning the national championship. Another nice thing about Nate Silver – he always explains his methodology.

We’ll be updating the numbers twice weekly: first, on Sunday morning (or very late Saturday evening) after the week’s games are complete; and second, on Tuesday evening after the new committee rankings come out. In addition to a probabilistic estimate of each team’s chances of winning its conference, making the playoff, and winning the national championship, we’ll also list three inputs to the model: their current committee ranking, FPI, and Elo. Let me explain the role that each of these play…

FPI is ESPN’s Football Power Index. We consider it the best predictor of future college games so that’s the role it plays in the model: if we say Team A has a 72 percent chance of beating Team B, that prediction is derived from FPI. Technically speaking, we’re using a simplified version of FPI that accounts for only each team’s current rating and home field advantage; the FPI-based predictons you see on ESPN.com may differ slightly because they also account for travel distance and days of rest…

Our college football Elo ratings are a little different, however. Instead of being designed to maximize predictive accuracy — we have FPI for that — they’re designed to mimic how humans rank the teams instead.4 Their parameters are set so as to place a lot of emphasis on strength of schedule and especially on recent “big wins,” because that’s what human voters have historically done too. They aren’t very forgiving of losses, conversely, even if they came by a narrow margin under tough circumstances. And they assume that, instead of everyone starting with a truly blank slate, human beings look a little bit at how a team fared in previous seasons. Alabama is more likely to get the benefit of the doubt than Vanderbilt, for example, other factors held equal.

R code to read Nate Silver’s data

Thanks to Nate Silver for posting all his polling data in a convenient text file that anyone can read! It’s a nice thing to do. Even though not many of us can do as interesting things with it as Nate Silver, it is a fun data set to play and practice with. Here is an R-bloggers post with some ideas on how to play with it.