Entries Tagged 'social media' ↓

The Math Behind Peoplemaps

For the last few years I’ve been working on mapping out social relationships in cities with my project peoplemaps.org. I had the chance to speak about this work recently at the TED Global conference, and the talk was featured this week on the ted.com website. This has led to many, many inquiries about the project, how it works, and the limits of its applicability.

So I wanted to take an opportunity to explain in high-level terms how I’m doing this work, what it tells us, and what it does not. I’ll attempt here to cover the approach in enough detail so others can reproduce it, without going into deep implementation details which are frankly not important to understanding the concepts. However, I’m happy to collaborate with anyone who would like to discuss implementations.

Theoretical Foundation

In the real world, each of us has personal relationships: our family, friends, co-workers, people we interact with in social settings, children and their parents and friends, and the like. The conventional wisdom is that we can maintain roughly 150 real-world relationships — this is called the “Dunbar Number,” named after the anthropologist who identified this phenomenon. Some people may have more, some people less, but 150 is probably average.

In an online setting, people may have many more relationships — perhaps a few thousand that they interact with in some meaningful way. However, the offline, real-world relationships that people have will likely be an overlapping subset of the relationships they have in an online setting. Online relationships are often more flexible (they can be global and operate at all times of day), though not always as meaningful. Still, if you are able to look at people’s online relationships in a specific geography, you have at least a proxy for understanding their offline, real-world relationships as well.

The ability to infer things about both offline and online interactions is derived from the principle of homophily, which sociologists identified as the powerful tendency for people to cluster into groups of people who are similar to themselves. Colloquially we know this through the saying, “birds of a feather flock together,” and it is powerfully demonstrated in network data.

So if we accept the notion that people do, in fact, have relationships that both shape and are shaped by their interactions, then it follows that there may be some ways to measure these relationships with at least some level of fidelity. Social network data appears to offer at least a window into these real-world relationships, though each dataset has biases which are not yet well understood. However, by comparing results from different networks, we can start to get a sense for what those biases might be.

This is where the state of the art for this research is right now: trying to understand how these different data sets are biased, the nature of those biases, and whether the biases are material in terms of distorting the “real world” network we are trying to understand. However, this lack of perfect understanding isn’t preventing people from using this data to do all kinds of things: from recommending movies to helping you find “people you may know” to identifying terror cells using cell-phone “metadata.” All of these activities use essentially the same approach.

These Maps Are NOT Geographic

A quick caveat: while we typically think of city maps as geographic, these maps are explicitly NOT geographic in nature. Rather, they are showing communities and their relationships to each other. The position of communities relative to the page is always arbitrary, their position relative to each other is determined by the presence or lack of relationships between them. I bring this up now to dissuade any notion you might have that these maps are geographic, despite whatever resemblance (real or imagined) they might have to geography. This all said, there are ways to tie these maps back to geography and use it as an additional investigative tool, but you should assume that all discussion of maps here is non-geographic, unless otherwise noted.

Gathering the Data — and Avoiding Bieber Holes

Depending on the dataset you are looking to explore, the exact details of how you gather data will vary somewhat; I have used data from Twitter, Facebook, LinkedIn, AngelList, email, and other sources. In all cases, you will want to gather a set of “nodes” (people, users, or companies, depending on the data source) and “edges” (relationships between them — typically “friend” or “follow” relationships.)

For the case of the city network maps, this is the approach I have used with Twitter data:

  1. Define a target geography using a geographic bounding box.
  2. Determine a set of keywords and location names that users inside the target geography may use to identify their location/geography.
  3. Identify a set of “seed” users within the bounding box, or which otherwise appear to verifiably be within the target geography.
  4. Determine which user identifiers are followed by a given user, and record that in a list that maintains the number of followers for a given user identifier.
  5. When a given user identifier is followed by a number of people exceeding a threshold, request the full user information for that user to see if it appears to be within the target geography.
  6. If that user is within the target geography, then feed it into step 4, requesting the user identifiers followed by that user.
  7. Repeat steps 4-7 until there are few new additions to the dataset.

Doing this process once produces a “first draft” data set, which can then be visualized for inspection — to look for improper inclusions, obvious exclusions, and any particular data artifacts or pathologies.

At this stage, various problems may appear. As an example, if you are trying to visualize Birmingham UK, you will likely end up with some data for Birmingham, Alabama, due to legitimate confusion which may exist between the two communities. At this point, you can modify the test used to determine whether something is inside the target or not, and regenerate the dataset, as well as perform additional data gathering iterations to get more of the “right” data. This process typically takes a few iterations to really drill down into the data you’re looking for.

One persistent problem I’ve come across is a phenomenon called Bieber Holes, which are essentially regions of the network occupied by Justin Bieber fans. They are so virulent, and their networks so dense, that only aggressive exclusion filters can prevent the algorithm from diving down into these holes and unearthing millions of Beliebers — only a fraction of which may pass location tests. Anyway, I’ve developed good techniques to avoid Bieber holes (and similar phenomena) but it’s a reminder that when working with data from the public with algorithmic approaches, editorial discretion is required.

Laying out the Network Graph

There are dozens of algorithms for laying out network graph data, each optimized to illustrate different properties of the graph. Since I’m primarily interested in homophily and clustering, I’m looking for layouts that express communities of relationships. A good way to do that is to use a force-directed graph layout algorithm; with this approach, relationships act like springs (expressing Hooke’s law), and each user or node repels nearby nodes (expressing Coulomb’s law). By iteratively drawing the graph based on these forces, the graph will eventually reach a steady state which exhibits the following properties:

  1. People with many relationships between them will be arranged into tight clusters.
  2. People with the fewest relationships between them will appear at opposite edges of the graph.
  3. People who have many relationships at both ends of the graph will appear in the middle.
  4. Clusters with few or no relationships between them will appear very far apart on the graph.

You can think of this in very simple terms. If you have a room full of 10 people, there can be a total of 45 relationships between them (n * (n-1)/2, since someone is not friends with themselves). If every person is friends with every other person, this network will appear as a perfect and symmetric “ball” under the rules of a force directed graph layout.

Likewise, let’s suppose that same room of 10 people was grouped into two groups of 5 people, and that those two groups hated each other and refused to speak, but each member of each group knows every other member of their own group. You would see two perfect “ball” layouts (each with 10 relationships expressed), but with no connection between them.

When we visualize data from a city in this way, we are essentially measuring the separateness of communities — whether we see several of these separate groups or whether we see one unified community.

Note that in the example with 10 people in one group, we have a total of 45 relationships, while in the example with two groups of 5, there are only 20 (2 x 10). Scaling that up to a city of 500,000 people, if the city was fully meshed, there would be 124,999,750,000 total relationships, while if that city is segregated into two groups of 250,000, there can only be 31,249,875,000 relationships in each group, or 62,499,750,000 total relationships across both groups, which is a little less than half of the number of relationships possible if the two groups merged.

Detecting Communities and Adding Color

Within the network, we can use algorithms to detect distinct communities. Communities are defined by the number of shared relationships within a given subgroup. We use an algorithm called the Louvain community detection algorithm, which iteratively determines communities of interest within the larger network, and assigns a community membership to each user accordingly.

We can then assign each of these assigned communities a color. For my work so far, I have assigned these colors arbitrarily, with the only goal of visually differentiating one community from another. This helps to generate an aesthetically pleasing visual representation.

While communities often correlate to clustering exhibited by the layout algorithm, for nodes that are not clearly members of only one cluster, color can be used to indicate their primary community affiliation. This is helpful in the visualization because generates blended color fields that can give a sense of the boundary between two communities.

For example, a group primarily concerned with politics (blue) and a group primarily concerned by music (yellow) may have a mix of both blue and yellow nodes in the space between those distinct communities, and you can get a sense that those people are very interested in both communities and which their primary affiliation might be.

Assigning Node Size with In-Degree

In a graph of Twitter users, it can be helpful to indicate how many people are following a given user within our specific graph (note that this is calculated for our subgraph, not taken from a user’s “global” follower count as displayed by Twitter.)

We can do that by making the “dot” associated with each user bigger or smaller based on the number of followers. This has no real effect on the shape of the network, but it can be helpful in determining what kinds of users are where, and how people have organized themselves into communities of interest.

Determining Community Interests

After you have a colored graph with communities and clusters, it’s time to try to figure out what these clusters seem to be around. The first thing to do is to start by manually inspecting nodes to see who they are — often starting with the biggest nodes first. Typically you’ll find that people often organize themselves into groups like these: sports, music, mainstream media, politics, food, technology, arts, books, culture, and the like. These clusterings vary somewhat from city to city, but you’ll see some common patterns between cities.

After communities are detected, we can start to monitor traffic coming out of these communities to look for topics of conversation and other characteristics which may be helpful in explicating the observed clustering. For instance, we can gather a corpus of Twitter traffic for each community, recording:

  1. hashtags
  2. commonly shared links
  3. languages in use
  4. operating systems in use (desktop vs. mobile et al)
  5. client software in use
  6. geographic coordinates (as provided in GPS)
  7. age of user accounts
  8. user mentions

With this kind of data recorded as a histogram for each community, we can start to get a pretty good sense that a given group is mostly concerned with sports, music, politics, and the like. By working with a collaborator who is well-versed with the culture of a given place, we can also get a sense of local subtleties that might not be immediately obvious to an outside observer. These insights can be used to generate legends for a final graph product, and other editorial content.

Investigating Race

Another phenomenon is very apparent in American cities: people separate by race. American cities like Baltimore and St. Louis are polarized into black and white communities, with some people bridging in the middle. These cities present roughly as eccentric polygons, with differences in mesh density at each end of a racially polarized spectrum. Relatively prosperous European cities, like Munich or Barcelona, present more like “balls,” with no clear majority/minority tension displayed. Istanbul, by contrast, shows a strong divide between the rich, establishment and a large emergent cohort of frustrated young men.

In many American cities, we do observe strong racial homophily in the data. For example, in the data for a city like Baltimore, people at opposite ends of the spectrum are generally strongly identifiable as “very white” or “very black.” This is a touchy subject, and it’s difficult to discuss this topic without offending people; however, what we are aiming to do here is to try to understand what the data is showing us, not make generalizations or prescriptive measures about race.

One way to ground this discussion about race is to look at the profile photos associated with user accounts. We can gather these photos in bulk, and when displaying them together, it becomes quickly clear that people have often organized themselves around skin color. When they have not organized themselves around skin color, they are organized around other cultural signifiers like fashion or style. This can give us a concrete sense that certain communities consist primarily of one race or another. However, this is not to suggest that outliers do not exist, or to make any statement either about any given individual — and certainly does not suggest an inverse relationship between race and the ability or proclivity to participate in any given cluster or clusters.

The Final Product

Once we have refined the data, identified communities, investigated trends and topics within communities, and potentially looked at photos, profiles, and other cultural signifiers, we are in a position to annotate the network map with editorial legends. This is a fundamentally human process. I generally use a tool like Photoshop or Keynote to annotate an image, but this could be done in a number of ways. Once this step is done, a final product can be exported.

This entire process of taking data from one or more biased sources, refining data, and then using a human editor who also has a bias produces a final end product which is comparable to an “op-ed” in a newspaper: it’s an expression of one possible mental model of the world, informed by a combination of facts, errors, and pre-existing opinions. It’s up to the reader to determine its utility, but to the extent it offers a novel view and the reader deems the data and editorial biases acceptable, such renderings can be an informative lens through which to see a city.

A Note About Bias, Inclusions, and Exclusions

A common criticism of large-scale social network analysis is that it is not representative because it is biased in some way. A typical statement is, “I don’t use Twitter, so the results don’t include me, therefore I question the validity of the approach.” Likewise, people may say the same thing about LinkedIn or Facebook, or start into a long explanation of how their online habits are very different from their offline reality. These are important facts to consider, but the question is really whether network analysis is representative enough to start to deliver information that we didn’t have until now.

To answer this, it’s helpful to think about this in terms of recent history. In 2004 or 2007, it probably wasn’t helpful to say anything in particular about data from social networks, because none of them had enough penetration to deliver insights beyond a very biased community of early adopters: whether it was geeks in the case of Twitter, or young music fans in the case of MySpace, or college students in the case of Facebook.

As these networks have continued to evolve, however, their penetration is increasing rapidly. They are also accreting a great deal of historical data about people and their position in social networks which give us clues as to their offline interactions. This accretion of data will only continue and begin to paint a more complete and multi-dimensional picture of our culture — especially if you correlate the data from multiple social networks.

I believe we are now at a point where these data sets are large and detailed enough to offer important insights about our “real world” culture. This belief is based on two facts. First, it’s possible to get a good working image of a community by gathering data over just a few hours. While we make every effort to gather as much data as is realistically possible before making statements about a community, the effect of adding more data is additive: we add members to the community, but we do not fundamentally alter its shape.

To understand this effect, it is helpful to consider an analogy from astronomy. A more powerful telescope can yield a better, sharper image of a star formation, but it neither changes the shape of the star formation or our basic understanding of its structure. Better tools simply yield more detailed data. I believe we are at a point where we have enough data to begin to understand structures, and that more data will yield more detailed understanding — but not alter the fundamental shapes we are beginning to uncover. This line of thinking is helpful in comprehending “exclusions.”

On the subject of “false inclusions,” these are generally sussed out in the layout and review process, but all of the graphs I have produced have generated a limited number of false inclusions as a practically inevitable artifact of the process. As a result, any statements one may make about a specific individual based on this kind of analysis may or may not be valid: they should be viewed through the lens of the biases disclosed alongside the visualization. However, repeated experience has shown that removing false small number of false inclusions does not have a material effect on the community structure.

On the subject of “network bias” (I don’t use x, therefore conclusions are not valid), when comparing data from Facebook, Twitter, LinkedIn, AngelList and other sources, the same patterns of homophily are exhibited. While each network has its own bias (LinkedIn towards professionals, Twitter towards both youths and professionals, Facebook towards Grandparents), if we limit analysis to a given geography, we will inevitably see homophilic tendencies which are correlated in each network.

While it’s difficult to speak about this in detail yet, as obtaining full comparable data sets from each source is currently quite challenging, early investigations indicate that the same patterns are exhibited across all networks. This squares up with the notion that what we are really doing in examining these networks is but a proxy for real world relationships. Ultimately, these networks must converge into something that approximates this abstract, Euclidian reality, and as we get more data and correlate it, it’s likely that a unified data set will closely reflect the actual geometry of our communities.

Future Research Directions

This kind of analysis is forming the basis for the emergent field of “computational sociology,” which is currently being explored by a variety of researchers around the world. This work has a number of important implications, and poses questions like:

  • What do we mean by diversity? If we should be looking to bridge networks, is race a helpful proxy for that function? Or should we be looking to develop new better measures of diversity?
  • What is the nature of segregation? Is physical segregation a product of our social networks? Or is it a manifestation of them?
  • What is the role of urban planning? Is our social fabric something that’s shaped by urban planning, or are our cities simply a manifestation of our social fabric?
  • What kind of interventions might we undertake to improve a city’s health? Should they be based first on creating and improving relationships?
  • What is creative capital and how can we maximize it? If creative capital is a byproduct of relationships between people with diverse backgrounds, then we should be able to increase creative capital by orders of magnitude by bridging networks together and increasing the number of relationships overall.

The level of detail we are able to extract from network data in cities is more detailed than any data source we have ever had, and can supplement many other indicators currently used to measure community health and make decisions about resource allocation. For example, the census may tell us someone’s race, family name and physical address, but it tells us very little about their participation in the social fabric. And if we believe that the city is primarily manifested as the sum total of social relationships, then clearly data surrounding social interactions is more useful than other attributes we may harvest.

Discerning Stable Structures vs. Topical Discussions

Likewise, many researchers have been using social network data like Tweets to characterize conversations around a topic; and it is true that by harvesting massive quantities of Tweets, one can discern trends, conversation leaders, and other insights. However, this type of analysis tends to be topical and shift a great deal depending on whether people are active online, and contains biases about why they might be active online. Regardless, this kind of research is well worth understanding, and may ultimately lead to a better understanding of network structure formation. However, I want to differentiate it from the kind of inquiry I am pursuing.

So, rather than “you are what you tweet,” I believe the truth is something more like “you tweet what you are.” Network follow structures tend to be very stable; while they do change and evolve over time, one interesting feature of the process I am using is that it tends to be very stable over time and the results of analysis are repeatable. That is to say, if I analyze Baltimore one day, and then repeat the analysis a month later, I will obtain a comparable result. This is not likely to be the case with semantic analysis as topics and chatter may change over the course of a few weeks.

Tracking (and Animating) Changes Over Time

Because network structure analysis is fundamentally labor intensive and somewhat difficult to compute, it’s difficult to automate and to perform on an ongoing basis. For example, it would be nice in Baltimore to know whether things are getting “better” or “worse.” Right now, only by taking periodic snapshots and comparing them can we get a sense for this. It would be helpful to be able to animate changes over time, and while this is theoretically possible, tools to do this have to be built by hand right now. My hope is to apply such tools to the process and dramatically increase our ability to monitor network structure in real time, and spot trends.

Ultimately, this will give us some ability to see our social structure change in near real-time, developing a sense as to whether interventions are having a positive effect, or any effect at all.

Healthy vs. Unhealthy Network Patterns – and Brain Development

Dr. Sandy Pentland, an MIT researcher who is perhaps the leading figure in the field of computational sociology, has suggested that there are certain patterns that characterize “healthy” networks, as well as patterns that characterize “unhealthy” networks.

Healthy networks are characterized by:

  • Frequent, short interactions
  • Broad participation by all nodes and meshing
  • Acknowledgement of contributions
  • A tendency to explore other parts of the network

By contrast, unhealthy networks exhibit the opposite patterns: broadcast vs. peer-to-peer interactions; fewer network connections; isolated subnetworks; a lack of exploration of other parts of the network.

Perhaps the most telling finding is that these patterns are not limited to just human networks, but also appear in other colony-based organisms, like bees. It appears that there are universal patterns of network health that apply to life in general.

The other important finding is that lack of network exploration affects brain development in young people. Specifically, young people who grow up in an environment where network exploration is not valued tend to exhibit structural changes in the brain which appear to dampen their desire to explore networks as adults. This produces a multi-generational effect, where children who grow up in isolated networks tend to persist in isolated networks, and to pass that on to their children.

This suggests that one possible intervention is to promote network exploration at an early age across the entire population. How we might do this is certainly open to discussion, but it seems to be a very powerful tool in breaking down divisions in our social networks.

What’s next?

As this project advances, we are gathering ever-larger datasets. This requires more and more computational power and distributed algorithms for visualization. If this is something you’re also working on, I would like to talk to you — please contact me. There are some interesting challenges in scaling this up but there are interesting opportunities emerging to apply these approaches!

Mapping Your City

We have a long list of places that we’re looking at mapping, and trying to prioritize opportunities. If you have a mapping project you would like us to consider, please contact us. We will try to get back to you as quickly as possible, but in general, we’re looking for projects that can make a serious social impact. It takes a serious amount of effort to generate these maps, so be thinking about possible partners who could potentially fund this work and help advance this science.

Baltimore Is Egypt

Newly-elected Maryland State Senator Bill Ferguson was recently named to the Baltimore Business Journal‘s Power 20. This week they asked me, as a friend of Bill’s and member of a previous Power 20 cohort, to comment on Bill’s relationship with and use of power.

“Bill is a curious, humble, and earnest young man, and he represents a true shift in how power is conferred in this town,” I said. “He didn’t work his way up through the ranks and spend a few years as a city council person, or wait his turn. Bill was able to win because of a shift in political power that’s taking place right now. He derives his power from the people, not from the system.”

Political power is now being conferred through the accumulation of weak and strong ties with citizens, and no longer by top-down power structures, power-brokers, and kingmakers. Don’t get me wrong; those folks still have an impact (they did in Bill Ferguson’s race – they got behind him when it was clear he was onto something), but that impact is waning. And things that were previously unthinkable are now possible.

It may seem like hyperbole to compare the situation in Baltimore to what took place over the last three weeks in Egypt. But it’s an apt comparison.

For decades in both places, people have felt marginalized by a top-down, tone-deaf government that was more interested in its own well-being than that of its citizens. In both places, decades of neglect and mismanagement have led to a serious crisis of confidence.

People are fed up. They’re tired of feeling marginalized, the failed programs, the broken promises, the lack of accountability and the inability to implement imaginative solutions. For 60 years, Baltimore’s population has been in decline, and places in decline have not had the benefit of oversight, dollars, or creative leaders. Instead, corruption (explicit or implicit) festers.

The Perfect Storm

Several factors are emerging all at once:

  • Young people want to live near their work and are tired of commuting (and they’ll accept a pay cut to do it)
  • Our roads are full and can no longer be meaningfully expanded due to lack of space and funds
  • Fuel costs are projected to rise as China’s demand grows exponentially
  • Online networks are having a meaningful impact on real-world relationships and politics

These factors, combined, have made Baltimore the most important jurisdiction in Maryland – practically overnight. Yet our leadership has not caught up with this reality.

Baltimore’s recent rise to relevance combined with the power of communications networks will create stark shifts in the power structure.

Two Kinds of Leaders

Today we have a choice between two kinds of leaders. We can choose between the leaders that the system hands us, or we can choose to put our faith in new, emerging leaders with whom citizens have a legitimate connection and a voice.

Legacy Next Generation
Product of the system Newcomers, inspired to serve
Disproportionate influence of money Driven by small donations, connection with people
Ideas come from insiders and developers Ideas come from anywhere and from study of best practices globally
Power comes from the top-down Power comes from legitimate engagement with citizens
“Openness” is skin deep, only ‘fauxpenness’ Transparency at every level; data is a strategic driver
Secrecy and private realities drive decisions One shared view of reality drives all decisions
Treat Symptoms: Problems (poverty, crime) are “mitigated” Address Root Causes: Focus on wealth creation
Social media is a “one way,” Orwellian broadcast tool Social Media is a “two-way” engagement tool
Over-Confident that the system knows best Open to Questioning: People know best
Boomer-centric: top-down, command and control Gen-Y Centered: Collaborative, flat organizations
People are engaged to placate them People are legitimately engaged
Fear of reprisal keeps people in line May the best ideas and people win
Career politician Will serve only as long as effective
Prideful Humble

 

It is sadly telling that Mayor Stephanie Rawlings-Blake’s much-promoted (Orwellian, broadcast-oriented) Safer City social media campaign follows just one person on Twitter: the Mayor herself. And it has just 78 followers. Why? Because it’s all for show, and no one legitimately cares about a program to mitigate a problem – people actually want to solve it at the root. To hell with a Safer City: give me a city where everyone can earn a living, and I can bet you it’ll be safer.

But our politicians don’t know that, because they have not taken the time to benchmark ourselves against other cities or learn from best practices elsewhere. Baltimore has more cops per capita than any other city. Why is that?

Because we need them. Why do we need them? Because we have a lot of crime. Why do we have a lot of crime? Because we have no middle class. Why do we have no middle class? Because we have not seriously focused on enabling small business formation, which is the number one driver of jobs. Instead we have given tax handouts to fatcat developers so they can build big projects and enrich their cronies.

Yes, clearly the cure is more cops. As the Mayor told the Baltimore Sun’s Justin Fenton, “Maybe we could do without as many officers, but that’s not what the public wants. They want more patrolmen on the street. They want more police in the neighborhood.”

No, Madam Mayor. What the public really wants is for these root cause issues to be addressed. It takes true leadership and understanding to go beyond just treating the symptoms.

Accelerating Change

Some have called the recent events in Egypt “the Twitter and Facebook revolution.” A few have scoffed at the idea that these tools could spark a revolution and cite eons of revolutionary precedent as proof. But it’s a mistake to dismiss their role.

Online networks are accelerants. They create connections passively where none might otherwise exist. Critical mass for change comes when the density of connections between people reaches a threshold level. Ideas spread between networks instantly. What might have taken 10 years before now takes 1 year.

The Soviet regime could never have survived in the age of networks. Iraq would have collapsed under its own weight if given time and these tools.

And the same repressive structures will fall in Baltimore, for the same reasons.

To quote Gandhi: “First they ignore you. Then they laugh at you. Then they fight you. Then you win.”

Is Groupon the new “Jesus Startup?”


50% Off Loaves and Fishes…

Every few years a company emerges that grows so swiftly that it manages to define the zeitgeist and often helps to inflate a bubble that defies any rational explanation. Often these businesses are driven by new, disruptive ideas that take the market by storm and create a real shift in how people do things. Amazon (and online shopping), Google (and the search business), and Apple (music, smartphones, and touch computing) fall into this category. They created real, thick value.

For every one of these, there are others that grow, get tremendous buzz, and then seem to dissipate as quickly as they emerged. Or they settle into a kind of staid middle-age, their torrid teen years long forgotten. Think about 90’s darlings like IOmega, Boston Chicken, eBay, and Home Depot. It can be difficult to predict which businesses will stick around and which will fall away (or become low-growth, boring enterprises).

Groupon has emerged as the “Jesus Startup” of 2010-2011. The industry always needs one, and they tend to conform to an archetype and have a mythical story: the visionary CEO (Marc Andreesen, Evan Williams, Mark Zuckerberg) who experiences a remarkable rise to greatness. For this story and for these 15 minutes, we have Andrew Mason, the humorous and self-deprecating everyman who declares of the fledgling Groupon, “We could still fuck this up.”

The implication is that they’ve done something to “ace” it so far. But the truth is that they are just regular guys that started out doing something else (some kind of social mission charity stuff – blech – don’t talk about that, it’s not compatible with the visionary myth). And after executing on their original idea and experimenting a bit, they found themselves in the middle of a new exploding business model. Kudos for that. But as is the case with most “Jesus Startups,” there’s been a notable lack of critical thinking about what happens next.

Here’s where I think Groupon is weak.

1. Over-reliance on hypergrowth.

Groupon has posted some crazy huge numbers as they push through massive expansion into new markets. When you are turning up a new major metropolitan area every few days, gross revenue numbers are going to grow very quickly as businesses rush to be part of adobe something that’s got so much buzz. As their geographic footprint stabilizes, top-line revenue will start to level out. When that happens, the business becomes much less interesting and has a lower upside (see Home Depot, Gap, Boston Chicken, Microsoft). This is why a push to IPO while this hypergrowth is happening seems to be a priority for the company.

2. Customer fatigue.

If you have been using Groupon, Living Social, GILT, HauteLook, or any of the countless other sites that rely on daily emails to get their message out, I’ll bet your experience has been something like this: at first you reviewed the emails every day; you bought a few things; you are now buying almost nothing; now, you may not look at the emails at all; you still have unused Groupons. Time is money, and people have too much crap. Eventually, people are not going to take the time with this. And when Groupon has exhausted all the “easy hits” that drive people to buy, then what? Besides, I thought email was “dead” and for “old people.” Right? Or did I miss something? (Sure, the deals spread through Facebook or whatever social channels, but email is a huge part of the business model.) As younger folks steer away from email, it’s an open question whether the current “daily deal” model can be sustained.

3. Business fatigue.

Businesses are tripping over themselves to be part of the latest new thing and expose themselves to thousands of customers at a shot. And sure, a Groupon deal can be a great opportunity for some businesses. But many businesses (some say up to 40%) have found that doing a Groupon deal can be a costly mistake that actually damages their business. The economics of the deals deliver a fraction (typically 25%) of the face value, which often does not cover their costs. While there is some breakage (unused deal revenue that can offset losses), this still may not cover the cost and hassle the promotion entails. Additionally, businesses that undertake in smart advertising can promote themselves all year round. A business can do a Groupon deal at most once every few months – otherwise the deal just doesn’t seem “special” enough. Groupon is a great novelty that can help some businesses become better established, but I really wonder if many businesses would participate more than once or twice, when compared to ongoing targeted marketing initiatives.

4. Scale as the only barrier to competition.

There are now thousands of competitors to Groupon (Living Social is the largest). There will be thousands more. The reason why both companies have received such massive investments to date is that they need to get big to create a local sales force in every market in the world, which is obviously an expensive proposition. If they can get sufficiently big, they can build a sustainable business that will dissuade new market entrants simply because any competitor would have to build a worldwide localized sales force. And if you’ve ever had to run a local sales force, you know that it’s a very expensive, messy, people-driven business. The business that Groupon will eventually most resemble structurally is the Yellow Pages. With sales teams in every city, the major directory publishers were able to exert a near monopoly control over the interface between local businesses and consumers, and Groupon is going after the same market. The difference is in Groupon’s use of technology and use of social. Otherwise, the two businesses are nearly indistinguishable. The assumption is that Groupon’s scale will prevent competitors from gaining a foothold, but I don’t see any real reason a focused local competitor couldn’t develop a sustainable business.

5. Tone-deaf on China.

Groupon has undertaken a massive push to expand into China. That sounds great, and any US investor would likely salivate over such an aggressive, prescient-sounding move. Ah, that Mason guy, he really knows his stuff. But my friend, China-expert Christine Lu tells me that Groupon’s Berlin office has recruited 1,000 new hires for China in the last three months – many recent college graduates. But here’s the thing. I’m currently getting a daily deal from a site in Shanghai called Wufantuan that’s indistinguishable from Groupon. (50% off Mexican food in Shanghai was one recent deal.) If you know anything about the Chinese market, you know it favors locals and cloning is part of the culture. To expect Groupon to be able to achieve anything meaningful in China is wishful thinking. Google got run out of the country on a rail. You expect the powers that be there to allow a US firm to “split” revenues with Chinese businesses to provide its budding bourgeoisie with deals on burgers, skydiving, and cupcakes? Um, yeah. OK. If there’s a business there, it will be Chinese. The entire Groupon strategy with China is theater, designed to show investors that they’re “paying attention to that market” while they ready the IPO.

So, the real deal of the day is for Groupon itself. The question is whether there’s enough upside in the model – and enough “bigger suckers” out there for the average Joe to make any money on the offering before the business model settles out and becomes the next eBay, Home Depot, or Gap. These are fine, sustainable businesses, to be sure, but all are way less sexy than they once seemed. (Yes, for about 6 months in 1995, Gap was incredibly sexy.)

Before you decide that Groupon’s the next hot young thing, it’s worth asking whether you want to jump on this model right now. I believe there’s a really nice, long term, but ultimately very boring business in there that should pay a nice dividend. Meantime, the visions of hypergrowth are likely much exaggerated.

I certainly can’t criticize the trajectory that Andrew Mason and company have managed to carve out for themselves. It’s an incredible story and it’ll be fascinating to see how it unfolds. The expectations are so high, they really can’t be met.

My bet is that they will need to move on to more sustainable forms of year-round marketing for businesses and away from the aggressive 50% discount model. That’s a much less sexy place to be and it will require some real creativity to carve out a niche there. But I just don’t buy the idea that they can continue to build a business based solely on deals of the day at such aggressive discounts.

The Groupon model right now is based primarily on creating new relationships between businesses and customers. They’ll be on to something really interesting when they can help to nurture and sustain those same relationships profitably.


I originally posted this as a Facebook Note on January 22nd, and posted it here with a few slight editorial modifications. There are some good comments regarding China that are worth repeating here. There are also many good comments on that Note that are worth checking out.

From my friend Christine Lu (@christinelu):
Thanks for the mention Dave. I think they’re hiring 1K in the next few months. As in currently in the process of. Things over there have just sounded a bit weird to be a sustainable market entry strategy so I think it’s all a nice way to have a China story to prop up the IPO. The elusive vision of 1.3 billion people using Groupon. Nevermind that clones are already saturating the market and they’ll have Alibaba’s Taobao to deal with. Anyways, we discussed it a bit on Quora.

From my friend Vivian Wang (@vivwang):
The JV is a positive differentiator for both companies and will accelerate market consolidation. There are 1686 other group shopping sites as of December, yet only 29 sites have CIECC licenses to legally operate. Some believe there are only 10 serious contenders that can attractively compete. The real threat is Alibaba and Taobao, so a more international footprint into China seems warranted. One of the smarter things Groupon did was buy Mob.ly back in May, which has been developing on all mobile platforms. For a sector that’s already doing about $79B in transactions, I think the risk seems worth taking.

Hope something truly uniquely innovative comes out of this that the world has yet to see. I’d personally love to see Tencent migrate from selling a $1B of games & virtual goods to some seriously tangible merchandise. The foolish side of me actually thinks they’ll have a fair shot at it. Should be fascinating.

And from my friend Francine Hardaway (@hardaway):
I believe all this bargain stuff, especially in the US, is part of the recession and will go away when it is over and we all relax. I agree with you 100% on Groupon’s model; I am done buying stuff I don’t need, even at half price. All the people I know who love coupons (I never have) are armed with sheaves of them, and all that happens is the merchants are in price wars with one another in a race to the bottom. Sites like Groupon and Haute Look might be marketing front ends, but they are also margin-shavers for the people in the businesses they market. This HAS to be unsustainable at the end of the day, whether China is successful or not (and I bet it won’t be, because of all the people who, when we were in China, got up and said they would clone our products in half an hour).

What do you think about Groupon?

Drop Everything and Pay Attention to Firesheep Now

Firesheep is a startling plugin that allows anyone to easily impersonate the login credentials of others for dozens of sites. It works on any unencrypted WiFi connection and is stupid-simple to setup. It can be done by anyone in a matter of minutes.

Just to illustrate how easy it is to setup, I was on Virgin America flight VX67 from Washington to San Francisco yesterday.

All I had to do to get going with Firesheep was download Firefox (onto my new MacBook Air) using the in-flight WiFi, and then download the Firesheep plugin for Firefox. Just drag the plugin into Firefox and it installs. Reload Firefox and you’re ready to go.

Click “Start Capturing” and you are instantly snooping on every interaction occurring on the WiFi network. In my case yesterday, that meant snooping on everybody who was using the WiFi on my flight.

What’s At Risk?

Within just a couple of minutes, I was able to impersonate 3 people on Facebook (updating their status, exploring friends, doing anything I wanted to – of course I didn’t). Twitter is also at risk. So is Gmail. And so is Amazon.

Access to Amazon is perhaps the most worrying. Once I realized I was in under someone else’s Amazon account, I quickly shut down Firesheep: this is some scary stuff. What if I had changed the shipping address for the account and done a one-click order on a $10,000 watch or a $2,000 plasma TV?

This was all at 37,000 feet in an airplane (and way more entertaining than SkyMall). Like taking candy from a baby.

Even More Shocking…

Later in the afternoon I was at one of the Internet Industry’s high-profile events: Web 2.0 Summit produced by O’Reilly. There on the hotel’s WiFi, which was setup to serve the summit, I ran Firesheep. Within seconds I had compromised about 25 accounts, including the Twitter accounts of O’Reilly Media and TechCrunch writer Alexia Tsotsis. Change passwords, tweet-as-them, friend and de-friend people? No problem. Here’s what I saw. (Note that my accounts were vulnerable as well.)


How It Works

I have not studied this exploit carefully enough yet to explain it in full detail, but my understanding is that on an open WiFi network, it’s trivial to capture in cleartext all of the web interactions of the users around you on the same IP network. Once you can do that (something Firesheep achieves using the pcap library, capturing port 80) then you can sniff for credential information specific to particular websites. Firesheep supports a couple of dozen out of the box, including all major social networking sites (Facebook, Twitter, Gmail, Gowalla, Foursquare) but also some more obscure sites relevant to coders (Github, Pivotal Tracker). Ouch. It even has an “import” function so others can write exploits for sites that Firesheep doesn’t know about yet.

The bottom line is that these sites all need to enforce the use of HTTPS (secure HTTP) rather than HTTP *before* the login handshake occurs. This will force some emergency changes by many sites over the next few days.

This is not a new exploit – it’s always been possible to do this; Firesheep just makes it stupid easy.

A Note On Passwords vs. Encryption

You’ve encountered WiFI networks that require WEP or WPA encryption passwords. These are secure from Firesheep’s reach. However, there are a lot of WiFi networks that require “passwords” (such as those at coffee shops, hotels, etc) that are in fact open networks. Many do not even require you to login to them to exploit them via Firesheep. To put it in perspective, every Starbucks location is vulnerable to attack.

The only for-sure ways to stay safe from Firesheep for now are to 1) use only encrypted WiFi networks (that use WPA or equivalent), 2) use wired networks that you trust. Any open WiFi network can and will be vulnerable to this attack until vulnerable sites switch to using HTTPS for all authentication. Be very careful out there, folks.


Update: After talking with a few folks and thinking through this exploit a little further, I can offer a bit more complete of an explanation of how it works and why blocking it is so difficult.

The exploit does not actually capture the *password* itself (which is actually transmitted using HTTPS) but rather captures the authentication credentials which are stored (and visible) in the session cookie *after* HTTPS authentication has completed.

So, even a one-time password will not address this. And the reason boils down to ads and other unsecure content that folks want to serve as part of the site experience. To fix this problem would require serving ads (and images) via HTTPS, which would require major computing resources and will have a major impact on the web.

According to one security researcher I spoke to this evening (who formerly ran Yahoo mail), there’s no obvious way around this other than to allow both HTTP and HTTPS content to be served from the same site during the same session, something which presently causes an alert to the user (which would have the result of freaking them out). Such an alert is a good thing; turning it off is not a net gain. It shouldn’t be up to the user to have to sort out which resources the site is requesting should be secure and which ones do not need to be.

So, it’s a real dilemma. No one seems to be sure how to really address it other than to eliminate or curb the use of open networks, which is probably where it’s going to end up. So open WiFi is now basically over. Expect places that had been using it to post publicly available WPA passwords, which solves the problem.