Larger JPEG version of map | Huge (3.7MB) PDF version of map
Map data represents news.google.com, searched on April 9, 2003. Coloration reflects variation from predicted number of news stories.
Results from the GAP bots in HTML format. You may want to save them and open them in Excel.

Global Attention Profile - Version 0.1


Summary
GAP is a project designed to document the attention media sources pay to the different nations of the world. GAP performs automated searches on media websites and calculates how many stories each website offers per million people in a nation. On three of the four sites currently monitored, GAP uses statistical regression to estimate how many stories we'd expect per nation and reports the variation from these estimates. The map above displays these variations on data from news.google.com on 4/9/2003. Countries in white are experiencing average attention (between half as many to twice as many stories as anticipated); countries in deep red are experiencing more than 4 times as many stories as anticipated, and countries in light red are experiencing 2x to 4x as much attention as predicted. Similarly, deep blue countries are experiencing 1/4 as many stories as anticipated or fewer, while light blue are experiencing 1/2 to 1/4 as many stories as predicted.

This research is in an extremely early stage, and I'm certainly not ready to publish it. I have serious concerns about methodology (most of which are documented on this page) and am hoping to get feedback and help from friends in addressing some of these methodology concerns before publication. That said, there are already some very interesting implications of the data: nations in South America, Central Asia and Africa appear to be systematically underrepresented, while nations in Western Europe and the Middle East seem to be systematically overrepresented.

Current data sets produced by GAP 'bots are available here.


Why?
Why is global attention important? I see three major reasons: trade, aid and intervention.

Trade - As trade becomes global, it becomes crucial for nations to be globally visible as possible trading partners. India's IT revolution has been a triumph of both education and marketing - not only have India's universities developed tremendous capacity for training top IT professionals, India has also "branded" Bangalore and Hyderabad as world-class IT centers. As a result, multinational corporations have felt comfortable outsourcing major IT projects to Indian firms, spurring a high-value industry. Some middle-income nations have been engaging in branding that is almost corporate, producing inserts for magazines like Newsweek International to promote their nations as product. GAP attempts to look at how successful different nations have been at "getting their brand out".

Aid - There is a small, finite amount of money contributed by individuals and governments to provide humanitarian aid in developing and conflict-ridden nations. This money has a tendency to go towards the conflict most visible at any particular moment - one might term this the "Live Aid" effect. Nations with less well-publicized needs tend to go wanting. After US intervention in Afghanistan, substantial commitments were made by organizations and governments to rebuilding that nation. At the time, many international aid groups expressed concern that other nations were also in need of assistance and that aid to Afghanistan - the popular conflict - might detract from aid to other nations. Now that Afghanistan is no longer as prominent in global media, it's becoming clear that some of these pledged funds will not arrive and Afghanistan, too, may find itself short on reconstruction funding, as those new funds head to Iraq.

Intervention - Individual nations and multilateral coalitions have a tendecy to intervene in high-visibility conflicts and to ignore conflicts in less visible nations. As a number of activists have pointed out, a justification of US intervention in high-visibility Iraq on human rights grounds ignores the low-visibility, but severe, human rights violations occurring in Sudan. Global attention makes it more likely that the UN and other peacekeeping organizations will involve themselves in the prevention of genocides - global attention likely prevented many deaths in the Balkan conflicfts, while lack of attention permitted the massacres in Rwanda in 1994 to occur without outside interference.

Here's another more personal - perhaps more honest - set of reasons why I think global attention is important.


Methodology
At its most basic level, GAP is simply a set of websearches performed automatically. A simple 'bot program (many thanks to Chris Warren for this key bit of code) reads a list of search keywords and presents them to a search engine. It accepts the page returned by the engine, parses to find the total number of results returned for that particular search and writes it for memory. Once all searches on a particular engine are performed, the bot crunches some numbers and presents an HTML chart of results.

Currently bots are surveying four sites:

A slightly different set of keywords is used for each engine, because each engine handles boolean search queries differently. CNN does not appear to handle them at all, and NYT handles them poorly - if more than two keywords are included on a "NOT", the NYT engine appears to handle them as an "AND". I'm aware that I'm ineptly querying Google - with my new copy of _Google Hacks_ in hand, I plan to revise the Google keywords in the next revision. The keywords I'm using are visible in the second column of the HTML results pages. A discussion of why the keywords are screwed up and how to improve them is in the "problems" section of this page.

Using population numbers from the 2002 CIA World Factbook, the bot calculates stories per million inhabitants of a nation and includes that in the output as well. In three of the four other cases, it goes on to calculate a few more numbers - estimated hits, estimated stories per million and variance. The source of these three numbers is a little complicated. In a perfect world, we would expect each nation to have the same number of stories per citizen - in other words, we'd expect Mexico, with 100 million citizens, to have ten times as many stories as Malawi at 10 million. (To nobody's surprise, Malawi gets a lot fewer than 10% of Mexico's stories...)

Graph stories per million against population and a pattern becomes quickly apparent. Big countries have a lot fewer stories per million than small countries. There's a pretty logical explanation for this. Assume for the moment that China's ratio of stories per million (observed at roughly 16 on AltaVista) holds true for the rest of the globe. We then anticipate 1.6 stories for tiny little Tonga. However, every nation, no matter how small, turns up multiple search matches on the large engines - nations tend to get press for appearing in world cup soccer qualifiers or attending UN summits. Even if Tonga loses every soccer match, it's still going to show up enough to skew statistics. The same problem appears in reverse if we attempt to use a Tongan ratio for the rest of the world - we suddenly discover China needs roughly 3 million stories to be proportional, outpacing everyone's story load by a factor of five.

Starting from the assumption that the number of stories in a given nation was a function - probably a nonlinear one - of population, I graphed stories against population on different data sets. Four outliers quickly became obvious - Iraq, Kuwait, Qatar and Guam. The first three are receiving an unusual volume of media attention because of the conflict in Iraq; Guam comes up high because both AltaVista and Google index Agana Pacific Daily News and KUAM-TV, both of which produce a high volume of Guam news stories in relation to Guam's modest population. Story counts for these four nations were above two standard deviations away from the mean stories per million on my initial Google data sets, so I removed them from my correlation curves.

The curve that best fits the remaining 188 data points is of the following form: stories = m * population^n. Variants of this curve fit google, altavista and CNN data sets with correlations ranging between R=0.49 and R=0.51. What's especially interesting on those three data sets is that, while there's a great deal of variation in "m", there's almost none in "n". On April 11th, 2003:

SiteValue of MValue of N
AltaVista0.01940.6853
CNN0.01080.6718
Google0.04210.6793

In other words, while these three different sites return different volumes of results (proportional to m), the distribution of these results in relationship to population fits similarly shaped curves (a function of x^n) for each data set. While it's not especially surprising that a single source's curves would not very much over a series of days - after all, 29/30ths of Google and AltaVista's collections are unchanged from one day to the following - it is quite surprising that three different sites would show such similar power series distributions. I was especially surprised to discover that CNN, which is indexing 7 years of content, revealed a near-identical curve to the search engines indexing 30 days worth of content.

It has been harder to find a correlation for NY Times data because the story counts are much smaller, including numerous zero results. I'm using Excel to do curve fitting, and it refuses to fit exponential curves against data sets that include zeros. (If anyone feels like running these sets on Data Desk or Mathematica, please feel free and let me know what correlations you get...) Trying a NYT data set with zero points removed gives a R=0.41 correlation against a power series where n=0.58, a significantly steeper curve than on the other three data sets. Until I have more confidence on the NYT numbers, I'm not using them to calculate estimated story numbers.

After m and n are calculated for the three large sources, they are plugged back into the bot, which then uses the resulting equation to predict how many stories each country should produce and how many it actually does. Each source has its own values of m and n. Currently, I'm calculating these values based on historical data - in the future, I'd like to be able to calculate them on the fly for each data set and look for variations from those predictions. (This would allow the numbers to remain accurate even if Google decided to run with a catalog half its usual size, for instance.) To do this, I'd need to be able to do non-linear regression within the Perl script, rather than in Excel - if anyone has good formulas to do so, please let me know.

Finally, the script calculates the variation between how many stories were expected (as a function of population) and how many were actually found by the bot. Results are color-coded and displayed on the table. Countries that received more than four times as many results as anticipated by the equation are colored deep red. Countries that received two to four times as anticipated are coded in light red. Less than a quarter as many stories as expected and the color is deep blue; half to a quarter of anticipated stories is light blue. Remaining nations are colored in white or beige to distinguish them from untracked nations, which will be colored grey. Currently, this data is mapped on world maps by hand, a painstaking and time-consuming process. I'm currently writing a perl script that will place appropriately colored dots, proportional in size to a nation's population on the map.


Problems and concerns
I've got lots of them. Here are some of the highlights:

Keywords - It's very difficult to use the same keywords for every nation. While most nations give believable responses from using the common name for the nation as a quoted phrase, there are some notable exceptions.

Chad, Georgia - A search for "Georgia" will give you very little information about the Caucuses and a great deal about sports teams in the southern US. A search for "Chad" gets you lots of guys with that nickname and very few stories about the Sahara. In both cases, I'm using the capital of the nation as the search term, knowing it's skewing my results low.
Guinea - Guinea, Guinea Bissau, Papua New Guinea and Equatorial Guinea are all nations. There's also a Gulf of Guinea, a disease called Guinea Worm, not to mention guinea pigs and guinea hens. An accurate search for the nation of Guinea would require a search string that looks something like "Guinea NOT "Guinea Bissau" NOT "Papua New Guinea" NOT "Equatorial Guinea" NOT etc." Some engines will handle that, others won't. Two fixes I know need to be made in the next generation: "Niger" needs to become "Niger NOT "Niger Delta" NOT "Nigeria", and Democratic Republic of Congo needs to become "Congo NOT Brazzaville".
Guam, Australia, Canada As discussed above, Guam gets a huge amount of coverage because two news sources that primarily index Guam are covered by Google. Similarly, Canada and Australia have a large number of media sources listed in major engines, and tend to get heavy coverage. Bug? Feature? Given that this reflects the reality of the current indexing situation, I'm removing Guam from correlation studies, but otherwise including these results.
American Underrepresentation The USA appears to be pretty well represented in the charts produced by GAP. I suspect it's actually represented by a factor of ten or more. That's because most stories set in the USA don't involve the phrase "United States" - instead, they've got the state or city name where the story is taking place. One could obtain a great deal more precision by doing a search for all US states and major cities. Of course, if I did this, I'd need to do the same thing for Canada, Britain and other large, media-rich nations. Instead of opening that can of worms, I'm thinking about expanding US to "UNITED STATES or US or USA".

Is that a nation?
For a number of the "nations" in the data set, the question of that nation's existence is a political one. I began this project with a data set derived from the CIA World Factbook. I trimmed the data to include only entries with a population of greater than 100,000. The existence of some strange entities - West Bank and Gaza Strip as separate areas - is the result of the CIA's decision, which I've not yet been moved to change. Mayotte is a territory of France, but is also claimed by Comoros. At the moment, I'm listing it as an independent entity - should it be incorporated into France? I don't know. Western Sahara claims to be independent, but is also claimed by Morocco. It's hard to know where to draw and not draw these lines.

All PR is good PR
GAP doesn't attempt to make any distinction between "good" and "bad" PR. I believe this is a feature, not a bug, but I'm anticipating possible criticism here. There's two reasons for not making this distinction. One is that it's very difficult for humans to make this distinction and near impossible for an AI system to do such. Even if I had the skills - or time! - to code a text analysis system that could attempt to classify search results as "good" or "bad", it would not be able to do so with a high degree of confidence. Furthermore, I'm not convinced this distinction needs to be made. Countries that are seeking international aid or intervention would benefit from "bad" PR. I think it would be a highly interesting study to look at perceptions of a nation based on search engine queries, but that's way beyond the scope of this study.

Bad Bots
Google explicitly prohibits automated querying, using its engine. In fairness, Google tries to soften the blow of this prohibition by offering an API. Unfortunately, said API only allows queries of the main database, not the special news collection. I suspect, were I to search through the TOS on the other three sites I'm searching, I'd discover that this sort of research is frowned upon. Chris and I have modified the bots to make them fairly polite - they sleep after each request and try not to overwhelm servers. Still, I've had some unusually repeatable errors on Google and I'm beginning to wonder whether my bot will be able to continue searching news.google.com for an extended period of time. I certainly plan on building an API-compliant bot once the interface makes searching of news.google.com possible, and would happily do so for the other sites should the APIs be made available. In the meantime, though, I'll just worry about bots being blocked.

Dirty Data
I picked the worst possible week to start this project, from a statistical point of view. Global media obsession with the war in Iraq and the spread of SARS have caused a small number of nations to dominate international media. I've tried to counter the most egregious influences by pulling four countries out of my correlation data sets, but I still may be dealing with a radically distorted media picture. That said - I don't think this is that huge a problem. The CNN data, which draws from a much longer time period, and hence is less susceptible to Iraq and SARS distortions, fits a very similar curve to the curves Google and AltaVista are fitting, suggesting to me that the stories per population distribution may be fairly constant over time. We'll know more after we get a "normal" month of data sometime in the future.

M,N and how much correlation is enough?
I'm not a statistician, but I know enough to know that R=0.5 is not an overwhelming positive correlation. Were I concerned purely with demonstrating that web stories are a nonlinear, continuous function of population, I'd be more obsessed with that figure. Instead, I've got a trickier statistical problem. If R=1.0, that would demonstrate perfect correlation between population and web stories... which we know not to be true. (For a simple example of this, take a look at Japan and Nigeria, which have almost identical populations. Japan generally gets 3-4x as many stories as Nigeria across search sites.) In other words, there is no possible "perfect" model for media distribution, because media distribution in the real world is imperfect. Instead, we need to model a curve that we know will have substantial deviation. I feel reasonably confident that I'm taking the right approach to this, but would be very, very grateful for outside opinions, especially those of my friends in the hard sciences.

Slow data change
Search engines change slowly - most attempt to grow, rather than replace their current catalogs. I chose news.google.com as my first target because, if they are to be believed, their collection turns over every 30 days, allowing us to see different data sets on a monthly basis. We won't have that opportunity with CNN or several other possible targets. And searches that look at a search engines entire catalog rather than a subset are unlikely to change very much on a daily basis. Again, a bug, or just a reflection of reality? I don't know yet.


Conclusions and correlations
Or, "so what does this tell us, anyway?" Well, it's early, and I haven't tried very hard to cross-correlate data yet. Still, here are a couple of observations I've made so far:

People talk about you if you're rich
Looking at the results of Google, AltaVista and CNN.com searches on April 13, 2003, the top twenty countries, in terms of GDP per capita, are pretty well represented in the media. In fact, Google colors 2 white (within one multiple of expected results), 13 light red (two to four times expected results) and five deep red (more than four times expected results). Altavista colors one (Austria) white, 11 light red and eight deep red. CNN also colors Austria white, 12 light red and 7 deep red. (Don't cry for Austria - they still average 60% more stories than most nations their size.) Just as a side note, this subset includes enormous nations like the US, Japan and Germany, as well as tiny ones like Luxembourg and Iceland.

People don't talk about you if you're poor. Unless you're being shot at. Sometimes.
Looking at the same data set, the results for the twenty poorest nations in terms of GDP per capita are radically different. On all three sets, the majority of countries are colored light blue (half to a quarter of stories anticipated) or dark blue (less than one quarter of stories anticipated.) Afghanistan is deep red on all three sets, and the Gaza Strip is light red on CNN, dark red on the other two. The distribution is slightly less stark than the top 20. Google has 2 deep red, 5 white, 3 light red, and 10 deep blue. AltaVista has 2 deep red, 4 white, 5 light blue and 9 deep blue. CNN has 1 deep red, 1 light red, 7 white, 4 light blue and 7 deep blue.

Conflict doesn't guarantee attention, however. Of those bottom 20, 9 have had major conflicts in the past 5 years. While Gaza and Afghanistan have received substantial attention, Rwanda and Eritrea have received only average attention. Burundi, Democratic Republic of Congo and Sierra Leone averaged less than average attention, and Guinea-Bissau and Ethiopia have received far less than average attention.

Violence doesn't neccesarily lead to attention
The good folks at the Department for Peace and Conflict Research at Uppsala University in Sweden publish a fascinating report on conflict in the world from 1946-2001. The report attempts to document every conflict in the world, civil or international, small or large, during that time period. According to the report, 34 nations have been involved in "intermediate" conflicts or "war" level conflicts between 1998 and 2001, the end of the study. ("Intermediate" conflicts are ones where more than 1000 people die in total, but not in a single year. Wars have more than 1000 deaths in a single year. "Minor" conflicts cause fewer than 1000 deaths.) Nations make the list if they've hosted a conflict - the US makes it for 9/11 and the UK for the "troubles" in Northern Ireland. The list also includes perpetual hotspots like Israel and the nascent Palestinian states, Afghanistan and the Indian/Pakistani border.

It also includes conflicts in a lot of places people can't place on a map, including Uganda, Sudan and Burundi. As it turns out, conflict doesn't look like a strong correlator to attention. Of the 34 nations, 7 are colored dark red (on average, from our three engines on 4/13), 3 light red, 9 white, 11 light blue and 4 dark blue. In pure numerical terms, it looks like conflict areas may be slightly below average in terms of media attention. I hope to take a closer look at more recent data, and also to attempt to correlate attention to casualties. If anyone has a good source of current conflicts and estimated casualties, I'd be very grateful for the reference.

GDP and attention
There appears to be a correlation between total GDP and attention that's much stronger than correlation between GDP per capita and stories. Graphing a nation's total GDP in purchasing parity dollars versus stories on Google on 4/13 gives us the graph that follows below. (Iraq, Kuwait, Qatar and Guam are not in this set for reasons mentioned previously. Western Sahara is eliminated because I lack consistent GDP information.) Correlation is R=0.6659, suggesting a fairly strong correlation. Of course, discovering that a nation's wealth correlates to the attention we pay to it surprises no one. Still, it's nice to have hard data to support that cynical presupposition.


Next steps

I'm open to suggestions for other next steps, as well as any and all questions about methodology, source data, etc. Please email me at ethan@geekcorps.org. Should there be sufficient interest, I'd be happy to start a mail list for discussions about next steps on the project - let me know if you'd be interested in commenting or participating on an ongoing basis.


Thanks
Many thanks to Chris Warren for coding the original bot, to the Berkman Center for hosting these pages (and bots!), to Rachel and everyone else who's patiently listened to be babble about this project over the past couple of weeks, and in advance to anyone who sends me helpful feedback, encouragement or criticism.

-Ethan Zuckerman, 4/14/03