November 23, 2013

Screenshots from BOUNDLESSINFORMANT can be misleading


Over the last months, a number of European newspapers published screenshots from an NSA tool codenamed BOUNDLESSINFORMANT, which were said to show the number of data that NSA collected from those countries.

Most recently, a dispute about the numbers mentioned in a screenshot about Norway urged Snowden-journalist Glenn Greenwald to publish a similar screenshot about Afghanistan. But as this article will show, Greenwald's interpretation of the latter was wrong, which also raises new questions about how to make sense out of the screenshots about other countries.


Norway vs Afghanistan

On November 19, the website of the Norwegian tabloid Dagbladet published a BOUNDLESSINFORMANT screenshot which, according to the paper, showed that NSA apparently monitored 33 million Norwegian phone calls (although actually, the NSA tool only presents metadata).

The report by Dagbladet was almost immediatly corrected by the Norwegian military intelligence agency Etteretningstjenesten (or E-tjenesten), which said that they collected the data "to support Norwegian military operations in conflict areas abroad, or connected to the fight against terrorism, also abroad" and that "this was not data collection from Norway against Norway, but Norwegian data collection that is shared with the Americans".

Earlier, a very similar explanation was given about the data from France, Spain and Germany. They too were said to be collected by French, Spanish and German intelligence agencies outside their borders, like in war zones, and then shared with NSA. Director Alexander added that these data were from a system that contained phone records collected by the US and NATO countries "in defense of our countries and in support of military operations".

Glenn Greenwald strongly contradicted this explanation in an article written for Dagbladet on November 22. In trying to prove his argument, he also released a screenshot from BOUNDLESSINFORMANT about Afghanistan (shown down below) and explained it as follows:
"What it shows is that the NSA collects on average of 1.2-1.5 million calls per day from that country: a small subset of the total collected by the NSA for Spain (4 million/day) and Norway (1.2 million).

Clearly, the NSA counts the communications it collects from Afghanistan in the slide labeled «Afghanistan» — not the slides labeled «Spain» or «Norway». Moreover, it is impossible that the slide labeled «Spain» and the slide labeled «Norway» only show communications collected from Afghanistan because the total collected from Afghanistan is so much less than the total collected from Spain and Norway."


Global overview

But Greenwald apparently forgot some documents he released earlier:

Last September, the Indian paper The Hindu published three less known versions of the BOUNDLESSINFORMANT global overview page, showing the total amounts of data sorted in three different ways: Aggregate, DNI and DNR. Each results in a slightly different top 5 of countries, which is also reflected in the colors of the heat map.

In the overall (aggregated) counting, Afghanistan is in the second place, with a total amount of over 2 billion internet records (DNI) and almost 22 billion telephony records (DNR) counted:




The screenshot about Afghanistan published by Greenwald only shows information about some 35 million telephony (DNR) records, collected by a facility only known by its SIGAD US-962A5 and processed or analysed by DRTBox. This number is just a tiny fraction of the billions of data from both internet and telephone communications from Afghanistan as listed in the global overview.


Differences

With these big differences, it's clear that this screenshot about Afghanistan is not showing all data which NSA collected from that country, not even all telephony data. The most likely option is that it only shows metadata from telephone communications intercepted by the facility designated US-962A5.

That fits the fact that this SIGAD denotes a sub- or even sub-sub-facility of US-962, which means there are more locations under this collection program. Afghanistan is undoubtedly being monitored by numerous SIGINT collection stations and facilities (like US-3217, codenamed SHIFTINGSHADOW which targets the MTN Afghanistan and Roshan GSM telecommunication companies), so seeing only one SIGAD in this screenshot proves that it can never show the whole collection from that country.

This makes that Greenwald's argument against the data being collected abroad is not valid anymore (although there maybe other arguments against it). Glenn Greenwald was asked via Twitter to comment on the findings of this article, but there was no reaction.


More questions

The new insight about the Afghanistan data means that the interpretation of the screenshots about other countries can be wrong too. Especially those showing only one collection facility, like France, Spain and Norway (and maybe also Italy and The Netherlands), might not be showing information about that specific country, but maybe only about the specific intercept location.

This also leads to other questions, like: are this really screenshots (why is there no classification marking)? Are they part of other documents or did Snowden himself made them? And how did he make the selection: by country, by facility, or otherwise?

There are many questions about NSA capabilities and operations which Snowden cannot answer, but he can answer how exactly he got to these documents and what their proper context is. Maybe Glenn Greenwald also knows more about this, and if so, it's about time to tell that part of the story too.


> See also: The BOUNDLESSINFORMANT interface


Links and Sources
- Le Monde/BugBrother: La NSA n’espionne pas tant la France que ├ža
- Volkskrant.nl: Bespioneerde de NSA ons of hebben wij zelf afgeluisterd?
- MatthewAid.com: Greenwald’s Interpretation of BOUNDLESSINFORMANT NSA Documents Is Oftentimes Wrong
- Dagbladet.no: NSA-files repeatedly show collection of data «against countries» - not «from»
- WSJ.com: Europeans Shared Spy Data With U.S.
- Cryptome.org: Some thoughts and explanations about the BOUNDLESSINFORMANT numbers

11 comments:

Anonymous said...

Greenwald is closer to correct than Officials.


Collection in Norway is on SIGAD US-987F
In France: US-985D
Afghanistan: US-962A5 (or -AS).
Spain: US-987S
Germany: US-987LA and US-987LB

France, Afghanistan and Spain all use DRTBOX

If the data was being collected by Norway and/or France in Afghanistan, then giftwrapped for the NSA, Why are the SIGADS all US? Shouldn't they be FR-xxxx or at least USJ-XXXX, or maybe a DS-something? A designation of US indicates full US control and ownership of infrastructure.

The Spanish and Dutch reactions are different from the responses of Norway, France, and Italy.

The Dutch have confirmed that their citizens are being spied on. The Spanish want to prosecute someone, it's not entirely clear who, but it's not the Newspaper or Greenwald.

The Italians aren't saying anything specific other than saying that the rights of Italians aren't being violated. Which is very very far from a denial that they are being spied on. I notice that the NSA did not deny the Cryptome report on Italy and others, only the newspaper reporting.

That said, Greenwald is quite mistaken about the Brazil numbers, or else the O Globo report is wrong. The O Globo report said that there were no precise numbers for Brazil but that it was behind the US, which was 2.3 billion, (DNR).

So the US is 2.3bn not Brazil.

Anonymous said...

While some of your points re. misinterpretation of the metadata stats as displayed on those slides may have merit, the Hindu Times slides you refer to don't serve to clear things up.

Yes, the aggregate stats for Afghanistan are ~2 billion DNI and ~22 billion DNR, respectively. But, as one would expect for aggregates, those very likely represent *all records pertaining to collection against Afghanistan in the database*.

The ~35 million DNR figure, however, quite clearly only covers the *last 30 days* as indicated in the respective slide, a timeframe spanning from ~ Dec. 10 2012 to beginning of Jan. 2013. This figure is consistent with what one would expect from a country like Afghanistan -- largely rural, limited technical infrastructure. Eyeballing and adding the frequency bars in the 30 days slide supports this notion -- it seems to work out to ~35 million records.

Besides, NSA, in various documents clearly states that BO counts collection *against a given country* in the regional / map view (as opposed to the SIGAD view of which we have not seen much). Those documents also make clear, that they *process the actual DNR / DNI metadata* to produce the output stats. I.e., a phone connection involving an Afghan party intercepted by / in, say, Norway would *still* count as collection against Afghanistan -- not Norway -- and show up in the DNR stats for Afghanistan. That's all according to NSA's own material.

So, either Greenwald et al. are mostly right or NSA has been misrepresenting their own system to their own staff (which would be quite disturbing).

Anonymous said...

@anonymous (November 24, 2013 at 12:39 PM)

The map view is also last 30 days. there is a green text box below the "OVERVIEW" frame. It's hard to read in this version. But you can spot it in Le Monde's version.

http://cryptome.org/2013/10/nsa-boundless-informant.pdf

You can do better than eyeball, you can measure pixels and calculate the scale. Having done that calculation in other cases, I used GIMP's measuring tool, but there are other ways of counting pixels. I know you will come VERY close to the true numbers. (Yes, I'm the dude who came up with the 46 million figure for Italy and posted it to Cryptome) I haven't run the calculation for Norway and Afghanistan yet.

That said, the balance of truth is with Greenwald at this juncture.

http://cryptome.org/2013/10/nsa-boundless-informant-images.htm

P/K said...

As I said in my article, there are now even more questions about how to interpret the screenshots. For example the US-SIGADs and what and where they are collecting.

But remember that Der Spiegel wrote earlier that US-987 was from a range of 3rd party SIGADs, which seems to indicate that there are also US-SIGADs used for collection facilities of 3rd party agencies. Maybe a reason for that is that NSA provided equipment, training, personell, etc.

One big question is also: why are these data counted as being Norwegian, Spanish, Dutch, etc? According to the Boundless FAQ document, this results from metadata records containing at least one phone number from a particular country. This means at least one end of the communication had te be a Norwegian, Spanish, Dutch number. It also means that a phone call between someone in say Norway and someone in Afghanistan would be counted twice: once for Norway and once for Afghanistan.

It's also puzzling how a country like Afghanistan can produce several billions of communications records. One guess of someone else is that a single communication is intercepted multiple times because of all the different systems targeting that country.

But even when each communication would be collected 10 times, there would still be 2 billion telephone records from Afghanistan a month (the BoundlessInformant global overview is also showing "last 30 days") - which is far more than the 35 million shown in the screenshot presented by Greenwald.

Anonymous said...

@Anonymous (November 24, 2013 at 7:55 PM)
@P/Kgh

I stand corrected. Thank you! The mystery remains.

Perhaps it's just a bug in BOUNDLESSINFORMANT. Now, if we only had access to FLAWMILL, we could actually report it... ;)

Anonymous said...

@PK I agree, we have now many more questions. All parties involved need to produce more documents, Greenwald, the NSA and the Various Internationals) then we can start comparing notes!

We could also do with some documents that mention BOUNDLESSINFORMANT but aren't actually about it. Those would be most revealing.

jbond@MI5.mil.gov.uk said...

I'm seeing a great deal of confusion out there about NSA databases and how reports are generated from their architecture. Here is how it works:
Let's begin with rows and columns making up a matrix, variously called a table, array, grid, flatfile database, or spreadsheet. In the database world, rows are called records, columns are called fields, and the individual boxes specified by row and column coordinates -- which hold the actual data -- are called cells.
For cell phone metadata, each call generates one record. NSA currently collects 13 fields for that call, such as To, From, IMEI, IMSI, Time, Location, CountryOrigin, Packet etc etc, primarily from small Boeing DRTBOXs placed on or near cell towers.
Because metadata from a single call can be intercepted multiple times along its path, generating duplicative records, NSA runs an ingest filtering tool to reduce redundancy, which is possible but not trivial because metadata acquisitions may not be entirely identical (eg timing). After this refinement, one call = one metadata record = one row x 13 columns in the BOUNDLESS INFORMANT's matrix.
Cell phone metadata is structured, unlike content (he said she said). However, as collected from various provider SIGADs, it is not cleanly or consistently structured -- see the messy example at wikipedia IMSI. So another refinement is needed: NSA programmers write many small extractors to get the metadata out of its various native protocols into the uniformly formatted taut database fields that it wants.
After all this, for a hundred calls, a metadata database such as BOUNDLESS INFORMANT consists of 100 records and 13 fields so 100 x 13 = 1300 cells. A counting field (all 1's) and consecutive serial numbers (indexing field) for each record may be added to facilitate report generation and linkage to other databases, see below.

-1- The first point of confusion is between BOUNDLESS INFORMANT as a flatfile database (we've never seen a single row, column or cell of it) and the one-page summary reports that can be generated using BOUNDLESS INFORMANT as the driving database (eg, the Norway slide).
These BOUNDLESS INFORMANT reports give the number of records (rows) in the table after various filters have been applied (eg country, 1EF = one end foreign, specified month, DNR type, intercept technology used, legal authority cited FISA vs FAA vs EO 12333).

BOUNDLESS INFORMANT does NOT report the number of cells nor gigabytes of storage taken up. It easily could, but it doesn't. Instead, it reports the main object of interest: the number of calls, after some filtering scheme has been applied.

-2- The second point of confusion arises over database viewing options. Myself, I like scrolling down row after row, page after page, plain black text in 8 pt courier font, lots of records per screen, thin lines separating cells, no html tables. A lot of people don't.
So a cottage industry has evolved around generating pretty monitor displays, web pages, and ppts from databases; these typically display one record per screen. All database views are equivalent: given a presentation, you can recover the database; given the database, you can make the pretty user interface.
Views are dressed up injecting the data fields into a fixed but fancy template (eg dept of motor vehicles putting your picture field into an antique wood frame and your name field into drop-shadow text). Nothing but a warmed-over version of spewing out form letters by mail-merging an address database into a letter template.
We've not seen *any* view of BOUNDLESS INFORMANT records to date, only summary reports it has generated. You cannot recover the underlying database from a few summary reports, only information about the number of records and a few of the 13 fields.

jbond@MI5.mil.gov.uk said...

-3- The third point of confusion: a given database like BOUNDLESS INFORMANT is capable of self-generating many summary reports about itself. Summary reports can have views too -- injections into templates. We've seen 3 of them for BOUNDLESS INFORMANT, Aggregate, DNI and DNR.
Databases can be sorted, according to the values in any column. For example, if NSA sorted by IMSI, that would pull together all the call records made from a particular cell phone with that id. Using the counting field, allowing the activity of each phone to be tallied. Or they could sort to pull up the least active phones-- to identify the user who tosses her 'burner' phones in the trash after one use.
Databases can be restricted. If NSA wanted to count the number of distinct cell phone calls during a given month that originated in Norway and terminated abroad (1EF one end foreign), it can restrict the records to the relevant time and location fields, masking out the others. They could compress each cell phone to a single line and count rows to get summary data on the number of phones doing 1EF. That summary data could be injected into a template for a BOUNDLESS INFORMANT slide.
Databases can be queried (tasked) to pull out only those records satisfying some string of selector logic. For example, you could submit a FOIA request to NSA in the form of a query that consisted of your selectors and a database like BOUNDLESS INFORMANT to see what call metadata they have on you in storage.
Here you would be wise to request simple output (rows of plain text with column values separated by commas,CSV format), to keep file size down. Then you could make your own mail-merge templates and spew out colorful BOUNDLESS INFORMANT graphs and reports about yourself, or just use the default templates provided by Excel.


jbond@MI5.gov.uk said...

-4- Next up on confusion, relational databases. NSA maintains hundreds of separate flatfile databases that might however share a field or two in common, for example someone texting, google searching, or shopping as well as making phone calls with with a given phone, the number or IMSI being the common field.
Those other activities involve different fields from those already in BOUNDLESS INFORMANT, such as your login to eBay or search term text instead of email subject line.
It could all be put into BOUNDLESS INFORMANT by expanding the number of fields. However this doesn't scale very well : it results in the voice call fields being massively blank for an IMSI making lots of google searches, creating a huge sparse table that is very slow to process, wasting analysts time (called high latency by NSA).
Instead, BOUNDLESS INFORMANT will just link to all the other databases which share a field. And those in turn could link to other simple databases sharing some other field that BOUNDLESS INFORMANT might lack. And so on -- it's how all the little constituent databases can be seamlessly integrated..
A query now calls through to this whole federation of linked databases, which can reside geographically anywhere on the Five Eyes network (though NSA is moving to one stop shopping from their Bluffdale cloud to improve security and reduce latency).
The primary provider of relational database software of this complexity is Oracle. However you can do about all of it free and friendly with open source MySQL. The Q is for querying -- what NSA calls tasking -- sending off some long-winded boolean logic string of field selector values and constituent databases that does the filtering you want.
The result of the query is a new little database, usually temporary, that you can use to generate fancy views and summary reports. The databases being updated continuously and storage retention varying, the same query tomorrow will give a slightly different outcome.
Your all-about-me FOIA request could be formulated in MySQL (first need to know names of linked databases) and surprisingly, the query string would be recognized and fulfilled by Oracle or whatever big relational database NSA ended up using/developing, it's that standardized.
If you're online or call a lot, that could still be a big file given 12 agencies keeping tabs, notably NSA, Homeland Security, and FBI's DITU. But if you wrote the query right, it would only take a small data center in the garage to host the response.

John Connor said...

Is this the DRTBOX? I think so
http://drti.com/

P/K said...

@ jbond:
Thank you very much for your very interesting and detailed explanation! It's very helpful to get a better idea of how these systems work.

@ John Connor:
Yes it is, I'll publish an article about DRT/BOX soon.