Ideas‎ > ‎

Benford Numbers and Government Accounting Fraud


Benford Numbers and Government Accounting Fraud

The law of Benford Numbers states that the location of digits in sequences of numbers often follow the same recognizable distribution. Benford numbers -- like the Fibonacci sequence -- are commonly found in nature where properly randomized patterns can be expected (such as digits within building heights, population numbers and accounting balances).

In times of financial distress it is helpful to look for overall financial oddities, rather than examining all accounts in detail, when searching for specific infractions (such as fraud, also known as cooked books). For many organizations in or near the red, there are likely no obvious or direct bridges to explain the brickwork that leads directly to the collapse. Due to their less visible natures (compared to state or federal governments) and their less organized or well-funded departments of financial and accounting systems, local governments are particularly susceptible to the sort of unsophisticated invention of numbers that would lead to figures not following the law of Benford.

As an example, capital expenditures (CapEx) in some county (x) in Iowa in 2009 amounted to $1,840,056. By taking each individual account balance that equals the sum number and comparing it to historical data, we can find significant repetitions of the number '4' that may amount to unusual or non-randomized patterns. The same theory of fraudulent stratifications was applied to the EU collapse to find 'cooked books' in countries like Greece. The subsequent publication of strategy and  "Fact and Fiction in EU Government: Economic Data", by Stefan Engel, found indeed that Greece ranked the most likely out of the EU countries to have committed acts of fraud. Unfortunately the numbers were only crunched ten years after the primary fraud period was really committed.

By looking at aggregate number sets across each individual county, our analysis will rank the areas where "under-randomization" is most present, and therefore where investigation of accounts for evidence of human-generated numbers not representative of the real values of county budgets are likely to be most productive.

As such, we propose to obtain and analyze financial data from counties around the US, looking for statistically significant deviations from Benford distributions.

It is important to note that, rather than proving fraud, the aim of our analysis is to help journalists, accountants and anyone with a vested interest in the topic direct their investigative efforts by showing where, in the data, there is cause for suspicion of fiscal infractions.

Goals include:
#1 -- present graphics and statistics of findings to demonstrate the potential of the approach
#2 -- create an online package replicating the Benford functionality to calculate fraudulent activity in comparable environments (such as an R package).

Goals do not include:
#1 -- prove fraud; this program will NOT show conclusive evidence of fraud. Rather, it should more accurately direct the efforts and scrutiny of journalists.

Our data comes from many places for local government, most notably:

Location: Columbia University

Category: Best in Potential.

Languages: R, Python
Python Packages: ipython, matplotlib, pandas, numpy



Source data was cleaned and put into consistent csv formats. Accessing functions were written in R to parse the data. Statistical functions and metrics were written in R to analyze the data.



Data Cleaning:
Source data were downloaded from the Iowa and North Carolina government websites. The data were initially in the .xls data format and were then converted into .csv files. This conversion was done by hand to assure data cleanliness and consistency.

Accessing Functions: Once the data is in the .csv format, it is easy to load it into R, but one still needs some basic data structure to effectively use it. To this end a basic API was programmed allowing the data to be easily accessed by descriptors such as "county", "year", etc.

Statistics: The functional aspects of the package were separated from the API. Initially this allowed the simultaneous use of python and R programming by using R to provide the data structure and then passing subdata (i.e. specific columns/rows/etc.) to python statistical functions by way of the command line. Once implemented cleanly, this became a clear bottle-neck due to the extra IO operations that needed to be performed. Luckily (the now quite clean) python code could easily be rewritten in R. As a result there is no longer any python being called which makes the program faster and obviously simplifies the codebase/dependencies.



Sunday, 13:55 - CHECK IN - pedroavila
Tested code and produced output.

Sunday, 12:00 - CHECK IN - pedroavila
Pedro and Thomas discussed merits of Python to R interface, and decided to scrap it, converting all Python functions to R. Carlos and Leon continued to put together the final presentation, determining which graphics would be necessary.

Sunday, 09:19 - CHECK IN - pedroavila
Thomas, Pedro, Leon and Carlos regrouped to discuss current state and next steps.
We agree that we have shown (both in Python and in R) how the number distributions and the formulas work and sufficiently understood how to apply the data to it.
We discussed the best way to tighten up the PURPOSE and SUBSTANCE of what we can show with our data processing and analysis thus far and decided that the Python function Thomas has put together will be black-boxed.
Pedro and Carlos will put together R functions to properly break data files down into the right format for the Python black box (to be called from R), as well as functions to appropriately show the results of the Python function (including empirical and theoretical distributions). Visualizations will follow.

Saturday, <time> - SUMMARY

Saturday, 15:57 - CHECK IN
Michael: Finishing importing full data sets from web. Was transforming from raw styles to .csv’s and then found that they actually provide excel spreadsheets.
Thomas: Fine tuning of the data imported, then comparing it to statistical outcomes to data which won’t have deviations - example = historical populations: obviously rounded, but still randomized.
LEON: Testing out metrics to make sure they’re statistically valid. Straightforwardly implement the numbers.
PABLO: Developing and cross-checking the same formula that Thomas is developing to ensure the formula doesn’t have different outcomes.

15:13 - Coffee Break

12:00 - CHECK IN

1. Leon Kautsky - Organise the data, overseeing the operational work process and structure
2. Thomas Nyberg - Finalize the mathematical equation to compute the data
3. Pedro Avila - Attended fusion table conference. Pre-processing data into csv files as well as coding R functions to confirm Python development
4. Carlos Ramos - Attended census data workshop and is working on the R-data
5.  Michael Lawson - Putting together balance sheets from Iowa counties
6. Karuna Kumar - Documenting progress and updating drop-box

09:58 - AGENDA - Iowa
Determined that Iowa, Texas,Florida and California are good places to start with to demonstrate capabilities.
1. Get balance sheets, define data type
2. Write Bedford’s Law Package in R/Python
3. Pre-process county data
5. Run own package

9:45 - SOURCES
Influenceexplorer - Ask for data, independent expenditures - digital copy of ad database from SEC? - clustering capabilities for showing

from data via - API’s for US state companies, company structure - sorting through a bunch of information in redefining the wealth gap

Sunlight Foundation gets data FROM - use as cross comparison, but doesn’t create easily accesible API - sunlight does.

9:24 - Finding alternate sets.
1. County officials salaries
2. Investigating individual states for reliable county data

9:30 - First Hurdle: Dante’s data doesn’t suffice as appropriate to determine fraud.