Projects‎ > ‎

Congressional Financial Disclosures: Bi-Coastal Team Judged in Stanford

How good are our elected officials' stock picks? Can we find any unusual transactions that generate story leads for further reporting?

http://www.streamingtopfilms.com/findis/ is our staging (um, actually live) server.

These are among the questions we hope to answer by processing House and Senate Financial Disclosure forms. These forms were previously available only at the Library of Congress, but now they are on line as a result of the STOCK Act.  Unfortunately, they are stored as scanned PDFs (House) or GIFs (Senate).  We plan to use parsed data from opensecrets.org and to also try hacking the more recently filed disclosures - which we have already downloaded.

We're nominating in the 'insight' category (although we do hope that there's potential and innovation as well).

We're starting with a raw data set from the Center for Responsive Politics, incorporating Yahoo! stock performance information, the St Louis' record of S&P 500 performance. Story leads are being generated from scraping and OCR-ing of the raw PDF files.

Full list of contributors
Co-ordination and more from:
Fergus Pitt (New York) 
Marc Joffe (Bay Area) 

West Coast
Vinay Pandey  - Data Manipulation and more
Johann Posch - Data Manipulation, Yahoo! Finance API Interactions and more
Renee Chu
Carlos Riquelme Ruiz - PhD student at Stanford working in Computer Science and Maths
Matt Gedigan
Raghu Suribhatla
Xiaohuo Cui

East Coast
Mohammad Hadhrawi - Interaction Design & Development and more
Ken Leland - Interaction Design & Development and more
James Weston - Interaction Design & Development and more
Janet Lee - Catching Data Limitations and more
Yazeed Awwad - Stock Market Domain Knowledge and more.



We're doing most of of our documentation and tracking on this Google Doc



Code Repos


West Coast
Vinay Pandey  - Data Manipulation and more
Johann Posch - Data Manipulation & Yahoo! Finance API Interactions and more
Renee Chu
Carlos Requelme-Ruiz
Matt Gedigan
Raghu Suribhatla
Vinay Pandey
Xiaohuo Cui


East Coast
Mohammad Hadhrawi - Interaction Design & Development and more
Ken Leland - Interaction Design & Development and more
James Weston - Interaction Design & Development and more
Janet Lee - Catching Data Limitations and more


Yazeed Awwad 
Miriam
Kareem

Process


Background:
The officials disclose the assets that they hold and the transactions that they make. The date of the transaction is recorded. The value of the transaction or asset holding is recorded, but only in a range, such as $1-$200, $15,000-$50,000 etc. This has big implications for the conclusions that can be drawn. They are discussed elsewhere.


Structure
The work has progressed in two mostly parallel branches, stemming from the two sources we have for the officials' shareholding information.
One source of stockholding data is what the government publishes as PDFs on the senate and house websites. This is extraordinarily messy, combining handwritten form submissions with followup letters. However, it is the most current information.

The second resource comes via OpenSecret.org's data store. This covers transactions and holdings to the end of 2011, so it's not as current, but it did come as structured data.

During the actual DataFest Weekend, we've also split the responsibilities across the West and East Coasts. In Stanford, the team has focussed on getting the stockholdings data into a usable format, and combining it with the performance data from Yahoo!
At Columbia, the team has mainly been designing and building the UI, and specifying a data format for the officials' stock holdings and performance.
With a huge number of people working on the project, everybody has been putting effort into communication and coordination.

Data Processing
SF Team - Technical Description

Senate and House Disclosures are stored as scanned GIFs and JPGs (respectively) on congressional web sites that do not provide bulk download functionality.

Using WGET and HTTrack Web Site Copier - both open source tools - Marc downloaded all the disclosures for 2012 and 2013 to a shared Dropbox account.  Fergus obtained csv files from opensecrets.org that contained a digest of the 2011 disclosure files.

Analysis of Raw Disclosure Files

To generate fresh news from these disclosures, a news organization needs to be able to process disclosures as they become available. We worked on a number of technologies to partially automate analysis of these disclosures..

Marc ran the 2012 and 2013 files through optical character recognition software to produce PDFs with embedded text and plain text files. He used Abbyy FineReader Professional 10 - a low-cost commercial tool.

Matt used Tesseract, an open source OCR tool and Image Magick, an open source tool that can rotate images from portrait to landscape format, in order to reproduce the results achieved with Abbyy FineReader. He also used Poppler to separate multi-page house PDFs into individual pages before submitting to Tesseract.

Only about 20% of the disclosure pages  include financial holdings or transactions, Renee wrote a Python program that analyzes the plain text outputs to identify the pages that contain financial holdings or disclosure. Because the Senate and House use different reporting formats, the script was designed to accept keywords that classify documents as input parameters.

Carlos wrote a Java program to extract the names of Senators, company names and transaction dates from relevant OCR outputs. Our goal was to focus on publicly traded equities rather than mutual funds or bonds. Carlos used a list of names of actively traded equities to filter the holdings data. Whether a given transaction was a purchase or a sale, and the amount of each transaction cannot be determined through OCR and automated text analysis. These variables and many handwritten forms require human processing.  If we were to regularize this process, we would recommend using either Amazon Web Turk or Crowdflower.com to obtain human data management services at reasonable cost.

Processing Open Secrets Data

OpenSecrets.org provided comma delimited files representing its capture of 2011 disclosures. They also provided a data dictionary. The files could not immediately be loaded into Excel because of the inclusion of extra delimiters.  John and Vinay stripped these extra delimiters and then Vinay created a set of Excel files cross referenced to codes and abbreviations provided in OpenSecrets data dictionary. Vinay also used a list of actively traded equities to establish which financial holdings and transactions related to NASDAQ or NYSE securities.

Incorporating Stock Performance

Johann obtained historical equity prices from the Yahoo Finance API at no cash cost. He used fuzzy text matching technology to assign ticker symbols to the financial transactions provided by others on the team.  He then crossed referenced the transactions and the ticker symbols, to obtain purchase and sale prices for each holding. He then packaged the holdings, prices and some other relevant statistics into Congressperson-specific JSON files for inclusion in the UI.


Steps: 

Prep 
(Team Formation, Acquiring Data, Assessing): 
For the week before the datafest, Marc, Fergus, Johann and Vinday were focussing on getting the data, understanding as much as possible its possibilities and limitations, as well as preparing to address the Yahoo! Finance API - our source for individual stocks' performance.
Marc also spent time seeing how much data could be extracted from the PDFs using OCR

DataFest Weekend:
Saturday Morning Columbia - The new recruits came in and familiarized themselves with the data to understand what information was theoretically possible, and discuss some potential directions.

Midday Columbia/Morning Stanford - The whole team had our first standup, where we confirmed roles and updated on status. 
In Stanford the team split into a group extracting the most recent data possible from the PDFs and a team transforming the structured data from OpenSecrets into a format where they could query Yahoo! per stock. 

In Columbia the front end and team sketched out some UIs that might be able to tell the story, and started to assemble the visualization libraries, and code up some demo pages.
The Columbia team also discovered some of the inherent limitations inherent in the data. Theses are discussed elsewhere in this page.

Afternoon Session Columbia/Midday Session Stanford
In Columbia, the front end team was aiming to have a 90% decided design by the end of the day, where we understood how the stocks would be plotted, what information would be on each graph, and how the user would interact with it.

The team also locked down a data spec for the information to flow from Stanford to Columbia, and decided the serving and presentation architecture 

In Stanford the team was aiming to have a sample of the OpenSecrets-sourced data for a single official, which could then be used by the Columbia team. 
The Stanford team was also examining the recent PDF-scraped data for any transactions which could be story leads. (One such lead is written into an example story on the presentation page).

Afternoon Session Stanford (deep into the evening)
The team produced real stock performance data for a single senator, Ben Nelson.

Morning Session Columbia
The test data was incorporated into the 90% design, which was further refined to improve usability, legibility, performance and fidelity.

Midday Session Columbia/Morning Session Stanford
Stanford working hard on producing as many officials' records as possible. 
Documentation of the process, limitations.
Writing demo stories
Polishing the UI, and preparing to push to live.



Limitations inherent in the Data:
  • Because the data only contains the potential value range of a transaction as opposed to exact amount, we are only able to show the performance of individual stocks (and those in percentage terms), as opposed to portfolios or aggregations.
  • We are not able to say conclusively that a sale transaction represents the official selling their total holding, or just a portion. 
  • There are also significant problems which are, we think caused by ambiguous entries in the original PDF files. These are documented in this Google Spreadsheet


Limitations introduced through the process:
The data that came from OpenSecrets formatted



Data & Code Formats
Example Data:
{
 "Official": {
   "Id": 939428384,
   "Name": "Lindsey Graham",
   "Position": "Senator",
   "Stockholdings": [
     {
       "Beta": 1.1,
       "CompanyName": "General Electric",
       "Ticker": "GE",
       "Buy": {
         "Date": "2011-05-01",
         "ValueRange": [ 5000, 10000 ]
       },
       "Sell": {
         "Date": "2011-05-01",
         "ValueRange": [ 1000, 5000 ]
       }
       "TotalPercentageChange": 5.2,
       # These Daily Percentage Changes are NOT DAY OVER DAY
       # They are absolute changes from the buy date
       "DailyPercentageChanges": [
         {
           "Date": "2011-05-02",
           "PercentageChange": 5
         },
         {
           "Date": "2011-05-03",
           "PercentageChange": -5
         },
         {
           "Date": "2011-05-03",
           "PercentageChange": 0.2
         }
       ]
     }
   ]
 }
}

Process for Comparing the Stock Data with the S&P500

We have plotted the S&P500 for a superset of the dates of the politician transaction data we have.
This superset is from 1/12006 to 12/31/2012.

Any politician holding is super imposed on this S&P500 in such a way that you can compare multiple holdings against the S&P500 performance in one image.

This intermediate processing is performed by sp_shifter.rb.

In simple terms, the stock is superimposed by shifting the stock performance graph vertically so that its start point is wherever the S&P500 is at the start date.


Subpages (1): Tag Cloud
Comments