Parsing the LV Municipal Court database, part 1

Mon 26 August 2013

Parsing the LVMC Case Summary Report

Finally got the time to put together the omnibus post I had mentioned in the previous entry. Doing this in R for two reasons: first, because I'm rather not get rusty with R (Anki helps), and second, as I mentioned, to get some experience with the XML library.

Preliminaries and scraping

Here we load the library and select all tables nested within a parent table cell in the tree.

library(XML)
library("data.table", lib.loc = "/home/james/R")
library("plyr", lib.loc = "/home/james/R")
library("ggplot2", lib.loc = "/home/james/R")
library("stringr", lib.loc = "/home/james/R")
library("xtable", lib.loc = "/home/james/R")

We can make use of the script we used, and include it in its entirety thanks to the knitr python engine.

import gzip
import sys
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

def get_lvmc_case_report(caseNumber):
    driver = webdriver.Firefox()
    driver.get('https://secure2.lasvegasnevada.gov/defendantreport/Default.aspx')
    casenum = driver.find_element_by_name('txt_CaseNo')
    casenum.send_keys(caseNumber)
    casenum.send_keys(Keys.RETURN)
    driver.implicitly_wait("3")

    for handle in driver.window_handles:
        driver.switch_to_window(handle)
        if driver.current_url == 'https://secure2.lasvegasnevada.gov/defendantreport/report.aspx':
            filename = caseNumber + '.html.gz'
            outfile = gzip.open(filename, 'wb')
            outfile.write(driver.page_source)
            outfile.close()
            print "Wrote %s" % filename
    driver.quit()

cases = ['C' + str(x) + 'A' for x in range(1000001, 1099263, 1)]

for case in cases:
    print case
    get_lvmc_case_report(case)
    # The time it takes to open a browser window is enough,
    # no need to sleep
    # sleep(randint(0,1))

See the original scraping post for documentation.

results.list <- list.files(path = "/home/james/Dropbox/legal/scraping/lvmc", 
    pattern = "*.html.gz", recursive = TRUE, full.names = TRUE)
names(results.list) <- gsub("/home/james/Dropbox/legal/scraping/lvmc/\\d\\d/(C\\d+A).html.gz", 
    "\\1", results.list)
head(results.list)
##                                                      C1000001A 
## "/home/james/Dropbox/legal/scraping/lvmc/00/C1000001A.html.gz" 
##                                                      C1000002A 
## "/home/james/Dropbox/legal/scraping/lvmc/00/C1000002A.html.gz" 
##                                                      C1000003A 
## "/home/james/Dropbox/legal/scraping/lvmc/00/C1000003A.html.gz" 
##                                                      C1000004A 
## "/home/james/Dropbox/legal/scraping/lvmc/00/C1000004A.html.gz" 
##                                                      C1000005A 
## "/home/james/Dropbox/legal/scraping/lvmc/00/C1000005A.html.gz" 
##                                                      C1000006A 
## "/home/james/Dropbox/legal/scraping/lvmc/00/C1000006A.html.gz"
tail(results.list)
##                                                      C1099255A 
## "/home/james/Dropbox/legal/scraping/lvmc/05/C1099255A.html.gz" 
##                                                      C1099259A 
## "/home/james/Dropbox/legal/scraping/lvmc/05/C1099259A.html.gz" 
##                                                      C1099260A 
## "/home/james/Dropbox/legal/scraping/lvmc/05/C1099260A.html.gz" 
##                                                      C1099261A 
## "/home/james/Dropbox/legal/scraping/lvmc/05/C1099261A.html.gz" 
##                                                      C1099262A 
## "/home/james/Dropbox/legal/scraping/lvmc/05/C1099262A.html.gz" 
##                                                      C1099263A 
## "/home/james/Dropbox/legal/scraping/lvmc/05/C1099263A.html.gz"

It wasn't necessary to download local copies of our files; we could have constructed and fed the URLs directly into the XML parser, we'd just replace the following strings with the appropriate URL generator.1

TreeParse <- function(x) {
    tree <- htmlTreeParse(x, useInternalNodes = TRUE, asText = FALSE)
    srctbl <- getNodeSet(tree, "//span/div/table/tbody/tr/td/table")
    return(srctbl)
}

Case Identifiers

We also create a class for the LVMC's date format:

setClass("caseDate")
setAs("character", "caseDate", function(from) as.POSIXct(from, format = "%m/%d/%Y %I:%M %p"))

And for the ID table, we need to split table cells by colon.

pairVectors <- function(x) {
    # Splits cells from the identifiers table of LVMC case reports into
    # variables
    # 
    # Args: x: A single character string to be split on its colon 'Court Date:
    # 10/4/2012 8:30 AM'
    # 
    # Returns: A single-column data frame
    name <- sub(" ", ".", tolower(x[1]))
    if (name %in% c("violation.date", "court.date")) {
        val <- as.POSIXct(x[2], format = "%m/%d/%Y %I:%M %p")
    } else {
        val <- x[2]
    }
    out <- data.frame(val)
    names(out) <- name
    return(out)
}

And we include that function in the main one extracting entries from the third table of the case summary.

extractIdentifiers <- function(raw) {
    # Parses the identifiers table (no. 3) of LVMC case reports
    # 
    # Args: ids.raw: An XML node set containing a single HTML table
    # 
    # Returns: A data frame containing: history.number case.number
    # citation.number violation violation.date name department court.date

    # Extract the table containing case identifiers
    ids.raw.df <- readHTMLTable(raw[[3]], stringsAsFactors = FALSE)
    # Split each cell into id:value pairs. Not every case has a citation
    # number so we split on a colon ending a line or followed by a space
    ids.raw.lst <- c(strsplit(ids.raw.df$V1, ": |:$"), strsplit(ids.raw.df$V2[!is.na(ids.raw.df$V2)], 
        ": |:$"))
    # Combine each id:value string vector into a data frame
    ids.pairs <- sapply(ids.raw.lst, pairVectors)
    # Merge the list of data frames and exit
    ids <- data.frame(ids.pairs)
    return(ids)
}

Case Docket

This table is a bit more complex in that a lot of the data we're after is encoded in the case activity column as id:value pairs (e.g., Plea: NOLO or Finding: GUILTY). We could repeat the steps we used to extract the ids table, but there's additional data that doesn't seem to obey the colon delimited format, most importantly the type of attorney (Public Attorney SMITH, BRIAN Bar# 11279).

Activity entry Variables Type
Case Closed case.closed = TRUE logical
Plea: NOLO) plea=NC or (NG, GT, SU) factor
Finding: GUILTY finding=GTor (NG, DS?) factor
Probable Cause Found pc.hearing=TRUE, pc.found=TRUE logical
Bail Due: \$1130 bail due=1130 numeric
Sentence: JAIL 3 days CTS: 3 days list("sentence.type"=J, "sentence.val"=3, "sentence.suspended"=FALSE), jail.served=3 list,numeric
Complaint Filed 12/2/2010 9:10 AM complaint.date=date POSIXct
Public Attorney SMITH, BRIAN Bar# 11279 attorney.type=(PDR,PVT), attorney.name="SMITH, BRIAN",attorney.bar=11279 factor
BW Ordered bw.issued=TRUE logical
Suspend Sentence for 1Y AND Sentence: Suspend JAIL 179 days `list("sentence.type"=J, "sentence.val"=3, "sentence.suspended"=TRUE) list

There's a separate table for sentences, so we'll set that aside for the moment. We begin by establishing whether the case is closed, and whether the court conducted a probable cause hearing:

isClosed <- function(docket.df) {
    if (length(grep("Case Closed", docket.df$activity, fixed = TRUE)) == 0) {
        case.closed <- FALSE
    } else {
        case.closed <- TRUE
    }
    return(case.closed)
}

GetProbableCause <- function(docket.df) {
    if (length(grep("Probable Cause", docket.df$activity, fixed = FALSE, ignore.case = TRUE)) == 
        0) {
        pc.hearing.held <- FALSE
        pc.found <- NA
    } else {
        pc.hearing.held <- TRUE
        if (length(grep("Probable Cause Found", docket.df$activity, fixed = FALSE, 
            ignore.case = TRUE)) > 0) {
            pc.found <- TRUE
        } else {
            pc.found <- FALSE
        }
    }
    pc.hearing <- data.frame(pc.hearing.held, pc.found)
    return(pc.hearing)
}

Getting the attorney name and type requires extracting it via regex. We also have to consider that multiple attorney entries may exist, as when a private attorney is replaced by a publicly provided one. We can do this by setting up the function to only record the first (most recent) entry, and to set certain logical flags if there are multiple attorneys.

parseAttorneyName <- function(atty.str) {
    atty.str <- sub("(Public|Private) Attorney ([A-Za-z]+, [A-Za-z]+|[A-Za-z]+, [A-Za-z]+ [A-Za-z]\\.) Bar# (\\d+)", 
        "\\2~\\3", atty.str)
    atty.entry <- unlist(strsplit(atty.str, "~", fixed = TRUE))
    a <- list(attorney.name = atty.entry[1], attorney.bar.num = atty.entry[2])
    if (length(a) == 2) {
        # check for a nam
        return(a)
    } else {
        a <- list(attorney.name = NA, attorney.bar.num = NA)
        return(a)
    }
}

getAttorney <- function(docket.df) {
    # Parses the case docket for attorney of record entries in LVMC case
    # reports
    # 
    # Args: docket.df: data frame with date and activity cols
    # 
    # Returns: A data frame containing: Attorney Name, if mult, first (latest)
    # attorney bar number, if mult, first (latest) assigned counsel multiple
    # attorneys

    pri.len <- length(grep("Private Attorney", docket.df$activity, fixed = TRUE))
    pub.len <- length(grep("Public Attorney", docket.df$activity, fixed = TRUE))
    if (pri.len == 0 && pub.len == 0) {
        # No attorney
        atty <- data.frame(attorney.is.assigned = FALSE, attorney.is.private = FALSE, 
            attorney.name = NA, attorney.bar.num = NA)
        return(atty)
    } else if (pub.len == 1) {
        # public attorney
        atty.name <- parseAttorneyName(grep("Public Attorney", docket.df$activity, 
            value = TRUE, ignore.case = TRUE))
        atty <- data.frame(attorney.is.assigned = TRUE, attorney.is.private = FALSE, 
            atty.name, stringsAsFactors = FALSE)
        return(atty)
    } else if (pri.len == 1) {
        # private attorney
        atty.name <- parseAttorneyName(grep("Private Attorney", docket.df$activity, 
            value = TRUE, ignore.case = TRUE))
        atty <- data.frame(attorney.is.assigned = FALSE, attorney.is.private = TRUE, 
            atty.name, stringsAsFactors = FALSE)
        return(atty)
    } else {
        atty <- data.frame(attorney.is.assigned = NA, attorney.is.private = NA, 
            attorney.name = NA, attorney.bar.num = NA)
        return(atty)
    }

}

Here we create a simpler version of the pairVectors function, which we'll use to extract pleas and findings (verdicts). We also want to check whether the City Attorney declines to pursue charges and/or the court dismissed the charge.2

activitySplit <- function(x) str_trim(unlist(strsplit(x, ": |:$")))

isDenied <- function(docket.df) {
    if (length(grep("(Charge Denied|(Case|Charge) dismissed)", docket.df$activity, 
        ignore.case = TRUE)) == 0) {
        charge.denied <- FALSE
    } else {
        charge.denied <- TRUE
    }
    return(charge.denied)
}

Putting it all together, we have:

parseDocket <- function(docket.df) {
    charge.denied <- isDenied(docket.df)
    if (charge.denied) {
        finding <- plea <- "NONE"
    } else {
        plea <- activitySplit(grep("Plea", docket.df$activity, value = TRUE))[2]
        finding <- activitySplit(grep("Finding", docket.df$activity, value = TRUE))[2]
        try(if (plea == "Plea" || finding == "Finding") {
            plea <- activitySplit(grep("Plea", docket.df$activity, value = TRUE))[3]
            finding <- activitySplit(grep("Finding", docket.df$activity, value = TRUE))[3]
        })
    }
    attorney <- getAttorney(docket.df)
    pc.hearing <- GetProbableCause(docket.df)
    dkt <- as.data.frame(list(charge.denied = charge.denied, plea = plea, finding = finding))
    dkt <- cbind(dkt, attorney, pc.hearing)
    return(dkt)
}

We first detect whether the charges have been denied, and, if so, skip the plea and verdict. Because these are sometimes prefaced by "Correction:" we select the third element when we extract the word "plea" or "finding." We then get our attorney data and PC hearing status. We cbind these together into a single data frame and return it.

Depending on whether various flags (bench warrants, clerk flags or alerts, etc.) have been set, the location of the docket can vary by up to three slots. But because the record flags are one column HTML tables we can check for the two columns of our docket table ("Activity date" and "Activity"). We set stringsAsFactors = FALSE to keep the attorney entries from being returned as a factor and subsequently being converted into an integer when the data frames are combined.

extractDocket <- function(raw.tree) {
    docket.raw.df <- readHTMLTable(raw.tree[[5]], stringsAsFactors = FALSE, 
        skip.rows = 1, header = TRUE, colClasses = c("caseDate", "character"))
    if (length(docket.raw.df) == 2) {
        names(docket.raw.df) <- c("activity.date", "activity")
        docket.dt <- as.data.table(docket.raw.df)
        return(docket.dt)
    } else {
        docket.raw.df <- readHTMLTable(raw.tree[[6]], stringsAsFactors = FALSE, 
            skip.rows = 1, header = TRUE, colClasses = c("caseDate", "character"))
        if (length(docket.raw.df) == 2) {
            names(docket.raw.df) <- c("activity.date", "activity")
            docket.dt <- as.data.table(docket.raw.df)
            return(docket.dt)
        } else {
            docket.raw.df <- readHTMLTable(raw.tree[[7]], stringsAsFactors = FALSE, 
                skip.rows = 1, header = TRUE, colClasses = c("caseDate", "character"))
            names(docket.raw.df) <- c("activity.date", "activity")
            docket.dt <- as.data.table(docket.raw.df)
            return(docket.dt)
        }
    }
}

We put all of that together:

ParseSummaryTree <- function(chr) {
    node <- TreeParse(chr)
    docket.df <- extractDocket(node)
    case.closed <- isClosed(docket.df)
    if (case.closed) {
        ids <- extractIdentifiers(node)
        dkt <- parseDocket(docket.df)
        case <- cbind(ids, dkt)
        return(case)
    }
}

This returns a one row data frame containing:

  • history.number
  • case.number
  • citation.number
  • violation
  • violation.date
  • name
  • department
  • court.date
  • charge.denied
  • plea
  • finding
  • attorney.is.assigned
  • attorney.is.private
  • attorney.name
  • attorney.bar.num
  • pc.hearing.held
  • pc.found

Extract the dockets

My original approach kept all of the XMLNodeSets in memory, a mistake that quickly filled up 8GB of RAM.

library("parallel", lib.loc = "/usr/lib/R/library")

if (file.exists("docketsDT.RData")) {
    load("docketsDT.RData")
} else {
    system.time(dockets.l <- mclapply(results.list, ParseSummaryTree))
    system.time(dockets <- do.call("rbind", dockets.l))
    dockets.DT <- data.table(dockets, key = "case.number")
    save(dockets.DT, file = "docketsDT.RData", compress = TRUE)
    write.csv(dockets, file = "docketsDT.csv")
}

A quick glance at the resulting data frame (omitting the ID numbers, defendants, dates, and attorney bar number):

x.dockets <- xtable(summary(dockets.DT[,c("violation", "plea", 
                                          "finding", 
                                          "attorney.is.assigned", 
                                          "attorney.is.private", 
                                          "attorney.name", 
                                          "pc.hearing.held", 
                                          "pc.found"), with = FALSE]), 
                    caption = "Selected result summaries for the dockets table."
                    )
print(x.dockets, type = "html")
Selected result summaries for the dockets table.
violation plea finding attorney.is.assigned attorney.is.private attorney.name pc.hearing.held pc.found
1 BATTERY/DOMESTIC VIOLENCE : 6871 NONE :29640 GUILTY :33632 Mode :logical Mode :logical Length:63516 Mode :logical Mode :logical
2 TRESPASSING : 6192 NOLO :19442 NONE :29640 FALSE:42513 FALSE:53745 Class :character FALSE:40056 FALSE:254
3 UNLAWFUL USE/POSSESSION OF DRUG PARAPHERNALIA: 4697 GUILTY :11841 NOT GUILTY: 111 TRUE :18864 TRUE :7632 Mode :character TRUE :23460 TRUE :23206
4 PETIT LARCENY : 3873 SUBMIT : 2100 Finding : 3 NA's :2139 NA's :2139 NA's :0 NA's :40056
5 POSSESS LESS THAN 1 OUNCE OF MARIJUANA : 3836 NOT GUILTY: 342 NA's : 130
6 (Other) :38046 (Other) : 91
7 NA's : 1 NA's : 60

The most eyepopping result? The 111 acquittals, or 0.175 percent of 63516 cases total. That isn't very many at all. A closer look:

acquitals <- dockets.DT[finding == 'NOT GUILTY',
                        c("violation",
                          "plea",
                          "finding",
                          "attorney.is.assigned",
                          "attorney.is.private",
                          "attorney.name",
                          "pc.hearing.held",
                          "pc.found"), 
                        with = FALSE]
x.acq <- xtable(summary(acquitals), caption = "Result summaries for the acquitals table.")
print(x.acq, type = "html")
Result summaries for the acquitals table.
violation plea finding attorney.is.assigned attorney.is.private attorney.name pc.hearing.held pc.found
1 BATTERY/DOMESTIC VIOLENCE :43 NOT GUILTY:106 GUILTY : 0 Mode :logical Mode :logical Length:111 Mode :logical Mode:logical
2 DUI LIQUOR :11 SUBMIT : 2 NONE : 0 FALSE:62 FALSE:56 Class :character FALSE:89 TRUE:22
3 BATTERY : 7 GUILTY : 0 NOT GUILTY:111 TRUE :40 TRUE :46 Mode :character TRUE :22 NA's:89
4 TRESPASSING : 6 NONE : 0 Finding : 0 NA's :9 NA's :9 NA's :0
5 POSSESS LESS THAN 1 OUNCE OF MARIJUANA: 3 NOLO : 0
6 RECKLESS DRIVING : 3 (Other) : 0
7 (Other) :38 NA's : 3

And we can also generate top ten tables for all dockets and for acquittals:

Top ten violations for all dockets.
Count
BATTERY/DOMESTIC VIOLENCE 6871
TRESPASSING 6192
UNLAWFUL USE/POSSESSION OF DRUG PARAPHERNALIA 4697
PETIT LARCENY 3873
POSSESS LESS THAN 1 OUNCE OF MARIJUANA 3836
CONSUME ALCHOL ON PREMISE OFF/SALE ONLY 3238
BATTERY 3200
DUI LIQUOR 2860
OBSTRUCTING/FALSE INFO TO P. O. 1932
CONVICTED PERSON FAIL/CHANGE ADDRESS 1900
Top ten violations for dockets of cases resulting in acquittal.
Count
BATTERY/DOMESTIC VIOLENCE 43
DUI LIQUOR 11
BATTERY 7
TRESPASSING 6
RECKLESS DRIVING 3
POSSESS LESS THAN 1 OUNCE OF MARIJUANA 3
NOISE DISTURBANCE 2
DUI DRUGS CHEMICALS ORGANIC SOLVENT 2
OBSTRUCTING/FALSE INFO TO P. O. 2
PETIT LARCENY 2

The other interesting result is the 1.08 percent of probable cause hearings that failed to find probable cause.

Sentencing

Relatively straight forward. We have five columns (sentence name, cash (T/F), due, paid, balance) to extract, and need only check for the presence of a dollar sign in the second column, which indicates a fine.

extractSentence <- function(chr, n) {
    # Parses the case sentencing table (no. 4) of LVMC case reports
    # 
    # Args: chr: filename or url pointing to case summary
    # 
    # Returns: A data frame containing: id item.name cash - todo: character
    # dollar sign -> logical due paid balance
    sen.raw <- TreeParse(chr)
    sen.raw.df <- readHTMLTable(sen.raw[[4]], stringsAsFactors = FALSE, skip.rows = 2, 
        header = c("item", "cash", "due", "paid", "balance"))
    if (length(sen.raw.df) == 5) {
        try(sen.raw.df$cash <- sen.raw.df$cash %in% "$", silent = TRUE)
        sents <- data.frame(.ids = n, sen.raw.df)
    } else {
        sents <- data.frame(.ids = n, item = NA, cash = NA, due = NA, paid = NA, 
            balance = NA)
    }
    return(sents)
}

And run it here, after checking to see whether our data file exists.

if (file.exists("sentsDT.RData")) {
    load("sentsDT.RData")
} else {
    sentences.l <- mclapply(seq_along(results.list), function(i) {
        extractSentence(results.list[[i]], n = names(results.list)[[i]])
    })
    sents <- rbind.fill(sentences.l)
    sents.DT <- data.table(sents, key = ".ids")
    save(sents.DT, file = "sentsDT.RData", compress = TRUE)
    write.csv(sents.DT, file = "sentsDT.csv")
}

Not evaluated. So the first run:

    > system.time(
    + sentences.l <- mclapply(seq_along(results.list), function(i) 
    + {
    + extractSentence(results.list[[i]], n=names(results.list)[[i]]))
    + }
    + )
       user  system elapsed 
    870.088   9.112 436.309 
    > 
    > 
    > system.time(
    + sents <- rbind.fill(sentences.l)
    + )
        user   system  elapsed 
    2856.880    3.672 2863.360 

Which is obviously hosed. It would probably be faster to write each data frame to disk and then paste or cat them together than to use rbind.fill. The parallel package is particularly useful here for parsing both the docket and sentence tables, cutting processing time in half on my machine (an aging i5-650 dual-core CPU).

Housekeeping

ls()
##  [1] "acquitals"          "acqviol.df"         "activitySplit"     
##  [4] "dockets.DT"         "extractDocket"      "extractIdentifiers"
##  [7] "extractSentence"    "getAttorney"        "GetProbableCause"  
## [10] "isClosed"           "isDenied"           "pairVectors"       
## [13] "parseAttorneyName"  "parseDocket"        "ParseSummaryTree"  
## [16] "results.list"       "sents.DT"           "TreeParse"         
## [19] "violdf"             "x.acq"              "x.acq.viol"        
## [22] "x.dockets"          "x.viol"
tables()
##      NAME          NROW MB
## [1,] acquitals      111 13
## [2,] dockets.DT  63,516 19
## [3,] sents.DT   329,070 18
##      COLS                                                                            
## [1,] history.number,case.number,citation.number,violation,violation.date,name,departm
## [2,] history.number,case.number,citation.number,violation,violation.date,name,departm
## [3,] .ids,item,cash,due,paid,balance                                                 
##      KEY        
## [1,] case.number
## [2,] case.number
## [3,] .ids       
## Total: 50MB
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] xtable_1.7-1     stringr_0.6.2    ggplot2_0.9.3.1  plyr_1.8        
## [5] data.table_1.8.8 XML_3.98-1.1     knitr_1.2       
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
##  [4] evaluate_0.4.4     formatR_0.8        grid_3.0.1        
##  [7] gtable_0.1.2       labeling_0.2       MASS_7.3-27       
## [10] munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5
## [13] reshape2_1.2.2     scales_0.2.3       tools_3.0.1

Data

I've posted the resulting files to Figshare.3


  1. Xiao Nan has a nice slide deck on web scraping with R available here

  2. This may need to change. I need to figure out whether "submittals" that result in the case being dismissed (similar to Adjournment in contemplation of dismissal) belong with cases in which the DA denies the charge. 

  3. James Estevez, "Criminal Misdemeanor Dockets from the Las Vegas Municipal Court (2009-2013)", FigShare (August 8, 2013), Link