Scraping with Python and Selenium

Mon 26 August 2013

Case information from the Las Vegas Municipal Court can be accessed through an ASP.Net Web Form. Attempts to get this public data in bulk having been rebuffed, we'll be doing a bit of scraping. Bleech.

Case lookup.

Case lookup is available here. That page contains a frame with a single form element.

That form takes a single argument, 16 digits or less in length. I was going to tackle this with the mechanize library. First run based on the scraperwiki mechanize cheat sheet:

import mechanize

br = mechanize.Browser() 
br.set_all_readonly(False) # allow everything to be written to 
br.set_handle_robots(False) # no robots 
br.set_handle_refresh(False) # can sometimes hang without this 
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 

And get our target url:

target = 'https://secure2.lasvegasnevada.gov/defendantreport/Default.aspx'

response = br.open(target)
print response.read()
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
        <head><title>
        City of Las Vegas Court Case Lookup
    </title><link rel="stylesheet" type="text/css" href="http://www.lasvegasnevada.gov/includes/stylesheetSmall.css" />

    [...]

We know that there's only one form, but let's see:

 
for form in br.forms():
    print "Form name:", form.name
    print form
Form name: None
<POST https://secure2.lasvegasnevada.gov/defendantreport/Default.aspx application/x-www-form-urlencoded
  <HiddenControl(__EVENTTARGET=) (readonly)>
  <HiddenControl(__EVENTARGUMENT=) (readonly)>
  <HiddenControl(__VIEWSTATE=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ) (readonly)>
  <HiddenControl(__EVENTVALIDATION=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML) (readonly)>
  <TextControl(txt_CaseNo=)>
  <SubmitControl(btn_GetCase=Get Report) (readonly)>>

As we expected, a single form. We'll select this form and iterate through the controls:

br.form = list(br.forms())[0]
for control in br.form.controls:
    print control
    print "type=%s, name=%s value=%s" % (control.type, control.name, br[control.name])
<HiddenControl(__EVENTTARGET=) (readonly)>
type=hidden, name=__EVENTTARGET value=
<HiddenControl(__EVENTARGUMENT=) (readonly)>
type=hidden, name=__EVENTARGUMENT value=
<HiddenControl(__VIEWSTATE=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ) (readonly)>
type=hidden, name=__VIEWSTATE value=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ
<HiddenControl(__EVENTVALIDATION=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML) (readonly)>
type=hidden, name=__EVENTVALIDATION value=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML
<TextControl(txt_CaseNo=)>
type=text, name=txt_CaseNo value=
<SubmitControl(btn_GetCase=Get Report) (readonly)>
type=submit, name=btn_GetCase value=Get Report

More of the same. Create a single test number and submit our query:

test_num = 'C1002073A'
br["txt_CaseNo"] = test_num

response = br.submit()
print response.read()
print br.response().read()
[...]
<div>
<input name="txt_CaseNo" type="text" value="C1002073A" maxlength="16" id="txt_CaseNo" style="font-size:12px;height:20px;width:175px;" /> &nbsp;
<input type="submit" name="btn_GetCase" value="Get Report" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;btn_GetCase&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" id="btn_GetCase" class="formButton" /><br />
<span id="validMessage" class="alertMssg" style="display:none;">Please enter a case ID</span>
<span id="lblError"></span>
</div>
[...]

Which isn't what we wanted.

Selenium deals with the Javascript for us

It would appear that problem here is javascript. I'm too lazy to figure out what's happening here so I'm going to just throw in a grenade: Selenium.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

Needed to figure out exactly how Selenium works, so using a rough approximation of the example on RTD:

p = webdriver.FirefoxProfile()
p.set_preference("webdriver.log.file", "/tmp/firefox_console")
driver = webdriver.Firefox(p)
driver.get(url)
assert "City of Las Vegas Court Case Lookup" in driver.title

Which works fine.

It would appear that our issue with getting the second page is that the lookup spits out via a call to window.open, the problem being that it uses null as the window.title. Looking up the pop-up window via the page title is brittle at best, so I just pulled all the window handles into a for loop and matched urls before writing the page source to disk.

Got a new browser window with Selenium, so now let's define a function to save our report to disk...

def get_lvmc_case_report(caseNumber):
    driver = webdriver.Firefox()
    driver.get(url)
    casenum = driver.find_element_by_name('txt_CaseNo')
    casenum.send_keys(caseNumber)
    casenum.send_keys(Keys.RETURN)
    driver.implicitly_wait("30000")     

    for handle in driver.window_handles:
        driver.switch_to_window(handle)
        if driver.current_url == 'https://secure2.lasvegasnevada.gov/defendantreport/report.aspx':
            filename = caseNumber + '.html'
            outfile = open(filename, 'w')
            outfile.write(driver.page_source)               
            outfile.close()
            print "Wrote %s" % filename   
    driver.quit()

In order to test the scraper we'll need some case numbers. There exists a multitude of arrest/mugshot sites, so that's not particularly difficult. For this project I'll be using Jailbase, given that it produces an archived snapshot of the inmate entry used to source each arrestee's page, as well as an API which we may make use of later. The question then becomes whether or not the case number pattern holds true for citations, i.e. those seizures that did not culminate in a custodial arrest. Some manual checking shows at least two "CA"-numbers from non-custodial arrests.

Also, reviewing our manual downloads leads me to notice that the case numbers seem to be an arithmetic progression. I manually queried a sequence of case numbers and all were hits so that simplified matters considerably. We can use a list comprehension to generate the case numbers and be reasonably sure that we'll get hits.

test_nums = ['C1012474A','C101200474A', 'C1096047A']
test_seq = ['C' + str(x) + 'A' for x in range(1096048, 1096051, 1)]
test_seq
['C1096048A', 'C1096049A', 'C1096050A']

So test_nums[1] is a bad case number, one not found in the DB. When this happens the server raises an alert message and fails to open a report.aspx window. This fails relatively gracefully.

Now, to avoid hammering the city's server let's introduce a random wait, then use a for loop:

from random import randint
from time import sleep
 
for test_num in test_nums:
    print test_num
    get_lvmc_case_report(test_num)
    sleep(randint(1,10))
C1012474A
Wrote C1012474A.html
C101200474A
C1096047A
Wrote C1096047A.html

A quick inspection of the files shows HTML tables, which was what we wanted. Now, let's try the sequential set:

for test_num in test_seq:
    print test_num
    get_lvmc_case_report(test_num)
    sleep(randint(1,10))
C1096048A
Wrote C1096048A.html
C1096049A
Wrote C1096049A.html
C1096050A
Wrote C1096050A.html

W5. Everything else looks good. We'll run this over range(1010001, 1099263, 1) for a total of 89,262 queries dating from 26 February 2010 to 3 August 2013. I'll shorten the sleep time upper bound to three seconds, cutting five days down to a day and a half.

Run it overnight and on Sunday and start work on parsing the tables into a database tomorrow.

Final notes:

  1. Should add gzip compression on file save.
  2. Progress bar or a counter would be nice.
  3. The connection at home is spotty, and I may need/want to put this on EC2, so making the script take the range from the command line is the next logical step.
  4. Running this in a second X session keeps it out of the way.

And we finally start getting data, after making a public records request for that very same data that took seven weeks to deny. Script took about a half-hour to figure out. It's absurd that I thought I could just ask the government to give it to me. If it's in the public domain, don't ask, just take it.

Gist here.