Aren Cambre's picture

Even more TxDPS progress

I can report even more TxDPS progress.

It turns out the problems pictured in the prior entry are for tickets were written on I-20 Business in or near Odessa:

What's going on here is I have a C# program that correlates each TxDPS ticket with a TxDOT reference marker. TxDOT's data has GPS coordinates.

These tickets on the left side of the picture were written at I-20 Business's reference markers. These markers have the same numbers as I-20 mainlane mile markers several miles east of Abilene, on the right side of the picture.

Why are the same numbers so far apart?

As explained in prior entries here, all Texas roads except Interstates use a reference marker system.

A road's reference marker increments only after the road crosses a gridline. This is on a grid superimposed on the state. These reference markers don't exactly indicate a mile of pavement unless the road direction is exactly N/S or E/W.

Texas Interstates use a mile marker system, where each mile of roadbed = 1 mile marker.

While "I-20 Business" in and near Odessa has "Interstate" in its name, it is in fact part of the state highway system. It is not part of the Interstate system or even the US highway system. So these Interstate business spurs/routes/loops use the reference marker system instead of the mile marker system.

It's just coincidence that the I-20 Business reference markers' numbers were matching Interstate mile marker numbers from many miles away.

US 181 through Beeville, TXI thought I had designed my reference marker locator C# program to only find reference markers within the same county. Turns out I had only done 1 of 2 steps needed to make this happen. After correcting the program and letting it run another 9 hours, I have much-improved data.

But I still found problems. An example is in Beeville, TX, where you have US 181 and US 181 Business. In the image at right, the business route appears on the left, and the main route appears on the right.

(What's with the J? It's actually Business US 181-J. Each business route along a road--and there can be several if the main road bypasses towns--has its own, unique letter suffix.)

While both these roads use the reference marker system, their numbering isn't parallel. For example, notice how the business route's reference marker 570 is parallel to markers 590/588 on the main route. This is an example where the system is not always contigious, although that doesn't matter much because I do have reasonable latitude and longitude for each reference marker.

Back to the point: if a ticket was written at the business route's reference marker 570, the TxDPS ticket data says it was written at US 181 rm 570. If this is all the TxDPS data gives me, how do I know if this refers to the US 181 business route rm 570 or the US 181 main lanes rm 570? While the main lanes' rm 570 is a few miles north, they are both in the same county. So my prior scheme, where I only match reference markers in the same county, can't handle this.

I scratched my head for a few hours until I stumbled on a strangely-named field in the TxDPS data. Turns out that this field has the route's TxDOT prefix! For tickets written on this business route, the field contains BU. For tickets written on the main route, it has US. Excellent!

So I went back to Odessa, where I had the original problem with I-20 Business. Well, it's better but still inconsistent: I'm seeing both IH and BI prefixes on I-20 Business. So looks like I need to sample a lot more routes to see if this is a systemwide problem.

While re-checking Odessa, I found something really weird: there's a gigantic number of tickets near I-20 and FM 2227, and they are all clustered on the northeast side of the intersection:

What??

I zoomed in more and added back the road prefix label:

Huh?

I inspected the area and saw no roads or anything special except--a DPS office! Yup, DPS officers appear to be waiting until they get back to their station before filing tickets.

I haven't found a rhyme or reason for this. All I can ascertain at the moment is only about 4 of these are speeding tickets, which is unusually low. So far, it appears speeding tickets are a large plurality, possibly a majority, of traffic-related tickets.

There's got to be another reason why they are filing at the station. If I'm lucky, these will mostly be non-traffic offenses and therefore not in my research.

Now I need to re-tweak my TxDPS reference marker locator C# program tonight thanks to the discovery of this new field. Tomorrow morning, I will have even better reference marker matching.

Aren Cambre's picture

TxDPS geocoding, continued

TxDPS ticketing along Dallas North TollwayI've continued to work on the TxDPS dataset.

The dataset has two types of tickets, for location purposes: those with GPS coordinates, and those without.

It's straightforward to geolocate the tickets with GPS coordinates. To the right is an example from the Dallas North Tollway. You'll notice that the tickets match the road's actual path really well. Sure, there are some stray tickets here and there, but they are minimal. I probably need to inspect the average straight-line distance of each ticket from the road's centerline to ascertain the true variance over one dimenson, but I imagine it wil be minimal.

Now the other challenge is for tickets without GPS coordinates. This generally applies to almost all tickets issued before 2008.

The way the TxDPS records ticket locations is to note the road's class and name and the nearest reference marker. (Reference markers discussed in an earlier post.)

Fortunately, I have a separate database of all Texas Department of Transportation reference markers with GPS coordinates, so as long as I can correlate to the nearest reference marker, I can geolocate the ticket within roughly 1 mile.

The TxDPS road class designations are:

  • 1 = Interstate (also used for tollways)
  • 2 = US or state highways
  • 3 = Farm to Market or Ranch to Market
  • 4 = county road
  • 5 = city street
  • & = other

Now here's a problem: both US 71 and TX 71 show up in the TxDPS database with name 0071 and road class 2. How do you tell roads like these apart?

If you have GPS, it's easy. But for the majority of tickets without GPS, it's more challenging.

One way is to hope both roads aren't in the same county. TxDPS also stores a county code, so you do the lookup on road name, road class, reference marker, and county.

But what if a state highway and US highway are both in the same county? This is the case for US 70 and TX 70!

In that case, all you can really do is hope that the reference markers are sufficiently apart to prevent an incorrect match. And in the case of US 70 and TX 70, where they intersect, US 70 will have higher reference markers that do not overlap with TX 70's markers.

Two more problems, that seem to be relatively minor, are that officers and clerical staff make mistakes. I've seen many examples where the GPS coordinates are a good distance from the indicated reference marker, and the error isn't explained by a keystroke. All I can guess is the officer wasn't paying attention and just remembered the most recent reference marker.

Another case are likely problems with clerical staff. I've seen some tickets where the GPS location is clearly on one road, but the route class and name are on a different road. This is only a guess, but I figure that if an officer focuses on road X and turns in a stack of tickets from that road, but that stack includes a small number of tickets from road Y, the typist may make some mistakes with road Y.

Now for an error I made.

The TxDPS county code correlates to a TxDOT county code, but indirectly.

The table that holds the tickets contains the county code, and I figured the corresponded directly to the TxDOT county codes.

Well, not exactly. If you look at the TxDOT county codes, you'll notice they don't all increment perfectly. For example, look at Kenedy County. The neighboring counties in alpha sort are Kendall (131) and Kent (132). Kenedy's number is 66.

I have no idea why it is ordered this way. All I do know is that, being formed in 1921, Kenedy County is among the newest Texas counties. So that might explain the discontinuity. But why number 66? Was there a now-disbanded county whose name was between Donley (65) and Duval (67)? Even if this happened way back when, why would it matter in the late 20th century, when this numeric convention was likely set? That would have been decades after Kenedy County's formation.

That curiosity notwithstanding, the TxDPS county reference was offset by 1 for many counties. In fact, the county number referred to a different table in the TxDPS database that correlated the TxDPS table refernce number to the TxDOT number.

It admittedly took me way too long to figure this out. Until I did, I was flummoxed by things like this:

This is where I was analyzing variance in my data. The red lines are between the locations of reference markers and the average location of traffic tickets for a given reference marker. Why was there such a strong correlation between Eastland County (Eastland, TX) and Ector County (Odessa, TX)?

Well, turns out that the tickets referenced county ID 68. County 68 in the TxDOT dataset is Eastland County. But the problem is that county 68 in the TxDPS dataset is in fact county 69 in the TxDOT data, which is Ector County!

As I'm typing, I have a program running re-analyzing all my TxDPS tickets, and this time I have corrected for the bad county numbers. Hopefully the 2nd time through will work! I'll find out when it finishes running in a few hours.

Aren Cambre's picture

Transposition error

OK, I admit it--it took me a while to figure out why my data was askew until I zoomed in:

Yup, transposed latitude and longitude. That's Texas mirrored and rotated.

Redoing this multi-hour calculation now!

Aren Cambre's picture

First glimpse of TxDPS ticket data

I now have my first glimpse of citation data from the Texas Department of Public Safety. It is probably the most prolific ticket-writing force in the state. For about 10.5 years' data, I have about 11.5 million tickets.

Here's TxDPS ticketwriting activity in the DFW area:
TxDPS ticketwriting data for DFW metro area

It took a while to get here.

For the past few years, TxDPS has included latitude/longitude data with its tickets, so this is helpful. But the first several years, TxDPS generally used highway reference markers to say where the ticket was written.

Reference markers for Interstate highways are (relatively) easy: they are mileposts that start counting from 0 at the western or southernmost extent. In this case, "(relatively) easy" was still a couple of weeks of work for me to figure out line traversal and other spiffy PostGIS features.

All other highways--US routes all the way down to FM/RM/RR routes--use an obscure, grid-based reference marker system. It's hard to explain, but in a nutshell, a grid is superimposed on the state, and I think the grid has lines 1 mile apart. A highway's reference marker doesn't increment until it crosses the grid line. If a highway runs diagonal, then it has longer intervals between successive reference markers.

But it's not that simple. I couldn't figure out a rational way to translate reference markers to latitude and longitude. Even manual calculations on some easy highways didn't work.

For example, TxDOT uses US 82 as an example highway in its explanation of the reference marker system. US 82 generally runs ENE across the state. Because it's not a strictly E/W road, it should have more miles than reference markers. But it didn't work out that way. The reference marker where US 82 enters the state from NM is 222, and where it leaves the state at AR is 798. 798 - 222 = 576 reference marker increments. But a huge problem: US 82 only runs 505 miles through the state! US 82 has more reference markers than miles, which ought to be impossible!

I gave up and did an informal open records request on TxDOT. Turns out they had a dataset already that correlates reference markers to latitude and longitude. They warned me that the data may be a off. However, at least on Interstates, their data pretty closely matches the mile markers I calculated using line traversals. They are usually close but may occasionally wander less than 1/3 of a mile off. This, by the way, is based on informal sampling, not population analysis.

Now that I have geocoded reference markers, I can use that to geocode TxDPS tickets that don't have lat/long, which is what you see above.

But there's a major problem with the above. Each blue dot represents the location of TxDPS activity, where at least one ticket was written. And each blue dot may represent MANY instances of TxDPS activity. Further, the closer you get to the DFW city centers, the fewer tickets you'll uncover over each dot.

My next step is to convert this into some kind of heat map. The heat map will show intensity of activity. Rural Interstates should show far greater activity than urban cores, which should show minimal or no activity.

Something already quite interesting are the gaps in the data. You may notice that roughly at the Dallas County line, TxDPS activity goes to zero on I-35E. Is this because TxDPS doesn't patrol I-35E north of the line? Is there a dataset error? Do TxDPS officers mis-enter data? Why does TxDPS appear to be active on I-45 all the way to downtown?

Also, why no enforcement on NTTA roads? I already know this is a problem because NTTA contracts out traffic enforcement to TxDPS. I think it's because my geocoding excluded all but state and federal roads, and NTTA roads are locally-owned. This is another example where I need to fine-tune my C# geocoding program, possibly having to run the entire 15+ hour process all over again.

These are all the next questions I need to assess, and the heatmap will help make this data human-understanadable and ease this analysis and data correction.

Aren Cambre's picture

GIS

As mentioned in my previous post, I've had to learn geographical information systems in the past few months. This was mostly new to me.

I'm glad I switched to Postgres. Its GIS support is excellent thanks to PostGIS.

I had to figure out which GIS tool to use. ArcGIS is the popular answer, but it's costly. While SMU has ArcGIS licenses, it would take teeth pulling to get it installed anywhere besides computers in an academic computer lab. I have to do the vast majority of my research work at home, so that won't work. Plus, what if I wanted to go more GIS stuff after I finish my praxis? I'd rather not pay an ArcGIS license, so I turned to the open source community.

I spent a couple of weeks figuring out the open source GIS market. In short, Quantum GIS is probably the best choice for general GIS needs. GRASS GIS appeared to be a popular alternative, but it's UI is inferior, and it's included with Quantum GIS, so that just helped me choose Quantum GIS.

I'll post more later. Just choosing a GIS product was enough of a challenge!

 

Aren Cambre's picture

Converting mileposts to latitude and longitude

In the past few months, I've been learning graphical information systems (GIS). This is a challenge; I've never done GIS before.

The latest learning task is figuring out how to convert highway mileposts to latitude and longitude. Why do I need this? Highways generally don't have street addresses. For example, almost all TxDPS tickets reference a milepost on a highway, such as milepost 283 on I-20.

That introduces another problem: there's no existing dataset that maps mile markers to latitude and longitude. So I have to create it myself.

Back to GIS, I have a GIS ShapeFile file from TxDOT. It has precise coordinates for all highways, and it also has the highway's length.

I've converted this ShapeFile to a PostGIS-enabled PostGres database. Now I get to use PostGIS's linear referencing functions to trace these routes and discover their mile markers.

But even that creates interesting challenges. In the ShapeFile, each highway's distance is measured in degrees. As in degrees of latitude. (Not longitude--each degree of longitude is a different distance apart depending on distance from the equator.)

So I have to convert degrees of latitude to miles. Not too hard--take the earth's circumference in miles and divide by 360.

NASA says the equatorial circumference is 24,901.55 (source). (Can I use that as a serious source when I write my praxis? Ha!) That divided by 360, the number of degrees of latitude, means a degree of latitude is 69.17 miles.

Yay, now I can find mile markers!

Aren Cambre's picture

Database changes to PostgreSQL

This project's database technology is changing to PostgreSQL.

Microsoft MapPoint 2010 is too primitive to be my comprehensive GIS solution. After a little searching, it was clear that PostgreSQL, thanks to PostGIS, dominates the database layer of GIS tools.

On top of this, I need good geospatial query support. MySql's geospatial features are minimal and apparently slow.

Fortunately, existing data is very simple tables, so data conversion is easy. But unfortunately, I put a lot of effort into learning .NET Framework 4's Entity Framework v2. The free Npgsql PostgreSQL ADO.NET provider doesn't support the DDEX necessary for full Entity Framework compatibility. (OK, so you can do Entity Framework stuff manually, but you can't do much in Visual Studio, so it is still incomplete.) I could pay for Devart's dotConnect for PostgreSQL for $180, which does support Entity Framework.

But given that no Entity Framework provider seems to support geospatial field types, it may not matter.

Decisions, decisions.

Aren Cambre's picture

Dallas data geocoded but...

I geocoded my Dallas citation data and plotted it on a map. It appears Microsoft MapPoint 2010 is making some errors, putting some Dallas citations far outside Dallas. This is weird because MapPoint accepts the name of the city in its own field during geocoding.

Also, poking around the data more, I've got to move to some other program to do visualizations. Microsoft MapPoint does a bad job at visualizations with more than 2 dimensions. In my case, I need to have some measure of intensity. For example, if 100 traffic citations were issued at one spot, I need for that to look much different than another spot with only 1 citation. MapPoint can't do this in a useful way.

Texas map that has lots of pushpins outside of expected areas

Aren Cambre's picture

Database choice

My database choice has not been without consternation.

So far I've been using MySql, and I lean towards sticking with it. A few reasons:

  • Even though SQL Server Express 2008 R2 is free and now has a 10GB data size limit, it still has significant limitations (to steer you into the paid version), and I don't know how those will affect me as the project plays out.
  • Even though my data may have multidimensional aspects, and even though SQL Server has powerful analytics and OLAP services, I think I can handle this fine with R (the statistical analysis language), C#, and MySql.
  • MySql's Visual Studio integration and .Net Framework compatibility is really good. It even supports .Net Framework 4's Entity Framework ORM. I believe that SQL Server still has superior integration and compatibility, but I don't think those enhancements are useful to me.
  • I have plenty of opportunities to use SQL Server in my Microsoft-centric work environment. I want an opportunity to learn MySql more deeply.
  • MySql may have better compatibility with the tools that come with R, although R does have an ODBC plugin...
  • If I need to move the database to school servers, MySql will probably be an easier sale.
  • Better ecosystem of free tools with MySql (important when you're on a shoestring budget!).
  • Ultimately, I can convert my data to SQL Server if required.
Aren Cambre's picture

Garland data imported

Garland's data is now imported.

Garland gave me three XLSX (Excel 2007) files with these dates:

  • 1/1/1997 through 1/1/1999 (10.1 MB)
  • 1/2/1999 through 12/31/2005 (26.4 MB)
  • 1/1/2006 through 10/31/2009 (57.0 MB)

I converted these to CSV and then stitched together into one consolidated file.

The CSV had several backslashes. These have special meaning during the import because they escape other characters. I reviewed many of these, and in fact each use of a backslash was done when a forward slash would have made more sense, so I just did a search and replace to convert the backslahses. I wanted to not replace any double backslashes, so I searched for \\([^\\]) and replaced with /\1. This means I found any instance of a backslash that wasn't followed by another backslash and converted it to a forward slash. But I then searched more and found that remaining instances of backslash made no sense, so I just converted them all to forward slashes.

Here's the SQL:

CREATE TABLE de.raw_garland (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
Officer_Id VARCHAR(5),
Driver_License VARCHAR(20),
Race VARCHAR(1),
Sex VARCHAR(1),
Veh_License VARCHAR(12),
Veh_Year VARCHAR(4),
Veh_State VARCHAR(2),
Veh_Make VARCHAR(15),
Veh_Color VARCHAR(8),
Arr_Code VARCHAR(4),
Arr_Description VARCHAR(60),
Allg_Speed VARCHAR(3),
Posted_Speed VARCHAR(3),
Arr_Date VARCHAR(10),
Arr_Address VARCHAR(169),
Arr_City VARCHAR(7),
Final_Disposition VARCHAR(39),
Disp_Date VARCHAR(10),
Fine VARCHAR(9)
);
 
LOAD DATA INFILE 'C:/Users/Aren Cambre/Desktop/g/Consolidated.csv'
INTO TABLE de.raw_garland
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(Officer_Id, Driver_License, Race, Sex, Veh_License, Veh_Year, Veh_State, Veh_Make, Veh_Color, Arr_Code, Arr_Description, Allg_Speed, Posted_Speed, Arr_Date, Arr_Address, Arr_City, Final_Disposition, Disp_Date, Fine);
Syndicate content