Best Practices for Preparing Data

One of the many powerful features of SpatialKey is its ability to work with virtually any kind of information and the simplicity of getting that information into the application quickly and easily. With that said, in order to best utilize SpatialKey, there are some things that you should be familiar with. What follows in this article, are various tidbits and tips to make your experience using SpatialKey the best it can be!

What kind of data works best in SpatialKey?
Great vs. Not So Great Data in SpatialKey
The Import Format: CSV Files
Temporal Data (Time based)
Geospatial Data
Other Data Import Tips
Final Words of Encouragement

What kind of data works best in SpatialKey?

SpatialKey is an excellent business intelligence tool for analyzing and aggregating virtually any kind of information, but where it really shines is when you have data that acts like a good journalistic article in a newspaper. What does that mean? Well, the data should tell you all of the basic facts: that is, the who and/or what, the when, and the where of what is being reported. In our technical terms we like to call this telling a temporal/geospatial story. Temporal… Geospatial… what’s that? Don’t worry too much about the terms, all it comes down to is that each record or item in your data should say something like, “Event A happened at location B, at date and time C”. That’s it, a simple news report.

Do you have to have all three elements of this “story” for SpatialKey to work? Absolutely not, but it works best if you do.

Now for an example.

Lets say that we work in the real-estate industry and we have collected data about the local property sales over the past few years. Within this data are plenty of facts: the addresses of the properties, the amounts the properties sold for, who the current owner of the property is, when the property went on the market, when it sold, etc. This data is valuable to those in the real-estate industry and could be to consumers as well if interested in the facts that are available. The real issue with business intelligence and data visualization is how best to present that information in a way that a human being can work with and understand it. The data could consist of many thousands of rows, and even if presented in a sortable spreadsheet like format cannot truly give a person a high impact, dynamic and visual representation of what the data is saying. This is where SpatialKey comes in! In fact, this kind of data is exactly the kind that SpatialKey loves as it contains individual records (you will see why this is important in a moment) that show the who/what, the where and the when.

So, how is this type of data different from what a lot of current “spreadsheet” data has? Often times the information that a business user (what I mean by that term is someone who is trying to use collected information to make business decisions, i.e. not a data analyst, SQL database guru, or an IT tech person) has is already in an aggregated format and it takes a lot of power away that SpatialKey can give you. Aggregated… what does that mean? By aggregated I mean that instead of having the “raw” data, i.e. the individual events or data points, an analyst, or perhaps a different reporting tool has combined the information into a more meaningful view to you as a user.

So why does this take power away from SpatialKey? Because the magic of SpatialKey allows ANYONE to aggregate the “raw” information on the fly in a very simple, fast, and innovative way. A report that contains aggregated information may show you what you need to know for a particular set of criteria, but what if the criteria changes, or you would rather see different information altogether? Once created, that pre-aggregated information cannot help you with changed criteria and a new report would have to be written. By using SpatialKey and having imported the “raw” data you can change the view on an existing report by simply changing the report filters, data pods, and manipulation of the map!

If you want to see a few examples, check out our Sample CSV Data files.

Great vs. Not So Great Data in SpatialKey

Here is an example of what would not work so well in SpatialKey with sample data showing employment figures from 2005 – 2007.

Not So Great Data Example
State	2005 Employed	2005 Unemployed	2006 Employed	2006 Unemployed	2007 Employed	2007 Unemployed
California	6	2	7	1.6	8	1
Kansas	3	.5	4	.5	4.6	.6

See the date information in the headers with the supporting data in cells below. This data structure contains human readable information, but would be fairly useless in SpatialKey as it is already aggregated. Here is what SpatialKey would like to see.

“Raw” Data that SpatialKey Likes
State	Year	# Employed	# Unemployed
California	2005	6	2
Kansas	2005	3	.5
California	2006	7	1.6
Kansas	2006	4	.5
California	2007	8	1
Kansas	2007	4.6	.6

This data is certainly not as readable by a human as the data from the first table as it is not summarized (aggregated). Instead we have many more “raw” data points that SpatialKey can aggregate based on user interaction with a report in real time! No analyst or IT is needed for a business user to get access to the aggregated information they may need other than providing the original data to import. The difference can be subtle but very important for getting the most out of SpatialKey!

The Import Format: CSV (Comma Separated Value) Files

SpatialKey needs to process records prior to being able to view them in a report. This is a multi-step procedure that prepares the “raw” data for quick aggregations and presentation on a map. The CSV file is the transport of choice for moving this data into the SpatialKey application.

Why the CSV file?

First of all, CSV has been around a very long time and there is a great deal of data already assembled into this format.

Secondly, it is easy to work with. The file itself is a simple text document, meaning you should be able to open it in Notepad on a Windows machine or TextEdit on a MAC. But it is also a spreadsheet format that can be manipulated in a more native spreadsheet program like Excel as well.

Third, it is simple and lightweight. There are many other options for data transport between systems. XML, for instance, is a powerful and descriptive way of transporting data. We feel, however, for our purposes, XML would simply bloat what can already become an extremely large file with useless text (as we do need access to attribute based information, sub tags, etc. that XML can provide). This is especially true when you are talking of records numbering in the tens or hundreds of thousands of rows.

The CSV file does not really have a single one size fits all standards based specification although there is a commonly used one that we follow as well. In addition, our CSV file needs to have as the very first row a header line containing the column header names that the user would like to refer to the data by. Omitting the header is not allowed and will result in the first row of data being used as a header if one is not provided. In order for your CSV to be properly parsed into SpatialKey the following guidelines should be followed:

Each row of data should be on its own line followed by a line feed or carriage return/line feed.
Each “field” or column within the data is separated by a comma (even if no data exists for that column).
Data within a column, if it contains a comma as a part of the text, must be prepended and appended with double quotation marks:
1. For example if your data was: Hello, world
2. Then in the CSV you should see: “Hello, world”
3. If you don’t follow this step, the CSV parser will think that Hello and world are two separate fields
Data within a column, if it contains a double quote as part of the text, the data must be prepended and appended with double quotation marks and the double quotes within must be escaped by changing each instance of ” to “”:
1. For example if your data was: The book was called “Hello World”
2. Then in the CSV you should see: “The book was called “”Hello World“””
3. Note the double – double quotes around “”Hello World“”
4. This is known as “escaping” the double quotes
Data within a column, if it contains a line break, must be prepended and appended with double quotation marks.
Data within a column, if it contains leading or trailing spaces that you wish not to lose (trimmed by default), should be prepended and appended with double quotation marks.
Fields can always be double quoted even if it is not required.
Any rows that contain more comma delimited fields than there are headers will be dropped as bad data since there is no accurate way for us to be able to dynamically determine what data would go in which column otherwise.

Here is an example CSV containing many of the above (borrowed from Wikipedia: http://en.wikipedia.org/wiki/Comma-separated_values):

year,make,model,options,price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

Temporal Data:

Temporal data is simply data that can be plotted on a time line. In many cases this temporal (date and time based) data is readily available in a data set. In order to view or work with this data in SpatialKey, we simply need to include the data in the import process and be sure that the fields are “typed” as a date. The granularity, or how accurate the date is, should be is set as well. For instance if a date field in your data included both date and time information, but all of the times were set to midnight (meaning that there is no differentiation from one record to another at the time level), then the time information is insignificant to the data set and the date granularity should really be set as Day. If the times were accurate down to the second of when the event occurred, however, then it would make more sense to have the granularity set at Second. In most cases the import wizard can detect this for you and set the option appropriately or at least limit the choice based on the preponderance of data in the column.

Is temporal data a necessity? No, it is not needed to run a map based report template. It is only important if you wish to aggregate based on dates/times and/or view your data in relation to a time line which is a powerful feature of SpatialKey.

Geospatial Data:

Geospatial data consists of the where an event or item took place or exists. This data could be represented by something as abstract as a country to as detailed as a full address including: address, city, state, postal code, and country. The more data available, the better the accuracy of location plotting SpatialKey can do. The entire process of going from raw data to being able to plot that information on a map consists of a step we call geocoding.

What is geocoding? Geocoding is taking street address information or other types of geographic data like an internet IP Address and getting a latitude and longitude from that. With a latitude and longitude, along with SpatialKey’s special sauce, we can produce those beautiful aggregated heats maps in our application reports. This geospatial data will also give us the ability to filter that data in amazing ways as well; Imagine being able to draw on the map a boundary area (polygon) and filtering for all related data within x-number of miles from that, or points that only exist within. Stay tuned, it’s going to get interesting.

In order to be able to geocode address based information, the data needs to be set up in a way that SpatialKey can work with. This is especially true for street address information. Some databases may store the street address as a collection of several fields. A field for the house number, a field for the street name, a field for the street type, as an example. SpatialKey cannot currently geocode a street address that is split in such a way. In order to solve this problem, when extracting your data from a database, combine all street address information into a single CSV column. If you do not have access to a database for your CSV data and already have your data in a CSV format split out, use existing spreadsheet capabilities to combine the columns.

It is important to note that even if SpatialKey cannot geocode your data to the address level, it could still geocode to the postal code, city, state/region, or county level as well. In fact, if SpatialKey cannot match an address (perhaps the address was invalid in the data) then the next available level (postal code) would be checked next. The accuracy of geocoding is what we call location granularity and as a user you can filter out records from reports that were not accurately geocoded if needed in the final steps of the import wizard or in the data set manager/report filters at any time.

In addition to Geocoding, SpatialKey also supports data that already contains latitude and longitude coordinates.

Both columns must be available in the data
The data should be formatted in decimal degrees (example latitude: 75.00034)
Empty values for latitude or longitude will default to 0.0
Hours, Minutes, Seconds is not currently supported

Other Data Import Tips

Here are some other tips as you are preparing your data for import:

Use the capabilities of spreadsheet applications
- Many of these have filtering capabilities that can help you to correct these issues before importing
If you are getting your data from a back end system or a data administrator, you will likely have to go back to the person generating your “reports” in order to get “raw” data.
- When you do, be explicit with them about how you want the data structured, show them examples if possible.
- IT folks like explicit examples as it removes verbal ambiguity.
- Many of them have been cutting data to be consumed using traditional reports and tools and will automatically assume those formats will work for you as well.
- You may even want to show them how SpatialKey uses data. It will give them a much better appreciation for why you are asking for data formatted this way.
If you are exporting data, select the option to quote all fields (depends on the specific application or database you are exporting from). If you do this, you will not need to worry about the CSV specifications mentioned above.
Dates and Times
- It is impossible to support every conceivable date and time combination that a person could come up with, although we do try!
- Try to use the most commonly used date and time text formats if possible
- Just a few examples (by no means exhaustive)
  - mm/dd/yyyy
  - mmddyyyy
  - mmyyyy
  - yyyy/mm/dddd
  - yyyy
  - HH:MM:SS
  - 12 JAN 2008 12:34:55
Text data
- Text data is restricted to 255 characters per field
- Text that has characters beyond this limit will be truncated down to the limit
Attempt smaller data sets first (a smaller subset of the total amount of data) to make sure that the results look correct before going for the tens of thousands. It is much better to see an issue with a data import after a few seconds to a minute than waiting many minutes only to discover an issue that requires you to re-import the data set.

Final Words of Encouragement

By taking a little bit of time with your data up front, you can save a lot of time and energy once in the SpatialKey application. The results that SpatialKey can produce for you is only as good as the source data itself, the Data is Key! We can’t wait to see the stories your data has to tell. Please contact us at any time, we love hearing your feedback and ideas.

It is a brave new day for business intelligence, geospatial tools, and data analysis and we encourage you to see your data for the first time by unlocking it with SpatialKey!