Linux Utilities and the Neustar GeoPoint Data File
Linux Utilities and the Neustar GeoPoint Data File
As part of the sales team, we occasionally receive urgent emails that often deal with slicing and dicing a Comma Separated Values (CSV) text file that is tens of millions of lines in length and tens of gigabytes in size.
This CSV file is a Neustar service branded as GeoPoint, which is a geo-IP database that maps IP address ranges to geography, network ownership, proxy identification, and other contextual information.
Excel is a great tool and mastery of pivot tables makes summarizing, sorting, reorganizing, grouping, counting, totaling or averaging data relatively easy. Unfortunately, the data set that I most often work with will not load into Excel since it currently has a limit of 1,048,576 rows.
The purpose of this guide is to showcase examples for manipulating large files of structured text which are often required when joining disparate data sets or importing into a database, typically a relational model (RDBMS) in this case. This is often referred to as “ETL for Extracting, Transforming, and Loading”. This article is intended to primarily focus on Transforming the data. Loading the data will be covered in a subsequent article.
Having a basic understanding of these (POSIX) commands helped me in providing much more efficient customer support. It certainly beats writing JIRA tickets and being dependent on Engineering.
Disclaimer: I am not a power user so there are likely more efficient ways to solve any one of the following problems. Also, many of the examples chain commands together versus referencing previously saved output, hopefully, this simplifies the examples.
All feedback is welcome and I will update this article with any interesting examples, explanations, or recommendations that I receive.
The following examples use an abridged version of the database which contains the header record, the first two data rows, five chosen ranges and the last two lines of data.
Sample CSV file
The compressed version of the sample data can be found here or copy the following:
The last entry stood out given the Organization is "apnic debogon project". I thought all Regional Internet Registries (APNIC, RIPE, ARIN, LACNIC) had entered “IPv4 exhaustion” but appears that IPv4 is being further fragmented. Linux even recently released a patch to allow 0.0.0.0/8 as valid address block. The examples in this article focus on IPv4 but it is important to note that Neustar also offers IPv6 data.
- file type is ASCII; compressed using gzip
- checksum is derived from the compressed (.gz) file
- all fields/values in CSV are double-quoted ("")
- country_code is ISO 3166-1 (Alpha-2)
The included checksum hashes for the gzipped CSV file are:
- MD5 = 30a811e676e692015e14ac23080c532f
- SHA2 = d10b88f0e04dda256cf8fa1bd31bf02bfcb439fb85bf44afeb103bbd87073133
Part 1 - Retrieving the File
The production file is stored on a Neustar SFTP server and it is recommended you use the gzipped sample database if following the first two steps in Part 1.
Step 1. Check File Integrity
The first data integrity check should use either the SHA2 (recommended) or MD5 checksum to verify that no errors were detected during the transmission.
Calculate a message-digest fingerprint (checksum) for a file:
Step 2. Decompress the File
The GeoPoint data file is compressed using gzip.
Deflate the .gz file and keep the compressed file:
Step 3. Validate the CSV File
Assuming the checksums match we can be highly confident the data has not been modified during transmission. Checking attributes like the line count or the column count may be considered as well as ensuring all values match the expected data type and are within the max length.
List recently modified files/directories last, in human readable sizes:
Verify the file type:
Count the number of lines:
Count the number of columns:
wc -c counts the new line character thus adds one to sed's output of the commas, which correctly equals the column count.
Part 2 - Modifying the File
This will include modifications to the CSV file like reducing the number of fields (i.e. columns) or to filter the records (i.e. rows) by matching a specific value or pattern. There is no specific order to the following commands as most use cases are slightly different. Part 3 showcases some real world questions.
Extract the first (1st) and second (2nd) columns:
*Print the third (3rd) column without the double quotes:
* Will not work on the first or last columns as the quotation mark will either be prepended or appended.
Print the values from the first two columns, adding a comma, and remove the double quotations:
Time how long it takes to extract the first two columns and remove all double quotes:
Most databases can ignore double quotations on import so it may not be necessary to remove, but they are used for the awk example below.
Do Not Use if the data contains double quotes within the values. If there are any places that double quotes are escaped then this awk command will not work.
Remove all double quotations; cut two columns; subtract the first column from the second; do not print the header:
Not all IP ranges fit into a single CIDR subnet. For instance, the 767 IPs in line 2 (above example) has starting address of 18.104.22.168 that ends at 22.214.171.124:
So far, the examples have included the header but to remove the first line:
Count the number of unique values in the third (3rd) column:
"uniq" can be used if the data is first sorted.
Sort the file by the third column (3rd), extract the sorted column, count the number of unique lines:
Count the number of non-empty values in the sixth (6th) column:
Occasionally the exact line numbers in the file are known but more often searching is required.
Select only line numbers 4 to 5:
awk makes it easy to search within structured data, in this case within a column of a CSV file.
Print all rows where "neustar" is in the value in the fifth (5th) column:
Grep would be used if searching multiple files, directories, or if returning rows containing the same search string in a different columns.
Return all rows that contain "neustar":
Delete all characters up to and including the last found colon ":" char; print remaining chars on a new line:
Do not use if the data being searched uses the colon ":" character
The data contains a header row by default but if the file was sorted before the header removed then it likely ended upon the last line, if sorting by a numeric column.
Using sed to delete the last line in a file:
If using MacOS and not piping to standard input then to delete the last line:
Part 3 - Real World Examples
Many examples have been provided as this document is largely intended to be a reference sheet, but these commands all originated from the following real world requests.
How many different countries are covered in the IPv4 data set?
Find the column number of the country code and perform a distinct count. In the sample data, the third column is the country code but it is not grouped by default.
The awk example, count the distinct values in a column, is the most concise given the file is not already sorted. Subtract one from this number as the original file includes a header record.
What percentage of IP ranges are assigned to a proxy?
This question also manifests as a request for city or postal code coverage. Find the total line count, find the number of records that have no value (i.e. an empty string "") in the target column and then convert to a percentage.
See the count all lines with a value for the target column, which is 7 in this example, but change the operator to "==" to test for an empty value.
Count the lines that have a value in the target column and count the of the file and convert to a percentage.
Compute the number of IP addresses in each range.
The sample data contains 9 IP boundaries and splits the network start and end boundaries where the largest value is 3758096383 (i.e. 126.96.36.199). This will likely be stored as a LONG, BIGINT, or a built-in network address data type.
I could not get this to work if the values are double-quoted thus I used awk to remove all double quotes and then extract and subtract two columns resulting in the number of single IP addresses in the range, which does not necessarily equal a CIDR prefix.
Although these examples were based on Neustar's geo-ip data, hopefully, it helps to solve other problems. We did not explicitly cover the use of Regular Expressions which would have added complexity but allowed for a single command to answer a more specific question. Something like... return all the proxies based in the US. This is likely possible in awk as well. Regular Expressions can be an exhausting topic thus saved for a future article.
If interested in learning more about any of Neustar's services then please email me at: email@example.com.