...
 
 
Site Search
 
October 24th, 2019

Linux Utilities and the Neustar GeoPoint Data File

Linux Utilities and the Neustar GeoPoint Data File

As part of the sales team, we occasionally receive urgent emails that often deal with slicing and dicing a Comma Separated Values (CSV) text file that is tens of millions of lines in length and tens of gigabytes in size.

This CSV file is a Neustar service branded as GeoPoint, which is a geo-IP database that maps IP address ranges to geography, network ownership, proxy identification, and other contextual information.

Excel is a great tool and mastery of pivot tables makes summarizing, sorting, reorganizing, grouping, counting, totaling or averaging data relatively easy. Unfortunately, the data set that I most often work with will not load into Excel since it currently has a limit of 1,048,576 rows.

The purpose of this guide is to showcase examples for manipulating large files of structured text which are often required when joining disparate data sets or importing into a database, typically a relational model (RDBMS) in this case. This is often referred to as “ETL for Extracting, Transforming, and Loading”. This article is intended to primarily focus on Transforming the data. Loading the data will be covered in a subsequent article.

Having a basic understanding of these (POSIX) commands helped me in providing much more efficient customer support. It certainly beats writing JIRA tickets and being dependent on Engineering.

Disclaimer: I am not a power user so there are likely more efficient ways to solve any one of the following problems. Also, many of the examples chain commands together versus referencing previously saved output, hopefully, this simplifies the examples. 

All feedback is welcome and I will update this article with any interesting examples, explanations, or recommendations that I receive.

The following examples use an abridged version of the database which contains the header record, the first two data rows, five chosen ranges and the last two lines of data.

Sample CSV file

The compressed version of the sample data can be found here or copy the following:

"start_ip_int","end_ip_int","country_code","country_cf","organization","anonymizer_status","proxy_type"
"16777216","16777471","au","86","apnic and cloudflare dns resolver project","",""
"16777472","16778239","cn","86","chinanet fujian province network","",""
"2627375104","2627375359","gb","76","neustar  inc.","",""
"2627377408","2627377663","us","99","neustar  inc.","",""
"2627391498","2627391498","us","86","neustar  inc.","active","http"
"2857327478","2857327478","co","99","tv azteca sucursal colombia","private","web"
"3122297186","3122297186","cr","86","claro cr telecomunicaciones s.a.","suspect","socks"
"3758095872","3758096127","sg","99","marina bay sands pte ltd","",""
"3758096128","3758096383","au","86","apnic debogon project","",""

The last entry stood out given the Organization is "apnic debogon project". I thought all Regional Internet Registries (APNIC, RIPE, ARIN, LACNIC) had entered “IPv4 exhaustion” but appears that IPv4 is being further fragmented. Linux even recently released a patch to allow 0.0.0.0/8 as valid address block. The examples in this article focus on IPv4 but it is important to note that Neustar also offers IPv6 data.

Assumptions

  • file type is ASCII; compressed using gzip
  • checksum is derived from the compressed (.gz) file
  • all fields/values in CSV are double-quoted ("")
  • country_code is ISO 3166-1 (Alpha-2)

The included checksum hashes for the gzipped CSV file are:

  • MD5 = 30a811e676e692015e14ac23080c532f
  • SHA2 = d10b88f0e04dda256cf8fa1bd31bf02bfcb439fb85bf44afeb103bbd87073133

 

Part 1 - Retrieving the File

The production file is stored on a Neustar SFTP server and it is recommended you use the gzipped sample database if following the first two steps in Part 1.

Step 1. Check File Integrity

The first data integrity check should use either the SHA2 (recommended) or MD5 checksum to verify that no errors were detected during the transmission.

SHA

$ shasum -a 256 gpu.csv.gz
d10b88f0e04dda256cf8fa1bd31bf02bfcb439fb85bf44afeb103bbd87073133 gpu.csv.gz

MD5

Calculate a message-digest fingerprint (checksum) for a file:

# Linux

$ md5sum gpu.csv.gz
30a811e676e692015e14ac23080c532f gpu.csv.gz

# MacOS

$ md5 gpu.csv.gz
MD5 (gpu.csv.gz) = 30a811e676e692015e14ac23080c532f

Step 2. Decompress the File

The GeoPoint data file is compressed using gzip.

Deflate the .gz file and keep the compressed file:

$ gzip -dk gpu.csv.gz

Step 3. Validate the CSV File

Assuming the checksums match we can be highly confident the data has not been modified during transmission. Checking attributes like the line count or the column count may be considered as well as ensuring all values match the expected data type and are within the max length.

List recently modified files/directories last, in human readable sizes:

$ ls -lrth
total 16
-rw-rw-rw-@ 1 ben  staff   432B Sep 27 11:21 gpu.csv.gz
-rw-rw-rw-  1 ben  staff   862B Sep 28 16:48 gpu.csv

Verify the file type:

$ file gpu.csv
gpu.csv: ASCII text

Count the number of lines:

$ wc -l gpu.csv
      10 gpu.csv

Count the number of columns:

$ head -1 gpu.csv | sed 's/[^,]//g' | wc -c
       7

wc -c counts the new line character thus adds one to sed's output of the commas, which correctly equals the column count.

Using awk:

$ awk -F, '{print NF; exit}' gpu.csv
7

 

Part 2 - Modifying the File

This will include modifications to the CSV file like reducing the number of fields (i.e. columns) or to filter the records (i.e. rows) by matching a specific value or pattern. There is no specific order to the following commands as most use cases are slightly different. Part 3 showcases some real world questions.

Extract the first (1st) and second (2nd) columns:

$ cut -d "," -f1 -f2 gpu.csv
"start_ip_int","end_ip_int"
"16777216","16777471"
...
"3758096128","3758096383"

*Print the third (3rd) column without the double quotes:

$ awk -F "\"*,\"*" '{print $3}' gpu.csv
country_code
au
...
au

* Will not work on the first or last columns as the quotation mark will either be prepended or appended.

Print the values from the first two columns, adding a comma, and remove the double quotations:

$ awk -F '","' '{print $1","$2}' gpu.csv | sed -e 's/.*"//g'
start_ip_int,end_ip_int
16777216,16777471
...
3758096128,3758096383

Time how long it takes to extract the first two columns and remove all double quotes:

$ time cat gpu.csv | cut -d "," -f1 -f2 | awk '{gsub(/\"/,"")};1'
start_ip_int,end_ip_int
16777216,16777471
...
sys 0m0.006s

Most databases can ignore double quotations on import so it may not be necessary to remove, but they are used for the awk example below.

Do Not Use if the data contains double quotes within the values. If there are any places that double quotes are escaped then this awk command will not work.

Remove all double quotations; cut two columns; subtract the first column from the second; do not print the header:

$ awk '{gsub(/\"/,"")};1' gpu.csv | cut -d "," -f1,2 | awk 'BEGIN { FS=","; OFS="," } {print $2 - $1}'
0
255
...
255

Not all IP ranges fit into a single CIDR subnet. For instance, the 767 IPs in line 2 (above example) has starting address of 1.0.1.0 that ends at 1.0.3.255:

  • 1.0.1.0/24
  • 1.0.2.0/23

So far, the examples have included the header but to remove the first line:

$ tail -n+2 gpu.csv
"16777216","16777471","au","86","apnic and cloudflare dns resolver project","",""
"16777472","16778239","cn","86","chinanet fujian province network","",""
...
"3758096128","3758096383","au","86","apnic debogon project","",""

# Linux

$ sed -i '1d' gpu.csv 

# MacOS

$ sed -i '' -e '1d' gpu.csv 

Count the number of unique values in the third (3rd) column:

$ awk -F"\"*,\"*" '!_[$3]++' gpu.csv | wc -l
      8

"uniq" can be used if the data is first sorted.

Sort the file by the third column (3rd), extract the sorted column, count the number of unique lines:

$ sort -k3 -t, gpu.csv | cut -f3 -d , | uniq | wc -l
       8

Count the number of non-empty values in the sixth (6th) column:

$ awk -F'","' '{if ($6!="") print NR}' gpu.csv | wc -l
      4

Occasionally the exact line numbers in the file are known but more often searching is required.

Select only line numbers 4 to 5:

$ sed -n 4,5p gpu.csv
"2627375104","2627375359","gb","76","neustar  inc.","",""
"2627377408","2627377663","us","99","neustar  inc.","",""

awk makes it easy to search within structured data, in this case within a column of a CSV file.

Print all rows where "neustar" is in the value in the fifth (5th) column:

$ awk -F'","' '$5 ~ /neustar/' gpu.csv
"2627375104","2627375359","gb","76","neustar  inc.","",""
"2627377408","2627377663","us","99","neustar  inc.","",""
"2627391498","2627391498","us","86","neustar  inc.","active","http"

Grep would be used if searching multiple files, directories, or if returning rows containing the same search string in a different columns.

Return all rows that contain "neustar":

$ grep -nr "neustar" gpu.csv
gpu.csv:4:"2627375104","2627375359","gb","76","neustar  inc.","",""
gpu.csv:5:"2627377408","2627377663","us","99","neustar  inc.","",""
gpu.csv:6:"2627391498","2627391498","us","86","neustar  inc.","active","http"

Delete all characters up to and including the last found colon ":" char; print remaining chars on a new line:

$ grep -nr "neustar" gpu.csv | sed -e 's/.*://g'
"2627375104","2627375359","gb","76","neustar  inc.","",""
"2627377408","2627377663","us","99","neustar  inc.","",""
"2627391498","2627391498","us","86","neustar  inc.","active","http"

Do not use if the data being searched uses the colon ":" character

The data contains a header row by default but if the file was sorted before the header removed then it likely ended upon the last line, if sorting by a numeric column.

Using sed to delete the last line in a file:

$ sort -k4 -t, gpu.csv | sed -e '$ d'
"2627375104","2627375359","gb","76","neustar  inc.","",""
"16777216","16777471","au","86","apnic and cloudflare dns resolver project","",""
...
"2857327478","2857327478","co","99","tv azteca sucursal colombia","private","web"

If using MacOS and not piping to standard input then to delete the last line:

$ sed -i '' -e '$ d' gpu.csv

 

Part 3 - Real World Examples

Many examples have been provided as this document is largely intended to be a reference sheet, but these commands all originated from the following real world requests.

How many different countries are covered in the IPv4 data set?

Find the column number of the country code and perform a distinct count. In the sample data, the third column is the country code but it is not grouped by default.

The awk example, count the distinct values in a column, is the most concise given the file is not already sorted. Subtract one from this number as the original file includes a header record.

What percentage of IP ranges are assigned to a proxy?

This question also manifests as a request for city or postal code coverage. Find the total line count, find the number of records that have no value (i.e. an empty string "") in the target column and then convert to a percentage.

See the count all lines with a value for the target column, which is 7 in this example, but change the operator to "==" to test for an empty value.

Count the lines that have a value in the target column and count the of the file and convert to a percentage.

Compute the number of IP addresses in each range.

The sample data contains 9 IP boundaries and splits the network start and end boundaries where the largest value is 3758096383 (i.e. 223.255.255.255). This will likely be stored as a LONG, BIGINT, or a built-in network address data type.

I could not get this to work if the values are double-quoted thus I used awk to remove all double quotes and then extract and subtract two columns resulting in the number of single IP addresses in the range, which does not necessarily equal a CIDR prefix.

Conclusion

Although these examples were based on Neustar's geo-ip data, hopefully, it helps to solve other problems. We did not explicitly cover the use of Regular Expressions which would have added complexity but allowed for a single command to answer a more specific question. Something like... return all the proxies based in the US. This is likely possible in awk as well. Regular Expressions can be an exhausting topic thus saved for a future article.

If interested in learning more about any of Neustar's services then please email me at: ben.jones@team.neustar.

Let's Connect

Learn How Your Company Can Benefit from the Power of Trusted Connections.

Contact Us   Give us a call 1-855-898-0036