...
 
 
Site Search
 
November 18th, 2020

Importing UltraGeoPoint into SQLite

The first article published in the series, Linux Utilities and the Neustar UltraGeoPoint Data File, focused on transforming raw geo-IP data by removing data fields and filtering values of a Comma Separated Values (CSV) file. This article will focus on loading the transformed data into a SQLite database engine, making it easier and faster to find the desired tuples for more complex queries involving multiple variables (e.g. all distinct ASNs in the US). The previous article used Sed or Awk often requiring Regular Expressions to filter the records, which is hard and error prone when compared to writing SQL statements.

Why SQLite

SQLite is an embedded SQL database engine that is simple, highly portable, is the most ubiquitous in the world (e.g. Android, iOS), and serves well for handling the UltraGeoPoint data in this tutorial. If a Postgres or MySQL schema is needed then please contact us.

End in Mind

The end goal of this post is to provide useful examples and code snippets to assist in loading the UltraGeoPoint file into SQLite.

There also may be additional source data, like requesting IP address, that needs to be imported and this may be in octet format (223.255.255.255). SQLite does not have a native data type for IP addresses and will need to be stored as an Integer. Thus, if the input file format is in dotted decimal notation, it will have to be converted into decimal format (3758096383).

Step 1 - Get the Data

For the convenience of this tutorial a subset of the database is being used. The full UltraGeoPoint schema is pasted at the end of this article.

Copy, paste, and save the following records to a file named, "geopoint.csv"

"start_ip_int","end_ip_int","country_code","country_cf","organization","anonymizer_status","proxy_type"
"16777216","16777471","au","86","apnic and cloudflare dns resolver project","",""
"16777472","16778239","cn","86","chinanet fujian province network","",""
"2627375104","2627375359","gb","76","neustar  inc.","",""
"2627377408","2627377663","us","99","neustar  inc.","",""
"2627391498","2627391498","us","86","neustar  inc.","active","http"
"2857327478","2857327478","co","99","tv azteca sucursal colombia","private","web"
"3122297186","3122297186","cr","86","claro cr telecomunicaciones s.a.","suspect","socks"
"3758095872","3758096127","sg","99","marina bay sands pte ltd","",""
"3758096128","3758096383","au","86","apnic debogon project","",""    

Step 2 - Build the Schema

For the convenience of this tutorial, the following examples are based on the sample data above, in a filed saved locally as "geopoint.csv". The full database contains dozens more data fields and its complete schema is pasted at the end of this article.

Given the ubiquity of SQLite, it is likely already installed on your *nix machine (including MacOS) and is also available for Windows.

To create or open a SQLite database named "ipi":

$ sqlite3 ipi
SQLite version 3.24.0 2018-06-04 14:10:15
Enter ".help" for usage hints.
sqlite>    

If installed, this should return the version number and the command prompt as shown above. To stop SQLite:

sqlite> Ctrl+D

To create a SQLite table, you can copy and paste the following into the sqlite> prompt:

CREATE TABLE geopoint (
    start_ip INTEGER PRIMARY KEY,
    end_ip INTEGER NOT NULL UNIQUE, 
    country_code TEXT,
    country_cf INTEGER,
    organization TEXT,
    anonymizer_status TEXT,
    proxy_type TEXT
);

Another option to create the table is to call an external file that contains the table definition.

If the previous create-table definition was saved in a file named "sqlite-create-table-geopoint", then to run it:

sqlite> .read sqlite-create-table-geopoint
sqlite> .tables
geopoint    

To list the tables and then view the schema:

sqlite> .tables
geopoint
sqlite> .schema geopoint    

Step 3 - Load the Data

Assuming at this point you have a database named "ipi" containing a single table named "geopoint" created in Steps 1 and 2, the next step is to load the data. For a quick test, the following insert statements should work:

INSERT INTO geopoint (start_ip, end_ip, country_code, country_cf, organization, anonymizer_status, proxy_type) VALUES (16777216,16777471,'au',86,'apnic and cloudflare dns resolver project','','');
INSERT INTO geopoint (start_ip, end_ip, country_code, country_cf, organization, anonymizer_status, proxy_type) VALUES (3758096128,3758096383,'au','86','apnic debogon project','','');    

The production data file is not composed of SQL statements, but instead is a Comma Separated Values (CSV) file, like the abridged sample data at the start of this tutorial. Similar to how the create-table statement file was loaded in Step 2, to import the data:

sqlite> .mode csv
sqlite> .import geopoint.csv geopoint    

If you notice the following error message:

gpu.csv:1: INSERT failed: datatype mismatch

Don’t worry... the header record was probably still in the file. The remaining records should still correctly import.

Step 4 - Optimize the performance

A full copy of the Neustar database will be tens of millions of records thus the start and end IP address columns should be indexed. These are the first two columns.

CREATE INDEX start_end_ip_index ON geopoint (start_ip, end_ip);

Maybe the end goal is to have a separate table of the source IPs to be queried. Running a join of hundreds of thousands of hundreds of IP addresses against tens of millions of IP ranges may require optimization beyond this tutorial.

Step 5 - Query the Data

A typical use case is to check the IP address of a new connection to the HTTP server and thus query the database for a single IP address at a time - versus the batch example alluded to in Step 4. If that is the case, then the following SQL select statement will work:

SELECT * FROM geopoint a 
WHERE a.start_ip = 
(SELECT MAX(start_ip) FROM geopoint b WHERE b.start_ip <= 3758096380) 
AND a.end_ip >= 3758096380;    

This assumes the IP address is already in decimal format where 3758096380 equals 223.255.255.252.

The application software serving the HTML will probably have a library to convert the IP address to an integer.

Step 5a - Output Query to File

One final example is to execute a prepared SQL statement and output the results to a CSV file. Maintaining a set of stored procedures provides a simple way to have easy access for common queries. Let's assume there is a file named, "sql-select-org-neustar.sql"

SELECT * FROM geopoint WHERE organization LIKE '%neustar%';

In the following example, the output file, "results.csv", will contain headers and execute the query as defined in the referenced file:

sqlite> .headers on
sqlite> .mode csv
sqlite> .output results.csv
sqlite> .read sql-select-org-neustar.sql
sqlite> .output stdout    

The length of this post would more than double if we barely touched upon 1) using bash/python to convert the octet IP addresses and 2) joining two large tables on a range of foreign key values.

Conclusion

SQLite offers a simple way to store Neustar's UltraGeoPoint in a database engine extending the GNU Linux utilities covered in the first article.

And finally, as promised in the beginning of the article, following is the SQLite schema for the full UltraGeoPoint data file:

DROP TABLE IF EXISTS geopoint;
CREATE TABLE geopoint (
start_ip_int INTEGER PRIMARY KEY,
end_ip_int INTEGER NOT NULL UNIQUE,
start_ip_oct TEXT NOT NULL UNIQUE,
end_ip_oct TEXT NOT NULL UNIQUE,
continent TEXT,
country TEXT,
country_code TEXT,
country_cf INTEGER,
region TEXT,
state TEXT,
state_code TEXT,
state_cf INTEGER,
city TEXT,
city_cf INTEGER,
postal_code TEXT,
postal_code_cf INTEGER,
area_code INTEGER,
time_zone REAL,
latitude REAL,
longitude REAL,
dma TEXT,
msa TEXT,
connection_type TEXT,
line_speed TEXT,
ip_routing_type TEXT,
asn INTEGER,
sld TEXT,
tld TEXT,
organization TEXT,
carrier TEXT,
anonymizer_status TEXT,
home INTEGER,
organization_type TEXT,
naics_code INTEGER,
isic_code INTEGER,
geonames_id INTEGER,
proxy_type TEXT,
proxy_level TEXT,
proxy_last_detected TEXT,
hosting_facility INTEGER
);
INSERT INTO geopoint (start_ip_int,end_ip_int,start_ip_oct,end_ip_oct,continent,country,country_code,country_cf,region,state,state_code,state_cf,city,city_cf,postal_code,postal_code_cf,area_code,time_zone,latitude,longitude,dma,msa,connection_type,line_speed,ip_routing_type,asn,sld,tld,organization,carrier,anonymizer_status,home,organization_type,naics_code,isic_code,geonames_id,proxy_type,proxy_level,proxy_last_detected,hosting_facility) VALUES ("16777216","16777343","1.0.0.0","1.0.0.127","oceania","australia","au","86","","new south wales","","68","sydney","50","2000","2","","10","-33.86714","151.20711","","","tx","high","fixed","13335","one","one","apnic and cloudflare dns resolver project","cloudflare","active","false","Internet Hosting Services","518210","J6311","2147714","http","elite","2020-11-05","false");
INSERT INTO geopoint (start_ip_int,end_ip_int,start_ip_oct,end_ip_oct,continent,country,country_code,country_cf,region,state,state_code,state_cf,city,city_cf,postal_code,postal_code_cf,area_code,time_zone,latitude,longitude,dma,msa,connection_type,line_speed,ip_routing_type,asn,sld,tld,organization,carrier,anonymizer_status,home,organization_type,naics_code,isic_code,geonames_id,proxy_type,proxy_level,proxy_last_detected,hosting_facility) VALUES ("3758096128","3758096383","223.255.255.0","223.255.255.255","oceania","australia","au","86","","queensland","","73","highgate hill","61","4101","2","","10","-27.4877","153.01895","","","","","","","","","apnic debogon project","","","false","Telecommunications","518210","J6311","","1140850828","","1157844277","false");    

Let's Connect

Learn How Your Company Can Benefit from the Power of Trusted Connections.

Contact Us   Give us a call 1-855-898-0036