General notes on the process of importing data
The process of importing data - from a spreadsheet for example - involves a number of steps. It is important to note that Logos Flow uses a relational database as a basis for its products and projects.
Below is a general outline of the various steps involved. Some import projects will involve many more steps and complex manipulation of data:
- Consultation and review of data to be imported. This is very important because you do not want to end up with data imported incorrectly and start the process all over again.
- Preparation and supply of a quote. Some clients question the fee charged to import data so it is always better to prepare costing prior to actually doing the import.
- We then arrange how to send the spreadsheets and other sources of data such as text delimited files (CSV, tab, etc.). These can be emailed or, if large files, then DropBox or even posting a CD/DVD (if the data runs into gbs).
- If the task is to import into an existing database then we will confirm the of number of records to make sure the correct data file has been received.
- We always make an archival backup of the data file prior to importing in case there is a rollback or problem with imported data. It is also really important to keep an original copy in case there is a problem and the data needs to be reviewed. The backup file, or files, are normally compressed, named by date and then stored in the client's archive.
- Preparation of a spreadsheet prior to importing. The more time taken in preparation the better the quality of the import. This can involve standardising data, checking for possible duplication and manipulating of data to suit the structure of the database.
- Importing of spreadsheet data into the Logos Flow data file as a temp table, or tables. This also involves checking the integrity of the data and if all spreadsheet rows and columns are actually imported.
- Manipulation of data to import into the primary tables and adding additional information to assist in the import, i.e. add '/2017_import/' into a filter field to future identify imported data.
- Often a new sequence of unique Item IDs is required, or to make things more complicated, start a new sequence of IDs from a previous ID. For example: '2016/025' is the last ID and the sequence must then start from '2016/026'.
- The process of creating queries then starts. A query manipulates data, or adds data, into a database. Queries are programming code normally written in SQL (Structured Query Language). This is a powerful language that can involved many thousands of records - so spectacular errors can also occur. These queries can be relatively simple or very complicated, depending on the type of data to be imported.
- There are a number of types of queries that are used when importing - the main ones are:
- Append query to add new records into a database table.
- Update query to update records in a database table.
- Other queries might be delete queries and make table queries.
- Frequently adding a temp ID from the import tables is required. This is often use to preserve the integrity of the data, especially when the relationship between tables is many to many, i.e. one artist with many artworks.
- Spreadsheets have many formats and these need to be resolved prior to importing. Any violation rules (date, number, etc. formats) need to be identified and the data format adjusted. This can be very time consuming especially when trying to located data that is causing the violation rule.
- When importing into a relational database the flat file structure needs to be split into various tables. For example: an artist links to multiple artworks but there is an intermediary table that manages the links. Hence, a simple artist name and artwork title on a spreadsheet needs to be split into 3 x tables and sequences of unique IDs is used to keep the data integrity. This becomes much more complex when associating more and more tables that all inter-relate. A product such as the HumanitiesCMD uses over 70 tables to effectively manage the data.
- The process of actually then starts and a number of queries are usually created to compelte the task.
- During the import process a series of random tests are done to make the data has been imported correctly. Each step needs to be checked and often a number of backup files are created to be able to roll-back if any problems occur. This means if an error occurs you go back to a previous backup and re-start the previous step.
- Throughout the process a record of all import steps is done. This is very time-consuming but essential to keep notes about number of records imports, where they are imported, any problems, etc.
- Once the import is complete an archived backup file is created. This will include the working queries, tables, etc.
- The import is thoroughly tested:
- Random checks from the initial spreadsheet with the database are done. This is to make sure all the initial rows on a spreadsheet match the various split tables. One test will often involve checking data in a large number of tables.
- All the temp fields created to preserve the data integrity need to be cleared or removed.
- Complete the import notes record and supply to client. This can be anything from a simple email to large document.
- Liaison with client about new data file with imported data.
- Make data file available via email, FTP or website download - depends on file size.
- Liaison with client to check imported data through number of records.
- Liaison with client about imported data done correctly.
- If not correct then identify the problems and use initial data file prior to import and re-do most of the above steps.
Example of a spreadsheet - or flat-file - with simple rows and columns
Example of a relational database with interlinking tables - the red highlights show the interlinking using unique IDs
Example of a database query - using a grid format - to interact with data
Example of a database query - using a SQL format - to interact with data