
Hand-cranking a built-in schema
In this recipe, we are presented with a CSV file that does not have a heading row and needs to create a schema for the data. This is a basic recipe with which most readers should be familiar: however, it does provide a framework for discussion of some of the more important principles of Talend schemas.
The record we will be defining is as follows:
John Smith,27/11/1990,2012-01-10 10:24:54.953
As you can see this contains the fields; first name, last name, date of birth, timestamp, and age. Note that age is an empty string.
Getting ready
Open a new Talend Job (jo_cook_ch02_0000_handCrankedSchema
), so that the right-hand palette becomes available.
How to do it…
- Drag a
tFileInputDelimited
component from the palette, and open it by double clicking it. - Click the Edit Schema button (…), shown in the following screenshot, to open the schema editor:
- Click the + button to add a column:
- Type
name
into the column, and set the length to50
. - Click the + button three more times to add three more columns.
- Type
dateOfBirth
into the second column, select a type of date, and set the date pattern todd/MM/yyyy
. Alternatively, press Ctrl+Space to open a list of common patterns and select this one. - Type
timestamp
into the third column, select a type of date and set the date pattern toyyyy-MM-dd
HH:mm:ss.SSS
. - Type
age
into the fourth column, set the type toInteger
, tick the Null box, and set the length to3
. Your schema should now look like the following screenshot: - Click OK to return to the component view.
How it works…
The schema has now been defined for the component, and data may then be read into the job by linking a flow from tFileInputDelimited
to tLogRow
, for example.
There’s more...
As you saw in the preceding section, Talend can handle many different types of data format. The following sections describe some of the common ones in little more detail.
Date patterns within Talend conform to the Java date format, and full definitions of the possible values to be used can be found at:
http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
Date patterns are case sensitive in Java, and upper and lower case letters often have a very different function.
Note
In the timestamp, there are MM
and mm
characters. These are the month and minute definitions and care should be taken to ensure that they are used correctly in the date and time portions of a date field.
Note also the ss
and SSS
fields. These are seconds and milliseconds. Again, care must be taken in their use within the time portion of a date.
HH
and hh
are also case sensitive. HH
is the hour portion of a 24-hour timestamp, whereas hh
is 12-hour time.
All Talend data types have the potential to be set to null, but in some cases, this may result in a type change, as described in the following section.
Try removing the tick from the null box for age
. You will notice that the type changes from Integer
to int
. This is because int
is a primitive Java type that cannot be null, whereas for the Object type Integer
null is an acceptable value.
A good example of the use of int over Integer is when mandatory values are required for say a database table. If the field is set as int
, a null value will cause an error to be thrown, highlighting either a data or job error.
Field lengths
Talend will generally ignore field lengths in a schema, but that does not mean that they are unimportant. In fact, it is best practice to ensure that field lengths are completed and accurate for all schemas, especially database schemas.
Keys
Most schemas will not require any keys; however, like field lengths, they become very important for database schemas.