Talend Open Studio Cookbook
上QQ阅读APP看书,第一时间看更新

Hand-cranking a built-in schema

In this recipe, we are presented with a CSV file that does not have a heading row and needs to create a schema for the data. This is a basic recipe with which most readers should be familiar: however, it does provide a framework for discussion of some of the more important principles of Talend schemas.

The record we will be defining is as follows:

John Smith,27/11/1990,2012-01-10 10:24:54.953

As you can see this contains the fields; first name, last name, date of birth, timestamp, and age. Note that age is an empty string.

Getting ready

Open a new Talend Job (jo_cook_ch02_0000_handCrankedSchema), so that the right-hand palette becomes available.

How to do it…

  1. Drag a tFileInputDelimited component from the palette, and open it by double clicking it.
  2. Click the Edit Schema button (), shown in the following screenshot, to open the schema editor:
    How to do it…
  3. Click the + button to add a column:
    How to do it…
  4. Type name into the column, and set the length to 50.
  5. Click the + button three more times to add three more columns.
  6. Type dateOfBirth into the second column, select a type of date, and set the date pattern to dd/MM/yyyy. Alternatively, press Ctrl+Space to open a list of common patterns and select this one.
  7. Type timestamp into the third column, select a type of date and set the date pattern to yyyy-MM-dd HH:mm:ss.SSS.
  8. Type age into the fourth column, set the type to Integer, tick the Null box, and set the length to 3. Your schema should now look like the following screenshot:
    How to do it…
  9. Click OK to return to the component view.

How it works…

The schema has now been defined for the component, and data may then be read into the job by linking a flow from tFileInputDelimited to tLogRow, for example.

There’s more...

As you saw in the preceding section, Talend can handle many different types of data format. The following sections describe some of the common ones in little more detail.

Date patterns

Date patterns within Talend conform to the Java date format, and full definitions of the possible values to be used can be found at:

http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

Date patterns are case sensitive in Java, and upper and lower case letters often have a very different function.

Note

In the timestamp, there are MM and mm characters. These are the month and minute definitions and care should be taken to ensure that they are used correctly in the date and time portions of a date field.

Note also the ss and SSS fields. These are seconds and milliseconds. Again, care must be taken in their use within the time portion of a date.

HH and hh are also case sensitive. HH is the hour portion of a 24-hour timestamp, whereas hh is 12-hour time.

Nullable elements

All Talend data types have the potential to be set to null, but in some cases, this may result in a type change, as described in the following section.

Try removing the tick from the null box for age. You will notice that the type changes from Integer to int. This is because int is a primitive Java type that cannot be null, whereas for the Object type Integer null is an acceptable value.

A good example of the use of int over Integer is when mandatory values are required for say a database table. If the field is set as int, a null value will cause an error to be thrown, highlighting either a data or job error.

Tip

The distinction between primitives and objects becomes more important as you use Talend and Java more frequently, because primitive types do not always act in the same way or have the same range of features as object types.

Field lengths

Talend will generally ignore field lengths in a schema, but that does not mean that they are unimportant. In fact, it is best practice to ensure that field lengths are completed and accurate for all schemas, especially database schemas.

Tip

When creating a temporary table in a database using Talend, all field lengths must be present for the DBMS to create the table. Failure to do so will result in job errors.

Keys

Most schemas will not require any keys; however, like field lengths, they become very important for database schemas.

Tip

Key fields are used during database update statements to match records to be updated. If the insert or update method is used to populate a table, then failure to specify the correct key(s) will result in a record being inserted rather than updated.