Creating a python package for a python job

This blog introduces how we create a python package for a python job with package structure and some details like SQL queries and data extraction with python.

As a member of data science team, we do not only analyse / explore data with terrific algorithm, but also create some reports which can help others departments to make decisions or map out strategies. For creating reports, we create a python package which contains multiple elements. In this blog, I’ll talk about how we create the python package with following points:

  • Structure of python package
  • Detail for each element

Structure of python package

$ tree
├── data
│   ├──
│   └── job.sql
│── tests
│   ├──
│   └──
├── .gitlab-ci.yml
└── environment.yml

Detail for each element

The files are required to make Python treat the directories as containing packages; this is done to prevent directories with a common name, such as string, from unintentionally hiding valid modules that occur later on the module search path. In the simplest case, can just be an empty file, but it can also execute initialization code for the package or set the __all__ variable. If you want more information, click here.

SQL file

In the SQL file, we specify all queries that we need for the report, with following format:

WITH    date_start AS
            SELECT CAST('{}' AS DATE) as date_start

        date_end AS
            SELECT CAST('{}' AS DATE) as date_end

        stores AS
            SELECT store_id
            FROM stores_table

        sales AS
                    , SUM(turnover_euros) AS turnover
            FROM sales_table
            WHERE sale_store IN (SELECT * FROM stores)
            AND turnover_euros <> 0
            AND sale_date BETWEEN (SELECT * FROM date_start) AND (SELECT * FROM date_end)
            GROUP BY product_id

As you read, we put date_start and date_end as input, search stores data directly from database, and use all of them to filter sales data.

Python file tells you that the module/package you are about to install has been packaged and distributed with Distutils, which is the standard for distributing Python Modules.

from setuptools import setup

      url='gitlab/job', # url of job's repo on gitlab
      packages=['job'], # package name
          '':['*.sql'], } # search .sql file in all folders of this job

Python file

This file describes how to create report with SQL file above.

from datetime import date, timedelta

import pandas as pd
import teradata
from pkg_resources import resource_filename

# start date & end date
date_start = - timedelta(days=90)
date_end = - timedelta(days=1)

# connect to Teradata database
udaExec = teradata.UdaExec(appName="HelloWorld",
session = udaExec.connect(method="odbc",

# find & read SQL file
filename = resource_filename('job', 'data/job.sql')

with open(filename, 'r') as myfile:
    ctes_ =

# prepare date_start and date_end
ctes = ctes_.format(date_start.strftime('%Y-%m-%d'),

# extract "sales" data
q = "{}\nSELECT * FROM {}".format(ctes, 'sales')
df = pd.read_sql(q, session)

This .py script defines start date and end date of report with, connects to Teradata database with teradata package, read SQL queries with pkg_resources.resource_filename(), import SQL query’s input with .format(), and extract data with pandas.read_sql().

Other files

  • Test file We write tests (e.g. unit tests) in this file. I’ll write another blog about how to write unit test with python in the near furture.

  • .gitlab-ci.yml & environment.yml They are used for automating job with gitlab runner. If you are interesting in it, you can find more information here.