As a member of data science team, we do not only analyse / explore data with terrific algorithm, but also create some reports which can help others departments to make decisions or map out strategies. For creating reports, we create a python package which contains multiple elements. In this blog, I’ll talk about how we create the python package with following points:
- Structure of python package
- Detail for each element
Structure of python package
$ tree
.
├── __init__.py
├── data
│ ├── __init__.py
│ └── job.sql
├── setup.py
├── job.py
│── tests
│ ├── __init__.py
│ └── job_test.py
├── .gitlab-ci.yml
└── environment.ymlDetail for each element
init.py
The __init__.py files are required to make Python treat the directories as
containing packages; this is done to prevent directories with a common name,
such as string, from unintentionally hiding valid modules that occur later on
the module search path. In the simplest case, __init__.py can just be an
empty file, but it can also execute initialization code for the package or set
the __all__ variable. If you want more information, click here.
SQL file
In the SQL file, we specify all queries that we need for the report, with following format:
WITH date_start AS
(
SELECT CAST('{}' AS DATE) as date_start
),
date_end AS
(
SELECT CAST('{}' AS DATE) as date_end
),
stores AS
(
SELECT store_id
FROM stores_table
),
sales AS
(
SELECT
product_id
, SUM(turnover_euros) AS turnover
FROM sales_table
WHERE sale_store IN (SELECT * FROM stores)
AND turnover_euros <> 0
AND sale_date BETWEEN (SELECT * FROM date_start) AND (SELECT * FROM date_end)
GROUP BY product_id
)As you read, we put date_start and date_end as input, search stores data
directly from database, and use all of them to filter sales data.
Python file setup.py
setup.py tells you that the module/package you are about to install has been
packaged and distributed with Distutils, which is the standard for distributing
Python Modules.
from setuptools import setup
setup(name='job_name',
version='0.1',
url='gitlab/job', # url of job's repo on gitlab
author='author_name',
author_email='author_email',
packages=['job'], # package name
package_data={
'':['*.sql'], } # search .sql file in all folders of this job
)Python file job.py
This file describes how to create report with SQL file above.
from datetime import date, timedelta
import pandas as pd
import teradata
from pkg_resources import resource_filename
# start date & end date
date_start = date.today() - timedelta(days=90)
date_end = date.today() - timedelta(days=1)
# connect to Teradata database
udaExec = teradata.UdaExec(appName="HelloWorld",
version="1.0",
logConsole=False)
session = udaExec.connect(method="odbc",
USEREGIONALSETTINGS="N",
system="tdprod",
username="xxx",
password="xxx");
# find & read SQL file
filename = resource_filename('job', 'data/job.sql')
with open(filename, 'r') as myfile:
ctes_ = myfile.read()
# prepare date_start and date_end
ctes = ctes_.format(date_start.strftime('%Y-%m-%d'),
date_end.strftime('%Y-%m-%d'))
# extract "sales" data
q = "{}\nSELECT * FROM {}".format(ctes, 'sales')
df = pd.read_sql(q, session)This .py script defines start date and end date of report with datetime.date,
connects to Teradata database with teradata package, read
SQL queries with pkg_resources.resource_filename(), import SQL query’s input
with .format(), and extract data with pandas.read_sql().
Other files
-
Test file
job_test.pyWe write tests (e.g. unit tests) in this file. I’ll write another blog about how to write unit test with python in the near furture. -
.gitlab-ci.yml&environment.ymlThey are used for automating job with gitlab runner. If you are interesting in it, you can find more information here.
Reference
- “6.4. Packages”, doc.python.org. [Online]. Available: https://docs.python.org/3/tutorial/modules.html#packages
- TimHill, “Tree image”, pixabay. [Online]. Available: https://pixabay.com/photos/tree-sun-sunshine-summer-meadow-2916763
- Silfverstrom, “What is setup.py?”, stackoverflow. [Online]. Available: https://stackoverflow.com/questions/1471994/what-is-setup-py