Review of 2024

Review of 2024

Introduction

The year 2024 passed, and on the first day of the year 2025, I prefer to review last year, like I do each year. This year, I enriched my skills in data engineering, data ops, and data management. I’ll talk about each of them in the following.

Working in retailing (In The Memory)

This year, I was on maternity leave for the first three months and returned to work at the end of March. Then, I accomplished different projects not only on data science but also on data engineering and data ops, which continued to complete my skill tree on data.

Data catalog

A big project I spend a lot on is: setting up the data catalog on Databricks.

Memory provides various data analytics tools to help retailers and brands make shopper-based decisions by applying retail Augmented Intelligence to all the richness of the available data (transactional data, shopper data, product data…) and delivering recommendations at a truly operational granularity level.

However, managing all data efficiently across multiple tenants, having a vision about the whole process of data transformation, constructing a stable data model, creating a homogeneous data layer to harmonize code base cross tenants, communicating them transparently to different teams (data, consulting, product, developer), ensuring security and compliance, and enabling seamless access has always been a challenge. To address these issues, we turned to Databricks Unity Catalog, which is a unified data governance solution that enables better control, discovery and collaboration on data. In this article, we’ll talk about how Memory implements Unity Catalog and how it revolutionises our data operations to drive business growth and innovation.

We operate one workspace per tenant and environment. With Unity Catalog, we centralize data governance across all workspaces by creating one metastore per region. Each metastore contains rich metadata related to data assets, such as data types, descriptions, and tags. Moreover, it can extract data automatically—once we create a table in the data catalog by specifying the data source path, the catalog updates automatically without manual code execution.

Inside each metastore, per tenant and environment, we create a catalog using SQL code (managed by the data ops team), an external location, and a storage credential (created by the devops team with Terraform) for importing datasets from Azure Storage Account. For each catalog, we create different schemas for various purposes:

  • Schema bronze: Contains bronze layer datasets transformed by the data engineering team
  • Schema silver: Contains silver layer datasets transformed by the data engineering team
  • Schema gold: Contains gold tables used for reporting and analytics
  • Schema analytics: Contains views used for harmonizing code base across tenants
  • Schema aggregates: Contains views of aggregate tables used for harmonizing code base across tenants
  • Schema for solutions: Contains delta tables created for specific solutions

Moreover, we have implemented CD to automatically run updated SQL files for the data catalog after merging in production, and set up CI to check if the SQL queries work for creating schemas, tables, and views.

As Memory scales, Databricks Unity Catalog plays a crucial role in streamlining our daily data operations. Its robust metadata management enables us to enhance data discovery, management, and security globally. By automating permissions and integrating these processes into our CI/CD pipelines, we’ve cut operational costs, bolstered data security, and sped up the delivery of our data-driven products and insights. Looking ahead, we plan to leverage more of Unity Catalog’s features. These include Data Lineage to track data transformation processes, Data Insights to analyze usage patterns and statistics, and Data Quality to establish check rules, monitor quality, and alert us to any issues.

Data quality

This is another interesting project that I achieved. We have created a tool to check the data that we received from our clients. We check the data quality in terms of data engineering and terms of business thinking as well, for example checking the daily transaction amount, client amount, product and sales etc. We also check if the promotional generosity respects the constraint or if there is some missing data etc. We use this tool on Databricks jobs and also in our data integration pipeline on airflow. Besides, we also created some dashboards on Power BI to share the data quality check results with all teams. Thanks to it we can monitor the data, and find and alert the problem in time.

Unit tests on internal tools & configuration setting up on different tenants

I added some unit tests for our internal tools and adapted and set up the configurations for different tenants. The unit test can help us to ensure the functions work well as we expect and to avoid the problem while the changes occur. And the setup of configuration for tenants can set a solid base for the code harmonization.

Role as a mother

First-year being mother, the challenge is balancing work and life (child), not easy but there are still some tips:

  • Set Realistic Expectations: Identify the most important responsibilities at work and home, and focus on those.
  • Establish a Routine: Block time for work, family, and self-care. Consistency helps you manage your energy and commitments.
  • Leverage Support Systems
    • Partner and family support: Share household and parenting responsibilities with your partner or family members.
    • Professional help: Consider hiring help for childcare, cleaning, or meal preparation, if possible.
    • Parenting groups: Join local or online communities for advice, emotional support, and encouragement.
  • Communicate with Your Employer: Discuss options such as remote work, if possible. Here I’d like to say a big thanks to Memory, the remote working policy allows me to save lots of time on traffic and earn more flexible time for working.

Hope to see you in 2025!

References