The year 2023 almost passed, before the arrival of the year 2024, I prefer to review this year, like what I did each year. This year I enriched my skills in data engineering, data ops and management. I’ll talk about each of them in the following.
This year I started different projects not only on data science but also on data engineering and data ops, which is a new domain for me and it completed my skill tree on data.
Same as last year, I’m always in charge of data management for some clients: maintaining the data integration process, troubleshooting, designing the workflow for new data integration, designing the retro planning with team consulting, interaction among clients, team consulting and team data. The project became richer and more challenging this year: I created a data audit flow to ensure the data quality, which contains two parts, data engineering checking and business rules checking. Data engineering checking ensures the data’s format, no missing values, columns’ names, etc. Business rules checking verifies if the data make sense in terms of business, such as if the start date is earlier than the end date for each product, if we can find all sold products in the table “product”, or if the brand labels are all good, etc. With the audit results, a report is exported and I discussed it with our consultants, data engineers, data analysts, and with our clients as well.
Moreover, I collaborated with the DevOps team to achieve the migration process on Eldorado, migrated the code base on GitHub and applied configuration files to standardize the codes across clients.
Thanks to these projects, enriched my experiences in data engineering, project management and communication, and let me have my first experience in data ops.
This is a pretty interesting project that I accomplished with my intern and my lead. We created a consumer segmentation based on consumers’ purchases for our client several years ago, caused of COVID-19, the purchase behaviours have changed a lot, so it’s time to update the definition for each consumer group.
We took last year’s transaction data, cleaned them and applied PCA and k-means clustering for all loyal customers and all consumer products, got firstly the product groups, then based on them we studied loyal clients’ purchase behaviours and defined various loyal clients groups. The study is industrialized, will be triggered in all quarters and send results to clients automatically.
This project aims to overhaul the last level of the customer hierarchy at intermarché (level 5) containing the UBC (Unités de Besoin Consommateur in French, consumer need units in English). The project was started by an ancient colleague, I relayed it after his leaving. Based on his work, I accomplished reassigning filtered products to new UBCs. After being verified by consultants, we’ll apply the new level 5 on our platform in 2 steps: integrate them into the database and apply the new table on the platform, For each step, we’ll first test on staging and then put it on production environment when it’s good on staging.
This project is on standby since I’m on maternity leave.
One of the things that I most appreciated for this year is I recruited an intern and finished a great project with her. Thanks to this experience, I enriched the following skills:
Overwatch was built to enable Databricks’ customers, employees, and partners to quickly / easily understand operations within Databricks deployments. As enterprise adoption increases there’s an ever-growing need for strong governance. Overwatch means to enable users to quickly answer questions and then drill down to make effective operational changes. Common examples of operational activities Overwatch assists with are:
Thanks to Overwatch, we got the first understanding of our cost and found most of the clusters that we used on Data Factory pipelines were all-purpose clusters, which spend more than job clusters. Based on this, I switched all-purpose clusters to job clusters by Azure CLI.
This year I have my new role: I become a mother! It’s really a special experience and a new challenge for us! Hope we can be good parents ;)
Hope to see you in 2024!
The composition of the data team can vary depending on the size and needs of an organization. However, some common roles are typically included in a data team. These roles include data engineers, data analysts, data scientists, data architects and data administrators, etc. They work to collect, store, analyze, and interpret data to drive business decisions and improve organizational performance. In this blog, I’ll focus on data engineers, and talk about their role and how they collaborate with others.
Data engineers are responsible for designing, building, testing, and maintaining the infrastructure that supports an organization’s data needs. Their role is critical in ensuring that data is properly collected, stored, and made available to data scientists and other users in a timely and efficient manner. Some of the key responsibilities of data engineers include:
When data engineers design the data-storage structure, they need to communicate with data scientists and data analysts to well understand their needs, pain points and use cases. They ensure that the required data is accessible and available in the appropriate formats. Raw data is often not immediately ready for analysis, after understanding that, data engineers start to clean and transform the data, ensuring that it is in a suitable format for analysis, like defining which tables need to create, with what columns, the data type of each column, do they need to do some data transformation and aggregation, create the data pipeline and define its running frequency. Moreover, they also need to create an automatic process for data quality checking, which is to verify the received data format, data volume, data value consistency, business-related logic, etc.
Moreover, they work together to establish data quality standards, data validation processes, and data monitoring mechanisms. Data engineers help data scientists and data analysts understand the data lineage, metadata, and data quality issues, ensuring that they can rely on the accuracy and integrity of the data for their analyses. After all these, data engineers gather feedback on data availability, performance, and infrastructure requirements from the data science and analysis processes. Data engineers continuously improve the data infrastructure and systems based on this feedback, ensuring that they meet the evolving needs of the data science and analysis work.
Before designing new products or new product features, we need to support them from the data side, which means ensuring that data infrastructure and systems align with the goals and requirements of the product or project.
Data engineers work closely with product owners to understand the data requirements for a specific product or project and gather information about the data sources, data formats, data volume, data quality standards and validation rules to meet the product objectives.
Furthermore, product owners often have specific data needs for their products. Data engineers work with them to define the data pipelines that capture, process, and transform the required data. They collaborate on the data flow, transformations, and integration points to ensure that the data is collected and processed correctly. Throughout the development and deployment lifecycle, data engineers collaborate with product owners to gather feedback, identify areas for improvement, and make necessary adjustments to the data infrastructure. They work together to iterate on the data pipelines and systems, aligning them with evolving product requirements.
As a data engineer in a SaaS(Software as a Service) company, data come from the client side. Thus, communicating with clients on data integration is one of my missions. We engage in discussions and meetings to gather information about the types of data the clients need, the sources of data, and the specific use cases or analytics they want to perform. This helps data engineers gain a clear understanding of the client’s data needs. They also work together to determine which data systems, databases, APIs, or external sources are necessary to collect the required data.
Based on the client’s data requirements, data engineers design data integration processes and pipelines. They collaborate with clients to define the data flow, transformations, and any data cleaning or enrichment steps that may be required. Data engineers work with clients to ensure that the data pipelines align with their specific needs and can provide the desired outputs.
Besides, data engineers need to maintain regular communication with clients throughout the project lifecycle. They provide progress updates, discuss any challenges or issues encountered, and seek client feedback to ensure that the data engineering processes align with the client’s expectations and requirements. This helps maintain transparency and fosters effective collaboration.
Data engineer plays a very important role in terms of the data side and business side by providing the infrastructure necessary to support data-driven decision-making. Without effective data engineering, organizations would struggle to make sense of the vast amounts of data they collect and would miss out on the valuable insights that data can provide.
Collaboration among data engineers, data scientists, and data analysts is crucial for successful data-driven initiatives, by working together, they ensure that data is accessible, reliable, and ready for analysis, enabling meaningful insights and valuable outcomes from the data; collaboration with product owners ensure that the data infrastructure supports the product’s needs, enabling data-driven decision-making and delivering value to end-users; by actively engaging with clients, data engineers can design and implement data solutions that meet their specific needs, leading to successful data-driven outcomes.
During the year 2022, our life is almost back to normal, we started to take off the mask, and have more contact with colleagues and friends. As in previous years, it’s time to review the whole year 2022. This year, I not only levelled up my skills in technique but also in project management and communication with clients. In this blog, I’ll resume my year 2022 with the following points:
In The Memory is a retail-tech company that helps retail players to make the best use of the different internal and external data sources to meet their strategic and operational business challenges. Our products allow distributors and brands to accelerate their decision-making to attract more customers and make the best assortment, merchandising, pricing, and promotional choices, in their various physical and online sales channels. We build tailored Augmented Intelligence solutions to meet clients’ priority challenges and serve their strategies by supporting their teams in change management, defining together the best KPIs to meet clients’ challenges and adapting our solutions to the client’s needs, constraints and processes. Moreover, this year, we have been labelled happyatwork 2022 (this label rewards company in which employees are the most committed and motivated), won the LSARetailTech trophy in the “Data, knowledge and customer personalization” category, and have been selected by Business France to represent France during the next Retail’s Big Show in New York in January 2023; we have nearly 70 colleagues vs. 50 in 2021.
This year, I’m promoted to Senior Data Scientist and levelled up my skills not only in technique but also in project management and communication with clients.
This year, I started to be in charge of data management for some clients. Thanks to this project, I improve my skills in multiple aspects: maintaining the data integration process, troubleshooting, designing the workflow for new data integration, designing the retro planning with team consulting, interaction among clients, team consulting and team data. All of these allow our clients to well integrate their data into our system which is the base of our data products, it also allows In The Memory to explore and create more products on the platform.
After more than 1-year development, we finally release these 2 promotional modules for our client. Thanks to this project, I enrich my knowledge on promotion business, know how to manage a project from the viewpoint of data, and discuss the project with different teams (consulting and software). This product helps users to save lots of time in defining the promotion leaflets and it can also attract other retailers to use these amazing modules.
We created some tools for ensuring data reliability. For example, we created the first API of team data, this API allows us to check data format and quality after the user imports the input files which ensures the data is ready to be used to run our calculation and model. I also set up a staging environment for the data team, which is isolated from the production environment, this allows us to modify or develop new features without worrying about breaking the data in production, this is a product guarantee and it’s a playground!
It has been 6 years since the graduation ceremony, this year I came back to Toulouse School of Economics again to help In The Memory recruit talents. It was an enriching experience for us to interact with students and teachers throughout the day.
This year I focused on my job and wrote only six blogs: some of them summarize what I learnt during my work, some of them talk about Data Science in Retail Discount, and some analyse open source second-hand apartment transaction data.
Since August 2016, I’ve written 116 blogs on various aspects: python, data analysis, data visualisation, machine learning, and data science in retailing, had more than 212k users come more than 301k times on my blogs.
Don’t hesitate if you want to ask questions or write comments, they’re welcome!!
Hope to see you in 2023!
Retail discount plays an important role in increasing turnover, increasing customer loyalty and attracting customers. To have a good strategy of discount, data mining can help retailers to build a recommendation engine that recommends products for customers and retailers. In this blog, I’ll share my experience on how to apply data mining on retail discounts with the following points:
Retail discounting is used to decrease the price of specific products for a set amount of time. In some cases, retailers offer a store-wide discount to move excess inventory and create space for new collections. Retailers usually run discounts to attract new customers, increase sales, and clear out old inventory. Large retailers have an easier time selling low-priced merchandise in high volumes, but this strategy doesn’t always work for small to mid-sized retail boutiques. With discounting, it’s important to keep an eye on your profit margins and break-even point, avoid conditioning customers to wait for a sale, and understand exactly why and when you want to discount products.
A recommendation system is a platform that provides its users with various content based on their preferences and likings. A recommendation system takes the information about the users and their behaviours as inputs. This information can be in the form of the past usage of the product or the ratings that were provided for the product. It then processes this information to predict how much the user would rate or prefer the product. A recommendation system makes use of a variety of machine learning algorithms.
Another important role that a recommendation system plays today is to search for similarities between different products. In retailing domain, the recommendation system searches for products that are similar to the ones you have purchased previously. This is an important method for scenarios that involve a cold start. In a cold start, the retailer does not have much user data available to generate recommendations. Therefore, based on the products that are sold, the engine can provide recommendations of the products that share a degree of similarity or satisfy the discount rules. There are three types of Recommendation Engine:
In a content-based recommendation system, the background knowledge of the products and customer information are taken into consideration. Based on the products that you have purchased in a retailer chain, it provides you with similar suggestions. For example, if you have purchased a product that belongs to the “alcohol” category, the content-based recommendation system will provide you with suggestions for similar products that have the same category.
Unlike content-based filtering that provided recommendations of similar products, Collaborative Filtering provides recommendations based on the similar profiles of its users. One key advantage of collaborative filtering is that it is independent of product knowledge. Rather, it relies on the users with a basic assumption that what the users liked in the past will also like in the future. For example, if person A purchases alcohol, snacking and baker categories and B purchases snacking, baker and ice-cream categories then A will also like ice cream and B will like the alcohol category.
There is also a third type of recommendation system that combines both Content and Collaborative techniques.
There are many use cases in the retailing domain like recommending products that are complementary to the product the shopper has chosen, offering a discount to the potential customers to encourage the purchase or even recommending some new products that might be interesting for customers. Here I’ll talk about two use cases.
Customer relationship management (CRM) refers to the principles, practices, and guidelines that a retailer follows when interacting with its customers. One of the CRM approaches is offering a discount on different products to various profiles of customers, which is an application of the Collaborative Filtering Recommendation System. The discounts target different objectives:
Usually, retailers start to build the promotion plan several months or even 1 year in advance, since it takes lots of time to negotiate with the supply, define the category and brand for each leaflet, and also design the discount for different products. The promotion plan recommendation system is not that common but it exists and it’s pretty helpful for retailers. It can recommend products for different leaflets by considering various elements like target turnover, target product amount, discount periods, category distribution, brand distribution, etc, and it helps retailers to save lots of time.
After promotion implementation, we need to do some analysis to understand the promotion effect. For example, among all target customers, how many of them benefit from the discount? Thanks to the promotion, how much did turnover increase? Furthermore, the following figure shows the effects of a retailer promotion on the sales of the promoted product (Gedenk 2002, Neslin 2002).
We distinguish between short-term effects, which occur during the promotion, and long-term effects, which involve behaviour that takes place after the promotion. Sales for the promoted brand can increase during the promotion by attracting customers from other stores (store switching), inducing customers to switch brands (brand switching), inducing customers to buy from the promoted category rather than another category (category switching), inducing customers who normally do not use the product category to purchase it (new users), or inducing customers to move their purchases forward in time (purchase acceleration). Purchase acceleration can occur because consumers purchase earlier or because they purchase more than they would have done without the promotion. Consumers can either stockpile the extra quantity for future use or consume it at a faster rate. Total category consumption can also increase owing to category switching or if the promotion attracts new users.
In this blog, we talked about how recommendation engines contribute to retail promotion and how could we follow the performance of a promotion. Since promotion plays an important role in the retail domain, it’s also important to build a suitable promotion plan and analyse its effects, In The Memory could be the expert which helps retailers to accelerate and improve the decision-making process, define the promotion plan, optimise category management levers (promotion, assortment, CRM, etc) and follow the business performance.
Recently I accomplished a product feature which checks input data quality after users upload their datasets to our platform. To return the checking result to the backend, I created an HTTP API to simplify the communication and since we use Microsoft Azure, we choose to create it with the Azure function. In this blog, I’ll talk about this with the following points:
Azure Function is a serverless solution that allows you to write less code, maintain less infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud infrastructure provides all the up-to-date resources needed to keep your applications running.
You focus on the pieces of code that matter most to you, and Azure Functions handles the rest.
You can build an Azure function to react to a series of critical events, for example building a web API, responding to database changes, processing IoT data streams, or even managing message queues, etc. and with your preferred language (C#, Java, JavaScript, PowerShell, Python, etc.). In the blog, I’ll only talk about building a web API with Python.
The specific prerequisites for Core Tools depend on the features you plan to use:
The recommended folder structure for an Azure Functions project in Python looks like the following example:
<project_root>/
| - .venv/
| - .vscode/
| - my_first_function/
| | - __init__.py
| | - function.json
| | - example.py
| - my_second_function/
| | - __init__.py
| | - function.json
| - shared_code/
| | - __init__.py
| | - my_first_helper_function.py
| | - my_second_helper_function.py
| - tests/
| | - test_my_second_function.py
| - .funcignore
| - host.json
| - local.settings.json
| - requirements.txt
| - Dockerfile
The main project folder can contain the following files:
requirements.txt
: Contains the list of Python packages that the system
installs when you’re publishing to Azure.host.json
: Contains configuration options that affect all functions in a
function app instance. This file is published to Azure. Not all options are
supported when functions are running locally..vscode/
: (Optional) Contains stored Visual Studio Code configurations..venv/
: (Optional) Contains a Python virtual environment that’s used for
local development.Dockerfile
: (Optional) Used when you’re publishing your project in a custom
container.tests/
: (Optional) Contains the test cases of your function app..funcignore
: (Optional) Declares files that shouldn’t be published to Azure.
Usually, this file contains .vscode/
to ignore your editor setting, .venv/
to ignore the local Python virtual environment, tests/
to ignore test cases,
and local.settings.json
to prevent local app settings from being published.Used to store app settings and connection strings when functions are running locally. This file isn’t published to Azure.
{
"IsEncrypted": false,
"Values": {
"FUNCTIONS_WORKER_RUNTIME": "<language worker>",
"AzureWebJobsStorage": "<connection-string>",
"MyBindingConnection": "<binding-connection-string>",
"AzureWebJobs.HttpExample.Disabled": "true"
},
"Host": {
"LocalHttpPort": 7071,
"CORS": "*",
"CORSCredentials": false
},
"ConnectionStrings": {
"SQLConnectionString": "<sqlclient-connection-string>"
}
}
true
, all values are encrypted with
a local machine key. Used with func settings
commands. The default value is
false
.FUNCTIONS_WORKER_RUNTIME
: Indicates the targeted language of the Functions
runtime.AzureWebJobsStorage
: Contains the connection string for an Azure storage
account. Required when using triggers other than HTTP.host.json
settings,
which also apply when you run projects in Azure.
You can find more information here.A function is the primary concept in Azure Functions. A function contains two important pieces - your code, which can be written in a variety of languages, and some config, the function.json file. For compiled languages, this config file is generated automatically from annotations in your code. For scripting languages, you must provide the config file yourself.
The function.json
file defines the function’s trigger, bindings, and other
configuration settings. Every function has one and only one trigger. The runtime
uses this config file to determine the events to monitor and how to pass data
into and return data from function execution. The following is an example
function.json
file.
{
"scriptFile": "__init__.py",
"bindings": [
{
"authLevel": "function",
"type": "httpTrigger",
"direction": "in",
"name": "req",
"methods": [
"get",
"post"
]
},
{
"type": "http",
"direction": "out",
"name": "$return"
}
]
}
The bindings
property is where you configure both triggers and bindings. Each
binding shares a few common settings and some settings which are specific to a
particular type of binding. Every binding requires the following settings:
type
: Name of binding.direction
: Indicates whether the binding is for receiving data into the
function or sending data from the function.name
: The name that is used for the bound data in the function.Triggers cause a function to run. A trigger defines how a function is invoked and a function must have exactly one trigger. Triggers have associated data, which is often provided as the payload of the function.
Binding to a function is a way of declaratively connecting another resource to the function; bindings may be connected as input bindings, output bindings, or both. Data from bindings are provided to the function as parameters.
You can mix and match different bindings to suit your needs. Bindings are optional and a function might have one or multiple input and/or output bindings.
Triggers and bindings let you avoid hardcoding access to other services. Your function receives data (for example, the content of a queue message) in function parameters. You send data (for example, to create a queue message) by using the return value of the function.
The HTTP trigger is defined in the function.json file. The name
parameter of
the binding must match the named parameter in the function. The previous
examples use the binding name req
. This parameter is an HttpRequest
object, and an HttpResponse object is returned.
From the HttpRequest
object, you can get request headers, query parameters,
route parameters, and the message body.
Here is an example:
def main(req: func.HttpRequest) -> func.HttpResponse:
headers = {"my-http-header": "some-value"}
name = req.params.get('name')
if not name:
try:
req_body = req.get_json()
except ValueError:
pass
else:
name = req_body.get('name')
if name:
return func.HttpResponse(f"Hello {name}!", headers=headers)
else:
return func.HttpResponse(
"Please pass a name on the query string or in the request body",
headers=headers, status_code=400
)
In this function, the value of the name
query parameter is obtained from the
params
parameter of the HttpRequest
object. The JSON-encoded message body is
read using the get_json
method. Likewise, you can set the status_code
and
headers
information for the response message in the returned HttpResponse
object.
Before running functions locally, we need to have installed Azure Functions Core Tools in your machine.
To run a Functions project, you run the Functions host from the root directory of your project. The host enables triggers for all functions in the project.
To test your functions locally, you start the Functions host and call endpoints on the local server using HTTP requests.
The command below must be run in a virtual environment.
# start the Functions host
# version 2.x
func start
Then we call the following endpoint to locally run HTTP and webhook triggered functions:
http://localhost:{port}/api/{function_name}
The following example is the function MyHttpTrigger
called from a POST request
passing name in the request body:
curl --request POST http://localhost:7071/api/MyHttpTrigger --data '{"name":"Azure Rocks"}'
When you’re ready to publish, make sure that all your publicly available dependencies are listed in the requirements.txt file. This file is at the root of your project directory. You can also find project files and folders that are excluded from publishing, including the virtual environment folder, in the root directory of your project.
Three build actions are supported for publishing your Python project to Azure: remote build, local build, and builds that use custom dependencies.
You can also use Azure Pipelines or GitHub Actions to build your dependencies and publish by using continuous delivery (CD), which is also the way that I choose.
In GitHub Actions, a workflow is an automated process that you define in your
GitHub repository. This process tells GitHub how to build and deploy your
function app project on GitHub. A workflow is defined by a YAML (.yml) file in
the /.github/workflows/
path in your repository. This definition contains the
various steps as the following and parameters that make up the workflow:
For the details of each step, you can find information here.
How to go further from here?
In this article, we talk about what we need to create an Azure function, what is HTTP triggers and bindings, how to test its functionality locally and how to publish it on Azure. Hope it’s useful for you :)
During the recent work, I need to write data into PostgreSQL database. Before writing real data on staging, I learnt how to do it with Docker. In this blog, I’ll talk about this with the following points:
PostgreSQL is a powerful, open source object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads. PostgreSQL comes with many features aimed to help developers build applications, administrators to protect data integrity and build fault-tolerant environments, and help you manage your data no matter how big or small the dataset. In addition to being free and open source, PostgreSQL is highly extensible. For example, you can define your own data types, build out custom functions, even write code from different programming languages without recompiling your database!
Developing apps today requires so much more than writing code. Multiple languages, frameworks, architectures, and discontinuous interfaces between tools for each lifecycle stage creates enormous complexity. Docker simplifies and accelerates your workflow, while giving developers the freedom to innovate with their choice of tools, application stacks, and deployment environments for each project.
Here, I used the official docker image of postgres to create a database and tables.
CREATE DATABASE xxx;
We can create a table with CREATE TABLE
and insert value
into it with INSERT INTO table_name (column_name) VALUES (values)
.
CREATE TABLE jsonb_test (
id INT GENERATED ALWAYS AS IDENTITY,
parameters jsonb
);
INSERT INTO jsonb_test ("parameters") VALUES ('{"param1":"value1","param2":22,"param3":[3,33]}');
To delete data in a table with conditions or data a whole table, we can use
DELETE FROM table_name WHERE xxx;
or DELETE FROM table_name
or
DROP TABLE tb_name
.
Here, I used the official docker images of python3.
Now we will insert a pandas dataframe pdf
into the table jsonb_test
:
from sqlalchemy import create_engine
import pandas as pd
pdf = pd.DataFrame({'parameters':['{"param1":"v1", "param2": 2}']})
eng_pg = create_engine("postgresql://postgres:{pw}@{host}/{dbname}".format(pw=PASSWORD],
host=HOST_NAME,
dbname=pgdb))
pdf.to_sql("jsonb_test", eng_pg, if_exists='append', index=False)
Before inserting the dataframe into PGSQL, we need to create a PGSQL engine with
create_engine
by specifying the host name and password, then insert the
dataframe with to_sql
.
With the check above, we can ensure that we insert the dataframe successfully.
First of all, I downloaded real estate transactions’ data on the site of government. In the dataset, we have transaction’s information from January 2014 to June 2021, like “nature_mutation” specifies the sale’s nature, “nombre_pieces_principales” indicates the number of rooms, “valeur_fonciere” presents the sold price, “code_commune”, “nom_commune” and “code_departement” specify the communities and departments, “surface_reelle_bati” describes the real surface area.
For this analysis, I only took account of second-hand apartments’ transactions with a positive area in Île-de-France.
I classed second-hand apartments into 5 groups in terms of piece’s number: T1, which means one-room apartment with around 25 m2; T2, which means two-room apartment with around 40 m2; T3 presents three-room apartment around 60 m2; T4 are four-room apartments nearly 80 m2; T5 are five-room apartments larger than 100 m2. This donut chart describes the quote-part of different pieces’ apartment among the transactions. T2 and T3 hold nearly 60% of transactions, nearly 17% of transactions sold T1 apartments, other purchases are for larger apartments. Let’s go further on the details.
This graph describes the average price m2 for different numbers of piece second-hand apartments in Ile-de-France, between January 2014 and June 2021. I classed second-hand apartments into 5 groups in terms of piece’s number: T1, which means one-room apartment with around 25 m2; T2, which means two-room apartment with around 40 m2; T3 presents three-room apartment around 60 m2; T4 are four-room apartments nearly 80 m2; T5 are five-room apartments larger than 100 m2. The average area of each class is similar in different departments. For the T1 and T2 apartments, the ones in Paris are smaller than other departments; however, the second-hand apartments of T4 or larger ones are larger than other departments.
According to the second graph, we find that although T2 and T3 are much larger than T1, their unit prices are lower than the unit price of T1: the gap in Paris is 6.7% and 6.3%, respectively, to T1; the gap in other departments is pretty large (12% and 23%). Moreover, the average area of T4 is three times larger than T1, its unit price is 20% less expensive than T1 on average; except for Paris, T4’s unit price is 1% less expensive than T1 in Paris. Why are T1 apartments that expensive per m2? That might be because there are many students or young workers, they need to rent a big enough apartment, which makes investors invest in T1 apartments, which also leads to higher demands on T1.
According to this map, we observe that the second-hand apartments in the center and west of Paris are more expensive( > 8.5k euros per m2) than other districts of Paris(6k - 8.5k euros per m2), the second-hand apartments in Paris are more expensive than other departments of IDF. Among the departments except for Paris, the second-hand apartments in Hauts-de-Seine are more expensive(4k - 6k euros per m2).
Furthermore, in the light of the stacked bar chart, it’s obvious that there are much more transactions in Paris, although it’s much more expensive than other departments, no matter which class of apartments. In Paris, 60% of transactions sold studio or 2-room apartments; but in other departments, the majority of transactions sold 2-room or 3-room apartments. One of the reasons might be the unit price in Paris is more expensive, the studio or 2-room apartment can satisfy the needs of people who live alone or in couple and satisfy the need of investors; on the contrary, people who live with family prefer the apartments outside Paris, they are larger and less expensive.
According to this group of scatter plots, we can simply get the relationship between second-hand apartments’ price and their area. Each point stands for one transaction, the plots on the red dash line mean that the price per m2 of these transactions is 10k euros. The points above the dashed line indicate their unit price is greater than 10k euros; otherwise, it’s less than 10k euros per m2.
For the transactions in Paris, most of the sold apartments are smaller than 150 m2, the unit price is around 10k euros per m2. However, for the transaction in other departments, most of the apartments are smaller than 130 m2, the unit price is lower than 10k euros per m2; especially in Seine-et-Marne(77), Essonne(91) and Val-d’Oise(95), we can even get a 100 m2 second-hand apartment with only 0.25 million euros, which is much cheaper than other departments.
The line chart describes the second-hand apartments’ average price per m2 of Ile-de-France, between January 2014 and June 2021. Obviously, the average price in Paris is the highest in Ile-de-France and its evolution is the highest as well, which increased 37%(11.5/8.4 - 1). Besides, the average price in Hauts-de-Seine is the second-highest in Ile-de-France, it increased 33% (7.3/5.5-1); the average price in Val-de-Marne increased 31% (5.5/4.2-1); the average price of other departments in Ile-de-France doesn’t change a lot. Among 8 departments, the second-hand apartments are the most expensive in Paris until June 2021, which is 11.3k euros per m2, it’s 55% higher than the price in Hauts-de-Seine and twice more expensive than second-hand apartments in Val-de-Marne.
The stacked area plot presents the second-hand apartments transaction amount of Ile-de-France, during the same period as the line chart. We can easily find that the transaction amount of Paris’ second-hand apartments is more than other departments of IDF, which is nearly 30% of transactions in IDF. Following Paris, the transaction amount of Hauts-de-Seine and Val-de-Marne are the second and third greatest of IDF. The peak of Paris’s transaction is December 2015 (4612 transactions), for Hauts-de-Seine and Val-de-Marne is July 2019.
Then I used Time Series additive
model to decompose data into a trend
component, a seasonal component, and a residual component. The trend component
captures changes over time, the seasonal component captures cyclical effects
due to the time of year, the residual component captures the influences not
described by the trend and seasonal effects. Thanks to this model, we find that
except for July, there is another transaction peak in March, which we didn’t
find above. In June and August, the transactions arrive at their low points,
that might be because, during the transition period between 2 months, the desire
for purchasing or selling apartments is not that high.
Moreover, I used fbprophet
module to predict the price per m2. The black
points present actual values, the blue line indicates the forecasted values, and
the light blue shaded region is the uncertainty. The uncertainty’s region
increases for the prediction because of the initial uncertainty and it grows
over time. This can be impacted by policy, social elements, or some others.
According to this analysis, we find that among all transactions of second-hand apartments in Île-de-France, T2 and T3 hold 60% transactions. The second-hand apartments in the center and west of Paris are more expensive(> 8.5k euros per m2) than other arrondissements of Paris(6k - 8.5k euros per m2), the second-hand apartments in Paris are more expensive than other departments of IDF. Among the departments except for Paris, the second-hand apartments in Hauts-de-Seine are more expensive(4k - 6k euros per m2).
First of all, I downloaded real estate transactions’ data on the site of government. In the dataset, we have transaction’s information from January 2014 to June 2021, like “nature_mutation” specifies the sale’s nature, “nombre_pieces_principales” indicates the number of rooms, “valeur_fonciere” presents the sold price, “code_commune”, “nom_commune” and “code_departement” specify the communities and departments, “surface_reelle_bati” describes the real surface area.
For this analysis, I only took account of second-hand apartments’ transactions with a positive area in Paris.
I classed second-hand apartments into 5 groups in terms of piece’s number: T1, which means one-room apartment with around 23 m2; T2, which means two-room apartment with around 40 m2; T3 presents three-room apartment around 63 m2; T4 are four-room apartments nearly 93 m2; T5 are five-room apartments with a larger area of about 147 m2. This donut chart describes the quote-part of different pieces’ apartments among the transactions. T1 and T2 hold 60% transactions, 22% transactions sold T3 apartments, other purchases are for larger apartments. Let’s go further on the details.
This graph describes the average price m2 for the different number of piece second-hand apartments in Paris, between January 2014 and June 2021. According to the second graph, we find that although T2 and T3 are much larger than T1, their unit prices are 6.7% and 6.4% lower than the unit price of T1. Moreover, the average area of T4 is three times larger than T1, its unit price is only 4% more expensive than T1; similar for other piece-number apartments. Why are T1 apartments that expensive per m2? That might be because there are many students or young workers in Paris, they need to rent a big enough apartment, which makes investors invest in T1 apartments, which also leads to higher demands on T1.
According to this map, we observe that the second-hand apartments in arrondissements 4, 6, 7, 8 are much more expensive than other arrondissements, their average unit price is at least 11800 euros; on the contrary, the second-hand apartments in arrondissements 18, 19 and 20 are much cheaper than others, their average unit price is less than 8000 euros. This might be caused by geographical positions, number of pieces, apartment’s state, the performance of energy, public security, etc. The public transport in the city center is more than in other areas, there are also lots of shopping centers or tourist spots, which attracts plenty of people, so that makes the city center to be more valuable.
Furthermore, according to the stacked bar plot below, it’s obvious that there are much more transactions in the 18th arrondissement than in other areas, nearly 50% sold apartments are 2-room apartments. The Sacré-Cœur Basilica and Montmartre make the 18th arrondissement famous. A real neighborhood of artists, it is bohemian and cosmopolitan. If you like discovering atypical places and diverse personalities, you will find what you’re looking for. You will discover the popular flea market, many schools and many nightlife venues, such as cabarets around Pigalle. All these attract couples to live in the 18th arrondissement.
Moreover, in the 16th arrondissement, the transaction amount of T4 is pretty larger than all other arrondissements. Paris 16 is eminently residential, as evidenced by its charming buildings with green courtyards and balconies. But it is also a Parisian cultural hotspot with many museums and emblematic places from both a historical and intellectual point of view. Moreover, it concentrates many schools and establishments of choice for the education of children and students. All these might be the reason why the transactions of T4 in Paris 16 are much more than other arrondissements.
For the arrondissements as 1st, 2nd or 3rd arrondissement, more than one third of sold apartments are T1 apartments, that might be because there is not that many apartments at the center of Paris, and its unit price is high.
According to this group of scatter plots, we can simply get the relationship between second-hand apartments’ price and their area. Each point stands for one transaction, the plots on the red dash line mean that the price per m2 of these transactions is 10k euros. The points above the dashed line indicate their unit price is greater than 10k euros; otherwise, it’s less than 10k euros per m2.
For the transactions of the downtown area, most of them are smaller than 50 m2, but their price varies widely to nearly 2 million euros; on the other hand, for the 8th, 16th and 17th arrondissement, many sold apartments’ price also arrive more than 2 million euros, but their area varies widely to 200 m2; moreover, there are also apartments whose unit price and area don’t vary that widely, as in 13th, 18th, 19th and 20th arrondissement, most of the apartments here are smaller than 100 m2 and cheaper than 1 million euros, so than less than 10k euros per m2.
This graph describes second-hand apartments’ transaction amount and average price per m2 of Paris, between January 2014 and June 2021. The orange line shows the monthly average price per m2, the blue area displays the monthly transaction amount. During 7.5 years, the average price per m2 increases 37% (11.5/8.4 - 1), especially from the year 2017, the average price per m2 increases nearly 26% (11.5/9.1 - 1). Moreover, the transaction amount arrives at the yearly lowest point in August, which might be because people go on holiday at that time; on the contrary, the transactions in July or September are higher than other months, which means that people usually sign the purchase promise in May or July (supposed that we have 2 months for negotiating the credit between the purchase promise and purchase agreement), so that they can sign the agreement before their holiday or before the school opening. Moreover, because of the COVID-19 pandemic, the transaction amount dropped 50% in April 2020, then it reverted after the first confinement ended. Impacted by the pandemic, both transaction amount and average price didn’t increase a lot in 2021.
Then I used Time Series additive
model to decompose data into a trend
component, a seasonal component, and a residual component. The trend component
captures changes over time, the seasonal component captures cyclical effects due
to the time of year, the residual component captures the influences not
described by the trend and seasonal effects. Thanks to this model, we find that
except for July, there is another transaction peak in January, which we didn’t
find above. In March and June, the transactions arrive at their low points,
that might be because, during the transition period between 2 months, the desire
for purchasing or selling apartments is not that high.
Moreover, I used fbprophet
module to predict the price per m2. The black
points present actual values, the blue line indicates the forecasted values,
and the light blue shaded region is the uncertainty. The uncertainty’s region
increases for the prediction because of the initial uncertainty and it grows
over time. This can be impacted by policy, social elements, or some others.
According to this analysis, we find that among all transactions of second-hand apartments in Paris, T1 and T2 hold 60% transactions. The second-hand apartments in arrondissements 4, 6, 7 and 8 are much more expensive than other arrondissements, their average unit price is at least 11800 euros; on the contrary, the second-hand apartments in arrondissements 18, 19 and 20 are much cheaper than others, their average unit price is less than 8000 euros.
The year 2021 is still a year battling with COVID-19. Thanks to the vaccination, our life gradually returns to normal, we started to back to the office more frequently and had the chance to travel as before. This year I continue to go deeper on how to apply data science to the retailing domain and do some data analysis with open-source data as well. In this blog, I’ll resume my year of 2021 with the following points:
In The Memory is a retail-tech company that helps retail players to make the best use of the different internal and external data sources to meet their strategic and operational business challenges. Our products allow distributors and brands to accelerate their decision-making to attract more customers and make the best assortment, merchandising, pricing, and promotional choices, in their various physical and online sales channels. We build tailored Augmented Intelligence solutions to meet clients’ priority challenges and serve their strategies by supporting their teams in change management, defining together the best KPIs to meet clients’ challenges and adapt our solutions to the client’s needs, constraints and processes. Moreover, this year, we have won the “Pépite du Retail 2020” trophy, voted for by LSA Live participants and was elected best Microsoft 2021 partner in the “France Action Startup Award” category; we have nearly 50 colleagues vs. 25 in 2020.
This year I accomplished about twenty CRM (Customer Relationship Management) projects, some are for the distributors, some are for the industrialists. With our analysis, we help them to have 10% more turnover per client. I also developed new features for a module that can extract about 50 KPIs for 1 year for different levels, such as per product/store, per product category/store group, or temporal levels x product/store, like month x product or day x store, etc. The SLA (Service-Level Agreement) of this module is about 2-5 min, and during 2 weeks after the release, the module has already been used around 1500 times. Moreover, with my colleague, we created a model for estimating the product’s turnover and recommend products for different promotion operations, which will be applied in a new module. Since it’s confidential, I won’t talk about the details here ;)
Furthermore, as the company expands, we updated our information on Welcome to the Jungle. I participated in a video shooting for presenting what the data team does in daily work and how we cooperate with other teams like consulting team and dev team.
Working in different projects of retailing, I gained more knowledge of different indicactors. Thanks to the CRM projects, I understand what should we focus on according to clients’ needs and how to segment customers with their purchases. During the daily work, I learned how to cut a project and accomplish different parts of the project with colleagues. The biggest gain is that when I did the sales promotion project, I enriched my knowledge of promotion, I understood different promotions operate different mechanics and generosities; to define the products for each promotion, we need to reach various objectives, such as the turnover objective, product number, brand type distribution, generosity distribution, etc. Thanks to this project, I have closer contact with the people of business (category managers, purchase, promotion, etc.), which let me better understand their needs/pain points, so that we can develop the right product to satisfy the needs or solve the pain points.
Since the beginning of 2020, people from all over the world have struggled with the COVID-19 virus, and scientists are also actively looking for solutions. To achieve herd immunity, the most effective method at the moment can be said to be vaccination. Various voices about the vaccine have also been on the cusp of social public opinion, and the praise or controversy about it has never stopped. And a vaccine from theoretical design to clinical trials requires too much wisdom and effort of scientists.
With open-source datasets, I analyzed the adverse reactions of Pfizer vaccine, Coronavac vaccine, AstraZeneca vaccine and Modena vaccine in different blogs:
Whether it is a local reaction or a systemic reaction, the reaction is most obvious after the injection of Modena. 86% of people have local reactions, such as pain, swelling, and redness at the injection site, and nearly 67% have a systemic reaction. Such as fatigue, chills, joint pain, muscle pain, etc. Followed by the Kexing vaccine, 62% of people had a local reaction after injection, and 58% had a systemic reaction. Among the four vaccines, the Pfizer-Biotech vaccine caused the least adverse reactions. The probability of local and systemic reactions after vaccination was 29.5% and 22.4%, respectively.
This year I wrote 22 blogs (including this one), they talk about various topics: retailing, COVID-19, population and employee. Moreover, the traffic of my blog increased by 21.4% concerning 2020. I’m pretty glad if my blogs can help you and solve the problems for you.
Besides, I opened a Wechat Official Account, which is likely a personal blog based on Wechat. On this platform, I translated some of my English blogs into Chinese ones and shared them with my Chinese friends. I’ve written 11 blogs and they’ve been read for 8500 times.
Don’t hesitate if you want to ask questions or write comments, it’s welcome!!
Hope to see you in 2022!
People who are familiar with openpyxl
know that we can use it to read/write
Excel 2010 xlsx/xlsm/xltx/xltm files. As I presented in this blog,
we can create a workbook, assign values to some cells, apply number formats,
merge cells, etc. However, if we need to create an Excel dashboard as the
following, should we accomplish all formats with openpyxl
?
For the question above, we can resolve it from another point of view: we can create an Excel template with the fixed format, such as dashboard title and logo, subtitles, then write values into this template. In this blog, I’ll show you how to do this with the following points:
We have an Excel template named “template.xlsx”, which contains two worksheets “category” and “product”:
The worksheet “category” shows the performance of each category with different indicators like turnover, volume, number of clients, etc.
The worksheet “product” shows the performance of each product with the same indicators.
And what we need to insert into these two worksheets are three pandas dataframes: classic_indicators_df, other_indicators_df and products_detail_df.
With all data preparation, the next target is writing the three dataframes into the template with several steps:
template = openpyxl.load_workbook('./template.xlsx')
We load the template with openpyxl.load_workbook
by indicating the path.
import pandas as pd
out_path = './final_report.xlsx'
writer = pd.ExcelWriter(out_path)
writer.book = template
We set the writer with pandas.ExcelWriter
that allows to write DataFrame
objects into excel sheets, and set the template file as the writer’s workbook.
import openpyxl
from openpyxl.styles.borders import Border, Side
import string
def set_border(ws, cell_range):
rows = ws[cell_range]
side = Side(border_style='thin', color="FF000000")
rows = list(rows)
max_y = len(rows) - 1 # index of the last row
for pos_y, cells in enumerate(rows):
max_x = len(cells) - 1 # index of the last cell
for pos_x, cell in enumerate(cells):
border = Border(
left=cell.border.left,
right=cell.border.right,
top=cell.border.top,
bottom=cell.border.bottom
)
border.left = side
border.right = side
border.top = side
border.bottom = side
cell.border = border
Before writing into the template, I create a function set_border()
by using
Side()
and Border()
to set borders for each cell in the given range.
df_sheet_list = [(classic_indicators_df, 'category'),
(other_indicators_df, 'category'),
(products_detail_df, 'product')]
for (df, sht) in df_sheet_list:
templ_sht = template[sht]
writer.sheets = {templ_sht.title:templ_sht}
if df is classic_indicators_df:
classic_indicators_df.to_excel(writer, sheet_name=sht, index=False,
header=False, startrow=13, startcol=2)
set_border(writer.sheets[sht], f"C14:G{14-1+len(df)}")
elif df is other_indicators_df:
other_indicators_df.to_excel(writer, sheet_name=sht, index=False,
header=False, startrow=13, startcol=8)
set_border(writer.sheets[sht], f"I14:M{14-1+len(df)}")
elif df is products_detail_df:
products_detail_df.to_excel(writer, sheet_name=sht, index=False,
header=False, startrow=12, startcol=4)
set_border(writer.sheets[sht], f"E13:K{13-1+len(df)}")
writer.save()
We assign the sheet where we will insert the dateframe to writer.sheets
.
Then we write the dataframe with .to_excel
by specifying the writer we use,
the worksheet that we want to insert, and the start row and the start column
with integers, after all these steps, we save the file with writer.save()
.
If you are curious about the scripts, you will find them here.